[ale] ZFS on Linux

Derek Atkins warlord at MIT.EDU
Tue Apr 2 12:08:49 EDT 2013


Brian MacLeod <nym.bnm at gmail.com> writes:

> On 4/2/13 10:47 AM, Derek Atkins wrote:
>> 
>> 
>> I wonder if this means you should spread your disks across multiple
>> controllers?  For example let's say you have three controllers in
>> your system, would it be better to put two drives from each array
>> on each controller?  That way if a single controller (or cable)
>> goes bad you don't lose your array.
>
>
> You absolutely can do this.
>
> Just be mindful that performance on each controller be near identical,
> else you risk making one controller the bounding restriction on the
> speed of rebuild.

Sure, in the example I'm considering I'd effectively have three
controllers for each vdev.  In total I'd have the on-board controller
(for 6 drives) on the mobo, two Supermicro AOC-SASLP-MV8 cards (at 8
drives each), and then a 2-port controller for the remaining 2 drives.
So basically each row of 4 drives would be on a single controller
(except for one row, two of which would be on the mobo, and 2 on the
2-port controller).  The mobo and 2-port card models are still TBD.

> You are in one sense, describing the design of our backup
> infrastructure behind the example file server I gave as well as our
> test Coraid storage. :-)

At least I am reinventing a previously deployed configuration.  :)

>> Are you sure about that?  I did some research and according to 
>> http://forums.overclockers.com.au/showthread.php?t=961125 I should 
>> be able to expand the space in the vdev once all the disks have 
>> been upgraded.  Apparently there is a zpool feature called 
>> "autoexpand" that lets you do that, once you've scrubbed.  (I'm
>> not 100% sure what a scrub does).
>
>
> Scrubs are a process to verify that bit flipping hasn't occurred and
> the media is still reliable.  It is usually good practice to have this
> regularly scheduled.  Still working on that because there is a
> performance impact.
>
> It was a later-added feature of ZFS and we've chosen to avoid it by
> buying larger chassis anticipating costs of drives (per TB) to drop,
> and letting the participants in the HPC program to buy in as they need
> to.  Thus, I'm not as well versed on it, but thank you for bringing it
> to my attention as this may actually solve an issue we are coming upon.

Sure, but once your chassis is full it might be cheaper to replace the
drives with larger drives as part of your periodic drive replacement
strategy.  So lets say you installed 1TB drives 3 years ago.  Now due to
drive prices you decide to replace them with 3TB drives.   Voila, you've
refreshed your drives AND added more space!  :)

>> This is probably due to the number of drives you need to hit to 
>> recover a block of data, or something like that.  On the system
>> I'm currently designing (based on a NORCO 4224 case) it looks like 
>> 6-drive raidz2 vdevs would fit nicely.
>
>
> Yes, 4 units with no hot spare.  Our design of similarly capable
> hardware would tend to put us in the more paranoid 7 drive vdev (x3)
> with 3 hot spares.

I've found that having a cold spare is "good enough" for my use.

>> What about rebalancing usage?  Let's say, for example, that I
>> start with one raidz2 vdev in the zpool.  Now a bit later I'm using
>> 80% of that space and want to expand my pool, so I get more drives
>> and build a second raidz2 vdev and add it to the zpool.  Can I get
>> zfs to rebalance its usage such that the first and second vdevs
>> are each using 40%?  I'm thinking about this for spindle and
>> controller load balancing on data reads.
>
>
> That's actually what happens with our buy-in model.  As data gets
> written and such ZFS will rebalance the usage between the vdevs, so I
> can attest this works as you might expect.

I'm not 100% sure what you mean by your "buy-in model".

To me it's not a question of rebalancing as data gets written.  If I'm
only writing lets say 1GB/day then it would take years to rebalance out.
I'm more asking if there is some command to tell ZFS to move existing
data, to effectively restripe itself, so that reads will get rebalanced
immediately?

>> Thanks!
>
>
> You're welcome.  Now I have some additional experimenting to do...

Glad to give you ideas :)

> Brian

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available


More information about the Ale mailing list