[ale] ZFS on Linux

Brian MacLeod nym.bnm at gmail.com
Mon Apr 1 17:46:40 EDT 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 4/1/13 12:59 PM, Derek Atkins wrote:

> Me, I'm still researching to figure out what's the best option for
> my future use.  I'd like to be able to add more drives to expand
> the array.

Can be done, by adding ZFS vdevs, which is how you break down the
arrays in units.  More in a second...


> I'd like to be able to replace the drives (potentially with larger 
> drives) and have the system expand the array when possible.  I'd
> like to be able to rebalance the system as I add more storage over
> time.  But I'd also like to have redundancy such that I can
> theoretically lose more than one drive and still survive (which
> would be a major issue if I had an 20 or 24-drives in use).  So I'm
> not sure if a Stripe+Mirror or "RaidZ" or potentially a set of
> striped RaidZs would be better for me.


Redundancy for more than one drive can be provided by a combination of
which RAIDZ level you choose and the size of the vdevs.

A vdev is a virtual device inside of a zpool configuration.  It
consists of a number of drive units in this case.  A plain RAIDz (1
drive redundant, similar to RAID5) can suffer 1 drive failure _PER
VDEV_.  Not array, per vdev.  Raidz2 can sustain 2 (like RAID6), and
raidz3 can withstand 3 (with severe losses in capacities), and  This
is an important distinction, because your larger arrays could (and
very well should) be comprised of multiple vdevs.

Let's take a simple example from my work:
- ------------
Oracle Corporation      SunOS 5.10      Generic Patch   January 2005
# zpool status
  pool: data
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        data                       ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t50014EE002A2FBB1d0  ONLINE       0     0     0
            c0t50014EE002A693A4d0  ONLINE       0     0     0
            c0t50014EE0AD5157FCd0  ONLINE       0     0     0
            c0t50014EE0AD515B7Ad0  ONLINE       0     0     0
            c0t50014EE0AD518EC0d0  ONLINE       0     0     0
            c0t50014EE0AD518F08d0  ONLINE       0     0     0
          raidz2-1                 ONLINE       0     0     0
            c0t50014EE0AD518FF7d0  ONLINE       0     0     0
            c0t50014EE600FBB0B6d0  ONLINE       0     0     0
            c0t50014EE6565102A7d0  ONLINE       0     0     0
            c0t50014EE656511B0Bd0  ONLINE       0     0     0
            c0t50014EE65651389Fd0  ONLINE       0     0     0
            c0t50014EE656519637d0  ONLINE       0     0     0
        spares
          c0t50014EE656521C8Cd0    AVAIL

errors: No known data errors
- ------------

The zpool "data" actually consists of 2 vdevs, called raidz2-0 and
raidz2-1.  I didn't name these, Solaris did. Generally you don't work
directly with vdevs. Each of these vdevs consists of 6 drives,
identified here by SAS address (your system may and probably will
vary).  These are raidz2 arrays.

Because the vdev is in and of itself it's own array, being raidz2, the
vdev itself can sustain 2 drive failures.  That means, in this case,
it is POSSIBLE for the zpool "data" to actually sustain 4 drive
failures, should each vdev sustain two failures.  Should a third unit
fail in either side, the array is toast.

The "punishment" for this of course is reduced capacity -- these are
2Tb drives, so in this case, each vdev contributes ~8Tb, yielding
about 16Tb usable here. If you were thinking 12 drives with raidz2
(using a calculation similar to RAID6), you might expect 20Tb of
space, so you can see the tradeoff for reliability.

Now, in the case of expansion: you can technically swap in larger
drives for smaller drives, but you will not get the expanded space
unless you use partitions/slices instead of whole drives, and then use
those partitions/slices as units in a vdev, but I would caution you
against that as you take a significant performance hit doing so.

You can add vdevs to a running zpool configuration.  In our case, we
use a lot of Penguin Computing IceBreaker 4736 and 4745 boxes (36 and
45 drive chassis) and fill them as we go along.  You cannot, however,
resize vdevs once configured (you can replace units).

There is math involved to prove the following assertions with regard
to sizing vdevs, but most SA's have fallen on the rule to keep vdevs
between 5-11 units, with 5 providing less storage but providing high
IOP counts, and 11 providing higher capacities but at the cost of
IOPs.  Beyond those numbers are severe losses.

This all said:
If you want flexibility to change drive sizes on the fly, I would
caution you against ZFS.  If you can change the equation and be able
to adjust number of drives, ZFS works very well.


> I wonder how long until we see a LinNAS (ala FreeNAS but built on 
> Linux)?


Wouldn't surprise me at all if there isn't already an alpha-level
project underway.  I'm still hesitant to swap OS on my storage nodes
right now -- I'm more likely to move these to OpenIndiana than linux
right now.

Brain

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQE4BAEBCAAiBQJRWgBAGxhoa3A6Ly9rZXlzZXJ2ZXIudWJ1bnR1LmNvbQAKCRD5
XCJY/q4Y6OF8B/9F3oS6pZIu5OkpsZctxWmofQXV0ccUwwiWfk/QZIzNx8Dmw7GO
kjwCgcmIHfJTnvnMGtH/uqrqa592TyY/XRdS3yGZsLUw+93cuZ+73nXYg1UVX1Cq
JtZV9h5hFF8i9rZ3ZZRu65YQR5hYpHrhXjWa+p3BuX1Io8BVuO/1jmJjdC52TMKZ
wOI9kG4ON1RPEXlbiDlEUsZmrAQAYoLUZP11wLA82cFJCdccVwFEE1v3rkcoUZ8b
zP9uI2NREUiYy9HVreTeCghVWCtfqk1OgnGjlAlN/DdxTBcNFB8mPW+qv55aSYAo
8GD5hvI5oOwZ6q7/eBmjn6b/9DjlHjaxkkHj
=R+gu
-----END PGP SIGNATURE-----


More information about the Ale mailing list