Contribute  :  Advanced Search  :  Directory  :  Forum  :  FAQ's  :  My Downloads  :  Links  :  Polls  
AFP548 Changing the world one server at a time.
Welcome to AFP548
Thursday, July 29 2010 @ 09:26 am MDT
   

AppleRAID 2 in Depth

ArticlesWay back in the day I mentioned that Appleraid 2 had received a fair bit of attention on Tiger. Then we proceeded to never say anything about it again. Today that changes as our in depth coverage of one of the most under utilized bits of Mac OS X, Appleraid, hits the site.

Read on for more...

There are few parts of Mac OS X that are understood, or appreciated, by fewer people than the built-in raid software. There is a perception out there that it is flaky, inflexible, and unreliable. Over the course of hundreds of Mac OS X Server installs though we have found these perceptions to be false or at least ill applied. By the end of this article I hope that you will have a greater respect for Appleraid 2.

A bit of history

Apple has had a long history of raid software. In the classical Mac OS days, they bundled a re-branded copy of Softraid as Appleraid with some installs of AppleShare or ASIP. raid disappeared with the advent of Mac OS X, reappearing in Mac OS X 10.2. At the time it offered device level stripes and mirrors and nothing else. Rebuilding a raid mirror before 10.2.4 was tricky and often failed. With the coming of Mac OS X 10.3 we received a few nice updates to Appleraid, the main ones being the ability to convert an existing drive into a mirror and the ability to rebuild mirrors on-line. All of this was just a tease for the launch of 10.4 and Appleraid 2.

What's new in Appleraid 2

Appleraid 2 brings a bunch of new features to the party:

  • The ability to work with volumes or devices
  • Tunable raid block sizes
  • Auto-rebuild for mirrors
  • Dedicated warm-spares for mirrors
  • The ability to split a mirror set for instant backups
  • raid levels can be combined to create 0+1 or 1+0 (or 10) sets
  • Concatenated disk sets (JBOD)
  • Significantly improved raid tools in Disk Utility

  • All of this means that Appleraid 2 is now a very flexible tool. In particular the new mirror tools are a boon to sysadmins.

    raid Primer

    Before we get too far into the Mac OS X Server specific details we should take a moment to review the basics of raid.

    raid stands for Redundant Array of Inexpensive (or Independent) Disks. The idea behind it is to take multiple cheap drives and band them together to achieve performance, or reliability, that is beyond the reach of a single drive. The different types of raid arrays are designated by a level number. Appleraid supports two levels of raid:

  • 0, or a stripe, combines multiple volumes into one large volume
  • 1, or a mirror, combines multiple volumes into one volume the size of its smallest member
  • 0+1 is a mirror of stripes
  • 1+0 (or 10) is a stripe of mirrors
  • Concatenated, or JBOD, disk sets combine volumes without a stripe.

  • Appleraid does not support any other common levels such as 3 or 5. If you want those you need to go with a hardware based solution for the best performance. Let's look a bit closer at each of these raid levels.

    raid level 0 stripes multiple volumes together in an effort to maximize speed. It offers no data protection at all and the loss of a single member results in the loss of the entire array. (Astute readers will note that this really means that raid 0 really isn't a raid at all as it lacks the "Redundant" part.) In general a stripe is not often used in a server as it only offers a speed increase and it multiplies your odds of a volume failure by each member added to the raid. It is possible to mirror two stripes, but with only 3 drive bays in a Xserve it is an option that not many will use. In general a Xserve raid configured for raid 5 is a better option if you want a single large, and protected, volume. While stripes find little use on Mac OS X Server outside of specific Xsan applications, they are often used on workstations for things like video capture volumes.

    raid level 1 mirrors two volumes to provide redundancy. This way if a drive were to fail there would be no sudden failure of services. In general this is the level of raid that most Mac OS X Server sysadmins will apply when hardening their servers against failure. raid mirrors give the administrator a great deal of flexibility to replace trouble hardware without downtime. Another common task is to split, or remove, a member from a raid mirror to create an instant backup of that volume. Appleraid 2 adds the ability to cleanly split a member from a live raid under Mac OS X.

    Nested raid levels such as 0+1 and 1+0 are an attempt to add flexibility to levels 0 and 1. 0+1 has all the problems of 0 for the most part with only slightly higher reliability. Since the underlaying stripes are so fragile the loss of a single drive in both arrays will result in total data loss. raid 1+0, or 10, is more robust as each member of the greater stripe is comprised of a mirror. You could loose a single drive from each mirror member of the stripe and not loose any data. As noted above these options are not often used on Mac OS X Server due to the relatively small number of drive bays in any shipping Mac.

    A concatenated disk is sometimes also referred to as JBOD (Just a Bunch of Disks). This is very similar to a raid 0 in that it combines volumes to form one larger volume. Where it is different on most OSes is that the failure of one volume won't result in the loss of all data, but rather just the data that was located on that particular disk in the set. This has an effect very similar to a large patch of bad blocks on a single device. Unfortunately, Apple's implimentation of concatenated disks doesn't follow this model. The loss of a single disk will result in a wholly inaccessible volume but if you restore the missing member it will spring back to life though. The advantage to a concat versus a stripe on Mac OS X is that you can dynamically add volumes to expand the set without taking the volume offline. If you administer Windows servers you know these disk sets as a spanning dynamic disk.

    One of the key things to remember about any raid level is that a raid set is not a backup! A raid is either a performance aid or a guard against unexpected downtime. Nothing more.

    So now that we have been refreshed on what the levels of raid are we can dive in.

    Appleraid Specifics

    So how does Mac OS X know that a disk is a member of a raid set? The basics of it are pretty simple. Take a look at the partition table for a non-raid device.

    GPT formatted disk:

    josh$ diskutil list disk3
    /dev/disk3
       #:                   type name               size      identifier
       0:  GUID_partition_scheme                    *27.9 GB  disk3
       1:                    EFI                    200.0 MB  disk3s1
       2:              Apple_HFS Client             13.8 GB   disk3s2
       3:              Apple_HFS disk3              13.7 GB   disk3s3
    
    APT formatted disk:

    josh$ diskutil list disk1
    /dev/disk1
       #:                   type name               size      identifier
       0: Apple_partition_scheme                    *17.0 GB  disk1
       1:    Apple_partition_map                    31.5 KB   disk1s1
       2:             Apple_Boot                    128.0 MB  disk1s2
       3:              Apple_HFS disk1              16.8 GB   disk1s3
    
    Now let's look at the same disks once they are part of a raid set.

    GPT formatted disk:

    josh$ diskutil list disk3
    /dev/disk3
       #:                   type name               size      identifier
       0:  GUID_partition_scheme                    *27.9 GB  disk3
       1:                    EFI                    200.0 MB  disk3s1
       2:              Apple_HFS Client             13.8 GB   disk3s2
       3:             Apple_raid                    13.7 GB   disk3s3
    
    APT formatted disk:

    josh$ diskutil list disk1
    /dev/disk1
       #:                   type name               size      identifier
       0: Apple_partition_scheme                    *17.0 GB  disk1
       1:    Apple_partition_map                    31.5 KB   disk1s1
       2:             Apple_Boot                    128.0 MB  disk1s2
       3:             Apple_raid                    16.8 GB   disk1s3
    
    Notice the data partition changed from a type of "Apple_HFS" to "Apple_raid" on the disks. What you can't see is that the partition was shrunk by about 8Kb and new header info was written. This header is used to store the information about the raid set and this particular drive's membership and status in that set. Because the info is stored on the disk, it means that you can move the set to a different Mac and it should come up fine. For more detailed info on the raid header you should take a look at AppleraidMember.cpp and AppleraidUserLib.h. (You must have an Apple ID to access the Darwin source repository.)

    If we were to look at the newly created raid as a disk we would see that it only has one partition:

    josh$ diskutil list disk4
    /dev/disk4
       #:                   type name               size      identifier
       0:              Apple_HFS Untitled raid Set 2 *16.8 GB  disk4
    
    This is because this logical disk only contains the actual raid set volume in an Apple_HFS partition. The physical disks retain all of the partitions needed for the disk to function.

    For all practical purposes the Apple_raid volumes of a mirror set and the Apple_HFS raid volume itself are functionally identical. Using Amit Singh's hfsdebug tool we can examine the volume headers of all the disks in a set. For example:

    josh$ diskutil checkraid
    raid SETS
    ---------
    
    Name:                   Untitled raid Set 7
    Unique ID:              3E2B3BDE-345C-4060-9D61-F56D05AB6DF3
    Type:                   Mirror
    Status:                 Online
    Device Node:            disk4
    Apple raid Version:     2
    ----------------------------------------------------------------------
     #      Device Node     UUID                                    Status
    ----------------------------------------------------------------------
    0       disk1s3         F07F8405-3F4F-42B3-A443-55C30C0188D2    Online
    1       disk2s3         AE62FC67-5F6F-4252-B7F3-720B91653A0B    Online
    ----------------------------------------------------------------------
    
    josh$ sudo ./hfsdebug -d /dev/rdisk4 -v | grep UUID
           # File System Boot UUID
                      UUID = FE926987-7868-34C7-8E05-0B06AB3F366E
    
    josh$ sudo ./hfsdebug -d /dev/rdisk1s3 -v | grep UUID
           # File System Boot UUID
                      UUID = FE926987-7868-34C7-8E05-0B06AB3F366E
    
    josh$ sudo ./hfsdebug -d /dev/rdisk2s3 -v | grep UUID
           # File System Boot UUID
                      UUID = FE926987-7868-34C7-8E05-0B06AB3F366E
    
    Note that the devices remain discrete and maintain their original UUIDs. The fact that the HFS volumes are identical is what allows the mirror to seamlessly function even when missing a member. If you were to add a new member to the array then it would assume the HFS properties of the mirror as a whole.

    Things are very different when considering a stripe or concat set of disks. Here only the first member of the array carries any HFS information with the other volumes simply providing additional blocks for the set. Let's examine the same information on a stripe.

    josh$ diskutil checkraid
    raid SETS
    ---------
    
    Name:                   Untitled raid Set 16
    Unique ID:              410492D8-E48D-4B3E-99CA-75FD70B08354
    Type:                   Stripe
    Status:                 Online
    Device Node:            disk4
    Apple raid Version:     2
    ----------------------------------------------------------------------
     #      Device Node     UUID                                    Status
    ----------------------------------------------------------------------
    0       disk1s3         D9866327-6AF8-49D1-982F-BC40B5AEE3EB    Online
    1       disk2s3         DD5EA9D5-041D-4BB2-9F0C-D4EED21931C9    Online
    ----------------------------------------------------------------------
    josh$ sudo ./hfsdebug -d /dev/rdisk4 -v | grep UUID
           # File System Boot UUID
                      UUID = 06E430B8-F1A4-3156-ADB4-5361D6B15031
    josh$ sudo ./hfsdebug -d /dev/rdisk1s3 -v | grep UUID
           # File System Boot UUID
                      UUID = 06E430B8-F1A4-3156-ADB4-5361D6B15031
    josh$ sudo ./hfsdebug -d /dev/rdisk2s3 -v | grep UUID
    This is neither an HFS+ nor an HFSX volume.
    hfsdebug: failed to access the Volume Header.
    
    On Mac OS X a concat set will return similar results.

    Now that we understand the relationship between an Apple_raid partition and an Apple_HFS one there are two different ways to get this new partition onto the disk. One is to create a new set, essentially re-partiontining the device or volume and destroying all data on the disks. The other way is to use the diskutil enableraid command. This will shrink the data partition slightly (by around 8Kb) and then create the Apple_raid header and partition. This operation depends on several things:

  • that the volume be mounted
  • that the volume can be un-mounted
  • that the volume format is shrinkable (HFS+J is)
  • that you have ownership of the volume (a.k.a. root access)
  • and that there is sufficient freespace on the volume to allow for the shrink.

  • Really the only two situations in which I have seen the enableraid command fail is when the disk is slap full of data or if the Mac O9 drivers are installed. In these cases you will get errors like the following:

    josh$ sudo diskutil enableraid mirror disk2s10
    Changing filesystem size on disk 'disk2s10'...
    Attempting to change filesystem size from 18232721408 to 18234343424 bytes
    Filesystem grow failed, 1
    Disk Management could not shrink the filesystem to fit the new raid headers
    Error enabling disk to raid Invalid request (-9998)
    
    What the enableraid command gives you is very powerful and that is the ability to convert an existing volume into the first disk of a mirror or concatenated raid set. We will further explore these options in just a bit.

    On Mac OS X there are two main interfaces for dealing with disks, Disk Utility and its command line counterpart diskutil. As was the case on Mac OS X 10.3, the diskutil tool can perform a few more functions than Disk Utility. The gap has narrowed though, and for many tasks it is far simpler to use Disk Utility's GUI. For the examples in this article we will use both tools where appropriate. A safe rule of thumb when dealing with the differences between the two tools though is that the non-destructive creation and removal tools for dealing with raid sets and members exist only in diskutil. Additionally, there are some cases where Disk Utility will become confused and out of sync with the status of raid sets. In those instances you can typically clear things up by restarting Disk Utility.

    To write this article I grabbed an ancient AGP G4 500, threw three old 10K SCSI drives in it and grabbed a bus-powered FireWire drive. I'm running 10.4.7 PPC and the internal disks are APT formatted while the FireWire drive is GPT formatted. Really you can use just about any combination of devices to build raid sets with as Appleraid is very flexible.

    Mirrors

    Since this is what most of you are after I'll cover it first. Simply put, it is my opinion that almost no server should be running if it isn't booted from a mirror of some sort. Drive failures are one of the most common failure on servers and a raid mirror can, and will, save your butt.

    The simplest way to create a mirror is with Disk Utility. Simply select a volume, click the raid tab, and then drag the volumes to create the mirror with into the window. Before you click the "Create" button though let's take a look at the "Options..." one.

    When you click on the Options button an options panel appears that has two settings. The first one is an option to change the raid block size. This is a performance tuning parameter that can assist you in tuning your raid for speed when you have a specific sort of data that will be stored on it. If your array will host a MySQL database with tiny records you can lower the block size to improve access. If you are creating a volume that will store large video files then a larger block size will probably help speed access to those files. For general use the default of 32K is fine but we will take a closer look at block size tuning and determination later. Take note though that you can not change the block size after the initial creation.

    The second option is the raid Mirror AutoRebuild setting and it, quite obviously, is only for mirror sets. What this setting does is to allow a mirror to automatically rebuild onto a spare drive in case of a failure. It will not grab just any drive, only those that are included in the set as a spare. Once a spare drive is activated the member of the set that failed is marked as a spare so that it will not get in the way if it happens to come back online during the rebuild. The simple act of adding a spare drive will activate the AutoRebuild option for the raid and in my testing I have not been able to turn it off. Luckily I have not been able to cause the rebuild process to fail by reintroducing the missing member to the set either. Let's take a closer look at spare drives now.

    To add a spare, just drag another volume of equal or larger size into the raid mirror and then select "Spare" from the type pick list. You can also add a spare from the command line using sudo diskutil addToraid spare newMember existingraid where newMember and existingraid are disk identifiers of the volumes you are working with. This will remove the volume from general use and put it on standby for an issue with the mirror. Check out the spiffy video (QuickTime 7 required) of a warm spare and AutoRebuild in action as I yank the cable on a member of the raid. Notice that the Dataraid volume never drops from the desktop. You can have multiple spares if you wish, although you will probably begin to run out of drive bays before it becomes practical. In a standard Xserve configuration a boot mirror with a spare would be a very bulletproof setup.

    To create your raid now just click the "Create" button and let Disk Utility do its thing. If we wanted to create our mirror from the command line then the syntax is simple, sudo diskutil createraid mirror raidname FilesystemName member disks. To see an example I'll create a mirror from disk1 and disk 3s3.

    josh$ sudo diskutil createraid mirror DataMirror "Journaled HFS+" disk1 disk3s3
    Preparing partition 'disk1s3' for raid
    Adding disk 'disk1s3' to new raid set
    Preparing partition 'disk3s3' for raid
    Adding disk 'disk3s3' to new raid set
    Creating raid Set (disk1 , disk3s3 data1)
    Bringing raid partitions online
    Waiting for new raid to come online "2CFEC74D-5C0C-49CB-940E-6099AE3C2B97"
    Creating file system on raid volume "disk4 "
    
    The raid has been created successfully
    
    Another common use of raid 1 is to split a member out of the mirror as a backup. This provides an instant snapshot of the volume as a whole. Before 10.4 Apple had no easy way to do this and the best you could get away with was to unmount the volume and remove a drive. This was a pretty nasty trick and not that easy if the OS was on the mirror as it required you to shut the server down. With Appleraid 2 though we have a much easier option in the form of the removeFromraid command. Let's take a look...

    For this example I've created a mirror that has an external FireWire drive as one of its members. A quick look with diskutil checkraid will show us the mirror, its status, and its member devices.

    josh$ diskutil checkraid
    raid SETS
    ---------
    
    Name:                   DataMirror
    Unique ID:              2CFEC74D-5C0C-49CB-940E-6099AE3C2B97
    Type:                   Mirror
    Status:                 Online
    Device Node:            disk4
    Apple raid Version:     2
    ----------------------------------------------------------------------
     #      Device Node     UUID                                    Status
    ----------------------------------------------------------------------
    0       disk1s3         EF3F0D87-0E9A-4ED4-BB2B-F2A7EBFD0306    Online
    1       disk3s3         B1EC4F22-9C0D-4CDB-90FE-D49347195834    Online
    ----------------------------------------------------------------------
    
    In this case the FireWire disk is the second member of the array. To cleanly remove this disk from the array I simply call sudo diskutil removeFromraid disk3s3 disk4. After a few seconds I'll be notified that the raid headers have been removed from the drive and it will mount on the Desktop.It looks like this when it happens.

    josh$ sudo diskutil removeFromraid disk3s3 disk4
    Password:
    Appleraid Headers removed from disk 'disk3s3'
    Changing filesystem size on disk 'disk3s3'...
    Attempting to change filesystem size from 14658928640 to 14658936832 bytes
    The disk has been removed from the raid
    
    Now I can eject the external drive and tuck it away with all of its data intact. If my mirror had a spare it would of begun the process of rebuilding onto it or I can add a new member to the array and rebuild onto that.

    Keep in mind that this operation actually removes the drive from the raid. The headers are deleted and it is removed from the mirror's member list. In this way it's not an exact replica but since the data is exactly preserved it serves our purposes.

    As mentioned earlier, you can turn any existing volume into a degraded mirror pair, then rebuild the mirror onto a second volume. We have an older document on the process here but I'll quickly run through the steps for you now. In this example we will turn a boot volume into a mirror.

    1. MAKE A BACKUP! You are about to live edit the partition table of a volume. To borrow a phrase from diskutil, "Enabling raid is an inherently dangerous operation.".
    2. Boot from something else. An external HD with the latest version of Mac OS X is probably your best bet.
    3. Identify the disk slice or volume mount point that you want to enable the raid on. Let's say that the disk I want to enable is mounted at /Volumes/BootDisk
    4. Continuing with our example, fire off, 'sudo diskutil enableraid mirror /Volumes/BootDisk'. The volume will unmount and the re-appear a few moments later.
    5. Re-select the freshly raid enabled volume in the Startup Disk System Preference and reboot.
    6. Now you can add additional volumes to the mirror and rebuild it in the background. This may take a long time.

    Our example here can vary in a few different ways. If you are working with a data volume that can be safely unmounted then you can skip the external media boot. Likewise you can always pass the volume as a device node to diskutil. You can get the disk and slice number with diskutil list. If we were to execute that command it generates output like this:

    josh$ diskutil list
    /dev/disk1
       #:                   type name               size      identifier
       0: Apple_partition_scheme                    *17.0 GB  disk1
       1:    Apple_partition_map                    31.5 KB   disk1s1
       2:             Apple_Boot                    128.0 MB  disk1s2
       3:              Apple_HFS BootDisk           16.8 GB   disk1s3
    
    In this case the data partition we would want to name in the enableraid command is disk1s3. Remember that the disk identifiers on Mac OS X are dynamic and you never know what disk will get which number. So always verify the disk number before doing anything!

    What we have done here is to turn a single device into a degraded mirror set. All we need to do now is add a second volume and rebuild the mirror onto it. You can do this in Disk Utility with a simple drag and drop or with sudo diskutil repairMirror mirrorSet newMember where mirrorSet and newMember are disk identifiers like we just looked at. Once this process is started you can check it with Disk Utility or diskutil checkraid. From a Terminal it looks like this:

    josh$ sudo diskutil repairMirror disk4 disk3s3
    Password:
    Note:  Syncing data between mirror partitions can take a very long time.
    Note:  The mirror should now be repairing itself  You can check it's status using 'diskutil checkraid'.
    

    After all the fun and games with mirror sets, the other raid types will seem simple by comparison.

    Stripes

    Due to their relative fragility, there are not a lot of times that you will be using a stripe on your server. There are cases though, like a raid 50 from an Xserve raid, where it is common so we should take a look at it.

    As with the mirrors, the easiest way to create a stripe is with Disk Utility. The process is the same, but you would define that the array created should be a stripe. Just drag in the volumes to stripe, set your options, and click "Create". The process of creating a stripe from the command line is the same as well with the only change being to define a stripe rather than a mirror. sudo diskutil createraid stripe raidname FilesystemName member disks. The big difference with a stripe though is that you can not create one from an existing data disk. On Mac OS X the process of stripe creation will destroy any data on the disks.

    Stripes are also different than the other supported raid levels on Mac OS X in that they are static. Once you create a stripe you are stuck with it. You can't add space, you can't remove members, and you can't change the block size. For this reason it's important to figure out how you want to set it up before you do it.

    When building your raid stripe for performance the biggest factor is the number of spindles that you can stripe across. In general the more discrete devices in any given stripe, the faster it will go.

    Concatenated Disks

    New to Appleraid 2 is the concatenated, or concat for short, disk set. A concat set takes multiple volumes and combines them into a larger one. In this way it is similar to a stripe, but there are three key differences.

  • A concat does not stripe the data
  • You can use the enableraid command for non-destructive creation
  • You can add disks to the concat set after the fact

  • Concat sets are cool in that it enables you to span multiple volumes dynamically. When you add more storage space you can now stretch an existing volume to use it. Keep in mind though that, just like a mirror, any volumes you add to the set after it's creation are formatted.

    It is important to note that Apple's concat sets are a bit different that your typical one. On most OSes a concat set will survive the loss of a member but will act as if it just has a large section of bad blocks. On Mac OS X this is not the case and the loss of any member will result in the loss of the volume as a whole. Along these same lines, you can not remove a concat member once it is in the array. If you read the source code it appears that Apple wanted to let you remove the last member but that functionality is not present. It's probably just as well as removing a member of a concat could lead to catastrophic data loss if anything went wrong.

    Mixed raid Types

    A big change in Appleraid 2 is that you can now nest raids within a larger raid. As noted earlier, raid 0+1 and 1+0 (aka raid 10.) allow you to mix the features of the different array types to create a more robust system. Of the two, raid 10 is probably the one that you would be looking at. It provides the speed and large volume of a stripe but while using mirrors for each of the stripe members.

    You create nested raids just like any other in Mac OS X, either with drag and drop in Disk Utility or with the diskutil command line tool. In this case it's probably easier with diskutil.

    In Disk Utility you need to create your base raids, then select the first one and go to the raid tab. There you can click the + button to create a new set and drag the existing raid sets into it. Click "Create" and you are done.

    From the command line it's a bit simpler. Using the diskutil checkraid command determine the disk identifiers of your existing raid volumes. Then use the createraid verb to build a new set using them. For example, if I wanted to create a raid 10 from mirrors with a disk IDs of disk4 and disk6 I would execute sudo diskutil createraid stripe NestedDisk HFS+ disk4 disk6. So while the creation of a nested set is a bit different in the GUI tools, it is exactly the same when using the CLI. Disks is disks.

    This all sounds great until you realize how unlikely it is that this will help you at all. To achieve any sort of gains from a nested raid you need at least four disks. Currently the only shipping Apple hardware with four disk bays is the Mac Pro, and the new Xeon Xserve models continue the three bay configuration of the G5 Xserve. If you had a large eSATA or Firewire 800 drive enclosure this might be of use, but if you are going to go the external storage route your money is better spent on a Xserve raid.

    Other Options...

    There are a few more things to cover before we can wrap up.

    If you are upgrading a server from Panther to Tiger you will also need to upgrade any raid sets to take advantage of the new features. This is pretty easy to do with diskutil. Let's say we have a Appleraid 1 volume at /Volumes/Pantherraid that we need to update. The command would simply be sudo diskutil convertraid /Volumes/Pantherraid or you could use the disk identifier numbers as well. When performing this operation all the same caveats apply as when you use the enableraid command, the most important being MAKE A BACKUP FIRST!

    Another, seemingly useless, command is the updateraid verb for diskutil. In theory you should be able to use this to adjust the settings of a raid after it is created. In practice I've not been able to get it to change anything. The example shows, "diskutil updateraid Appleraid-AutoRebuild 1 disk5" but the changes never seem to take when you try it yourself. Even more frustrating is the fact that the supposed settings aren't published anywhere and you need to dig in the source to figure them out. It seems that AutoRebuild is the only setting that this should be able to work with, but it doesn't.

    One of the most confusing settings is the raid block size. There was a time when using tools like tunefs (Which has a fun man page by the way.) was a standard practice to get the most out of your server's disks. Now IO is so fast that it doesn't really come up much, except in the case of raids. As a result, tuning a file system is a bit of a lost art these days. This isn't an exhaustive look at it, but rather a quick summary to give you an idea of how to proceed if you really want to get into it.

    The first, and most obvious, step is to look at the data access patterns of the applications that will be accessing the disk. If it's a SQL database doing tiny queries then you may benefit from small blocks. If you are serving large video files then a larger size might provide a boost. An easy thing to do in many cases is to ask the person responsible for the application to tell you what the average size of a read or write is. You can't just make the assumption that a database is going to need small blocks.

    The next step is to take some measurements using iostat -w 1 on your server. This will tell you how many transactions per second are happening on each disk and the average size in KB of each transaction. Let's take a look at some examples...

    Here is the output from one of my Xserves at work. At the time I took this sample disk0 was working on active print queues and disk1 had a radmind client in the middle of a large lappy run.

    macxmv2:~ mv2admin$ iostat -w 1
    iostat: sysctl(kern.tty_nin) failed: No such file or directory
    iostat: disabling TTY statistics
              disk0           disk1       cpu
      KB/t tps  MB/s   KB/t tps  MB/s  us sy id
     33.94   3  0.12  13.76   5  0.06   7  5 88
      0.00   0  0.00  73.12  24  1.71   4  2 94
      0.50   1  0.00  72.29  24  1.69   5  2 92
     80.14   7  0.55  73.84  16  1.15  46 16 38
     52.67   6  0.31  52.46  36  1.84  52 14 34
      6.08   6  0.04  67.82  25  1.65  26 20 53
     128.00  80  9.98  54.25  38  2.01   5 28 66
     126.56  89 10.98  58.48  25  1.43   5 28 67
     122.48  92 10.98  54.62  29  1.54   6 34 60
     118.44  32  3.69  39.10  25  0.95  42 16 42
     64.50   8  0.50  76.06  27  2.00  50 32 18
     70.62   8  0.55  54.40  20  1.06  43 16 40
     122.20 100 11.92   4.46  57  0.25  37 30 33
     120.45 202 23.70   5.06  67  0.33   6 35 59
      0.00   0  0.00  15.60  52  0.79   4  4 92
    There are a few things we can gleen from this info. The first is that radmind doesn't make very large filesystem transactions, with most of them between 50 and 75 KB. You can see the print spools on the boot drive make much larger reads and writes, but that is to be expected as it is sending and receiving large Postscript files. Note that the largest transaction you see on this Xserve is 128 KB. That's because it's a G4 and the biggest transaction a PATA drive can muster is 128 KB. Other drive types are capable of larger IO. In this case I might want to tune a file system that has lots of radmind traffic on it to a 64K block size as that more closely matches the average transaction size. Really what you are trying to do here is reduce the amount of tps needed to get the job done. I would hold off on changing the boot drive in this case though as the print spool is not the only thing it does. In general more analysis would be in order before trying to tune anything. Since block size isn't a setting you can change later you need to be sure what you are setting.

    Another way to help determine your block size is to use a tool like Bonnie++ to measure read/write performance as well as latency. A good example of this is the fact that my SCSI based G4 test rig is completely bus restrained. If I stripe 2 SCSI drives together I can create a noticeable improvement in some areas --but not all-- over that of a single drive. By tuning the block size to 128KB I can sacrifice some overall read/write performance in an exchange for better latency. It's up to you to determine if tuning your blocksize is worth it. You should see more dramatic results with less, um, stately hardware.

    Two Bonnie++ processes against a single device.

    Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    dhcp172-21s10n 300M    53  50  3844   7  1917   2    65  53  4007   2 598.3  22
    Latency               746ms     952ms    1199ms     278ms     157ms    2166ms
    Version 1.93c       ------Sequential Create------ --------Random Create--------
    dhcp172-21s10n88.09 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16   398  20  2603  44   282  16    78   6  4248  71    56   5
    Latency               317ms     234ms     419ms     426ms   55717us    1230ms

    Two Bonnie++ processes against a two drive stripe and default, 32K, blocksize.

    Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    dhcp172-21s10n 300M    54  51  5617  10  2374   3    67  54  5278   3 358.1  25
    Latency               470ms    1316ms    1446ms     474ms     148ms     658ms
    Version 1.93c       ------Sequential Create------ --------Random Create--------
    dhcp172-21s10n88.09 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16   433  22  2480  40   324  17   120  11  2957  50    80   8
    Latency               197ms     175ms     258ms     317ms   80632us     585ms

    Two Bonnie++ processes against a two drive stripe and 128K blocksize.

    Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    dhcp172-21s10n 300M    50  47  5107   9  2370   3    63  52  4972   3 845.1  40
    Latency               516ms     791ms     780ms     246ms     148ms     952ms
    Version 1.93c       ------Sequential Create------ --------Random Create--------
    dhcp172-21s10n88.09 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16   419  21  2274  37   316  17   109   9  3219  51    88   9
    Latency               210ms     197ms     272ms     597ms   78131us     518ms
    (Note that bonnie++ has a bunch of options you can use in the latest dev builds in order to multithread the tests. For these tests I created 2 semaphores and then used those to synchronize the tests. This is where you should see improvement on a raid stripe over a single disk as there are more spindles and more heads to move the data around. Check the bonnie++ documentation on how to get the best results out of it. Bonnie++ can also output some nice HTML tables, but those results don't include latency.)

    A Cautionary Warning

    Currently the status of all of this on Intel based Macs is a bit up in the air. As of 10.4.8 it should all work, but Apple's related KB articles are so vague that it's not really clear. If anyone has some Intel Macs to test with please do, and post your results in the comments below.

    Wrapping up

    So there it is. The most comprehensive collection of information on Appleraid anywhere. While I was writing this, Apple posted a few KB notes on the subject. They have pretty pictures, but in general are pretty surface level. That's not to say you shouldn't read them though as I'm a firm believer of reading all the documentation from the ReadMe to the source code.

    With the apparent lack of an internal hardware raid option in the new Xeon Xserve you can bet that Apple is going to be pushing to further improve Appleraid and I am excited to see where they take the technology. If you read the comments in the source it's plain to see that raid 5 and better concat sets are something they would like to implement.

    A critical thing to remember in all of this is that raid is not a backup. raid, in the server context, is uptime assurance. Even if something gets hosed up in the rebuild process, and you need to start over from a backup, the raid did its job by keeping services up and letting YOU pick the downtime needed to repair the server. When dealing with servers in general I belong to the school that availability is goal number one. Appleraid 2 can help you achieve this.

    As always, have fun and read the man pages...

    man diskutil
    man iostat
    man tunefs

    Story Options

    Advertising

    AppleRAID 2 in Depth | 12 comments | Create New Account
    The following comments are owned by whomever posted them. This site is not responsible for what they say.
    AppleRAID 2 in Depth - more on RAID 0+1 & RAID 10
    Authored by: iWiring on Wednesday, November 01 2006 @ 10:37 pm MST

    There are specific reasons for choosing raid 0+1 over raid 10 that are
    too quickly dismissed by this fine article and deserve further comment as
    the reasons and choices for each are more complex than presented.

    Put simply there are specific performance reasons for choosing raid 0+1
    vs raid 10 depending on the nature of the I/O load and one chooses one
    or the other based on reasons of performance for the size and mix of I/O
    operations expected to target the volume as both raid 0+1 and raid 10
    offer strikingly different performance. raid 0+1 may provide significantly
    better read performance (regardless of IO size) while raid 10 may
    provide better read performance when IO sizes are less than or equal to
    the stripe size. Write performance is more complex, but generally
    speaking raid 0+1 may be slightly better.

    In terms of failure tolerance the basic raid 0+1 is not any worse than
    raid 3 or raid 5 where you also can normally only afford to lose at most
    one member before suffering failure at a second member loss. So the
    argument for raid 0+1 having bad failure tolerance is generally not a
    concern. Fault tolerance can be improved, if it is a concern, by simply
    adding another mirror member or member stripe set. So if fault tolerance
    is a concern the raid 0+1 model offers a solution and can allow for a
    varying number of discrete disks to fail.

    raid 10 provides slightly better fault tolerance and significantly quicker
    rebuilds (as all disks don't have to take part in a rebuild), but offers very
    poor operational flexibility.

    One other major difference between raid 0+1 and raid 10 concerns
    expansion. You can continue to add stripe sets for additional mirroring
    with raid 0+1 but withe raid 10 you can't expand at all. This has a
    major impact for sites looking to use broken mirrors for snapshots and
    backup purposes. And raid 0+1 need not be symmetrical; you could have
    a raid 0+1 that includes a non-striped mirror member.

    So while at first glance raid 0+1 looks unappealing, in practice it can be
    a better choice than raid 10 if properly understood and managed.

    Additionally the benefits of both raid levels as a software alternative to
    hardware raid 5 (which is not available in software) is a topic worth
    additional attention. While both raid 0+1 and raid 10 do require at
    least four disks, external SATA and SCSI disks remain solid options
    without having to resort to XServe raid for raid 5 or more costly
    hardware based raid appliances.

    ---
    -dhan

    Dan Shoop
    shoop@iwiring.net
    appleRAID limitation
    Authored by: jkrisko on Monday, November 27 2006 @ 06:46 pm MST
    nice article. something worth mentioning which i find a drawback of apples
    sw raid implementation: i prefer to partition my servers with a system
    partition, and the shared data (ie: the stuff that the server does) on the
    other partition. this means that its easy to rebuild the system without
    touching the shared data. unfortunately appleraid cannot partition a
    raidset.

    with our xserve G5s this was do-able using the megaraid card, which also
    nicely did raid5. now that this seems to have disappeared on the intel
    xserves, we're back to:
    - external xserve raid (not always appropriate, also expensive)
    - appleraid (can't partition a raidset, can do raid10 but the xserve only has
    3 drives, duh)

    this was confirmed by apple enterprise support who said its still active as a
    feature request. they asked for metrics on
    impact, sales figures etc, all the usual things. unfortunately xserves arent
    something you buy a whole lot of, i only have 14 so apple isnt going to care
    so much, but i see this as something that they should be doing in the
    server space, if only to enhance their low reputation in this area. especially
    if they arent going to give us internal hardware raid - still something a lot of
    organisations require in the datacentre.

    anyway just thought this deserves a mention in case anyone else was going
    to try this setup. might save you some time...

    cheers,
    jk.
    AppleRAID 2 in Depth
    Authored by: Anonymous on Sunday, December 17 2006 @ 04:22 pm MST

    I'm wondering if anyone's successfully run diskutil convertraid on a normal, functioning Appleraid 1 (Panther) setup and had it work - after Googling for convertraid I've seen several horror stories about how the Apple_Boot partition gets changed from a 516 KB partition into a 128 *MB* partition!

    A few months ago, I made the mistake of connecting my Panther Appleraid 1 volumes on a PowerBook G4 under then-10.3.9 to my then-new MacBook Pro under Tiger. The Intel Mac promptly degraded the raid (which was working fine until then) and when I tried to "fix" it in Disk Utility, it basically ran roughshod over it - first it byte-swapped all the values in the raid header (nice, so it appears the Appleraid stuff is byte-order dependent) and then it started writing NUL bytes over the actual raid data - luckily for me I stopped it as soon as it was clear that something was going wrong; it turns out it only overwrote the first 10 disk blocks after the header on one of the 2 disks in my mirror - phew.

    A little under-the-hood work with dd to grab the first 5K of the still-unzero'ed disk in the mirror, which I copied to the over-written disk's (raw) partition, combined with using a hex editor to manually un-byte swap all the byte-swapped values, and I had it recovered to where I could hook it back up to the old PowerBook G4 and run diskutil repairMirror on it. Man, what a nightmare!

    (I'm actually reading/posting this now because I just upgraded the PowerBook G4 to Tiger and I'm wondering whether to risk trying a diskutil convertraid in the hopes that the MacBook Pro might be able to read a Tiger Appleraid 2 volume from a PowerPC Mac without mangling it. Sounds like I better find a way to make a backup of my two raid1 volumes first ... )

    (This has been reported to Apple as bug id 4631176 in Radar. It has yet to even be evaluated, 6 months later as I write this.)

    I wish Apple's software RAID was reliable for me
    Authored by: reppep on Friday, May 18 2007 @ 05:14 pm MDT
    I've had several problems (on a very small set of machines using software raid), so I have very little confidence in it.

    There are problems (lack of info and confusing instructions/explanation) in both Disk Utility's online help and diskutil's manual page. There is confusion about whether one should work with disks or partitions.

    There are known issues with raid creation not working, and difficultlies raiding an existing simple volume.

    That said, I am using software raid at home because I can't justify the cost for a raid controller for my personal server, but it makes me very nervous.
    Disk Utility has problem adding concat set to mirror set
    Authored by: tempel on Friday, February 19 2010 @ 03:30 am MST
    Having a Mac Pro (as a workstation) with 4 standard drive bays and room for two more drives in the DVD drive compartment, and having 6 S-ATA2 connectors, I now went for a "raid C+1" configuration for my Time Machine backup volume:

    What I had available was one 1TB and two 500GB drives, with the single 1TB drive containing the current TM backup data. So I first used "Appleraid enable mirror" on the 1TB drive to turn it into a mirror with one member without losing the current backup. The other two 500GB drives I turned into a concatenated volume (so that I can later expand my TM volume as desired).

    At that point, Disk Utility would not let me add the concatenated set as a second member to my mirror set. Well, in the UI it would let me drag it in, but the "Update" button remained disabled, so my change could not be activated.

    However, using the Terminal command "diskutil Appleraid add member", it worked. After that, the new member started rebuilding automatically, i.e. becoming a copy of the 1TB disk.

    Note that this also proves the statements wrong of some posters saying that a raid 0+1 needs at least four disks, when, in fact, it can be done with 3, too, in a way (yes, it's not really striped, but that doesn't matter in a 0+1 anyway, does it).

    Oh, BTW, if you want to look at raw disk contents, you can also use my tool "iBored" - it's a graphical disk editor, which not only shows bocks in hex but also understand (more and more) structures.