Getting all the Xsan pieces straight in your head
When Xsan came out, it was the first time that the average Apple System Administrator had seen a Storage Area Network (SAN), and as a result, there’s a lot of questions about how to best deal the very fundamentals of an Xsan setup. Here we’re hoping to cover a lot of the questions that are being asked, and help you on your way to the healthiest Xsan environment you could get.
We’re going to start with the Xserve RAID (Redundant Array of Independent Disks), and discuss what types of RAID can be applied to disks:
RAID 0 – Striping data across multiple disks for highest storage performance, 100% storage efficiency, however, no data protection.
RAID 1 – Mirroring disks for redundancy, high read performance, reasonable write performance, however, storage efficiency is at 50%. This highly-redundant setup is ideal for metadata.
RAID 3 – Parity Data stored on a dedicated disk, reasonable read & write performance, increased data protection, unless parity drive and a data drive fail simultaneously. This used to be a rather popular form of RAID, but has pretty much been foregone in favor of RAID 5.
RAID 5 – Parity Data striped across all disks, high read & write performance, even greater data protection, and a storage efficiency of up to 85% across a set of seven drives. The Xserve RAID (XSR) is optimized for this setup, which is rather unusual in RAIDs, as most do best on a RAID 0.
RAID 10, 30 & 50 – these assume the use of software striping – since Xsan does it’s own striping (we’ll get more into this later on) these schemes aren’t appropriate for use with Xsan. Tthese RAID types are created by striping RAID 1, 3, or 5 sets.
RAID 1 is ideal for metadata use with it’s high read performance, and excellent redundancy, since metadata only takes up 1GB per million files we’re not too worried about storage efficiency here either. You could use a RAID 5 set for the metadata, but you’d be tying up more drives for no real benefit.
RAID 5 is going to be our best option for redundancy and performance on our non-metadata disks.
With this in mind, here is how our 2 RAIDs are set up visually. Notice that the other drives in our RAID 5 setups are left un-used, this gets used as a hot spare, and is instantly picked up by the RAID set in case of a failure. Ideally we do not want to use the spaces (the other 4 drive bays) allowed to us in the remainder of the RAID controller where our 2 disk RAID 1 (plus 1 hot spare) for metadata resides. Metadata consists of small files being read and written very frequently – anything else we task that RAID controller with is going to slow down how quickly metadata can be read and written, which means latency on our SAN that we don’t want!
Keep in mind that RAID 0 is traditionally marginally faster than RAID 5 (with the XSR being optimized for RAID 5 the performance is actually very close, and in many cases RAID 5, as strange as this sounds, will be faster than RAID 0), but you should only even think about using RAID 0 if you don’t mind losing ALL of your data – there aren’t many applications where this applies – an example may be using RAID 0 for writing/reading rendered files to and from, consider how long it took you to generate these rendered files, and what it would involve to recreate them if needed, if you can shrug your shoulders at losing all that data, then go ahead and play with RAID 0.
Once you’ve set up your RAID, you may have some interest in experimenting with slicing. The idea is that the best performance is going to be found on the outer portion of the disks, these would be the first slices you create, and then assigning a LUN to these outer slices is going to give you faster reads and writes. Or at least this is a technique that was used extensively in video RAID arrays just a few years ago.
Certainly experiment with this, however, it does require some more advanced administration, and for the trouble it may not give you that much more increase. Remember that if you’re going to be using any of the “slower” slices in a different LUN these are going to be using the same controller that your “faster” slices are using, thus canceling the benefits that slicing could give you.
Ed. Note: Since the disks on an XSR can fully saturate the controller in a RAID 5 configuration, you most likely will not find any benefit to using slices in this way. Instead you’ll probably just introduce additional overhead to your controllers.
A LUN is going to be the smallest storage element you work with in Xsan. The abbreviation LUN comes from a SCSI logical unit number – the XSR disks are not SCSI, they are in fact Ultra ATA drives, but still the term LUN applies here.
In reality, a LUN just represents a group of drives, such as one of the RAID 5 arrays we created in the previous section – this is where we explain why we left a hot spare there, not only is is excellent to have a hot spare, but many operating systems get real touchy with volumes larger than 2 TB ( OS X 10.3.9 and 10.4, however, do not have those issues).
Your LUN has already been created when you created your RAID array, but it has not yet been labeled and initialized for use in Xsan. The labeling and initialization is done within Xsan Admin under the Setup tab, in the second section called “LUNs”, or with
/Library/Filesystems/Xsan/bin/cvlabel through command-line.
Storage Pools are created from combined LUNs. Data is distributed across the LUNs in a storage pool using a RAID 0 scheme in Xsan – what this means to us, is the more LUNs we put in a pool, the better our access speed, of course there are limits to adhere to here – you can put up to 32 LUNs in a Storage Pool, which when you work it out is 16 XSRs and a whole lot of fibre! When creating a Storage Pool there is a list of names that are disallowed by the Xsan Admin software, you can find a list of these on page 76 of the Xsan Admin Guide, your pool name can be up to 255 characters however. By default the first pool you create will contain journalling and metadata information, when selecting this in Xsan Admin, there’s a choice for “Any Data” or “Journalling and metadata only” – if you select “Any data” both your metadata and user data will be allowed on this pool, this is not going to be your best option for performance as you really want that dedicated controller for your metadata. After you’ve set up your first pool with your metadata, an additional choice appears allowing you to choose “User data only”. The other interesting thing to note here is that you do get the choice of making your pool either “Read & Write” or “Read Only”, this I find interesting, as if you make your pool “Read Only”, how would you get any data onto that pool – ever?
Ed. Note: The read only option is for use after you have put data on it. For archival purposes is the most common use.
So, my md_pool was the first pool I created, and set it for journalling and metadata only, and then took the 3 LUNs I’d labeled for video above, and made them into my video_pool.
The Multipath Method choice is fairly straight-forward, if you do have a dual-link Fibre connection between each client and storage unit, this is where you can chose how Xsan using the connections, Rotate will give you maximum throughput, and Static will assign each LUN in a volume to one of the connections when the volume is mounted.
Finally, there’s the Stripe Breadth and block size to look at in a Storage Pool. By default this is set to 256 blocks, which has been generally found to be the best for all over performance, however, you may find that “tuning” this number a bit, may suit your application better. Basically, Xsan uses the storage pool stripe breadth and volume block allocation size to decide how to write data to a volume. The block allocation size multiplied by the stripe breadth should equal 1MB (1024 to be precise) which is the optimal transfer size for XSR systems. So, a couple examples to illustrate this:
If you have an application that you know writes small blocks of data, i.e. 8KB blocks of data, you may want to try adjusting the block allocation size with the following formula (Stripe breadth is set up in the Storage Pools section, while Block Allocation Size (set at a default of 4KB) is set up when you create your Volume, this is discussed in the next section):
stripe breadth (number of blocks) = transfer size (bytes) / block allocation size (bytes)
stripe breadth = 1048576 / 8192
which gives us a stripe breadth of 128.
On the other hand, an application that you know writes large blocks of data, then you could chose the largest block allocation size (512KB) and use the equation to find the stripe breadth of 2.
Really the only way to find what setup is going to work best for you is by testing. Remember the equation, and see what sort of performance you get. It is important to note that every time you change the stripe breadth all data on the storage pool and the volume to which it belongs is lost, so testing this before setting up your final Xsan is crucial.
Finally, Volumes are what your Xsan users will actually see – from their perspective the SAN volume looks and behaves exactly like a large local disk, except, that of course the volume can grow as you add more storage pools, and that other users can access the volume at the same time, in the same manner that your users would be familiar with a mounted AFP volume.
A Volume is made up of combined storage pools, of which there is a maximum of 512 pools per volume, which also gives us the maximum of 512 LUNs in a Volume, and while we’re talking maximums, how about maximum files in a volume – you’re looking at 4,294,967,296 files which, in Panther, can take up a maximum of 16TB per volume, in Tiger this goes up to Petabytes. Just one more maximum for you, which is the name length of your volume at 70 characters.
We’ll cover storage affinities in another article since they can get rather complex. In a nutshell, they allow you to create folders in the Xsan volume that are backed by specific LUNs. This is most commonly used to allow your faster disk arrays to be used for faster storage while your slower arrays are used for less speed-intensive uses.
So, after setting up your volume name, and the Block Allocation size that we discussed earlier, we have to think about our allocation strategy which is what determines how data is going to be written to our storage pools. There are three allocation strategies we can choose from, Round Robin, Fill and Balance. These strategies apply to all data saved to a volume, except for data that has an affinity assigned to it.
In the Round Robin allocation strategy Xsan allocated space for successive writes to each available storage pool in turn – this is going to be your highest performance option as it makes use of every controller in your volume.
The Fill allocation strategy allocates all new writes to the first storage pool in the volume until that storage pool is full, and then moves onto the next pool. If you’re experimenting with the slices we mentioned above for speed, this may be an option for you as you can get it to fill the outer/faster slices first before moving onto the perceived slower portions.
Xsan will analyze available storage pools with the Balance allocation strategy and will allocate new writes to the storage pool with the most free space.
Everything that we’ve just discussed above are the parts that go into a SAN. In short, a SAN is a Storage Area Network – a network whose primary purpose is the transfer of data between computer systems (clients & servers) and storage elements, and between storage elements themselves. Xsan is a combination of one (or more preferably) controllers, storage volumes and storage clients, all of which are high speed, and the storage itself is easily expandable. We’ve only discussed one volume within our example SAN here, however, you can have multiple volumes within Xsan, just remember that in Xsan there is an limit of up to 64 controllers and clients – your metadata controllers, and any client attached to the SAN is counted in this number, but storage elements are not included.
If you’re new to SAN technology then it’s important to note that not all SANs are high-speed multi-user systems, and it’s the Xsan software in combination with the XSR units that make this such an efficient system. Most I/O card manufactures have speed test tools that you can use to test the throughput of your SAN. Some require the card to be installed, other do not. These are excellent for helping you work out the best settings for your particular environment.
Before purchasing any equipment it would be best to check with the manufacturer or the Apple Xsan site to make sure things will work with Xsan.
Hopefully this helps clear up some of the questions that are out there about all the parts that make up Xsan, and how to best go about setting up your own Xsan. The biggest message that anyone can send out about all of this is to test thoroughly before setting up your production Xsan environment. A lot of the things we discussed here can’t be changed without drastically altering your SAN, and as a result data will be lost. Find the settings that best suit your environment first, test several variations, and only when you feel confident that you’ve got the best setup and performance should you deploy.