Introduction
This is a guide to Linux software RAID under the 2.4 and 2.6 kernels. If you don't know what RAID is, I suggest you read
Everything2's excellent entry on the subject. Linux software RAID consists of two components:
- The kernel subsystem, also known as the Linux software RAID drivers or modules. Linux provides modules for each of the supported RAID levels (also known as personalities) and a module that facilitates communication between the aforementioned personaility modules and the Linux block layer.
- A userspare utility to create and manage Linux software RAID arrays. There are currently two popular packages that fill this role: raidtools and mdadm. Both will be covered here.
Bootloader Considerations
Since Linux software RAID arrays are implemented in software, bootloaders such as GRUB and LILO generally cannot read from them. There exist patches for some bootloaders that allow booting from certain RAID levels, but generally it's recommended that you provide one plain ol' partition for your /boot directory to avoid making your Linux installation unbootable. Along the same vein, if your root partition is on a software RAID array, the Linux software RAID subsystem must be initialized
before attempting to mount the root partition. This means that the necessary modules for the block devices that make up the root partition's array
and the RAID subsystem must be either compiled in to the kernel or loaded during an initial RAM disk. Check your distribution's documentation regarding this.
Performance Considerations
The IDE protocol only allows one device per cable to "talk" at a time; therefore, it is highly recommended that only one IDE device per cable be configured as part of an array (including spares). Otherwise, the performance of an array with more than one IDE device per cable will suffer greatly. SCSI does not suffer from this problem, and although the SATA protocol allows more than one device per cable, there are currently no SATA controllers, cables, or devices that support it. However, SCSI suffers from its own limitations: the standard PCI bus allows, at most, approximately 133MB/s of traffic across it. However, SCSI controllers commonly support speeds as high as 160 and 320MB/s. If you don't want your Linux software RAID arrays to max out at the speed of the PCI bus, a more high-speed bus such as PCI-64, PCI-X, or PCIe is necessary. SATA also has this limitation as the protocol allows speeds up to 150MB/s. IDE does not have this limitation as its maximum speed is 133MB/s.
Linux software RAID only directly supports the following RAID levels: RAID-0, RAID-1, RAID-4, RAID-5, RAID-6, and JBOD (called linear mode). RAID levels 10, 50, 0+1, and 0+5 can be faked by using existing RAID arrays as the volumes in other arrays. RAID-2 is so complicated to implement that it doesn't even show up in commercial hardware RAID offerings, and RAID-3 isn't worth implementing if RAID-4 and RAID-5 are available. The Linux software RAID subsystem requires you configure a chunk size for RAID levels 0, 4, 5, and 6. The chunk size is in multiples of two and units of kilobytes, and it defines the size of the stripes for the array. If the chunk size is too small, then RAID subsystem will have to spend a lot of time talking to the Linux kernel block layer and, in the case of RAID levels 4, 5, and 6, computing parity information. If the chunk size is too large, then the RAID subsystem will have to spend too much time waiting on the block layer. There is no easy way to determine what a good chunk size for a particular RAID array should be b/c it's heavily dependent on the hardware being used and the intended use of the array. However, there
are a few guidelines which should be helpful:
- The chunk size should never be smaller than the block size of the filesystem that you plan to put on the array. For ext2/3, the block size is configurable at creation time but is generally the page size of the architecture which, for x86, is 4K, a chunk size value of 4. Other filesystems and architectures have different block sizes and page sizes. However, the chunk size should not be limited to the filesystem's block size and shouldn't be less than 32 unless you've got slow (read: anything slower than PCI, such as ISA) hardware.
- The chunk size should generally be smaller if the files you intend to store on the array are relatively small. For example, a common usage of RAID-0 arrays in the ISP world are as high-performance storage for caching HTTP proxies such as Squid, and since web pages (and all the associated files such as images and stylesheets) are relatively small, a chunk size of 32 or 64 would be appropriate. On the other hand, an array used for storing relatively large files such as music or video should have a relatively large chunk size such as 256 or even 512. However, RAID arrays with a chunk size larger than 512 tend to perform poorly due to the latency involved with passing 1+M of data around in the kernel.
- SCSI generally implies a higher chunk size, but not always. This means that if you're debating between 128 and 256 and have SCSI, go with 256. SCSI's performance will make up the difference, if any.
- Benchmarking on the Linux kernel mailing list indicates that for RAID-0 and RAID-4, a chunk size of 32 is a good starting point. For RAID-5 and RAID-6, the "happy median" is a chunk size of 128.
The Persistent Superblock
What if your root partition is a RAID array, and its configuration is stored on itself? The solution to this chicken-and-egg problem is the persistent superblock option. When you create an array with the persistent superblock option, the RAID subsystem writes the array's configuration to a special area on each of the devices in the array so that the RAID subsystem can later -- such as after a reboot -- re-read the configuration directly from the array's devices rather than having to read it off the filesystem. There is, however, one more stipulation: the RAID subsystem will only automatically scan partitions on hard disks of type Linux RAID autodetect (fdisk type "fd") for persistent superblocks. Even if you create an array of three raw disks with a persistent superblock, the RAID subsystem will
not automatically scan those devices for persistent superblocks.
Failover Considerations
Linux software RAID supports failover for RAID-1, RAID-4, RAID-5, and RAID-6 via spares. The RAID subsystem also supports hot-pluggable devices.
Configuration Examples
The first step in creating a Linux software RAID array is determining which devices will be used in an array. Every RAID level except linear mode requires that each device including spares in an array be approximately the same size; the RAID subsystem can account for small discrepancies. Partitions on hard disks should be of type Linux RAID autodetect (fdisk type "fd") and instead of using a raw hard disk, a Linux RAID autodetect partition across the whole disk should be created and used. It's a waste of time to format the devices that make up your RAID arrays with filesystems; the RAID subsystem will overwrite the formatting when it creates the arrays. You should only be formatting the arrays themselves with filesystems.
The standard device naming for RAID arrays is /dev/mdN where N is an integer starting at 0. So, your first RAID array should be /dev/md0, your second /dev/md1, and so on.
The raidtools package requires an entry in /etc/raidtab for every array. The format of an entry is as follows:
# Comments start with a #
raiddev
raid-level
nr-raid-disks
[nr-spare-disks ]
chunk-size
persistent-superblock 1
device
raid-disk 0
device
raid-disk 1
[continue the pattern for each primary device in the array]
[device
spare-disk 0
[continue the pattern for each spare device in the array]]
Brackets mean the directive can be omitted where unnecessary ie. don't specify nr-spare-disks and spares for a RAID-0 array.
You'll notice that chunk size is required even for arrays that don't care about it. Here's an example /etc/raidtab with two arrays:
# Example /etc/raidtab
raiddev /dev/md0
raid-level linear # A linear mode array
nr-raid-disks 2
chunk-size 128
persistent-superblock 1
device /dev/hda3
raid-disk 0
device /dev/hdc1
raid-disk 1
raiddev /dev/md1
raid-level 5 # A RAID-5 array
nr-raid-disks 4
nr-spare-disks 2
chunk-size 256
persistent-superblock 1
device /dev/sda1
raid-disk 0
device /dev/sdb1
raid-disk 1
device /dev/sdc1
raid-disk 2
device /dev/sdd1
raid-disk 3
device /dev/sde1
spare-disk 0
device /dev/sdf1
spare-disk 1
An example /etc/raidtab with a linear array and a RAID-5 array
/etc/raidtab only tells the programs in the raidtools package what your RAID configuration should be. To actually create an array, you must run mkraid:
mkraid <device name eg. /dev/md0>
To make active or "start" the newly created array, you must run raidstart:
raidstart <device name eg. /dev/md0>
If you need to manually stop an array, run raidstop:
raidstop <device name>
Gentoo Linux users: If you create /etc/raidtab during installation on the LiveCD, make sure you copy it to the root partition of your new install:
cp /etc/raidtab /mnt/gentoo/etc/raidtab
mdadm doesn't require a configuration file to create arrays. Instead, when you create an array, you specify all its options on the command line:
mdadm --create <device name> --chunk=<chunk size> --level=<linear|0|1|4|5|6> --raid-devices=<number of devices in the array> [--spare-devices=<number of spares in the array] <all the devices in the array, including spares>
mdadm always creates arrays with persistent superblocks. Here's how to create the two arrays in the raidtools example above but with mdadm:
mdadm --create /dev/md0 --chunk=128 --level=linear --raid-devices=2 /dev/hda3 /dev/hdc1
Create a linear mode array with two devices, /dev/hda3 and /dev/hdc1, with a chunk size of 128K
mdadm --create /dev/md1 --chunk=256 --level=5 --raid-devices=4 --spare-devices=2 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
Crete a RAID-5 array with four devices and two spares with a chunk size of 256K
mdadm doesn't differentiate between primary devices and spares at array creation time; it simply picks the number of devices specified in the --raid-devices option from the device list and uses them as the initial devices in the array. The others are the spares. mdadm also automatically starts arrays at creation time, but if for some reason you must start an array manually with mdadm, then you run it in assemble mode:
mdadm --assemble <device name> <device list including spares>
Run this to manually stop an array:
mdadm --stop <device name>
You can avoid specifying the device list for assemble mode on the command line by configuring your arrays in mdadm's configuration file, /etc/mdadm.conf. /etc/mdadm.conf has two directives for doing this, DEVICE and ARRAY:
Syntax for specifying RAID arrays and the devices they're composed of in /etc/mdadm.conf
/etc/mdadm.conf also allows you to specify an ARRAY by UUID or superblock minor number; read the examples in /etc/mdadm.conf or mdadm.conf's man page. Multiple DEVICE and ARRAY entries are allowed as long as you don't specify duplicate devices or arrays or configure more than one array with the same device.
So, for the two arrays we created about, our /etc/mdadm.conf should look like this:
DEVICE /dev/hda3 /dev/hdc1
DEVICE /dev/sd[abcde]1
ARRAY /dev/md0 devices=/dev/hda3,/dev/hda1
ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1,/dev/sde1
Example /etc/mdadm.conf for a linear mode array with two devices and a RAID-5 array with five devices
Monitoring your RAID arrays
The Linux software RAID subsystem can show you the status of all your arrays in /proc/mdstat. However, the information is terse, at best, so both mdadm and raidtools provide a way to fetch the information in a more user-friendly manner:
mdadm --detail /dev/md0
lsraid -a /dev/md0
You could concieveably monitor your RAID arrays by running either of these commands as a cron job, but that would generate quite a bit of unnecessary e-mail if you wanted to be notified of an array failure ASAP. mdadm solves this problem for us by providing a monitor mode that probes for changes in the status of your arrays and can e-mail you when such an event occurs. Here's how to do that:
mdadm --monitor --mail=<your e-mail address> --delay=<time between probes in seconds> <device name(s)>
So, if I wanted to monitor /dev/md0 and /dev/md2, probing every 30 seconds and e-mailing me (xunil@theanykey.com) when something happens, here's what I would run:
mdadm --monitor --mail=xunil@theanykey.com --delay=30 /dev/md0 /dev/md2
If you want to monitor all your arrays, you can substitute --scan for the device names like so:
mdadm --monitor --mail=xunil@theanykey.com --delay=30 --scan
mdadm won't daemonize ie. run in the background by default when it enters monitor mode, so you should either run it in the background manually (check your shell's documentation on how to do this, but it's almost certainly by appending an & to the end of the command) or in a
screen session. The e-mail mdadm will send when a probe detects an event will lack any details, so you'll have to investigate with the commands above to figure out what to do about the event.
raidtools does not offer anything similar to this.
Recovering from drive failure
So, you've recieved an e-mail from mdadm saying something's up with /dev/md0, a RAID-1 array with two primary drives (/dev/sdb1 and /dev/sdc1) and one spare (/dev/sdd1). What do you do next? First, investigate what's up by using either mdadm --detail /dev/md0 or lsraid -a /dev/md0. Let's suppose the primary drive /dev/sdb1 has failed. The spare /dev/sdc1 will have taken its place automatically, and the array should be resyncing if it hasn't finished yet. Next, you have to remove the faulty drive from the array:
mdadm /dev/md0 -r /dev/sdb1
or if you're using raidtools:
raidhotremove /dev/md0 /dev/sdb1
Next, physically replace the faulty drive. The faulty RAID array will survive a reboot if you don't have the luxury of hot-swappable drives. Next, you have to re-add the drive to the array:
mdadm /dev/md0 -a /dev/sdb1
or if you're using raidtools:
raidhotadd /dev/md0 /dev/sdb1
The RAID subsystem will add the drive back to the array as the spare.
What if you're not using spares? The RAID subsystem will mark the array as "degraded." The process for replacing the faulty drive is the same, but your arrays will not perform as well while the faulty drive remains in place (hence the name, degraded). Also, resyncing will be performed
after the faulty drive has been replaced.
Simulating drive failure
The following is a script that uses loopback devices to simulate drive failure in a RAID-1 array. Run all the commands as root. It requires 300M of free spare on the partition /tmp is on.
modprobe md
dd if=/dev/zero of=/tmp/lo-raid-1.img bs=1M count=100
cp /tmp/lo-raid-1.img /tmp/lo-raid-2.img
cp /tmp/lo-raid-1.img /tmp/lo-raid-3.img
losetup /dev/loop0 /tmp/lo-raid-1.img
losetup /dev/loop1 /tmp/lo-raid-2.img
losetup /dev/loop2 /tmp/lo-raid-3.img
mdadm --create /dev/md0 --chunk=128 --level=1 --raid-devices=2 --spare-devices=1 /dev/loop0 /dev/loop1 /dev/loop2
sleep 60
mdadm --detail /dev/md0
mke2fs /dev/md0
mkdir /mnt/tmp
mount /dev/md0 /mnt/tmp
touch /mnt/tmp/testing
mdadm --manage --set-faulty /dev/md0 /dev/loop0
sleep 60
mdadm --detail /dev/md0
touch /mnt/tmp/testing
mdadm /dev/md0 -r /dev/loop0
mdadm --detail /dev/md0
touch /mnt/tmp/testing
sleep 30
mdadm /dev/md0 -a /dev/loop0
mdadm --detail /dev/md0
touch /mnt/tmp/testing
umount /mnt/tmp
fsck -C /dev/md0
if [ $? -ne 0 ]; then echo "SnackTray's an idiot; this crap didn't work."; fi
mdadm --stop /dev/md0
losetup -d /dev/loop0
losetup -d /dev/loop1
losetup -d /dev/loop2
rm /tmp/lo-raid-?.img
This article is ©2008 by the respective authors. Reproduction is prohibited without express permission from all contributors.