dean's linux raid wishlist

$Id: raid-wishlist.html,v 1.9 2006/04/24 01:47:43 dean Exp $

here's my wishlist for enhancements i'd like to see in the linux raid subsystem. i thought it'd be interesting to share... if i had more money than i knew what to do with, i'd fund someone to work on this stuff. alas. :)

overall i'm pretty damn happy with the systems i've been able to build with linux md. maintaining these systems over time, and through various forms of failures has given me a few battle scars... those scars are what lead to this wishlist.

the most common failures i experience are completely recoverable -- it's been rare to experience a completely dead disk. the common problems which lead to failures are:

sector read error -- this is the sign of a disk needing replacement, but it's not hopeless yet.
communication loss -- bad cabling, bad controllers, whatever, cause a disk to become unreachable until a reboot, and it works fine after reboot.

wishlist

logging raid

send writes to a log first, sync the log, then ack the upper layers. play the log against the raid in the background.

if there's a system crash then it is sufficient to replay the logs in order to get the raid back in sync. note that such a log replay is in general more accurate than a resync or reconstruct -- because in a resync/reconstruct it's not guaranteed that the resulting data will be the most recently ack'd copy. (consider that resync/reconstruct need to select from several permutations of disks to decide what is the "master" copy of the data).

logs can address the "temporary communication loss" problem as well. when disk communication is restored, the raid software can replay the log and then place the array back into service automatically. today communication loss can easily require some manual intervention.

it should be possible to place the log on any block device -- i would expect to use either a mirrored pair of nvram devices, or a mirrored pair of disks. (a disk used exclusively for a single log has very high locality and seek latency is almost non-existent... it's faster to ack writes on such a disk than it is on a larger raid5.)

i understand that linux-2.6 will improve resync/reconstruct by saving a "progress indicator" so that the process can be restarted after a reboot. this has been a terrible source of headache on linux-2.4. but overall my wish is for logging techniques so that huge arrays will be even more feasible.

apparently some thought has been put into this effort.

delay resync/reconstruct boot option

there have been many times when i've been frustrated by the resync/reconstruct beginning immediately at md startup. the typical situations this is undesirable include when i'm dealing with a disk failure; or when i'm dealing with some other system problem requiring lots of reboots and power cycles.

consistency check / repair tool

just like fsck is still useful with logging filesystems, no matter how hard we've tried to prove a raid will never end up in an inconsistent state it would be nice to have a consistency checking / repair tool.

wishes which have been granted

thanks!

raid6

update: hpa has made progress on raid6, and it is merged in the kernel as of 2.6.3.

hpa has developed a second redundancy function using the galois field defined in the AES specification. his redundancy function is orthogonal to standard XOR-based parity. using them in concert allows for two disks of redundancy -- a very desirable situation. he has demonstrated assembly implementations which generate the two functions in parallel at speeds comparable to generating parity alone.

partially failed disks

update: NeilBrown has implemented a fix for this -- and i believe it has been merged as of 2.6.15. the kernel now will reconstruct a sector after a read fails.

the "sector read error" problem is not generally fatal for a disk. in fact if the error-causing sectors are rewritten then the disk may correct the error, or it can reallocate the sectors from spare sectors designed for this purpose. a read error is often a sign of disk age and SMART should be consulted to decide on whether the drive should be replaced.

linux raid treats read errors as fatal for a disk and takes the disk offline. once enough disks are offline the raid is taken offline. this results in an unfortunately array failure mode:

consider this 4 disk raid5 with D = data, P = parity, and X = read error, in the following sectors:

	disk0	disk1	disk2	disk3
stripe0	D	D	D	P
stripe1	D	D	P	D
stripe2	D	P	X	D
stripe3	X	D	D	D

it is obvious that stripe2 can be reconstructed using disk0, disk1, and disk3. stripe3 can be reconstructed using disk1, disk2, and disk3.

there are two improvements to be made:

on read error, generate a kernel log message, and attempt to reconstruct the data which couldn't be read. then attempt to rewrite the data. this will keep the array online, and allow the user the opportunity to replace the disk without losing data redundancy.
on write error mark the drive as read-only and begin recovery onto a spare. if the array has insufficient writable disks to continue operation then mark the array read-only. at this point almost the entire array is readable and can be streamed off onto new disks.

note that raid1 benefits from these techniques as much as raid5 does -- especially raid1 on more than 2 disks.

some of these errors could be corrected by an offline tool.

proactive disk replacements

update: support for raid5 bitmaps (2.6.15+) essentially provides the functionality required to implement this. i've written directions and mailed them to linux-raid.

allow for proactively replacing disks in an array without first removing a disk. today the only way to proactively replace a disk requires you to mark an existing disk as faulty and then bring the new disk in and wait for a full reconstruct. however when proactively replacing disks it would be much better if the raid software could mirror an existing disk, and provide a clean switchover without ever losing redundancy. (it would take less I/O and cpu as well...)

this can already be accomplished by pre-building arrays out of 2-disk raid1 components, with one of the raid1 components "missing". but there's something unfortunate about all the /dev/mdN required for that (and you can't really do it to existing arrays which weren't preconfigured in this manner).

it might be possible to modify an existing array disk to be part of a "legacy" raid1 without a superblock. this would require some finessing to bring the main array online with one of its components replaced by this "legacy" raid1. it definitely requires the array to be taken offline to start and finish the process though.