Date: Sun, 23 Apr 2006 18:40:17 -0700 (PDT) From: dean gaudet To: linux-raid@vger.kernel.org Subject: proactive raid5 disk replacement success (using bitmap + raid1) i had a disk in a raid5 which i wanted to clone onto the hot spare... without going offline and without long periods without redundancy. a few folks have discussed using bitmaps and temporary (superblockless) raid1 mappings to do this... i'm not sure anyone has tried / reported success though. this is my success report. setup info: - kernel version 2.6.16.9 (as packaged by debian) - mdadm version 2.4.1 - /dev/md4 is the raid5 - /dev/sde1 is the disk in md4 i want to clone from - /dev/sdh1 is the hot spare from md4, and is the clone target - /dev/md5 is an unused md device name here are the exact commands i issued: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... mdadm /dev/md4 -f /dev/md5 -r /dev/md5 mdadm --stop /dev/md5 mdadm /dev/md4 --re-add /dev/sdh1 mdadm --zero-superblock /dev/sde1 mdadm /dev/md4 -a /dev/sde1 this sort of thing shouldn't be hard to script :) the only times i was without full redundancy was briefly between the "-r" and "--re-add" commands... and with bitmap support the raid5 resync for each of those --re-adds was essentially zero. thanks Neil (and others)! -dean p.s. it's absolutely necessary to use "--build" for the temporary raid1 ... if you use --create mdadm will rightfully tell you it's already a raid component and if you --force it then you'll trash the raid5 superblock and it won't fit into the raid5 any more... Date: Fri, 08 Sep 2006 12:48:51 +0400 From: Michael Tokarev To: Linux RAID Subject: proactive-raid-disk-replacement Recently Dean Gaudet, in thread titled 'Feature Request/Suggestion - "Drive Linking"', mentioned his document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt I've read it, and have some umm.. concerns. Here's why: .... > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 > mdadm /dev/md4 -r /dev/sdh1 > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing > mdadm /dev/md4 --re-add /dev/md5 > mdadm /dev/md5 -a /dev/sdh1 > > ... wait a few hours for md5 resync... And here's the problem. While new disk, sdh1, are resynced from old, probably failing disk sde1, chances are high that there will be an unreadable block on sde1. And this means the whole thing will not work -- md5 initially contained one working drive (sde1) and one spare (sdh1) which is being converted (resynced) to working disk. But after read error on sde1, md5 will contain one failed drive and one spare -- for raid1 it's fatal combination. While at the same time, it's perfectly easy to reconstruct this failing block from other component devices of md4. That to say: this way of replacing disk in a software raid array isn't much better than just removing old drive and adding new one. And if the drive you're replacing is failing (according to SMART for example), this method is more likely to fail. /mjt Date: Fri, 8 Sep 2006 02:24:40 -0700 (PDT) From: dean gaudet To: Michael Tokarev Cc: Linux RAID Subject: Re: proactive-raid-disk-replacement On Fri, 8 Sep 2006, Michael Tokarev wrote: > Recently Dean Gaudet, in thread titled 'Feature > Request/Suggestion - "Drive Linking"', mentioned his > document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt > > I've read it, and have some umm.. concerns. Here's why: > > .... > > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 > > mdadm /dev/md4 -r /dev/sdh1 > > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 > > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing > > mdadm /dev/md4 --re-add /dev/md5 > > mdadm /dev/md5 -a /dev/sdh1 > > > > ... wait a few hours for md5 resync... > > And here's the problem. While new disk, sdh1, are resynced from > old, probably failing disk sde1, chances are high that there will > be an unreadable block on sde1. And this means the whole thing > will not work -- md5 initially contained one working drive (sde1) > and one spare (sdh1) which is being converted (resynced) to working > disk. But after read error on sde1, md5 will contain one failed > drive and one spare -- for raid1 it's fatal combination. > > While at the same time, it's perfectly easy to reconstruct this > failing block from other component devices of md4. this statement is an argument for native support for this type of activity in md itself. > That to say: this way of replacing disk in a software raid array > isn't much better than just removing old drive and adding new one. hmm... i'm not sure i agree. in your proposal you're guaranteed to have no redundancy while you wait for the new disk to sync in the raid5. in my proposal the probability that you'll retain redundancy through the entire process is non-zero. we can debate how non-zero it is, but non-zero is greater than zero. i'll admit it depends a heck of a lot on how long you wait to replace your disks, but i prefer to replace mine well before they get to the point where just reading the entire disk is guaranteed to result in problems. > And if the drive you're replacing is failing (according to SMART > for example), this method is more likely to fail. my practice is to run regular SMART long self tests, which tend to find Current_Pending_Sectors (which are generally read errors waiting to happen) and then launch a "repair" sync action... that generally drops the Current_Pending_Sector back to zero. either through a realloc or just simply rewriting the block. if it's a realloc then i consider if there's enough of them to warrant replacing the disk... so for me the chances of a read error while doing the raid1 thing aren't as high as they could be... but yeah you've convinced me this solution isn't good enough. -dean Date: Fri, 08 Sep 2006 14:47:31 +0400 From: Michael Tokarev To: dean gaudet Cc: Linux RAID Subject: Re: proactive-raid-disk-replacement dean gaudet wrote: > On Fri, 8 Sep 2006, Michael Tokarev wrote: > >> Recently Dean Gaudet, in thread titled 'Feature >> Request/Suggestion - "Drive Linking"', mentioned his >> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt >> >> I've read it, and have some umm.. concerns. Here's why: >> >> .... >>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 By the way, don't specify bitmap-chunk for internal bitmap. It's needed for file-based (external) bitmap. With internal bitmap, we have fixed size in superblock for it, so bitmap-chunk is determined by dividing that size by size of the array. >>> mdadm /dev/md4 -r /dev/sdh1 >>> mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 >>> mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing >>> mdadm /dev/md4 --re-add /dev/md5 >>> mdadm /dev/md5 -a /dev/sdh1 >>> >>> ... wait a few hours for md5 resync... >> And here's the problem. While new disk, sdh1, are resynced from >> old, probably failing disk sde1, chances are high that there will >> be an unreadable block on sde1. And this means the whole thing >> will not work -- md5 initially contained one working drive (sde1) >> and one spare (sdh1) which is being converted (resynced) to working >> disk. But after read error on sde1, md5 will contain one failed >> drive and one spare -- for raid1 it's fatal combination. >> >> While at the same time, it's perfectly easy to reconstruct this >> failing block from other component devices of md4. > > this statement is an argument for native support for this type of activity > in md itself. Yes, definitely. >> That to say: this way of replacing disk in a software raid array >> isn't much better than just removing old drive and adding new one. > > hmm... i'm not sure i agree. in your proposal you're guaranteed to have > no redundancy while you wait for the new disk to sync in the raid5. It's not a proposal per se, it's just another possible way (used by majority of users I think, because it's way simpler ;) > in my proposal the probability that you'll retain redundancy through the > entire process is non-zero. we can debate how non-zero it is, but > non-zero is greater than zero. Yes there will be no redundancy in "my" variant, guaranteed. And yes, there is probability to complete the whole "your" process without a glitch. > i'll admit it depends a heck of a lot on how long you wait to replace your > disks, but i prefer to replace mine well before they get to the point > where just reading the entire disk is guaranteed to result in problems. > >> And if the drive you're replacing is failing (according to SMART >> for example), this method is more likely to fail. > > my practice is to run regular SMART long self tests, which tend to find > Current_Pending_Sectors (which are generally read errors waiting to > happen) and then launch a "repair" sync action... that generally drops the > Current_Pending_Sector back to zero. either through a realloc or just > simply rewriting the block. if it's a realloc then i consider if there's > enough of them to warrant replacing the disk... > > so for me the chances of a read error while doing the raid1 thing aren't > as high as they could be... So the whole thing goes this way: 0) do a SMART selftest ;) 1) do repair for the whole array 2) copy data from failing to new drive (using temporary superblock-less array) 2a) if step 2 failed still, probably due to new bad sectors, go the "old way", removing the failing drive and adding new one. That's 2x or 3x (or 4x counting the selftest, but that should be done regardless) more work than just going the "old way" from the beginning, but still some chances to have it completed flawlessly in 2 steps, without losing redundancy. Too complicated and too long for most people I'd say ;) I can come to yet another way, which is only somewhat possible with current md code. In 3 variants. 1) Offline the array, stop it. Make a copy of the drive using dd with error=skip (or how it is), noticing the bad blocks Mark those bad blocks in bitmap as dirty Assemble the array with new drive, letting it to resync the blocks to new drive which we were unable to copy previously. This variant does not lose redundancy at all, but requires the array to be off-line during the whole copy procedure. What's missing (which has been discussed on linux-raid@ recently too) is the ability to mark those "bad" blocks in bitmap. 2) The same, but not offlining the array. Hot-remove a drive, make copy of it to new drive, flip necessary bitmap bits, and re-add the new drive, and let raid code to resync changed (during copy, while the array was still active, something might has changed) and missing blocks. This variant still loses redundancy, but not much of it, provided the bitmap code works correctly. 3) The same as your way, with the difference that we tell md to *skip* and ignore possible errors during resync (which is also not possible currently). > but yeah you've convinced me this solution isn't good enough. But all this, all 5 (so far ;) ways, aren't nice ;) /mjt Date: Fri, 8 Sep 2006 11:44:07 -0700 (PDT) From: dean gaudet To: Michael Tokarev Cc: Linux RAID Subject: Re: proactive-raid-disk-replacement On Fri, 8 Sep 2006, Michael Tokarev wrote: > dean gaudet wrote: > > On Fri, 8 Sep 2006, Michael Tokarev wrote: > > > >> Recently Dean Gaudet, in thread titled 'Feature > >> Request/Suggestion - "Drive Linking"', mentioned his > >> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt > >> > >> I've read it, and have some umm.. concerns. Here's why: > >> > >> .... > >>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 > > By the way, don't specify bitmap-chunk for internal bitmap. > It's needed for file-based (external) bitmap. With internal > bitmap, we have fixed size in superblock for it, so bitmap-chunk > is determined by dividing that size by size of the array. yeah sorry that was with an older version of mdadm which didn't calculate the chunksize correct for an internal bitmap on a large enough array... i should have mentioned that in the post. it's fixed in newer mdadm. > > my practice is to run regular SMART long self tests, which tend to find > > Current_Pending_Sectors (which are generally read errors waiting to > > happen) and then launch a "repair" sync action... that generally drops the > > Current_Pending_Sector back to zero. either through a realloc or just > > simply rewriting the block. if it's a realloc then i consider if there's > > enough of them to warrant replacing the disk... > > > > so for me the chances of a read error while doing the raid1 thing aren't > > as high as they could be... > > So the whole thing goes this way: > 0) do a SMART selftest ;) > 1) do repair for the whole array > 2) copy data from failing to new drive > (using temporary superblock-less array) > 2a) if step 2 failed still, probably due to new bad sectors, > go the "old way", removing the failing drive and adding > new one. > > That's 2x or 3x (or 4x counting the selftest, but that should be > done regardless) more work than just going the "old way" from the > beginning, but still some chances to have it completed flawlessly > in 2 steps, without losing redundancy. well it's more "work" but i don't actually manually launch the SMART tests, smartd does that. i just notice when i get mail indicating Current_Pending_Sectors has gone up. but i'm starting to lean towards SMART short tests (in case they test something i can't test with a full surface read) and regular crontabbed rate-limited repair or check actions. > 2) The same, but not offlining the array. Hot-remove a drive, make copy > of it to new drive, flip necessary bitmap bits, and re-add the new drive, > and let raid code to resync changed (during copy, while the array was > still active, something might has changed) and missing blocks. > > This variant still loses redundancy, but not much of it, provided the bitmap > code works correctly. i like this method. it yields the minimal disk copy time because there's no competition with the live traffic... and you can recover if another disk has errors while you're doing the copy. > 3) The same as your way, with the difference that we tell md to *skip* and > ignore possible errors during resync (which is also not possible currently). maybe we could hand it a bitmap to record the errors in... so we could merge it with the raid5 bitmap later. still not really the best solution though, is it? we really want a solution similar to raid10... -dean