Date: Sun, 23 Apr 2006 18:40:17 -0700 (PDT)
From: dean gaudet <dean@arctic.org>
To: linux-raid@vger.kernel.org
Subject: proactive raid5 disk replacement success (using bitmap + raid1)

i had a disk in a raid5 which i wanted to clone onto the hot spare... 
without going offline and without long periods without redundancy.  a few 
folks have discussed using bitmaps and temporary (superblockless) raid1 
mappings to do this... i'm not sure anyone has tried / reported success 
though.  this is my success report.

setup info:

- kernel version 2.6.16.9 (as packaged by debian)
- mdadm version 2.4.1
- /dev/md4 is the raid5
- /dev/sde1 is the disk in md4 i want to clone from
- /dev/sdh1 is the hot spare from md4, and is the clone target
- /dev/md5 is an unused md device name

here are the exact commands i issued:

mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
mdadm /dev/md4 -r /dev/sdh1
mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
mdadm /dev/md4 --re-add /dev/md5
mdadm /dev/md5 -a /dev/sdh1

... wait a few hours for md5 resync...

mdadm /dev/md4 -f /dev/md5 -r /dev/md5
mdadm --stop /dev/md5
mdadm /dev/md4 --re-add /dev/sdh1
mdadm --zero-superblock /dev/sde1
mdadm /dev/md4 -a /dev/sde1

this sort of thing shouldn't be hard to script :)

the only times i was without full redundancy was briefly between the "-r" 
and "--re-add" commands... and with bitmap support the raid5 resync for 
each of those --re-adds was essentially zero.

thanks Neil (and others)!

-dean

p.s. it's absolutely necessary to use "--build" for the temporary raid1 
... if you use --create mdadm will rightfully tell you it's already a raid 
component and if you --force it then you'll trash the raid5 superblock and 
it won't fit into the raid5 any more...


Date: Fri, 08 Sep 2006 12:48:51 +0400
From: Michael Tokarev <mjt at tls.msk.ru>
To: Linux RAID <linux-raid@vger.kernel.org>
Subject: proactive-raid-disk-replacement

Recently Dean Gaudet, in thread titled 'Feature
Request/Suggestion - "Drive Linking"', mentioned his
document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

I've read it, and have some umm.. concerns.  Here's why:

....
> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> mdadm /dev/md4 -r /dev/sdh1
> mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> mdadm /dev/md4 --re-add /dev/md5
> mdadm /dev/md5 -a /dev/sdh1
>
> ... wait a few hours for md5 resync...

And here's the problem.  While new disk, sdh1, are resynced from
old, probably failing disk sde1, chances are high that there will
be an unreadable block on sde1.  And this means the whole thing
will not work -- md5 initially contained one working drive (sde1)
and one spare (sdh1) which is being converted (resynced) to working
disk.  But after read error on sde1, md5 will contain one failed
drive and one spare -- for raid1 it's fatal combination.

While at the same time, it's perfectly easy to reconstruct this
failing block from other component devices of md4.

That to say: this way of replacing disk in a software raid array
isn't much better than just removing old drive and adding new one.
And if the drive you're replacing is failing (according to SMART
for example), this method is more likely to fail.

/mjt

Date: Fri, 8 Sep 2006 02:24:40 -0700 (PDT)
From: dean gaudet <dean@arctic.org>
To: Michael Tokarev <mjt at tls.msk.ru>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: proactive-raid-disk-replacement

On Fri, 8 Sep 2006, Michael Tokarev wrote:

> Recently Dean Gaudet, in thread titled 'Feature
> Request/Suggestion - "Drive Linking"', mentioned his
> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> 
> I've read it, and have some umm.. concerns.  Here's why:
> 
> ....
> > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> > mdadm /dev/md4 -r /dev/sdh1
> > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> > mdadm /dev/md4 --re-add /dev/md5
> > mdadm /dev/md5 -a /dev/sdh1
> >
> > ... wait a few hours for md5 resync...
> 
> And here's the problem.  While new disk, sdh1, are resynced from
> old, probably failing disk sde1, chances are high that there will
> be an unreadable block on sde1.  And this means the whole thing
> will not work -- md5 initially contained one working drive (sde1)
> and one spare (sdh1) which is being converted (resynced) to working
> disk.  But after read error on sde1, md5 will contain one failed
> drive and one spare -- for raid1 it's fatal combination.
> 
> While at the same time, it's perfectly easy to reconstruct this
> failing block from other component devices of md4.

this statement is an argument for native support for this type of activity 
in md itself.

> That to say: this way of replacing disk in a software raid array
> isn't much better than just removing old drive and adding new one.

hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
no redundancy while you wait for the new disk to sync in the raid5.

in my proposal the probability that you'll retain redundancy through the 
entire process is non-zero.  we can debate how non-zero it is, but 
non-zero is greater than zero.

i'll admit it depends a heck of a lot on how long you wait to replace your 
disks, but i prefer to replace mine well before they get to the point 
where just reading the entire disk is guaranteed to result in problems.


> And if the drive you're replacing is failing (according to SMART
> for example), this method is more likely to fail.

my practice is to run regular SMART long self tests, which tend to find 
Current_Pending_Sectors (which are generally read errors waiting to 
happen) and then launch a "repair" sync action... that generally drops the 
Current_Pending_Sector back to zero.  either through a realloc or just 
simply rewriting the block.  if it's a realloc then i consider if there's 
enough of them to warrant replacing the disk...

so for me the chances of a read error while doing the raid1 thing aren't 
as high as they could be...

but yeah you've convinced me this solution isn't good enough.

-dean

Date: Fri, 08 Sep 2006 14:47:31 +0400
From: Michael Tokarev <mjt at tls.msk.ru>
To: dean gaudet <dean@arctic.org>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: proactive-raid-disk-replacement

dean gaudet wrote:
> On Fri, 8 Sep 2006, Michael Tokarev wrote:
> 
>> Recently Dean Gaudet, in thread titled 'Feature
>> Request/Suggestion - "Drive Linking"', mentioned his
>> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
>>
>> I've read it, and have some umm.. concerns.  Here's why:
>>
>> ....
>>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4

By the way, don't specify bitmap-chunk for internal bitmap.
It's needed for file-based (external) bitmap.  With internal
bitmap, we have fixed size in superblock for it, so bitmap-chunk
is determined by dividing that size by size of the array.

>>> mdadm /dev/md4 -r /dev/sdh1
>>> mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
>>> mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
>>> mdadm /dev/md4 --re-add /dev/md5
>>> mdadm /dev/md5 -a /dev/sdh1
>>>
>>> ... wait a few hours for md5 resync...
>> And here's the problem.  While new disk, sdh1, are resynced from
>> old, probably failing disk sde1, chances are high that there will
>> be an unreadable block on sde1.  And this means the whole thing
>> will not work -- md5 initially contained one working drive (sde1)
>> and one spare (sdh1) which is being converted (resynced) to working
>> disk.  But after read error on sde1, md5 will contain one failed
>> drive and one spare -- for raid1 it's fatal combination.
>>
>> While at the same time, it's perfectly easy to reconstruct this
>> failing block from other component devices of md4.
> 
> this statement is an argument for native support for this type of activity 
> in md itself.

Yes, definitely.

>> That to say: this way of replacing disk in a software raid array
>> isn't much better than just removing old drive and adding new one.
> 
> hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
> no redundancy while you wait for the new disk to sync in the raid5.

It's not a proposal per se, it's just another possible way (used by majority
of users I think, because it's way simpler ;)

> in my proposal the probability that you'll retain redundancy through the 
> entire process is non-zero.  we can debate how non-zero it is, but 
> non-zero is greater than zero.

Yes there will be no redundancy in "my" variant, guaranteed.  And yes,
there is probability to complete the whole "your" process without a glitch.

> i'll admit it depends a heck of a lot on how long you wait to replace your 
> disks, but i prefer to replace mine well before they get to the point 
> where just reading the entire disk is guaranteed to result in problems.
> 
>> And if the drive you're replacing is failing (according to SMART
>> for example), this method is more likely to fail.
> 
> my practice is to run regular SMART long self tests, which tend to find 
> Current_Pending_Sectors (which are generally read errors waiting to 
> happen) and then launch a "repair" sync action... that generally drops the 
> Current_Pending_Sector back to zero.  either through a realloc or just 
> simply rewriting the block.  if it's a realloc then i consider if there's 
> enough of them to warrant replacing the disk...
> 
> so for me the chances of a read error while doing the raid1 thing aren't 
> as high as they could be...

So the whole thing goes this way:
  0) do a SMART selftest ;)
  1) do repair for the whole array
  2) copy data from failing to new drive
    (using temporary superblock-less array)
  2a) if step 2 failed still, probably due to new bad sectors,
      go the "old way", removing the failing drive and adding
      new one.

That's 2x or 3x (or 4x counting the selftest, but that should be
done regardless) more work than just going the "old way" from the
beginning, but still some chances to have it completed flawlessly
in 2 steps, without losing redundancy.

Too complicated and too long for most people I'd say ;)

I can come to yet another way, which is only somewhat possible with
current md code. In 3 variants.

1)  Offline the array, stop it.
    Make a copy of the drive using dd with error=skip (or how it is),
     noticing the bad blocks
    Mark those bad blocks in bitmap as dirty
    Assemble the array with new drive, letting it to resync the blocks
    to new drive which we were unable to copy previously.

This variant does not lose redundancy at all, but requires the array to
be off-line during the whole copy procedure.  What's missing (which has
been discussed on linux-raid@ recently too) is the ability to mark those
"bad" blocks in bitmap.

2)  The same, but not offlining the array.  Hot-remove a drive, make copy
   of it to new drive, flip necessary bitmap bits, and re-add the new drive,
   and let raid code to resync changed (during copy, while the array was
   still active, something might has changed) and missing blocks.

This variant still loses redundancy, but not much of it, provided the bitmap
code works correctly.

3)  The same as your way, with the difference that we tell md to *skip* and
  ignore possible errors during resync (which is also not possible currently).

> but yeah you've convinced me this solution isn't good enough.

But all this, all 5 (so far ;) ways, aren't nice ;)

/mjt

Date: Fri, 8 Sep 2006 11:44:07 -0700 (PDT)
From: dean gaudet <dean@arctic.org>
To: Michael Tokarev <mjt at tls.msk.ru>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: proactive-raid-disk-replacement

On Fri, 8 Sep 2006, Michael Tokarev wrote:

> dean gaudet wrote:
> > On Fri, 8 Sep 2006, Michael Tokarev wrote:
> > 
> >> Recently Dean Gaudet, in thread titled 'Feature
> >> Request/Suggestion - "Drive Linking"', mentioned his
> >> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> >>
> >> I've read it, and have some umm.. concerns.  Here's why:
> >>
> >> ....
> >>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> 
> By the way, don't specify bitmap-chunk for internal bitmap.
> It's needed for file-based (external) bitmap.  With internal
> bitmap, we have fixed size in superblock for it, so bitmap-chunk
> is determined by dividing that size by size of the array.

yeah sorry that was with an older version of mdadm which didn't calculate 
the chunksize correct for an internal bitmap on a large enough array... i 
should have mentioned that in the post.  it's fixed in newer mdadm.


> > my practice is to run regular SMART long self tests, which tend to find 
> > Current_Pending_Sectors (which are generally read errors waiting to 
> > happen) and then launch a "repair" sync action... that generally drops the 
> > Current_Pending_Sector back to zero.  either through a realloc or just 
> > simply rewriting the block.  if it's a realloc then i consider if there's 
> > enough of them to warrant replacing the disk...
> > 
> > so for me the chances of a read error while doing the raid1 thing aren't 
> > as high as they could be...
> 
> So the whole thing goes this way:
>   0) do a SMART selftest ;)
>   1) do repair for the whole array
>   2) copy data from failing to new drive
>     (using temporary superblock-less array)
>   2a) if step 2 failed still, probably due to new bad sectors,
>       go the "old way", removing the failing drive and adding
>       new one.
> 
> That's 2x or 3x (or 4x counting the selftest, but that should be
> done regardless) more work than just going the "old way" from the
> beginning, but still some chances to have it completed flawlessly
> in 2 steps, without losing redundancy.

well it's more "work" but i don't actually manually launch the SMART 
tests, smartd does that.  i just notice when i get mail indicating 
Current_Pending_Sectors has gone up.

but i'm starting to lean towards SMART short tests (in case they test 
something i can't test with a full surface read) and regular crontabbed 
rate-limited repair or check actions.


> 2)  The same, but not offlining the array.  Hot-remove a drive, make copy
>    of it to new drive, flip necessary bitmap bits, and re-add the new drive,
>    and let raid code to resync changed (during copy, while the array was
>    still active, something might has changed) and missing blocks.
> 
> This variant still loses redundancy, but not much of it, provided the bitmap
> code works correctly.


i like this method.  it yields the minimal disk copy time because there's
no competition with the live traffic... and you can recover if another
disk has errors while you're doing the copy.


> 3)  The same as your way, with the difference that we tell md to *skip* and
>   ignore possible errors during resync (which is also not possible currently).

maybe we could hand it a bitmap to record the errors in... so we could
merge it with the raid5 bitmap later.

still not really the best solution though, is it?

we really want a solution similar to raid10...

-dean