Thursday, February 7, 2008

Fixing my software raid-5 with mdadm

On my server-box I have a software raid-5 /dev/md0 consisting of 4 500GB SATA-harddisks, namely sda1, sdb1, sdc1, sdd1. When I was working on my other box where I was pulling off some dd-stunts on a lvm-volume my raid suddenly died on the server. There was some output in dmesg that both sda and sdb are somewhat corrupt and that they've been removed from the raid, leaving it unfunctional (you need to have at least n-1 disks in a raid-5 to keep it operational). I was very shocked by this. I restarted the PC and tried to re-assemble the raid with

mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
to no avail. mdadm said 'bad superblock on device /dev/sda1' (or similar) and leaving out sda1 worked and I had the mdadm assembled with 3 out of 4 disks. That's of course not satisfactory. I stopped the raid and ran S.M.A.R.T.-checks on each of the 4 disks with
smartctl -t long /dev/sdx1
. This took over an hour so I went to sleep and checked the results the next day -- 100% error-free, according to smart! That's really strange. Assembling the array still does not work because of sda1. I opened up sda in cfdisk and saw the exact same partition-size as on sdb and the others, but I knew that something was corrupt. So I wrote the partition-table to a file to back it up, removed the partition and re-added it. Then I used
mdadm --add /dev/md0 /dev/sda1
to re-add the partition that was formerly part of the array, anyways... mdadm did it's job and recovered the raid. You can watch the progress by doing
cat /proc/mdstat
. It took around 7 hours or so to complete, and now the raid5 is fully functional again.
What a horror-trip! I'm still wondering what was going on and why sda1 has been kicked out of the array.
A small addition: After I've fixed the raid it was ok for a day or two, but then one day when I came home I noticed it broke again. I remembered that I stepped onto the USB-keyboard that was attached to the server right after I came home and found an unhandled IRQ-oops in the kernel-log at what happens exactly that timespan. So my suggestion is that the USB-handler somehow messed up something, which in turn has killed the RAID again. But I'm still investigating the issue, for now rebooting and forcing the assembly worked fine. I hope I'll not have any more problems with it...

No comments:

Post a Comment