This is a short collection of subjects related to preventing, detecting and fixing a broken hdd from a raid1 array.
Which drive is broken?
1. Check for messages in dmesg
[ 1040.470282] ata1.00: device reported invalid CHS sector 0
[ 1040.470287] ata1: EH complete
[ 6373.208104] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 6373.214488] ata1.00: failed command: FLUSH CACHE EXT
[ 6373.221215] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[ 6373.221217] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[ 6373.243107] ata1.00: status: { DRDY }
[ 6373.251006] ata1: hard resetting link
[ 6378.646534] ata1: link is slow to respond, please be patient (ready=0)
[ 6383.266995] ata1: COMRESET failed (errno=-16)
2. install smartmontools
<pre lang="bash">apt-get install smartmontools
3.start smart tests on the drives:
<pre lang="bash">smartctl -t short /dev/sdx #start a short test
smartctl -t long /dev/sdx #start a long test
smartctl -a /dev/sdx | less #check results
Remove drive from md array
1. Check mdstat
<pre lang="bash">cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] [multipath]
md1 : active raid1 sda2[0] sdb2[1]
524224 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
729952192 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
2096064 blocks [2/2] [UU]
unused devices: <none>
2. If not already marked, mark drive as failed
<pre lang="bash">mdadm --manage /dev/md2 --fail /dev/sda3 #for all partitions
3. remove from array
<pre lang="bash">mdadm /dev/md2 -r /dev/sda3 #for all partitions
Check grub
1. Check if grub is installed on the second drive also:
<pre lang="bash">dd bs=512 count=1 if=/dev/sdb 2>/dev/null | strings
ZRr=`|f
\|f1
<strong>GRUB</strong>
Geom
Hard Disk
Read
Error
If there is no GRUB then you will not be able to boot. Install grub:
<pre lang="bash">grub-install /dev/sdb
Replace disk
1. Replace disk and reboot
2. Install disk
<pre lang="bash">dd if=/dev/sdb of=/dev/sda count=1 bs=512 #copy mbr and partition table
sfdisk -R /dev/sda #reread partition table
mdadm /dev/md2 -a /dev/sda3 #add partitions to raid, for all partitions
cat /proc/mdstat #check mdstat until rebuild is complete
3. Install grub
<pre lang="bash">grub-install /dev/sda
Restore grub
If you forgot step 3 your machine will most likely not boot. Boot with a rescue cd and execute the following commands in order to fix this:
<pre lang="bash">mkdir /mnt/rescue
mount /dev/md2 /mnt/rescue
mount /dev/md2 /mnt/rescue/boot #mount everything, depending on your config
for i in /dev /dev/pts /proc /sys; do sudo mount -B $i /mnt/rescue/$i; done
chroot /mnt/rescue
grub-install /dev/sda
Prevention: configure raid and smartd notifications
1. mdadm: add the following at the end of /etc/mdadm/mdadm.conf
PROGRAM /root/scripts/raidMonitor.py
replace the raidMonitor.py script with a script which does whatever you need: sends emails, rings a bell etc.
2. smartd: on ubuntu to enable smartd uncomment the following line from /etc/default/smartmontools
start_smartd=yes
don’t read the smartd manual for -m and -M options since the default configuration executes /usr/share/smartmontools/smartd-runner which in turn executes all scripts from /etc/smartmontools/run.d add a script there which sends emails, rings a bell etc. You can test it by modifying the following line in /etc/smartd.conf:
DEVICESCAN -m root <strong>-M test</strong> -M exec /usr/share/smartmontools/smartd-runner
the -M test will execute the script when smartd is started. Link your script in /etc/smartmontools/run.d as 10mail.