Backup, backup, backup

This is a short collection of subjects related to preventing, detecting and fixing a broken hdd from a raid1 array.

Which drive is broken?

1. Check for messages in dmesg

[ 1040.470282] ata1.00: device reported invalid CHS sector 0
[ 1040.470287] ata1: EH complete
[ 6373.208104] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 6373.214488] ata1.00: failed command: FLUSH CACHE EXT
[ 6373.221215] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[ 6373.221217] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[ 6373.243107] ata1.00: status: { DRDY }
[ 6373.251006] ata1: hard resetting link
[ 6378.646534] ata1: link is slow to respond, please be patient (ready=0)
[ 6383.266995] ata1: COMRESET failed (errno=-16)

2. install smartmontools

<pre lang="bash">apt-get install smartmontools

3.start smart tests on the drives:

<pre lang="bash">smartctl -t short /dev/sdx #start a short test
smartctl -t long /dev/sdx #start a long test
smartctl -a /dev/sdx | less #check results

Remove drive from md array

1. Check mdstat

<pre lang="bash">cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] [multipath] 
md1 : active raid1 sda2[0] sdb2[1]
 524224 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
 729952192 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
 2096064 blocks [2/2] [UU]

unused devices: <none>

2. If not already marked, mark drive as failed

<pre lang="bash">mdadm --manage /dev/md2 --fail /dev/sda3 #for all partitions

3. remove from array

<pre lang="bash">mdadm /dev/md2 -r /dev/sda3 #for all partitions

Check grub

1. Check if grub is installed on the second drive also:

<pre lang="bash">dd bs=512 count=1 if=/dev/sdb 2>/dev/null | strings
ZRr=`|f 
\|f1
<strong>GRUB</strong> 
Geom
Hard Disk
Read
 Error

If there is no GRUB then you will not be able to boot. Install grub:

<pre lang="bash">grub-install /dev/sdb

Replace disk

1. Replace disk and reboot

2. Install disk

<pre lang="bash">dd if=/dev/sdb of=/dev/sda count=1 bs=512 #copy mbr and partition table
sfdisk -R /dev/sda #reread partition table
mdadm /dev/md2 -a /dev/sda3 #add partitions to raid, for all partitions
cat /proc/mdstat #check mdstat until rebuild is complete

3. Install grub

<pre lang="bash">grub-install /dev/sda

Restore grub

If you forgot step 3 your machine will most likely not boot. Boot with a rescue cd and execute the following commands in order to fix this:

<pre lang="bash">mkdir /mnt/rescue
mount /dev/md2 /mnt/rescue
mount /dev/md2 /mnt/rescue/boot #mount everything, depending on your config
for i in /dev /dev/pts /proc /sys; do sudo mount -B $i /mnt/rescue/$i; done
chroot /mnt/rescue
grub-install /dev/sda

Prevention: configure raid and smartd notifications

1. mdadm: add the following at the end of /etc/mdadm/mdadm.conf

PROGRAM /root/scripts/raidMonitor.py

replace the raidMonitor.py script with a script which does whatever you need: sends emails, rings a bell etc.

2. smartd: on ubuntu to enable smartd uncomment the following line from /etc/default/smartmontools

start_smartd=yes

don’t read the smartd manual for -m and -M options since the default configuration executes /usr/share/smartmontools/smartd-runner which in turn executes all scripts from /etc/smartmontools/run.d add a script there which sends emails, rings a bell etc. You can test it by modifying the following line in /etc/smartd.conf:

DEVICESCAN -m root <strong>-M test</strong> -M exec /usr/share/smartmontools/smartd-runner

the -M test will execute the script when smartd is started. Link your script in /etc/smartmontools/run.d as 10mail.

Which drive is broken?#

Remove drive from md array#

Check grub#

Replace disk#

Restore grub#

Prevention: configure raid and smartd notifications#