Sometime during the night of April 14th a hard drive (/dev/sdd) went silent; its platter resting motionlessly as increasingly desperate read and write requests from the operating system went unanswered. After trying unsuccessfully to resuscitate /dev/sdd, the operating system pronounced the hard drive dead. The medical examiner (i.e. the sysadmin, me) was then summoned to the scene with the message: "RAID DEGRADED" (figure 1). While performing an autopsy (i.e. checking the server logs), nothing that would explain /dev/sdd's death was found. We can only speculate what the true cause of death was; perhaps it was an electrical spike, a speck of dust on the read/write head, a cosmic ray, or divine intervention; God only knows. Sudden deaths like these are especially tragic because those close to the deceased have no time to prepare; often greatly extending the recovery process. Fortunately, /dev/sdd was a member of a RAID1 array!
Alright, enough with melodramatic storytelling. When I checked my email on April 14th I got the message shown in figure 1; the hard drive I used to store all my personal files and my family's home movies and photographs had failed. Decades worth of irreplaceable videos, photographs, and personal files were lost in an instant when the hard drive died. This would have been an unmitigated disaster except I had configured the hard drive in a RAID array. Specifically, the hard drives referred to as /dev/sdc and /dev/sdd were configured as a RAID1 array. RAID1 provides redundancy against hard drive failures like the one I experienced by mirroring contents across all hard drives in the array; when /dev/sdd failed, /dev/sdc had an exact replica of the data, so the file server storing my and my family's files continued running without interruption.
I had simulated disk failures when I built and configured my file server, but this was the first time I had experienced a disk failure in production. It may seem small, but seeing my system perform flawlessly under a real world hardware failure was super awesome (at least to me). I can only imagine the relief system administrators of the 1970s and 1980s had when they first deployed RAID in their storage systems.
Anyway, even though my file server was still working and I had not lost any data (yet), I needed to replace the broken drive as soon as possible. If the second drive (/dev/sdc) failed before I restored the RAID array, data would truly be lost. So I quickly ordered a new drive on Amazon (figure 2) and waited anxiously for the shipment to arrive.
Several days later, I got the new hard drive in the mail! If I were using Hardware RAID, the recovery process would automatically begin as soon I installed the new drive; however, when I built the file server I opted to save money (hardware RAID controllers are expensive!) and use the default software RAID controller for Linux, mdadm. The first step on the road to recovery was to remove the broken hard drive from the software RAID controller.
# get status of RAID array $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1 sda1 976629568 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdd1(F) sdc1 2930133824 blocks super 1.2 [2/1] [U_] unused devices: none # remove the failed drive $ mdadm --manage /dev/md1 --remove /dev/sdd1 mdadm: hot removed /dev/sdd1
Once the broken hard drive was removed from the RAID array in software, it was time to power off the server and replace the drive (figure 3). After carefully opening up the server and unplugging the dead drive from the motherboard, I opened up the antistatic bag containing the new hard drive and installed it. So far so good.
Next, I needed to prepare the new hard drive and attach it to the RAID array. Using standard UNIX tools we can copy the partition table of the working drive to the newly installed drive. Then again using mdadm, we can attach the new drive to the RAID array; once the new drive is attached, the recovery process will begin automatically!
# copy the partitions from the healthy drive to the new drive $ sfdisk -d /dev/sdc | sfdisk /dev/sdd # attach the newly partitioned drive to the RAID array $ mdadm --manage /dev/md1 --add /dev/sdd1 mdadm: re-added /dev/sdd1 # check the status of the RAID array $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1 sda1 976629568 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdd1 sdc1 2930135360 blocks super 1.2 [2/2] [UU] [>....................] recovery = 0.4% (10923836/2930135360) finish=448.0min speed=10918K/sec unused devices: none
Excellent, the RAID array is copying data from /dev/sdc to the newly installed /dev/sdd drive. Time to get some rest while we wait for the system to fully recover.
Waking up the next morning I see that the RAID array is completely recovered (figure 4). And with all my and my family's files being stored across multiple hard drives, I feel at ease again!
P.S. It is important to note that RAID is not a replacement for regular backups. RAID only protects you from hardware malfunctions (which are relatively rare); backups using programs like rbackup protect you from yourself. If you accidentally delete an important file, RAID will happily delete that file from all of the drives in the array. As a precaution against this type of error, I have a separate drive on my file server which stores backups for the last 6 months.
P.P.S. And if you are extra paranoid about data loss, it is best to keep a copy of your backups at a physically separate location (the further away the better). This protects you in the event of natural disasters or other unforseen events. I use a server at my parents house in Minnesota (a good 2000+ miles away) to keep copies of my backups just in case.
P.P.P.S The more paranoid you are, the closer your system will resemble AWS S3.