The Problem:

  • A hardware issue on one of my FreeNAS servers1 causes hard drives to disappear every few hours
  • When the drives disappear, it puts my data at risk2

Warning Sign:

  • Prior to the failure, the system's dmesg3 command starts showing ahcich#: Timeout errors like these:

      ahcich2: Timeout on slot 10 port 0
      ahcich2: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 50 serr 00000000 cmd 10008917
      ahcich4: Timeout on slot 31 port 0
      ahcich4: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 50 serr 00000000 cmd 10009e17
      ahcich4: Timeout on slot 19 port 0
      ahcich4: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 40 serr 00000000 cmd 10009217
    

Goal:

  • Write a bash script to run once an hour that watches for the Timeout errors and reboots the machine if it sees one4

My Solution - Version 1 (Shameless Green):

The first thing I did was to write a Sandi Metz5 style Shameless Green6 version of the script. Meaning, I wrote the quickest thing I could put together that met my minimum requirements of:

  1. Watch for Timeout errors
  2. Reboot if one is found

Here's what I came up with:

#!/bin/bash

dmesg | grep Timeout

if [ $? == "0" ]
then
    /sbin/shutdown -r now
fi

The way it works is:

  1. Run the built-in FreeNAS dmesg command and pipe (i.e. |) the output to grep to search for the word Timeout.

    If a Timeout has occurred, then grep will identify the match which results in an exit code7 of 0 for the line

    If no Timeout has occurred, the exit code is something other than 0. It's usually 1, but I don't really care what it is as long as it's not zero because I then…

  2. Use the special $? bash parameter8 to grab the exit code and compare it with == against zero inside an if conditional statement

  3. If the exit code stored in $? is zero, then I run the FreeNAS reboot command (/sbin/shutdown -r) with the argument now to tell the server to initiate a reboot immediately

    Otherwise, nothing else happens and the script simply finishes without doing anything else

Updated Solution:

I decided to add some logging after confirming the first version of the script worked as expected.

Here's the final version. (Detailing out what each line does is left as an exercise to the reader.)

#!/bin/bash

LOG_FILE="/mnt/z/depot/Files/mingus_protection_tools/ahcich_issue_rebooter/log.txt"

DATE_TIME=$(date +%Y%m%d-%H%M%S) G
echo "${DATE_TIME}: Running script." >> $LOG_FILE

dmesg | grep Timeout

if [ $? == "0" ]
then
    echo "${DATE_TIME}: Found a timeout. Rebooting." >> $LOG_FILE
    /sbin/shutdown -r now
else
    echo "${DATE_TIME}: No Timeout issue found." >> $LOG_FILE
fi

Putting that script in place is letting me move files off the machine without the Timeout issue building up to the point of failure. As long as nothing changes, it should keep my data safe while I move things that weren't yet backed-up to another machine.

Footnotes

  1. This FreeNAS server, as a matter of fact.

  2. The server is setup with 11 hard drives in ZFS in RAID-Z3. Up to three can fail and my data won't be affected. However, if four fail at the same time all the data across all eleven drives will be lost in a way that's basically impossible to restore.

  3. More details about dmesg vai it's manual page.

  4. The script is run as the root user via cron once an hour.

  5. If you're not already familiar with her, you should check out Sandi's work. Her Practical Object-Oriented Design in Ruby class with Katrina Owen improved my programming by an order of magnitude.

  6. While it took a while to get use to, I'm now in love with the term (and concept) of Shameless Green. It's a specific reminder to do the minimum possible amount of work to get a test to pass. Without it, I have a tendency to internally scope-creep new code in a way that usually comes back to bite me. That little reminder has done more to improve and speed up my programming than anything else I've learned in my 20+ years hacking at code. (Why yes… it does deserve it's own post. Said post is one of the many items on my TODO list.)

  7. Here's more info on Exit Status (aka exit codes).

  8. And, here' more info on Bash Special Parameters including $?.