Thursday, March 20, 2014

Death of a Hard Drive

Diagnosis

You just never know what you're going to get hit with from one day to another. Late last week I got a call from our operator at the Five-Mile Pond Dam. He said that there was an odd alarm on the SCADA computer that wasn't going away when you acknowledge it. It was a strange alarm, but it was enough to point me in a direction:
MAC_DL Lost DB connection to SQL Server
There was also one for MAC_PTDL but since that part of the error didn't mean anything to me I figured it was probably secondary to the primary issue of the lost connection to the SQL Server. I don't pretend to be an SQL guy and I had nothing to do with how the computer was set up seven years ago. I do know that the SQL Service was running and it claimed that the database was online (this turned out to be too general to be useful).

I started to dig through the admin tools for the SCADA software (Cimplicity) and was on the verge of contacting the company that developed the SCADA for us. Before I did that I was trying to check everything I could find and stumbled into the Trends section. Trends are stored to the database but everything looked fine until I tried to close out of the Trends admin area. As soon as I tried to do that it told me that the database it was trying to use ("FiveMile") no longer existed. The list of available databases contained nothing that looked like it would be used for trends and definitely didn't contain one called "FiveMile."

Database Woes

As I said, I'm not a database guy. I am generally pretty savvy though and found my way into the SQL Database Admin tools pretty easily. At this point I found a list of available databases. This list included one called "FiveMile" but it was grayed out and next to it was the label "suspect".
FiveMile (suspect)
What does that even mean? I'm sure there are people out there who really know, but from what I found it basically means that something happened (computer crashed?) and when it started back up SQL tried to restart the database and found a problem that it couldn't repair. Thus it suspects there's something wrong.

I spent a great deal of time learning about suspect databases and emergency mode and running chkdb and all sorts of related errors and status codes. It was rough. Once I got the DB back up chkdb didn't want to run on this ancient XP machine with a 5GB database. The errors in the SCADA had changed but weren't quite gone. I thought I had made progress but couldn't be sure and was planning to do some more DB research when things changed...

Hardware Failure

I had assumed that the DB issues were stemming from the hard drive starting to fail. That wasn't quite right though, looks like the DB issues were the end of the failure, not the start. I had placed an order for a new hard drive the first day I was exploring the issues and I'm glad I did because the next day the computer died. Locked up on boot and further boot attempts resulted in "No boot device found" sort of errors. The site is 400 miles away so I couldn't verify but since my remote access was gone I was pretty sure that was it. 

These machines are old, not very good to begin with (Dell 830 tower) and in very dirty environments. I'm surprised it lasted 7 years straight. I had anticipated this though. Back in January when I was doing the network upgrades I decided to make backups of the computers (FiveMile and Quinabaug) just in case. I wasn't part of the initial plan and I didn't run it past upper management, I just did it. 

I bought a USB to SADA (bare drive) adapter and wanted a rugged external drive to save it all on. After a TON of research I ended up with the Silicon Power Rugged Armor A80 external 1TB drive. I'm not yet at the point where I would say it's the best drive ever, but it's supposedly water proof and dust proof. It has a good warranty and it includes a very short USB3 cable that slots into the side of it, which is VERY handy. I've never used any other cable with it. 

For the actual backups I went with a program called DriveImageXML from Runtime Software. I think I went with it because it was free and sounded like it did a reasonably good job. 

So, back in January I anticipated a hardware failure and back a backup of the SCADA software at Five-Mile Pond Dam. Now, in March, the drive failed and it's time to test my backups...

A Simple Drive Restore

Everybody knows that you NEVER test your backups until you need them. I am, of course, being sarcastic, but we're not in a position to have a bunch of extra hardware laying around to actually perform those sorts of tests on. So I made backups, took them away with me and hoped that we'd never need them. Yet here we are, needing them. 

My HOPE was that I could restore onto the new drive here at my house, then drive the 400 miles out to the dam and install it them drive home the next morning. I had purchased a Data-Center quality drive that was supposedly designed for long-term reliability which should arrive Monday. I'd drive out Tuesday and come back Wednesday. 12 hours of driving in two days isn't my idea of fun but without that computer there was no remote access and the operator would have to be onsite a lot more. So let's do it. 

The restore went smoothly on Monday afternoon and it looked like the drive was all ready to just be plugged in and it would work. I haven't done many windows restores so I assumed it would go smoothly and struck out Tuesday morning with every expectation of this being a cake walk. I'd install the drive in 30 minutes then be off to dinner and the hotel where I would monitor it remotely during the evening, meet the operator briefly in the morning and head out. Oh the best laid plans...

Murphy is my project manager

I arrived at the site around 4:30pm after some delays and cracked open the PC. A dead drive doesn't really look any different from a good one, but seven years of industrial dust gives a drive a strange texture:

I suppose it could have been worse after seven years. Anyway, I easily swapped in the new drive and buttoned it up. Turned it on and... .... .... No love. No boot device present. Immediately I'm like, "it must ben the Master Boot Record, it probably didn't get restored." I go to look up how to fix it and it requires the Windows install disk. Yeah... no idea where that is.

The operator thinks it's at the other plant so we run down there and manage to find a Windows 2000 dell install disk. What's the OS on the machine at the other plant? I can't remember, might be 2000... (rookie error). So we grab the disk and head back. I boot into the repair console and run FIXBOOT which is the typical first step. Still no luck. 

Things get a little hazy here as I started to get nervous. My tech cred was rapidly disappearing as the operator patiently waits for his computer to come back online. I try a bunch of stuff which may or may not have been in the following order:
  1. Realize the partition needs to be "ACTIVE" to boot so I rip the drive out of the machine and hook it back up to my laptop to fix that. 
  2. Once that's done it turns out the FIXBOOT for Win2k won't help a WinXP install. (stupid) and now there's a MISSING NTLDR error. 
  3. Operator takes THREE more trips to the other plant before we FINALLY find the windows XP dell install CD.
  4. XP Repair console doesn't really like what I did with the Win2K repair console. Out of desperation I try FIXMBR, which it tells me might mess up my partition table, but I didn't read it. 
  5. My partition table gets messed up and the drive is dead. 
At this point it's over. There's nothing I can do at the plant. We pack the computer (big old dell tower) and monitor, keyboard and mouse into my car and I tell the operator I'll fix it in the hotel tonight and we'll get it set up in the morning. I'm pretty down but determined to get this working.

Determination and tenacity win the day

I know this is it. If I want to head home tomorrow morning I need to get this done tonight no matter what. Earlier I had realized that I forgot my USB CD Writer at home (STUPID) and I saw the potential of needing to but an emergency boot disk of some sort. At the very least I didn't want to realize that I need to do that at 11:00pm and be SOL. So I tried a Target first and then a Staples and found an external SuperDrive and some blank CDs and DVDs. Then I grabbed some takeout (Subway) and hunkered down at the hotel. 

First order of business was to restore the data again. I repartitioned the drive (set partition 0 to Active) and began the restore. Thankfully there's not a lot of data on the drive and it only takes and hour or so to restore from the external drive. During this time I'm talking to my wife and kids and the question on everyone's mind is, "You are still coming home tomorrow, right?" My answer is that this is still the plan, but in my heart I was not yet at a point where I was making progress.

With the data restored I pop it back into the tower and boot. No luck, just as I thought. So the Windows Repair console didn't seem to help much last time so I tried something new. I had downloaded something called the Ultimate Boot CD and installed it on a USB drive. I tried to boot the dell off the USB and it works. Unfortunately the Ultimate Boot CD is anything but straight forward. I spend about 30 minutes gingerly navigating the tools on the CD (USB) before giving up. Nothing makes sense and the documentation is all out of date. 

Back to the Windows XP Repair Console. This time when I start the console it asks me which Windows install I want to repair, this is new, and lists C:\Windows. I'm very excited. It's recognized something new that it didn't last time and I eagerly selected that entry.
Please enter the Administrator password: 
Oh, no. I try all the standard options and nothing works. Every three attempts I have to reboot and launch the repair console again (about 5-10 minutes each time). I've got no idea. We only used the "Operator" user and never the Administrator so nobody has any idea. The repair console won't let me  repair without the password. Ugh.

I remember anticipating this issue at the other plant back in January. I had made a CD of the Offline NT Password and Registry Editor, a tool I had found that supposedly allowed you to reset windows passwords. Well if ever there was a time to test it out... I dig it out of my bag and try to boot off it, but it wont boot. Crap. Try again. Finally it boots. I don't know what the issue was but it does finally boot and the provided instructions for the tool are perfect. Everything goes smooth, I reset the Admin password and I'm finally able to boot into the Repair Console! Whew!

Of course we're not done yet. I start again with the FIXBOOT command. It appears to work correctly and I reboot the machine. There's no boot device error this time which is GREAT, but instead I get another weird error:
Windows could not start because the following file is missing or corrupt: <Windows Root>\system32\hal.dll
HAL.DLL.
What the hell is that? First link off a google search sends me to About.com but is promising. It provides a list that I start going though:

  1. Reboot, could be a fluke. Nope didn't work.
  2. Check bios boot order. Yup, it's fine.
  3. Run XP restore command. Looked this up, didn't like the look of it. Skipped it. 
  4. Repair boot.ini. Restored boot.ini from a backup, I'm sure it's fine.
  5. New Boot Sector. This is FIXBOOT. I did this.
  6. Recover bad sectors. New drive hopefully this isn't it.
  7. Restore HAL.DLL from XP install CD. Really? I just restored from a backup, this should be fine.
  8. Repair Install. Risky.
  9. Clean Install. No way...
Not feeling very good about this HAL.DLL thing. I decide to be more thorough. I go back up to boot.ini and start trying to look into it. The Repair console doesn't give you a lot of options and I eventually remember the MORE command to view text files. Running MORE on boot.ini gives me some crap that I don't fully understand. It shows a "path" to the hard drive where windows is supposed to be installed. Doesn't look quite right. Seems like it SHOULD be saying partition 0, not Partition 1. So I do some research into how to edit boot.ini which leads me to the BOOTCFG command which has an option to display the currently available OS installations. Sure enough it shows Windows on Partition 0. So somehow there was something else on Partition 0 of the original hard drive that I didn't back up (or restore), oops.

Some more searching lands me a bootcfg tutorial which helps me add the correct drive to boot.ini. 

BOOM!

No more HAL.DLL errors and the machine boots up and seems to be running fine.

FWEW. 

I have a new level of loathing for Windows but at least I can say I learned something. I ran some more tests and in the morning we set it back up in the plant and everything is running smooth. YES! After a quick trip to staples to return the stuff I bought the previous night I'm on my way home. 

Wow. What a day. Anyway everyone was happy and I'm glad I was able to get everything working. I'm still not completely happy with the backup/restore process but it was WAY better than nothing.