« Life without e-mail | Main | Telstra's new strategy »

November 14, 2005

Hard drive failings

Got our Inbox back on the weekend, fortunately. But we were curious about the fact that FastMail's problem had been caused by the sudden premature failure of three hard drives in one RAID 6 array. The question we had for Jeremy was this: "What brand of hard drives was involved?" The answer? Western Digital. We're waiting for further details [we expect they've still got their hands full] but somehow, we think we'll continue to recommend Seagate drives in our quarterly workhorse PC specifications.

Posted by cw at November 14, 2005 09:43 PM

Comments

The above failure of 3 hd's highlights for me a backup problem. I've just installed a server in my business with a RAID array of just 2 HD's. Collapse of the server would be really bad news for my business. How do you have redundancy in the server set up - a second server minus HD's to switch to if the something other than the disks fails. The RAID array of 2 disks I thought was good disk backup but in the light of above, obviously not. What to do?

Posted by: Dixp8 at November 14, 2005 10:44 PM

I've used Western Digital, Maxstor, Seagate, IBM (now operating as Hitachi), Toshiba, Fujitsu and the defunct brands Conner, Quantum. Nearly all of these have been IDE drives or even earlier designs.

The point to make is that they are consumer level devices. Server hard disks are invariably SCSI. They run at relatively higher RPMs, they have a higher statistical MTTF. They are/should be more expensive. Making a good pro drive range does not necessarily imply that the consumer drive range is to the same quality values. That having been said, Seagate has always made a name in the pro SCSI arena and their IDE drives, although now made in China instead of Singapore or Malaysia are doing well.

There are RAID arrays now, using IDE hard disks instead of SCSI. No doubt the cost is much lower since you're talking about 5-6 hard disks x cost difference between IDE and SCSI for each one. The lower IDE RAID cost however, does mean that you are not using devices made to the same MTTF.

Posted by: anandasim at November 15, 2005 09:18 AM

Three disks failing in a single RAID configuration in quick succession suggest to me one of, or a combination of, the following scenarios:

1) A faulty UPS unit failed to protect the server from a power spike or brown out,

2) All three disks came from the same faulty batch with a reduced life span, which is a no-no when it comes to maintaining stable RAID configurations,

3) Bad practice resulted in the disks been used beyond their use-by date, or,

4) Either faulty monitoring software/bad practice resulted in the first failure not been picked up in time for a disk replacement and re-build.

On the other hand, it was probably just be bad luck. I look forward to JH’s explanation once things have calmed down and the cause is investigated.

CW, be thankful you don’t use Bigpond’s mail servers – remember how long it took them to fix a catastrophic email failure.

(Jeremy, the above is meant as some general RAID based thoughts – if it sounds accusatory that is not the intention.)

Posted by: wilbert at November 15, 2005 10:49 AM

The disks were about 6 months old, and we replaced the first disk within 30 mins of receiving an error. The 2nd disk failed 30 mins later. The third disk didn't fully fail, but a bad block couldn't be recovered, and resulted in a file system corruption an hour later.

Posted by: Jeremy Howard at November 15, 2005 11:52 AM

Dixp8 - 2 HD in a RAID? You're using RAID 1? Check out the brief description of RAID. You might want to upgrade to RAID 5 which uses at least 3 HD. And from what wilbert says, maybe 3 HD from different brands?

Try not to equate RAID (which is supposed to reduce data loss from a bad write and to keep you running regardless) with regular and routine Data / System Backup which you still do in some fashion - that backs up and allows you to archive data - protection from some hardware/software/end user accidents that RAID cannot do.

Posted by: anandasim at November 15, 2005 12:09 PM

Here's some more information about what happened, for those who are interested.

Posted by: Jeremy Howard at November 15, 2005 12:24 PM

Anandsim, I think Wilbert meant that you get the same drive model and brand for each RAID array you are using and you purchase the drives from different batches of production to elliminate the chance of both drives coming out after each other in the manufacturing process. To eliminate simple QA mistakes such as he was asleep when that lot passed his desk. And other QA factors that can happen in a manufacturing process.

The problems start becoming a little trickier once you start getting 5 or more drives. As then they become like your car tyres on your car. You have the hot-swap in the boot ready to change in an instant but how do you know where the next pothole in the road is. Is it 100 metres down the road or 100,000 km's. A piece of string theory.

So as you start monitoring the batch and serial numbers of all your drives in production you still need to maintain your hot-swap drives for serial number and batch compatability within your RAID arrays and the service retirment date for each drive. And quantities of replacement on-site drives is always a lovely discussion between parties involved. Then bring up 'How about a hot-swap server' and instantly double the anger and double the price.

You should also use network and file server management such as HP OpenView, Dell OpenManage and Intel Server Managment software to check with the server manufacturer for BIOS and Firmware updates and the standard server monitoring that this type of software performs. With the final bit of advice is to subscribe to the drive manufacturer's e-mail/rss lists to be notified about any errors/updates involved with that particular piece of hardware or software.

Posted by: Anonymous at November 15, 2005 05:51 PM

Andasim - thanks for the link and comments. I'm deeply unknowing on this subject but I'm concerned about disaster planning, redundancy, cost of failure etc. The RAID 1 system we installed is expensive enough for my small business but I think I need some further backup.

Posted by: Dixp8 at November 18, 2005 01:10 PM

SEAGATE & Large HD drive Problems:

I think we are returning to the 2000-2004 era of big hard drive problems. Too often "techs" keep thinking of every micro problem and ignore the big picture.
I'm a tech and I saw IBM's Deathstar (Deskstar) now hitachi successfully class-action sued for quality problems. I had 40G in 2001 go bad slowly corrupting the files -

RMA'ed it and got a 60G that failed the same way ~ 1.5 yrs later. I have scene 3

Deskstars fail in customer computers all with the exact same tell-tale scratching noise.

Fast Forward: I bought a maxtor 250 SATA. Ran and it passed all diag tests. 7 days later I was getting corrupt files (previous drive is a wd and has been run 7 days /wk
for 6 years with not one single problem. Ran the diags said drive is failing and returned it to Fry's Electronics. Got a 2nd drive and tested it before using it and

this time I got drive is failing code right away.

Spring 2004 I reluctantly bought my first PATA Seagate 200G Drive. Ran all tests - no problems. 1 month later I was geting the dreaded seagate power noise - it is as

thought you unplugged and quickly plugged the drive back in. Got an RMA code from their diag (full test said bad) the 2nd replacement drive failed and I am rma'ing that now. Seagate didn't want to RMA it because it was beyond the 1 yr warranty but said due to all the documented failures I had experienced they would send me a 3rd drive but it would only carry a 90 day warranty.

The problem: I heard a few very minor noises over the past week. Yesterday my backup / test system - a fully 48 bit compliant (seagate certified) ABIT NF7M system would not boot w2k sp4. When I unplugged the SEA 300G drive it booted fine. Ran the drive via usb and it couldn't mount it under w2k. The drive has read errors and makes a continous clicking noise. Wow, the 200G backup drive failed and I said no problem

thank God it was just a backup drive. Now the 300G is back a week later with 250G of data on it of which 200G is NO LONGER BACKED UP (the 200G is now being rma'ed to

segate). I called "Rudy Cobb" at Seagate in Texas and left a message to call me.

Seagate's data recovery starts at $500 and goes up.

Also, a power mac dual G5 has a seagate 160G sata drive that failed - it too made the dreaded power on/off noise. Apple warranteed it when I brought it to the store.

Concluding: Seagate bumped their warranty from 1 yr to 5 yr's in July 2004 to save their failing reputation. Maxtor only offeres 1 yr. Clearly, the drives are not being made corrrectly and getting lost in MTBF's is silly - the drives are failing at 1% of their life expectancy! I have had zero prolbems with WD 120G drives. I believe they are the most bullit-proof drives currently being produced. 200G and up - I say watch out. I felt someting was sketchy when that first Seagate 200G drive went bad. The news is slowly getting around that the seagate is making a huge number of bad drives. Also, backups may not save you. I didn't know the 2nd sea200 was bad until I tried to copy certain large files. Then I got a read error meaning they were already corrupted with no warning or windows 2000 scandisk / checkdisk has forced a check message.

SUMMARY:

BAD:
Hitachi (formerly IBM)
Seagate (SATA / PATA great drive in the 90's, now low cost junk, corrupts slowly toowith no warning until you attemp to access a bad file, so even backs ups may not save you. Expensive, if you must back up everything to two seagate drives in case two fail at the same time. Probably about as bad as IBM Deathstar series.)

Maxtor: 160G and below is ok, Over that is questionable.

GOOD: (Best)
WD 120 (most reliable IDE drive around today and for the past several years.)
WD 160 Also good.
WD 200GB+ unknown / untested.

Editor's note: Hard drives fail, as a matter of course. Every vendor has a percentage of faulty products, but my own experience with Seagate drives has been very good indeed. In fact, I've not had a single Seagate failure. Nor have Ihe many readers to whom I've recommended Seagate drives reported unusually bad experiences. Their response to customers has also been good. By the same token, I've also got a high regard for Western Digital drives. (And Maxtor is now owned by Seagate). The DeskStar problem was associated with a particular technology later dropped by IBM. But I had three DeskStar failures. They were a disaster. I don't think though, that Hitachi has had similar problems. - CW

Posted by: Paul at January 13, 2006 04:19 AM

I've had 3 WD 120GB, all were RMA'ed due to the crappy ball bearing motor whining after 6 months of use, they were also adequatly cooled
Replacement seems to have hybrid fluid/ball bearing type of motors like other mfgs

Posted by: Daijoubu at March 26, 2007 03:40 PM

Over the last 3-4 years I have been running WD SATA 160's in my crazy, loose all data RAID 0. During that time I have had 3 deaths with the on/off cycling problem.

All was fine since WD rma'd them every time, and I like the upfront replacement. That is until I got my 3 replacement yesterday. I ran the serial for kicks, only to find that I had 90 days left on my warranty. I had almost 2 years warranty on my old drive before replacing it tho.

Has anyone noticed the same thing? Is there something I'm missing? I have e-mailed WD, but no answer as of yet. I hope somehow this is a mistake, and I don't end up leaving for another company.

Posted by: GSX at April 10, 2007 09:49 PM