The Myth of Redundancy in Data Storage

Redundancy in a data storage system does not by itself guarantee reliability.
Are two wooden houses safer than one brick house?

In Part 3 – Lack of write speed gain , we talked about the low performance write speeds when using a RAID, especially the RAID 5/6 types.
Okay, that lack of write speed is compensated by some increased reliability of the system, until we need to rebild/resilver the RAID system.

Risk management is a difficult concept and requires a lot of training. So often we substitute risk analysis with simply adding basic redundancy and assuming that we have fully banished the risk.

RAID and Risky Thinking

RAID is a good example of where a lack of risk thinking can guide to some strange decision making.
Look at a not uncommon scenario, where the goal of protecting against drive failure can actually lead to an increase in risk. Even if additional redundancy is applied and build in.

Now we compare a 12-drive RAID array consisting of 12x 10TB SATA hard drives. Usually people saying a RAID 5 will be the best to get „maximum capacity and performance“ while having „good protection against failures.“

The idea behind here is that RAID 5 protects against the loss of a single drive that can be replaced. Then the array will rebuild itself automatically (before a second drive will fail).

That is – in theory – great. But the real risks of 120 TB drive capacity come not from multiple drive failures, as people generally suspect. The risks arise from an incompetency to rebuild the array after a single drive failure – or from a failure of the RAID array itself with no individual drives failing.

Of course, the risk of a second drive failing is usually quite low. Drives are very reliable in our days. This is well documented by Google Data Center or by Backblaze.

What often happens that scares us during a RAID 5 resilver or rebuild process is that an unrecoverable read error (URE) can occur. When the magic resilver process stops the array is left in a useless state – all data on the array are lost. (When a device is replaced, a resilvering/rebuild operation is initiated to move data from the good copies to the new added device. This is a form of disk scrubbing.)

In our days, a modern SATA drive has a rate of URE of ca. 10^-14, or one error will be happen once every 11 TB of data operation. That means even a small home-use 6 TB array being resilvered has a roughly 50% chance of exploring a URE and will fail, so all data are lost at the end, even on a RAID 5 szenario.
And, a 50% chance of failure is high. Imagine if your aircraft operates with a 50% chance of the wings falling off every time that you take off.

So with a small – again: home user – 6 TB RAID 5 array using 10^-14 URE drives: If we lose a single drive, we have only a 50% chance that the array will recover, assuming the failed drive is replaced immediately or in relatively short time. That doesn’t include the risk of a second drive failing, only the risk of a URE failure.

It also assumes that the drive is completely idle. If the healthy drives are used for production tasks at the same time then the chances of something bad happening, either a URE, second drive failure, or CPU/Memory problem begin to increase dramatically.

But we are not talking about a 6 TB array, we are talking about a 120 TB – which sounds huge, but this is a common size that even some people have at home today for all their movies.
I’ve seen at my clients estimating resilver times for weeks, even a month on the systems display. That is a long time to run without being able to lose another drive. When we are talking about hours or days only, the risks are lower, but still present. But when we are talking about weeks or a month of continuous drive abuse – resilver operations are extremely drive intensive …

RAID 0 vs. RAID 5

With an RAID 5 array of this size we can assume here that the loss of a single drive means the loss of the complete array, all data are gone.

Now we compare that to a drive of the same performance with the same capacity using a RAID 0. Remember, RAID 0 provides su with no protection against drive loss.

For a RAID 0 of the same size, we need only use 11 of drives, comparing to 12 for our RAID 5 array.

So we will use instead of 12 hard drives, each of which has a roughly 3% percent chance of annual failure, we have only 11. That point makes the RAID 0 array more reliable for us, since there are fewer drives to fail. Additional feature is no need to write the parity block or skip parity blocks when reading, which makes the RAID 0 incredible fast.

The RAID 0 array of 11 drives will be identical in capacity to the 12 drive RAID 5 but will have slightly better throughput and better access times and saves cost for an additional drive.
If we move to enterprise SATA or SAS drives then the capacity number where this may occurs becomes very high and is not a big deal in our days. But in future when drive capacities becomes larger. The most dangerous situation for a RAID 5 is its huge overall size today.

Of course, everyone understands the incredible risks of RAID 0. But it’s very difficult to put into perspective that RAID 5 failures are extreme. Its may be less reliable than RAID 0.
The RAID 5 might be less reliable than RAID 0 in an array of this big size based on a rebuild. In a massive array like this, the resilver time can take so long and the chance of a second drive failure starts to become a measurable risk. Take also into account, that additional risks caused by controller errors that can utilize resilver algorithms to destroy an entire array even if no drive failure occurred.

The Danger of using RAID 0

As we know now, that RAID 5 can be less reliable than RAID 0.
RAID systems are generally used to lower the risk of a single hard drive failing. We all except a single drive can simply fail, burn, brick – and all data are lost.
RAID 0 is a stripe of drives without any redundancy. A drive fail causes total loss of data to all drives. So in our 11 disk example above, if any of the 11 disks fails everything is gone. It is clearly visible this is dramatically more dangerous than just using a single drive.

But redundancy does not mean reliability. If something is redundant, like our RAID 5, it provides no guarantee that it will always be more reliable than something that is not redundant.

Are 2 wooden houses safer than one brick house?

A brick made house has some significant reliability advantages over 2 redundant wooden made houses.
Redundancy didn’t matter, reliability matters at the end.

Redundancy is often misunderstood.
Redundancy is similar to a black or white question: Is it redundant? Yes. No.

Reliability is about measured failure rates and calculated probabilities, it is statistics based on analyses. It’s hard to me when trying to explaining reliability, especially when selling a system to business owners, so redundancy often becomes a simple substitute for this complex concept.

Instead of providing a redundant „system“ it’s more common to make a highly reliable (and low cost :-)) subsystem redundant and handle the subsystem redundancy as well as applying to the total system.

Good examples here are RAID controllers in SAN environments. Instead of having 2 SAN controller of different manufacturers for redundancy, there is often only 1 controller integrated – but calling the SAN redundant because of the drives build in. This meaning a SAN contains some redundancy, but it is not the same thing.

Comparing having redundant aircraft: 2 separate, working airplanes – versus 1 plane with a spare fuel pump in the wing locker. Just in case the one fails.
Of course, having a spare fuel pump is not bad. But it is a very basic protection against aircraft failure compared to having a second aircraft ready to go.

Single Point of Failure

Similar to the myth of RAID 5’s reliability most shared storage technologies like SAN/NAS often use the same technology. Using the term „Single Point of Failure“ (SPOF) create a panic feeling with everybody and is a great to steer a sales conversation 😆 .

A SPOF is something we like to remove if possible, but it is not the end of the world.

Think about our brick made house. It is a SPOF. The 2 wooden houses of wood are not. A single storm may remove our redundant solutions faster than our reliable SPOF brick house. Analyzing SPOFs is a great method to explore fragility in the systems we use. But not every SPOF must be eliminated to provide redundancy in every scenario.

Our goal is reliability at appropriate cost. Redundancy, as we have learned, is not a substitute for reliability, it is simply a tool that we can use to get more reliability. Most businesses do it like this, having many SPOFs in house.

It is critical to us as IT professionals that we look at complete systems on the big view and consider reliability and risk . Then think of redundancy simply as a tool to increase reliability.
Redundancy itself is not a science. But it is also not simple. Reliability is a complex problem to handle.

Continue with Part 5 – RAID comparison on a 8x Toshiba X300 High Performance HDD array. We’ll compare some of the RAID performances when using a state-of-the-art high performance hard drive.

Leica HexMap, LiDAR, Point Cloud, RAID, RiProcess, Vexcel Ultramap, Workstation