Risk & Fragility: Do you know where your bytes are tonight?


Goha, an Egyptian male folk character, has been often used to convey wise nuggets in the fashion of Aesop's Fables, with a satirical twist. So Goha was once found, at his wits end, looking for his donkey. A passer-by notices his befuddlement and the following dialogue ensues.

- "Goha, what's the trouble?"
- "I lost my donkey!"
- "Where was it?"
- "Tied to a tree."
- "Which tree?"
- "The one with the cloud above it"
- "Did some one steal it?"
- "I don't know. I can't find the cloud."


I have seen a lot of IT folk do worse. I saw them tie their bytes to the cloud.

Now, let us be clear. I love a fad no less than the other guy. And in my career--wearing my computer scientist hat--I have witnessed my fare share of fads. I have seen them come and go, to come back again. I have seen the Artificial Intelligence (AI) fad with its promise to solve all of humanities issues by 1990. I have seen the Neural Networks one with its promise to find the optimal answer to any math problem you cannot understand well enough to properly model. I have seen the Object Oriented (OO) fad with its promise to make the pain in programming a thing of the past. Code would, all of a sudden, become clean, bug free and easy to maintain. Problems that were procedural, functional or logical in nature had to be shoehorned into an OO model so you can just look in keeping with the times.

Back to the cloud

Properly used, cloud computing is a tremendous resource. As a reliability engineer and financial risk manager, I became interested in the role of cloud computing as it pertains to system fragility. In future posts I'll discuss fragile systems and why human-built systems tend to grow more fragile over time. Fragility is often the flip-side of convenience or efficiency and stems from remoteness, either physical or through divestiture of control. As a result, if the cloud is used as a way to shirk responsibility, it leads to serious fragility issues. On the other hand, if the cloud is used as an additional, locally manged resource, it can improve the entire reliability of the system.

Level of service (LOS) figures are some of the most misunderstood concepts by IT professionals. What does is it mean when a company guarantees a 99.99% up time? It usually means that in their past experience they had no more than 0.0001 failure rate. Sounds great. But as every financial portfolio manager had to tell his clients, past experience is no guarantee of future performance.

For example, my experience had been that RAID5 was extremely reliable. That is, until it became unreliable. This summer, the east coast of the US has seen unusual thunderstorm activity. This resulted in my losing a few electronic components including a RAID adapter (they were all on a UPS). This points to the concept of conditional probability. A provider may be able to guarantee 99.99% LOS as long as past external conditions prevail.

Redefining reliable

To illustrate further, here is an example comparing two fictitious manufacturers. Prestige Peripherals is a hard drive manufacturer that sells expensive drives and claims a historical in-the-field failure rate of 1 in 1000 (one failure in a 1000 days of operation). In other words, the drives are roughly 99.9% reliable. Nameless Nockoffs is a also a hard-drive manufacturer that sells hard drives for a quarter of the cost of its name-brand competitor and claims a failure rate of 2 in a 1000 (i.e. 99.8% reliability). On first blush, the PP drives are twice as reliable as the NN ones.

What you have not been told, however, is that since the PP drives are overly expensive they are primarily used by top corporations with serious IT budgets and are enclosed in temperature controlled server closets. All the failures of the PP drives happened following server room air conditioning malfunction. Every time the ambient temperatures exceeded 110o F the PP drive failed. On the other hand, the NN drives are used primarily by individuals and small companies with no IT budget in environments where 110o F is a weekly occurrence. Now, which is the more reliable drive? Conditional on the ambient temperature exceeding 110o F, the PP drive has 0% reliability. The NN drive, on the other hand, had only two failures when experiencing high-temperature events.  This makes it 98% reliable, during high-temperature events. In other words, conditional on the high temperature event not occurring, the PP drive looks more impressive than it actually is.

Reducing risk in the cloud

Cloud computing providers can be adversely affected by many external factors including political and legal issues, energy disruptions, natural disasters, and network and operating system vulnerabilities. As a result, in considering cloud computing one needs to worry about:

  • The LOS of the cloud provider.
  • The location, distribution and management of the cloud servers.
  • The provider's approach to data and network security.
  • The reliability of one's link to the cloud.
  • A fully functional and quickly accessible backup solution, in case one's link to the cloud fails.
  • Running daily tests to ascertain the reliability, integrity and scope of cloud storage.

Three areas where cloud computing offers a great advantage and, in fact, reduces fragility are the areas of distributed storage of properly encrypted data, the area of distributed web hosting, and the area of Software as a Service (SaaS) applications.

Fads are cool. They keep life exciting. Trouble is, fads turn into bubbles. And bubbles ultimately pop. As long as you buy into the fad for the right reasons and take the right precautions, you can wear your Silly Bands and remain safe.


About Author

Maged Tawfik

Financial Risk Specialist

Financial Risk Specialist SAS Institute, Cary, NC Cornell University, PhD

Leave A Reply

Back to Top