Wednesday, March 23, 2016

Redundancy and Fault Tolerance

Individual nodes can fail without causing system failure.

This tends to be a contentious topic, because all of a sudden, project managers feel like they're being pressured into buying at least two of everything, driving system cost and complexity up.  I'll admit that I'd rather use a device that's simpler and 10x more reliable than having to use a paired cluster of devices that trigger a failure mode n times more often.  But we also have to take into account the consequences and impact of that one device failing along with the cost having a backup.  Unfortunately, people are bad at statistics when it comes to estimating when, not if, something critical to their system might fail.

Let me start with one counterexample.  Most airplanes have two engines.  The idea is that in the unlikely event that one of the engines fails in flight, the remaining engine will still be able to safely get everyone to an emergency landing at the nearest serviceable airport instead of whatever field or body of water is within gliding distance.  The largest single-engine aircraft, such as the Cessna Grand Caravan or Pilatus PC-12, get by with using one highly reliable Pratt and Whitney PT6 turboprop engine.  This turbine engine, with regular maintenance, has reliability that is orders of magnitude better than piston engines in the same power range, so a single-turbine engine aircraft can still maintain a similar safety record than competing twin-engine piston aircraft.  Of course, turbine engines are radically different technologies from piston engines, with lower complexity and fewer moving parts.  The higher price is eventually offset by higher operating efficiency.  But the point stands, if the technology is radically different, sometimes one highly-reliable device can be the better choice for your money than a bunch of older, less reliable tech.  You're more likely to successfully drive across the country in one modern car than a pair of 60s VW Beetles.  But that's about the difference in technology levels you'd need to be looking at.

Computers are such a commodity now, however, that it shouldn't hurt as much as it once did to buy two or more to do a job instead of throwing all your money at one big beefy server.  So you could configure one high-reliability server (often with redundant disks and NICs and fans and power supplies, the components most likely to fail) for about $2000, or you could configure two cheaper servers with similar specs for about $1000 each, and build a fully-redundant system where anything, including the mainboard, could break and the service could continue running.

The simplest way to make sure the service could continue running is to set up a Primary/Standby or Master/Backup style system.  Many would consider this a waste, since the standby machine is doing nothing most of the time.  Most people also forget to test the standby machine (particularly if it's turned off as a "cold" standby as opposed to a "hot" standby that stays up and running and keeps its data in sync with the primary), so it's also possible that it breaks first and no one notices until the primary machine fails and then they discover the standby doesn't work either!  So your backup systems can end up giving you a false sense of security unless you go through the trouble of testing your backup systems regularly.  The best scenario would just make this part of regular operations and exercise your fault recovery by actually switching between them every week or so.  It helps to break out of the "primary/standby" terminology and just call them something more generic, like systems A and B.

For running network services, having A and B systems could give you a lot more capability, such as the ability to do zero-downtime A/B deployments, upgrades, etc.  that simply aren't possible on the more expensive single-host system.  But those are topics to dwell on in future sessions.  For now, we just want to make sure our system is fully fault tolerant.  How can we be sure?

Well, you need to be able to take any individual component of your system out, and have the system continue functioning.  The easy way to spot vulnerable components is to look for redundancy.  There should be at least two of everything.  If there's just one of something, chances are that it's a potential single point of failure (SPoF).  You should be able to invite someone to walk up to your system, grab any single box or cable and remove it with no ill effect.  Ideally, this should include:

  • network cables
  • disks
  • entire computer nodes
  • power cable
  • UPS
  • switches and routers
  • network uplinks (can't be much of a service provider if you only have one ISP)

Monitoring and alerting is also key.  If a backup component silently fails, then you no longer have redundancy.  Anything that fails needs to generate an alert to have it fixed.  That needs to be part of the systems fault tolerance validation checklist.

Isn't having two of everything a waste?  Only if the potential downtime from a component failure creates more of a waste of time and effort and equipment to replace.  Parts break.  Think of it as replacing broken parts in advance.  Yes, if you're lucky enough that the parts don't break during the product lifecycle, then you're spending as much as twice as much extra money.  But this is only a waste if you're not creative enough to put the "working spare" parts to good use.

As a side benefit, a redundant system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.  This enables High Availability for the system, which we will discuss in the next section.

Thursday, May 28, 2015

Logically Distributed

"Logically distributed" simply means that nodes are capable of sharing their work.  No node is ever the only node that can perform a specific task -- other nodes must be able to come in and perform the same function.  They can either help (preferred) or take over completely.  A more practical and familiar way of looking at this is that the system should have no single point of failure (SPoF).

While this sounds kinda silly, this quality of logical distribution pretty much enables everything else we need for distributed operations.

Thursday, May 21, 2015

Geographically Distributed

Assuming we've made the step to a distributed model, it's necessary to consider where we're distributing our resources to.  There isn't much that can occupy the same space at the same time, so if we're going to have more than one of something, where should we put them?

Sometimes we want things as close to each other as possible.  With a distributed model, we often want to try to push them as far apart as practical.

Colocated Things

Performance and convenience are the main drivers that push the nodes of your system close together or colocated.  Proximity makes communication faster, cheaper, and lower-latency.  Plus, it's easier to maintain everything if it's all in one place.  But likely the primary reason to consolidate and centralize resources are to help minimize overhead.  Things that designers tend to centralize without really thinking about it too much:
  • Backups - yes, it's faster to make a local copy for disaster recovery, you probably want to ship your backups as far as practical if they're going to survive whatever kills your primary working copy.
  • Command HQ - everyone wants to hobnob with the big cheese, to the detriment of satellite offices.  But when the goal is to have clear and authoritative leadership and top-down communications, people haven't really figured out how to do it better yet.
  • Inventory and maintenance - buses and rail systems would have a single station to consolidate spares and specialists for repairs.
  • Databases and storage - network-attached storage is placed into large banks for centralized management and provisioning.  Stateful databases also tend to be the most performance and security critical element of an information system, so we try to keep them locked away in a secure central facility for compliance with various laws.
  • Network Switches - ironically the major piece of equipment that makes distributed operations possible also tends to get itself consolidated into huge backplanes in a central switching room.  But this ensures that high network performance is always available to throw at problems with ill-defined or emerging requirements.  Throwing more bandwidth at a problem can often be a decent substitution for planning.

Dispersed Things

For distributed systems, you will often find yourself pushing nodes out as far as practical.

The network enables decentralization.  From a philosophical standpoint, the DoD has plenty of insight (yes, the DoD pays people to wax philosophical about the uses of ARPANET).  DoD white papers on Effects-Based Operations talk about how data networks can push the power to the edge, allowing the decision-making to occur where it is most needed.  Instead of long feedback and control loops where all sensors must report data to a central command for analysis and synthesis of a response to be transmitted back to the effectors, the network allows the effectors themselves to understand the situation and take the appropriate action on the spot.

With this in mind, let's consider some of the ways where geographical dispersion of system components is beneficial.
  • Backups - disasters take place on different scales.  For business continuity, the farther your backups live from your computer systems, the better they'll be able to survive progressively more catastrophic events:
    • Backup disk in your computer:  a virus or a simple power surge could corrupt both your system and your backup volume in one fell swoop
    • Backup disk next to your computer: a thief or fire sprinkler could wipe out your electronic equipment
    • Backup disk in another room: an actual fire could destroy your building
    • Backup disk in another building: a flood or earthquake could wipe out your city
    • Backup disk in another city: probably will take a cyberattack or government legal action to shut you down at this point
    • Backup disk in another country: good luck complying with all applicable export laws
    • Backup disk on another planet: what we all aspire to
  • Web servers:  sure they're on the internet in the "cloud", so it shouldn't matter.  But studies by Amazon and other retailers have shown that server responsiveness does increase sales, and tenths of a second count.  The speed of light may be fast at 300,000 km per second, but that still translates to over a tenth of a second round-trip coast-to-coast over the US.  That may not sound like much, but add in all of the data transfer times and encryption and backend api abd database calls, each with their own set of handshaking delays, and you're easily counting web interaction response time in seconds.  For reference, video gaming lag becomes painfully noticeable above about 0.2 seconds, and 2-way human voice conversation becomes tortured above just 0.5 seconds with people mistakenly talking over each other.  So with internet services, the extra gains in responsiveness from locating your servers geographically close to your customers are measureable and significant.

Monday, March 30, 2015

One

1. Distributed means "more than one".
Like the Buddha going to the hot dog stand and asking the vendor to "make me one with everything," let us contemplate upon the meaning of this title.
Now, more than ever, we live in a binary world.  Almost all digital logic can be expressed as a seemingly endless string of ones and zeros.  Computers can perform any operation and calculation imaginable in base-2.  People could be divided into the "haves, and the have-nots."  Has the necessity for anything more become an outdated relic of the past?  A historical footnote of a simpler culture, like the aboriginal language that only had words for the concepts of none, one, and "more than one"?  Of course not.

But let us first consider the special circumstances conferred by 0 and 1.

e - 1 = 0
Euler's equation.  Well, one of them, anyway.  Notable for including most of the important numbers used in math.  Who would need anything more?

Is there any advantage by considering a third option for "many"?  That could add extra complexity, overhead, and waste.  And some things are impossible to duplicate.  You after all only live once.

 How can we justify investing extra energy and resources to redundancy?  Well, maybe it's not always worthwhile.  That's the first decision you need to be prepared to make right after deciding to build a product or capability -- going from none to one ...  Should you have gone from one to many at the outset?

I would argue yes... inherently you will always be faced with a multiplicity of things that you will need to maintain throughout their life cycle, so you might as well plan for handling the many from the beginning.   It can be extremely difficult to transition from a system that was only designed to be single to work or migrate to anything else.  You will hurt yourself more in the long run by taking the short view.  It isn't terribly difficult to plan for handling the many from the onset of you have an organizational framework.  That's what this blog is all about.

But first, let us take a moment to consider what kinds of situations actually make sense for there to be only one.

We mentioned "one life".  Marley would add "one love", but the animal kingdom gives us several examples which gives an evolutionary advantage to being... flexible with that rule.

What else would be better if you only had one.  One car?  Sure, if it breaks down you can telecommuter for a while or take public transit, but eventually that might take a toll on your job performance or personal time.  You probably end up renting a car while your only means of transportation is on the shop.  But that's like having more than one car available to you.

One house?  Yes, it doesn't make sense to purchase more than one house just in case one of them burns down or needs to be fumigated or simply losses power or some other utility for a week or so after a storm.  But chances are if you do, you have friends or family nearby that you can crash with, or at least stay in a hotel or shower at the gym on occasion.  We share what we own in times of need, that's distributed.

One phone?  We could certainly disappear into the mountain's away from contact for a while, but eventually our answering machine fills up and we miss bills and lose friends.  There's only so much time you can leave things to buffer up.  But you can certainly leave many things to buffer up for a few days while you replace a broken handset.  No financial harm done, unless you missed a job interview during that time or couldn't help a family member in need of emergency assistance.

Which brings us to computers.  Perhaps you only use your computer for entertainment, and you can only play one game or watch one movie at a time, so it doesn't make sense to have more than one, and if it breaks you just find some other pastime to keep your fun-meter pegged.  If you actually use your computer to do anything important, though, you may have found that losing it due to a disk failure can range from annoying to debilitating depending upon how much time it takes to set up a replacement.

So the common thread in all of this is that equipment loss or failure, in most cases, are a recoverable interruption in service that just means you need to spend some extra time, money, and apologies while you tinker with getting the replacement up and running again.  If you have extra time and money and reputation to spare, then by all means, do not worry too much about accounting for ever having more than one of a capability on hand.  Chances are, this won't happen during the worst possible moment.  As the old pilots' saying goes, "I'd rather be lucky than good."

There are several reasons why going the distributed route may not be necessary for your situation.  If you have the ability to convince your boss or customer to buy your excuse for delay or failure, that's great!  Second, if you are the only person inconvenienced by an equipment failure, and you're willing to just grin and bear it, then your not hurting anyome but yourself.  Third, perhaps you've poured substantial resources into this one house or commercial vehicle, and you just have to risk your livelihood on its continued reliable functioning.  If it does need maintenance, you simply drop everything and focus on getting that critical component up and running again.  Finally, perhaps you are blessed with a monopoly on some product or service.  Then if your ability to deliver had been interrupted, your clients will just buffer up and be forced to wait, because there is no other competition.  Then sure, you can go ahead and cut costs by neglecting fault tolerance or even preventative maintenance if you're not going to be losing any money in the long term.  That could work fine if you're some sort of government bureaucracy or sought-after artist, but it probably won't make you popular.

So far, we've only been talking about accounting for failure modes, which is likely only interesting for insurance actuaries.  There are plenty of other more interesting benefits for using the distributed model.  Let's engage in a few thought exercises now in order to save time and resources during crises in the future.  Then we'll be better equipped to decide if it's worth the extra complexity to consider design for distributed operation up front.

Tuesday, February 17, 2015

RoDS - Properties of Distributed Systems

What are some of the properties and capabilities enabled by distributed architectures?

This entry provides a brief outline to be expanded upon in this blog.
  1. "Distributed" means more than one.
  2. "GeographicallyDistributed" is a corollary of 1.
  3. "LogicallyDistributed" means the system is never partitioned so that a node is the only one that can perform a certain function or service part of the problem set; other nodes must be able to come in and perform the same function, which enables everything else:
  4. Redundancyand Fault Tolerance: individual nodes can fail without causing system failure. As a side benefit, the system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.
  5. High Availability: with redundancy, the system can continue to operate without interruption. For certain critical systems, such as air traffic control, scheduling downtime for repairs or upgrades is not cost-effective, practical, or acceptable.
  6. Load Balancing: The additional nodes providing fault tolerance are not always standing by or duplicating effort to provide redundancy. Under normal operating conditions, we want the extra nodes to be sharing the workload so the system can more effectively get its job done in parallel.
  7. Scalability: A flat organization of nodes can only grow so far. Eventually we need to provide a better means of organizing the nodes into groups to overcome various bottlenecks and other coordination limitations that prevent the system from tackling larger problem sets.
  8. Heterogeneity: The system must be able to grow and evolve gracefully over time as well, to meet availability goals. Inevitably, this means that the system will need upgrades. Rather than scheduling downtime to shut down the old system and turn on the new system, we should be able to introduce upgraded components to the new system and have them be able to interoperate... simultaneously running the older components in concert with the newer ones, gradually transitioning to newer protocols as necessary. Heterogeneity also increases the system's ability to survive total systems failure due to one fatal flaw or bug that affects all components of a homogeneous system at once (e.g., while a computer virus might take down all computers on one particular operating system, hopefully the system was designed to allow other computer clusters with different operating systems to resume those functions)
  9. Interoperability: In order to swap out and upgrade components of a system to achieve all of these worthy system features and capabilities, well-defined and standardized interoperability between components is key.

If you've read this far and you "get it", then great!  No need to go any further!
In each section I'll just be sharing some anecdotes, examples, and free association for each topic that builds upon the previous.

Rumblings on Distributed Systems


RoDS is about scalability, load balancing, and fault tolerance

But first a disclaimer: I am by no means an acknowledged expert in the field of reliability engineering.  This is merely a topic I've spent a fair amount of time reading, thinking, and practicing with, so hopefully someone might benefit from some random insight.

What are distributed systems?

Before coming to work for Boeing, I rode the crest of the 1990's tech wave at Ocean City, working at a supercomputing cluster start-up. We designed high-availability, fault-tolerant, scalable computing systems with a rather unique recursive network topology. They would often compare their reliability goals with the legendary reputation of Boeing aircraft, where double and triple redundancy would allow the plane to keep flying after multiple equipment failures. So I was kind of surprised when I did start working for Boeing, and did not find that philosophy pervasive in a lot of the work we were doing.

At the supercomputing company we would perform this demonstration where we'd start the cluster on an intensive task such as parallel raуtracing. As the nodes were working, we'd walk up to the machine and pull out various components -- network cables, power supplies, entire computing blades -- and show how the system would keep on running. The process would continue rendering - maybe show a hiccup a bit but go back and fill in the missing data.

A lot of my understanding and perhaps obsession with distributed systems was shaped by studying and designing for these types of computing components: RAID arrays, redundant power supplies, load balancing, etc. However, a lot of these patterns and considerations can be applied to many other fields, including products, systems, and people.

When most people make mention of Distributed Operations (and yes, that was the actual name of my workgroup), they generally mean it in the geographical sense, in that the network allows to decouple people, tools, and resources from particular locations. Let's spend some time musing over some of the many other senses of distributed architectures, however.