Technology Musings
  • Blog
    • Article Series Index
  • Ask the Experts
  • About

Home / Blog / The Evolution Of Reliability and High Availability

The Evolution Of Reliability and High Availability

Posted on: 08-16-2009 in Cloud Computing, High Availability (HA), Software as a Service, Software Design

Over the last few decades, the technologies we used and the approaches we took to make our systems reliable have undergone a steady evolution. In some cases the technology has just gotten more reliable through quality control at the hardware level (consider an Intel Blade today compared to my 1986 Zenith 8088 that I wrote my first automated trading programs on. Hard to believe 8MHz, 2×5.25″ floppy’s and 512K of RAM was once the best machine money could buy, short of a mainframe. AHH.. the nostalgia……NOT)

For most of the time pre mid 90′s we relied on hardware to make our systems reliable. We had mainframes for most things business critical and towards the latter part of that time, the Unix machines were starting to be taken seriously by business as well as the scientific community. Regardless of whether you used Tandem Non-Stop technology, IBM Series 3X0′s or Stratus, you relied on the hardware to be fault tolerant and to just stay up. And for the most part they did, but at great cost and with relatively poor price/performance compared to the other platforms that were becoming available. Coupled with this resilient hardware we would have typically 2 data centers (and sometime 3) with essentially identical hardware for disaster recovery. Two of these centers were usually less than 30 miles apart and the data was synchronized between them again using hardware, with technology such as EMC’s SAN replication technology. In fact a lot of systems still do this today where performance and latency in the systems response time is not critical. Although post 9/11 the SEC mandated financial firms to have their DR site 300 miles apart which means this SAN replication approach cannot be used for most new systems as it’s distance limited. Most other countries followed the SEC’s lead (Do you know how hard it is to find site’s 300 km apart in Switzerland and still be within Switzerland, because Swiss data (depending on the data type) can’t be stored or transmitted outside of Switzerland, which is something for SaaS vendors to keep in mind. Well you can’t so we cheat. Usually one in Zurich and one in Lugano which is as good as you can do.)

By the mid 90′s though we were starting to use more UNIX machines. SUN Sparc Systems, IBM R6000′s and HP-UX machines were coming on strong. Their hardware was better than a typical Intel desktop at the time but it still didn’t have the 9′s of uptime that a mainframe had. Now for stateless applications such as those that were emerging on the web, we could throw an IP sprayer or Load balancer, such as the BIG-IP product line by F5, in front of a hot-hot pair and be pretty good to go. This is still the best way to achieve HA for most stateless applications today, but I digress. So in order to assure reliability, and for this era we defined that mostly as no loss of data more so than sheer system uptime, we had to do more with software to provide that reliability.

This software augmentation centered around two primary software technologies.

  • Messaging Middleware such as IBM’s MQ, Tibco EMS and Rendevous
  • Databases such as DB2, Oracle, Sybase and Informix

Well I won’t spend to much time on how we used these technologies back 10 years ago, because to be honest it really hasn’t changed much up to today. With the messaging software, we moved from a world in which all inter-process communication happened over a raw socket, to instead using messaging middleware, which removed the burden for message reliability from the programmer. No longer did we have to implement transactional semantics in every application by hand. We could instead rely on the middleware to make sure the messages got from point A to point B. Today we use IBM MQ to handle every message for virtually every trade of US treasuries, Eurobonds, Stocks, etc in the world. We can rely on it to deliver messages of any size from one application to another, even if one of the machines goes down and doesn’t come back online for weeks, MQ ensures it gets delivered. (Hopefully, being down for weeks doesn’t actually ever happen in production, but the guys at IBM’s Hursley labs due test these things.)  Now I will say, we don’t use TIBCO, or MQ when low latency and very high throughput are required.  There is a new breed of messaging technologies out recently which are prefered and I will touch on some of them in coming articles.

With the databases, we moved all of the transactional abilities we knew and loved off the mainframes and onto the distributed platforms. In addition, the database companies implemented ways to run the databases in a cluster. This meant that if the database server failed, I would in theory, with a slight pause, fail over to the backup, with no intervention on the part of my application. Now in practice this took a few missteps to get right but today is old hat and everyone relies on the big commercial databases to be able to do this. Some of the open source ones are not so strong here as their paid for counterparts, but in time we will probably see this happen as well.

So this brings us pretty much up to today’s state of the world (or atleast a few years ago for a typical enterprise application) in a very Cliff’s Notes sort of summary. In the next article we will start a hypothetical design exercise as a way to ground the discussions going forward. This hypothetical will form the basis of the next few articles to come after it.

Paul Michaud

Paul Michaud is a co-founder and CEO of Nebility, an enterprise solutions company. Paul has been designing and building some of the world’s largest, most scalable and highest performing applications, for over 25 years. Immediately prior to Nebility, Paul was Global Executive IT Architect for Financial Services at IBM. To learn more about Paul check him out on LinkedIn using the button at the top of this author box.

Other posts by Paul Michaud
  • Popular Posts
  • Related Posts
  • Android - Rise of the Amazon Marketplace, Part 2
    Android - Rise of the Amazon Marketplace, Part 2
  • Android - Rise of the Amazon Marketplace, Part 1
    Android - Rise of the Amazon Marketplace, Part 1
  • Real Life Issues With Big Data In The Enterprise - The Issues With Data Completeness
    Real Life Issues With Big Data In The Enterprise - The Issues With Data Completeness
  • Real Life Issues With Big Data In The Enterprise – The Issues With Data Consistency (Or Lack Thereof)
    Real Life Issues With Big Data In The Enterprise – The Issues With Data Consistency (Or Lack Thereof)
  • Android - Rise of the Amazon Marketplace, Part 2
    Android - Rise of the Amazon Marketplace, Part 2
  • Android - Rise of the Amazon Marketplace, Part 1
    Android - Rise of the Amazon Marketplace, Part 1
  • Real Life Issues With Big Data In The Enterprise - The Issues With Data Completeness
    Real Life Issues With Big Data In The Enterprise - The Issues With Data Completeness
  • Real Life Issues With Big Data In The Enterprise – The Issues With Data Consistency (Or Lack Thereof)
    Real Life Issues With Big Data In The Enterprise – The Issues With Data Consistency (Or Lack Thereof)
  • http://www.technologymusings.com/softwaredesign/high-availability-series-series-outline High Availability Series: Series Outline | Technology Musings

    [...] The Evolution Of Reliability and High Availability Over the last few decades, the technologies we used and the approaches we took to make our systems reliable have undergone a steady evolution. In some cases the technology has just gotten more reliable… [...]

  • http://www.cloudave.com/link/high-availability-series-series-outline High Availability Series: Series Outline | CloudAve

    [...] The Evolution Of Reliability and High Availability [...]

  • http://www.cloudave.com/1706/high-availability-series-series-outline/ High Availability Series: Series Outline

    [...] The Evolution Of Reliability and High Availability [...]

blog comments powered by Disqus

Translate This

Translate

Catagories

Tweets

  • .@bmichelson Completely agree with your post. Good architecture should be simple, obvious and clearly meet a business value. #entarch
  • RT I completely agree @bmichelson: my post: Enterprise Architecture Rant #4,892 : Elemental Links http://bit.ly/gM4vVp #entarch #ruckus
  • Real Life Issues with Big Data Part 3 - Completeness http://bit.ly/fna3KH #BigData

Tag Cloud

Architecture Ask The Experts Big Data Business Intelligence CIO Cloud Computing Enterprise 2.0 Enterprise Data Modeling (EDM) Executive Discussions High Availability (HA) High Performance Computing Markets Mobile Nebility Service Oriented Architecture (SOA) Software as a Service Software Design Solution Design Strategy Technology Startups Technology Strategy The Business of SaaS Uncategorized

Recent Comments

  • Bruce Bent II on Android – Rise of the Amazon Marketplace, Part 2
  • Franck MIKULECZ on Consideration For The Technical Implementation of an SOA
  • High Availability Series: Series Outline on The Evolution Of Reliability and High Availability
  • rohanpillai on How To Build an SOA Based, High Performance, Scalable and Reliable Twitter on Steroids
  • rohanpillai on How To Build an SOA Based, High Performance, Scalable and Reliable Twitter on Steroids
Avatars by Sterling Adventures
Copyright 2011 Technical Musings, All Rights Reserved
TwitterStumbleUponRedditDiggdel.icio.usFacebookLinkedIn