Featured Posts

The New Economics of Technology Startups? I have recently been reading the book "Free: The Future of a Radical Price" by Chris Anderson.  Well I am not actually reading it as I find I do not have time for reading books any more.  These days...

Readmore

Here is my hammer. Show me your screw! Well I have been traveling out of the country a lot these past few weeks so its been a while since I posted.  I will try and do better in the future.  During my travels I had a lot of interesting discussions...

Readmore

Consideration For The Technical Implementation of an... I had a lot of questions from people after my last post on BPM and SOA about the layered SOA I proposed and whether it would be slow performance wise.  The answer I gave people was "It depends".  In...

Readmore

Why a Business Process Modeling (BPM) Approach to SOA... I was having a Twitter conversation with Brenda Michelson (@bmichelson) and Todd Biske (@toddbiske) about the tight coupling in peoples minds between BPM and SOA, and why I find that when people take a...

Readmore

Enterprise 2.0 Needs To Stop Being So Naive You know I really struggle to get excited about Enterprise 2.0.  Not because I don't think IT needs to undergo change, but because I feel that Enterprise 2.0 as we seem to be defining it, and covering...

Readmore

  • Prev
  • Next

The Evolution Of Reliability and High Availability

Posted on : 16-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Software Design, Software as a Service

Comments

Over the last few decades, the technologies we used and the approaches we took to make our systems reliable have undergone a steady evolution. In some cases the technology has just gotten more reliable through quality control at the hardware level (consider an Intel Blade today compared to my 1986 Zenith 8088 that I wrote my first automated trading programs on. Hard to believe 8MHz, 2×5.25″ floppy’s and 512K of RAM was once the best machine money could buy, short of a mainframe. AHH.. the nostalgia……NOT)

For most of the time pre mid 90’s we relied on hardware to make our systems reliable. We had mainframes for most things business critical and towards the latter part of that time, the Unix machines were starting to be taken seriously by business as well as the scientific community. Regardless of whether you used Tandem Non-Stop technology, IBM Series 3X0’s or Stratus, you relied on the hardware to be fault tolerant and to just stay up. And for the most part they did, but at great cost and with relatively poor price/performance compared to the other platforms that were becoming available. Coupled with this resilient hardware we would have typically 2 data centers (and sometime 3) with essentially identical hardware for disaster recovery. Two of these centers were usually less than 30 miles apart and the data was synchronized between them again using hardware, with technology such as EMC’s SAN replication technology. In fact a lot of systems still do this today where performance and latency in the systems response time is not critical. Although post 9/11 the SEC mandated financial firms to have their DR site 300 miles apart which means this SAN replication approach cannot be used for most new systems as it’s distance limited. Most other countries followed the SEC’s lead (Do you know how hard it is to find site’s 300 km apart in Switzerland and still be within Switzerland, because Swiss data (depending on the data type) can’t be stored or transmitted outside of Switzerland, which is something for SaaS vendors to keep in mind. Well you can’t so we cheat. Usually one in Zurich and one in Lugano which is as good as you can do.)

By the mid 90’s though we were starting to use more UNIX machines. SUN Sparc Systems, IBM R6000’s and HP-UX machines were coming on strong. Their hardware was better than a typical Intel desktop at the time but it still didn’t have the 9’s of uptime that a mainframe had. Now for stateless applications such as those that were emerging on the web, we could throw an IP sprayer or Load balancer, such as the BIG-IP product line by F5, in front of a hot-hot pair and be pretty good to go. This is still the best way to achieve HA for most stateless applications today, but I digress. So in order to assure reliability, and for this era we defined that mostly as no loss of data more so than sheer system uptime, we had to do more with software to provide that reliability.

This software augmentation centered around two primary software technologies.

  • Messaging Middleware such as IBM’s MQ, Tibco EMS and Rendevous
  • Databases such as DB2, Oracle, Sybase and Informix

Well I won’t spend to much time on how we used these technologies back 10 years ago, because to be honest it really hasn’t changed much up to today. With the messaging software, we moved from a world in which all inter-process communication happened over a raw socket, to instead using messaging middleware, which removed the burden for message reliability from the programmer. No longer did we have to implement transactional semantics in every application by hand. We could instead rely on the middleware to make sure the messages got from point A to point B. Today we use IBM MQ to handle every message for virtually every trade of US treasuries, Eurobonds, Stocks, etc in the world. We can rely on it to deliver messages of any size from one application to another, even if one of the machines goes down and doesn’t come back online for weeks, MQ ensures it gets delivered. (Hopefully, being down for weeks doesn’t actually ever happen in production, but the guys at IBM’s Hursley labs due test these things.)  Now I will say, we don’t use TIBCO, or MQ when low latency and very high throughput are required.  There is a new breed of messaging technologies out recently which are prefered and I will touch on some of them in coming articles.

With the databases, we moved all of the transactional abilities we knew and loved off the mainframes and onto the distributed platforms. In addition, the database companies implemented ways to run the databases in a cluster. This meant that if the database server failed, I would in theory, with a slight pause, fail over to the backup, with no intervention on the part of my application. Now in practice this took a few missteps to get right but today is old hat and everyone relies on the big commercial databases to be able to do this. Some of the open source ones are not so strong here as their paid for counterparts, but in time we will probably see this happen as well.

So this brings us pretty much up to today’s state of the world (or atleast a few years ago for a typical enterprise application) in a very Cliff’s Notes sort of summary. In the next article we will start a hypothetical design exercise as a way to ground the discussions going forward. This hypothetical will form the basis of the next few articles to come after it.

  • Share/Bookmark

High Availability Series: Series Outline

Posted on : 16-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Service Oriented Architecture (SOA), Software Design, Software as a Service

Comments

With all of the talk about reliability, or lack thereof, of SaaS and Cloud based applications, I thought I would write a series on designing applications to be Resilient and Highly Available.  The series sort of started with this post “It’s Inadequate Design That Lets Systems Fail, Not Whether They Are SaaS or Deployed in The Cloud“.

As any of you who have read my Bio are aware, I have spent most of my career designing very large, high volume and high performance applications for the World’s largest financial institutions.  In these systems High Availability and Reliability is Key, as systems I have been involved in designing carry Trillions of dollars of transactions on them each day.  Also in the Financial Markets world, and down time can cost millions of dollars per minute. We have also been center stage in the evolution of technology and design best practice when it comes to performance and reliability.  We have gone from just using a robust mainframe and assuming it stays up with hot swap hardware to high performance distributed applications handling millions of transactions per second in statefull applications (much harder to make HA than stateless Web apps), where time from failure to detection and takeover by a hot standby can be as little at 7 milliseconds.

The articles which will follow in this series will represent my personal opinion on how this is done.  It is by no means the only way to do it and I am sure others will clearly have other opinions.

Topic’s will tentatively the following:

  1. The Evolution Of Reliability and High Availability
  2. Guaranteeing No Loss Of Data
  3. Designing For Disaster Recovery
  4. Designing For Maximum Uptime In A Distributed World
  5. High Availability in a High Volume Transactional Environment

Other topics will be considered based on feedback, user requests or if something just pops into my head.  So if you have a particular question or topic you would like answered just ask and if it is something I feel I can write about, I will.

We will start in the next article in the series with a brief discussion of The Evolution Of Reliability and High Availability.

  • Share/Bookmark

It’s Inadequate Design That Lets Systems Fail, Not Whether They Are SaaS or Deployed in The Cloud

Posted on : 15-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Software Design, Software as a Service

Comments

There have been many high profile outages lately which have caught peoples attention.  These failures are being used as an argument for why critical systems should remain internal and not be deployed as SaaS or in the Cloud.  Some of these outages included Google App Engine’s performance issues in early July , Rackspace’s loss of their Dallas data center due to power failure and the fire in Seattle that took Authorize.Net offline for 12 hours to name but a few.

What amazes me is how so many people point to this and argue that this is proof for why Cloud and/or SaaS is bad and that everything should be in house.  It’s preposterous.  The fact that these systems went down with a data center failure (or otherwise) is nothing more than an argument for inadequate system design, where High Availability (HA) is concerned.  The bottom line is it takes planning, forethought and good design to make a system highly available, and most systems simply are not designed with that in mind.

The reasons for not making a system highly available are many and include the following:

  1. Naivete: People don’t believe it could happen to their system and thus choose not to put in the time, effort and cost of making a system highly available
  2. Cost: Bottom line is it costs a lot of money to make a system HA and for a lot of firms, particularly when starting out or for smaller businesses, it just not a viable option
  3. Difficulty: Its bloody hard to make a system HA.  Its one thing to ensure no data loss,  its quite another to ensure little to no down time.

For most of my career I have built systems for the World’s largest financial companies including the World’s leading Investment Banks and Stock Exchanges.  These firms take high availability very seriously as a rule, but even with their resources and decades of experience systems still go down.

Consider the London Stock Exchange (whose system I did not design), who last year had a very public outage when they were down for most of a trading day.  This was not a SaaS system or one deployed in a Cloud.  It was an internal system run by a highly reputable company whose business is based on being reliable and never losing a trade.  These exchanges, for the most part, have highly redundant systems, multiple backup data centers, design for High Availability and run fail over tests regularly, yet they still experience downtime from time to time.

The point is, failures happen, whether the system is run internally, or in the cloud.  Whether its a SaaS system or one of home grown legacy design.  The objective is to minimize those failures and the downtime associated with them.

That said,  with today’s technologies, some careful planning and good design, it is possible to build systems that should almost never go down, even in the face of a 9/11 type event, but thats a topic for another day.

  • Share/Bookmark

The Challenges of Allowing Offline Usage in a SaaS Based System

Posted on : 14-08-2009 | By : Paul Michaud | In : Software Design, Software as a Service, The Business of SaaS

Comments

So I was reading an article this evening over at CloudAve about the latest Google Reader and how it still can’t be used offline with full features. In particular the article focuses on its inability to allow you to read articles offline  and then flag those articles as already read, such that when you get back online Google Reader doesn’t present them to you again, which is a waste of time.

Well this got me thinking about the general challenges of making SaaS based applications usable in an offline mode.

Consider the following application scenario:

  1. You have designed a simple SaaS application to do basic contact management
  2. Users need to be able to use it offline as well as online
  3. Users may log into the application from many computers
  4. The systems allows more than one person to edit any one particular contact record

Now, I specifically structured this scenario to allow each of the points above to allow each one to layer on additional complexity which needs to be considered when designing the system.  We will tackle in each turn.

1. Basic SaaS Contact Management

Well this isn’t too hard if you want a basic system (and we don’t plan for any of items 2-4, or other complexities such as data design for multi-tenancy, fine grained permissioning, etc ).  All you need is a database which contains contact records, which can be read or edited either through a public interface (which doesn’t necessarily need to be Web Service based) or by using a GUI of some kind (through the API or tightly bound to the database) which you provide.

2. Using the Contact Management System Offline

Now its gets more difficult.  You face the following issues:

  1. Method two of using a GUI tightly bound to the database is no longer an option, unless you want to be maintaining that data access method as well a separate API for the offline users to use, with all the inherent potential for inconsistent behavior two data access methods would imply
  2. The offline user needs to have full access to a copy of his/her relevant data on the machine that they are using offline
  3. How do you accomplish this if the potential data in question is large (say many Gigabytes just to make it fun)
  4. Said offline data needs to be stored in a secure manner, which is searchable, editable, and more importantly synchronizable
  5. Metadata capturing the status of each record also needs to be stored along with those records
  6. The metadata needs to be compared against the database when the user gets back online in order to determine how to update or merge the data

3. Allowing for Multiple (Potentially Offline) Computers To Be Used By Each User

Well this is not too terribly difficult once you can solve the issues in #2 above.  As long as the multiple computers are all online tis is infact really trivial with a SaaS application and is infact where SaaS shines. How often have you cursed not being able to easily synch your iPhone, MS Outlook on your Home machine and MS Outlook at Work in real time.  Well with a SaaS based app, this is no problem because they all read and write the same master copy of the data which resides on the central server.  That is, as long as they are online.  So what if they aren’t all online?  Well the following could arise and needs to be considered when designing your SaaS system.

  1. Assume the user has 2 computers, one at home and one at work (we will ignore the iPhone for now)
  2. Assume he edits some data at home (in offline mode) and forgets to sync it online before heading out to work
  3. Assume that when at work he edits more records (lets assume in offline mode again just for fun), and that some of the records are the same as what he edited at home.
  4. At the end of the day he sync’s to the central server from the office machine
  5. He gets back home and wants to sync the home machine and carry on from where he left off

The problem is thatthe records at home have been edited so they need to be sync’ed, but some of those may be stale and older than what was done at work. So, how do we do it?  Well we ideally need to do the following:

  1. We need to check all records on the home machine against the central SaaS system
  2. For any record which was edited on the home machine but not of the server copy, just upload it
  3. Any records which were edited at home and at work need to be merged (which is always a pain and error prone so you need some good rules and just stick to them)
  4. Any record edited on the server version and not at home should be overwritten with the central copy.

After all of that,  the system will be in as clean a state as it can be and the person can now continue to edit the records.  Its important to note that if the records had been sync’ed in the morning that this procedure would not be needed, which makes the case for auto saving/syncing the data often, to minimize the potential for data collissions in cases like this.

4. What If Multiple People Can Access and Edit The Same Data

Well the issue in this case is one of concurrency, and it needs to be allowed for whether the users are online or offline.  Consider Person A and Person B are both editing the same contact records while offline.  Both then go to sync their data (and they have both edited some of the same records maybe not with the same changes to those records).  So what does your system need to do?

  1. This gets a lot worse if they submit the data simultaneously, but we will not worry about that at this time
  2. Each record for each person needs to be compared against the central data copy
  3. If the server copy has not been modified, then just upload as normal
  4. If the server is modified, perform a merge in accordance with the systems rules
  5. Sync the local copy of the data to the new refreshed server version if needed

That’s really about it.  it turns out that multiple users is really no different (from each user perspective), than if the had edited on two machines as a single user.

So whats the point of all this.  Well the bottom line is that if you are designing a SaaS based application you better think about all of this.  It not sufficient to assume that

  • Everything is stateless
  • No data concurrency issues will arise
  • There is only one user on an account or that can edit any particular data

As software designers we need to make sure that what we design works when these use cases are considered, which is very difficult, especially if you are a startup and trying to get product out the door in a hurry.  While I am a fan of Agile programming and have used variations of it building application for Wall Street Investment Banks, even before the term Agile existed, the idea of minimal viable product would get you into trouble here.  These types of features are hard to layer on later and you will be forced to go back and redo what you already completed resulting in increased development cost and time to market, to say nothing of dissatisfied customers if you already shipped a version that needed to be drastically changed.

Thus my personal design mantra:

Implement only what needs implementing, but don’t design yourself into a corner.

In my experience it pays in the long run to make sure of your design upfront, even if it costs a little bit of time.  But don’t get crazy and produce a 6000 page design doc because that will be wrong too.

  • Share/Bookmark