Featured Posts

The New Economics of Technology Startups? I have recently been reading the book "Free: The Future of a Radical Price" by Chris Anderson.  Well I am not actually reading it as I find I do not have time for reading books any more.  These days...

Readmore

Here is my hammer. Show me your screw! Well I have been traveling out of the country a lot these past few weeks so its been a while since I posted.  I will try and do better in the future.  During my travels I had a lot of interesting discussions...

Readmore

Consideration For The Technical Implementation of an... I had a lot of questions from people after my last post on BPM and SOA about the layered SOA I proposed and whether it would be slow performance wise.  The answer I gave people was "It depends".  In...

Readmore

Why a Business Process Modeling (BPM) Approach to SOA... I was having a Twitter conversation with Brenda Michelson (@bmichelson) and Todd Biske (@toddbiske) about the tight coupling in peoples minds between BPM and SOA, and why I find that when people take a...

Readmore

Enterprise 2.0 Needs To Stop Being So Naive You know I really struggle to get excited about Enterprise 2.0.  Not because I don't think IT needs to undergo change, but because I feel that Enterprise 2.0 as we seem to be defining it, and covering...

Readmore

  • Prev
  • Next

How To Build an SOA Based, High Performance, Scalable and Reliable Twitter on Steroids

Posted on : 20-08-2009 | By : Paul Michaud | In : High Availability (HA), Service Oriented Architecture (SOA), Software Design

Comments

Over the past few days I have been having some issues with my Twitter account.  Beyond the well known pauses in the service, outages, etc there are some less known but more annoying problems with twitter search.  It turns out that many accounts don’t show up in search at all.  Therefore, if you are one of those lucky accounts, no one other than direct followers can see your tweets and no one can find you or any of your Tweets.  This makes the accounts pretty useless.  It also turns out its been a know issue with no fix for over a year other than to create a new account and tweet with that.  Well it turns out that my account was one such account which needless to say was very annoying and cost me 2 days of my time trying to figure out a viable work around.  As a result, Twitter earned the place of honor in today’s blog.

Now in the defense of Evan Williams, Biz Stone and the rest of the gang at Twitter, they find themselves in the enviable position of having a hugely successful product on their hands which has no doubt outpaced their wildest growth projections over the past few years and thus put stress on their design and everything else.  I on the other hand have the advantage of 20/20 hindsight and thus in this blog we can design Twitter on Steroids from scratch using technology that was not even available when Twitter was conceived.  I know the Team at twitter is busting their butts to keep up with their phenomenal growth and my hats of to them for their success.

So for those who have not read my Bio, I have been designing and building ultra high performance systems for the World’s Largest Banks and Stock Exchanges for about 25 years.  Just this June a couple of colleagues of mine and I designed and ran a stock exchange prototype system capable of 4.5 million transaction per second with round trip response time as low as 15 microseconds (yes that’s microseconds for multiple network hops, I/O, parsing, matching and the whole shebang, everything the NYSE does to tell you that you just bought 100 shares of IBM).  We also showed this system can scale linearly for throughput by adding hardware, was fully fault tolerant and could do dynamic load balancing if traffic at the exchange spiked.  In this design, I will be leveraging the lessons learned over that 25 years and the technologies used for the system above.

So lets dive in.

Requirements:
So what does our Twitter on Steroids need to do.  Here is my overly simplistic list of requirements (I am only going to deal with the big ones):

Functional Requirements:
The system shall allow users to create accounts.
The system must provide a means for users to submit Tweets
The System must persist those Tweets
Users shall be able to follow other users Tweets
The system shall provide a mechanism to search Tweets

Non-Functional Requirements:
The system shall be highly responsive
The system shall maintain response times even under load
The system shall be highly scalable
The system shall be highly available with 99.999% or better uptime (its doable)

Where Do We Start?

First some design principles:

  • We will use a componentized SOA design
  • The Twitter Web Site will use the same Service API that is exposed publicly
  • The System will use a Hot/Hot High Availability Model based on component replication for reliability
  • All Service Components will be implemented in a manner that ensure deterministic behaviour (Easiest way to do that, but not the only way, is to make it single threaded which is what we do for most exchange systems.  Thread context switches are expensive at speed and multithreading can result in coherency issues which Twitter seems to be suffering from based on the comments on their support site)
  • To the maximum extent possible all I/O, remote Service calls, etc will be asynchronous
  • All internal communication will be message based using multicast for efficiency

About the Technology
I don’t normally like to reference specific technologies in my blog but in this case I am going to as there are a couple which provide unique capabilities to implement this system design, and which people are probably not familiar with.  Apologies in advance for the product plug.  They are as follows:

Websphere MQ Low Latency Messaging (LLM):
LLM is a unique high performance messaging product that has some purpose built capabilities specifically designed for ultra high throughput, low latency, transactional systems.  For one it’s the fastest messaging available on the market, capable of throughput in excess of 9 million Tweets per second per connection, and latency application to application across a switch as low as 3 microseconds with Infiniband Networking and about 12 microseconds with 10Gbe.
More important than its speed for this type of application though is its unique high availability mechanisms.  LLM provides a unique mechanism that allows me to deliver messages to a primary and secondary Service Component at speed, while maintaining total order across all receivers.  In addition it provides unique mechanisms to perform failure detection, failover, state synchronization and component replication all at speed.  In exchange systems,  LLM has detected and failed over from a primary system to a backup in as little as 7 milliseconds, with no loss of messages or duplication and no system level down time even though a component failed.

Datapower XM70:
This is an appliance that was originally designed for Web Service and Web Edge Security.  This model is specifically enabled to work with LLM above.  It will allow us to expose REST or SOAP based services and convert them to message based for internal consumption.  The XM70 can also do content based routing, parsing and transformation for us on the fly at wire speed taking load of the back end Service Components.

XIV Storage:
This is a low cost storage appliance that has great throughput and reliability.  I have been able to sustain write speeds with this in excess of 5.5 Gb per second per intel box writing to it.

The rest we can use pretty commodity stuff.  The disk above can also be easily swapped for your preferred flavour,  this one just has great price performance.

What Does Twitter on Steroids Look Like?

My version of Twitter on Steroids would look like this (except I didn’t have room on the drawing to add the Account Management Service Componets or the Follower Service Components, so just imagine they are in the diagram and follow the same pattern :D ):

Twitter on Steroids

Twitter on Steroids

So let’s walk through this diagram.

  1. Firstly we are using Big IP to load balance across the Web Servers and also across the Datapower appliances.  This is pretty standard Web design no surprises.  The BIG IP could also do this to a remote backup site as well if configured correctly, where we could twin this setup for failover or load balancing.  Or we could put the Instance 2’s in the second site.  It just depends on the SLA’s you are trying to meet.  The logical design and coding would not change regardless.
  2. The Web Servers are making calls through the Datapower to the back end Services Components just like the external API calls.  This ensures consistent behaviour and reduces the need to test and maintain two API’s
  3. Datapower is converting all REST and SOAP payload into messages on top of LLM
  4. This is important.  Datapower is multicasting all messages out of the appliance using LLM’s high availability mechanisms.  It is also putting those messages on different topics based on the content of the message.  I am suggesting partitioning the incoming Tweets based on the first few letters of the Tweeter’s ID.  The first 2 letters will do to start giving us 676 topics to work with for load balancing.  We can add more topics for finer partitioning later if need be.
  5. LLM is delivering the messages throughout the systems and also providing all the reliability.  It handles NAK’s and ACK’s automatically, retransmissions, etc to asssure messages get where they need to be without any additional work by the application.
  6. Tweets are first picked up by the Tweet Capture Service Components.  Each partition subscribes to and handles a subset of the topics in order to provide load balancing.  It is possible to add an external system which monitors load per topic and dynamically changes the subscriptions to adjust load.  Also by partitioning, we can use multiple databases in parallel thus eliminating the databases as a bottleneck, throughput wise.
  7. I/O, in the Tweet Capture Service Components, is Asynchronous providing very fast response times.  We can batch write the tweets for higher throughput and because we do compoent replication using LLM,  if the primary Instance 1 fails, Instance 2 just takes over where it left off with no loss of messages or duplication.
  8. The Tweet Capture rebroadcasts (multicast) all messages to the Tweet Indexing Service Component.  These are also twinned for High Availability and Partitioned for Scalability.  The indexing component does as the name says and indexes into the tweets and stores a record in a database.  I would recommend an in memory database be used with a traditional database behind it, with bi-directional synchronization of current data between the two.  SolidDB/DB2 is one pair or possibly TimesTen/Oracle is another (but the latter pair is slower).  I/O would be batched and asynchronous again for speed.
  9. When a search request comes in, it would be routed by Datapower to the Search Service Components, which would then query the Indexing Service and receive back the matching records for each key word in the search.  A fast parallel algorithm would then be used to handle any “or” or “and” statements in the search
  10. These results would be returned to the caller via the datapower box as a response to the original service call.

So how fast would this be and how big could it scale?

Well this is just a guess based on my experience and without ever having looked at any specific search algorithms that might be used by Twitter.  Lets assume we write everything in C behind the Datapower for speed and stability and that we use 1Gbe for networking which is the slowest at about 27 microseconds per hop.  All latencies are round trip to and from the Datapower Box.

  1. I think for Tweet Capture, we could achieve round trip latency per tweet of about 50-60 microseconds with throughput per partition somewhere in the 100,000-200,000 thousand Tweets per second range if using a fast database and some solid state disk for the database log files, etc.  Even higher if a custom binary file system is used (15,000,000+ Tweets per second which have done with stock orders with similar sized messages)
  2. Similar performance is possible for the Tweet Indexing per partition to that of the Capture
  3. For Tweet Search it a bit tougher to gauge, but I woudl guess it would be about 100-150 microseconds per search depending on the algorithm used.  Throughput should also be well into the 100’s of thousands per second if in mkemory databases are used.
  4. Response times could be reduced by as much as 25 microseconds per network hop by using Infiniband networking instead of 1Gbe
  5. From a scaling poerspective,  this should be able to scale linearly by adding hardware almost without limit (only limited by the avilable network bandwidth)

Now clearly,  this is a simplified case and I am sure there are lots of design details we are missing but I think you get the idea.  A bigger, badder Twitter (or any other app for that matter) is definately possible and by using the SOA pattern, Async I/O, component replication, etc we can do this to almost anything.  So if anyone from Twitter (or any one else for that matter) wants to talk specifics or other examples feel free to leave a comment or reach me on Twitter (@techmusings) any time.

Sorry for picking on Twitter they just seemed like a good example given my struggles.  We all wish we had the “problems” that come with such a huge success.

  • Share/Bookmark

The Evolution Of Reliability and High Availability

Posted on : 16-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Software Design, Software as a Service

Comments

Over the last few decades, the technologies we used and the approaches we took to make our systems reliable have undergone a steady evolution. In some cases the technology has just gotten more reliable through quality control at the hardware level (consider an Intel Blade today compared to my 1986 Zenith 8088 that I wrote my first automated trading programs on. Hard to believe 8MHz, 2×5.25″ floppy’s and 512K of RAM was once the best machine money could buy, short of a mainframe. AHH.. the nostalgia……NOT)

For most of the time pre mid 90’s we relied on hardware to make our systems reliable. We had mainframes for most things business critical and towards the latter part of that time, the Unix machines were starting to be taken seriously by business as well as the scientific community. Regardless of whether you used Tandem Non-Stop technology, IBM Series 3X0’s or Stratus, you relied on the hardware to be fault tolerant and to just stay up. And for the most part they did, but at great cost and with relatively poor price/performance compared to the other platforms that were becoming available. Coupled with this resilient hardware we would have typically 2 data centers (and sometime 3) with essentially identical hardware for disaster recovery. Two of these centers were usually less than 30 miles apart and the data was synchronized between them again using hardware, with technology such as EMC’s SAN replication technology. In fact a lot of systems still do this today where performance and latency in the systems response time is not critical. Although post 9/11 the SEC mandated financial firms to have their DR site 300 miles apart which means this SAN replication approach cannot be used for most new systems as it’s distance limited. Most other countries followed the SEC’s lead (Do you know how hard it is to find site’s 300 km apart in Switzerland and still be within Switzerland, because Swiss data (depending on the data type) can’t be stored or transmitted outside of Switzerland, which is something for SaaS vendors to keep in mind. Well you can’t so we cheat. Usually one in Zurich and one in Lugano which is as good as you can do.)

By the mid 90’s though we were starting to use more UNIX machines. SUN Sparc Systems, IBM R6000’s and HP-UX machines were coming on strong. Their hardware was better than a typical Intel desktop at the time but it still didn’t have the 9’s of uptime that a mainframe had. Now for stateless applications such as those that were emerging on the web, we could throw an IP sprayer or Load balancer, such as the BIG-IP product line by F5, in front of a hot-hot pair and be pretty good to go. This is still the best way to achieve HA for most stateless applications today, but I digress. So in order to assure reliability, and for this era we defined that mostly as no loss of data more so than sheer system uptime, we had to do more with software to provide that reliability.

This software augmentation centered around two primary software technologies.

  • Messaging Middleware such as IBM’s MQ, Tibco EMS and Rendevous
  • Databases such as DB2, Oracle, Sybase and Informix

Well I won’t spend to much time on how we used these technologies back 10 years ago, because to be honest it really hasn’t changed much up to today. With the messaging software, we moved from a world in which all inter-process communication happened over a raw socket, to instead using messaging middleware, which removed the burden for message reliability from the programmer. No longer did we have to implement transactional semantics in every application by hand. We could instead rely on the middleware to make sure the messages got from point A to point B. Today we use IBM MQ to handle every message for virtually every trade of US treasuries, Eurobonds, Stocks, etc in the world. We can rely on it to deliver messages of any size from one application to another, even if one of the machines goes down and doesn’t come back online for weeks, MQ ensures it gets delivered. (Hopefully, being down for weeks doesn’t actually ever happen in production, but the guys at IBM’s Hursley labs due test these things.)  Now I will say, we don’t use TIBCO, or MQ when low latency and very high throughput are required.  There is a new breed of messaging technologies out recently which are prefered and I will touch on some of them in coming articles.

With the databases, we moved all of the transactional abilities we knew and loved off the mainframes and onto the distributed platforms. In addition, the database companies implemented ways to run the databases in a cluster. This meant that if the database server failed, I would in theory, with a slight pause, fail over to the backup, with no intervention on the part of my application. Now in practice this took a few missteps to get right but today is old hat and everyone relies on the big commercial databases to be able to do this. Some of the open source ones are not so strong here as their paid for counterparts, but in time we will probably see this happen as well.

So this brings us pretty much up to today’s state of the world (or atleast a few years ago for a typical enterprise application) in a very Cliff’s Notes sort of summary. In the next article we will start a hypothetical design exercise as a way to ground the discussions going forward. This hypothetical will form the basis of the next few articles to come after it.

  • Share/Bookmark

High Availability Series: Series Outline

Posted on : 16-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Service Oriented Architecture (SOA), Software Design, Software as a Service

Comments

With all of the talk about reliability, or lack thereof, of SaaS and Cloud based applications, I thought I would write a series on designing applications to be Resilient and Highly Available.  The series sort of started with this post “It’s Inadequate Design That Lets Systems Fail, Not Whether They Are SaaS or Deployed in The Cloud“.

As any of you who have read my Bio are aware, I have spent most of my career designing very large, high volume and high performance applications for the World’s largest financial institutions.  In these systems High Availability and Reliability is Key, as systems I have been involved in designing carry Trillions of dollars of transactions on them each day.  Also in the Financial Markets world, and down time can cost millions of dollars per minute. We have also been center stage in the evolution of technology and design best practice when it comes to performance and reliability.  We have gone from just using a robust mainframe and assuming it stays up with hot swap hardware to high performance distributed applications handling millions of transactions per second in statefull applications (much harder to make HA than stateless Web apps), where time from failure to detection and takeover by a hot standby can be as little at 7 milliseconds.

The articles which will follow in this series will represent my personal opinion on how this is done.  It is by no means the only way to do it and I am sure others will clearly have other opinions.

Topic’s will tentatively the following:

  1. The Evolution Of Reliability and High Availability
  2. Guaranteeing No Loss Of Data
  3. Designing For Disaster Recovery
  4. Designing For Maximum Uptime In A Distributed World
  5. High Availability in a High Volume Transactional Environment

Other topics will be considered based on feedback, user requests or if something just pops into my head.  So if you have a particular question or topic you would like answered just ask and if it is something I feel I can write about, I will.

We will start in the next article in the series with a brief discussion of The Evolution Of Reliability and High Availability.

  • Share/Bookmark

It’s Inadequate Design That Lets Systems Fail, Not Whether They Are SaaS or Deployed in The Cloud

Posted on : 15-08-2009 | By : Paul Michaud | In : Cloud Computing, High Availability (HA), Software Design, Software as a Service

Comments

There have been many high profile outages lately which have caught peoples attention.  These failures are being used as an argument for why critical systems should remain internal and not be deployed as SaaS or in the Cloud.  Some of these outages included Google App Engine’s performance issues in early July , Rackspace’s loss of their Dallas data center due to power failure and the fire in Seattle that took Authorize.Net offline for 12 hours to name but a few.

What amazes me is how so many people point to this and argue that this is proof for why Cloud and/or SaaS is bad and that everything should be in house.  It’s preposterous.  The fact that these systems went down with a data center failure (or otherwise) is nothing more than an argument for inadequate system design, where High Availability (HA) is concerned.  The bottom line is it takes planning, forethought and good design to make a system highly available, and most systems simply are not designed with that in mind.

The reasons for not making a system highly available are many and include the following:

  1. Naivete: People don’t believe it could happen to their system and thus choose not to put in the time, effort and cost of making a system highly available
  2. Cost: Bottom line is it costs a lot of money to make a system HA and for a lot of firms, particularly when starting out or for smaller businesses, it just not a viable option
  3. Difficulty: Its bloody hard to make a system HA.  Its one thing to ensure no data loss,  its quite another to ensure little to no down time.

For most of my career I have built systems for the World’s largest financial companies including the World’s leading Investment Banks and Stock Exchanges.  These firms take high availability very seriously as a rule, but even with their resources and decades of experience systems still go down.

Consider the London Stock Exchange (whose system I did not design), who last year had a very public outage when they were down for most of a trading day.  This was not a SaaS system or one deployed in a Cloud.  It was an internal system run by a highly reputable company whose business is based on being reliable and never losing a trade.  These exchanges, for the most part, have highly redundant systems, multiple backup data centers, design for High Availability and run fail over tests regularly, yet they still experience downtime from time to time.

The point is, failures happen, whether the system is run internally, or in the cloud.  Whether its a SaaS system or one of home grown legacy design.  The objective is to minimize those failures and the downtime associated with them.

That said,  with today’s technologies, some careful planning and good design, it is possible to build systems that should almost never go down, even in the face of a 9/11 type event, but thats a topic for another day.

  • Share/Bookmark