Framework ArchitectureTechnology

Trading Tech Lessons from an Amazon Outage

Amazon.com went down last week. It makes up over half the revenue for a trillion dollar behemoth. Let’s just say uptime is critical.

A short window of grace

The first thing we can learn about this outage is that you probably didn’t even notice.

This is a really important concept: No setup is infallible. The total Amazon outage was about eight minutes. Compare this to Delta’s notorious days-long outages, which caused massive physical disruptions and some very hot under the collar customers. The take-away is that even in the absence of perfection there is a window, a very narrow window of grace where even if there is an outage the impact to the business is minimal.

Cloud just lapped on-premise

The second thing we can learn is that this is probably a milestone. The cloud has your on-premise beat. Hands down. No contest. Hit the showers. By cloud, I mean Amazon Web Services (AWS), Azure, and maybe Google. These cloud providers have invested so much into reliability tools it’s getting to the point that it is inconceivable to duplicate on-premise. How did Amazon recover so fast? Well, obviously it’s AWS. But they are likely using their own tools like Elastic Beanstalk (EB), which automatically prevents downtime with redundancy. For example, if a whole region is out, EB brings it back up in another region. The mechanics of this are complex, but the takeaway is that expecting this type of reliability on-premise is plain old doubtful. Recreating it at the same cost: nearly impossible.

You don’t have to move to the cloud, but you must, YOU MUST understand how these tools work. Segments, not all, of your technical infrastructure can live there without taking on additional security risk. Your business will fare better for it.

How?

At this point here is BroadPeak’s plug for straight through processing (STP). STP has never been more important. People are distributed now, and likely will be for a long period of time. “Home Internet” throws a major wrench into business as usual. If you are manually booking anything, you simply cannot guarantee the position will be updated. Take away: Cloud hosting exchange connectivity is easier, safer, and more resilient than you might think.

Second, maintenance for exchange connections is getting higher and higher. (Lots of exchange changes are afoot.) This has become a major cost center for vendors and IT teams alike. But it’s just not difficult to stand up an in-network dedicated cloud server that gets all the benefits of cloud while still being inside your firewall. Take away: Outsourced support gets you tons of dev time back.

Automation has never been more important. If your team is curious how reliability tools like this work we are very open about how we use them and happy to share what we know. STP is so important right now and even if you are just curious to see what we bring to the table we’d be happy to share what we know.