DevOps Day: When Success is 99% Failover – How Availability Can Persist in the AWS Cloud When Network Events Also Persist in an EC2 / RDS Region

Some might refer to today as a DevOps Day… and to those who haven’t figured out their failover strategy, today might seem like the day the cloud stood still. But if you’re familiar with Internet service at large, you’ve seen it before. Network events persist, whether it be in the datacenter or in the Cloud… a sad hardship we face on shared networks such as the Internet. Remember that infrastructure services such as EC2 and DBMS services such as RDS are merely service layers on top of a data-center. Are you afraid of Cloud or data-center? Fear not, but perhaps the biggest “cloud” is the dark one powered by those who allow their computers to be compromised. If a denial-of-service attack is distributed, a provider-of-service defender should work just as hard to distribute his or her eggs… well… I guess her eggs in multiple baskets. Failover is a difficult concept for many applications, out of the box, because it requires a great deal of redundancy and synchronization. The database is perhaps the most difficult piece of the puzzle to distribute… especially if it is a relational database. Master -> Slave replication is one way to achieve not only multi-tiered horizontal scalability on demand, but also multi-regional redundancy. Take a look at the reference architecture just announced as part of a Rightscale + Zend horizontal scalability solution:

The separation of static content from dynamic content is a concept that will lead to higher efficiency and higher availability in any Cloud environment. Backups from master to slave databases may seem expensive across availability zones, but perhaps, after today, they are less expensive than we once thought.

Now let’s think about Content Distribution Networks. Static content can be cached at the edge which provides the most availability to your end users. When people think of CDN availability, they might assume “closest geographical region to the end user”… but what if your CDN was smart enough to weigh latency and system load as metrics in the load balancing determination algorithms? Do we have that? Yeah. Skeptics blame AWS / EC2 for today’s hardships, but perhaps some should be thanking them for edge-caching static content worldwide. It’s a saving grace for those who have their eggs scattered amongst 18 geographic regions.

For static content, content distribution networks often have multi-region high availability built in out-of-the-box. It’s a lot easier when dealing with static content, but with some systems architecture and database management expertise, the same caching principles can also be applied to maximize reliable delivery of both static and dynamic content.

If an application provider or platform service provisioner can separate static content from persistent data and also separate important data from not-so-important temporary / session data and deliver these types of data and content with discardable instances… fail-over can be achieved and even automated by replicating data across providers (or at least across cloud regions / availability zones). Once static content, persistent data, and temporal data have been sorted out… a redundant, meshed / multi-homed front-end server-array tier can determine (based on monitoring and availability metrics) which cloud / data-center / availability zone to distribute static and dynamic content from.

I think this type of architecture can be justified not only for fail-over reasons, but perhaps it can also be a way to achieve more rapidly elastic, impressive server performance.

When N. Virginia gets hit hard, it may be quite a hardship, but it shouldn’t be too hard to fail over to your other region’s slave database. If Soichiro Honda is going to tell me that success is 99% failure, then in the case of distributed, edge cached, redundant web systems architecture… perhaps success is 99% failover.

But don’t just go throwing shuriken at network-event coordinators unless your star has more than just these two points. I think a nice third point to sharpen and cut to is the reliability of monitoring systems. It’s good to be monitoring your auto-scaling processes if you’re in a situation where you scale on demand… and you also want to monitor who is demanding the computing resources. Ideally, you’re getting alarmed before your end users are. Reflexive firewalls are a good way to go, but just having good reflexes is part of wearing the agile cat’s hat in general. If you have a fast way to report trouble to the authorities charged with ownership of a compromised node attacking your system, you’re part of the solution and get a gold star.

Conversely, unnecessary reflexive post-mortem backups en-mass may have been a somewhat panicked response to the network event and a contribution to the length of this outage.

Amazon Web Services has done an excellent job (as always) of not only describing what happened and when service is expected to be restored, but what you can do to maximize availability if your service has been adversely affected by the outage. You can access their status updates via RSS feeds directly from the AWS Service Health Dashboard at status.aws.amazon.com.

Here’s a copy of what AWS is saying about EC2 services in the N. Virginia region [ RSS ]:

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.
2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.
2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.
3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We’re continuing to work towards full resolution.
4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT
5:02 AM PDT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.
6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.
6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.
7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.
11:09 AM PDT A number of people have asked us for an ETA on when we’ll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.
12:30 PM PDT We have observed successful new launches of EBS backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.
1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
6:18 PM PDT Earlier today we shared our high level ETA for a full recovery. At this point, all Availability Zones except one have been functioning normally for the past 5 hours. We have stabilized the remaining Availability Zone, but recovery is taking longer than we originally expected. We have been working hard to add the capacity that will enable us to safely re-mirror the stuck volumes. We expect to incrementally recover stuck volumes over the coming hours, but believe it will likely be several more hours until a significant number of volumes fully recover and customers are able to create new EBS-backed instances in the affected Availability Zone. We will be providing more information here as soon as we have it.

Here are a couple of things that customers can do in the short term to work around these problems. Customers having problems contacting EC2 instances or with instances stuck shutting down/stopping can launch a replacement instance without targeting a specific Availability Zone. If you have EBS volumes stuck detaching/attaching and have taken snapshots, you can create new volumes from snapshots in one of the other Availability Zones. Customers with instances and/or volumes that appear to be unavailable should not try to recover them by rebooting, stopping, or detaching, as these actions will not currently work on resources in the affected zone.

10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

Notice the ENTIRE CLOUD has certainly not collapsed. They are providing you a way to spin up instances in many availability zones that are available as usual. These are highly available availability zones which are not affected by this outage and may serve as failover with proper implementation of redundant server architecture.

Here’s a copy of what Amazon Web Services is saying about RDS services in the N. Virginia Region [ RSS ]:

1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.
2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.
3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.
4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.
5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances.
6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone.
8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.
10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.
2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.

11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

These updates are not direct from Amazon, but merely a copy, so please subscribe to the Amazon Service Health Dashboard for more freshly updated information regarding their service (which I still insist is high quality).

– Asher Bond
It’s a long way down if your head is in the CLOUD.

Tagged with: , , , , , , , , , ,
Posted in cloud computing, content distribution, Test-Driven DevOps Design
One comment on “DevOps Day: When Success is 99% Failover – How Availability Can Persist in the AWS Cloud When Network Events Also Persist in an EC2 / RDS Region
  1. Rencontres says:

    Hello My name is Diana, I’m french and I work in the city of Paris for Cie Rencontre. I like your development, your comment is really well write. That’s so bad for me cause I don’t speak so much english (I’m french) and so I could not explain very good my opinion. Anyway continue your very good blog.Regards