![]() ![]() As a result of these errors, many API Gateway servers eventually got into a state where they needed to be replaced in order to serve requests successfully. API Gateway servers were impacted by their inability to communicate with the internal network during the early part of this event. However, API Gateway, which is often used to invoke Lambda functions as well as an API management service for customer applications, experienced increased error rates. For example ( emphasis added):ĪWS Lambda APIs and invocation of Lambda functions operated normally throughout the event. ![]() Likewise, the network congestion between AWS's internal and external networks had odd effects on seemingly unrelated services like VPC Endpoints and API Gateways. Similarly, while the Route53 DNS service so critical to most multi-region - that is, deliberately designed for maximum redundancy and scalability - customer applications continued to respond to DNS queries, customers couldn't make changes, for example, to redirect a newly slow application in AWS US-East-1 to another region or cloud provider. ![]() Similarly, existing Elastic Load Balancers remained healthy during the event, but the elevated API error rates and latencies for the ELB APIs resulted in increased provisioning times for new load balancers and delayed instance registration times for adding new instances to existing load balancers. As AWS's post mortem puts it:Ĭustomers of AWS services like Amazon RDS, EMR, Workspaces would not have been able to create new resources because of the inability to launch new EC2 instances during the event. However, anything that used the AWS control plane to create and manage services experienced significant delays. those whose workloads were relatively static and didn't require any resource or configuration changes, operated normally throughout the event. In one sense, AWS is justified in deliberately avoiding the word "outage" - it doesn't appear once in AWS's almost 2,000-word account of the incident - since many customers, i.e. AWS's demonstration of control plane dependencies Together these events illustrate the fragility of modern IT systems and the 'digital transformation' strategies utterly dependent upon them. While the AWS outage was spotty, leaving many customers only minorly inconvenienced, the vulnerability in the Log4j software library is broad (affecting millions of software and service users), deep (lodged within countless commercial and open source applications) and likely to be lengthy (with ramifications lasting well into next year). Last week's AWS outage Service Event™ demonstrated that even the largest and most sophisticated hyperscale operators aren't immune from mistakes that cascade into a multitude of other problems. In what is becoming a Holiday tradition for IT operations and security teams (SolarWinds anyone?), two recent events have added stress to an already hectic season by demonstrating the fragility and interdependence of the cloud services and open-source software at the foundation of enterprise applications and processes. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |