When Cloud Platforms Fail
In October 2025, Amazon Web Services (AWS) experienced a major outage in its US East 1 region, disrupting thousands of enterprises worldwide. What began as a localized issue quickly spread across dependent services, highlighting how interconnected cloud platforms have become.
The disruption originated from platform-level dependency. A Domain Name System (DNS) issue affecting DynamoDB, a core AWS service, led to cascading failures across multiple platform services. The scale and duration of the outage exposed hidden single-region dependencies. It underscored the importance of designing for resilience and continuously validating it through quality engineering (QE) practices. This includes platform testing, controlled failure scenarios, and recovery validation.
The Business Impact of Cloud Failure
The financial impact was staggering. Global losses exceeded US $1 billion, with industry benchmarks estimating US $75 million per hour in cumulative downtime. Amazon alone reportedly lost US $72 million per hour. More than 3,500 companies were affected, and over 17 million user reports were logged, making this one of the largest internet outages on record.
AWS is not an outlier. Similar incidents across Azure, Google Cloud, and CrowdStrike over the past two years reveal a recurring pattern. Business continuity risks are underestimated; centralized dependencies are poorly understood; infrastructure assurance is insufficient; and failover testing is often incomplete or entirely absent.
Structural Gaps in Cloud Platform Testing
Despite its criticality, platform-level testing is frequently sidelined for three reasons:
- Cost pressure: Testing is seen as a sunk cost rather than a business enabler.
- Compressed timelines: Migration deadlines push resilience validation into an uncertain future phase that rarely materializes.
- Assumed provider reliability: Blind trust in hyperscaler service level agreements (SLAs) creates a false sense of security.
The result is predictable. Quality is undervalued, even as the financial impact of outages far exceeds the investment required to prevent them.
Resilience as a QE Discipline
Preventing or containing disruptions requires resilience engineering as a core QE practice:
- Multi-region and multicloud resilience testing: Validate active-active architectures, run automated failover drills, test DNS failure scenarios, and measure recovery time and recovery point objectives (RTO/RPO) compliance under real-world conditions.
- Chaos engineering for control plane dependencies: Introduce controlled failures into DNS, identity, and database layers to observe and refine mechanisms.
- Platform observability and predictive analytics: Shift from reactive testing to synthetic monitoring and AI-driven anomaly detection to catch latent issues early.
- Cost of quality framework: Link resilience testing investments to outage risk reduction, positioning testing as risk mitigation, using metrics that resonate with finance stakeholders.
Strategic Recommendations for the Industry
Improving cloud resilience requires coordinated action across providers, architects, and development teams:
- Cloud providers: Increase transparency around architectural dependencies and move beyond single-region uptime metrics. Offer resilience-focused SLAs and mandate standardized resiliency testing supported by clear runbooks.
- Enterprise architects: Make infrastructure testing a mandatory part of cloud migration roadmaps, rather than treating it as a post-migration activity.
- Development teams: Bring resilience testing earlier in the development process by integrating end-to-end cloud infrastructure-as-code (IaC) testing into continuous integration and continuous delivery (CI/CD) pipelines.
Partners play a critical role. Infosys uses an AI-first approach for end-to-end automated validation. Working with enterprises, we embed operational readiness, infrastructure configuration validation, resilience testing, and network assurance into QE frameworks.
From Cloud Outages to Cloud Confidence
The AWS US East 1 outage was a warning, not an anomaly. As cloud adoption accelerates, resilience can no longer be treated as a secondary concern. Cloud platform validation and resilience must become core priorities within QE. They must be built through deliberate testing, continuous monitoring, and a clear understanding of failure.
Chaos testing, multi-region validation, and predictive analytics are now essential capabilities for operating at cloud scale. At Infosys, we believe resilience engineering must be embedded into QE strategies, testing systems to survive failure and ensure continuity as well as confidence.