Cloud Chaos Engineering Practice

Chaos Engineering:

Chaos engineering is a discipline to ensure the distributed computing system can withstand turbulent and unexpected conditions. It depends on the concept of chaos theory which focuses on unexpected scenarios of system failure points with random behavior. Chaos experiments will help find breaking points and provide compliance with the following:

• Build confidence in the organization system

• Making use of application behavioural patterns, identify the vulnerabilities before any intruders or system failures in a production environment

• Reduce the maintenance cost

 Chaos Engineering on Cloud Space:

Cloud computing is the availability of computer system resources, especially data storage, computing resources, network, file storage, access management, etc. These computing resources can be available on demand based on several cloud service providers like AWS, Azure, and GCP.

The organization can opt for the cloud model for its development and deployment. On any mode of the model, the computing services are involved in any application deployment and data maintenance.

Chaos engineering on computing services on the cloud will be the main focus is to identify the unpredictable failures of systems, networks, performance, applications, databases, etc. As the usage of cloud computing and hosting an application on the cloud is growing faster in rapid mode will have a sudden failure in the system, or the application will impact the organization’s productivity and the cost involved.

Working principle of Chaos Engineering:

Chaos engineering is similar to stress testing, which is used to identify and correct system or network issues. Unlike stress testing, chaos engineering doesn’t test and correct one component at once.

Chaos engineering inspects problems that have a seemingly infinite number of possible causes. It sees beyond the definite or obvious issues and tests the distributed systems against problems or sets of problems that are less likely to happen.

The chaos engineering process is listed below:

1.    Cloud Baseline Data: Start collecting the baseline data for EC2 and VM instances of resource utilization for corresponding cloud service providers. The wider data collection for other cloud services for s3 buckets, blob storage, cosmos DB, VPC, Network topology, and Cloud Kubernetes environment (AKS, EKS, GKE). The tester must identify how the system work under normal conditions with the presets of values and constitute the normal working state.

2.   Create the Hypothesis: Have a thought process that comprises the potential weakness of the cloud system and formulate those thought processes as hypotheses for system failure. For example, the user needs to know what could happen to the system if there is an unexpected surge in traffic or memory. More instances are a sudden increase of CPU utilization in EC2 instance of AWS, S3 bucket storage getting filled with junk data, PODs getting killed in Azure/AWS/GCP Cloud Kubernetes environment, and network delay in accessing the cloud services.

3.    Test the Hypothesis: Experiment to gauge the consequence of a sudden surge. The test or experiment will bring up an error in a system or process or with an unexpected cause and effect relationship. For example, the sudden surge in traffic should identify the memory blockages in the application layer which cause the performance issue.

4.    Evaluate the Hypothesis: Measure and evaluate how the hypothesis holds up and determine the problem to fix.

5.   Repeat the Hypothesis: Once the fault or the issues has been addressed, and if the issue persists, scale up the cloud resources and network bandwidth, and repeat the chaos engineering principles.


Chaos engineering will help to build confidence in an organization’s system and the vulnerabilities identified before the production environment. In the world of distributed computing, more resources will result in more chaos, which should be dealt by with utilizing the Chaos engineering method. By using renowned chaos engineering tools for AWS, AZURE, and GCP, the Kubernetes environment will also receive broad coverage.

Author Details

Siva Sankaran

Project Manager - Cloud Infra and Network - Design and Architect

Leave a Comment

Your email address will not be published.