ML-based Anomaly Detection using AWS CloudWatch

Cloud workloads require continuous monitoring of the complex infrastructure components which can potentially experience failures of systems and applications. In such complex enterprise cloud workloads, finding, isolating, and troubleshooting issues with infrastructure resources and applications may require a reactive exercise where operations team respond to fired alarms by analyzing multiple metrics and dashboards to find the problematic resources. This approach needs extensive prior application domain knowledge, knowledge of the underlying system design which makes it difficult to differentiate between normal versus problematic behavior.

System operations can be quite cumbersome when troubleshooting applications with rapid growth based on cyclical or seasonal behavior such as requests that peak during the day and taper off at night. Other time bound peak-load patterns are difficult to monitor using static thresholds. Typical threshold-based monitoring and alerting features require lot of manual operational effort, are not cost optimal and have higher chances of errors. They may largely benefit from dynamic systems that can issue a correct early warning of unusual behavior after learning the system characteristics of normal operation, over a period.

How can these challenges be mitigated?

  • Leverage machine-learning algorithms using AWS CloudWatch to continuously analyze system and application metrics. Anomaly detection is an easily configurable out-of-the-box feature. It can analyze metrics across all AWS services monitored by CloudWatch. There are wide range of metric types include default and custom metrics. Examples of default metrics: CPUUtilization, MemoryUtilization, DiskReadOps, etc., It also offers the flexibility to configure custom Machine Learning algorithm band functions, based on application-specific thresholds.
  • Use the analysis outcome to determine a normal baseline and detect anomalies with minimal user intervention.
  • Create alarms to notify when anomalies occur, thus fully automating the proactive application monitoring with continuous learning.

Sample illustration – Configuring anomaly detection:

Sample illustration – CPU Utilization Anomaly Detection depicting normal & unusual behavior:

 

What are the value propositions?

Benefits the IT Operations team by fostering automation of application monitoring:

  • Helps identify runtime issues sooner, reducing system and application downtime
  • Adapts to metric trends, enabling the ability to monitor the dynamic nature of system and application behavior
  • Alarms can auto-adjust to situations such as time-of-day utilization peaks
  • Identifies unexpected changes that result from connectivity issues, code change deployments, database errors, and other operational issues
  • Quick remediation

By leveraging machine learning features to augment monitoring tasks, operations team can continuously learn and adapt to seasonality of workloads. The solution is not cost-prohibitive and fosters automation, thus enabling continuous optimization and operational excellence.

Looking forward to your comments.

Author Details

Mahesh Kothandaraman

Mahesh is a Solution Architect in API Microservices and Enterprise Cloud systems at Infosys Digital Experience. He helps deliver Digital transformation programs for enterprises, by leveraging cloud workloads and designing cloud-native application systems providing leadership, strategy, and technical consultation.

Leave a Comment

Your email address will not be published. Required fields are marked *