Monitoring, Logging & Remediation Implementing and managing monitoring and alerting solutions
Exam Domain
Monitoring accounts for 20% of the exam. Master CloudWatch metrics, alarms, logs, and automated remediation.
CloudWatch Metrics
| Metric Type | Resolution | Examples |
|---|
| Standard (free) | 5-minute intervals | CPU, NetworkIn/Out, DiskReadOps |
| Detailed Monitoring | 1-minute intervals | Same metrics, higher resolution ($) |
| Custom Metrics | 1-min standard, 1-sec high-res | Memory, disk space, app-specific |
| Anomaly Detection | ML-based bands | Auto-generated expected ranges |
CloudWatch Alarms
- States: OK → ALARM → INSUFFICIENT_DATA
- Actions: SNS notification, Auto Scaling, EC2 action (stop/terminate/reboot/recover)
- Composite Alarms: Combine multiple alarms with AND/OR logic
- Evaluation periods: Number of consecutive data points to trigger
- Treat missing data: Missing, NotBreaching, Breaching, Ignore
CloudWatch Logs
Log Groups
Logical grouping with retention policies (1 day to 10 years or never)
Metric Filters
Extract metrics from log data, create alarms based on log patterns
Logs Insights
Query language for interactive log analysis across log groups
Subscription Filters
Stream to Lambda, Kinesis, or OpenSearch in real-time
Automated Remediation
- EventBridge rules → Lambda/SSM Automation for auto-remediation
- AWS Config rules with auto-remediation actions
- CloudWatch alarm actions for EC2 recovery
- Systems Manager Automation runbooks for complex workflows
Exam Focus Areas
- CloudWatch agent needed for memory/disk metrics (not included by default)
- Unified CloudWatch Agent replaces older Logs agent + monitoring scripts
- EC2 recover action: Same instance ID, IP, EBS volumes, metadata
- Metric math: Combine multiple metrics in a single alarm expression