AWS Cloud Optimization

Back to Projects

AWS Cost Explorer CloudWatch Trusted Advisor Lambda Functions Auto Scaling Elastic Load Balancing Python

60-70%

Dev/Test Cost Reduction

24/7

Automated Scheduling

Multi-Region

Global Support

📋 Executive Summary

Context: The Challenge

As a solo developer running multiple development and staging environments on AWS, I faced the common problem of paying for resources 24/7 even though I only actively used them during working hours. My AWS bills were unnecessarily high because EC2 instances, databases, and other resources remained running overnight, on weekends, and during periods when I wasn't actively developing.

Action: The Solution

Built an intelligent EC2 Auto-Scheduler using AWS Lambda and Python with boto3
Implemented timezone-aware scheduling that respects different working hours across regions
Created flexible tag-based system allowing per-instance schedule overrides or SSM Parameter Store global policies
Added multi-region support to manage instances across different AWS regions from a single Lambda function
Integrated CloudWatch metrics for tracking started/stopped instances and cost savings
Built comprehensive error handling, dry-run mode, and structured JSON logging for monitoring

Result: Business Impact

Personal Cost Savings: Reduced my monthly AWS bill by automatically shutting down dev/staging resources outside business hours (weeknights and weekends)
Enterprise Potential: The same solution scales to organizations with hundreds of instances across multiple regions, potentially saving thousands monthly
Zero-Touch Automation: Instances automatically start during working hours and stop after hours with no manual intervention required
Flexible Scheduling: Supports custom schedules via tags or centralized policies, accommodating different teams and timezones
Production-Safe: Environment-based filtering ensures production workloads are never touched
Observable & Auditable: CloudWatch metrics and structured logging provide complete visibility into scheduler actions

🛠️ How I Built This

Initial Assessment: Self-led comprehensive audit of AWS environment using native tools (Cost Explorer, Trusted Advisor) combined with custom Python scripts for deeper analysis 🤝 AI-assisted debugging
Resource Scheduler: Self-coded Lambda function in Python with boto3, implementing tag-based automation for environment scheduling
Cost Analytics: Built custom CloudWatch dashboards and Python analysis scripts to identify spending patterns and optimization opportunities
Right-Sizing Engine: Developed data collection and analysis pipeline using CloudWatch metrics, with AI-assisted validation of recommendations 🤝 AI validation
Testing & Validation: Implemented comprehensive testing strategy with gradual rollout, monitoring for performance regressions
Documentation: Created runbooks and training materials for operations team 🤝 AI-assisted docs

Transparency: This project leveraged AI for debugging complex boto3 interactions, validating right-sizing recommendations, and creating comprehensive documentation. Core architecture, optimization logic, and implementation were self-developed based on AWS best practices and hands-on experience.

Project Overview

As a solo developer managing multiple AWS projects, I built an intelligent EC2 Auto-Scheduler to solve a simple but expensive problem: I was paying for development and staging resources 24/7 even though I only used them during working hours. This Lambda-based solution automatically starts and stops EC2 instances based on customizable schedules, significantly reducing costs without impacting productivity.

The Problem I Solved

Like many developers running cloud infrastructure, I faced several challenges:

Wasted Spend: Development and staging instances running overnight, weekends, and holidays when not in use
Manual Management: Remembering to stop instances at night and start them in the morning was unreliable and tedious
Multiple Environments: Managing schedules across different projects and environments manually didn't scale
Cost Visibility: No clear tracking of savings from resource scheduling optimizations
Multi-Region Complexity: Resources spread across different AWS regions needed coordinated management
Team Scalability: As projects grew, a solution that would work for larger teams was necessary

Key Features & Benefits

For Solo Developers

Immediate Cost Savings: Reduce monthly AWS bills by automatically shutting down dev/test resources outside working hours (up to 70% savings on non-production instances)
Set-and-Forget Automation: Tag instances once, and the scheduler handles everything automatically via EventBridge triggers
Timezone Support: Respects your local working hours with configurable timezone settings per instance
Dry-Run Mode: Test scheduling logic safely before making actual changes to instances
Cost Tracking: CloudWatch metrics show exactly how many instances were started/stopped for transparency

For Organizations

Enterprise Scale: Manages hundreds of instances across multiple AWS regions from a single Lambda function
Centralized Policies: Define global schedules in SSM Parameter Store (e.g., "business-hours") that teams can reference
Per-Team Flexibility: Teams can override global policies with custom schedules via instance tags
Environment Isolation: Built-in safeguards prevent accidental shutdown of production workloads
Multi-Region Support: Coordinate scheduling across us-east-1, eu-west-1, ap-southeast-1, etc. from one function
Batch Operations: Processes up to 50 instances per API call for efficiency, handling large environments gracefully
Observability: Structured JSON logging and CloudWatch metrics enable monitoring, alerting, and cost analysis
Estimated Savings: Organizations with 100+ dev/staging instances could save $5K-$15K+ monthly

Technical Architecture

The EC2 Auto-Scheduler is built as a serverless Lambda function that runs on a schedule (typically every hour via EventBridge):

1. Intelligent Scheduling Engine

Timezone-Aware Logic: Uses Python's zoneinfo module to calculate current time in each instance's configured timezone
Window-Based Decisions: Determines if current time falls within the schedule's start/end window and allowed days
Flexible Schedule Sources: Supports per-instance JSON tags, SSM Parameter Store references, or global default policies
State-Based Actions: Compares desired state (running/stopped) with actual state and takes appropriate action only when needed

2. Multi-Region Support

Configurable Regions: Environment variable defines which AWS regions to manage (defaults to current region)
Parallel Processing: Iterates through regions sequentially, processing all instances in each region
Per-Region Metrics: CloudWatch metrics tagged by region for granular visibility
Centralized Control: Single Lambda deployment can manage global infrastructure

3. Tagging Strategy

AutoSchedule Tag: AutoSchedule=true enables scheduling for an instance
Environment Tag: Environment=development|staging|test ensures production safety
Timezone Tag: Timezone=Pacific/Honolulu or any IANA timezone identifier
Schedule Tag: Either JSON schedule object or reference to SSM parameter (e.g., Schedule=business-hours)

4. Robust Error Handling

Boto3 Retry Logic: Configured with exponential backoff for handling AWS API throttling
Batch Processing: Groups start/stop operations in batches of 50 to respect AWS API limits
Graceful Failures: Errors logged but don't halt processing of remaining instances
Dry-Run Mode: Environment variable enables testing without making actual changes
Structured Logging: JSON-formatted logs for easy parsing by CloudWatch Insights or log aggregators

How It Works: Step-by-Step

1. Initialization & Configuration

Lambda function is triggered by EventBridge (typically hourly: cron(0 * * * ? *))
Loads configuration from environment variables: regions, timezone, dry-run mode, SSM parameter path
Optionally fetches global schedule policy from SSM Parameter Store for centralized management
Supports force_action event parameter for manual testing and overrides

2. Instance Discovery

For each configured region, queries EC2 API with filters: AutoSchedule=true and Environment=development|staging|test
Uses pagination to handle environments with hundreds of instances efficiently
Extracts instance ID, current state, and all tags for decision-making
Filters ensure production workloads are never accidentally affected

3. Schedule Evaluation

For each instance, determines its effective schedule (tag > SSM reference > global default)
Gets current time and day in the instance's configured timezone
Checks if current time falls within the schedule window (e.g., Mon-Fri, 08:00-18:00)
Determines desired state: running if within window, stopped otherwise
Logs decision with full context for auditability

4. Action Execution

Compares desired state with actual instance state
Batches instances needing the same action (start or stop) in groups of 50
Calls start_instances or stop_instances API (unless dry-run mode is enabled)
Adds small delays between batches to avoid overwhelming the AWS API
Continues processing even if individual operations fail, ensuring maximum coverage

5. Metrics & Reporting

Publishes CloudWatch custom metrics: EC2AutoScheduler/InstancesStarted and InstancesStopped
Metrics are dimensioned by region for granular tracking
Returns summary JSON with total counts and per-region details
All actions logged with structured JSON for CloudWatch Logs Insights queries

Key Outcomes

Personal Cost Savings: Reduced my AWS bill by automatically shutting down dev/staging instances outside working hours (nights and weekends)
Zero Manual Intervention: Instances start/stop automatically based on schedule—no more forgetting to shut things down
Production-Safe: Environment-based filtering ensures only dev/test resources are affected, never production
Enterprise-Ready: Scales to manage hundreds of instances across multiple regions for organizations
Full Observability: CloudWatch metrics and structured logging provide complete visibility into scheduler actions
Flexible & Customizable: Supports per-instance schedules, global policies, and timezone-specific working hours

Technical Highlights

The EC2 Auto-Scheduler demonstrates several production-grade engineering practices:

1. Timezone-Aware Scheduling

Unlike simple time-based schedulers, this solution respects timezones using Python's zoneinfo module:

Each instance can have a Timezone tag (e.g., "America/New_York", "Europe/London", "Asia/Tokyo")
Current time is calculated in the instance's timezone, not UTC or the Lambda's region timezone
Enables globally distributed teams to use the same scheduler with local working hours
Falls back to a configurable default timezone if tag is missing

2. Flexible Schedule Configuration

Three-tier schedule hierarchy provides maximum flexibility:

# Option 1: Per-instance JSON schedule tag
Schedule={"days":["mon","tue","wed","thu","fri"], "start":"09:00", "end":"17:00"}

# Option 2: Reference to global schedule in SSM Parameter Store
Schedule=business-hours  # looks up /ec2-scheduler/schedule["business-hours"]

# Option 3: Use global default schedule
# If no Schedule tag, uses default from SSM or hardcoded Mon-Fri 08:00-18:00

3. Production-Grade Error Handling

Built to handle failures gracefully at scale:

Boto3 Retry Configuration: 10 retries with exponential backoff for AWS API throttling
Try-Except Blocks: Failures loading SSM parameters or stopping instances don't halt entire execution
Batch Processing: Up to 50 instances per API call, with rate limiting between batches
Dry-Run Mode: DRY_RUN=true environment variable tests logic without making changes
Structured Logging: JSON logs include all context (instance ID, region, timezone, decision) for debugging

4. Core Lambda Function (Simplified)

# lambda_function.py (simplified for illustration)
import os, json, logging
from datetime import datetime
from zoneinfo import ZoneInfo
import boto3

def lambda_handler(event, context):
    regions = os.getenv("REGIONS", "").split(",") or [os.environ["AWS_REGION"]]

    for region in regions:
        ec2 = boto3.client("ec2", region_name=region)

        # Find instances with AutoSchedule=true tag
        for inst in ec2.describe_instances(
            Filters=[
                {"Name":"tag:AutoSchedule", "Values":["true"]},
                {"Name":"tag:Environment", "Values":["development","staging","test"]}
            ]
        )["Reservations"]:
            for instance in inst["Instances"]:
                # Get timezone and schedule from tags
                tz = tags.get("Timezone", "Pacific/Honolulu")
                schedule = tags.get("Schedule", default_schedule)

                # Calculate current time in instance timezone
                current_time, current_day = _now_in_tz(tz)

                # Determine if instance should be running
                should_run = _is_within_window(current_time, current_day, schedule)

                # Take action if needed
                if should_run and state == "stopped":
                    ec2.start_instances(InstanceIds=[instance_id])
                elif not should_run and state == "running":
                    ec2.stop_instances(InstanceIds=[instance_id])

    return {"statusCode": 200, "body": json.dumps(summary)}

5. CloudWatch Integration

Complete observability through custom metrics and structured logging:

Custom namespace: EC2AutoScheduler
Metrics: InstancesStarted and InstancesStopped with region dimension
Create CloudWatch dashboards to track scheduler activity over time
Set up alarms if no instances are being managed (possible configuration issue)
Use Logs Insights to query structured JSON logs: fields @timestamp, msg, instance, region, desired | filter msg="decision"

Real-World Impact

For Solo Developers & Small Teams

Monthly Savings: Automatically shutting down 5-10 development instances outside working hours can save $100-$300/month
No More Manual Work: Eliminates the need to remember to stop instances at night or start them in the morning
Peace of Mind: Never worry about forgetting to shut down resources and getting a surprise bill
Professional Setup: Run the same automation that enterprise companies use, even as a solo developer

For Organizations & Enterprises

Significant Cost Reduction: Organizations with 100+ dev/staging instances can save $5K-$15K+ monthly (60-70% reduction on non-production compute)
Multi-Team Support: Different teams can define custom schedules (e.g., West Coast team vs. European team working hours)
Centralized Governance: DevOps teams define global policies in SSM Parameter Store that developers reference
Improved Developer Experience: Developers arrive in the morning to find their environments already running
Environmental Responsibility: Reduces carbon footprint by running compute only when needed
Compliance & Auditing: Structured logs and CloudWatch metrics provide audit trail for cost optimization initiatives

Potential Extensions

RDS Support: Extend to automatically start/stop RDS database instances (similar cost savings)
Auto Scaling Group Management: Scale down ASG min/desired capacity outside business hours
ECS/Fargate Tasks: Automatically stop ECS services running on Fargate for development
Slack/Teams Notifications: Alert teams when their environments are started or stopped
Cost Reporting: Calculate and report actual monthly savings based on instance hours saved

Key Takeaways

This project demonstrates important principles for cloud cost optimization:

Automation is Essential: Manual management doesn't scale and leads to wasted spend—automated scheduling ensures consistency
Serverless for Operational Tasks: Lambda is perfect for periodic tasks like resource scheduling—no servers to manage, pay only for execution time
Tag-Based Architecture: Using tags to drive automation provides flexibility without hardcoding instance IDs
Safety First: Environment-based filtering and dry-run mode prevent accidental production impacts
Observability Matters: CloudWatch metrics and structured logging are critical for debugging and proving ROI
Start Small, Scale Later: Built for personal use, designed to scale to enterprise environments

Lessons Learned

Timezone Complexity: Naive time-based schedulers fail for distributed teams; timezone support is critical
AWS API Limits: Batching operations and adding delays prevents throttling at scale
Configuration Hierarchy: Three-tier schedule system (tag > SSM > default) balances flexibility with simplicity
Error Tolerance: Failing to process one instance shouldn't halt processing of others
Testing is Hard: Dry-run mode and the force_action event parameter enable testing without affecting real resources

Future Enhancements

RDS Database Scheduling: Extend to automatically start/stop RDS instances (even greater savings potential)
Intelligent Scheduling: Machine learning to predict when developers actually need resources based on git activity
Cost Reporting Dashboard: Calculate and visualize actual monthly savings based on instance hours saved
Slack/Teams Integration: Notify teams when environments are started or stopped, with manual override buttons
ASG/ECS Support: Extend to Auto Scaling Groups and ECS services for even broader coverage
Holiday Calendar Integration: Automatically skip scheduled starts on company holidays