AWS powers 31% of the global cloud market and runs everything from two-person startups to Netflix, Airbnb, and NASA. But with 200+ services and a near-infinite number of ways to build on it, getting your architecture right from the start matters enormously.
A poorly designed architecture leads to runaway costs, security breaches, and systems that can't scale. A well-designed one gives you predictable performance, automatic scaling, and infrastructure costs that scale proportionally with your business.
At Codazz, we've architected AWS systems for startups scaling from 0 to millions of users. This guide distills everything into actionable best practices for 2026.
The AWS Well-Architected Framework: 5 Pillars
The Well-Architected Framework is AWS's blueprint for building production systems. Every architecture decision should be evaluated against these five pillars:
Operational Excellence
- Automate everything: deployments, scaling, recovery, alerting
- Infrastructure as Code (Terraform, CDK, CloudFormation) for all resources
- Runbooks and playbooks for common operational tasks
- Continuous improvement through post-incident reviews
Security
- Implement a strong identity foundation with least-privilege IAM
- Enable traceability: CloudTrail, Config, GuardDuty, Security Hub
- Apply security at all layers: network, compute, data, application
- Automate security best practices using AWS Config Rules and SCPs
Reliability
- Automatically recover from failure using health checks and auto-scaling
- Test recovery procedures: chaos engineering, game days
- Scale horizontally to increase aggregate system availability
- Stop guessing capacity: use auto-scaling groups and serverless where possible
Performance Efficiency
- Choose the right resource type: Graviton4 for general compute, GPU for ML
- Use managed services to reduce undifferentiated heavy lifting
- Use serverless architectures to remove operational burden
- Benchmark regularly and review performance metrics quarterly
Cost Optimization
- Adopt a consumption model: pay only for what you use
- Measure overall efficiency with AWS Cost Explorer and Trusted Advisor
- Stop spending money on undifferentiated heavy lifting
- Analyze and attribute expenditure with cost allocation tags
Serverless Architecture: Lambda, API Gateway & DynamoDB
Serverless is the default recommendation for new APIs and event-driven workloads in 2026. With Lambda SnapStart eliminating cold starts and DynamoDB on-demand pricing, the cost and operational benefits are compelling.
| Service | Role | Pricing | Best For |
|---|---|---|---|
| Lambda | Compute | $0.20/1M reqs + compute | Event-driven functions, APIs |
| API Gateway HTTP | Request routing | $1.00/1M requests | REST/WebSocket APIs |
| DynamoDB On-Demand | Database | $1.25/1M writes, $0.25/1M reads | Key-value, variable traffic |
| SQS | Message queue | $0.40/1M requests | Async processing, decoupling |
| EventBridge | Event bus | $1.00/1M events | Service-to-service events |
| Step Functions | Orchestration | $25/1M state transitions | Complex workflows |
Lambda + API Gateway: Production Pattern
# SAM template: Lambda API with SnapStart enabled
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Timeout: 30
MemorySize: 512
Environment:
Variables:
TABLE_NAME: !Ref AppTable
STAGE: !Ref Stage
Resources:
ApiFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: index.handler
Runtime: nodejs20.x
SnapStart:
ApplyOn: PublishedVersions # Sub-200ms cold starts
AutoPublishAlias: live
Events:
Api:
Type: HttpApi
Properties:
Path: /api/{proxy+}
Method: ANY
Auth:
Authorizer: JwtAuthorizer
# DynamoDB with on-demand pricing (no capacity planning)
AppTable:
Type: AWS::DynamoDB::Table
Properties:
BillingMode: PAY_PER_REQUEST
TableClass: STANDARD
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
SSESpecification:
SSEEnabled: true
AttributeDefinitions:
- AttributeName: PK
AttributeType: S
- AttributeName: SK
AttributeType: S
KeySchema:
- AttributeName: PK
KeyType: HASH
- AttributeName: SK
KeyType: RANGEServerless Cost Reality Check
A serverless API handling 10 million requests/month costs approximately $12/month (Lambda) + $10/month (API Gateway) + DynamoDB usage. Compare that to $150-300/month for equivalent EC2/RDS infrastructure. At low-to-medium traffic, serverless wins on cost. At very high, sustained traffic (>100M req/month), committed EC2 with Savings Plans may be cheaper.
Containerized Apps: ECS vs EKS
When serverless doesn't fit (long-running processes, WebSockets, CPU-intensive workloads), containers on ECS or EKS are the answer. Here's when to choose each:
| Factor | ECS (Elastic Container Service) | EKS (Elastic Kubernetes) |
|---|---|---|
| Complexity | Low — AWS-native, simpler API | High — Kubernetes expertise required |
| Control plane cost | Free | $0.10/hr per cluster (~$73/mo) |
| Ecosystem | AWS services only | Huge CNCF/Kubernetes ecosystem |
| Scaling | Service Auto Scaling, KEDA | HPA, KEDA, Karpenter |
| Best for | Startups, AWS-only shops | Multi-cloud, large teams, k8s expertise |
| Launch type | EC2 or Fargate (serverless) | EC2 or Fargate |
| Migration effort | Lower from Docker Compose | Lower from existing k8s |
ECS Fargate: Auto-Scaling Configuration
# Terraform: ECS Fargate with Application Load Balancer
resource "aws_ecs_service" "api" {
name = "api-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = aws_subnet.private_app[*].id
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 3000
}
deployment_controller { type = "ECS" }
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
}
# Target tracking: scale on CPU utilization
resource "aws_appautoscaling_policy" "cpu_scaling" {
name = "cpu-auto-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
target_tracking_scaling_policy_configuration {
target_value = 65.0 # Keep CPU below 65%
scale_in_cooldown = 300 # 5 min to scale in
scale_out_cooldown = 60 # 1 min to scale out
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}
}RDS vs DynamoDB vs Aurora: Database Selection
Database selection is one of the most consequential architecture decisions. Choose based on access patterns, not familiarity. Here's a complete comparison:
| Factor | RDS PostgreSQL | DynamoDB | Aurora Serverless v3 |
|---|---|---|---|
| Type | Relational SQL | NoSQL key-value | Relational SQL (serverless) |
| Latency | 1-10ms | <1ms single-digit | 1-10ms (cold: ~1s) |
| Scale to zero | No (min $15/mo) | Yes (on-demand) | Yes (ACU = 0) |
| Max throughput | Vertical scaling | Unlimited (horizontal) | Auto-scales to 128 ACUs |
| Complex queries | Full SQL, JOINs, ACID | Limited — single table | Full SQL, JOINs, ACID |
| Starting cost | ~$15/mo (t4g.micro) | $0 (pay per request) | $0 (serverless) |
| Best for | Complex relational data | High-throughput, key-value | Variable traffic, SQL |
Our Database Recommendation at Codazz
- New projects with variable traffic: Aurora Serverless v3. SQL + scales to zero = best of both worlds.
- High-throughput key-value (gaming, sessions, carts): DynamoDB on-demand.
- Existing PostgreSQL teams with predictable load: RDS with Multi-AZ and read replicas.
- Always add: ElastiCache Redis as a caching layer to reduce database load by 80-90%.
S3 Best Practices & CloudFront CDN
S3 stores virtually unlimited data at $0.023/GB/month. CloudFront is AWS's CDN with 450+ Points of Presence globally. Together they handle static assets, user uploads, and media delivery at any scale.
S3 Storage Classes (Choose Wisely)
Standard ($0.023/GB): frequently accessed. Intelligent-Tiering: unpredictable access, auto-moves objects between tiers. Standard-IA ($0.0125/GB): infrequent access, ~30-day minimum. Glacier Instant Retrieval ($0.004/GB): archives with ms retrieval. Glacier Deep Archive ($0.00099/GB): 7-10 year compliance retention.
CloudFront: Cache Configuration
Use Origin Access Control (OAC) to restrict S3 bucket access to CloudFront only. Enable Brotli + Gzip compression (15-25% smaller files). Cache TTL strategy: hashed JS/CSS assets = 1 year, index.html = 60 seconds, API responses = no-cache. Use Cache Policies and Origin Request Policies for fine-grained control.
Pre-Signed URLs for User Uploads
Generate short-lived (15 min) pre-signed PUT URLs server-side. Client uploads directly to S3, bypassing your servers entirely — no bandwidth cost, no memory pressure. Validate file type and size with an S3 event trigger invoking Lambda before moving to the final location.
S3 Lifecycle Policies
Auto-transition objects: Standard → Standard-IA after 30 days, → Glacier after 90 days. Delete incomplete multipart uploads after 7 days (surprisingly common source of waste). Expire non-current versions after 30 days. These policies alone typically save 40-70% on storage costs.
IAM Security Best Practices
Security misconfigurations are the #1 cause of cloud data breaches. Here are non-negotiable IAM practices for production AWS accounts:
1. Least Privilege IAM Roles
Every service, ECS task, and Lambda function gets its own IAM role with minimum required permissions. Never use AdministratorAccess in production. Use IAM Access Analyzer to generate least-privilege policies from actual access patterns. Review and tighten policies quarterly.
2. Secrets Manager — Never Hardcode
Store all secrets (database passwords, API keys, OAuth credentials) in AWS Secrets Manager. Enable automatic rotation for database credentials (RDS, Aurora natively supported). Lambda and ECS tasks retrieve secrets at runtime. Cost: $0.40/secret/month — the cheapest insurance you can buy.
3. Encryption Everywhere
S3: default encryption with SSE-S3 or SSE-KMS. RDS/Aurora: enable encryption at rest (must be set at creation). DynamoDB: encryption at rest enabled by default. EBS volumes: encrypt all volumes. Use AWS Certificate Manager (free) for TLS 1.3 on all public endpoints.
4. Multi-Account Organization Structure
Use AWS Organizations with separate accounts for: production, staging, development, shared services (DNS, monitoring), and security audit. Apply Service Control Policies (SCPs) to prevent dangerous actions (disabling CloudTrail, removing MFA). Use AWS Control Tower for automated governance.
5. Threat Detection & Compliance
Enable GuardDuty in all regions ($3-30/month depending on usage) for ML-powered threat detection. Use Security Hub to aggregate findings. Enable AWS Config with managed rules for continuous compliance checks. Set up CloudWatch Alarms for root account usage, unauthorized API calls, and MFA failures.
Cost Optimization: Reserved vs Spot Instances
Most teams overspend on AWS by 30-50%. These are the highest-ROI cost reduction levers, in order of impact:
| Pricing Model | Discount vs On-Demand | Commitment | Best For |
|---|---|---|---|
| On-Demand | 0% | None | Dev, testing, variable workloads |
| Savings Plans (1yr) | ~40% | 1 year spend | Predictable compute (EC2, Fargate, Lambda) |
| Savings Plans (3yr) | ~60% | 3 year spend | Stable long-term workloads |
| Reserved Instances (1yr) | ~40% | 1 year + instance family | Specific instance types, RDS |
| Reserved Instances (3yr) | ~72% | 3 year + instance family | Locked-in, stable databases |
| Spot Instances | 60-90% | Can be interrupted 2-min notice | Batch, CI/CD, fault-tolerant workers |
Right-Sizing (Do This First)
AWS Compute Optimizer analyzes 14 days of CloudWatch metrics and recommends right-sized instances. Most teams run 2-4x more compute than needed. Downsizing is the highest-impact, lowest-risk cost action. Typical savings: 30-50% of compute costs.
Spot for CI/CD and Dev Environments
Your GitHub Actions runners, Jenkins agents, and dev environments don't need guaranteed availability. Use Spot instances with a mixed strategy (On-Demand + Spot) to maintain availability while cutting costs 60-90%. ECS capacity providers make this straightforward.
NAT Gateway Elimination
NAT Gateways cost $0.045/GB processed — often the #1 surprise on AWS bills. Add S3 and DynamoDB gateway endpoints (free) to route traffic directly. Add interface endpoints for ECR, Secrets Manager, and CloudWatch to reduce NAT data. Typical savings: $100-2,000+/month.
S3 Intelligent-Tiering
Enable S3 Intelligent-Tiering for all buckets with objects you access unpredictably. AWS automatically moves objects between Frequent Access and Infrequent Access tiers. No retrieval fees. Monitoring charge: $0.0025 per 1,000 objects. Break-even at ~30 days of infrequent access.
Multi-Region Architecture & Disaster Recovery
Multi-region architecture protects against regional AWS outages (rare but catastrophic). It also reduces latency for globally distributed users. Here are the four DR strategies, ranked by cost and recovery capability:
Backup & Restore
Periodic backups to S3 Cross-Region Replication. Restore from backups on disaster. Lowest cost, longest recovery time. Good for non-critical systems.
Pilot Light
Minimal secondary region footprint: database replicas, no compute. Scale up compute on failover. Moderate cost with reasonable recovery. Best for most applications.
Warm Standby
Scaled-down but running secondary environment. Route 53 health-check failover. Fast recovery with moderate cost. Good for business-critical applications.
Active-Active
Full capacity in multiple regions simultaneously. Route 53 latency-based routing + health checks. DynamoDB Global Tables for active-active database. 2x cost but best resilience.
Aurora Global Database: Cross-Region Replication
# Terraform: Aurora Global Database (primary + replica)
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "app-global-cluster"
engine = "aurora-postgresql"
engine_version = "16.2"
database_name = "app"
}
# Primary cluster (us-east-1)
resource "aws_rds_cluster" "primary" {
provider = aws.us_east_1
engine = "aurora-postgresql"
engine_mode = "provisioned"
global_cluster_identifier = aws_rds_global_cluster.main.id
cluster_identifier = "app-primary"
master_username = var.db_username
manage_master_user_password = true
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 32
}
}
# Secondary read replica (eu-west-1) — <1s replication lag
resource "aws_rds_cluster" "secondary" {
provider = aws.eu_west_1
engine = "aurora-postgresql"
global_cluster_identifier = aws_rds_global_cluster.main.id
cluster_identifier = "app-secondary"
# Read-only: promotes to primary on failover in <30 seconds
}Frequently Asked Questions
How much does a production AWS architecture cost per month?
A typical early-stage startup (ECS Fargate 2 tasks, Aurora Serverless, S3 + CloudFront, ALB) runs $200-500/month. A mid-scale SaaS (auto-scaling ECS, RDS Multi-AZ, ElastiCache, WAF) costs $1,500-5,000/month. An enterprise multi-region architecture starts at $5,000-20,000+/month. Serverless (Lambda + DynamoDB) can start as low as $10/month for low traffic.
Should I choose ECS or EKS for containerized applications?
Choose ECS if: you're a startup or small team, you're AWS-only, and you want simplicity. Choose EKS if: you have existing Kubernetes expertise, you need multi-cloud portability, or you rely heavily on the CNCF/Kubernetes ecosystem (Istio, Karpenter, ArgoCD). For most startups and mid-size companies, ECS + Fargate is the right default — less complexity, no control plane cost, and full AWS integration.
What is the AWS Well-Architected Framework and why does it matter?
The Well-Architected Framework is AWS's set of guidelines across 5 pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. It matters because it gives you a systematic way to evaluate your architecture against proven best practices. AWS offers free Well-Architected Reviews through the console or AWS Partner Network. At Codazz, every architecture we design is reviewed against all 5 pillars before launch.
What is the best disaster recovery strategy for a startup?
Start with Pilot Light: Aurora Global Database replica in a second region, S3 Cross-Region Replication, and Route 53 health-check failover. This gives you ~15-minute RTO with less than 1-minute RPO at a fraction of the cost of active-active. Test your failover quarterly. As you grow and SLAs tighten, evolve to Warm Standby, then Active-Active.
How do I reduce my AWS bill without impacting production?
Follow this sequence: (1) Use Compute Optimizer to identify over-provisioned instances and right-size. (2) Purchase 1-year Compute Savings Plans for predictable workloads — 40% savings, no risk. (3) Add S3/DynamoDB VPC endpoints to eliminate NAT Gateway data charges. (4) Enable S3 Intelligent-Tiering. (5) Move CI/CD and dev environments to Spot instances. These five steps typically reduce AWS bills by 35-50% with zero production impact.
Need Help Designing Your AWS Architecture?
We'll review your current setup (or design from scratch), identify cost savings, and deliver a production-ready architecture with full Terraform/CDK code.
Get a Free AWS Architecture Review