AWS Multi-Account Landing Zone

AWS Multi-Account Landing Zone

Challenge

Design and implement a secure, scalable, and compliant AWS foundation for a fintech payment processing platform from scratch, supporting:

  • PCI DSS Compliance: Prepare for Level 1 certification
  • High Availability: 99.95% SLA for payment processing
  • Multi-Region DR: Active-passive disaster recovery across EU regions
  • Security First: Zero-trust principles and defense in depth
  • Cost Efficiency: Optimize for FinOps best practices
  • Scalability: Support 1M+ daily transactions

Architecture Overview

Multi-Account Strategy

Organization Structure (15+ Accounts):

Management OU:
  - management: Root account, Control Tower, Organizations
  - logging: Centralized logging (CloudTrail, Config, Flow Logs)
  - security: Security Hub, GuardDuty findings aggregation
  - audit: Read-only audit access for compliance

Infrastructure OU:
  - network: Transit Gateway, shared networking
  - shared-services: DNS, Active Directory, central repositories

Workloads OU:
  Production:
    - prod-cde: PCI DSS Cardholder Data Environment
    - prod-non-cde: Non-CDE production workloads

  Non-Production:
    - staging: Pre-production testing environment
    - dev: Development environment
    - sandbox: Experimentation and POCs

Security OU:
  - security-tooling: Security tools and scanning
  - incident-response: IR automation and forensics

Network Architecture

Hub-and-Spoke Topology

Transit Gateway (TGW) Hub:
  Purpose: Central routing for all VPCs
  Regions:
    - Primary: eu-west-1 (Ireland)
    - DR: eu-west-2 (London)

  Routing:
    - Centralized egress via NAT Gateways
    - Inter-VPC communication controls
    - On-premise connectivity (future VPN/Direct Connect)
    - Route table segmentation for CDE isolation

VPC Design per Account:
  Subnets:
    - Public: NAT GW, ALB, bastion (jump hosts)
    - Private: Application tier, EKS nodes
    - Data: Databases, ElastiCache, MSK
    - Management: Systems Manager endpoints

  CIDR Strategy:
    - Non-overlapping ranges across all accounts
    - /16 for production, /20 for non-production
    - Reserved ranges for future expansion

Security Controls

Network Firewall:
  - Centralized in network account
  - Deep packet inspection
  - Intrusion prevention (IPS)
  - Domain filtering for egress
  - Threat intelligence integration

Multi-Layer Protection:
  1. NACLs: Subnet-level stateless filtering
  2. Security Groups: Instance-level stateful filtering
  3. WAF: Application layer protection (ALB/CloudFront)
  4. Shield Standard: DDoS protection (all accounts)
  5. VPC Flow Logs: Network traffic analysis

Implementation Details

1. Infrastructure as Code

Terraform/OpenTofu Architecture

Repository Structure:
  terraform-live/
    ├── management/
    ├── production/
    │   ├── eu-west-1/
    │   │   ├── vpc/
    │   │   ├── eks/
    │   │   ├── rds/
    │   │   └── security/
    │   └── eu-west-2/
    ├── staging/
    └── modules/
        ├── vpc/
        ├── eks/
        ├── rds-aurora/
        └── security-baseline/

Terraform Stack:
  - 1000+ AWS resources managed
  - Terragrunt for DRY configuration
  - Remote state: S3 + DynamoDB locking
  - State encryption with KMS
  - Module versioning and testing

Security Scanning:
  - Checkov: Compliance and security checks
  - tfsec: Terraform security scanning
  - Terrascan: Policy as code enforcement
  - Automated in CI/CD before apply
  - Drift detection and remediation

GitLab CI/CD Integration

Pipeline Stages:
  1. Validate:
     - terraform validate
     - terraform fmt check
     - Module version verification

  2. Security Scan:
     - Checkov (CIS, PCI DSS policies)
     - tfsec (AWS security best practices)
     - Secret detection (gitleaks)

  3. Plan:
     - terraform plan
     - Cost estimation (Infracost)
     - Plan review and approval

  4. Apply:
     - Manual approval gate
     - terraform apply
     - Drift detection scheduling

2. AWS Control Tower Setup

Landing Zone Features:
  Account Factory:
    - Automated account provisioning
    - Baseline security configuration
    - IAM Identity Center (SSO) integration
    - CloudTrail and Config enabled by default

  Guardrails (SCPs):
    Mandatory:
      - Deny disabling CloudTrail
      - Deny modifying Config rules
      - Deny root user access keys
      - Enforce MFA for root user

    Strongly Recommended:
      - Deny leaving organization
      - Deny disabling EBS encryption
      - Deny public S3 buckets
      - Enforce encrypted volumes

    Custom (PCI DSS):
      - Deny non-approved regions
      - Enforce KMS encryption
      - Restrict instance types
      - Deny IMDSv1 (require IMDSv2)

Account Baseline:
  - VPC with private subnets
  - NAT Gateway for outbound
  - VPC endpoints for AWS services
  - CloudWatch log groups
  - SNS topics for alerts
  - Systems Manager access

3. Kubernetes (EKS) Platform

Cluster Architecture

Production EKS Clusters:
  Primary (eu-west-1):
    - 3 Availability Zones
    - Managed node groups (on-demand)
    - Spot instances for batch jobs
    - Fargate for serverless workloads

  DR (eu-west-2):
    - Pilot-light configuration
    - Minimal capacity (cost-optimized)
    - Automated scale-up on failover

Node Configuration:
  - Instance types: m5.xlarge, r5.xlarge
  - Auto Scaling: Cluster Autoscaler
  - OS: Amazon Linux 2
  - Container runtime: containerd
  - IRSA for pod-level IAM permissions

Control Plane:
  - Control plane logging to CloudWatch
  - Private endpoint (VPC-only access)
  - Kubernetes version: 1.27+
  - Encryption: KMS for secrets

GitOps with ArgoCD

Deployment Strategy:
  - ArgoCD deployed in EKS
  - Git as single source of truth
  - 80+ microservices managed
  - Application-per-repo pattern
  - Automated sync (with approval for prod)

Progressive Delivery (Argo Rollouts):
  Strategies:
    - Blue-Green deployments
    - Canary releases (10% → 50% → 100%)
    - Automated rollback on metrics

  Analysis:
    - Prometheus metrics integration
    - Success rate, latency, error rate
    - Automated promotion or rollback

4. Observability Stack

Prometheus & Grafana

Prometheus Architecture:
  - Thanos for long-term storage (S3)
  - Multi-cluster monitoring
  - 7-day local retention
  - 1-year Thanos retention
  - AlertManager for notifications

Grafana Dashboards:
  Infrastructure:
    - EKS cluster health
    - Node and pod metrics
    - Network performance
    - Storage utilization

  Application:
    - Service-level metrics
    - Payment processing metrics
    - API response times
    - Error rates and SLIs

  Security:
    - GuardDuty findings
    - WAF blocked requests
    - Failed authentication attempts
    - Compliance posture

  Cost:
    - Per-service costs (Kubecost)
    - AWS Cost Explorer integration
    - Budget vs actual tracking

Logging (ELK + Loki)

OpenSearch (ELK):
  - Centralized log aggregation
  - 30-day retention in hot tier
  - 1-year retention in S3 (cold tier)
  - Vector for log collection
  - Kibana for visualization

Loki + Promtail:
  - Kubernetes-native logging
  - Label-based log queries
  - Grafana integration
  - Lower storage costs vs ELK
  - Real-time log streaming

Log Sources:
  - Application logs (stdout/stderr)
  - AWS CloudTrail (API calls)
  - VPC Flow Logs (network traffic)
  - EKS control plane logs
  - Load balancer access logs
  - WAF logs

5. Security Architecture

IAM Identity Center (AWS SSO)

Configuration:
  - Centralized user management
  - Azure AD integration (SAML)
  - MFA enforcement
  - Permission sets per role:
    * Admin: Full access (break-glass only)
    * DevOps: Infrastructure management
    * Developer: Application deployment
    * ReadOnly: Audit and compliance
    * Security: Security tools access

  Access Patterns:
    - Time-limited sessions (8 hours)
    - JIT access for production
    - Approval workflow for sensitive accounts
    - Audit logging of all access

Secrets Management

HashiCorp Vault:
  Deployment:
    - HA cluster on EKS
    - Auto-unseal with AWS KMS
    - Consul storage backend
    - Cross-region replication

  Use Cases:
    - Database credentials (dynamic)
    - API keys and tokens
    - TLS certificates (PKI engine)
    - Encryption as a service

  Authentication:
    - Kubernetes auth for pods
    - AWS IAM for services
    - OIDC for users

AWS Secrets Manager:
  - RDS password rotation
  - Cross-account secret sharing
  - Lambda rotation functions
  - Backup to Vault for redundancy

6. Database Platform

Amazon Aurora PostgreSQL:
  Configuration:
    - Multi-AZ deployment
    - Read replicas (3x)
    - Cross-region read replica (DR)
    - Performance Insights enabled

  Security:
    - Encryption at rest (KMS)
    - Encryption in transit (TLS 1.3)
    - IAM database authentication
    - Private subnet deployment
    - Security group restrictions

  Backup:
    - Automated daily snapshots
    - 35-day retention
    - Cross-region snapshot copy
    - Point-in-time recovery (PITR)

MongoDB Atlas:
  - Managed service (AWS VPC peering)
  - Replica set configuration
  - Automated backups
  - Performance monitoring

ElastiCache Redis:
  - Cluster mode enabled
  - Multi-AZ automatic failover
  - Encryption in-transit and at-rest
  - Session storage and caching

7. Disaster Recovery

Multi-Region Strategy:
  Primary: eu-west-1 (Ireland)
  DR: eu-west-2 (London)

  Approach: Pilot Light
    - Network infrastructure pre-deployed
    - EKS cluster in standby (minimal nodes)
    - Database read replica in DR region
    - S3 cross-region replication
    - Route 53 health checks and failover

Automation:
  - Lambda-based failover orchestration
  - Automated DNS cutover (Route 53)
  - EKS cluster scale-up automation
  - Database promotion scripts
  - Runbooks in Confluence

Testing:
  - Quarterly DR drills
  - Documented runbooks
  - Automated validation scripts
  - RTO: 4 hours
  - RPO: 15 minutes

Backup Strategy:
  - Velero for EKS (daily)
  - RDS automated snapshots
  - S3 versioning enabled
  - Configuration backups in Git
  - 3-2-1 backup rule adherence

Results & Metrics

Reliability & Performance

Uptime & Availability:
├── Infrastructure Uptime: 99.95%
├── API Availability: 99.98%
├── Payment Processing: 1M+ daily transactions
└── Database Latency: p95 80ms (optimized from 500ms)

Disaster Recovery:
├── RTO (Recovery Time Objective): 4 hours
├── RPO (Recovery Point Objective): 15 minutes
├── DR Tests: Quarterly (100% success rate)
└── Failover Time: <30 minutes (automated)

Cost Optimization (FinOps)

Monthly Cost Reduction: 45%
├── Before: $180,000/month
└── After: $99,000/month

Optimization Strategies:
├── Reserved Instances: 30% compute savings
├── Savings Plans: Additional 15% savings
├── Spot Instances: 50-70% savings for batch jobs
├── Rightsizing: Reduced over-provisioned instances
├── S3 Lifecycle: Automated tiering to Glacier
└── EBS Optimization: gp3 vs gp2, volume cleanup

Security Posture

  • GuardDuty Findings: <5 medium+ findings/month
  • Security Hub Score: 95+ compliance score
  • Config Compliance: 98% compliant resources
  • IAM Access Analyzer: Zero external exposure findings
  • Vulnerability Management: <24h MTTR for critical CVEs

Automation & Efficiency

  • Infrastructure Provisioning: 90% automated
  • Deployment Frequency: 10+ deployments/day
  • Deployment Time: Reduced from 4 hours to 15 minutes
  • MTTR (Mean Time To Recovery): <30 minutes
  • Change Failure Rate: <5%

Technologies Used

AWS Services

  • Governance: Organizations, Control Tower, SSO (Identity Center)
  • Networking: VPC, Transit Gateway, Route 53, Network Firewall
  • Compute: EKS, EC2, Fargate, Lambda
  • Storage: S3, EBS, EFS
  • Database: Aurora PostgreSQL, ElastiCache Redis, DynamoDB
  • Security: KMS, Secrets Manager, GuardDuty, Security Hub, Config, WAF, Shield
  • Monitoring: CloudWatch, CloudTrail, VPC Flow Logs

Infrastructure as Code

  • Terraform/OpenTofu: Infrastructure provisioning
  • Terragrunt: DRY configuration management
  • Ansible AWX: Configuration management, OS hardening

Kubernetes Ecosystem

  • EKS: Managed Kubernetes
  • ArgoCD: GitOps continuous delivery
  • Argo Rollouts: Progressive delivery
  • Istio: Service mesh
  • Helm: Package management

Observability

  • Prometheus/Thanos: Metrics and monitoring
  • Grafana: Visualization and dashboards
  • OpenSearch (ELK): Log aggregation and analysis
  • Loki: Kubernetes-native logging
  • Jaeger: Distributed tracing

Security

  • HashiCorp Vault: Secrets management
  • Checkov: IaC compliance scanning
  • tfsec: Terraform security scanning
  • Wazuh: SIEM (added in 2024)

Key Learnings

Architectural Decisions

  1. Multi-Account Strategy: Critical for security, compliance, and blast radius reduction
  2. Transit Gateway: Simplified network architecture vs VPC peering
  3. GitOps: ArgoCD provided excellent deployment visibility and rollback capability
  4. Terraform Modules: Reusable modules accelerated account provisioning

Best Practices Established

  • Infrastructure as Code for all resources (100%)
  • Security scanning in CI/CD before deployment
  • Automated compliance monitoring (AWS Config)
  • Cost allocation tags on all resources
  • Documentation as code (README in every Terraform module)

Challenges Overcome

  • Service Quotas: Proactive quota increases for production
  • Cross-Account Networking: TGW routing and DNS resolution
  • EKS Upgrades: Blue-green cluster strategy for zero downtime
  • Cost Control: Implemented budget alerts and cost anomaly detection

Future Enhancements

  • AWS Network Firewall for advanced threat protection ✅ (Completed)
  • Service mesh (Istio) for zero-trust networking ✅ (Completed)
  • Automated security remediation (Security Hub + Lambda)
  • FinOps automation with cost recommendations
  • Infrastructure drift detection and auto-remediation