AWS Multi-Account Landing Zone Challenge Design and implement a secure, scalable, and compliant AWS foundation for a fintech payment processing platform from scratch, supporting:
PCI DSS Compliance: Prepare for Level 1 certificationHigh Availability: 99.95% SLA for payment processingMulti-Region DR: Active-passive disaster recovery across EU regionsSecurity First: Zero-trust principles and defense in depthCost Efficiency: Optimize for FinOps best practicesScalability: Support 1M+ daily transactionsArchitecture Overview Multi-Account Strategy Organization Structure (15+ Accounts) :
Management OU :
- management : Root account, Control Tower, Organizations
- logging : Centralized logging (CloudTrail, Config, Flow Logs)
- security : Security Hub, GuardDuty findings aggregation
- audit : Read-only audit access for compliance
Infrastructure OU :
- network : Transit Gateway, shared networking
- shared-services : DNS, Active Directory, central repositories
Workloads OU :
Production :
- prod-cde : PCI DSS Cardholder Data Environment
- prod-non-cde : Non-CDE production workloads
Non-Production :
- staging : Pre-production testing environment
- dev : Development environment
- sandbox : Experimentation and POCs
Security OU :
- security-tooling : Security tools and scanning
- incident-response : IR automation and forensics
Network Architecture Hub-and-Spoke Topology Transit Gateway (TGW) Hub :
Purpose : Central routing for all VPCs
Regions :
- Primary : eu-west-1 (Ireland)
- DR : eu-west-2 (London)
Routing :
- Centralized egress via NAT Gateways
- Inter-VPC communication controls
- On -premise connectivity (future VPN/Direct Connect)
- Route table segmentation for CDE isolation
VPC Design per Account :
Subnets :
- Public : NAT GW, ALB, bastion (jump hosts)
- Private : Application tier, EKS nodes
- Data : Databases, ElastiCache, MSK
- Management : Systems Manager endpoints
CIDR Strategy :
- Non-overlapping ranges across all accounts
- /16 for production, /20 for non-production
- Reserved ranges for future expansion
Security Controls Network Firewall :
- Centralized in network account
- Deep packet inspection
- Intrusion prevention (IPS)
- Domain filtering for egress
- Threat intelligence integration
Multi-Layer Protection :
1. NACLs : Subnet-level stateless filtering
2. Security Groups : Instance-level stateful filtering
3. WAF : Application layer protection (ALB/CloudFront)
4. Shield Standard : DDoS protection (all accounts)
5. VPC Flow Logs : Network traffic analysis
Implementation Details 1. Infrastructure as Code Repository Structure :
terraform-live/
├── management/
├── production/
│ ├── eu-west-1/
│ │ ├── vpc/
│ │ ├── eks/
│ │ ├── rds/
│ │ └── security/
│ └── eu-west-2/
├── staging/
└── modules/
├── vpc/
├── eks/
├── rds-aurora/
└── security-baseline/
Terraform Stack :
- 1000 + AWS resources managed
- Terragrunt for DRY configuration
- Remote state : S3 + DynamoDB locking
- State encryption with KMS
- Module versioning and testing
Security Scanning :
- Checkov : Compliance and security checks
- tfsec : Terraform security scanning
- Terrascan : Policy as code enforcement
- Automated in CI/CD before apply
- Drift detection and remediation
GitLab CI/CD Integration Pipeline Stages :
1. Validate :
- terraform validate
- terraform fmt check
- Module version verification
2. Security Scan :
- Checkov (CIS, PCI DSS policies)
- tfsec (AWS security best practices)
- Secret detection (gitleaks)
3. Plan :
- terraform plan
- Cost estimation (Infracost)
- Plan review and approval
4. Apply :
- Manual approval gate
- terraform apply
- Drift detection scheduling
2. AWS Control Tower Setup Landing Zone Features :
Account Factory :
- Automated account provisioning
- Baseline security configuration
- IAM Identity Center (SSO) integration
- CloudTrail and Config enabled by default
Guardrails (SCPs) :
Mandatory :
- Deny disabling CloudTrail
- Deny modifying Config rules
- Deny root user access keys
- Enforce MFA for root user
Strongly Recommended :
- Deny leaving organization
- Deny disabling EBS encryption
- Deny public S3 buckets
- Enforce encrypted volumes
Custom (PCI DSS) :
- Deny non-approved regions
- Enforce KMS encryption
- Restrict instance types
- Deny IMDSv1 (require IMDSv2)
Account Baseline :
- VPC with private subnets
- NAT Gateway for outbound
- VPC endpoints for AWS services
- CloudWatch log groups
- SNS topics for alerts
- Systems Manager access
Cluster Architecture Production EKS Clusters :
Primary (eu-west-1) :
- 3 Availability Zones
- Managed node groups (on-demand)
- Spot instances for batch jobs
- Fargate for serverless workloads
DR (eu-west-2) :
- Pilot-light configuration
- Minimal capacity (cost-optimized)
- Automated scale-up on failover
Node Configuration :
- Instance types : m5.xlarge, r5.xlarge
- Auto Scaling : Cluster Autoscaler
- OS : Amazon Linux 2
- Container runtime : containerd
- IRSA for pod-level IAM permissions
Control Plane :
- Control plane logging to CloudWatch
- Private endpoint (VPC-only access)
- Kubernetes version : 1.27 +
- Encryption : KMS for secrets
GitOps with ArgoCD Deployment Strategy :
- ArgoCD deployed in EKS
- Git as single source of truth
- 80 + microservices managed
- Application-per-repo pattern
- Automated sync (with approval for prod)
Progressive Delivery (Argo Rollouts) :
Strategies :
- Blue-Green deployments
- Canary releases (10% → 50% → 100%)
- Automated rollback on metrics
Analysis :
- Prometheus metrics integration
- Success rate, latency, error rate
- Automated promotion or rollback
4. Observability Stack Prometheus & Grafana Prometheus Architecture :
- Thanos for long-term storage (S3)
- Multi-cluster monitoring
- 7 -day local retention
- 1 -year Thanos retention
- AlertManager for notifications
Grafana Dashboards :
Infrastructure :
- EKS cluster health
- Node and pod metrics
- Network performance
- Storage utilization
Application :
- Service-level metrics
- Payment processing metrics
- API response times
- Error rates and SLIs
Security :
- GuardDuty findings
- WAF blocked requests
- Failed authentication attempts
- Compliance posture
Cost :
- Per-service costs (Kubecost)
- AWS Cost Explorer integration
- Budget vs actual tracking
Logging (ELK + Loki) OpenSearch (ELK) :
- Centralized log aggregation
- 30 -day retention in hot tier
- 1 -year retention in S3 (cold tier)
- Vector for log collection
- Kibana for visualization
Loki + Promtail :
- Kubernetes-native logging
- Label-based log queries
- Grafana integration
- Lower storage costs vs ELK
- Real-time log streaming
Log Sources :
- Application logs (stdout/stderr)
- AWS CloudTrail (API calls)
- VPC Flow Logs (network traffic)
- EKS control plane logs
- Load balancer access logs
- WAF logs
5. Security Architecture IAM Identity Center (AWS SSO) Configuration :
- Centralized user management
- Azure AD integration (SAML)
- MFA enforcement
- Permission sets per role :
* Admin : Full access (break-glass only)
* DevOps : Infrastructure management
* Developer : Application deployment
* ReadOnly : Audit and compliance
* Security : Security tools access
Access Patterns :
- Time-limited sessions (8 hours)
- JIT access for production
- Approval workflow for sensitive accounts
- Audit logging of all access
Secrets Management HashiCorp Vault :
Deployment :
- HA cluster on EKS
- Auto-unseal with AWS KMS
- Consul storage backend
- Cross-region replication
Use Cases :
- Database credentials (dynamic)
- API keys and tokens
- TLS certificates (PKI engine)
- Encryption as a service
Authentication :
- Kubernetes auth for pods
- AWS IAM for services
- OIDC for users
AWS Secrets Manager :
- RDS password rotation
- Cross-account secret sharing
- Lambda rotation functions
- Backup to Vault for redundancy
Amazon Aurora PostgreSQL :
Configuration :
- Multi-AZ deployment
- Read replicas (3x)
- Cross-region read replica (DR)
- Performance Insights enabled
Security :
- Encryption at rest (KMS)
- Encryption in transit (TLS 1.3)
- IAM database authentication
- Private subnet deployment
- Security group restrictions
Backup :
- Automated daily snapshots
- 35 -day retention
- Cross-region snapshot copy
- Point-in-time recovery (PITR)
MongoDB Atlas :
- Managed service (AWS VPC peering)
- Replica set configuration
- Automated backups
- Performance monitoring
ElastiCache Redis :
- Cluster mode enabled
- Multi-AZ automatic failover
- Encryption in-transit and at-rest
- Session storage and caching
7. Disaster Recovery Multi-Region Strategy :
Primary : eu-west-1 (Ireland)
DR : eu-west-2 (London)
Approach : Pilot Light
- Network infrastructure pre-deployed
- EKS cluster in standby (minimal nodes)
- Database read replica in DR region
- S3 cross-region replication
- Route 53 health checks and failover
Automation :
- Lambda-based failover orchestration
- Automated DNS cutover (Route 53)
- EKS cluster scale-up automation
- Database promotion scripts
- Runbooks in Confluence
Testing :
- Quarterly DR drills
- Documented runbooks
- Automated validation scripts
- RTO : 4 hours
- RPO : 15 minutes
Backup Strategy :
- Velero for EKS (daily)
- RDS automated snapshots
- S3 versioning enabled
- Configuration backups in Git
- 3-2-1 backup rule adherence
Results & Metrics Uptime & Availability:
├── Infrastructure Uptime: 99.95%
├── API Availability: 99.98%
├── Payment Processing: 1M+ daily transactions
└── Database Latency: p95 80ms (optimized from 500ms)
Disaster Recovery:
├── RTO (Recovery Time Objective): 4 hours
├── RPO (Recovery Point Objective): 15 minutes
├── DR Tests: Quarterly (100% success rate)
└── Failover Time: <30 minutes (automated)
Cost Optimization (FinOps) Monthly Cost Reduction: 45%
├── Before: $180,000/month
└── After: $99,000/month
Optimization Strategies:
├── Reserved Instances: 30% compute savings
├── Savings Plans: Additional 15% savings
├── Spot Instances: 50-70% savings for batch jobs
├── Rightsizing: Reduced over-provisioned instances
├── S3 Lifecycle: Automated tiering to Glacier
└── EBS Optimization: gp3 vs gp2, volume cleanup
Security Posture GuardDuty Findings: <5 medium+ findings/monthSecurity Hub Score: 95+ compliance scoreConfig Compliance: 98% compliant resourcesIAM Access Analyzer: Zero external exposure findingsVulnerability Management: <24h MTTR for critical CVEsAutomation & Efficiency Infrastructure Provisioning: 90% automatedDeployment Frequency: 10+ deployments/dayDeployment Time: Reduced from 4 hours to 15 minutesMTTR (Mean Time To Recovery): <30 minutesChange Failure Rate: <5%Technologies Used AWS Services Governance: Organizations, Control Tower, SSO (Identity Center)Networking: VPC, Transit Gateway, Route 53, Network FirewallCompute: EKS, EC2, Fargate, LambdaStorage: S3, EBS, EFSDatabase: Aurora PostgreSQL, ElastiCache Redis, DynamoDBSecurity: KMS, Secrets Manager, GuardDuty, Security Hub, Config, WAF, ShieldMonitoring: CloudWatch, CloudTrail, VPC Flow LogsInfrastructure as Code Terraform/OpenTofu: Infrastructure provisioningTerragrunt: DRY configuration managementAnsible AWX: Configuration management, OS hardeningKubernetes Ecosystem EKS: Managed KubernetesArgoCD: GitOps continuous deliveryArgo Rollouts: Progressive deliveryIstio: Service meshHelm: Package managementObservability Prometheus/Thanos: Metrics and monitoringGrafana: Visualization and dashboardsOpenSearch (ELK): Log aggregation and analysisLoki: Kubernetes-native loggingJaeger: Distributed tracingSecurity HashiCorp Vault: Secrets managementCheckov: IaC compliance scanningtfsec: Terraform security scanningWazuh: SIEM (added in 2024)Key Learnings Architectural Decisions Multi-Account Strategy: Critical for security, compliance, and blast radius reductionTransit Gateway: Simplified network architecture vs VPC peeringGitOps: ArgoCD provided excellent deployment visibility and rollback capabilityTerraform Modules: Reusable modules accelerated account provisioningBest Practices Established Infrastructure as Code for all resources (100%) Security scanning in CI/CD before deployment Automated compliance monitoring (AWS Config) Cost allocation tags on all resources Documentation as code (README in every Terraform module) Challenges Overcome Service Quotas: Proactive quota increases for productionCross-Account Networking: TGW routing and DNS resolutionEKS Upgrades: Blue-green cluster strategy for zero downtimeCost Control: Implemented budget alerts and cost anomaly detectionFuture Enhancements AWS Network Firewall for advanced threat protection ✅ (Completed) Service mesh (Istio) for zero-trust networking ✅ (Completed) Automated security remediation (Security Hub + Lambda) FinOps automation with cost recommendations Infrastructure drift detection and auto-remediation December 15, 2023
• AWS
Landing Zone
Multi-Account
Security
Governance
Fintech