Enterprise SIEM Implementation with Wazuh
Enterprise SIEM Implementation with Wazuh
Challenge
Implement a comprehensive Security Information and Event Management (SIEM) solution for a fintech payment processing platform to:
- Achieve PCI DSS Compliance: Meet Requirements 10 (logging) and 11 (monitoring)
- Threat Detection: Identify security incidents in real-time across 200+ nodes
- Compliance Automation: Automate evidence collection for audits
- Incident Response: Enable rapid investigation and response to security events
- Visibility: Centralize security monitoring across AWS and Kubernetes
Architecture Overview
High-Level Design
Wazuh Architecture:
Management Layer:
Wazuh Manager Cluster (HA):
- 3 manager nodes (active-active)
- Load balancing for agent connections
- Shared configuration via cluster sync
- Deployed on AWS EC2 (m5.xlarge)
Data Storage Layer:
OpenSearch Cluster:
- 5 data nodes (m5.2xlarge)
- 2 master nodes (m5.xlarge)
- S3 for snapshot backups
- 90-day hot retention
- 1-year cold storage (S3 Glacier)
Agent Layer:
200+ Wazuh Agents:
- EKS worker nodes (100+)
- EC2 instances (80+)
- RDS/Aurora monitoring (indirect via CloudWatch)
- Container-based agents (DaemonSet)
Integration Layer:
AWS Services:
- CloudTrail → S3 → Wazuh ingestion
- VPC Flow Logs → S3 → Wazuh processing
- GuardDuty → EventBridge → Wazuh
- Security Hub → aggregation
- Config → compliance data
- ALB/WAF logs → S3 → analysis
Implementation Details
1. Wazuh Infrastructure Deployment
Manager Cluster (High Availability)
Deployment:
Platform: AWS EC2 (Auto Scaling Group)
Instance Type: m5.xlarge (4 vCPU, 16 GB RAM)
Count: 3 nodes (active-active cluster)
OS: Ubuntu 22.04 LTS (CIS hardened)
Configuration:
Cluster Communication:
- Wazuh cluster protocol (port 1516)
- Shared configuration synchronization
- Automatic failover
- Load balanced agent connections
Agent Communication:
- Port 1514: Agent data ingestion
- Port 1515: Agent enrollment
- TLS encryption enforced
- Certificate-based authentication
API:
- RESTful API (port 55000)
- JWT token authentication
- Integration with automation tools
- Rate limiting enabled
Storage:
- 500 GB EBS gp3 for local buffer
- S3 for long-term archive
- Daily snapshots to S3
OpenSearch Cluster
Cluster Design:
Data Nodes:
- Count: 5 nodes
- Instance: m5.2xlarge (8 vCPU, 32 GB RAM)
- Storage: 2 TB EBS gp3 per node
- Purpose: Index and search operations
Master Nodes:
- Count: 2 nodes (quorum)
- Instance: m5.xlarge (4 vCPU, 16 GB RAM)
- Purpose: Cluster state management
Index Management:
Hot Tier (0-30 days):
- SSD storage (gp3)
- High IOPS for real-time queries
- Daily index rotation
- Replica count: 1
Warm Tier (31-90 days):
- SSD storage (gp3)
- Reduced replica count
- Force merge for optimization
Cold Tier (91-365 days):
- S3 storage via snapshots
- Searchable snapshots
- Minimal compute cost
Security:
- TLS 1.3 for all connections
- OpenSearch Security plugin
- Role-based access control (RBAC)
- Audit logging enabled
- VPC private subnet deployment
- Security group restrictions
2. Agent Deployment Strategy
Kubernetes (EKS) Agents
Deployment Method: DaemonSet
Purpose: One agent per node
Resource Limits:
CPU: 200m request, 500m limit
Memory: 256Mi request, 512Mi limit
Container Configuration:
Image: wazuh/wazuh-agent:4.8.0
Security Context:
- Privileged: true (for host monitoring)
- hostPID: true
- hostNetwork: true
Volumes:
- /var/log → container logs
- /var/ossec → agent data
- /etc/os-release → OS detection
- /var/run/docker.sock → container monitoring
Monitoring Capabilities:
- Container lifecycle events
- Kubernetes audit logs
- Pod security violations
- Node system logs
- File integrity monitoring
- Rootkit detection
EC2 Agents
Installation:
Method: Ansible playbook automation
OS Support:
- Amazon Linux 2 / 2023
- Ubuntu 20.04 / 22.04
- CentOS 7 / 8
Enrollment:
- Automated via API
- Certificate-based authentication
- Group assignment by tags
Configuration Profile by Role:
Web Servers:
- Apache/Nginx log monitoring
- Web attack detection
- SSL/TLS monitoring
Database Servers:
- PostgreSQL audit logs
- Failed authentication attempts
- Privilege escalation detection
Application Servers:
- Application log parsing
- API abuse detection
- Performance metrics
3. Security Detection & Rules
Custom Rule Development (500+ Rules)
Payment-Specific Threats:
PAN Data Access Detection:
- Regex patterns for credit card numbers
- Unauthorized database queries
- File access to cardholder data
- Network transmission of sensitive data
- Alert severity: CRITICAL
Transaction Anomalies:
- Unusual transaction amounts
- Rapid transaction frequency
- Geographic anomalies
- Velocity checks (same card, multiple locations)
- ML-based anomaly detection
Authentication & Access:
Brute Force Detection:
- Failed SSH attempts (5+ in 1 min)
- Failed API authentication (10+ in 5 min)
- Account lockout monitoring
- Distributed brute force detection
Privilege Escalation:
- Sudo usage monitoring
- IAM permission changes
- Role assumption tracking
- Unauthorized service account usage
Web Attacks:
OWASP Top 10 Detection:
- SQL injection attempts
- Cross-Site Scripting (XSS)
- Command injection
- Path traversal
- Insecure deserialization
- XML External Entities (XXE)
API Abuse:
- Rate limiting violations
- Invalid API token usage
- Unusual API endpoints
- Parameter tampering
Data Exfiltration:
Indicators:
- Large data transfers (>100MB)
- Unusual outbound connections
- Database dumps
- SSH/SCP file transfers
- S3 bucket data access anomalies
Integration Rules
AWS CloudTrail:
High-Risk Events:
- IAM policy changes
- Security group modifications
- S3 bucket policy changes
- Root account usage
- KMS key deletion attempts
- CloudTrail logging disabled
Compliance Events:
- Encryption disabled on resources
- Public access enabled
- Unencrypted snapshots
- Cross-region resource access
GuardDuty Findings:
- Malware detection
- Cryptocurrency mining
- Backdoor detection
- Unusual API calls
- Compromised credentials
- Data exfiltration attempts
VPC Flow Logs:
- Port scanning detection
- DDoS indicators
- Unusual traffic patterns
- Blocked connection attempts
- Internal lateral movement
4. File Integrity Monitoring (FIM)
Monitored Files (10,000+):
System Files:
Linux:
- /etc/passwd, /etc/shadow
- /etc/ssh/sshd_config
- /etc/sudoers
- /boot/grub/grub.cfg
- Systemd service files
Frequency: Real-time
Actions: Alert + snapshot
Configuration Files:
Application:
- Nginx/Apache configs
- Application .env files
- Database configuration
- SSL certificates
Kubernetes:
- Pod manifests
- ConfigMaps
- Secrets (metadata only)
- Service definitions
Frequency: Real-time
Actions: Alert + backup + change review
Code Directories:
- /var/www/html
- /opt/applications
- Container image layers
Frequency: Scheduled (daily)
Actions: Alert on unauthorized changes
Logs & Audit:
- /var/log/*
- Application log directories
- Audit logs
Frequency: Real-time
Actions: Detect log tampering
FIM Capabilities:
- Real-time change detection
- File checksum (SHA256)
- File attributes (permissions, owner)
- Who-data (who made the change)
- Baseline comparison
- Automated restoration (critical files)
5. Vulnerability Management
Vulnerability Detection:
Methods:
- Agent-based scanning
- Package manager integration (apt, yum)
- CVE database correlation
- OVAL definitions
Scan Frequency:
- Critical systems: Daily
- Production servers: Daily
- Non-production: Weekly
- Containers: On image push
Risk-Based Prioritization:
Scoring:
- CVSS base score
- Exploitability (EPSS)
- Asset criticality
- Network exposure
- Data sensitivity
SLA by Severity:
- Critical: 24 hours
- High: 7 days
- Medium: 30 days
- Low: 90 days
Integration:
- Jira ticket creation
- Slack notifications
- Email alerts to teams
- Dashboard for management
- Monthly vulnerability reports
Remediation Tracking:
- Patch deployment via Ansible
- Verification scanning
- Exception management
- Compliance reporting
6. Compliance Automation (PCI DSS)
PCI DSS Dashboard (150+ Checks):
Requirement 1 & 2: Network & Configuration:
Checks:
- Firewall rules in place
- Default passwords changed
- Unnecessary services disabled
- Configuration standards enforced
Requirement 3 & 4: Data Protection:
Checks:
- Encryption at rest enabled
- TLS version compliance (1.2+)
- Key rotation schedules
- Sensitive data masking
Requirement 5 & 6: Malware & Development:
Checks:
- Anti-malware running
- Malware signature updates
- Secure coding practices
- Change management process
Requirement 7 & 8: Access Control:
Checks:
- Least privilege enforcement
- User access reviews
- MFA enabled
- Password complexity
Requirement 10: Logging & Monitoring:
Checks:
- Log collection enabled
- Log retention (1 year)
- Clock synchronization (NTP)
- Log integrity protection
- Audit trail completeness
Requirement 11: Security Testing:
Checks:
- Vulnerability scans completed
- Penetration test schedule
- IDS/IPS operational
- File integrity monitoring
Automated Evidence Collection:
- Daily compliance snapshots
- Configuration backups
- Change logs
- Access reports
- Exception documentation
- Quarterly audit packages
7. Incident Response Integration
Active Response:
Automated Actions:
IP Blocking:
- Trigger: 5+ failed SSH attempts
- Action: iptables block for 30 minutes
- Scope: Source IP
Account Lockout:
- Trigger: 10+ failed logins
- Action: Disable account
- Notification: Security team + manager
Container Quarantine:
- Trigger: Malware detected in container
- Action: Kill pod, taint node
- Notification: DevOps + Security
Process Kill:
- Trigger: Cryptocurrency miner detected
- Action: Kill process, block binary
- Forensics: Memory dump
Incident Management:
PagerDuty Integration:
- Critical alerts → immediate page
- High severity → notification
- Escalation after 15 minutes
- 24/7 SOC coverage
Workflow:
1. Alert triggered in Wazuh
2. PagerDuty incident created
3. On-call engineer notified
4. Investigation in OpenSearch
5. Response action (manual/automated)
6. Incident documentation
7. Post-incident review
Playbooks:
- Brute force attack response
- Malware infection containment
- Data breach procedure
- DDoS mitigation
- Insider threat investigation
- Compromised credentials
8. Monitoring & Alerting
Alert Channels:
Email:
- Daily summary reports
- Critical alerts (immediate)
- Weekly compliance reports
Slack:
- Real-time alerts (high+)
- Compliance violations
- System health issues
PagerDuty:
- Critical security events
- System outages
- Escalation after 15 min
Custom Dashboards:
Security Operations Center (SOC):
- Real-time event stream
- Alert count by severity
- Top attacked assets
- Geographic threat map
- MITRE ATT&CK mapping
Executive Dashboard:
- Security posture score
- Compliance status (%)
- Incident trends
- Vulnerability metrics
- Cost of incidents
Compliance Dashboard:
- PCI DSS requirement status
- Failed compliance checks
- Remediation progress
- Audit readiness score
Results & Metrics
Security Improvements
Threat Detection:
├── Security Incidents: 85% reduction (20/month → 3/month)
├── MTTD (Mean Time To Detect): 5 minutes average
├── MTTR (Mean Time To Respond): 30 minutes average
└── False Positive Rate: <10% (continuous tuning)
Visibility:
├── Monitored Nodes: 200+ agents
├── Events Per Day: 500K+ security events
├── Log Sources: 15+ integrated sources
└── Coverage: 100% of CDE infrastructure
Compliance Achievement
PCI DSS Level 1 Certification:
├── Audit Result: Zero findings
├── Requirement 10: PASS (Logging & Monitoring)
├── Requirement 11: PASS (Security Testing & Monitoring)
└── Automated Evidence: 150+ compliance checks
Operational Benefits:
├── Audit Preparation: Reduced from 2 weeks to 2 days
├── Evidence Collection: 90% automated
├── Compliance Reporting: Real-time dashboard
└── Audit Cost: Reduced by 60%
Operational Efficiency
- Alert Investigation: 70% faster with centralized SIEM
- Incident Response: 50% faster MTTR
- Compliance Effort: 90% reduction in manual checks
- Security Team Productivity: 3x improvement
Technologies Used
Core SIEM Stack
- Wazuh: 4.8.x (SIEM platform)
- OpenSearch: 2.11.x (data storage and search)
- OpenSearch Dashboards: Visualization and reporting
Integration & Automation
- AWS Services: CloudTrail, GuardDuty, VPC Flow Logs, Config, Security Hub
- Python: Custom integrations and scripts
- Ansible: Agent deployment automation
- PagerDuty: Incident management
- Slack: Real-time notifications
Infrastructure
- AWS EC2: Wazuh managers and OpenSearch nodes
- Kubernetes: Agent deployment via DaemonSet
- S3: Long-term log archive
- EBS: High-performance storage
Key Learnings
Best Practices
- Rule Tuning is Critical: Started with 1000+ alerts/day, tuned to 50/day
- Agent Performance: Proper resource limits prevent node impact
- Data Retention: Balance compliance requirements with storage costs
- Integration First: AWS service integration provides deeper visibility
- Automation: Active response reduces MTTR significantly
Challenges Overcome
- OpenSearch Scaling: Tuned cluster for 500K events/day ingestion
- Agent Overhead: Optimized configuration to <5% CPU usage
- Alert Fatigue: Implemented severity-based routing and aggregation
- Custom Rules: Iterative development with security team feedback
Future Enhancements
- SOAR Integration: Security Orchestration and Automation
- Threat Intelligence: Integrate external threat feeds
- User Behavior Analytics (UBA): ML-based anomaly detection
- MITRE ATT&CK Mapping: Automated attack technique identification
- Red Team Integration: Automated detection testing