Chapter 30: Monitoring and Alerting
Overview
This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.
Monitoring Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ MONITORING STACK │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ POS-API-1 │ │ POS-API-2 │ │ POS-API-3 │
│ │ │ │ │ │
│ /metrics:8080 │ │ /metrics:8080 │ │ /metrics:8080 │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ PROMETHEUS │
│ (Metrics Store) │
│ │
│ - Scrape interval: 15s │
│ - Retention: 15 days │
│ - Port: 9090 │
└──────────────────┬───────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ GRAFANA │ │ ALERTMANAGER │ │ LOKI │
│ (Dashboards) │ │ (Alerts) │ │ (Logs) │
│ │ │ │ │ │
│ Port: 3000 │ │ Port: 9093 │ │ Port: 3100 │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Slack │ │ Email │ │ PagerDuty│
└────────┘ └────────┘ └────────┘
Key Metrics
Business SLIs (Service Level Indicators)
| Metric | Description | Target | Alert Threshold |
|---|---|---|---|
| Transaction Success Rate | % of transactions completed successfully | > 99.9% | < 99.5% |
| Avg Transaction Time | End-to-end transaction processing | < 2s | > 5s |
| Payment Success Rate | % of payments processed successfully | > 99.5% | < 99% |
| Order Fulfillment Rate | Orders fulfilled within SLA | > 98% | < 95% |
| API Availability | Uptime of API endpoints | > 99.9% | < 99.5% |
Infrastructure Metrics
| Category | Metric | Warning | Critical |
|---|---|---|---|
| CPU | Usage % | > 70% | > 90% |
| Memory | Usage % | > 75% | > 90% |
| Disk | Usage % | > 70% | > 85% |
| Disk | I/O Wait | > 20% | > 40% |
| Network | Packet Loss | > 0.1% | > 1% |
| Network | Latency (ms) | > 100ms | > 500ms |
Application Metrics
| Metric | Description | Warning | Critical |
|---|---|---|---|
| Error Rate | 5xx errors per minute | > 1% | > 5% |
| Response Time (p99) | 99th percentile latency | > 500ms | > 2000ms |
| Response Time (p50) | Median latency | > 100ms | > 500ms |
| Request Rate | Requests per second | N/A (baseline) | > 200% of baseline |
| Queue Depth | Messages waiting in RabbitMQ | > 1000 | > 5000 |
| Active Connections | DB connections in use | > 80% of pool | > 95% of pool |
| Cache Hit Rate | Redis cache effectiveness | < 80% | < 60% |
Prometheus Configuration
Complete prometheus.yml
# File: /pos-platform/monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'pos-production'
environment: 'production'
#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
#=============================================
# RULE FILES
#=============================================
rule_files:
- "/etc/prometheus/rules/*.yml"
#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
#-----------------------------------------
# Prometheus Self-Monitoring
#-----------------------------------------
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
#-----------------------------------------
# POS API Instances
#-----------------------------------------
- job_name: 'pos-api'
metrics_path: '/metrics'
static_configs:
- targets:
- 'pos-api-1:8080'
- 'pos-api-2:8080'
- 'pos-api-3:8080'
labels:
app: 'pos-api'
tier: 'backend'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
#-----------------------------------------
# PostgreSQL Exporter
#-----------------------------------------
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
labels:
app: 'postgres'
tier: 'database'
#-----------------------------------------
# Redis Exporter
#-----------------------------------------
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
labels:
app: 'redis'
tier: 'cache'
#-----------------------------------------
# RabbitMQ Exporter
#-----------------------------------------
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']
labels:
app: 'rabbitmq'
tier: 'messaging'
#-----------------------------------------
# Nginx Exporter
#-----------------------------------------
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
labels:
app: 'nginx'
tier: 'ingress'
#-----------------------------------------
# Node Exporter (Host Metrics)
#-----------------------------------------
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
tier: 'infrastructure'
#-----------------------------------------
# Docker Container Metrics
#-----------------------------------------
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
tier: 'containers'
Alert Rules
Complete Alert Rules Configuration
# File: /pos-platform/monitoring/prometheus/rules/alerts.yml
groups:
#=============================================
# P1 - CRITICAL (Page immediately)
#=============================================
- name: critical_alerts
rules:
#-----------------------------------------
# API Down
#-----------------------------------------
- alert: APIDown
expr: up{job="pos-api"} == 0
for: 1m
labels:
severity: P1
team: platform
annotations:
summary: "POS API instance {{ $labels.instance }} is down"
description: "API instance has been unreachable for more than 1 minute"
runbook_url: "https://wiki.internal/runbooks/api-down"
#-----------------------------------------
# Database Down
#-----------------------------------------
- alert: DatabaseDown
expr: pg_up == 0
for: 30s
labels:
severity: P1
team: platform
annotations:
summary: "PostgreSQL database is down"
description: "Database connection failed for 30 seconds"
runbook_url: "https://wiki.internal/runbooks/db-down"
#-----------------------------------------
# High Error Rate
#-----------------------------------------
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100 > 5
for: 2m
labels:
severity: P1
team: platform
annotations:
summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
description: "Error rate exceeds 5% for more than 2 minutes"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
#-----------------------------------------
# Transaction Failure Spike
#-----------------------------------------
- alert: TransactionFailureSpike
expr: |
(
sum(rate(pos_transactions_failed_total[5m]))
/
sum(rate(pos_transactions_total[5m]))
) * 100 > 1
for: 5m
labels:
severity: P1
team: platform
annotations:
summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
description: "More than 1% of transactions are failing"
runbook_url: "https://wiki.internal/runbooks/transaction-failures"
#=============================================
# P2 - HIGH (Page during business hours)
#=============================================
- name: high_alerts
rules:
#-----------------------------------------
# High Response Time
#-----------------------------------------
- alert: HighResponseTime
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: P2
team: platform
annotations:
summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
description: "99th percentile latency exceeds 2 seconds"
runbook_url: "https://wiki.internal/runbooks/high-latency"
#-----------------------------------------
# Database Connection Pool Exhaustion
#-----------------------------------------
- alert: DBConnectionPoolLow
expr: |
pg_stat_activity_count / pg_settings_max_connections * 100 > 80
for: 5m
labels:
severity: P2
team: platform
annotations:
summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
description: "Database connections nearly exhausted"
runbook_url: "https://wiki.internal/runbooks/db-connections"
#-----------------------------------------
# Queue Backlog
#-----------------------------------------
- alert: QueueBacklog
expr: rabbitmq_queue_messages > 5000
for: 10m
labels:
severity: P2
team: platform
annotations:
summary: "Message queue backlog: {{ $value }} messages"
description: "RabbitMQ queue has significant backlog"
runbook_url: "https://wiki.internal/runbooks/queue-backlog"
#-----------------------------------------
# Memory Pressure
#-----------------------------------------
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: P2
team: infrastructure
annotations:
summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
description: "System memory is critically low"
runbook_url: "https://wiki.internal/runbooks/memory-pressure"
#=============================================
# P3 - MEDIUM (Email/Slack notification)
#=============================================
- name: medium_alerts
rules:
#-----------------------------------------
# CPU Warning
#-----------------------------------------
- alert: HighCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
for: 15m
labels:
severity: P3
team: infrastructure
annotations:
summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
description: "CPU usage elevated for extended period"
#-----------------------------------------
# Disk Space Warning
#-----------------------------------------
- alert: DiskSpaceLow
expr: |
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
for: 30m
labels:
severity: P3
team: infrastructure
annotations:
summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
description: "Disk space running low"
#-----------------------------------------
# Cache Hit Rate Low
#-----------------------------------------
- alert: CacheHitRateLow
expr: |
redis_keyspace_hits_total /
(redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
for: 30m
labels:
severity: P3
team: platform
annotations:
summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
description: "Redis cache effectiveness is low"
#=============================================
# P4 - LOW (Log/Dashboard only)
#=============================================
- name: low_alerts
rules:
#-----------------------------------------
# SSL Certificate Expiry
#-----------------------------------------
- alert: SSLCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: P4
team: platform
annotations:
summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
description: "Certificate renewal needed soon"
#-----------------------------------------
# Container Restarts
#-----------------------------------------
- alert: ContainerRestarts
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 1h
labels:
severity: P4
team: platform
annotations:
summary: "Container {{ $labels.container }} restarted {{ $value }} times"
description: "Container may be unstable"
AlertManager Configuration
# File: /pos-platform/monitoring/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@pos-platform.com'
smtp_auth_username: 'alerts@pos-platform.com'
smtp_auth_password: '${SMTP_PASSWORD}'
slack_api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
#=============================================
# ROUTING
#=============================================
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
#-----------------------------------------
# P1 - Critical: Page immediately
#-----------------------------------------
- match:
severity: P1
receiver: 'pagerduty-critical'
continue: true
- match:
severity: P1
receiver: 'slack-critical'
continue: true
#-----------------------------------------
# P2 - High: Page during business hours
#-----------------------------------------
- match:
severity: P2
receiver: 'pagerduty-high'
active_time_intervals:
- business-hours
continue: true
- match:
severity: P2
receiver: 'slack-high'
#-----------------------------------------
# P3 - Medium: Slack + Email
#-----------------------------------------
- match:
severity: P3
receiver: 'slack-medium'
continue: true
- match:
severity: P3
receiver: 'email-team'
#-----------------------------------------
# P4 - Low: Slack only
#-----------------------------------------
- match:
severity: P4
receiver: 'slack-low'
#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '18:00'
#=============================================
# RECEIVERS
#=============================================
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: critical
- name: 'pagerduty-high'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: error
- name: 'slack-critical'
slack_configs:
- channel: '#pos-critical'
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button
text: 'Runbook'
url: '{{ .CommonAnnotations.runbook_url }}'
- type: button
text: 'Dashboard'
url: 'https://grafana.internal/d/pos-overview'
- name: 'slack-high'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
color: 'warning'
- name: 'slack-medium'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
- name: 'slack-low'
slack_configs:
- channel: '#pos-info'
send_resolved: false
- name: 'email-team'
email_configs:
- to: 'platform-team@company.com'
send_resolved: true
Grafana Dashboard
POS Platform Overview Dashboard (JSON)
{
"dashboard": {
"id": null,
"uid": "pos-overview",
"title": "POS Platform Overview",
"tags": ["pos", "production"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Transaction Success Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"targets": [
{
"expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
"legendFormat": "Success Rate"
}
],
"options": {
"colorMode": "value",
"graphMode": "area"
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 99},
{"color": "green", "value": 99.5}
]
}
}
}
},
{
"id": 2,
"title": "Requests per Second",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total[1m]))",
"legendFormat": "RPS"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"id": 3,
"title": "P99 Response Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.5},
{"color": "red", "value": 2}
]
}
}
}
},
{
"id": 4,
"title": "Error Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Errors"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
},
{
"id": 5,
"title": "Active Transactions",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"targets": [
{
"expr": "pos_transactions_in_progress",
"legendFormat": "Active"
}
]
},
{
"id": 6,
"title": "API Health",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"targets": [
{
"expr": "count(up{job=\"pos-api\"} == 1)",
"legendFormat": "Healthy Instances"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 2},
{"color": "green", "value": 3}
]
}
}
}
},
{
"id": 10,
"title": "Request Rate by Endpoint",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
]
},
{
"id": 11,
"title": "Response Time Distribution",
"type": "heatmap",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"targets": [
{
"expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
"legendFormat": "{{le}}"
}
]
},
{
"id": 20,
"title": "Database Connections",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
"targets": [
{
"expr": "pg_stat_activity_count",
"legendFormat": "Active"
},
{
"expr": "pg_settings_max_connections",
"legendFormat": "Max"
}
]
},
{
"id": 21,
"title": "Redis Operations",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
"targets": [
{
"expr": "rate(redis_commands_processed_total[1m])",
"legendFormat": "Commands/sec"
}
]
},
{
"id": 22,
"title": "Queue Depth",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
"targets": [
{
"expr": "rabbitmq_queue_messages",
"legendFormat": "{{queue}}"
}
]
},
{
"id": 30,
"title": "CPU Usage by Container",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
"legendFormat": "{{container}}"
}
],
"fieldConfig": {
"defaults": {"unit": "percent"}
}
},
{
"id": 31,
"title": "Memory Usage by Container",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"targets": [
{
"expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
"legendFormat": "{{container}}"
}
],
"fieldConfig": {
"defaults": {"unit": "decmbytes"}
}
}
]
}
}
Incident Response Runbooks
Runbook: API Down (P1)
# Runbook: API Down
**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions
## Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike
## Immediate Actions (First 5 minutes)
1. **Verify the alert**
```bash
curl -s http://pos-api:8080/health | jq
docker ps | grep pos-api
-
Check container logs
docker logs pos-api-1 --tail 100 docker logs pos-api-2 --tail 100 docker logs pos-api-3 --tail 100 -
Check resource usage
docker stats --no-stream -
Restart unhealthy containers
docker restart pos-api-1 # Replace with affected container
Escalation
- If all containers down: Page Infrastructure Lead
- If database issue: Page Database Team
- If network issue: Page Network Team
Resolution Checklist
- Identify root cause
- Apply fix (restart, rollback, config change)
- Verify health checks passing
- Monitor for 15 minutes
- Update incident ticket
- Schedule postmortem if major outage
Common Causes
| Cause | Solution |
|---|---|
| OOM (Out of Memory) | Restart, investigate memory leak |
| Database connection failure | Check DB health, restart connections |
| Deployment failure | Rollback to previous version |
| Network partition | Check network, restart networking |
### Runbook: High Error Rate (P1)
```markdown
# Runbook: High Error Rate
**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing
## Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping
## Immediate Actions
1. **Identify error patterns**
```bash
# Check recent errors in logs
docker logs pos-api-1 2>&1 | grep -i error | tail -50
# Query Loki for error patterns
{job="pos-api"} |= "error" | json | line_format "{{.message}}"
-
Check which endpoints are failing
# In Grafana/Prometheus sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status) -
Check dependent services
# Database docker exec pos-postgres-primary pg_isready # Redis docker exec pos-redis redis-cli ping # RabbitMQ curl -u admin:password http://localhost:15672/api/healthchecks/node
Root Cause Investigation
| Error Pattern | Likely Cause | Solution |
|---|---|---|
| 500 on /api/transactions | Database timeout | Check DB connections |
| 503 across all endpoints | Overload | Scale up or rate limit |
| 502 from nginx | Container crash | Restart containers |
| Timeout errors | Slow DB queries | Kill long queries, add indexes |
Recovery Steps
- If DB issue: Restart connection pool
- If overload: Enable aggressive rate limiting
- If code bug: Rollback deployment
- If external dependency: Enable circuit breaker
---
## Summary
This chapter provides complete monitoring coverage:
1. **Architecture**: Prometheus + Grafana + AlertManager stack
2. **Metrics**: Business SLIs and infrastructure metrics with thresholds
3. **Prometheus Config**: Complete scrape configuration
4. **Alert Rules**: P1-P4 severity levels with escalation
5. **Grafana Dashboard**: Production-ready JSON dashboard
6. **Runbooks**: Step-by-step incident response procedures
**Next Chapter**: [Chapter 31: Security Compliance](./Chapter-31-Security-Compliance.md)
---
*"You cannot improve what you do not measure."*