Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 30: Monitoring and Alerting

Overview

This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.


Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              MONITORING STACK                                        │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   POS-API-1     │     │   POS-API-2     │     │   POS-API-3     │
│                 │     │                 │     │                 │
│ /metrics:8080   │     │ /metrics:8080   │     │ /metrics:8080   │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
              ┌──────────────────────────────────────┐
              │           PROMETHEUS                 │
              │          (Metrics Store)             │
              │                                      │
              │  - Scrape interval: 15s              │
              │  - Retention: 15 days                │
              │  - Port: 9090                        │
              └──────────────────┬───────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│    GRAFANA      │  │  ALERTMANAGER   │  │   LOKI          │
│  (Dashboards)   │  │    (Alerts)     │  │   (Logs)        │
│                 │  │                 │  │                 │
│  Port: 3000     │  │  Port: 9093     │  │  Port: 3100     │
└─────────────────┘  └────────┬────────┘  └─────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌────────┐     ┌────────┐     ┌────────┐
         │ Slack  │     │ Email  │     │ PagerDuty│
         └────────┘     └────────┘     └────────┘

Key Metrics

Business SLIs (Service Level Indicators)

MetricDescriptionTargetAlert Threshold
Transaction Success Rate% of transactions completed successfully> 99.9%< 99.5%
Avg Transaction TimeEnd-to-end transaction processing< 2s> 5s
Payment Success Rate% of payments processed successfully> 99.5%< 99%
Order Fulfillment RateOrders fulfilled within SLA> 98%< 95%
API AvailabilityUptime of API endpoints> 99.9%< 99.5%

Infrastructure Metrics

CategoryMetricWarningCritical
CPUUsage %> 70%> 90%
MemoryUsage %> 75%> 90%
DiskUsage %> 70%> 85%
DiskI/O Wait> 20%> 40%
NetworkPacket Loss> 0.1%> 1%
NetworkLatency (ms)> 100ms> 500ms

Application Metrics

MetricDescriptionWarningCritical
Error Rate5xx errors per minute> 1%> 5%
Response Time (p99)99th percentile latency> 500ms> 2000ms
Response Time (p50)Median latency> 100ms> 500ms
Request RateRequests per secondN/A (baseline)> 200% of baseline
Queue DepthMessages waiting in RabbitMQ> 1000> 5000
Active ConnectionsDB connections in use> 80% of pool> 95% of pool
Cache Hit RateRedis cache effectiveness< 80%< 60%

Prometheus Configuration

Complete prometheus.yml

# File: /pos-platform/monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'pos-production'
    environment: 'production'

#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

#=============================================
# RULE FILES
#=============================================
rule_files:
  - "/etc/prometheus/rules/*.yml"

#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
  #-----------------------------------------
  # Prometheus Self-Monitoring
  #-----------------------------------------
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  #-----------------------------------------
  # POS API Instances
  #-----------------------------------------
  - job_name: 'pos-api'
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pos-api-1:8080'
          - 'pos-api-2:8080'
          - 'pos-api-3:8080'
        labels:
          app: 'pos-api'
          tier: 'backend'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  #-----------------------------------------
  # PostgreSQL Exporter
  #-----------------------------------------
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          app: 'postgres'
          tier: 'database'

  #-----------------------------------------
  # Redis Exporter
  #-----------------------------------------
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          app: 'redis'
          tier: 'cache'

  #-----------------------------------------
  # RabbitMQ Exporter
  #-----------------------------------------
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
        labels:
          app: 'rabbitmq'
          tier: 'messaging'

  #-----------------------------------------
  # Nginx Exporter
  #-----------------------------------------
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          app: 'nginx'
          tier: 'ingress'

  #-----------------------------------------
  # Node Exporter (Host Metrics)
  #-----------------------------------------
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          tier: 'infrastructure'

  #-----------------------------------------
  # Docker Container Metrics
  #-----------------------------------------
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          tier: 'containers'

Alert Rules

Complete Alert Rules Configuration

# File: /pos-platform/monitoring/prometheus/rules/alerts.yml

groups:
  #=============================================
  # P1 - CRITICAL (Page immediately)
  #=============================================
  - name: critical_alerts
    rules:
      #-----------------------------------------
      # API Down
      #-----------------------------------------
      - alert: APIDown
        expr: up{job="pos-api"} == 0
        for: 1m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "POS API instance {{ $labels.instance }} is down"
          description: "API instance has been unreachable for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/api-down"

      #-----------------------------------------
      # Database Down
      #-----------------------------------------
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 30s
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database connection failed for 30 seconds"
          runbook_url: "https://wiki.internal/runbooks/db-down"

      #-----------------------------------------
      # High Error Rate
      #-----------------------------------------
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 2m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
          description: "Error rate exceeds 5% for more than 2 minutes"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      #-----------------------------------------
      # Transaction Failure Spike
      #-----------------------------------------
      - alert: TransactionFailureSpike
        expr: |
          (
            sum(rate(pos_transactions_failed_total[5m]))
            /
            sum(rate(pos_transactions_total[5m]))
          ) * 100 > 1
        for: 5m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
          description: "More than 1% of transactions are failing"
          runbook_url: "https://wiki.internal/runbooks/transaction-failures"

  #=============================================
  # P2 - HIGH (Page during business hours)
  #=============================================
  - name: high_alerts
    rules:
      #-----------------------------------------
      # High Response Time
      #-----------------------------------------
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
          description: "99th percentile latency exceeds 2 seconds"
          runbook_url: "https://wiki.internal/runbooks/high-latency"

      #-----------------------------------------
      # Database Connection Pool Exhaustion
      #-----------------------------------------
      - alert: DBConnectionPoolLow
        expr: |
          pg_stat_activity_count / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
          description: "Database connections nearly exhausted"
          runbook_url: "https://wiki.internal/runbooks/db-connections"

      #-----------------------------------------
      # Queue Backlog
      #-----------------------------------------
      - alert: QueueBacklog
        expr: rabbitmq_queue_messages > 5000
        for: 10m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "Message queue backlog: {{ $value }} messages"
          description: "RabbitMQ queue has significant backlog"
          runbook_url: "https://wiki.internal/runbooks/queue-backlog"

      #-----------------------------------------
      # Memory Pressure
      #-----------------------------------------
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: P2
          team: infrastructure
        annotations:
          summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
          description: "System memory is critically low"
          runbook_url: "https://wiki.internal/runbooks/memory-pressure"

  #=============================================
  # P3 - MEDIUM (Email/Slack notification)
  #=============================================
  - name: medium_alerts
    rules:
      #-----------------------------------------
      # CPU Warning
      #-----------------------------------------
      - alert: HighCPUUsage
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
        for: 15m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
          description: "CPU usage elevated for extended period"

      #-----------------------------------------
      # Disk Space Warning
      #-----------------------------------------
      - alert: DiskSpaceLow
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
        for: 30m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
          description: "Disk space running low"

      #-----------------------------------------
      # Cache Hit Rate Low
      #-----------------------------------------
      - alert: CacheHitRateLow
        expr: |
          redis_keyspace_hits_total /
          (redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
        for: 30m
        labels:
          severity: P3
          team: platform
        annotations:
          summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
          description: "Redis cache effectiveness is low"

  #=============================================
  # P4 - LOW (Log/Dashboard only)
  #=============================================
  - name: low_alerts
    rules:
      #-----------------------------------------
      # SSL Certificate Expiry
      #-----------------------------------------
      - alert: SSLCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
          description: "Certificate renewal needed soon"

      #-----------------------------------------
      # Container Restarts
      #-----------------------------------------
      - alert: ContainerRestarts
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "Container {{ $labels.container }} restarted {{ $value }} times"
          description: "Container may be unstable"

AlertManager Configuration

# File: /pos-platform/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@pos-platform.com'
  smtp_auth_username: 'alerts@pos-platform.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

  slack_api_url: '${SLACK_WEBHOOK_URL}'

  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

#=============================================
# ROUTING
#=============================================
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
    #-----------------------------------------
    # P1 - Critical: Page immediately
    #-----------------------------------------
    - match:
        severity: P1
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: P1
      receiver: 'slack-critical'
      continue: true

    #-----------------------------------------
    # P2 - High: Page during business hours
    #-----------------------------------------
    - match:
        severity: P2
      receiver: 'pagerduty-high'
      active_time_intervals:
        - business-hours
      continue: true
    - match:
        severity: P2
      receiver: 'slack-high'

    #-----------------------------------------
    # P3 - Medium: Slack + Email
    #-----------------------------------------
    - match:
        severity: P3
      receiver: 'slack-medium'
      continue: true
    - match:
        severity: P3
      receiver: 'email-team'

    #-----------------------------------------
    # P4 - Low: Slack only
    #-----------------------------------------
    - match:
        severity: P4
      receiver: 'slack-low'

#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

#=============================================
# RECEIVERS
#=============================================
receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: critical

  - name: 'pagerduty-high'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: error

  - name: 'slack-critical'
    slack_configs:
      - channel: '#pos-critical'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: 'https://grafana.internal/d/pos-overview'

  - name: 'slack-high'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true
        color: 'warning'

  - name: 'slack-medium'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'slack-low'
    slack_configs:
      - channel: '#pos-info'
        send_resolved: false

  - name: 'email-team'
    email_configs:
      - to: 'platform-team@company.com'
        send_resolved: true

Grafana Dashboard

POS Platform Overview Dashboard (JSON)

{
  "dashboard": {
    "id": null,
    "uid": "pos-overview",
    "title": "POS Platform Overview",
    "tags": ["pos", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Transaction Success Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
            "legendFormat": "Success Rate"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.5}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Requests per Second",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m]))",
            "legendFormat": "RPS"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 3,
        "title": "P99 Response Time",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 2}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Error Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Errors"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Active Transactions",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "pos_transactions_in_progress",
            "legendFormat": "Active"
          }
        ]
      },
      {
        "id": 6,
        "title": "API Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
        "targets": [
          {
            "expr": "count(up{job=\"pos-api\"} == 1)",
            "legendFormat": "Healthy Instances"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 2},
                {"color": "green", "value": 3}
              ]
            }
          }
        }
      },
      {
        "id": 10,
        "title": "Request Rate by Endpoint",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "id": 11,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
        "targets": [
          {
            "expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "id": 20,
        "title": "Database Connections",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
        "targets": [
          {
            "expr": "pg_stat_activity_count",
            "legendFormat": "Active"
          },
          {
            "expr": "pg_settings_max_connections",
            "legendFormat": "Max"
          }
        ]
      },
      {
        "id": 21,
        "title": "Redis Operations",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
        "targets": [
          {
            "expr": "rate(redis_commands_processed_total[1m])",
            "legendFormat": "Commands/sec"
          }
        ]
      },
      {
        "id": 22,
        "title": "Queue Depth",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
        "targets": [
          {
            "expr": "rabbitmq_queue_messages",
            "legendFormat": "{{queue}}"
          }
        ]
      },
      {
        "id": 30,
        "title": "CPU Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percent"}
        }
      },
      {
        "id": 31,
        "title": "Memory Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
        "targets": [
          {
            "expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "decmbytes"}
        }
      }
    ]
  }
}

Incident Response Runbooks

Runbook: API Down (P1)

# Runbook: API Down

**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions

## Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike

## Immediate Actions (First 5 minutes)

1. **Verify the alert**
   ```bash
   curl -s http://pos-api:8080/health | jq
   docker ps | grep pos-api
  1. Check container logs

    docker logs pos-api-1 --tail 100
    docker logs pos-api-2 --tail 100
    docker logs pos-api-3 --tail 100
    
  2. Check resource usage

    docker stats --no-stream
    
  3. Restart unhealthy containers

    docker restart pos-api-1  # Replace with affected container
    

Escalation

  • If all containers down: Page Infrastructure Lead
  • If database issue: Page Database Team
  • If network issue: Page Network Team

Resolution Checklist

  • Identify root cause
  • Apply fix (restart, rollback, config change)
  • Verify health checks passing
  • Monitor for 15 minutes
  • Update incident ticket
  • Schedule postmortem if major outage

Common Causes

CauseSolution
OOM (Out of Memory)Restart, investigate memory leak
Database connection failureCheck DB health, restart connections
Deployment failureRollback to previous version
Network partitionCheck network, restart networking

### Runbook: High Error Rate (P1)

```markdown
# Runbook: High Error Rate

**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing

## Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping

## Immediate Actions

1. **Identify error patterns**
   ```bash
   # Check recent errors in logs
   docker logs pos-api-1 2>&1 | grep -i error | tail -50

   # Query Loki for error patterns
   {job="pos-api"} |= "error" | json | line_format "{{.message}}"
  1. Check which endpoints are failing

    # In Grafana/Prometheus
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status)
    
  2. Check dependent services

    # Database
    docker exec pos-postgres-primary pg_isready
    
    # Redis
    docker exec pos-redis redis-cli ping
    
    # RabbitMQ
    curl -u admin:password http://localhost:15672/api/healthchecks/node
    

Root Cause Investigation

Error PatternLikely CauseSolution
500 on /api/transactionsDatabase timeoutCheck DB connections
503 across all endpointsOverloadScale up or rate limit
502 from nginxContainer crashRestart containers
Timeout errorsSlow DB queriesKill long queries, add indexes

Recovery Steps

  1. If DB issue: Restart connection pool
  2. If overload: Enable aggressive rate limiting
  3. If code bug: Rollback deployment
  4. If external dependency: Enable circuit breaker

---

## Summary

This chapter provides complete monitoring coverage:

1. **Architecture**: Prometheus + Grafana + AlertManager stack
2. **Metrics**: Business SLIs and infrastructure metrics with thresholds
3. **Prometheus Config**: Complete scrape configuration
4. **Alert Rules**: P1-P4 severity levels with escalation
5. **Grafana Dashboard**: Production-ready JSON dashboard
6. **Runbooks**: Step-by-step incident response procedures

**Next Chapter**: [Chapter 31: Security Compliance](./Chapter-31-Security-Compliance.md)

---

*"You cannot improve what you do not measure."*