Chapter 30: Monitoring and Alerting

Overview

This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.

Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              MONITORING STACK                                        │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   POS-API-1     │     │   POS-API-2     │     │   POS-API-3     │
│                 │     │                 │     │                 │
│ /metrics:8080   │     │ /metrics:8080   │     │ /metrics:8080   │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
              ┌──────────────────────────────────────┐
              │           PROMETHEUS                 │
              │          (Metrics Store)             │
              │                                      │
              │  - Scrape interval: 15s              │
              │  - Retention: 15 days                │
              │  - Port: 9090                        │
              └──────────────────┬───────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│    GRAFANA      │  │  ALERTMANAGER   │  │   LOKI          │
│  (Dashboards)   │  │    (Alerts)     │  │   (Logs)        │
│                 │  │                 │  │                 │
│  Port: 3000     │  │  Port: 9093     │  │  Port: 3100     │
└─────────────────┘  └────────┬────────┘  └─────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌────────┐     ┌────────┐     ┌────────┐
         │ Slack  │     │ Email  │     │ PagerDuty│
         └────────┘     └────────┘     └────────┘

Key Metrics

Business SLIs (Service Level Indicators)

Metric	Description	Target	Alert Threshold
Transaction Success Rate	% of transactions completed successfully	> 99.9%	< 99.5%
Avg Transaction Time	End-to-end transaction processing	< 2s	> 5s
Payment Success Rate	% of payments processed successfully	> 99.5%	< 99%
Order Fulfillment Rate	Orders fulfilled within SLA	> 98%	< 95%
API Availability	Uptime of API endpoints	> 99.9%	< 99.5%

Infrastructure Metrics

Category	Metric	Warning	Critical
CPU	Usage %	> 70%	> 90%
Memory	Usage %	> 75%	> 90%
Disk	Usage %	> 70%	> 85%
Disk	I/O Wait	> 20%	> 40%
Network	Packet Loss	> 0.1%	> 1%
Network	Latency (ms)	> 100ms	> 500ms

Application Metrics

Metric	Description	Warning	Critical
Error Rate	5xx errors per minute	> 1%	> 5%
Response Time (p99)	99th percentile latency	> 500ms	> 2000ms
Response Time (p50)	Median latency	> 100ms	> 500ms
Request Rate	Requests per second	N/A (baseline)	> 200% of baseline
Queue Depth	Messages waiting in RabbitMQ	> 1000	> 5000
Active Connections	DB connections in use	> 80% of pool	> 95% of pool
Cache Hit Rate	Redis cache effectiveness	< 80%	< 60%

Prometheus Configuration

Complete prometheus.yml

# File: /pos-platform/monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'pos-production'
    environment: 'production'

#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

#=============================================
# RULE FILES
#=============================================
rule_files:
  - "/etc/prometheus/rules/*.yml"

#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
  #-----------------------------------------
  # Prometheus Self-Monitoring
  #-----------------------------------------
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  #-----------------------------------------
  # POS API Instances
  #-----------------------------------------
  - job_name: 'pos-api'
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pos-api-1:8080'
          - 'pos-api-2:8080'
          - 'pos-api-3:8080'
        labels:
          app: 'pos-api'
          tier: 'backend'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  #-----------------------------------------
  # PostgreSQL Exporter
  #-----------------------------------------
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          app: 'postgres'
          tier: 'database'

  #-----------------------------------------
  # Redis Exporter
  #-----------------------------------------
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          app: 'redis'
          tier: 'cache'

  #-----------------------------------------
  # RabbitMQ Exporter
  #-----------------------------------------
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
        labels:
          app: 'rabbitmq'
          tier: 'messaging'

  #-----------------------------------------
  # Nginx Exporter
  #-----------------------------------------
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          app: 'nginx'
          tier: 'ingress'

  #-----------------------------------------
  # Node Exporter (Host Metrics)
  #-----------------------------------------
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          tier: 'infrastructure'

  #-----------------------------------------
  # Docker Container Metrics
  #-----------------------------------------
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          tier: 'containers'

Alert Rules

Complete Alert Rules Configuration

# File: /pos-platform/monitoring/prometheus/rules/alerts.yml

groups:
  #=============================================
  # P1 - CRITICAL (Page immediately)
  #=============================================
  - name: critical_alerts
    rules:
      #-----------------------------------------
      # API Down
      #-----------------------------------------
      - alert: APIDown
        expr: up{job="pos-api"} == 0
        for: 1m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "POS API instance {{ $labels.instance }} is down"
          description: "API instance has been unreachable for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/api-down"

      #-----------------------------------------
      # Database Down
      #-----------------------------------------
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 30s
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database connection failed for 30 seconds"
          runbook_url: "https://wiki.internal/runbooks/db-down"

      #-----------------------------------------
      # High Error Rate
      #-----------------------------------------
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 2m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
          description: "Error rate exceeds 5% for more than 2 minutes"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      #-----------------------------------------
      # Transaction Failure Spike
      #-----------------------------------------
      - alert: TransactionFailureSpike
        expr: |
          (
            sum(rate(pos_transactions_failed_total[5m]))
            /
            sum(rate(pos_transactions_total[5m]))
          ) * 100 > 1
        for: 5m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
          description: "More than 1% of transactions are failing"
          runbook_url: "https://wiki.internal/runbooks/transaction-failures"

  #=============================================
  # P2 - HIGH (Page during business hours)
  #=============================================
  - name: high_alerts
    rules:
      #-----------------------------------------
      # High Response Time
      #-----------------------------------------
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
          description: "99th percentile latency exceeds 2 seconds"
          runbook_url: "https://wiki.internal/runbooks/high-latency"

      #-----------------------------------------
      # Database Connection Pool Exhaustion
      #-----------------------------------------
      - alert: DBConnectionPoolLow
        expr: |
          pg_stat_activity_count / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
          description: "Database connections nearly exhausted"
          runbook_url: "https://wiki.internal/runbooks/db-connections"

      #-----------------------------------------
      # Queue Backlog
      #-----------------------------------------
      - alert: QueueBacklog
        expr: rabbitmq_queue_messages > 5000
        for: 10m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "Message queue backlog: {{ $value }} messages"
          description: "RabbitMQ queue has significant backlog"
          runbook_url: "https://wiki.internal/runbooks/queue-backlog"

      #-----------------------------------------
      # Memory Pressure
      #-----------------------------------------
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: P2
          team: infrastructure
        annotations:
          summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
          description: "System memory is critically low"
          runbook_url: "https://wiki.internal/runbooks/memory-pressure"

  #=============================================
  # P3 - MEDIUM (Email/Slack notification)
  #=============================================
  - name: medium_alerts
    rules:
      #-----------------------------------------
      # CPU Warning
      #-----------------------------------------
      - alert: HighCPUUsage
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
        for: 15m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
          description: "CPU usage elevated for extended period"

      #-----------------------------------------
      # Disk Space Warning
      #-----------------------------------------
      - alert: DiskSpaceLow
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
        for: 30m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
          description: "Disk space running low"

      #-----------------------------------------
      # Cache Hit Rate Low
      #-----------------------------------------
      - alert: CacheHitRateLow
        expr: |
          redis_keyspace_hits_total /
          (redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
        for: 30m
        labels:
          severity: P3
          team: platform
        annotations:
          summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
          description: "Redis cache effectiveness is low"

  #=============================================
  # P4 - LOW (Log/Dashboard only)
  #=============================================
  - name: low_alerts
    rules:
      #-----------------------------------------
      # SSL Certificate Expiry
      #-----------------------------------------
      - alert: SSLCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
          description: "Certificate renewal needed soon"

      #-----------------------------------------
      # Container Restarts
      #-----------------------------------------
      - alert: ContainerRestarts
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "Container {{ $labels.container }} restarted {{ $value }} times"
          description: "Container may be unstable"

AlertManager Configuration

# File: /pos-platform/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@pos-platform.com'
  smtp_auth_username: 'alerts@pos-platform.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

  slack_api_url: '${SLACK_WEBHOOK_URL}'

  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

#=============================================
# ROUTING
#=============================================
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
    #-----------------------------------------
    # P1 - Critical: Page immediately
    #-----------------------------------------
    - match:
        severity: P1
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: P1
      receiver: 'slack-critical'
      continue: true

    #-----------------------------------------
    # P2 - High: Page during business hours
    #-----------------------------------------
    - match:
        severity: P2
      receiver: 'pagerduty-high'
      active_time_intervals:
        - business-hours
      continue: true
    - match:
        severity: P2
      receiver: 'slack-high'

    #-----------------------------------------
    # P3 - Medium: Slack + Email
    #-----------------------------------------
    - match:
        severity: P3
      receiver: 'slack-medium'
      continue: true
    - match:
        severity: P3
      receiver: 'email-team'

    #-----------------------------------------
    # P4 - Low: Slack only
    #-----------------------------------------
    - match:
        severity: P4
      receiver: 'slack-low'

#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

#=============================================
# RECEIVERS
#=============================================
receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: critical

  - name: 'pagerduty-high'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: error

  - name: 'slack-critical'
    slack_configs:
      - channel: '#pos-critical'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: 'https://grafana.internal/d/pos-overview'

  - name: 'slack-high'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true
        color: 'warning'

  - name: 'slack-medium'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'slack-low'
    slack_configs:
      - channel: '#pos-info'
        send_resolved: false

  - name: 'email-team'
    email_configs:
      - to: 'platform-team@company.com'
        send_resolved: true

Grafana Dashboard

POS Platform Overview Dashboard (JSON)

{
  "dashboard": {
    "id": null,
    "uid": "pos-overview",
    "title": "POS Platform Overview",
    "tags": ["pos", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Transaction Success Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
            "legendFormat": "Success Rate"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.5}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Requests per Second",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m]))",
            "legendFormat": "RPS"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 3,
        "title": "P99 Response Time",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 2}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Error Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Errors"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Active Transactions",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "pos_transactions_in_progress",
            "legendFormat": "Active"
          }
        ]
      },
      {
        "id": 6,
        "title": "API Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
        "targets": [
          {
            "expr": "count(up{job=\"pos-api\"} == 1)",
            "legendFormat": "Healthy Instances"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 2},
                {"color": "green", "value": 3}
              ]
            }
          }
        }
      },
      {
        "id": 10,
        "title": "Request Rate by Endpoint",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "id": 11,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
        "targets": [
          {
            "expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "id": 20,
        "title": "Database Connections",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
        "targets": [
          {
            "expr": "pg_stat_activity_count",
            "legendFormat": "Active"
          },
          {
            "expr": "pg_settings_max_connections",
            "legendFormat": "Max"
          }
        ]
      },
      {
        "id": 21,
        "title": "Redis Operations",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
        "targets": [
          {
            "expr": "rate(redis_commands_processed_total[1m])",
            "legendFormat": "Commands/sec"
          }
        ]
      },
      {
        "id": 22,
        "title": "Queue Depth",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
        "targets": [
          {
            "expr": "rabbitmq_queue_messages",
            "legendFormat": "{{queue}}"
          }
        ]
      },
      {
        "id": 30,
        "title": "CPU Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percent"}
        }
      },
      {
        "id": 31,
        "title": "Memory Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
        "targets": [
          {
            "expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "decmbytes"}
        }
      }
    ]
  }
}

Incident Response Runbooks

Runbook: API Down (P1)

# Runbook: API Down

**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions

## Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike

## Immediate Actions (First 5 minutes)

1. **Verify the alert**
   ```bash
   curl -s http://pos-api:8080/health | jq
   docker ps | grep pos-api

Check container logs

docker logs pos-api-1 --tail 100
docker logs pos-api-2 --tail 100
docker logs pos-api-3 --tail 100

Check resource usage
```
docker stats --no-stream
```

Restart unhealthy containers

docker restart pos-api-1  # Replace with affected container

Escalation

If all containers down: Page Infrastructure Lead
If database issue: Page Database Team
If network issue: Page Network Team

Resolution Checklist

Identify root cause
Apply fix (restart, rollback, config change)
Verify health checks passing
Monitor for 15 minutes
Update incident ticket
Schedule postmortem if major outage

Common Causes

Cause	Solution
OOM (Out of Memory)	Restart, investigate memory leak
Database connection failure	Check DB health, restart connections
Deployment failure	Rollback to previous version
Network partition	Check network, restart networking


### Runbook: High Error Rate (P1)

```markdown
# Runbook: High Error Rate

**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing

## Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping

## Immediate Actions

1. **Identify error patterns**
   ```bash
   # Check recent errors in logs
   docker logs pos-api-1 2>&1 | grep -i error | tail -50

   # Query Loki for error patterns
   {job="pos-api"} |= "error" | json | line_format "{{.message}}"

Check which endpoints are failing

# In Grafana/Prometheus
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status)

Check dependent services

# Database
docker exec pos-postgres-primary pg_isready

# Redis
docker exec pos-redis redis-cli ping

# RabbitMQ
curl -u admin:password http://localhost:15672/api/healthchecks/node

Root Cause Investigation

Error Pattern	Likely Cause	Solution
500 on /api/transactions	Database timeout	Check DB connections
503 across all endpoints	Overload	Scale up or rate limit
502 from nginx	Container crash	Restart containers
Timeout errors	Slow DB queries	Kill long queries, add indexes

Recovery Steps

If DB issue: Restart connection pool
If overload: Enable aggressive rate limiting
If code bug: Rollback deployment
If external dependency: Enable circuit breaker


---

## Summary

This chapter provides complete monitoring coverage:

1. **Architecture**: Prometheus + Grafana + AlertManager stack
2. **Metrics**: Business SLIs and infrastructure metrics with thresholds
3. **Prometheus Config**: Complete scrape configuration
4. **Alert Rules**: P1-P4 severity levels with escalation
5. **Grafana Dashboard**: Production-ready JSON dashboard
6. **Runbooks**: Step-by-step incident response procedures

**Next Chapter**: [Chapter 31: Security Compliance](./Chapter-31-Security-Compliance.md)

---

*"You cannot improve what you do not measure."*

Keyboard shortcuts

The POS Platform Blueprint