Chapter 25: Monitoring and Alerting

25.1 Overview

This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.


25.2 Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              MONITORING STACK                                        │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   POS-API-1     │     │   POS-API-2     │     │   POS-API-3     │
│                 │     │                 │     │                 │
│ /metrics:8080   │     │ /metrics:8080   │     │ /metrics:8080   │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
              ┌──────────────────────────────────────┐
              │           PROMETHEUS                 │
              │          (Metrics Store)             │
              │                                      │
              │  - Scrape interval: 15s              │
              │  - Retention: 15 days                │
              │  - Port: 9090                        │
              └──────────────────┬───────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│    GRAFANA      │  │  ALERTMANAGER   │  │   LOKI          │
│  (Dashboards)   │  │    (Alerts)     │  │   (Logs)        │
│                 │  │                 │  │                 │
│  Port: 3000     │  │  Port: 9093     │  │  Port: 3100     │
└─────────────────┘  └────────┬────────┘  └─────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌────────┐     ┌────────┐     ┌────────┐
         │ Slack  │     │ Email  │     │ PagerDuty│
         └────────┘     └────────┘     └────────┘

25.3 Key Metrics

Business SLIs (Service Level Indicators)

MetricDescriptionTargetAlert Threshold
Transaction Success Rate% of transactions completed successfully> 99.9%< 99.5%
Avg Transaction TimeEnd-to-end transaction processing< 2s> 5s
Payment Success Rate% of payments processed successfully> 99.5%< 99%
Order Fulfillment RateOrders fulfilled within SLA> 98%< 95%
API AvailabilityUptime of API endpoints> 99.9%< 99.5%

Infrastructure Metrics

CategoryMetricWarningCritical
CPUUsage %> 70%> 90%
MemoryUsage %> 75%> 90%
DiskUsage %> 70%> 85%
DiskI/O Wait> 20%> 40%
NetworkPacket Loss> 0.1%> 1%
NetworkLatency (ms)> 100ms> 500ms

Application Metrics

MetricDescriptionWarningCritical
Error Rate5xx errors per minute> 1%> 5%
Response Time (p99)99th percentile latency> 500ms> 2000ms
Response Time (p50)Median latency> 100ms> 500ms
Request RateRequests per secondN/A (baseline)> 200% of baseline
Queue DepthMessages waiting in RabbitMQ> 1000> 5000
Active ConnectionsDB connections in use> 80% of pool> 95% of pool
Cache Hit RateRedis cache effectiveness< 80%< 60%

25.4 Prometheus Configuration

Complete prometheus.yml

# File: /pos-platform/monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'pos-production'
    environment: 'production'

#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

#=============================================
# RULE FILES
#=============================================
rule_files:
  - "/etc/prometheus/rules/*.yml"

#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
  #-----------------------------------------
  # Prometheus Self-Monitoring
  #-----------------------------------------
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  #-----------------------------------------
  # POS API Instances
  #-----------------------------------------
  - job_name: 'pos-api'
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pos-api-1:8080'
          - 'pos-api-2:8080'
          - 'pos-api-3:8080'
        labels:
          app: 'pos-api'
          tier: 'backend'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  #-----------------------------------------
  # PostgreSQL Exporter
  #-----------------------------------------
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          app: 'postgres'
          tier: 'database'

  #-----------------------------------------
  # Redis Exporter
  #-----------------------------------------
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          app: 'redis'
          tier: 'cache'

  #-----------------------------------------
  # RabbitMQ Exporter
  #-----------------------------------------
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
        labels:
          app: 'rabbitmq'
          tier: 'messaging'

  #-----------------------------------------
  # Nginx Exporter
  #-----------------------------------------
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          app: 'nginx'
          tier: 'ingress'

  #-----------------------------------------
  # Node Exporter (Host Metrics)
  #-----------------------------------------
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          tier: 'infrastructure'

  #-----------------------------------------
  # Docker Container Metrics
  #-----------------------------------------
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          tier: 'containers'

25.5 Alert Rules

Complete Alert Rules Configuration

# File: /pos-platform/monitoring/prometheus/rules/alerts.yml

groups:
  #=============================================
  # P1 - CRITICAL (Page immediately)
  #=============================================
  - name: critical_alerts
    rules:
      #-----------------------------------------
      # API Down
      #-----------------------------------------
      - alert: APIDown
        expr: up{job="pos-api"} == 0
        for: 1m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "POS API instance {{ $labels.instance }} is down"
          description: "API instance has been unreachable for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/api-down"

      #-----------------------------------------
      # Database Down
      #-----------------------------------------
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 30s
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database connection failed for 30 seconds"
          runbook_url: "https://wiki.internal/runbooks/db-down"

      #-----------------------------------------
      # High Error Rate
      #-----------------------------------------
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 2m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
          description: "Error rate exceeds 5% for more than 2 minutes"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      #-----------------------------------------
      # Transaction Failure Spike
      #-----------------------------------------
      - alert: TransactionFailureSpike
        expr: |
          (
            sum(rate(pos_transactions_failed_total[5m]))
            /
            sum(rate(pos_transactions_total[5m]))
          ) * 100 > 1
        for: 5m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
          description: "More than 1% of transactions are failing"
          runbook_url: "https://wiki.internal/runbooks/transaction-failures"

  #=============================================
  # P2 - HIGH (Page during business hours)
  #=============================================
  - name: high_alerts
    rules:
      #-----------------------------------------
      # High Response Time
      #-----------------------------------------
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
          description: "99th percentile latency exceeds 2 seconds"
          runbook_url: "https://wiki.internal/runbooks/high-latency"

      #-----------------------------------------
      # Database Connection Pool Exhaustion
      #-----------------------------------------
      - alert: DBConnectionPoolLow
        expr: |
          pg_stat_activity_count / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
          description: "Database connections nearly exhausted"
          runbook_url: "https://wiki.internal/runbooks/db-connections"

      #-----------------------------------------
      # Queue Backlog
      #-----------------------------------------
      - alert: QueueBacklog
        expr: rabbitmq_queue_messages > 5000
        for: 10m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "Message queue backlog: {{ $value }} messages"
          description: "RabbitMQ queue has significant backlog"
          runbook_url: "https://wiki.internal/runbooks/queue-backlog"

      #-----------------------------------------
      # Memory Pressure
      #-----------------------------------------
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: P2
          team: infrastructure
        annotations:
          summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
          description: "System memory is critically low"
          runbook_url: "https://wiki.internal/runbooks/memory-pressure"

  #=============================================
  # P3 - MEDIUM (Email/Slack notification)
  #=============================================
  - name: medium_alerts
    rules:
      #-----------------------------------------
      # CPU Warning
      #-----------------------------------------
      - alert: HighCPUUsage
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
        for: 15m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
          description: "CPU usage elevated for extended period"

      #-----------------------------------------
      # Disk Space Warning
      #-----------------------------------------
      - alert: DiskSpaceLow
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
        for: 30m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
          description: "Disk space running low"

      #-----------------------------------------
      # Cache Hit Rate Low
      #-----------------------------------------
      - alert: CacheHitRateLow
        expr: |
          redis_keyspace_hits_total /
          (redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
        for: 30m
        labels:
          severity: P3
          team: platform
        annotations:
          summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
          description: "Redis cache effectiveness is low"

  #=============================================
  # P4 - LOW (Log/Dashboard only)
  #=============================================
  - name: low_alerts
    rules:
      #-----------------------------------------
      # SSL Certificate Expiry
      #-----------------------------------------
      - alert: SSLCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
          description: "Certificate renewal needed soon"

      #-----------------------------------------
      # Container Restarts
      #-----------------------------------------
      - alert: ContainerRestarts
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "Container {{ $labels.container }} restarted {{ $value }} times"
          description: "Container may be unstable"

25.6 AlertManager Configuration

# File: /pos-platform/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@pos-platform.com'
  smtp_auth_username: 'alerts@pos-platform.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

  slack_api_url: '${SLACK_WEBHOOK_URL}'

  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

#=============================================
# ROUTING
#=============================================
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
    #-----------------------------------------
    # P1 - Critical: Page immediately
    #-----------------------------------------
    - match:
        severity: P1
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: P1
      receiver: 'slack-critical'
      continue: true

    #-----------------------------------------
    # P2 - High: Page during business hours
    #-----------------------------------------
    - match:
        severity: P2
      receiver: 'pagerduty-high'
      active_time_intervals:
        - business-hours
      continue: true
    - match:
        severity: P2
      receiver: 'slack-high'

    #-----------------------------------------
    # P3 - Medium: Slack + Email
    #-----------------------------------------
    - match:
        severity: P3
      receiver: 'slack-medium'
      continue: true
    - match:
        severity: P3
      receiver: 'email-team'

    #-----------------------------------------
    # P4 - Low: Slack only
    #-----------------------------------------
    - match:
        severity: P4
      receiver: 'slack-low'

#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

#=============================================
# RECEIVERS
#=============================================
receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: critical

  - name: 'pagerduty-high'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: error

  - name: 'slack-critical'
    slack_configs:
      - channel: '#pos-critical'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: 'https://grafana.internal/d/pos-overview'

  - name: 'slack-high'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true
        color: 'warning'

  - name: 'slack-medium'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'slack-low'
    slack_configs:
      - channel: '#pos-info'
        send_resolved: false

  - name: 'email-team'
    email_configs:
      - to: 'platform-team@company.com'
        send_resolved: true

25.7 Grafana Dashboard

POS Platform Overview Dashboard (JSON)

{
  "dashboard": {
    "id": null,
    "uid": "pos-overview",
    "title": "POS Platform Overview",
    "tags": ["pos", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Transaction Success Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
            "legendFormat": "Success Rate"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.5}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Requests per Second",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m]))",
            "legendFormat": "RPS"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 3,
        "title": "P99 Response Time",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 2}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Error Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Errors"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Active Transactions",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "pos_transactions_in_progress",
            "legendFormat": "Active"
          }
        ]
      },
      {
        "id": 6,
        "title": "API Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
        "targets": [
          {
            "expr": "count(up{job=\"pos-api\"} == 1)",
            "legendFormat": "Healthy Instances"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 2},
                {"color": "green", "value": 3}
              ]
            }
          }
        }
      },
      {
        "id": 10,
        "title": "Request Rate by Endpoint",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "id": 11,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
        "targets": [
          {
            "expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "id": 20,
        "title": "Database Connections",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
        "targets": [
          {
            "expr": "pg_stat_activity_count",
            "legendFormat": "Active"
          },
          {
            "expr": "pg_settings_max_connections",
            "legendFormat": "Max"
          }
        ]
      },
      {
        "id": 21,
        "title": "Redis Operations",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
        "targets": [
          {
            "expr": "rate(redis_commands_processed_total[1m])",
            "legendFormat": "Commands/sec"
          }
        ]
      },
      {
        "id": 22,
        "title": "Queue Depth",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
        "targets": [
          {
            "expr": "rabbitmq_queue_messages",
            "legendFormat": "{{queue}}"
          }
        ]
      },
      {
        "id": 30,
        "title": "CPU Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percent"}
        }
      },
      {
        "id": 31,
        "title": "Memory Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
        "targets": [
          {
            "expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "decmbytes"}
        }
      }
    ]
  }
}

25.8 Incident Response Runbooks

Runbook: API Down (P1)

# Runbook: API Down

**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions

## 25.9 Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike

## 25.10 Immediate Actions (First 5 minutes)

1. **Verify the alert**
   ```bash
   curl -s http://pos-api:8080/health | jq
   docker ps | grep pos-api
  1. Check container logs

    docker logs pos-api-1 --tail 100
    docker logs pos-api-2 --tail 100
    docker logs pos-api-3 --tail 100
    
  2. Check resource usage

    docker stats --no-stream
    
  3. Restart unhealthy containers

    docker restart pos-api-1  # Replace with affected container
    

25.11 Escalation

  • If all containers down: Page Infrastructure Lead
  • If database issue: Page Database Team
  • If network issue: Page Network Team

25.12 Resolution Checklist

  • Identify root cause
  • Apply fix (restart, rollback, config change)
  • Verify health checks passing
  • Monitor for 15 minutes
  • Update incident ticket
  • Schedule postmortem if major outage

25.13 Common Causes

CauseSolution
OOM (Out of Memory)Restart, investigate memory leak
Database connection failureCheck DB health, restart connections
Deployment failureRollback to previous version
Network partitionCheck network, restart networking

### Runbook: High Error Rate (P1)

```markdown
# Runbook: High Error Rate

**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing

## 25.9 Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping

## 25.15 Immediate Actions

1. **Identify error patterns**
   ```bash
   # Check recent errors in logs
   docker logs pos-api-1 2>&1 | grep -i error | tail -50

   # Query Loki for error patterns
   {job="pos-api"} |= "error" | json | line_format "{{.message}}"
  1. Check which endpoints are failing

    # In Grafana/Prometheus
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status)
    
  2. Check dependent services

    # Database
    docker exec pos-postgres-primary pg_isready
    
    # Redis
    docker exec pos-redis redis-cli ping
    
    # RabbitMQ
    curl -u admin:password http://localhost:15672/api/healthchecks/node
    

25.16 Root Cause Investigation

Error PatternLikely CauseSolution
500 on /api/transactionsDatabase timeoutCheck DB connections
503 across all endpointsOverloadScale up or rate limit
502 from nginxContainer crashRestart containers
Timeout errorsSlow DB queriesKill long queries, add indexes

25.17 Recovery Steps

  1. If DB issue: Restart connection pool
  2. If overload: Enable aggressive rate limiting
  3. If code bug: Rollback deployment
  4. If external dependency: Enable circuit breaker

---

## 25.18 OpenTelemetry Integration

### Overview

The monitoring stack is enhanced with OpenTelemetry (OTel) for comprehensive observability that prevents vendor lock-in and enables "Trace-to-Code" root cause analysis.

### Primary Pattern

| Attribute | Selection |
|-----------|-----------|
| **Pattern** | OpenTelemetry "Trace-to-Code" Pipeline |
| **Rationale** | Industry-standard protocol; trace errors from store terminal directly to source code line |
| **Vendor Lock-in** | None - OTel is open standard |

### Technology Stack (The "LGTM" Stack)

+——————————————————————+ | THE LGTM STACK | +——————————————————————+ | | | L = Loki (Log Aggregation) | | G = Grafana (Visualization & Dashboards) | | T = Tempo (Distributed Tracing) | | M = Prometheus (Metrics Collection) ← Already configured | | | +——————————————————————+


| Component | Tool | Purpose | Port |
|-----------|------|---------|------|
| **L** - Logs | Loki | Log aggregation, search | 3100 |
| **G** - Grafana | Grafana | Unified dashboards | 3000 |
| **T** - Traces | Tempo (or Jaeger) | Distributed tracing | 4317 (OTLP), 16686 (UI) |
| **M** - Metrics | Prometheus | Metrics collection | 9090 |

### Docker Compose Addition

```yaml
# Add to docker-compose.monitoring.yml

services:
  # ... existing prometheus, grafana, alertmanager ...

  # Loki - Log Aggregation
  loki:
    image: grafana/loki:2.9.0
    container_name: pos-loki
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - monitoring

  # Tempo - Distributed Tracing
  tempo:
    image: grafana/tempo:2.3.0
    container_name: pos-tempo
    ports:
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "3200:3200"    # Tempo query
    volumes:
      - tempo_data:/var/tempo
      - ./tempo/tempo-config.yml:/etc/tempo/tempo.yaml
    command: -config.file=/etc/tempo/tempo.yaml
    networks:
      - monitoring

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.89.0
    container_name: pos-otel-collector
    ports:
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "8888:8888"    # Prometheus metrics
    volumes:
      - ./otel/otel-collector-config.yml:/etc/otel/config.yaml
    command: --config=/etc/otel/config.yaml
    networks:
      - monitoring

volumes:
  loki_data:
  tempo_data:

OpenTelemetry Collector Configuration

# monitoring/otel/otel-collector-config.yml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  # Send traces to Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel

  # Send logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        service.instance.id: "instance_id"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

.NET Application Instrumentation

// Program.cs - Add OpenTelemetry instrumentation

using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Define resource attributes
var resourceBuilder = ResourceBuilder.CreateDefault()
    .AddService(
        serviceName: "pos-api",
        serviceVersion: Assembly.GetExecutingAssembly().GetName().Version?.ToString() ?? "1.0.0",
        serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment"] = builder.Environment.EnvironmentName,
        ["tenant.id"] = "dynamic"  // Set per-request
    });

// Configure OpenTelemetry Tracing
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .SetResourceBuilder(resourceBuilder)
        .AddSource("PosPlatform.*")
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.EnrichWithHttpRequest = (activity, request) =>
            {
                activity.SetTag("tenant.id", request.Headers["X-Tenant-Id"].FirstOrDefault());
            };
        })
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }))
    .WithMetrics(metrics => metrics
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddPrometheusExporter()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

// Configure OpenTelemetry Logging
builder.Logging.AddOpenTelemetry(logging => logging
    .SetResourceBuilder(resourceBuilder)
    .AddOtlpExporter(options =>
    {
        options.Endpoint = new Uri("http://otel-collector:4317");
    }));

Custom Span Example (Trace-to-Code)

// SaleService.cs - Custom tracing for business operations

public class SaleService
{
    private static readonly ActivitySource ActivitySource = new("PosPlatform.Sales");
    private readonly ILogger<SaleService> _logger;

    public async Task<Sale> CreateSaleAsync(CreateSaleCommand command)
    {
        // Create custom span with source code reference
        using var activity = ActivitySource.StartActivity(
            "CreateSale",
            ActivityKind.Internal,
            Activity.Current?.Context ?? default);

        activity?.SetTag("sale.location_id", command.LocationId);
        activity?.SetTag("sale.line_items_count", command.LineItems.Count);
        activity?.SetTag("code.filepath", "SaleService.cs");
        activity?.SetTag("code.lineno", 25);
        activity?.SetTag("code.function", "CreateSaleAsync");

        try
        {
            // Business logic
            var sale = await ProcessSale(command);

            activity?.SetTag("sale.id", sale.Id);
            activity?.SetTag("sale.total", sale.Total);
            activity?.SetStatus(ActivityStatusCode.Ok);

            return sale;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            _logger.LogError(ex, "Failed to create sale for location {LocationId}", command.LocationId);
            throw;
        }
    }
}

Trace-to-Code Dashboard Query

# Grafana Tempo query - Find traces with errors from specific store
{
  resource.service.name = "pos-api" &&
  span.tenant.id = "NEXUS" &&
  status = error
}
| select(
    traceDuration,
    resource.service.name,
    span.code.filepath,
    span.code.lineno,
    span.code.function,
    statusMessage
)

Observability Overload Mitigation

To prevent alert fatigue and noise:

StrategyImplementation
SamplingSample 10% of successful traces, 100% of errors
AggregationBatch traces before export (10s window)
FilteringExclude health check endpoints from tracing
RetentionKeep raw traces 7 days, aggregates 30 days
# Sampling configuration in OTel Collector
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

  tail_sampling:
    policies:
      - name: always-sample-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: sample-successful
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Grafana Data Source Configuration

# grafana/provisioning/datasources/datasources.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://loki:3100

  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name']
      tracesToMetrics:
        datasourceUid: prometheus
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

Correlating Traces, Logs, and Metrics

With LGTM stack, you can jump between:

+------------------------------------------------------------------+
|                   OBSERVABILITY CORRELATION                       |
+------------------------------------------------------------------+
|                                                                   |
|  TRACE (Tempo)                                                    |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ TraceID: abc123                                             │  |
|  │ Span: CreateSale (45ms)                                     │  |
|  │   └─ Span: ValidateInventory (12ms)                         │  |
|  │   └─ Span: ProcessPayment (28ms) [ERROR]                    │  |
|  │         └─ code.filepath: PaymentService.cs:142             │  |
|  └────────────────────────────────────────────────────────────┘  |
|              │                                                    |
|              │ Click "Logs for this span"                        |
|              ▼                                                    |
|  LOGS (Loki)                                                      |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ 2026-01-24 10:15:32 ERROR Payment declined: Insufficient   │  |
|  │ 2026-01-24 10:15:32 INFO  Rolling back transaction abc123  │  |
|  └────────────────────────────────────────────────────────────┘  |
|              │                                                    |
|              │ Click "Metrics for this time"                     |
|              ▼                                                    |
|  METRICS (Prometheus)                                             |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ payment_failures_total{reason="insufficient_funds"} = 47   │  |
|  │ payment_latency_p99 = 2.3s                                  │  |
|  └────────────────────────────────────────────────────────────┘  |
|                                                                   |
+------------------------------------------------------------------+

Reference

For complete observability strategy and risk mitigations, see:


25.19 Observability Sampling Strategy

Overview

At production scale, collecting 100% of traces, metrics, and logs becomes prohibitively expensive. A thoughtful sampling strategy reduces costs while preserving visibility into errors and performance issues.

AttributeSelection
ApproachHead-based + Tail-based Sampling
Error Retention100% of errors sampled
Normal Traffic1-10% sampled based on volume
Cost Target< $500/month for LGTM stack

Sampling Strategy Matrix

+------------------------------------------------------------------+
|                    SAMPLING STRATEGY MATRIX                        |
+------------------------------------------------------------------+
|                                                                   |
|  SIGNAL TYPE       SAMPLE RATE    CONDITION                       |
|  ─────────────────────────────────────────────────────────────   |
|  Traces (errors)   100%           status_code >= 500 OR error=true|
|  Traces (slow)     100%           duration > 2s                   |
|  Traces (normal)   5%             All other traces                |
|  Traces (health)   0%             /health, /metrics endpoints     |
|                                                                   |
|  Metrics           100%           Always (cheap to store)         |
|  Metrics (custom)  Aggregated     Sum/avg over 15s window         |
|                                                                   |
|  Logs (ERROR+)     100%           severity >= ERROR               |
|  Logs (WARN)       50%            severity == WARN                |
|  Logs (INFO)       10%            severity == INFO                |
|  Logs (DEBUG)      0%             Production only; 100% in dev    |
|  Logs (health)     0%             Health check logs suppressed    |
|                                                                   |
+------------------------------------------------------------------+

Head-Based Sampling

Decision made at trace start. Simple but may miss errors that occur later in the trace.

// Program.cs - Head-based sampling configuration

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.05))) // 5% sampling
        .AddAspNetCoreInstrumentation(options =>
        {
            // Always exclude health endpoints
            options.Filter = httpContext =>
                !httpContext.Request.Path.StartsWithSegments("/health") &&
                !httpContext.Request.Path.StartsWithSegments("/metrics");
        })
    );

Decision made after trace completes. Ensures all errors and slow requests are captured.

# otel-collector-config.yaml

processors:
  # Tail-based sampling processor
  tail_sampling:
    decision_wait: 10s          # Wait for span completion
    num_traces: 100000          # Max traces in memory
    expected_new_traces_per_sec: 1000
    policies:
      # Policy 1: Always sample errors (100%)
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Policy 2: Always sample slow requests (100%)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000    # > 2 seconds

      # Policy 3: Always sample payment operations (100%)
      - name: payments-policy
        type: string_attribute
        string_attribute:
          key: http.route
          values:
            - /api/v1/payments
            - /api/v1/refunds
          enabled_regex_matching: false

      # Policy 4: Sample normal traffic (5%)
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]

Log Sampling Configuration

# Loki pipeline configuration for log sampling

pipeline_stages:
  # Drop health check logs entirely
  - match:
      selector: '{job="pos-api"} |~ "GET /health"'
      action: drop

  # Drop metrics endpoint logs
  - match:
      selector: '{job="pos-api"} |~ "GET /metrics"'
      action: drop

  # Sample INFO logs at 10%
  - match:
      selector: '{level="info"}'
      stages:
        - sampling:
            rate: 0.1

  # Sample WARN logs at 50%
  - match:
      selector: '{level="warn"}'
      stages:
        - sampling:
            rate: 0.5

  # Keep 100% of ERROR and above
  - match:
      selector: '{level=~"error|fatal|critical"}'
      stages:
        - sampling:
            rate: 1.0

Application-Level Log Filtering

// Program.cs - Serilog with level-based filtering

builder.Host.UseSerilog((context, config) =>
{
    config
        .MinimumLevel.Information()
        .MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
        .MinimumLevel.Override("Microsoft.EntityFrameworkCore", LogEventLevel.Warning)
        // Don't log health checks
        .Filter.ByExcluding(Matching.WithProperty<string>("RequestPath", p =>
            p.Contains("/health") || p.Contains("/metrics")))
        // Sample INFO logs in production
        .Filter.ByExcluding(e =>
            e.Level == LogEventLevel.Information &&
            Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") == "Production" &&
            Random.Shared.NextDouble() > 0.1) // Keep 10%
        .WriteTo.Console()
        .WriteTo.OpenTelemetry(options =>
        {
            options.Endpoint = "http://otel-collector:4317";
            options.Protocol = OtlpProtocol.Grpc;
        });
});

Sampling Cost Analysis

+------------------------------------------------------------------+
|                    MONTHLY COST COMPARISON                         |
+------------------------------------------------------------------+
|                                                                   |
|  SCENARIO: 10 API instances, 1000 req/sec, 30-day retention       |
|                                                                   |
|  WITHOUT SAMPLING                    WITH SAMPLING                |
|  ─────────────────────────          ─────────────────────────    |
|  Traces:                             Traces:                      |
|    2.6B traces/month                   130M traces/month (5%)     |
|    Storage: ~2.6 TB                    Storage: ~130 GB           |
|    Cost: ~$2,000/month                 Cost: ~$100/month          |
|                                                                   |
|  Logs:                               Logs:                        |
|    5B log lines/month                  500M log lines (10% avg)   |
|    Storage: ~5 TB                      Storage: ~500 GB           |
|    Cost: ~$3,000/month                 Cost: ~$300/month          |
|                                                                   |
|  TOTAL: ~$5,000/month                TOTAL: ~$400/month           |
|  ─────────────────────────────────────────────────────────────   |
|  SAVINGS: 92% reduction with smart sampling                       |
|                                                                   |
+------------------------------------------------------------------+

Preserving Debug Capability

While sampling reduces volume, ensure debugging capability is preserved:

// Enable full sampling for specific requests via header

public class DynamicSamplingMiddleware
{
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        // Check for debug header
        if (context.Request.Headers.TryGetValue("X-Force-Trace", out var forceTrace) &&
            forceTrace == "true")
        {
            // Set sampling decision to RECORD_AND_SAMPLE
            Activity.Current?.SetTag("sampling.priority", 1);
            Activity.Current?.SetTag("debug.forced", true);
        }

        await _next(context);
    }
}

// Usage: Add header to force sampling
// curl -H "X-Force-Trace: true" https://api.posplatform.io/api/v1/sales

Sampling Metrics

Monitor sampling effectiveness:

# prometheus/rules/sampling-rules.yml

groups:
  - name: sampling-metrics
    rules:
      - record: otel_traces_sampled_total
        expr: sum(rate(otel_processor_tail_sampling_count_traces_sampled[5m]))

      - record: otel_traces_dropped_total
        expr: sum(rate(otel_processor_tail_sampling_count_traces_dropped[5m]))

      - record: otel_sampling_rate
        expr: |
          otel_traces_sampled_total / (otel_traces_sampled_total + otel_traces_dropped_total)

      - alert: SamplingRateTooLow
        expr: otel_sampling_rate < 0.01
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Trace sampling rate is below 1%"
          description: "Consider increasing sampling or checking for data loss"

      - alert: ErrorsNotSampled
        expr: |
          rate(http_server_requests_total{status=~"5.."}[5m]) >
          rate(otel_traces_sampled{has_error="true"}[5m]) * 1.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Errors may not be properly sampled"
          description: "More HTTP 5xx errors than sampled error traces"

Sampling Decision Flowchart

┌─────────────────────────────────────────────────────────────────┐
│                   SAMPLING DECISION FLOW                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  New Request Arrives                                             │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────┐                                             │
│  │ Health/Metrics  │──Yes──► DROP (0%)                           │
│  │ endpoint?       │                                             │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ X-Force-Trace   │──Yes──► SAMPLE (100%)                       │
│  │ header present? │                                             │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  [Request Processes...]                                          │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Error occurred? │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Duration > 2s?  │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Payment route?  │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Random 5%?      │──Yes──► SAMPLE                              │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│         DROP                                                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

25.20 Summary

This chapter provides complete monitoring coverage:

  1. Architecture: Prometheus + Grafana + AlertManager stack
  2. Metrics: Business SLIs and infrastructure metrics with thresholds
  3. Prometheus Config: Complete scrape configuration
  4. Alert Rules: P1-P4 severity levels with escalation
  5. Grafana Dashboard: Production-ready JSON dashboard
  6. Runbooks: Step-by-step incident response procedures

Next Chapter: Chapter 26: Security Compliance


“You cannot improve what you do not measure.”


Document Information

AttributeValue
Version5.0.0
Created2025-12-29
Updated2026-02-25
AuthorClaude Code
StatusActive
PartVII - Operations
Chapter25 of 32

This chapter is part of the POS Blueprint Book. All content is self-contained.