Chapter 25: Monitoring and Alerting

25.1 Overview

This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.

25.2 Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              MONITORING STACK                                        │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   POS-API-1     │     │   POS-API-2     │     │   POS-API-3     │
│                 │     │                 │     │                 │
│ /metrics:8080   │     │ /metrics:8080   │     │ /metrics:8080   │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
              ┌──────────────────────────────────────┐
              │           PROMETHEUS                 │
              │          (Metrics Store)             │
              │                                      │
              │  - Scrape interval: 15s              │
              │  - Retention: 15 days                │
              │  - Port: 9090                        │
              └──────────────────┬───────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│    GRAFANA      │  │  ALERTMANAGER   │  │   LOKI          │
│  (Dashboards)   │  │    (Alerts)     │  │   (Logs)        │
│                 │  │                 │  │                 │
│  Port: 3000     │  │  Port: 9093     │  │  Port: 3100     │
└─────────────────┘  └────────┬────────┘  └─────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌────────┐     ┌────────┐     ┌────────┐
         │ Slack  │     │ Email  │     │ PagerDuty│
         └────────┘     └────────┘     └────────┘

25.3 Key Metrics

Business SLIs (Service Level Indicators)

Metric	Description	Target	Alert Threshold
Transaction Success Rate	% of transactions completed successfully	> 99.9%	< 99.5%
Avg Transaction Time	End-to-end transaction processing	< 2s	> 5s
Payment Success Rate	% of payments processed successfully	> 99.5%	< 99%
Order Fulfillment Rate	Orders fulfilled within SLA	> 98%	< 95%
API Availability	Uptime of API endpoints	> 99.9%	< 99.5%

Infrastructure Metrics

Category	Metric	Warning	Critical
CPU	Usage %	> 70%	> 90%
Memory	Usage %	> 75%	> 90%
Disk	Usage %	> 70%	> 85%
Disk	I/O Wait	> 20%	> 40%
Network	Packet Loss	> 0.1%	> 1%
Network	Latency (ms)	> 100ms	> 500ms

Application Metrics

Metric	Description	Warning	Critical
Error Rate	5xx errors per minute	> 1%	> 5%
Response Time (p99)	99th percentile latency	> 500ms	> 2000ms
Response Time (p50)	Median latency	> 100ms	> 500ms
Request Rate	Requests per second	N/A (baseline)	> 200% of baseline
Queue Depth	Messages waiting in RabbitMQ	> 1000	> 5000
Active Connections	DB connections in use	> 80% of pool	> 95% of pool
Cache Hit Rate	Redis cache effectiveness	< 80%	< 60%

25.4 Prometheus Configuration

Complete prometheus.yml

# File: /pos-platform/monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'pos-production'
    environment: 'production'

#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

#=============================================
# RULE FILES
#=============================================
rule_files:
  - "/etc/prometheus/rules/*.yml"

#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
  #-----------------------------------------
  # Prometheus Self-Monitoring
  #-----------------------------------------
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  #-----------------------------------------
  # POS API Instances
  #-----------------------------------------
  - job_name: 'pos-api'
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pos-api-1:8080'
          - 'pos-api-2:8080'
          - 'pos-api-3:8080'
        labels:
          app: 'pos-api'
          tier: 'backend'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  #-----------------------------------------
  # PostgreSQL Exporter
  #-----------------------------------------
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          app: 'postgres'
          tier: 'database'

  #-----------------------------------------
  # Redis Exporter
  #-----------------------------------------
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          app: 'redis'
          tier: 'cache'

  #-----------------------------------------
  # RabbitMQ Exporter
  #-----------------------------------------
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
        labels:
          app: 'rabbitmq'
          tier: 'messaging'

  #-----------------------------------------
  # Nginx Exporter
  #-----------------------------------------
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          app: 'nginx'
          tier: 'ingress'

  #-----------------------------------------
  # Node Exporter (Host Metrics)
  #-----------------------------------------
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          tier: 'infrastructure'

  #-----------------------------------------
  # Docker Container Metrics
  #-----------------------------------------
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          tier: 'containers'

25.5 Alert Rules

Complete Alert Rules Configuration

# File: /pos-platform/monitoring/prometheus/rules/alerts.yml

groups:
  #=============================================
  # P1 - CRITICAL (Page immediately)
  #=============================================
  - name: critical_alerts
    rules:
      #-----------------------------------------
      # API Down
      #-----------------------------------------
      - alert: APIDown
        expr: up{job="pos-api"} == 0
        for: 1m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "POS API instance {{ $labels.instance }} is down"
          description: "API instance has been unreachable for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/api-down"

      #-----------------------------------------
      # Database Down
      #-----------------------------------------
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 30s
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database connection failed for 30 seconds"
          runbook_url: "https://wiki.internal/runbooks/db-down"

      #-----------------------------------------
      # High Error Rate
      #-----------------------------------------
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 2m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
          description: "Error rate exceeds 5% for more than 2 minutes"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      #-----------------------------------------
      # Transaction Failure Spike
      #-----------------------------------------
      - alert: TransactionFailureSpike
        expr: |
          (
            sum(rate(pos_transactions_failed_total[5m]))
            /
            sum(rate(pos_transactions_total[5m]))
          ) * 100 > 1
        for: 5m
        labels:
          severity: P1
          team: platform
        annotations:
          summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
          description: "More than 1% of transactions are failing"
          runbook_url: "https://wiki.internal/runbooks/transaction-failures"

  #=============================================
  # P2 - HIGH (Page during business hours)
  #=============================================
  - name: high_alerts
    rules:
      #-----------------------------------------
      # High Response Time
      #-----------------------------------------
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
          description: "99th percentile latency exceeds 2 seconds"
          runbook_url: "https://wiki.internal/runbooks/high-latency"

      #-----------------------------------------
      # Database Connection Pool Exhaustion
      #-----------------------------------------
      - alert: DBConnectionPoolLow
        expr: |
          pg_stat_activity_count / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
          description: "Database connections nearly exhausted"
          runbook_url: "https://wiki.internal/runbooks/db-connections"

      #-----------------------------------------
      # Queue Backlog
      #-----------------------------------------
      - alert: QueueBacklog
        expr: rabbitmq_queue_messages > 5000
        for: 10m
        labels:
          severity: P2
          team: platform
        annotations:
          summary: "Message queue backlog: {{ $value }} messages"
          description: "RabbitMQ queue has significant backlog"
          runbook_url: "https://wiki.internal/runbooks/queue-backlog"

      #-----------------------------------------
      # Memory Pressure
      #-----------------------------------------
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: P2
          team: infrastructure
        annotations:
          summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
          description: "System memory is critically low"
          runbook_url: "https://wiki.internal/runbooks/memory-pressure"

  #=============================================
  # P3 - MEDIUM (Email/Slack notification)
  #=============================================
  - name: medium_alerts
    rules:
      #-----------------------------------------
      # CPU Warning
      #-----------------------------------------
      - alert: HighCPUUsage
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
        for: 15m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
          description: "CPU usage elevated for extended period"

      #-----------------------------------------
      # Disk Space Warning
      #-----------------------------------------
      - alert: DiskSpaceLow
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
        for: 30m
        labels:
          severity: P3
          team: infrastructure
        annotations:
          summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
          description: "Disk space running low"

      #-----------------------------------------
      # Cache Hit Rate Low
      #-----------------------------------------
      - alert: CacheHitRateLow
        expr: |
          redis_keyspace_hits_total /
          (redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
        for: 30m
        labels:
          severity: P3
          team: platform
        annotations:
          summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
          description: "Redis cache effectiveness is low"

  #=============================================
  # P4 - LOW (Log/Dashboard only)
  #=============================================
  - name: low_alerts
    rules:
      #-----------------------------------------
      # SSL Certificate Expiry
      #-----------------------------------------
      - alert: SSLCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
          description: "Certificate renewal needed soon"

      #-----------------------------------------
      # Container Restarts
      #-----------------------------------------
      - alert: ContainerRestarts
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 1h
        labels:
          severity: P4
          team: platform
        annotations:
          summary: "Container {{ $labels.container }} restarted {{ $value }} times"
          description: "Container may be unstable"

25.6 AlertManager Configuration

# File: /pos-platform/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@pos-platform.com'
  smtp_auth_username: 'alerts@pos-platform.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

  slack_api_url: '${SLACK_WEBHOOK_URL}'

  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

#=============================================
# ROUTING
#=============================================
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

  routes:
    #-----------------------------------------
    # P1 - Critical: Page immediately
    #-----------------------------------------
    - match:
        severity: P1
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: P1
      receiver: 'slack-critical'
      continue: true

    #-----------------------------------------
    # P2 - High: Page during business hours
    #-----------------------------------------
    - match:
        severity: P2
      receiver: 'pagerduty-high'
      active_time_intervals:
        - business-hours
      continue: true
    - match:
        severity: P2
      receiver: 'slack-high'

    #-----------------------------------------
    # P3 - Medium: Slack + Email
    #-----------------------------------------
    - match:
        severity: P3
      receiver: 'slack-medium'
      continue: true
    - match:
        severity: P3
      receiver: 'email-team'

    #-----------------------------------------
    # P4 - Low: Slack only
    #-----------------------------------------
    - match:
        severity: P4
      receiver: 'slack-low'

#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

#=============================================
# RECEIVERS
#=============================================
receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: critical

  - name: 'pagerduty-high'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: error

  - name: 'slack-critical'
    slack_configs:
      - channel: '#pos-critical'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: 'https://grafana.internal/d/pos-overview'

  - name: 'slack-high'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true
        color: 'warning'

  - name: 'slack-medium'
    slack_configs:
      - channel: '#pos-alerts'
        send_resolved: true

  - name: 'slack-low'
    slack_configs:
      - channel: '#pos-info'
        send_resolved: false

  - name: 'email-team'
    email_configs:
      - to: 'platform-team@company.com'
        send_resolved: true

25.7 Grafana Dashboard

POS Platform Overview Dashboard (JSON)

{
  "dashboard": {
    "id": null,
    "uid": "pos-overview",
    "title": "POS Platform Overview",
    "tags": ["pos", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Transaction Success Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
            "legendFormat": "Success Rate"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.5}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Requests per Second",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m]))",
            "legendFormat": "RPS"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 3,
        "title": "P99 Response Time",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 2}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Error Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Errors"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Active Transactions",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
        "targets": [
          {
            "expr": "pos_transactions_in_progress",
            "legendFormat": "Active"
          }
        ]
      },
      {
        "id": 6,
        "title": "API Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
        "targets": [
          {
            "expr": "count(up{job=\"pos-api\"} == 1)",
            "legendFormat": "Healthy Instances"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 2},
                {"color": "green", "value": 3}
              ]
            }
          }
        }
      },
      {
        "id": 10,
        "title": "Request Rate by Endpoint",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "id": 11,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
        "targets": [
          {
            "expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "id": 20,
        "title": "Database Connections",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
        "targets": [
          {
            "expr": "pg_stat_activity_count",
            "legendFormat": "Active"
          },
          {
            "expr": "pg_settings_max_connections",
            "legendFormat": "Max"
          }
        ]
      },
      {
        "id": 21,
        "title": "Redis Operations",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
        "targets": [
          {
            "expr": "rate(redis_commands_processed_total[1m])",
            "legendFormat": "Commands/sec"
          }
        ]
      },
      {
        "id": 22,
        "title": "Queue Depth",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
        "targets": [
          {
            "expr": "rabbitmq_queue_messages",
            "legendFormat": "{{queue}}"
          }
        ]
      },
      {
        "id": 30,
        "title": "CPU Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "percent"}
        }
      },
      {
        "id": 31,
        "title": "Memory Usage by Container",
        "type": "timeseries",
        "gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
        "targets": [
          {
            "expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
            "legendFormat": "{{container}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "decmbytes"}
        }
      }
    ]
  }
}

25.8 Incident Response Runbooks

Runbook: API Down (P1)

# Runbook: API Down

**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions

## 25.9 Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike

## 25.10 Immediate Actions (First 5 minutes)

1. **Verify the alert**
   ```bash
   curl -s http://pos-api:8080/health | jq
   docker ps | grep pos-api

Check container logs

docker logs pos-api-1 --tail 100
docker logs pos-api-2 --tail 100
docker logs pos-api-3 --tail 100

Check resource usage
```
docker stats --no-stream
```

Restart unhealthy containers

docker restart pos-api-1  # Replace with affected container

25.11 Escalation

If all containers down: Page Infrastructure Lead
If database issue: Page Database Team
If network issue: Page Network Team

25.12 Resolution Checklist

Identify root cause
Apply fix (restart, rollback, config change)
Verify health checks passing
Monitor for 15 minutes
Update incident ticket
Schedule postmortem if major outage

25.13 Common Causes

Cause	Solution
OOM (Out of Memory)	Restart, investigate memory leak
Database connection failure	Check DB health, restart connections
Deployment failure	Rollback to previous version
Network partition	Check network, restart networking


### Runbook: High Error Rate (P1)

```markdown
# Runbook: High Error Rate

**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing

## 25.9 Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping

## 25.15 Immediate Actions

1. **Identify error patterns**
   ```bash
   # Check recent errors in logs
   docker logs pos-api-1 2>&1 | grep -i error | tail -50

   # Query Loki for error patterns
   {job="pos-api"} |= "error" | json | line_format "{{.message}}"

Check which endpoints are failing

# In Grafana/Prometheus
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status)

Check dependent services

# Database
docker exec pos-postgres-primary pg_isready

# Redis
docker exec pos-redis redis-cli ping

# RabbitMQ
curl -u admin:password http://localhost:15672/api/healthchecks/node

25.16 Root Cause Investigation

Error Pattern	Likely Cause	Solution
500 on /api/transactions	Database timeout	Check DB connections
503 across all endpoints	Overload	Scale up or rate limit
502 from nginx	Container crash	Restart containers
Timeout errors	Slow DB queries	Kill long queries, add indexes

25.17 Recovery Steps

If DB issue: Restart connection pool
If overload: Enable aggressive rate limiting
If code bug: Rollback deployment
If external dependency: Enable circuit breaker


---

## 25.18 OpenTelemetry Integration

### Overview

The monitoring stack is enhanced with OpenTelemetry (OTel) for comprehensive observability that prevents vendor lock-in and enables "Trace-to-Code" root cause analysis.

### Primary Pattern

| Attribute | Selection |
|-----------|-----------|
| **Pattern** | OpenTelemetry "Trace-to-Code" Pipeline |
| **Rationale** | Industry-standard protocol; trace errors from store terminal directly to source code line |
| **Vendor Lock-in** | None - OTel is open standard |

### Technology Stack (The "LGTM" Stack)


| Component | Tool | Purpose | Port |
|-----------|------|---------|------|
| **L** - Logs | Loki | Log aggregation, search | 3100 |
| **G** - Grafana | Grafana | Unified dashboards | 3000 |
| **T** - Traces | Tempo (or Jaeger) | Distributed tracing | 4317 (OTLP), 16686 (UI) |
| **M** - Metrics | Prometheus | Metrics collection | 9090 |

### Docker Compose Addition

```yaml
# Add to docker-compose.monitoring.yml

services:
  # ... existing prometheus, grafana, alertmanager ...

  # Loki - Log Aggregation
  loki:
    image: grafana/loki:2.9.0
    container_name: pos-loki
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - monitoring

  # Tempo - Distributed Tracing
  tempo:
    image: grafana/tempo:2.3.0
    container_name: pos-tempo
    ports:
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "3200:3200"    # Tempo query
    volumes:
      - tempo_data:/var/tempo
      - ./tempo/tempo-config.yml:/etc/tempo/tempo.yaml
    command: -config.file=/etc/tempo/tempo.yaml
    networks:
      - monitoring

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.89.0
    container_name: pos-otel-collector
    ports:
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "8888:8888"    # Prometheus metrics
    volumes:
      - ./otel/otel-collector-config.yml:/etc/otel/config.yaml
    command: --config=/etc/otel/config.yaml
    networks:
      - monitoring

volumes:
  loki_data:
  tempo_data:

OpenTelemetry Collector Configuration

# monitoring/otel/otel-collector-config.yml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  # Send traces to Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel

  # Send logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        service.instance.id: "instance_id"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

.NET Application Instrumentation

// Program.cs - Add OpenTelemetry instrumentation

using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Define resource attributes
var resourceBuilder = ResourceBuilder.CreateDefault()
    .AddService(
        serviceName: "pos-api",
        serviceVersion: Assembly.GetExecutingAssembly().GetName().Version?.ToString() ?? "1.0.0",
        serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment"] = builder.Environment.EnvironmentName,
        ["tenant.id"] = "dynamic"  // Set per-request
    });

// Configure OpenTelemetry Tracing
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .SetResourceBuilder(resourceBuilder)
        .AddSource("PosPlatform.*")
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.EnrichWithHttpRequest = (activity, request) =>
            {
                activity.SetTag("tenant.id", request.Headers["X-Tenant-Id"].FirstOrDefault());
            };
        })
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }))
    .WithMetrics(metrics => metrics
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddPrometheusExporter()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

// Configure OpenTelemetry Logging
builder.Logging.AddOpenTelemetry(logging => logging
    .SetResourceBuilder(resourceBuilder)
    .AddOtlpExporter(options =>
    {
        options.Endpoint = new Uri("http://otel-collector:4317");
    }));

Custom Span Example (Trace-to-Code)

// SaleService.cs - Custom tracing for business operations

public class SaleService
{
    private static readonly ActivitySource ActivitySource = new("PosPlatform.Sales");
    private readonly ILogger<SaleService> _logger;

    public async Task<Sale> CreateSaleAsync(CreateSaleCommand command)
    {
        // Create custom span with source code reference
        using var activity = ActivitySource.StartActivity(
            "CreateSale",
            ActivityKind.Internal,
            Activity.Current?.Context ?? default);

        activity?.SetTag("sale.location_id", command.LocationId);
        activity?.SetTag("sale.line_items_count", command.LineItems.Count);
        activity?.SetTag("code.filepath", "SaleService.cs");
        activity?.SetTag("code.lineno", 25);
        activity?.SetTag("code.function", "CreateSaleAsync");

        try
        {
            // Business logic
            var sale = await ProcessSale(command);

            activity?.SetTag("sale.id", sale.Id);
            activity?.SetTag("sale.total", sale.Total);
            activity?.SetStatus(ActivityStatusCode.Ok);

            return sale;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            _logger.LogError(ex, "Failed to create sale for location {LocationId}", command.LocationId);
            throw;
        }
    }
}

Trace-to-Code Dashboard Query

# Grafana Tempo query - Find traces with errors from specific store
{
  resource.service.name = "pos-api" &&
  span.tenant.id = "NEXUS" &&
  status = error
}
| select(
    traceDuration,
    resource.service.name,
    span.code.filepath,
    span.code.lineno,
    span.code.function,
    statusMessage
)

Observability Overload Mitigation

To prevent alert fatigue and noise:

Strategy	Implementation
Sampling	Sample 10% of successful traces, 100% of errors
Aggregation	Batch traces before export (10s window)
Filtering	Exclude health check endpoints from tracing
Retention	Keep raw traces 7 days, aggregates 30 days

# Sampling configuration in OTel Collector
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

  tail_sampling:
    policies:
      - name: always-sample-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: sample-successful
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Grafana Data Source Configuration

# grafana/provisioning/datasources/datasources.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://loki:3100

  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name']
      tracesToMetrics:
        datasourceUid: prometheus
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

Correlating Traces, Logs, and Metrics

With LGTM stack, you can jump between:

+------------------------------------------------------------------+
|                   OBSERVABILITY CORRELATION                       |
+------------------------------------------------------------------+
|                                                                   |
|  TRACE (Tempo)                                                    |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ TraceID: abc123                                             │  |
|  │ Span: CreateSale (45ms)                                     │  |
|  │   └─ Span: ValidateInventory (12ms)                         │  |
|  │   └─ Span: ProcessPayment (28ms) [ERROR]                    │  |
|  │         └─ code.filepath: PaymentService.cs:142             │  |
|  └────────────────────────────────────────────────────────────┘  |
|              │                                                    |
|              │ Click "Logs for this span"                        |
|              ▼                                                    |
|  LOGS (Loki)                                                      |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ 2026-01-24 10:15:32 ERROR Payment declined: Insufficient   │  |
|  │ 2026-01-24 10:15:32 INFO  Rolling back transaction abc123  │  |
|  └────────────────────────────────────────────────────────────┘  |
|              │                                                    |
|              │ Click "Metrics for this time"                     |
|              ▼                                                    |
|  METRICS (Prometheus)                                             |
|  ┌────────────────────────────────────────────────────────────┐  |
|  │ payment_failures_total{reason="insufficient_funds"} = 47   │  |
|  │ payment_latency_p99 = 2.3s                                  │  |
|  └────────────────────────────────────────────────────────────┘  |
|                                                                   |
+------------------------------------------------------------------+

Reference

For complete observability strategy and risk mitigations, see:

Appendix L: Architecture Styles Analysis

25.19 Observability Sampling Strategy

Overview

At production scale, collecting 100% of traces, metrics, and logs becomes prohibitively expensive. A thoughtful sampling strategy reduces costs while preserving visibility into errors and performance issues.

Attribute	Selection
Approach	Head-based + Tail-based Sampling
Error Retention	100% of errors sampled
Normal Traffic	1-10% sampled based on volume
Cost Target	< $500/month for LGTM stack

Sampling Strategy Matrix

+------------------------------------------------------------------+
|                    SAMPLING STRATEGY MATRIX                        |
+------------------------------------------------------------------+
|                                                                   |
|  SIGNAL TYPE       SAMPLE RATE    CONDITION                       |
|  ─────────────────────────────────────────────────────────────   |
|  Traces (errors)   100%           status_code >= 500 OR error=true|
|  Traces (slow)     100%           duration > 2s                   |
|  Traces (normal)   5%             All other traces                |
|  Traces (health)   0%             /health, /metrics endpoints     |
|                                                                   |
|  Metrics           100%           Always (cheap to store)         |
|  Metrics (custom)  Aggregated     Sum/avg over 15s window         |
|                                                                   |
|  Logs (ERROR+)     100%           severity >= ERROR               |
|  Logs (WARN)       50%            severity == WARN                |
|  Logs (INFO)       10%            severity == INFO                |
|  Logs (DEBUG)      0%             Production only; 100% in dev    |
|  Logs (health)     0%             Health check logs suppressed    |
|                                                                   |
+------------------------------------------------------------------+

Head-Based Sampling

Decision made at trace start. Simple but may miss errors that occur later in the trace.

// Program.cs - Head-based sampling configuration

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.05))) // 5% sampling
        .AddAspNetCoreInstrumentation(options =>
        {
            // Always exclude health endpoints
            options.Filter = httpContext =>
                !httpContext.Request.Path.StartsWithSegments("/health") &&
                !httpContext.Request.Path.StartsWithSegments("/metrics");
        })
    );

Tail-Based Sampling (Recommended)

Decision made after trace completes. Ensures all errors and slow requests are captured.

# otel-collector-config.yaml

processors:
  # Tail-based sampling processor
  tail_sampling:
    decision_wait: 10s          # Wait for span completion
    num_traces: 100000          # Max traces in memory
    expected_new_traces_per_sec: 1000
    policies:
      # Policy 1: Always sample errors (100%)
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Policy 2: Always sample slow requests (100%)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000    # > 2 seconds

      # Policy 3: Always sample payment operations (100%)
      - name: payments-policy
        type: string_attribute
        string_attribute:
          key: http.route
          values:
            - /api/v1/payments
            - /api/v1/refunds
          enabled_regex_matching: false

      # Policy 4: Sample normal traffic (5%)
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]

Log Sampling Configuration

# Loki pipeline configuration for log sampling

pipeline_stages:
  # Drop health check logs entirely
  - match:
      selector: '{job="pos-api"} |~ "GET /health"'
      action: drop

  # Drop metrics endpoint logs
  - match:
      selector: '{job="pos-api"} |~ "GET /metrics"'
      action: drop

  # Sample INFO logs at 10%
  - match:
      selector: '{level="info"}'
      stages:
        - sampling:
            rate: 0.1

  # Sample WARN logs at 50%
  - match:
      selector: '{level="warn"}'
      stages:
        - sampling:
            rate: 0.5

  # Keep 100% of ERROR and above
  - match:
      selector: '{level=~"error|fatal|critical"}'
      stages:
        - sampling:
            rate: 1.0

Application-Level Log Filtering

// Program.cs - Serilog with level-based filtering

builder.Host.UseSerilog((context, config) =>
{
    config
        .MinimumLevel.Information()
        .MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
        .MinimumLevel.Override("Microsoft.EntityFrameworkCore", LogEventLevel.Warning)
        // Don't log health checks
        .Filter.ByExcluding(Matching.WithProperty<string>("RequestPath", p =>
            p.Contains("/health") || p.Contains("/metrics")))
        // Sample INFO logs in production
        .Filter.ByExcluding(e =>
            e.Level == LogEventLevel.Information &&
            Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") == "Production" &&
            Random.Shared.NextDouble() > 0.1) // Keep 10%
        .WriteTo.Console()
        .WriteTo.OpenTelemetry(options =>
        {
            options.Endpoint = "http://otel-collector:4317";
            options.Protocol = OtlpProtocol.Grpc;
        });
});

Sampling Cost Analysis

+------------------------------------------------------------------+
|                    MONTHLY COST COMPARISON                         |
+------------------------------------------------------------------+
|                                                                   |
|  SCENARIO: 10 API instances, 1000 req/sec, 30-day retention       |
|                                                                   |
|  WITHOUT SAMPLING                    WITH SAMPLING                |
|  ─────────────────────────          ─────────────────────────    |
|  Traces:                             Traces:                      |
|    2.6B traces/month                   130M traces/month (5%)     |
|    Storage: ~2.6 TB                    Storage: ~130 GB           |
|    Cost: ~$2,000/month                 Cost: ~$100/month          |
|                                                                   |
|  Logs:                               Logs:                        |
|    5B log lines/month                  500M log lines (10% avg)   |
|    Storage: ~5 TB                      Storage: ~500 GB           |
|    Cost: ~$3,000/month                 Cost: ~$300/month          |
|                                                                   |
|  TOTAL: ~$5,000/month                TOTAL: ~$400/month           |
|  ─────────────────────────────────────────────────────────────   |
|  SAVINGS: 92% reduction with smart sampling                       |
|                                                                   |
+------------------------------------------------------------------+

Preserving Debug Capability

While sampling reduces volume, ensure debugging capability is preserved:

// Enable full sampling for specific requests via header

public class DynamicSamplingMiddleware
{
    private readonly RequestDelegate _next;

    public async Task InvokeAsync(HttpContext context)
    {
        // Check for debug header
        if (context.Request.Headers.TryGetValue("X-Force-Trace", out var forceTrace) &&
            forceTrace == "true")
        {
            // Set sampling decision to RECORD_AND_SAMPLE
            Activity.Current?.SetTag("sampling.priority", 1);
            Activity.Current?.SetTag("debug.forced", true);
        }

        await _next(context);
    }
}

// Usage: Add header to force sampling
// curl -H "X-Force-Trace: true" https://api.posplatform.io/api/v1/sales

Sampling Metrics

Monitor sampling effectiveness:

# prometheus/rules/sampling-rules.yml

groups:
  - name: sampling-metrics
    rules:
      - record: otel_traces_sampled_total
        expr: sum(rate(otel_processor_tail_sampling_count_traces_sampled[5m]))

      - record: otel_traces_dropped_total
        expr: sum(rate(otel_processor_tail_sampling_count_traces_dropped[5m]))

      - record: otel_sampling_rate
        expr: |
          otel_traces_sampled_total / (otel_traces_sampled_total + otel_traces_dropped_total)

      - alert: SamplingRateTooLow
        expr: otel_sampling_rate < 0.01
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Trace sampling rate is below 1%"
          description: "Consider increasing sampling or checking for data loss"

      - alert: ErrorsNotSampled
        expr: |
          rate(http_server_requests_total{status=~"5.."}[5m]) >
          rate(otel_traces_sampled{has_error="true"}[5m]) * 1.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Errors may not be properly sampled"
          description: "More HTTP 5xx errors than sampled error traces"

Sampling Decision Flowchart

┌─────────────────────────────────────────────────────────────────┐
│                   SAMPLING DECISION FLOW                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  New Request Arrives                                             │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────┐                                             │
│  │ Health/Metrics  │──Yes──► DROP (0%)                           │
│  │ endpoint?       │                                             │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ X-Force-Trace   │──Yes──► SAMPLE (100%)                       │
│  │ header present? │                                             │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  [Request Processes...]                                          │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Error occurred? │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Duration > 2s?  │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Payment route?  │──Yes──► SAMPLE (100%)                       │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│  ┌─────────────────┐                                             │
│  │ Random 5%?      │──Yes──► SAMPLE                              │
│  └────────┬────────┘                                             │
│           │ No                                                   │
│           ▼                                                      │
│         DROP                                                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

25.20 Summary

This chapter provides complete monitoring coverage:

Architecture: Prometheus + Grafana + AlertManager stack
Metrics: Business SLIs and infrastructure metrics with thresholds
Prometheus Config: Complete scrape configuration
Alert Rules: P1-P4 severity levels with escalation
Grafana Dashboard: Production-ready JSON dashboard
Runbooks: Step-by-step incident response procedures

Next Chapter: Chapter 26: Security Compliance

“You cannot improve what you do not measure.”

Document Information

Attribute	Value
Version	5.0.0
Created	2025-12-29
Updated	2026-02-25
Author	Claude Code
Status	Active
Part	VII - Operations
Chapter	25 of 32

This chapter is part of the POS Blueprint Book. All content is self-contained.

The POS Platform Blueprint