Chapter 25: Monitoring and Alerting
25.1 Overview
This chapter defines the complete monitoring architecture for the POS Platform, including metrics collection, dashboards, alerting rules, and incident response procedures.
25.2 Monitoring Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ MONITORING STACK │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ POS-API-1 │ │ POS-API-2 │ │ POS-API-3 │
│ │ │ │ │ │
│ /metrics:8080 │ │ /metrics:8080 │ │ /metrics:8080 │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ PROMETHEUS │
│ (Metrics Store) │
│ │
│ - Scrape interval: 15s │
│ - Retention: 15 days │
│ - Port: 9090 │
└──────────────────┬───────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ GRAFANA │ │ ALERTMANAGER │ │ LOKI │
│ (Dashboards) │ │ (Alerts) │ │ (Logs) │
│ │ │ │ │ │
│ Port: 3000 │ │ Port: 9093 │ │ Port: 3100 │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Slack │ │ Email │ │ PagerDuty│
└────────┘ └────────┘ └────────┘
25.3 Key Metrics
Business SLIs (Service Level Indicators)
| Metric | Description | Target | Alert Threshold |
|---|---|---|---|
| Transaction Success Rate | % of transactions completed successfully | > 99.9% | < 99.5% |
| Avg Transaction Time | End-to-end transaction processing | < 2s | > 5s |
| Payment Success Rate | % of payments processed successfully | > 99.5% | < 99% |
| Order Fulfillment Rate | Orders fulfilled within SLA | > 98% | < 95% |
| API Availability | Uptime of API endpoints | > 99.9% | < 99.5% |
Infrastructure Metrics
| Category | Metric | Warning | Critical |
|---|---|---|---|
| CPU | Usage % | > 70% | > 90% |
| Memory | Usage % | > 75% | > 90% |
| Disk | Usage % | > 70% | > 85% |
| Disk | I/O Wait | > 20% | > 40% |
| Network | Packet Loss | > 0.1% | > 1% |
| Network | Latency (ms) | > 100ms | > 500ms |
Application Metrics
| Metric | Description | Warning | Critical |
|---|---|---|---|
| Error Rate | 5xx errors per minute | > 1% | > 5% |
| Response Time (p99) | 99th percentile latency | > 500ms | > 2000ms |
| Response Time (p50) | Median latency | > 100ms | > 500ms |
| Request Rate | Requests per second | N/A (baseline) | > 200% of baseline |
| Queue Depth | Messages waiting in RabbitMQ | > 1000 | > 5000 |
| Active Connections | DB connections in use | > 80% of pool | > 95% of pool |
| Cache Hit Rate | Redis cache effectiveness | < 80% | < 60% |
25.4 Prometheus Configuration
Complete prometheus.yml
# File: /pos-platform/monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'pos-production'
environment: 'production'
#=============================================
# ALERTING CONFIGURATION
#=============================================
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
#=============================================
# RULE FILES
#=============================================
rule_files:
- "/etc/prometheus/rules/*.yml"
#=============================================
# SCRAPE CONFIGURATIONS
#=============================================
scrape_configs:
#-----------------------------------------
# Prometheus Self-Monitoring
#-----------------------------------------
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
#-----------------------------------------
# POS API Instances
#-----------------------------------------
- job_name: 'pos-api'
metrics_path: '/metrics'
static_configs:
- targets:
- 'pos-api-1:8080'
- 'pos-api-2:8080'
- 'pos-api-3:8080'
labels:
app: 'pos-api'
tier: 'backend'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
#-----------------------------------------
# PostgreSQL Exporter
#-----------------------------------------
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
labels:
app: 'postgres'
tier: 'database'
#-----------------------------------------
# Redis Exporter
#-----------------------------------------
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
labels:
app: 'redis'
tier: 'cache'
#-----------------------------------------
# RabbitMQ Exporter
#-----------------------------------------
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']
labels:
app: 'rabbitmq'
tier: 'messaging'
#-----------------------------------------
# Nginx Exporter
#-----------------------------------------
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
labels:
app: 'nginx'
tier: 'ingress'
#-----------------------------------------
# Node Exporter (Host Metrics)
#-----------------------------------------
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
tier: 'infrastructure'
#-----------------------------------------
# Docker Container Metrics
#-----------------------------------------
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
tier: 'containers'
25.5 Alert Rules
Complete Alert Rules Configuration
# File: /pos-platform/monitoring/prometheus/rules/alerts.yml
groups:
#=============================================
# P1 - CRITICAL (Page immediately)
#=============================================
- name: critical_alerts
rules:
#-----------------------------------------
# API Down
#-----------------------------------------
- alert: APIDown
expr: up{job="pos-api"} == 0
for: 1m
labels:
severity: P1
team: platform
annotations:
summary: "POS API instance {{ $labels.instance }} is down"
description: "API instance has been unreachable for more than 1 minute"
runbook_url: "https://wiki.internal/runbooks/api-down"
#-----------------------------------------
# Database Down
#-----------------------------------------
- alert: DatabaseDown
expr: pg_up == 0
for: 30s
labels:
severity: P1
team: platform
annotations:
summary: "PostgreSQL database is down"
description: "Database connection failed for 30 seconds"
runbook_url: "https://wiki.internal/runbooks/db-down"
#-----------------------------------------
# High Error Rate
#-----------------------------------------
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100 > 5
for: 2m
labels:
severity: P1
team: platform
annotations:
summary: "High error rate detected: {{ $value | printf \"%.2f\" }}%"
description: "Error rate exceeds 5% for more than 2 minutes"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
#-----------------------------------------
# Transaction Failure Spike
#-----------------------------------------
- alert: TransactionFailureSpike
expr: |
(
sum(rate(pos_transactions_failed_total[5m]))
/
sum(rate(pos_transactions_total[5m]))
) * 100 > 1
for: 5m
labels:
severity: P1
team: platform
annotations:
summary: "Transaction failure rate: {{ $value | printf \"%.2f\" }}%"
description: "More than 1% of transactions are failing"
runbook_url: "https://wiki.internal/runbooks/transaction-failures"
#=============================================
# P2 - HIGH (Page during business hours)
#=============================================
- name: high_alerts
rules:
#-----------------------------------------
# High Response Time
#-----------------------------------------
- alert: HighResponseTime
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: P2
team: platform
annotations:
summary: "P99 response time is {{ $value | printf \"%.2f\" }}s"
description: "99th percentile latency exceeds 2 seconds"
runbook_url: "https://wiki.internal/runbooks/high-latency"
#-----------------------------------------
# Database Connection Pool Exhaustion
#-----------------------------------------
- alert: DBConnectionPoolLow
expr: |
pg_stat_activity_count / pg_settings_max_connections * 100 > 80
for: 5m
labels:
severity: P2
team: platform
annotations:
summary: "DB connection pool at {{ $value | printf \"%.0f\" }}%"
description: "Database connections nearly exhausted"
runbook_url: "https://wiki.internal/runbooks/db-connections"
#-----------------------------------------
# Queue Backlog
#-----------------------------------------
- alert: QueueBacklog
expr: rabbitmq_queue_messages > 5000
for: 10m
labels:
severity: P2
team: platform
annotations:
summary: "Message queue backlog: {{ $value }} messages"
description: "RabbitMQ queue has significant backlog"
runbook_url: "https://wiki.internal/runbooks/queue-backlog"
#-----------------------------------------
# Memory Pressure
#-----------------------------------------
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: P2
team: infrastructure
annotations:
summary: "Memory usage at {{ $value | printf \"%.0f\" }}%"
description: "System memory is critically low"
runbook_url: "https://wiki.internal/runbooks/memory-pressure"
#=============================================
# P3 - MEDIUM (Email/Slack notification)
#=============================================
- name: medium_alerts
rules:
#-----------------------------------------
# CPU Warning
#-----------------------------------------
- alert: HighCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
for: 15m
labels:
severity: P3
team: infrastructure
annotations:
summary: "CPU usage at {{ $value | printf \"%.0f\" }}%"
description: "CPU usage elevated for extended period"
#-----------------------------------------
# Disk Space Warning
#-----------------------------------------
- alert: DiskSpaceLow
expr: |
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 70
for: 30m
labels:
severity: P3
team: infrastructure
annotations:
summary: "Disk usage at {{ $value | printf \"%.0f\" }}% on {{ $labels.mountpoint }}"
description: "Disk space running low"
#-----------------------------------------
# Cache Hit Rate Low
#-----------------------------------------
- alert: CacheHitRateLow
expr: |
redis_keyspace_hits_total /
(redis_keyspace_hits_total + redis_keyspace_misses_total) * 100 < 80
for: 30m
labels:
severity: P3
team: platform
annotations:
summary: "Cache hit rate: {{ $value | printf \"%.0f\" }}%"
description: "Redis cache effectiveness is low"
#=============================================
# P4 - LOW (Log/Dashboard only)
#=============================================
- name: low_alerts
rules:
#-----------------------------------------
# SSL Certificate Expiry
#-----------------------------------------
- alert: SSLCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: P4
team: platform
annotations:
summary: "SSL cert expires in {{ $value | printf \"%.0f\" }} days"
description: "Certificate renewal needed soon"
#-----------------------------------------
# Container Restarts
#-----------------------------------------
- alert: ContainerRestarts
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 1h
labels:
severity: P4
team: platform
annotations:
summary: "Container {{ $labels.container }} restarted {{ $value }} times"
description: "Container may be unstable"
25.6 AlertManager Configuration
# File: /pos-platform/monitoring/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@pos-platform.com'
smtp_auth_username: 'alerts@pos-platform.com'
smtp_auth_password: '${SMTP_PASSWORD}'
slack_api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
#=============================================
# ROUTING
#=============================================
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
#-----------------------------------------
# P1 - Critical: Page immediately
#-----------------------------------------
- match:
severity: P1
receiver: 'pagerduty-critical'
continue: true
- match:
severity: P1
receiver: 'slack-critical'
continue: true
#-----------------------------------------
# P2 - High: Page during business hours
#-----------------------------------------
- match:
severity: P2
receiver: 'pagerduty-high'
active_time_intervals:
- business-hours
continue: true
- match:
severity: P2
receiver: 'slack-high'
#-----------------------------------------
# P3 - Medium: Slack + Email
#-----------------------------------------
- match:
severity: P3
receiver: 'slack-medium'
continue: true
- match:
severity: P3
receiver: 'email-team'
#-----------------------------------------
# P4 - Low: Slack only
#-----------------------------------------
- match:
severity: P4
receiver: 'slack-low'
#=============================================
# TIME INTERVALS
#=============================================
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '18:00'
#=============================================
# RECEIVERS
#=============================================
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: critical
- name: 'pagerduty-high'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: error
- name: 'slack-critical'
slack_configs:
- channel: '#pos-critical'
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button
text: 'Runbook'
url: '{{ .CommonAnnotations.runbook_url }}'
- type: button
text: 'Dashboard'
url: 'https://grafana.internal/d/pos-overview'
- name: 'slack-high'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
color: 'warning'
- name: 'slack-medium'
slack_configs:
- channel: '#pos-alerts'
send_resolved: true
- name: 'slack-low'
slack_configs:
- channel: '#pos-info'
send_resolved: false
- name: 'email-team'
email_configs:
- to: 'platform-team@company.com'
send_resolved: true
25.7 Grafana Dashboard
POS Platform Overview Dashboard (JSON)
{
"dashboard": {
"id": null,
"uid": "pos-overview",
"title": "POS Platform Overview",
"tags": ["pos", "production"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Transaction Success Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"targets": [
{
"expr": "(sum(rate(pos_transactions_success_total[5m])) / sum(rate(pos_transactions_total[5m]))) * 100",
"legendFormat": "Success Rate"
}
],
"options": {
"colorMode": "value",
"graphMode": "area"
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 99},
{"color": "green", "value": 99.5}
]
}
}
}
},
{
"id": 2,
"title": "Requests per Second",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total[1m]))",
"legendFormat": "RPS"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"id": 3,
"title": "P99 Response Time",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.5},
{"color": "red", "value": 2}
]
}
}
}
},
{
"id": 4,
"title": "Error Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Errors"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
},
{
"id": 5,
"title": "Active Transactions",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
"targets": [
{
"expr": "pos_transactions_in_progress",
"legendFormat": "Active"
}
]
},
{
"id": 6,
"title": "API Health",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
"targets": [
{
"expr": "count(up{job=\"pos-api\"} == 1)",
"legendFormat": "Healthy Instances"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 2},
{"color": "green", "value": 3}
]
}
}
}
},
{
"id": 10,
"title": "Request Rate by Endpoint",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
]
},
{
"id": 11,
"title": "Response Time Distribution",
"type": "heatmap",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"targets": [
{
"expr": "sum(increase(http_request_duration_seconds_bucket[1m])) by (le)",
"legendFormat": "{{le}}"
}
]
},
{
"id": 20,
"title": "Database Connections",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 0, "y": 12},
"targets": [
{
"expr": "pg_stat_activity_count",
"legendFormat": "Active"
},
{
"expr": "pg_settings_max_connections",
"legendFormat": "Max"
}
]
},
{
"id": 21,
"title": "Redis Operations",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 8, "y": 12},
"targets": [
{
"expr": "rate(redis_commands_processed_total[1m])",
"legendFormat": "Commands/sec"
}
]
},
{
"id": 22,
"title": "Queue Depth",
"type": "timeseries",
"gridPos": {"h": 6, "w": 8, "x": 16, "y": 12},
"targets": [
{
"expr": "rabbitmq_queue_messages",
"legendFormat": "{{queue}}"
}
]
},
{
"id": 30,
"title": "CPU Usage by Container",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 18},
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]) * 100",
"legendFormat": "{{container}}"
}
],
"fieldConfig": {
"defaults": {"unit": "percent"}
}
},
{
"id": 31,
"title": "Memory Usage by Container",
"type": "timeseries",
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 18},
"targets": [
{
"expr": "container_memory_usage_bytes{container!=\"\"} / 1024 / 1024",
"legendFormat": "{{container}}"
}
],
"fieldConfig": {
"defaults": {"unit": "decmbytes"}
}
}
]
}
}
25.8 Incident Response Runbooks
Runbook: API Down (P1)
# Runbook: API Down
**Alert**: APIDown
**Severity**: P1 (Critical)
**Impact**: Customers cannot complete transactions
## 25.9 Symptoms
- Health check endpoint returning non-200
- Load balancer showing unhealthy targets
- Transaction error rate spike
## 25.10 Immediate Actions (First 5 minutes)
1. **Verify the alert**
```bash
curl -s http://pos-api:8080/health | jq
docker ps | grep pos-api
-
Check container logs
docker logs pos-api-1 --tail 100 docker logs pos-api-2 --tail 100 docker logs pos-api-3 --tail 100 -
Check resource usage
docker stats --no-stream -
Restart unhealthy containers
docker restart pos-api-1 # Replace with affected container
25.11 Escalation
- If all containers down: Page Infrastructure Lead
- If database issue: Page Database Team
- If network issue: Page Network Team
25.12 Resolution Checklist
- Identify root cause
- Apply fix (restart, rollback, config change)
- Verify health checks passing
- Monitor for 15 minutes
- Update incident ticket
- Schedule postmortem if major outage
25.13 Common Causes
| Cause | Solution |
|---|---|
| OOM (Out of Memory) | Restart, investigate memory leak |
| Database connection failure | Check DB health, restart connections |
| Deployment failure | Rollback to previous version |
| Network partition | Check network, restart networking |
### Runbook: High Error Rate (P1)
```markdown
# Runbook: High Error Rate
**Alert**: HighErrorRate
**Severity**: P1 (Critical)
**Impact**: Significant portion of requests failing
## 25.9 Symptoms
- 5xx error rate > 5%
- Customer complaints about failures
- Transaction success rate dropping
## 25.15 Immediate Actions
1. **Identify error patterns**
```bash
# Check recent errors in logs
docker logs pos-api-1 2>&1 | grep -i error | tail -50
# Query Loki for error patterns
{job="pos-api"} |= "error" | json | line_format "{{.message}}"
-
Check which endpoints are failing
# In Grafana/Prometheus sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint, status) -
Check dependent services
# Database docker exec pos-postgres-primary pg_isready # Redis docker exec pos-redis redis-cli ping # RabbitMQ curl -u admin:password http://localhost:15672/api/healthchecks/node
25.16 Root Cause Investigation
| Error Pattern | Likely Cause | Solution |
|---|---|---|
| 500 on /api/transactions | Database timeout | Check DB connections |
| 503 across all endpoints | Overload | Scale up or rate limit |
| 502 from nginx | Container crash | Restart containers |
| Timeout errors | Slow DB queries | Kill long queries, add indexes |
25.17 Recovery Steps
- If DB issue: Restart connection pool
- If overload: Enable aggressive rate limiting
- If code bug: Rollback deployment
- If external dependency: Enable circuit breaker
---
## 25.18 OpenTelemetry Integration
### Overview
The monitoring stack is enhanced with OpenTelemetry (OTel) for comprehensive observability that prevents vendor lock-in and enables "Trace-to-Code" root cause analysis.
### Primary Pattern
| Attribute | Selection |
|-----------|-----------|
| **Pattern** | OpenTelemetry "Trace-to-Code" Pipeline |
| **Rationale** | Industry-standard protocol; trace errors from store terminal directly to source code line |
| **Vendor Lock-in** | None - OTel is open standard |
### Technology Stack (The "LGTM" Stack)
+——————————————————————+ | THE LGTM STACK | +——————————————————————+ | | | L = Loki (Log Aggregation) | | G = Grafana (Visualization & Dashboards) | | T = Tempo (Distributed Tracing) | | M = Prometheus (Metrics Collection) ← Already configured | | | +——————————————————————+
| Component | Tool | Purpose | Port |
|-----------|------|---------|------|
| **L** - Logs | Loki | Log aggregation, search | 3100 |
| **G** - Grafana | Grafana | Unified dashboards | 3000 |
| **T** - Traces | Tempo (or Jaeger) | Distributed tracing | 4317 (OTLP), 16686 (UI) |
| **M** - Metrics | Prometheus | Metrics collection | 9090 |
### Docker Compose Addition
```yaml
# Add to docker-compose.monitoring.yml
services:
# ... existing prometheus, grafana, alertmanager ...
# Loki - Log Aggregation
loki:
image: grafana/loki:2.9.0
container_name: pos-loki
ports:
- "3100:3100"
volumes:
- loki_data:/loki
- ./loki/loki-config.yml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
networks:
- monitoring
# Tempo - Distributed Tracing
tempo:
image: grafana/tempo:2.3.0
container_name: pos-tempo
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "3200:3200" # Tempo query
volumes:
- tempo_data:/var/tempo
- ./tempo/tempo-config.yml:/etc/tempo/tempo.yaml
command: -config.file=/etc/tempo/tempo.yaml
networks:
- monitoring
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.89.0
container_name: pos-otel-collector
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics
volumes:
- ./otel/otel-collector-config.yml:/etc/otel/config.yaml
command: --config=/etc/otel/config.yaml
networks:
- monitoring
volumes:
loki_data:
tempo_data:
OpenTelemetry Collector Configuration
# monitoring/otel/otel-collector-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
exporters:
# Send traces to Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Send metrics to Prometheus
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
# Send logs to Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
service.instance.id: "instance_id"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
.NET Application Instrumentation
// Program.cs - Add OpenTelemetry instrumentation
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
var builder = WebApplication.CreateBuilder(args);
// Define resource attributes
var resourceBuilder = ResourceBuilder.CreateDefault()
.AddService(
serviceName: "pos-api",
serviceVersion: Assembly.GetExecutingAssembly().GetName().Version?.ToString() ?? "1.0.0",
serviceInstanceId: Environment.MachineName)
.AddAttributes(new Dictionary<string, object>
{
["deployment.environment"] = builder.Environment.EnvironmentName,
["tenant.id"] = "dynamic" // Set per-request
});
// Configure OpenTelemetry Tracing
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.SetResourceBuilder(resourceBuilder)
.AddSource("PosPlatform.*")
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, request) =>
{
activity.SetTag("tenant.id", request.Headers["X-Tenant-Id"].FirstOrDefault());
};
})
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
}))
.WithMetrics(metrics => metrics
.SetResourceBuilder(resourceBuilder)
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddPrometheusExporter()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
}));
// Configure OpenTelemetry Logging
builder.Logging.AddOpenTelemetry(logging => logging
.SetResourceBuilder(resourceBuilder)
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
}));
Custom Span Example (Trace-to-Code)
// SaleService.cs - Custom tracing for business operations
public class SaleService
{
private static readonly ActivitySource ActivitySource = new("PosPlatform.Sales");
private readonly ILogger<SaleService> _logger;
public async Task<Sale> CreateSaleAsync(CreateSaleCommand command)
{
// Create custom span with source code reference
using var activity = ActivitySource.StartActivity(
"CreateSale",
ActivityKind.Internal,
Activity.Current?.Context ?? default);
activity?.SetTag("sale.location_id", command.LocationId);
activity?.SetTag("sale.line_items_count", command.LineItems.Count);
activity?.SetTag("code.filepath", "SaleService.cs");
activity?.SetTag("code.lineno", 25);
activity?.SetTag("code.function", "CreateSaleAsync");
try
{
// Business logic
var sale = await ProcessSale(command);
activity?.SetTag("sale.id", sale.Id);
activity?.SetTag("sale.total", sale.Total);
activity?.SetStatus(ActivityStatusCode.Ok);
return sale;
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
_logger.LogError(ex, "Failed to create sale for location {LocationId}", command.LocationId);
throw;
}
}
}
Trace-to-Code Dashboard Query
# Grafana Tempo query - Find traces with errors from specific store
{
resource.service.name = "pos-api" &&
span.tenant.id = "NEXUS" &&
status = error
}
| select(
traceDuration,
resource.service.name,
span.code.filepath,
span.code.lineno,
span.code.function,
statusMessage
)
Observability Overload Mitigation
To prevent alert fatigue and noise:
| Strategy | Implementation |
|---|---|
| Sampling | Sample 10% of successful traces, 100% of errors |
| Aggregation | Batch traces before export (10s window) |
| Filtering | Exclude health check endpoints from tracing |
| Retention | Keep raw traces 7 days, aggregates 30 days |
# Sampling configuration in OTel Collector
processors:
probabilistic_sampler:
sampling_percentage: 10 # Sample 10% of traces
tail_sampling:
policies:
- name: always-sample-errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: sample-successful
type: probabilistic
probabilistic: {sampling_percentage: 10}
Grafana Data Source Configuration
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['service.name']
tracesToMetrics:
datasourceUid: prometheus
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
Correlating Traces, Logs, and Metrics
With LGTM stack, you can jump between:
+------------------------------------------------------------------+
| OBSERVABILITY CORRELATION |
+------------------------------------------------------------------+
| |
| TRACE (Tempo) |
| ┌────────────────────────────────────────────────────────────┐ |
| │ TraceID: abc123 │ |
| │ Span: CreateSale (45ms) │ |
| │ └─ Span: ValidateInventory (12ms) │ |
| │ └─ Span: ProcessPayment (28ms) [ERROR] │ |
| │ └─ code.filepath: PaymentService.cs:142 │ |
| └────────────────────────────────────────────────────────────┘ |
| │ |
| │ Click "Logs for this span" |
| ▼ |
| LOGS (Loki) |
| ┌────────────────────────────────────────────────────────────┐ |
| │ 2026-01-24 10:15:32 ERROR Payment declined: Insufficient │ |
| │ 2026-01-24 10:15:32 INFO Rolling back transaction abc123 │ |
| └────────────────────────────────────────────────────────────┘ |
| │ |
| │ Click "Metrics for this time" |
| ▼ |
| METRICS (Prometheus) |
| ┌────────────────────────────────────────────────────────────┐ |
| │ payment_failures_total{reason="insufficient_funds"} = 47 │ |
| │ payment_latency_p99 = 2.3s │ |
| └────────────────────────────────────────────────────────────┘ |
| |
+------------------------------------------------------------------+
Reference
For complete observability strategy and risk mitigations, see:
25.19 Observability Sampling Strategy
Overview
At production scale, collecting 100% of traces, metrics, and logs becomes prohibitively expensive. A thoughtful sampling strategy reduces costs while preserving visibility into errors and performance issues.
| Attribute | Selection |
|---|---|
| Approach | Head-based + Tail-based Sampling |
| Error Retention | 100% of errors sampled |
| Normal Traffic | 1-10% sampled based on volume |
| Cost Target | < $500/month for LGTM stack |
Sampling Strategy Matrix
+------------------------------------------------------------------+
| SAMPLING STRATEGY MATRIX |
+------------------------------------------------------------------+
| |
| SIGNAL TYPE SAMPLE RATE CONDITION |
| ───────────────────────────────────────────────────────────── |
| Traces (errors) 100% status_code >= 500 OR error=true|
| Traces (slow) 100% duration > 2s |
| Traces (normal) 5% All other traces |
| Traces (health) 0% /health, /metrics endpoints |
| |
| Metrics 100% Always (cheap to store) |
| Metrics (custom) Aggregated Sum/avg over 15s window |
| |
| Logs (ERROR+) 100% severity >= ERROR |
| Logs (WARN) 50% severity == WARN |
| Logs (INFO) 10% severity == INFO |
| Logs (DEBUG) 0% Production only; 100% in dev |
| Logs (health) 0% Health check logs suppressed |
| |
+------------------------------------------------------------------+
Head-Based Sampling
Decision made at trace start. Simple but may miss errors that occur later in the trace.
// Program.cs - Head-based sampling configuration
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.05))) // 5% sampling
.AddAspNetCoreInstrumentation(options =>
{
// Always exclude health endpoints
options.Filter = httpContext =>
!httpContext.Request.Path.StartsWithSegments("/health") &&
!httpContext.Request.Path.StartsWithSegments("/metrics");
})
);
Tail-Based Sampling (Recommended)
Decision made after trace completes. Ensures all errors and slow requests are captured.
# otel-collector-config.yaml
processors:
# Tail-based sampling processor
tail_sampling:
decision_wait: 10s # Wait for span completion
num_traces: 100000 # Max traces in memory
expected_new_traces_per_sec: 1000
policies:
# Policy 1: Always sample errors (100%)
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Policy 2: Always sample slow requests (100%)
- name: latency-policy
type: latency
latency:
threshold_ms: 2000 # > 2 seconds
# Policy 3: Always sample payment operations (100%)
- name: payments-policy
type: string_attribute
string_attribute:
key: http.route
values:
- /api/v1/payments
- /api/v1/refunds
enabled_regex_matching: false
# Policy 4: Sample normal traffic (5%)
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/tempo]
Log Sampling Configuration
# Loki pipeline configuration for log sampling
pipeline_stages:
# Drop health check logs entirely
- match:
selector: '{job="pos-api"} |~ "GET /health"'
action: drop
# Drop metrics endpoint logs
- match:
selector: '{job="pos-api"} |~ "GET /metrics"'
action: drop
# Sample INFO logs at 10%
- match:
selector: '{level="info"}'
stages:
- sampling:
rate: 0.1
# Sample WARN logs at 50%
- match:
selector: '{level="warn"}'
stages:
- sampling:
rate: 0.5
# Keep 100% of ERROR and above
- match:
selector: '{level=~"error|fatal|critical"}'
stages:
- sampling:
rate: 1.0
Application-Level Log Filtering
// Program.cs - Serilog with level-based filtering
builder.Host.UseSerilog((context, config) =>
{
config
.MinimumLevel.Information()
.MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft.EntityFrameworkCore", LogEventLevel.Warning)
// Don't log health checks
.Filter.ByExcluding(Matching.WithProperty<string>("RequestPath", p =>
p.Contains("/health") || p.Contains("/metrics")))
// Sample INFO logs in production
.Filter.ByExcluding(e =>
e.Level == LogEventLevel.Information &&
Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") == "Production" &&
Random.Shared.NextDouble() > 0.1) // Keep 10%
.WriteTo.Console()
.WriteTo.OpenTelemetry(options =>
{
options.Endpoint = "http://otel-collector:4317";
options.Protocol = OtlpProtocol.Grpc;
});
});
Sampling Cost Analysis
+------------------------------------------------------------------+
| MONTHLY COST COMPARISON |
+------------------------------------------------------------------+
| |
| SCENARIO: 10 API instances, 1000 req/sec, 30-day retention |
| |
| WITHOUT SAMPLING WITH SAMPLING |
| ───────────────────────── ───────────────────────── |
| Traces: Traces: |
| 2.6B traces/month 130M traces/month (5%) |
| Storage: ~2.6 TB Storage: ~130 GB |
| Cost: ~$2,000/month Cost: ~$100/month |
| |
| Logs: Logs: |
| 5B log lines/month 500M log lines (10% avg) |
| Storage: ~5 TB Storage: ~500 GB |
| Cost: ~$3,000/month Cost: ~$300/month |
| |
| TOTAL: ~$5,000/month TOTAL: ~$400/month |
| ───────────────────────────────────────────────────────────── |
| SAVINGS: 92% reduction with smart sampling |
| |
+------------------------------------------------------------------+
Preserving Debug Capability
While sampling reduces volume, ensure debugging capability is preserved:
// Enable full sampling for specific requests via header
public class DynamicSamplingMiddleware
{
private readonly RequestDelegate _next;
public async Task InvokeAsync(HttpContext context)
{
// Check for debug header
if (context.Request.Headers.TryGetValue("X-Force-Trace", out var forceTrace) &&
forceTrace == "true")
{
// Set sampling decision to RECORD_AND_SAMPLE
Activity.Current?.SetTag("sampling.priority", 1);
Activity.Current?.SetTag("debug.forced", true);
}
await _next(context);
}
}
// Usage: Add header to force sampling
// curl -H "X-Force-Trace: true" https://api.posplatform.io/api/v1/sales
Sampling Metrics
Monitor sampling effectiveness:
# prometheus/rules/sampling-rules.yml
groups:
- name: sampling-metrics
rules:
- record: otel_traces_sampled_total
expr: sum(rate(otel_processor_tail_sampling_count_traces_sampled[5m]))
- record: otel_traces_dropped_total
expr: sum(rate(otel_processor_tail_sampling_count_traces_dropped[5m]))
- record: otel_sampling_rate
expr: |
otel_traces_sampled_total / (otel_traces_sampled_total + otel_traces_dropped_total)
- alert: SamplingRateTooLow
expr: otel_sampling_rate < 0.01
for: 15m
labels:
severity: warning
annotations:
summary: "Trace sampling rate is below 1%"
description: "Consider increasing sampling or checking for data loss"
- alert: ErrorsNotSampled
expr: |
rate(http_server_requests_total{status=~"5.."}[5m]) >
rate(otel_traces_sampled{has_error="true"}[5m]) * 1.1
for: 5m
labels:
severity: critical
annotations:
summary: "Errors may not be properly sampled"
description: "More HTTP 5xx errors than sampled error traces"
Sampling Decision Flowchart
┌─────────────────────────────────────────────────────────────────┐
│ SAMPLING DECISION FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ New Request Arrives │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Health/Metrics │──Yes──► DROP (0%) │
│ │ endpoint? │ │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ ┌─────────────────┐ │
│ │ X-Force-Trace │──Yes──► SAMPLE (100%) │
│ │ header present? │ │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ [Request Processes...] │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Error occurred? │──Yes──► SAMPLE (100%) │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Duration > 2s? │──Yes──► SAMPLE (100%) │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Payment route? │──Yes──► SAMPLE (100%) │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Random 5%? │──Yes──► SAMPLE │
│ └────────┬────────┘ │
│ │ No │
│ ▼ │
│ DROP │
│ │
└─────────────────────────────────────────────────────────────────┘
25.20 Summary
This chapter provides complete monitoring coverage:
- Architecture: Prometheus + Grafana + AlertManager stack
- Metrics: Business SLIs and infrastructure metrics with thresholds
- Prometheus Config: Complete scrape configuration
- Alert Rules: P1-P4 severity levels with escalation
- Grafana Dashboard: Production-ready JSON dashboard
- Runbooks: Step-by-step incident response procedures
Next Chapter: Chapter 26: Security Compliance
“You cannot improve what you do not measure.”
Document Information
| Attribute | Value |
|---|---|
| Version | 5.0.0 |
| Created | 2025-12-29 |
| Updated | 2026-02-25 |
| Author | Claude Code |
| Status | Active |
| Part | VII - Operations |
| Chapter | 25 of 32 |
This chapter is part of the POS Blueprint Book. All content is self-contained.