Chapter 32: Disaster Recovery
Overview
This chapter defines the disaster recovery strategy, backup procedures, failover architecture, and recovery processes for the POS Platform.
Recovery Objectives
RTO/RPO Requirements by Data Type
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ RECOVERY TIME OBJECTIVE (RTO) / RECOVERY POINT OBJECTIVE (RPO) │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────┬─────────────┬─────────────┬──────────────────────────────────────┐
│ Data Category │ RTO │ RPO │ Justification │
├───────────────────┼─────────────┼─────────────┼──────────────────────────────────────┤
│ Transaction Data │ < 1 hour │ 0 (no loss) │ Revenue-critical, legal requirements │
│ Inventory Data │ < 4 hours │ < 1 hour │ Business operations │
│ Customer Data │ < 4 hours │ < 1 hour │ Order fulfillment │
│ Product Catalog │ < 8 hours │ < 24 hours │ Can rebuild from source │
│ Audit Logs │ < 24 hours │ < 1 hour │ Compliance requirements │
│ Analytics Data │ < 72 hours │ < 24 hours │ Non-critical, can rebuild │
│ Configuration │ Immediate │ 0 (no loss) │ Stored in Git │
└───────────────────┴─────────────┴─────────────┴──────────────────────────────────────┘
Recovery Tier Definitions:
┌─────────┬─────────────────────────────────────────────────────────────────────────────┐
│ TIER 1 │ MISSION CRITICAL │
│ │ RTO: < 1 hour | RPO: 0 │
│ │ - Active transactions │
│ │ - Payment processing │
│ │ - Real-time inventory │
│ │ Strategy: Synchronous replication, hot standby │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 2 │ BUSINESS CRITICAL │
│ │ RTO: < 4 hours | RPO: < 1 hour │
│ │ - Customer data │
│ │ - Order history │
│ │ - Inventory levels │
│ │ Strategy: Asynchronous replication, warm standby │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 3 │ IMPORTANT │
│ │ RTO: < 24 hours | RPO: < 24 hours │
│ │ - Product catalog │
│ │ - Reports │
│ │ - Historical analytics │
│ │ Strategy: Daily backups, cold standby │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 4 │ NON-CRITICAL │
│ │ RTO: < 72 hours | RPO: < 72 hours │
│ │ - Archived data │
│ │ - Legacy exports │
│ │ Strategy: Weekly backups, rebuild if needed │
└─────────┴─────────────────────────────────────────────────────────────────────────────┘
Backup Strategy
Database Backup Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ DATABASE BACKUP STRATEGY │
└─────────────────────────────────────────────────────────────────────────────────────┘
PostgreSQL Primary
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────┐ ┌─────────────────┐
│ Streaming │ │ WAL │ │ pg_dump │
│ Replication │ │ Archiving│ │ (Daily) │
│ (Real-time) │ │ (PITR) │ │ │
└────────┬────────┘ └────┬─────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────┐ ┌─────────────────┐
│ Hot Standby │ │ WAL │ │ Backup Storage │
│ (Same Region) │ │ Archive │ │ (Encrypted) │
│ │ │ (S3/NFS) │ │ │
└─────────────────┘ └──────────┘ └─────────────────┘
│ │ │
│ │ │
└───────────────┼────────────────┘
│
▼
┌─────────────────────┐
│ Offsite Backup │
│ (Different DC) │
│ S3 Cross-Region │
└─────────────────────┘
BACKUP SCHEDULE:
┌──────────────────┬───────────────┬─────────────────┬────────────────────────────────┐
│ Backup Type │ Frequency │ Retention │ Storage Location │
├──────────────────┼───────────────┼─────────────────┼────────────────────────────────┤
│ WAL Archiving │ Continuous │ 7 days │ Local NFS + S3 │
│ pg_dump (Full) │ Daily 2AM │ 30 days │ S3 (encrypted) │
│ pg_dump (Weekly) │ Sunday 3AM │ 90 days │ S3 + Glacier │
│ Monthly Archive │ 1st of month │ 1 year │ Glacier │
│ Yearly Archive │ Jan 1st │ 7 years │ Glacier Deep Archive │
└──────────────────┴───────────────┴─────────────────┴────────────────────────────────┘
Backup Scripts
#!/bin/bash
# File: /pos-platform/scripts/backup/daily-backup.sh
# Daily database backup script
set -e
#=============================================
# CONFIGURATION
#=============================================
BACKUP_DIR="/backups/postgres/daily"
S3_BUCKET="s3://pos-backups/postgres"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="pos_db_${DATE}.sql.gz"
LOG_FILE="/var/log/pos-backup.log"
#=============================================
# FUNCTIONS
#=============================================
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
send_alert() {
# Send to Slack on failure
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-type: application/json' \
-d "{\"text\": \"BACKUP ALERT: $1\"}"
}
#=============================================
# BACKUP PROCESS
#=============================================
backup_database() {
log "Starting database backup..."
# Create backup with compression
docker exec postgres-primary pg_dump \
-U pos_admin \
-d pos_db \
--format=custom \
--compress=9 \
--file="/tmp/${BACKUP_FILE}"
# Copy from container
docker cp "postgres-primary:/tmp/${BACKUP_FILE}" "${BACKUP_DIR}/${BACKUP_FILE}"
# Verify backup integrity
docker exec postgres-primary pg_restore \
--list "/tmp/${BACKUP_FILE}" > /dev/null 2>&1
if [ $? -eq 0 ]; then
log "Backup verified successfully"
else
log "ERROR: Backup verification failed"
send_alert "Backup verification failed for ${BACKUP_FILE}"
exit 1
fi
log "Backup completed: ${BACKUP_FILE}"
}
upload_to_s3() {
log "Uploading to S3..."
# Encrypt and upload
aws s3 cp \
"${BACKUP_DIR}/${BACKUP_FILE}" \
"${S3_BUCKET}/daily/${BACKUP_FILE}" \
--sse aws:kms \
--sse-kms-key-id "$KMS_KEY_ID"
log "Upload completed"
}
cleanup_old_backups() {
log "Cleaning up old backups..."
# Local cleanup
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +$RETENTION_DAYS -delete
# S3 cleanup (handled by lifecycle policy)
log "Cleanup completed"
}
#=============================================
# PER-TENANT BACKUP
#=============================================
backup_tenant_data() {
log "Starting per-tenant backups..."
# Get all active tenants
TENANTS=$(docker exec postgres-primary psql -U pos_admin -d pos_db -t -c \
"SELECT schema_name FROM tenants WHERE status = 'active';")
for TENANT in $TENANTS; do
TENANT=$(echo "$TENANT" | tr -d ' ')
TENANT_BACKUP="${BACKUP_DIR}/tenants/${TENANT}_${DATE}.sql.gz"
log "Backing up tenant: $TENANT"
docker exec postgres-primary pg_dump \
-U pos_admin \
-d pos_db \
--schema="${TENANT}" \
--format=custom \
--compress=9 \
--file="/tmp/tenant_${TENANT}.sql"
docker cp "postgres-primary:/tmp/tenant_${TENANT}.sql" "$TENANT_BACKUP"
# Upload tenant backup
aws s3 cp "$TENANT_BACKUP" \
"${S3_BUCKET}/tenants/${TENANT}/${TENANT}_${DATE}.sql.gz" \
--sse aws:kms
log "Tenant backup completed: $TENANT"
done
}
#=============================================
# MAIN
#=============================================
main() {
log "=========================================="
log "Daily Backup Started"
log "=========================================="
mkdir -p "$BACKUP_DIR/tenants"
backup_database
backup_tenant_data
upload_to_s3
cleanup_old_backups
log "=========================================="
log "Daily Backup Completed Successfully"
log "=========================================="
}
main "$@"
WAL Archiving Configuration
# File: /pos-platform/docker/postgres/postgresql.conf (excerpt)
# WAL Settings
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://pos-backups/wal/%f --sse aws:kms'
archive_timeout = 60
# Replication Settings
max_wal_senders = 5
wal_keep_size = 1GB
hot_standby = on
# Recovery Settings (for standby)
restore_command = 'aws s3 cp s3://pos-backups/wal/%f %p'
recovery_target_timeline = 'latest'
Failover Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ MULTI-REGION FAILOVER ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ DNS (Route53) │
│ Health-based │
│ Failover │
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ │ ▼
┌───────────────────┐ │ ┌───────────────────┐
│ PRIMARY REGION │ │ │ SECONDARY REGION │
│ (US-East-1) │ │ │ (US-West-2) │
│ │ │ │ │
│ ┌─────────────┐ │ │ │ ┌─────────────┐ │
│ │ Load │ │ │ │ │ Load │ │
│ │ Balancer │ │ │ │ │ Balancer │ │
│ └──────┬──────┘ │ │ │ └──────┬──────┘ │
│ │ │ │ │ │ │
│ ┌──────┴──────┐ │ │ │ ┌──────┴──────┐ │
│ │ API (x3) │ │ │ │ │ API (x2) │ │
│ │ Active │ │ │ │ │ Standby │ │
│ └──────┬──────┘ │ │ │ └──────┬──────┘ │
│ │ │ │ │ │ │
│ ┌──────┴──────┐ │ Sync │ │ ┌──────┴──────┐ │
│ │ PostgreSQL │ │◄─────────┼──────│ │ PostgreSQL │ │
│ │ PRIMARY │ │ (Async) │ │ │ REPLICA │ │
│ └─────────────┘ │ │ │ └─────────────┘ │
│ │ │ │ │
│ ┌─────────────┐ │ Sync │ │ ┌─────────────┐ │
│ │ Redis │ │◄─────────┼──────│ │ Redis │ │
│ │ PRIMARY │ │ │ │ │ REPLICA │ │
│ └─────────────┘ │ │ │ └─────────────┘ │
└───────────────────┘ │ └───────────────────┘
│
NORMAL OPERATION:
100% traffic → Primary
FAILOVER STATE:
100% traffic → Secondary
FAILOVER TRIGGERS:
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Trigger │ Detection Time │ Failover Time │ Auto/Manual │
├──────────────────────────────────┼────────────────┼───────────────┼────────────────┤
│ Load balancer health check fail │ 30 seconds │ 1 minute │ Automatic │
│ Database connection failure │ 1 minute │ 5 minutes │ Automatic │
│ Region-wide outage (AWS) │ 5 minutes │ 10 minutes │ Automatic │
│ Planned maintenance │ N/A │ 0 (graceful) │ Manual │
│ Security incident │ Immediate │ 5 minutes │ Manual │
└─────────────────────────────────────────────────────────────────────────────────────┘
Recovery Procedures
Complete Database Recovery
#!/bin/bash
# File: /pos-platform/scripts/recovery/full-db-recovery.sh
# Complete database recovery from backup
set -e
#=============================================
# RECOVERY MODES
#=============================================
# 1. full - Restore to latest available state
# 2. pitr - Point-in-time recovery to specific timestamp
# 3. tenant - Restore specific tenant only
RECOVERY_MODE=${1:-full}
TARGET_TIME=${2:-}
TENANT_ID=${3:-}
#=============================================
# CONFIGURATION
#=============================================
S3_BUCKET="s3://pos-backups"
WORK_DIR="/tmp/recovery_$(date +%s)"
LOG_FILE="/var/log/pos-recovery.log"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] RECOVERY: $1" | tee -a "$LOG_FILE"
}
#=============================================
# STEP 1: STOP SERVICES
#=============================================
stop_services() {
log "Stopping API services..."
docker-compose stop pos-api
log "Services stopped"
}
#=============================================
# STEP 2: DOWNLOAD BACKUP
#=============================================
download_backup() {
log "Downloading backup files..."
mkdir -p "$WORK_DIR"
# Get latest backup
LATEST_BACKUP=$(aws s3 ls "${S3_BUCKET}/postgres/daily/" | \
sort | tail -1 | awk '{print $4}')
aws s3 cp "${S3_BUCKET}/postgres/daily/${LATEST_BACKUP}" \
"${WORK_DIR}/backup.sql.gz"
log "Downloaded: ${LATEST_BACKUP}"
}
#=============================================
# STEP 3: VERIFY BACKUP INTEGRITY
#=============================================
verify_backup() {
log "Verifying backup integrity..."
# Check file is valid
gunzip -t "${WORK_DIR}/backup.sql.gz"
if [ $? -ne 0 ]; then
log "ERROR: Backup file is corrupted"
exit 1
fi
log "Backup verified"
}
#=============================================
# STEP 4: PREPARE DATABASE
#=============================================
prepare_database() {
log "Preparing database for recovery..."
# Create recovery database
docker exec postgres-primary psql -U postgres -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'pos_db';"
docker exec postgres-primary psql -U postgres -c \
"DROP DATABASE IF EXISTS pos_db_recovery;"
docker exec postgres-primary psql -U postgres -c \
"CREATE DATABASE pos_db_recovery;"
log "Recovery database prepared"
}
#=============================================
# STEP 5: RESTORE DATA
#=============================================
restore_data() {
log "Restoring data..."
# Copy backup to container
docker cp "${WORK_DIR}/backup.sql.gz" postgres-primary:/tmp/
# Decompress and restore
docker exec postgres-primary bash -c \
"gunzip -c /tmp/backup.sql.gz | psql -U postgres -d pos_db_recovery"
log "Data restored"
}
#=============================================
# STEP 6: POINT-IN-TIME RECOVERY (if needed)
#=============================================
apply_wal_logs() {
if [ "$RECOVERY_MODE" == "pitr" ]; then
log "Applying WAL logs until: $TARGET_TIME"
# Download WAL files
aws s3 sync "${S3_BUCKET}/wal/" "${WORK_DIR}/wal/" \
--exclude "*" \
--include "*.gz"
# Apply WAL files (PostgreSQL recovery mode)
docker exec postgres-primary bash -c "
echo \"recovery_target_time = '$TARGET_TIME'\" >> /var/lib/postgresql/data/recovery.signal
pg_ctl restart
"
log "PITR completed"
fi
}
#=============================================
# STEP 7: VERIFY RECOVERY
#=============================================
verify_recovery() {
log "Verifying recovery..."
# Check table counts
TABLES=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
"SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
log "Restored tables: $TABLES"
# Check transaction count
TX_COUNT=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
"SELECT COUNT(*) FROM transactions;")
log "Restored transactions: $TX_COUNT"
# Check latest transaction
LATEST_TX=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
"SELECT MAX(created_at) FROM transactions;")
log "Latest transaction: $LATEST_TX"
}
#=============================================
# STEP 8: SWAP DATABASES
#=============================================
swap_databases() {
log "Swapping databases..."
# Rename databases
docker exec postgres-primary psql -U postgres -c \
"ALTER DATABASE pos_db RENAME TO pos_db_old;"
docker exec postgres-primary psql -U postgres -c \
"ALTER DATABASE pos_db_recovery RENAME TO pos_db;"
log "Databases swapped"
}
#=============================================
# STEP 9: RESTART SERVICES
#=============================================
restart_services() {
log "Restarting services..."
docker-compose start pos-api
# Wait for health checks
sleep 30
# Verify health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
if [ "$HTTP_CODE" -eq 200 ]; then
log "Services healthy"
else
log "ERROR: Services not healthy after recovery"
exit 1
fi
}
#=============================================
# STEP 10: CLEANUP
#=============================================
cleanup() {
log "Cleaning up..."
rm -rf "$WORK_DIR"
# Keep old database for 24 hours, then drop
echo "DROP DATABASE pos_db_old;" | at now + 24 hours
log "Cleanup scheduled"
}
#=============================================
# MAIN
#=============================================
main() {
log "=========================================="
log "DATABASE RECOVERY STARTED"
log "Mode: $RECOVERY_MODE"
[ -n "$TARGET_TIME" ] && log "Target Time: $TARGET_TIME"
log "=========================================="
stop_services
download_backup
verify_backup
prepare_database
restore_data
apply_wal_logs
verify_recovery
swap_databases
restart_services
cleanup
log "=========================================="
log "DATABASE RECOVERY COMPLETED"
log "=========================================="
}
main "$@"
Tenant-Specific Recovery
#!/bin/bash
# File: /pos-platform/scripts/recovery/tenant-recovery.sh
# Restore specific tenant data
TENANT_ID=$1
BACKUP_DATE=${2:-latest}
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] TENANT RECOVERY: $1"
}
#=============================================
# FIND TENANT BACKUP
#=============================================
find_backup() {
log "Finding backup for tenant: $TENANT_ID"
if [ "$BACKUP_DATE" == "latest" ]; then
BACKUP_FILE=$(aws s3 ls "s3://pos-backups/tenants/${TENANT_ID}/" | \
sort | tail -1 | awk '{print $4}')
else
BACKUP_FILE="${TENANT_ID}_${BACKUP_DATE}.sql.gz"
fi
log "Using backup: $BACKUP_FILE"
}
#=============================================
# RESTORE TENANT SCHEMA
#=============================================
restore_tenant() {
log "Restoring tenant schema..."
# Download backup
aws s3 cp "s3://pos-backups/tenants/${TENANT_ID}/${BACKUP_FILE}" /tmp/
# Drop existing schema (with confirmation in production)
docker exec postgres-primary psql -U postgres -d pos_db -c \
"DROP SCHEMA IF EXISTS ${TENANT_ID} CASCADE;"
# Restore schema
docker exec postgres-primary bash -c \
"gunzip -c /tmp/${BACKUP_FILE} | psql -U postgres -d pos_db"
log "Tenant restored: $TENANT_ID"
}
#=============================================
# MAIN
#=============================================
main() {
if [ -z "$TENANT_ID" ]; then
echo "Usage: $0 <tenant_id> [backup_date]"
exit 1
fi
find_backup
restore_tenant
log "Recovery completed for tenant: $TENANT_ID"
}
main "$@"
DR Testing Schedule
# Disaster Recovery Test Schedule
## Quarterly Tests
### Q1 (January)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Full failover drill | Week 3 | 4 hours | Platform Team |
| Backup restoration test | Week 4 | 2 hours | DBA |
### Q2 (April)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Tenant recovery test | Week 2 | 2 hours | Platform Team |
| Network failover test | Week 3 | 2 hours | Network Team |
### Q3 (July)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Full failover drill | Week 3 | 4 hours | Platform Team |
| PITR recovery test | Week 4 | 3 hours | DBA |
### Q4 (October)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Annual DR exercise | Week 2-3 | 8 hours | All Teams |
| Tabletop exercise | Week 4 | 2 hours | Leadership |
## Monthly Tests
- Automated backup verification
- Replica lag monitoring
- Health check validation
## Test Procedure
### Pre-Test Checklist
- [ ] Notify stakeholders
- [ ] Confirm maintenance window
- [ ] Verify backup freshness
- [ ] Prepare rollback plan
- [ ] Stage monitoring dashboards
### During Test
- [ ] Document all actions
- [ ] Record timestamps
- [ ] Note any issues
- [ ] Track RTO/RPO actual vs target
### Post-Test
- [ ] Generate test report
- [ ] Update runbooks if needed
- [ ] File improvement tickets
- [ ] Schedule follow-up for issues
Communication Templates
Outage Notification Templates
# Template: Initial Outage Notification
## Internal (Slack/Email)
Subject: [INCIDENT] POS Platform - Service Disruption
**Status**: Investigating
**Impact**: [High/Medium/Low]
**Start Time**: [YYYY-MM-DD HH:MM UTC]
**Affected Services**:
- [ ] Transaction Processing
- [ ] Inventory Management
- [ ] Order Fulfillment
- [ ] Reporting
**Current Actions**:
- Investigating root cause
- Engaged [Team Name]
**Next Update**: In 30 minutes or when status changes
---
# Template: Customer Notification
Subject: Service Status Update - POS Platform
Dear Valued Customer,
We are currently experiencing a service disruption affecting
[specific functionality]. Our team is actively working to
resolve this issue.
**What's Affected**:
[List specific features]
**What's Working**:
[List unaffected features]
**Workaround**:
[If applicable, provide workaround]
**Expected Resolution**:
We anticipate resolution within [timeframe].
We apologize for any inconvenience and will provide updates
as the situation progresses.
---
# Template: Resolution Notification
Subject: [RESOLVED] POS Platform - Service Restored
**Status**: Resolved
**Duration**: [X hours, Y minutes]
**Resolution Time**: [YYYY-MM-DD HH:MM UTC]
**Root Cause**:
[Brief description]
**Resolution**:
[What was done to fix]
**Preventive Measures**:
[What will prevent recurrence]
**Post-Incident Review**:
Scheduled for [date]
Thank you for your patience.
Summary
This chapter provides complete disaster recovery coverage:
- Recovery Objectives: RTO/RPO by data tier
- Backup Strategy: Daily dumps, WAL archiving, per-tenant backups
- Failover Architecture: Multi-region with automatic failover
- Recovery Procedures: Step-by-step scripts for full and tenant recovery
- DR Testing: Quarterly test schedule and procedures
- Communication: Templates for internal and customer notifications
Next Chapter: Chapter 33: Tenant Lifecycle
“Hope is not a strategy. Test your recovery procedures.”