Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 32: Disaster Recovery

Overview

This chapter defines the disaster recovery strategy, backup procedures, failover architecture, and recovery processes for the POS Platform.


Recovery Objectives

RTO/RPO Requirements by Data Type

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                    RECOVERY TIME OBJECTIVE (RTO) / RECOVERY POINT OBJECTIVE (RPO)   │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌───────────────────┬─────────────┬─────────────┬──────────────────────────────────────┐
│ Data Category     │ RTO         │ RPO         │ Justification                        │
├───────────────────┼─────────────┼─────────────┼──────────────────────────────────────┤
│ Transaction Data  │ < 1 hour    │ 0 (no loss) │ Revenue-critical, legal requirements │
│ Inventory Data    │ < 4 hours   │ < 1 hour    │ Business operations                  │
│ Customer Data     │ < 4 hours   │ < 1 hour    │ Order fulfillment                    │
│ Product Catalog   │ < 8 hours   │ < 24 hours  │ Can rebuild from source              │
│ Audit Logs        │ < 24 hours  │ < 1 hour    │ Compliance requirements              │
│ Analytics Data    │ < 72 hours  │ < 24 hours  │ Non-critical, can rebuild            │
│ Configuration     │ Immediate   │ 0 (no loss) │ Stored in Git                        │
└───────────────────┴─────────────┴─────────────┴──────────────────────────────────────┘


Recovery Tier Definitions:

┌─────────┬─────────────────────────────────────────────────────────────────────────────┐
│ TIER 1  │  MISSION CRITICAL                                                          │
│         │  RTO: < 1 hour | RPO: 0                                                     │
│         │  - Active transactions                                                      │
│         │  - Payment processing                                                       │
│         │  - Real-time inventory                                                      │
│         │  Strategy: Synchronous replication, hot standby                            │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 2  │  BUSINESS CRITICAL                                                         │
│         │  RTO: < 4 hours | RPO: < 1 hour                                            │
│         │  - Customer data                                                            │
│         │  - Order history                                                            │
│         │  - Inventory levels                                                         │
│         │  Strategy: Asynchronous replication, warm standby                          │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 3  │  IMPORTANT                                                                 │
│         │  RTO: < 24 hours | RPO: < 24 hours                                         │
│         │  - Product catalog                                                          │
│         │  - Reports                                                                  │
│         │  - Historical analytics                                                     │
│         │  Strategy: Daily backups, cold standby                                     │
├─────────┼─────────────────────────────────────────────────────────────────────────────┤
│ TIER 4  │  NON-CRITICAL                                                              │
│         │  RTO: < 72 hours | RPO: < 72 hours                                         │
│         │  - Archived data                                                            │
│         │  - Legacy exports                                                           │
│         │  Strategy: Weekly backups, rebuild if needed                               │
└─────────┴─────────────────────────────────────────────────────────────────────────────┘

Backup Strategy

Database Backup Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                           DATABASE BACKUP STRATEGY                                   │
└─────────────────────────────────────────────────────────────────────────────────────┘

                    PostgreSQL Primary
                          │
         ┌────────────────┼────────────────┐
         │                │                │
         ▼                ▼                ▼
┌─────────────────┐ ┌──────────┐ ┌─────────────────┐
│  Streaming      │ │   WAL    │ │   pg_dump       │
│  Replication    │ │ Archiving│ │   (Daily)       │
│  (Real-time)    │ │ (PITR)   │ │                 │
└────────┬────────┘ └────┬─────┘ └────────┬────────┘
         │               │                │
         ▼               ▼                ▼
┌─────────────────┐ ┌──────────┐ ┌─────────────────┐
│  Hot Standby    │ │   WAL    │ │  Backup Storage │
│  (Same Region)  │ │ Archive  │ │  (Encrypted)    │
│                 │ │ (S3/NFS) │ │                 │
└─────────────────┘ └──────────┘ └─────────────────┘
         │               │                │
         │               │                │
         └───────────────┼────────────────┘
                         │
                         ▼
              ┌─────────────────────┐
              │   Offsite Backup    │
              │   (Different DC)    │
              │   S3 Cross-Region   │
              └─────────────────────┘


BACKUP SCHEDULE:

┌──────────────────┬───────────────┬─────────────────┬────────────────────────────────┐
│ Backup Type      │ Frequency     │ Retention       │ Storage Location               │
├──────────────────┼───────────────┼─────────────────┼────────────────────────────────┤
│ WAL Archiving    │ Continuous    │ 7 days          │ Local NFS + S3                 │
│ pg_dump (Full)   │ Daily 2AM     │ 30 days         │ S3 (encrypted)                 │
│ pg_dump (Weekly) │ Sunday 3AM    │ 90 days         │ S3 + Glacier                   │
│ Monthly Archive  │ 1st of month  │ 1 year          │ Glacier                        │
│ Yearly Archive   │ Jan 1st       │ 7 years         │ Glacier Deep Archive           │
└──────────────────┴───────────────┴─────────────────┴────────────────────────────────┘

Backup Scripts

#!/bin/bash
# File: /pos-platform/scripts/backup/daily-backup.sh
# Daily database backup script

set -e

#=============================================
# CONFIGURATION
#=============================================
BACKUP_DIR="/backups/postgres/daily"
S3_BUCKET="s3://pos-backups/postgres"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="pos_db_${DATE}.sql.gz"
LOG_FILE="/var/log/pos-backup.log"

#=============================================
# FUNCTIONS
#=============================================
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

send_alert() {
    # Send to Slack on failure
    curl -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-type: application/json' \
        -d "{\"text\": \"BACKUP ALERT: $1\"}"
}

#=============================================
# BACKUP PROCESS
#=============================================
backup_database() {
    log "Starting database backup..."

    # Create backup with compression
    docker exec postgres-primary pg_dump \
        -U pos_admin \
        -d pos_db \
        --format=custom \
        --compress=9 \
        --file="/tmp/${BACKUP_FILE}"

    # Copy from container
    docker cp "postgres-primary:/tmp/${BACKUP_FILE}" "${BACKUP_DIR}/${BACKUP_FILE}"

    # Verify backup integrity
    docker exec postgres-primary pg_restore \
        --list "/tmp/${BACKUP_FILE}" > /dev/null 2>&1

    if [ $? -eq 0 ]; then
        log "Backup verified successfully"
    else
        log "ERROR: Backup verification failed"
        send_alert "Backup verification failed for ${BACKUP_FILE}"
        exit 1
    fi

    log "Backup completed: ${BACKUP_FILE}"
}

upload_to_s3() {
    log "Uploading to S3..."

    # Encrypt and upload
    aws s3 cp \
        "${BACKUP_DIR}/${BACKUP_FILE}" \
        "${S3_BUCKET}/daily/${BACKUP_FILE}" \
        --sse aws:kms \
        --sse-kms-key-id "$KMS_KEY_ID"

    log "Upload completed"
}

cleanup_old_backups() {
    log "Cleaning up old backups..."

    # Local cleanup
    find "$BACKUP_DIR" -name "*.sql.gz" -mtime +$RETENTION_DAYS -delete

    # S3 cleanup (handled by lifecycle policy)

    log "Cleanup completed"
}

#=============================================
# PER-TENANT BACKUP
#=============================================
backup_tenant_data() {
    log "Starting per-tenant backups..."

    # Get all active tenants
    TENANTS=$(docker exec postgres-primary psql -U pos_admin -d pos_db -t -c \
        "SELECT schema_name FROM tenants WHERE status = 'active';")

    for TENANT in $TENANTS; do
        TENANT=$(echo "$TENANT" | tr -d ' ')
        TENANT_BACKUP="${BACKUP_DIR}/tenants/${TENANT}_${DATE}.sql.gz"

        log "Backing up tenant: $TENANT"

        docker exec postgres-primary pg_dump \
            -U pos_admin \
            -d pos_db \
            --schema="${TENANT}" \
            --format=custom \
            --compress=9 \
            --file="/tmp/tenant_${TENANT}.sql"

        docker cp "postgres-primary:/tmp/tenant_${TENANT}.sql" "$TENANT_BACKUP"

        # Upload tenant backup
        aws s3 cp "$TENANT_BACKUP" \
            "${S3_BUCKET}/tenants/${TENANT}/${TENANT}_${DATE}.sql.gz" \
            --sse aws:kms

        log "Tenant backup completed: $TENANT"
    done
}

#=============================================
# MAIN
#=============================================
main() {
    log "=========================================="
    log "Daily Backup Started"
    log "=========================================="

    mkdir -p "$BACKUP_DIR/tenants"

    backup_database
    backup_tenant_data
    upload_to_s3
    cleanup_old_backups

    log "=========================================="
    log "Daily Backup Completed Successfully"
    log "=========================================="
}

main "$@"

WAL Archiving Configuration

# File: /pos-platform/docker/postgres/postgresql.conf (excerpt)

# WAL Settings
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://pos-backups/wal/%f --sse aws:kms'
archive_timeout = 60

# Replication Settings
max_wal_senders = 5
wal_keep_size = 1GB
hot_standby = on

# Recovery Settings (for standby)
restore_command = 'aws s3 cp s3://pos-backups/wal/%f %p'
recovery_target_timeline = 'latest'

Failover Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                           MULTI-REGION FAILOVER ARCHITECTURE                         │
└─────────────────────────────────────────────────────────────────────────────────────┘

                              ┌─────────────────┐
                              │   DNS (Route53) │
                              │   Health-based  │
                              │   Failover      │
                              └────────┬────────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                  │
                    ▼                  │                  ▼
        ┌───────────────────┐          │      ┌───────────────────┐
        │   PRIMARY REGION  │          │      │  SECONDARY REGION │
        │   (US-East-1)     │          │      │  (US-West-2)      │
        │                   │          │      │                   │
        │  ┌─────────────┐  │          │      │  ┌─────────────┐  │
        │  │ Load        │  │          │      │  │ Load        │  │
        │  │ Balancer    │  │          │      │  │ Balancer    │  │
        │  └──────┬──────┘  │          │      │  └──────┬──────┘  │
        │         │         │          │      │         │         │
        │  ┌──────┴──────┐  │          │      │  ┌──────┴──────┐  │
        │  │  API (x3)   │  │          │      │  │  API (x2)   │  │
        │  │  Active     │  │          │      │  │  Standby    │  │
        │  └──────┬──────┘  │          │      │  └──────┬──────┘  │
        │         │         │          │      │         │         │
        │  ┌──────┴──────┐  │   Sync   │      │  ┌──────┴──────┐  │
        │  │  PostgreSQL │  │◄─────────┼──────│  │  PostgreSQL │  │
        │  │  PRIMARY    │  │  (Async) │      │  │  REPLICA    │  │
        │  └─────────────┘  │          │      │  └─────────────┘  │
        │                   │          │      │                   │
        │  ┌─────────────┐  │   Sync   │      │  ┌─────────────┐  │
        │  │   Redis     │  │◄─────────┼──────│  │   Redis     │  │
        │  │  PRIMARY    │  │          │      │  │  REPLICA    │  │
        │  └─────────────┘  │          │      │  └─────────────┘  │
        └───────────────────┘          │      └───────────────────┘
                                       │
                              NORMAL OPERATION:
                              100% traffic → Primary

                              FAILOVER STATE:
                              100% traffic → Secondary


FAILOVER TRIGGERS:
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Trigger                          │ Detection Time │ Failover Time │ Auto/Manual    │
├──────────────────────────────────┼────────────────┼───────────────┼────────────────┤
│ Load balancer health check fail  │ 30 seconds     │ 1 minute      │ Automatic      │
│ Database connection failure      │ 1 minute       │ 5 minutes     │ Automatic      │
│ Region-wide outage (AWS)         │ 5 minutes      │ 10 minutes    │ Automatic      │
│ Planned maintenance              │ N/A            │ 0 (graceful)  │ Manual         │
│ Security incident                │ Immediate      │ 5 minutes     │ Manual         │
└─────────────────────────────────────────────────────────────────────────────────────┘

Recovery Procedures

Complete Database Recovery

#!/bin/bash
# File: /pos-platform/scripts/recovery/full-db-recovery.sh
# Complete database recovery from backup

set -e

#=============================================
# RECOVERY MODES
#=============================================
# 1. full    - Restore to latest available state
# 2. pitr    - Point-in-time recovery to specific timestamp
# 3. tenant  - Restore specific tenant only

RECOVERY_MODE=${1:-full}
TARGET_TIME=${2:-}
TENANT_ID=${3:-}

#=============================================
# CONFIGURATION
#=============================================
S3_BUCKET="s3://pos-backups"
WORK_DIR="/tmp/recovery_$(date +%s)"
LOG_FILE="/var/log/pos-recovery.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] RECOVERY: $1" | tee -a "$LOG_FILE"
}

#=============================================
# STEP 1: STOP SERVICES
#=============================================
stop_services() {
    log "Stopping API services..."

    docker-compose stop pos-api

    log "Services stopped"
}

#=============================================
# STEP 2: DOWNLOAD BACKUP
#=============================================
download_backup() {
    log "Downloading backup files..."

    mkdir -p "$WORK_DIR"

    # Get latest backup
    LATEST_BACKUP=$(aws s3 ls "${S3_BUCKET}/postgres/daily/" | \
                    sort | tail -1 | awk '{print $4}')

    aws s3 cp "${S3_BUCKET}/postgres/daily/${LATEST_BACKUP}" \
        "${WORK_DIR}/backup.sql.gz"

    log "Downloaded: ${LATEST_BACKUP}"
}

#=============================================
# STEP 3: VERIFY BACKUP INTEGRITY
#=============================================
verify_backup() {
    log "Verifying backup integrity..."

    # Check file is valid
    gunzip -t "${WORK_DIR}/backup.sql.gz"

    if [ $? -ne 0 ]; then
        log "ERROR: Backup file is corrupted"
        exit 1
    fi

    log "Backup verified"
}

#=============================================
# STEP 4: PREPARE DATABASE
#=============================================
prepare_database() {
    log "Preparing database for recovery..."

    # Create recovery database
    docker exec postgres-primary psql -U postgres -c \
        "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'pos_db';"

    docker exec postgres-primary psql -U postgres -c \
        "DROP DATABASE IF EXISTS pos_db_recovery;"

    docker exec postgres-primary psql -U postgres -c \
        "CREATE DATABASE pos_db_recovery;"

    log "Recovery database prepared"
}

#=============================================
# STEP 5: RESTORE DATA
#=============================================
restore_data() {
    log "Restoring data..."

    # Copy backup to container
    docker cp "${WORK_DIR}/backup.sql.gz" postgres-primary:/tmp/

    # Decompress and restore
    docker exec postgres-primary bash -c \
        "gunzip -c /tmp/backup.sql.gz | psql -U postgres -d pos_db_recovery"

    log "Data restored"
}

#=============================================
# STEP 6: POINT-IN-TIME RECOVERY (if needed)
#=============================================
apply_wal_logs() {
    if [ "$RECOVERY_MODE" == "pitr" ]; then
        log "Applying WAL logs until: $TARGET_TIME"

        # Download WAL files
        aws s3 sync "${S3_BUCKET}/wal/" "${WORK_DIR}/wal/" \
            --exclude "*" \
            --include "*.gz"

        # Apply WAL files (PostgreSQL recovery mode)
        docker exec postgres-primary bash -c "
            echo \"recovery_target_time = '$TARGET_TIME'\" >> /var/lib/postgresql/data/recovery.signal
            pg_ctl restart
        "

        log "PITR completed"
    fi
}

#=============================================
# STEP 7: VERIFY RECOVERY
#=============================================
verify_recovery() {
    log "Verifying recovery..."

    # Check table counts
    TABLES=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
        "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")

    log "Restored tables: $TABLES"

    # Check transaction count
    TX_COUNT=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
        "SELECT COUNT(*) FROM transactions;")

    log "Restored transactions: $TX_COUNT"

    # Check latest transaction
    LATEST_TX=$(docker exec postgres-primary psql -U postgres -d pos_db_recovery -t -c \
        "SELECT MAX(created_at) FROM transactions;")

    log "Latest transaction: $LATEST_TX"
}

#=============================================
# STEP 8: SWAP DATABASES
#=============================================
swap_databases() {
    log "Swapping databases..."

    # Rename databases
    docker exec postgres-primary psql -U postgres -c \
        "ALTER DATABASE pos_db RENAME TO pos_db_old;"

    docker exec postgres-primary psql -U postgres -c \
        "ALTER DATABASE pos_db_recovery RENAME TO pos_db;"

    log "Databases swapped"
}

#=============================================
# STEP 9: RESTART SERVICES
#=============================================
restart_services() {
    log "Restarting services..."

    docker-compose start pos-api

    # Wait for health checks
    sleep 30

    # Verify health
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)

    if [ "$HTTP_CODE" -eq 200 ]; then
        log "Services healthy"
    else
        log "ERROR: Services not healthy after recovery"
        exit 1
    fi
}

#=============================================
# STEP 10: CLEANUP
#=============================================
cleanup() {
    log "Cleaning up..."

    rm -rf "$WORK_DIR"

    # Keep old database for 24 hours, then drop
    echo "DROP DATABASE pos_db_old;" | at now + 24 hours

    log "Cleanup scheduled"
}

#=============================================
# MAIN
#=============================================
main() {
    log "=========================================="
    log "DATABASE RECOVERY STARTED"
    log "Mode: $RECOVERY_MODE"
    [ -n "$TARGET_TIME" ] && log "Target Time: $TARGET_TIME"
    log "=========================================="

    stop_services
    download_backup
    verify_backup
    prepare_database
    restore_data
    apply_wal_logs
    verify_recovery
    swap_databases
    restart_services
    cleanup

    log "=========================================="
    log "DATABASE RECOVERY COMPLETED"
    log "=========================================="
}

main "$@"

Tenant-Specific Recovery

#!/bin/bash
# File: /pos-platform/scripts/recovery/tenant-recovery.sh
# Restore specific tenant data

TENANT_ID=$1
BACKUP_DATE=${2:-latest}

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] TENANT RECOVERY: $1"
}

#=============================================
# FIND TENANT BACKUP
#=============================================
find_backup() {
    log "Finding backup for tenant: $TENANT_ID"

    if [ "$BACKUP_DATE" == "latest" ]; then
        BACKUP_FILE=$(aws s3 ls "s3://pos-backups/tenants/${TENANT_ID}/" | \
                      sort | tail -1 | awk '{print $4}')
    else
        BACKUP_FILE="${TENANT_ID}_${BACKUP_DATE}.sql.gz"
    fi

    log "Using backup: $BACKUP_FILE"
}

#=============================================
# RESTORE TENANT SCHEMA
#=============================================
restore_tenant() {
    log "Restoring tenant schema..."

    # Download backup
    aws s3 cp "s3://pos-backups/tenants/${TENANT_ID}/${BACKUP_FILE}" /tmp/

    # Drop existing schema (with confirmation in production)
    docker exec postgres-primary psql -U postgres -d pos_db -c \
        "DROP SCHEMA IF EXISTS ${TENANT_ID} CASCADE;"

    # Restore schema
    docker exec postgres-primary bash -c \
        "gunzip -c /tmp/${BACKUP_FILE} | psql -U postgres -d pos_db"

    log "Tenant restored: $TENANT_ID"
}

#=============================================
# MAIN
#=============================================
main() {
    if [ -z "$TENANT_ID" ]; then
        echo "Usage: $0 <tenant_id> [backup_date]"
        exit 1
    fi

    find_backup
    restore_tenant

    log "Recovery completed for tenant: $TENANT_ID"
}

main "$@"

DR Testing Schedule

# Disaster Recovery Test Schedule

## Quarterly Tests

### Q1 (January)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Full failover drill | Week 3 | 4 hours | Platform Team |
| Backup restoration test | Week 4 | 2 hours | DBA |

### Q2 (April)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Tenant recovery test | Week 2 | 2 hours | Platform Team |
| Network failover test | Week 3 | 2 hours | Network Team |

### Q3 (July)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Full failover drill | Week 3 | 4 hours | Platform Team |
| PITR recovery test | Week 4 | 3 hours | DBA |

### Q4 (October)
| Test | Date | Duration | Owner |
|------|------|----------|-------|
| Annual DR exercise | Week 2-3 | 8 hours | All Teams |
| Tabletop exercise | Week 4 | 2 hours | Leadership |

## Monthly Tests
- Automated backup verification
- Replica lag monitoring
- Health check validation

## Test Procedure

### Pre-Test Checklist
- [ ] Notify stakeholders
- [ ] Confirm maintenance window
- [ ] Verify backup freshness
- [ ] Prepare rollback plan
- [ ] Stage monitoring dashboards

### During Test
- [ ] Document all actions
- [ ] Record timestamps
- [ ] Note any issues
- [ ] Track RTO/RPO actual vs target

### Post-Test
- [ ] Generate test report
- [ ] Update runbooks if needed
- [ ] File improvement tickets
- [ ] Schedule follow-up for issues

Communication Templates

Outage Notification Templates

# Template: Initial Outage Notification

## Internal (Slack/Email)

Subject: [INCIDENT] POS Platform - Service Disruption

**Status**: Investigating
**Impact**: [High/Medium/Low]
**Start Time**: [YYYY-MM-DD HH:MM UTC]

**Affected Services**:
- [ ] Transaction Processing
- [ ] Inventory Management
- [ ] Order Fulfillment
- [ ] Reporting

**Current Actions**:
- Investigating root cause
- Engaged [Team Name]

**Next Update**: In 30 minutes or when status changes

---

# Template: Customer Notification

Subject: Service Status Update - POS Platform

Dear Valued Customer,

We are currently experiencing a service disruption affecting
[specific functionality]. Our team is actively working to
resolve this issue.

**What's Affected**:
[List specific features]

**What's Working**:
[List unaffected features]

**Workaround**:
[If applicable, provide workaround]

**Expected Resolution**:
We anticipate resolution within [timeframe].

We apologize for any inconvenience and will provide updates
as the situation progresses.

---

# Template: Resolution Notification

Subject: [RESOLVED] POS Platform - Service Restored

**Status**: Resolved
**Duration**: [X hours, Y minutes]
**Resolution Time**: [YYYY-MM-DD HH:MM UTC]

**Root Cause**:
[Brief description]

**Resolution**:
[What was done to fix]

**Preventive Measures**:
[What will prevent recurrence]

**Post-Incident Review**:
Scheduled for [date]

Thank you for your patience.

Summary

This chapter provides complete disaster recovery coverage:

  1. Recovery Objectives: RTO/RPO by data tier
  2. Backup Strategy: Daily dumps, WAL archiving, per-tenant backups
  3. Failover Architecture: Multi-region with automatic failover
  4. Recovery Procedures: Step-by-step scripts for full and tenant recovery
  5. DR Testing: Quarterly test schedule and procedures
  6. Communication: Templates for internal and customer notifications

Next Chapter: Chapter 33: Tenant Lifecycle


“Hope is not a strategy. Test your recovery procedures.”