Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Disaster recovery (DR) ensures your database can survive catastrophic events: region outages, data corruption, accidental deletions, or ransomware attacks. Unlike high availability, which handles component failures, DR addresses large-scale disasters.

RPO and RTO

Two metrics define DR requirements:

Recovery Point Objective (RPO) : The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.

Recovery Time Objective (RTO) : The maximum acceptable downtime. An RTO of 4 hours means the database must be operational within 4 hours of the disaster.

| Scenario | RPO | RTO | Strategy | |----------|-----|-----|----------| | Internal tool | 24 hours | 24 hours | Daily backups, restore | | E-commerce | 5 minutes | 1 hour | Cross-region replication | | Financial trading | 0 (zero loss) | 5 minutes | Synchronous replication + DR site |

Cross-Region Replication

PostgreSQL Logical Replication Across Regions

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- On primary (us-east-1)

CREATE PUBLICATION dr_pub FOR ALL TABLES;

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- On standby (us-west-2)

CREATE SUBSCRIPTION dr_sub

CONNECTION 'host=primary-us-east-1.example.com port=5432 dbname=proddb'

PUBLICATION dr_pub

WITH (copy_data = true, connect = true, create_slot = true);

Logical replication works across regions with asynchronous delivery. Monitor lag carefully:

SELECT pg_size_pretty(

pg_wal_lsn_diff(

pg_current_wal_lsn(),

replay_lsn

)

) AS replication_lag

FROM pg_stat_replication

WHERE application_name = 'dr_sub';

AWS RDS Cross-Region Read Replicas

Create cross-region read replica

aws rds create-db-instance-read-replica \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--db-instance-identifier mydb-dr \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--source-db-instance-identifier mydb \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--region us-west-2 \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--db-instance-class db.r6g.large

Promote to standalone for DR

aws rds promote-read-replica \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--db-instance-identifier mydb-dr \

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\--region us-west-2

Multi-Region with Patroni

Patroni can manage clusters across regions with careful configuration:

DR site configuration

scope: myapp

namespace: /service/

name: pg-dr-node-1

consul:

host: dr-consul.service.consul:8500

Separate DCS for DR isolation

tags:

nofailover: true # DR site should not automatically become primary

Backup-Based DR

For cost-sensitive environments, backups plus WAL archiving to S3 provide DR:

Continuous WAL archiving to cross-region S3 bucket

archive_command = 'aws s3 cp %p s3://myapp-wal-dr/region/us-east-1/%f'

DR restore procedure

pg_restore --dbname=proddb /backups/dr/latest_full.dump

pg_receivewal --directory /backups/dr/wal

Recovery Workflow

!/bin/bash

Dr: restore to us-west-2

1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Restore latest full backup

pgbackrest --stanza=prod --db-path=/var/lib/postgresql/dr restore

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Set recovery target

cat >> /var/lib/postgresql/dr/postgresql.conf << EOF

restore_command = 'aws s3 cp s3://myapp-wal-dr/region/us-east-1/%f %p'

recovery_target_time = '2026-05-12 10:00:00 UTC'

recovery_target_action = promote

EOF

3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Start and recover

pg_ctl start -D /var/lib/postgresql/dr

4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Verify data integrity

psql -c "SELECT count(*) FROM critical_table;"

psql -c "SELECT max(created_at) FROM orders;"

Backup Testing

Backups are worthless until proven restorable. Regular testing is mandatory.

Automated Restore Test

!/bin/bash

Weekly restore test

set -euo pipefail

TEST_DIR=/tmp/dr_test_$(date +%Y%m%d)

LOG_FILE=$TEST_DIR/restore.log

mkdir -p $TEST_DIR

echo "=== DR Restore Test $(date) ===" >> $LOG_FILE

Full restore

pgbackrest --stanza=prod --db-path=$TEST_DIR/data restore >> $LOG_FILE 2>&1

Start database

pg_ctl -D $TEST_DIR/data -l $TEST_DIR/pg.log start >> $LOG_FILE 2>&1

sleep 5

Verify

echo "Database size:"

psql -p 5433 -c "SELECT pg_size_pretty(pg_database_size('proddb'));"

echo "Row counts:"

psql -p 5433 -c "

SELECT 'users' as tbl, count(*) FROM users

UNION ALL

SELECT 'orders', count(*) FROM orders

UNION ALL

SELECT 'payments', count(*) FROM payments;

"

echo "Max dates (data freshness):"

psql -p 5433 -c "

SELECT 'users' as tbl, max(created_at) FROM users

UNION ALL

SELECT 'orders', max(created_at) FROM orders;

"

Cleanup

pg_ctl -D $TEST_DIR/data stop >> $LOG_FILE 2>&1

rm -rf $TEST_DIR

echo "=== Test Complete ===" >> $LOG_FILE

Schedule this via cron:

0 2 * * 0 /usr/local/bin/dr_restore_test.sh

DR Plan Components

A complete DR plan should document:

  • Contact list : Who to contact and escalation paths.

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. RTO and RPO targets : Specific to each data tier. 3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Runbook : Step-by-step recovery procedures. 4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. DR site details : Region, connection strings, credentials. 5\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Validation steps : How to verify the recovery succeeded. 6\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Communication plan : Internal and external notifications. 7\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Post-mortem process : How to document and improve.

Disaster Scenarios and Mitigations

| Scenario | Mitigation | RPO Impact | |----------|------------|------------| | Region outage | Cross-region replica promotion | RPO = replication lag | | Accidental DROP TABLE | PITR to before the statement | RPO = time since last WAL backup | | Ransomware | Immutable WAL backups | RPO depends on backup frequency | | Data corruption | Replay WAL; keep multiple backups | Dependent on detection time |

Testing DR with Chaos Engineering

Simulate region failure: block traffic to primary

iptables -A INPUT -s dr-test-region -j DROP

Trigger DR failover script

./dr_failover.sh --target us-west-2

Verify applications work from DR region

curl -f https://dr-api.myapp.com/health

Fail back

./dr_failback.sh --target us-east-1

Clean up

iptables -D INPUT -s dr-test-region -j DROP

Run DR drills quarterly at minimum. Document every drill outcome and update the runbook with lessons learned. A DR plan that has never been tested is not a plan; it is a hope.