๐พ FireMUD System Architecture: Backup & Disaster Recovery
This document defines the backup schedule and disaster recovery procedures for FireMUD. Backups are taken only for production. Development and staging environments rely on ad hoc snapshots as needed.
๐ฆ PostgreSQL Snapshots
- Snapshots are taken every 15 minutes.
- Retention policy:
- 24 hours of 15โminute snapshots
- 3 weekly snapshots
- 3 monthly snapshots
- If the database service fails completely:
- Restore the latest snapshot.
- Restart services to resume operation.
- Redis repopulates transient state from PostgreSQL on access.
๐๏ธ Redis Persistence
- Redis stores only transient gameplay state.
- AOF (AppendโOnly File) is enabled for crash recovery while the cluster is running.
- Redis is not restored from backup during a cold start; it is repopulated from PostgreSQL after recovery.
โ๏ธ Kubernetes Production
- Velero backs up StatefulSets, PersistentVolumeClaims, ConfigMaps, and Secrets.
- Restoration process:
- Use Velero to rehydrate the PostgreSQL volume from the latest snapshot.
- Restore other resources (StatefulSets, ConfigMaps, Secrets).
- Restart the affected pods; Redis starts empty and fills itself from PostgreSQL.
๐ณ Local Development
- Backups are restored manually using
pg_restore
from local snapshot files. - Services are restarted with Docker Compose.
- Redis starts empty and repopulates when services access the database.
๐ Restore Workflow Summary
Environment | Steps |
---|---|
Kubernetes | Restore PostgreSQL via Velero โ restore other resources โ restart pods โ allow Redis to repopulate |
Docker Compose | pg_restore local backup โ restart containers โ Redis repopulates automatically |
Redis always uses AOF for crash recovery during runtime but is never restored from backup images. Gameplay resumes after services restart and Redis repopulates from PostgreSQL.