Platform Operations Runbook
Operational reference for the ShackleAI platform. Covers VPS access, Docker service management, PostgreSQL, Redis, Caddy, CI/CD deploy and rollback, and emergency procedures. All commands assume a single-VPS deployment on Ubuntu 24.04.
Security note: Replace all placeholder values (YOUR_VPS_IP, path/to/key.pem) with your actual credentials. Never commit credentials to version control.
1. VPS Access & Health Checks
Connect to the VPS over SSH using the PEM key. Always verify system health before making changes.
SSH access
ssh -i "path/to/key.pem" ubuntu@YOUR_VPS_IP
System health
Run these commands after connecting to verify resource availability before a deploy or troubleshooting session.
# System resource usage htop # Disk usage df -h # Memory usage free -m # All running containers docker ps # Caddy proxy status systemctl status caddy
2. Docker Operations
All platform services run as Docker containers managed by Docker Compose. The production compose file is docker-compose.prod.yml.
List all containers
# Show all containers (running and stopped) docker ps -a
Restart a service
# Restart a specific service docker compose -f docker-compose.prod.yml restart <service> # Examples: docker compose -f docker-compose.prod.yml restart nextjs docker compose -f docker-compose.prod.yml restart worker
View logs
# Tail 100 lines and follow logs for a container docker logs --tail 100 -f <container> # Examples: docker logs --tail 100 -f shackleai-platform-nextjs-1 docker logs --tail 100 -f shackleai-platform-worker-1
Rebuild and restart all services
# Pull latest image and restart in detached mode docker compose -f docker-compose.prod.yml up -d --build
Prune unused resources
Warning: This command removes ALL stopped containers, unused images, networks, and volumes. Only run during a planned maintenance window. Take a database backup first.
# WARNING: This removes ALL stopped containers, dangling images, # unused networks, and volumes. Only run during maintenance. docker system prune -af --volumes
3. PostgreSQL Management
PostgreSQL 16 with pgvector runs in the shackleai-platform-postgres-1 container. Database name: shackleai_platform.
Connect to the database
docker exec -it shackleai-platform-postgres-1 \ psql -U shackleai -d shackleai_platform
Common operational queries
-- Total registered users SELECT COUNT(*) FROM users; -- Active subscriptions by tier SELECT tier, COUNT(*) FROM subscriptions WHERE status = 'active' GROUP BY tier; -- Recent API key usage (last 24 hours) SELECT COUNT(*) FROM api_calls WHERE created_at > NOW() - INTERVAL '24 hours'; -- Check pending migrations SELECT name, applied_at FROM _migrations ORDER BY applied_at DESC LIMIT 10;
Backup
Run backups before any schema migration or infrastructure change. Store the output file securely off-VPS.
# Dump the database to a local file docker exec shackleai-platform-postgres-1 \ pg_dump -U shackleai shackleai_platform > backup_$(date +%Y%m%d_%H%M%S).sql
Restore
Warning: Restore will overwrite existing data. Ensure the target database is in a known state before running this command.
# Restore from a backup file docker exec -i shackleai-platform-postgres-1 \ psql -U shackleai -d shackleai_platform < backup.sql
Check migration status
# Check migration status from VPS
for f in src/migrations/*.sql; do
name=$(basename $f)
exists=$(docker exec shackleai-platform-postgres-1 \
psql -U shackleai -d shackleai_platform -t \
-c "SELECT 1 FROM _migrations WHERE name='$name'" | tr -d ' ')
echo "$name: $([ "$exists" = "1" ] && echo applied || echo PENDING)"
done4. Redis Operations
Redis 7 runs in the shackleai-platform-redis-1 container. The platform uses DB index 3 for session and cache data.
Connect to Redis CLI
# Connect to Redis on DB index 3 (ShackleAI platform DB) docker exec -it shackleai-platform-redis-1 redis-cli -n 3
Inspect keys and memory
# List all keys in current DB KEYS * # Check a specific key's value GET <key> # Check memory usage INFO memory # Monitor live commands (press Ctrl+C to stop) MONITOR
Flush the platform cache
Caution: FLUSHDB removes all keys in the selected DB index. Sessions will be invalidated and users will be logged out. Use only when necessary (e.g., cache poisoning, stale data incident).
# Switch to the platform DB index and flush it # WARNING: This clears ALL cached data for the platform. SELECT 3 FLUSHDB
5. Caddy (Reverse Proxy & SSL)
Caddy handles HTTPS termination and reverse proxying. SSL certificates are auto-provisioned and renewed via Let's Encrypt — no manual renewal is needed under normal conditions.
Check SSL certificate status
# List all managed SSL certificates and their expiry caddy list-certificates
Reload configuration
Use this after editing the Caddyfile. Caddy reloads without dropping active connections.
# Reload Caddy config without downtime caddy reload --config /etc/caddy/Caddyfile
View access logs
# Stream Caddy access logs journalctl -u caddy -f # Or if running in Docker docker logs --tail 100 -f shackleai-platform-caddy-1
Add a new subdomain
# Example Caddyfile entry for a new subdomain
api.shackleai.com {
reverse_proxy localhost:3001
}
# After editing, reload:
caddy reload --config /etc/caddy/Caddyfile6. CI/CD Pipeline
The platform uses a manual SSH deploy as the primary method. A GitHub Actions workflow is available when the self-hosted runner is online. Local quality gates (lint, unit tests, build) must pass before every push.
Manual deploy via GitHub Actions
Use when the self-hosted runner is online.
# Trigger a manual deploy via GitHub Actions (when runner is online) gh workflow run deploy.yml --repo shackleai/platform
Manual deploy via SSH
Primary deploy method. Run locally after all PRs for the release batch are merged to main.
# Manual deploy via SSH (primary method when runner is offline)
ssh -i "path/to/key.pem" ubuntu@YOUR_VPS_IP
cd /home/ubuntu/shackleai-platform
export GIT_SSH_COMMAND="ssh -i /home/ubuntu/.ssh/shackleai_deploy -o StrictHostKeyChecking=no"
git fetch git@github.com:shackleai/platform.git main
git reset --hard FETCH_HEAD
docker compose -f docker-compose.prod.yml up -d --build
# Health check
curl -s -o /dev/null -w "HTTP %{http_code}" http://localhost:3002/api/healthRoll back to a previous release
Every milestone completion creates a tagged GitHub Release. Use the tag to roll back if a deploy introduces a regression.
# Roll back to a specific tagged release
git checkout v0.X.Y
docker compose -f docker-compose.prod.yml up -d --build
# Verify the rollback
curl -s -o /dev/null -w "HTTP %{http_code}" http://localhost:3002/api/health7. Emergency Procedures
Structured triage steps for the most common production incidents. Work through each step in order before escalating to more disruptive actions.
Platform is down (site unreachable)
# Step 1: Check if containers are running docker ps -a # Step 2: Restart any stopped containers docker compose -f docker-compose.prod.yml up -d # Step 3: Check Caddy systemctl status caddy # If stopped: systemctl start caddy # Step 4: Check DNS (from local machine) dig shackleai.com +short nslookup shackleai.com # Step 5: Check disk space df -h # If full, see disk full procedure below
Database connection pool exhausted
Symptoms: 500 errors with “too many connections” in logs, API calls timing out.
# Check current PostgreSQL connections docker exec shackleai-platform-postgres-1 \ psql -U shackleai -d shackleai_platform \ -c "SELECT count(*) FROM pg_stat_activity;" # Terminate idle connections docker exec shackleai-platform-postgres-1 \ psql -U shackleai -d shackleai_platform \ -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < NOW() - INTERVAL '5 minutes';" # Restart postgres as last resort docker compose -f docker-compose.prod.yml restart postgres
Disk full
Symptoms: Docker containers crashing, write errors in logs, health check returning 500.
# Free up Docker resources docker system prune -af # Check large log files du -sh /var/log/* | sort -rh | head -10 # Truncate large log files (do not delete — truncate preserves file handles) truncate -s 0 /var/log/syslog # Clean up old SQL backups ls -lh ~/backups/*.sql rm ~/backups/backup_2024*.sql # remove backups older than retention window # After cleanup, verify disk df -h
SSL certificate issues
Caddy auto-renews certificates 30 days before expiry. Manual intervention is rarely needed. If you see browser certificate warnings:
# Check certificate status caddy list-certificates # Force certificate renewal caddy reload --config /etc/caddy/Caddyfile # If Caddy is stuck, restart it systemctl restart caddy # Verify HTTPS is working curl -I https://shackleai.com
Quick Reference
| Service | Container name | Port | Restart command |
|---|---|---|---|
| Next.js app | shackleai-platform-nextjs-1 | 3002 | docker compose restart nextjs |
| PostgreSQL | shackleai-platform-postgres-1 | 5432 | docker compose restart postgres |
| Redis | shackleai-platform-redis-1 | 6379 | docker compose restart redis |
| Caddy | caddy (systemd) | 80 / 443 | systemctl restart caddy |