Platform Operations Runbook

Operational reference for the ShackleAI platform. Covers VPS access, Docker service management, PostgreSQL, Redis, Caddy, CI/CD deploy and rollback, and emergency procedures. All commands assume a single-VPS deployment on Ubuntu 24.04.

Security note: Replace all placeholder values (YOUR_VPS_IP, path/to/key.pem) with your actual credentials. Never commit credentials to version control.

1. VPS Access & Health Checks

Connect to the VPS over SSH using the PEM key. Always verify system health before making changes.

SSH access

ssh -i "path/to/key.pem" ubuntu@YOUR_VPS_IP

System health

Run these commands after connecting to verify resource availability before a deploy or troubleshooting session.

# System resource usage
htop

# Disk usage
df -h

# Memory usage
free -m

# All running containers
docker ps

# Caddy proxy status
systemctl status caddy

2. Docker Operations

All platform services run as Docker containers managed by Docker Compose. The production compose file is docker-compose.prod.yml.

List all containers

# Show all containers (running and stopped)
docker ps -a

Restart a service

# Restart a specific service
docker compose -f docker-compose.prod.yml restart <service>

# Examples:
docker compose -f docker-compose.prod.yml restart nextjs
docker compose -f docker-compose.prod.yml restart worker

View logs

# Tail 100 lines and follow logs for a container
docker logs --tail 100 -f <container>

# Examples:
docker logs --tail 100 -f shackleai-platform-nextjs-1
docker logs --tail 100 -f shackleai-platform-worker-1

Rebuild and restart all services

# Pull latest image and restart in detached mode
docker compose -f docker-compose.prod.yml up -d --build

Prune unused resources

Warning: This command removes ALL stopped containers, unused images, networks, and volumes. Only run during a planned maintenance window. Take a database backup first.

# WARNING: This removes ALL stopped containers, dangling images,
# unused networks, and volumes. Only run during maintenance.
docker system prune -af --volumes

3. PostgreSQL Management

PostgreSQL 16 with pgvector runs in the shackleai-platform-postgres-1 container. Database name: shackleai_platform.

Connect to the database

docker exec -it shackleai-platform-postgres-1 \
  psql -U shackleai -d shackleai_platform

Common operational queries

-- Total registered users
SELECT COUNT(*) FROM users;

-- Active subscriptions by tier
SELECT tier, COUNT(*) FROM subscriptions
WHERE status = 'active'
GROUP BY tier;

-- Recent API key usage (last 24 hours)
SELECT COUNT(*) FROM api_calls
WHERE created_at > NOW() - INTERVAL '24 hours';

-- Check pending migrations
SELECT name, applied_at FROM _migrations
ORDER BY applied_at DESC LIMIT 10;

Backup

Run backups before any schema migration or infrastructure change. Store the output file securely off-VPS.

# Dump the database to a local file
docker exec shackleai-platform-postgres-1 \
  pg_dump -U shackleai shackleai_platform > backup_$(date +%Y%m%d_%H%M%S).sql

Restore

Warning: Restore will overwrite existing data. Ensure the target database is in a known state before running this command.

# Restore from a backup file
docker exec -i shackleai-platform-postgres-1 \
  psql -U shackleai -d shackleai_platform < backup.sql

Check migration status

# Check migration status from VPS
for f in src/migrations/*.sql; do
  name=$(basename $f)
  exists=$(docker exec shackleai-platform-postgres-1 \
    psql -U shackleai -d shackleai_platform -t \
    -c "SELECT 1 FROM _migrations WHERE name='$name'" | tr -d ' ')
  echo "$name: $([ "$exists" = "1" ] && echo applied || echo PENDING)"
done

4. Redis Operations

Redis 7 runs in the shackleai-platform-redis-1 container. The platform uses DB index 3 for session and cache data.

Connect to Redis CLI

# Connect to Redis on DB index 3 (ShackleAI platform DB)
docker exec -it shackleai-platform-redis-1 redis-cli -n 3

Inspect keys and memory

# List all keys in current DB
KEYS *

# Check a specific key's value
GET <key>

# Check memory usage
INFO memory

# Monitor live commands (press Ctrl+C to stop)
MONITOR

Flush the platform cache

Caution: FLUSHDB removes all keys in the selected DB index. Sessions will be invalidated and users will be logged out. Use only when necessary (e.g., cache poisoning, stale data incident).

# Switch to the platform DB index and flush it
# WARNING: This clears ALL cached data for the platform.
SELECT 3
FLUSHDB

5. Caddy (Reverse Proxy & SSL)

Caddy handles HTTPS termination and reverse proxying. SSL certificates are auto-provisioned and renewed via Let's Encrypt — no manual renewal is needed under normal conditions.

Check SSL certificate status

# List all managed SSL certificates and their expiry
caddy list-certificates

Reload configuration

Use this after editing the Caddyfile. Caddy reloads without dropping active connections.

# Reload Caddy config without downtime
caddy reload --config /etc/caddy/Caddyfile

View access logs

# Stream Caddy access logs
journalctl -u caddy -f

# Or if running in Docker
docker logs --tail 100 -f shackleai-platform-caddy-1

Add a new subdomain

# Example Caddyfile entry for a new subdomain
api.shackleai.com {
  reverse_proxy localhost:3001
}

# After editing, reload:
caddy reload --config /etc/caddy/Caddyfile

6. CI/CD Pipeline

The platform uses a manual SSH deploy as the primary method. A GitHub Actions workflow is available when the self-hosted runner is online. Local quality gates (lint, unit tests, build) must pass before every push.

Manual deploy via GitHub Actions

Use when the self-hosted runner is online.

# Trigger a manual deploy via GitHub Actions (when runner is online)
gh workflow run deploy.yml --repo shackleai/platform

Manual deploy via SSH

Primary deploy method. Run locally after all PRs for the release batch are merged to main.

# Manual deploy via SSH (primary method when runner is offline)
ssh -i "path/to/key.pem" ubuntu@YOUR_VPS_IP

cd /home/ubuntu/shackleai-platform
export GIT_SSH_COMMAND="ssh -i /home/ubuntu/.ssh/shackleai_deploy -o StrictHostKeyChecking=no"
git fetch git@github.com:shackleai/platform.git main
git reset --hard FETCH_HEAD
docker compose -f docker-compose.prod.yml up -d --build

# Health check
curl -s -o /dev/null -w "HTTP %{http_code}" http://localhost:3002/api/health

Roll back to a previous release

Every milestone completion creates a tagged GitHub Release. Use the tag to roll back if a deploy introduces a regression.

# Roll back to a specific tagged release
git checkout v0.X.Y
docker compose -f docker-compose.prod.yml up -d --build

# Verify the rollback
curl -s -o /dev/null -w "HTTP %{http_code}" http://localhost:3002/api/health

7. Emergency Procedures

Structured triage steps for the most common production incidents. Work through each step in order before escalating to more disruptive actions.

Platform is down (site unreachable)

# Step 1: Check if containers are running
docker ps -a

# Step 2: Restart any stopped containers
docker compose -f docker-compose.prod.yml up -d

# Step 3: Check Caddy
systemctl status caddy
# If stopped:
systemctl start caddy

# Step 4: Check DNS (from local machine)
dig shackleai.com +short
nslookup shackleai.com

# Step 5: Check disk space
df -h
# If full, see disk full procedure below

Database connection pool exhausted

Symptoms: 500 errors with “too many connections” in logs, API calls timing out.

# Check current PostgreSQL connections
docker exec shackleai-platform-postgres-1 \
  psql -U shackleai -d shackleai_platform \
  -c "SELECT count(*) FROM pg_stat_activity;"

# Terminate idle connections
docker exec shackleai-platform-postgres-1 \
  psql -U shackleai -d shackleai_platform \
  -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < NOW() - INTERVAL '5 minutes';"

# Restart postgres as last resort
docker compose -f docker-compose.prod.yml restart postgres

Disk full

Symptoms: Docker containers crashing, write errors in logs, health check returning 500.

# Free up Docker resources
docker system prune -af

# Check large log files
du -sh /var/log/* | sort -rh | head -10

# Truncate large log files (do not delete — truncate preserves file handles)
truncate -s 0 /var/log/syslog

# Clean up old SQL backups
ls -lh ~/backups/*.sql
rm ~/backups/backup_2024*.sql  # remove backups older than retention window

# After cleanup, verify disk
df -h

SSL certificate issues

Caddy auto-renews certificates 30 days before expiry. Manual intervention is rarely needed. If you see browser certificate warnings:

# Check certificate status
caddy list-certificates

# Force certificate renewal
caddy reload --config /etc/caddy/Caddyfile

# If Caddy is stuck, restart it
systemctl restart caddy

# Verify HTTPS is working
curl -I https://shackleai.com

Quick Reference

ServiceContainer namePortRestart command
Next.js appshackleai-platform-nextjs-13002docker compose restart nextjs
PostgreSQLshackleai-platform-postgres-15432docker compose restart postgres
Redisshackleai-platform-redis-16379docker compose restart redis
Caddycaddy (systemd)80 / 443systemctl restart caddy