Devops Deployment
Use when setting up CI/CD pipelines, containerizing applications, deploying to Kubernetes, or writing infrastructure as code. DevOps & Deployment covers GitHub Actions, Docker, Helm, and Terraform patterns.
Primary Agent: data-pipeline-engineer
DevOps & Deployment Skill
Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.
Overview
- Setting up CI/CD pipelines
- Containerizing applications
- Deploying to Kubernetes or cloud platforms
- Implementing GitOps workflows
- Managing infrastructure as code
- Planning release strategies
Pipeline Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Code │──>│ Build │──>│ Test │──>│ Deploy │
│ Commit │ │ & Lint │ │ & Scan │ │ & Release │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
v v v v
Triggers Artifacts Reports MonitoringKey Concepts
CI/CD Pipeline Stages
- Lint & Type Check - Code quality gates
- Unit Tests - Test coverage with reporting
- Security Scan - npm audit + Trivy vulnerability scanner
- Build & Push - Docker image to container registry
- Deploy Staging - Environment-gated deployment
- Deploy Production - Manual approval or automated
Container Best Practices
Multi-stage builds minimize image size:
- Stage 1: Install production dependencies only
- Stage 2: Build application with dev dependencies
- Stage 3: Production runtime with minimal footprint
Security hardening:
- Non-root user (uid 1001)
- Read-only filesystem where possible
- Health checks for orchestrator integration
Kubernetes Deployment
Essential manifests:
- Deployment with rolling update strategy
- Service for internal routing
- Ingress for external access with TLS
- HorizontalPodAutoscaler for scaling
Security context:
runAsNonRoot: trueallowPrivilegeEscalation: falsereadOnlyRootFilesystem: true- Drop all capabilities
Deployment Strategies
| Strategy | Use Case | Risk |
|---|---|---|
| Rolling | Default, gradual replacement | Low - automatic rollback |
| Blue-Green | Instant switch, easy rollback | Medium - double resources |
| Canary | Progressive traffic shift | Low - gradual exposure |
Rolling Update (Kubernetes default):
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0 # Zero downtimeSecrets Management
Use External Secrets Operator to sync from cloud providers:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
- GCP Secret Manager
References
Docker Patterns
See: references/docker-patterns.md
Key topics covered:
- Multi-stage build examples with 78% size reduction
- Layer caching optimization
- Security hardening (non-root, health checks)
- Trivy vulnerability scanning
- Docker Compose development setup
CI/CD Pipelines
See: references/ci-cd-pipelines.md
Key topics covered:
- Branch strategy (Git Flow)
- GitHub Actions caching (85% time savings)
- Artifact management
- Matrix testing
- Complete backend CI/CD example
Kubernetes Basics
See: references/kubernetes-basics.md
Key topics covered:
- Health probes (startup, liveness, readiness)
- Security context configuration
- PodDisruptionBudget
- Resource quotas
- StatefulSets for databases
- Helm chart structure
Environment Management
See: references/environment-management.md
Key topics covered:
- External Secrets Operator
- GitOps with ArgoCD
- Terraform patterns (remote state, modules)
- Zero-downtime database migrations
- Alembic migration workflow
- Rollback procedures
Observability
See: references/observability.md
Key topics covered:
- Prometheus metrics exposition
- Grafana dashboard queries (PromQL)
- Alerting rules for SLOs
- Golden signals (SRE)
- Structured logging
- Distributed tracing (OpenTelemetry)
Railway Deployment
See: rules/railway-deployment.md
Key topics covered:
- railway.json configuration, Nixpacks builds
- Environment variable management, database provisioning
- Multi-service setups, Railway CLI workflows
- References:
references/railway-json-config.md,references/nixpacks-customization.md,references/multi-service-setup.md
Deployment Strategies
See: references/deployment-strategies.md
Key topics covered:
- Rolling deployment
- Blue-green deployment
- Canary releases
- Traffic splitting with Istio
Deployment Checklist
Pre-Deployment
- All tests passing in CI
- Security scans clean
- Database migrations ready
- Rollback plan documented
During Deployment
- Monitor deployment progress
- Watch error rates
- Verify health checks passing
Post-Deployment
- Verify metrics normal
- Check logs for errors
- Update status page
Helm Chart Structure
charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── hpa.yaml
│ └── _helpers.tpl
└── values/
├── staging.yaml
└── production.yamlRelated Skills
zero-downtime-migration- Database migration patterns for zero-downtime deploymentssecurity-scanning- Security scanning integration for CI/CD pipelinesork:monitoring-observability- Monitoring and alerting for deployed applicationsork:database-patterns- Python/Alembic migration workflow for backend deployments
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Container user | Non-root (uid 1001) | Security best practice, required by many orchestrators |
| Deployment strategy | Rolling update (default) | Zero downtime, automatic rollback, resource efficient |
| Secrets management | External Secrets Operator | Syncs from cloud providers, GitOps compatible |
| Health checks | Separate startup/liveness/readiness | Prevents premature traffic, enables graceful shutdown |
Extended Thinking Triggers
Use Opus 4.6 adaptive thinking for:
- Architecture decisions - Kubernetes vs serverless, multi-region setup
- Migration planning - Moving between cloud providers
- Incident response - Complex deployment failures
- Security design - Zero-trust architecture
Templates Reference
| Template | Purpose |
|---|---|
github-actions-pipeline.yml | Full CI/CD workflow with 6 stages |
Dockerfile | Multi-stage Node.js build |
docker-compose.yml | Development environment |
k8s-manifests.yaml | Deployment, Service, Ingress |
helm-values.yaml | Helm chart values |
terraform-aws.tf | VPC, EKS, RDS infrastructure |
argocd-application.yaml | GitOps application |
external-secrets.yaml | Secrets Manager integration |
Capability Details
ci-cd
Keywords: ci, cd, pipeline, github actions, gitlab ci, jenkins, workflow Solves:
- How do I set up CI/CD?
- GitHub Actions workflow patterns
- Pipeline caching strategies
- Matrix testing setup
docker
Keywords: docker, dockerfile, container, image, build, compose, multi-stage Solves:
- How do I containerize my app?
- Multi-stage Dockerfile best practices
- Docker Compose development setup
- Container security hardening
kubernetes
Keywords: kubernetes, k8s, deployment, service, ingress, helm, statefulset, pdb Solves:
- How do I deploy to Kubernetes?
- K8s health probes and resource limits
- Helm chart structure
- StatefulSet for databases
infrastructure-as-code
Keywords: terraform, pulumi, iac, infrastructure, provision, gitops, argocd Solves:
- How do I set up infrastructure as code?
- Terraform AWS patterns (VPC, EKS, RDS)
- GitOps with ArgoCD
- Secrets management patterns
deployment-strategies
Keywords: blue green, canary, rolling, deployment strategy, rollback, zero downtime Solves:
- Which deployment strategy should I use?
- Zero-downtime database migrations
- Blue-green deployment setup
- Canary release with traffic splitting
observability
Keywords: prometheus, grafana, metrics, alerting, monitoring, health check Solves:
- How do I add monitoring to my app?
- Prometheus metrics exposition
- Grafana dashboard queries
- Alerting rules for SLOs
Rules (6)
Protect CI/CD branches from direct pushes to enforce code review and audit trails — HIGH
CI/CD: Branch Protection
Configure branch protection rules to enforce code review, passing CI checks, and linear history on critical branches. This prevents untested or unreviewed code from reaching production.
Incorrect:
# Direct push to main — no review, no CI checks
git checkout main
git commit -m "quick fix"
git push origin main
# Force push overwrites history
git push --force origin main# CI workflow with no branch restrictions
on:
push:
branches: ['*']Correct:
Branch strategy with protection rules:
main (production) ─────●────────●──────>
| |
dev (staging) ─────●──●────●──●──────>
| |
feature/* ─────────●────────┘
^
└─ PR required, CI checks, code reviewGitHub branch protection settings:
main branch:
- Require pull request before merging
- Required approving reviews: 2
- Require status checks to pass (lint, test, security)
- Require branches to be up to date before merging
- Do not allow force pushes
- Do not allow deletions
dev branch:
- Require pull request before merging
- Required approving reviews: 1
- Require status checks to pass (lint, test)Key rules:
mainrequires PR + 2 approvals + all status checks passing before mergedevrequires PR + 1 approval + all status checks passing- Never allow direct commits or force pushes to
mainordev - Feature branches must be created from
devand merged back via PR - Require branches to be up-to-date before merging to prevent integration gaps
- Enable "Require linear history" to keep the commit graph clean and auditable
Reference: references/ci-cd-pipelines.md (lines 5-21)
Cache CI/CD pipeline dependencies to avoid re-downloading and save minutes per run — HIGH
CI/CD: Pipeline Caching
Cache dependencies in CI pipelines using lockfile-based cache keys. Proper caching reduces dependency installation from 2-3 minutes to 10-20 seconds (~85% time savings).
Incorrect:
# No caching: re-downloads everything on every run
steps:
- uses: actions/checkout@v3
- run: npm install
- run: npm test# Bad cache key: no lockfile hash, stale deps served indefinitely
- uses: actions/cache@v3
with:
path: node_modules
key: ${{ runner.os }}-modulesCorrect:
- name: Cache Dependencies
uses: actions/cache@v3
with:
path: |
~/.npm
node_modules
backend/.venv
key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-deps-
- name: Install Dependencies
run: npm ci
- name: Run Tests
run: npm test# Python example with Poetry
- name: Cache Poetry Dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}
- name: Install Dependencies
run: poetry installKey rules:
- Always include
hashFiles()of the lockfile in the cache key so caches invalidate when dependencies change - Use
restore-keysas a fallback prefix to get a partial cache hit when the exact key misses - Cache the package manager cache directory (
~/.npm,~/.cache/pypoetry), not justnode_modules - Use
npm ci(notnpm install) after cache restore for reproducible installs - Cache multiple dependency directories in a single step when possible (npm + pip + venv)
- Set artifact retention policies (
retention-days: 7) to prevent storage bloat
Reference: references/ci-cd-pipelines.md (lines 23-68)
Run database migrations safely during deployment to prevent downtime and data loss — CRITICAL
DevOps: Database Migrations
All schema changes must be backward-compatible with the currently running application version. Destructive changes require a multi-phase migration to achieve zero-downtime deployments.
Incorrect:
-- Destructive: renames column while old code still references 'name'
ALTER TABLE users RENAME COLUMN name TO full_name;
-- Destructive: adds NOT NULL column, old inserts fail immediately
ALTER TABLE users ADD COLUMN email VARCHAR(255) NOT NULL;
-- Destructive: drops column while old code still reads it
ALTER TABLE users DROP COLUMN legacy_field;Correct (3-phase zero-downtime migration):
-- Phase 1: Add nullable column (safe with old code running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);# Phase 2: Deploy new code that writes to both + backfill
def create_user(name: str, email: str):
db.execute(
"INSERT INTO users (name, email) VALUES (%s, %s)",
(name, email),
)
async def backfill_emails():
users = await db.fetch("SELECT id FROM users WHERE email IS NULL")
for user in users:
email = generate_email(user.id)
await db.execute(
"UPDATE users SET email = %s WHERE id = %s",
(email, user.id),
)-- Phase 3: Add constraint after backfill is verified complete
ALTER TABLE users ALTER COLUMN email SET NOT NULL;Backward-compatible changes (safe to deploy directly):
- Add nullable column
- Add new table
- Add index
- Rename column with a view alias
Backward-incompatible changes (require 3-phase migration):
- Remove column
- Rename column without alias
- Add NOT NULL column
- Change column type
Deploy order: migrate (phase 1) --> deploy new code (phase 2) --> migrate (phase 3)
Key rules:
- Always deploy migrations before the application code that depends on them
- Never add a NOT NULL column in a single step — use the 3-phase pattern (add nullable, backfill, add constraint)
- Always write a
downgrade()function so migrations can be rolled back (alembic downgrade -1) - Always review auto-generated migrations before applying (
alembic revision --autogenerate) - Test rollback procedures regularly — do not assume
downgrade()works without verification - Column renames require a view alias to maintain backward compatibility during rollout
Reference: references/environment-management.md (lines 115-162)
Secure Docker layers by running as non-root and excluding secrets from image builds — CRITICAL
Docker: Layer Security
Every Docker image layer is immutable and inspectable. Running as root or embedding secrets in layers creates critical security vulnerabilities that persist even if later layers attempt to remove them.
Incorrect:
FROM node:20
WORKDIR /app
# BAD: Copies .env, .git, node_modules, and everything else
COPY . .
RUN npm install
# BAD: Secret baked into image layer (visible via docker history)
ARG DATABASE_URL
ENV DATABASE_URL=$DATABASE_URL
# BAD: Running as root (default)
EXPOSE 3000
CMD ["node", "dist/main.js"]Correct:
FROM node:20-alpine AS runner
WORKDIR /app
# GOOD: Create and use non-root user (uid 1001)
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
# GOOD: Copy only what's needed with explicit ownership
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
# GOOD: Run as non-root
USER nodejs
# GOOD: Secrets injected at runtime, never in image
# Use: docker run -e DATABASE_URL=... or Kubernetes secrets
ENV NODE_ENV=production
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]Required .dockerignore:
.git
.env
.env.*
node_modules
*.md
tests/
.vscode/Key rules:
- Never run containers as root — always create a non-root user with
USERdirective - Never pass secrets via
ARGorENVin the Dockerfile — they are visible indocker history - Always use a
.dockerignoreto exclude.env,.git,node_modules, and test files - Use
COPY --chownto set file ownership without a separatechownlayer - Prefer minimal base images (
-alpine) to reduce the CVE surface area - Enable read-only root filesystem in Kubernetes (
readOnlyRootFilesystem: true) - Add health checks so orchestrators can detect and restart unhealthy containers
Reference: references/docker-patterns.md (lines 52-85)
Use Docker multi-stage builds to exclude dev dependencies and reduce image size by 4-5x — HIGH
Docker: Multi-Stage Builds
Separate build-time concerns from runtime to produce minimal, secure production images. A well-structured multi-stage build can reduce image size by 78% or more.
Incorrect:
# Single-stage: build tools and dev deps ship to production
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/main.js"]
# Result: ~850 MB image with dev dependencies, source files, build toolsCorrect:
# Stage 1: Install production dependencies only
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
# Stage 2: Build with dev dependencies
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build && npm run test
# Stage 3: Minimal production runtime
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]
# Result: ~180 MB image with only production runtimeKey rules:
- Use separate stages for dependency installation, building, and runtime
- Copy only production
node_modulesand compiled artifacts into the final stage - Use
-alpinebase images to minimize base layer size - Run
npm ci(notnpm install) for reproducible, lockfile-exact installs - Clean caches (
npm cache clean --force) in the same layer as install to avoid bloating layers - Always include a
HEALTHCHECKin the production stage for orchestrator integration - Run tests in the builder stage so test failures prevent image creation
Reference: references/docker-patterns.md (lines 7-50)
Configure Railway PaaS deployment with correct Nixpacks, environment, and railway.json settings — HIGH
Railway Deployment Patterns
railway.json Configuration
{
"$schema": "https://railway.com/railway.schema.json",
"build": {
"builder": "NIXPACKS",
"buildCommand": "npm ci && npm run build"
},
"deploy": {
"startCommand": "npm start",
"healthcheckPath": "/health",
"healthcheckTimeout": 30,
"restartPolicyType": "ON_FAILURE",
"restartPolicyMaxRetries": 3
}
}Nixpacks vs Dockerfile
| Factor | Nixpacks (default) | Dockerfile |
|---|---|---|
| Setup | Zero config, auto-detect | Manual, full control |
| Build time | Fast (Nix cache) | Depends on layers |
| Customization | nixpacks.toml | Unlimited |
| Use when | Standard apps | Custom runtimes, multi-stage |
Environment Variables
- Use Railway's shared variables for cross-service config (DATABASE_URL, REDIS_URL)
- Service-specific variables override shared ones
- Reference other vars:
$\{\{shared.DATABASE_URL\}\} - Never hardcode secrets — use Railway's encrypted env vars
Database Provisioning
Railway provisions managed databases with one click:
- PostgreSQL, MySQL, Redis, MongoDB
- Connection string auto-injected as env var
- Backups included on paid plans
Multi-Service Setup
- Use monorepo config: set
rootDirectoryper service - Internal networking: services communicate via
$\{\{service.RAILWAY_PRIVATE_DOMAIN\}\}:port - Shared env groups for common config
Railway CLI
railway login # Authenticate
railway link # Connect to project
railway up # Deploy from local
railway logs # View deployment logs
railway variables # List env vars
railway shell # Open shell in serviceAnti-Patterns
Incorrect:
- Running
railway upfrom CI withoutrailway link— deploys to wrong project - Using Dockerfile when Nixpacks handles the stack — unnecessary complexity
- Storing secrets in railway.json — use env vars
- Skipping healthcheck config — Railway can't detect failed deploys
Correct:
- Configure healthcheckPath for all web services
- Use shared variables for cross-service config
- Set restart policy for resilience
- Use Nixpacks unless you need custom runtime
References
references/railway-json-config.md— Full railway.json schema and examplesreferences/nixpacks-customization.md— Custom build configs, environment detectionreferences/multi-service-setup.md— Monorepo deploy, service networking
References (9)
Ci Cd Pipelines
CI/CD Pipelines
Comprehensive CI/CD patterns for GitHub Actions, caching, matrix testing, and artifact management.
Branch Strategy
Recommended: Git Flow with Feature Branches
main (production) ─────●────────●──────>
┃ ┃
dev (staging) ─────●───●────●───●──────>
┃ ┃
feature/* ─────────●────────┘
▲
└─ PR required, CI checks, code reviewBranch protection rules:
main: Require PR + 2 approvals + all checks passdev: Require PR + 1 approval + all checks pass- Feature branches: No direct commits to main/dev
GitHub Actions Caching Strategy
- name: Cache Dependencies
uses: actions/cache@v3
with:
path: |
~/.npm
node_modules
backend/.venv
key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-deps-Cache hit ratio impact:
- Without cache: 2-3 min install time
- With cache: 10-20 sec install time
- ~85% time savings on typical workflows
Artifact Management
# Build and upload artifact
- name: Build Application
run: npm run build
- name: Upload Build Artifact
uses: actions/upload-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/
retention-days: 7
# Download in deployment job
- name: Download Build Artifact
uses: actions/download-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/Benefits:
- Avoid rebuilding in deployment job
- Deploy exact tested artifact (byte-for-byte match)
- Retention policies prevent storage bloat
Matrix Testing
strategy:
matrix:
node-version: [18, 20, 22]
os: [ubuntu-latest, windows-latest]
jobs:
test:
runs-on: ${{ matrix.os }}
steps:
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- run: npm testComplete Backend CI/CD Example
name: Backend CI/CD
on:
push:
branches: [main, dev]
paths: ['backend/**']
pull_request:
branches: [main, dev]
paths: ['backend/**']
jobs:
lint-and-test:
runs-on: ubuntu-latest
defaults:
run:
working-directory: backend
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Cache Poetry Dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}
- name: Install Poetry
run: pip install poetry
- name: Install Dependencies
run: poetry install
- name: Run Ruff Format Check
run: poetry run ruff format --check app/
- name: Run Ruff Lint
run: poetry run ruff check app/
- name: Run Type Check
run: poetry run mypy app/ --ignore-missing-imports
- name: Run Tests
run: poetry run pytest tests/ --cov=app --cov-report=xml
- name: Upload Coverage
uses: codecov/codecov-action@v3
with:
file: ./backend/coverage.xml
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Trivy Scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: 'backend/'
severity: 'CRITICAL,HIGH'Key features:
- Path filtering (only run on backend changes)
- Poetry dependency caching
- Comprehensive quality checks (format, lint, type, test)
- Security scanning with Trivy
Pipeline Stages
- Lint & Type Check - Code quality gates
- Unit Tests - Test coverage with reporting
- Security Scan - npm audit + Trivy vulnerability scanner
- Build & Push - Docker image to container registry
- Deploy Staging - Environment-gated deployment
- Deploy Production - Manual approval or automated
Best Practices
- Fast feedback - tests complete in < 5 min
- Fail fast - stop on first failure
- Cache dependencies - npm/pip cache
- Matrix testing - multiple Node/Python versions
- Secrets management - use GitHub Secrets
- Branch protection - require passing tests
See scripts/github-actions-pipeline.yml for complete examples.
Deployment Strategies
Deployment Strategies
Blue-green, canary, and rolling deployment patterns.
Rolling Deployment (Default)
Update pods gradually:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1- Pros: No downtime, gradual rollout
- Cons: Mixed versions running simultaneously
Blue-Green Deployment
Two identical environments, switch traffic:
# Deploy to green (inactive)
kubectl apply -f green-deployment.yaml
# Test green
curl https://green.example.com/health
# Switch traffic (update service selector)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'
# Rollback if needed
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'- Pros: Instant rollback, no mixed versions
- Cons: 2x resources, database migrations tricky
Canary Deployment
Gradually shift traffic:
# 90% to stable
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-stable
spec:
replicas: 9
# 10% to canary
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-canary
spec:
replicas: 1- Pros: Limit blast radius, test with real traffic
- Cons: Complex traffic management
See scripts/argocd-application.yaml for GitOps patterns.
Docker Patterns
Docker Patterns
Best practices for Dockerfile optimization, multi-stage builds, and container security.
Multi-Stage Build Example
# ============================================================
# Stage 1: Dependencies (builder)
# ============================================================
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
# ============================================================
# Stage 2: Build (with dev dependencies)
# ============================================================
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci # Include dev dependencies
COPY . .
RUN npm run build && npm run test
# ============================================================
# Stage 3: Production runtime (minimal)
# ============================================================
FROM node:20-alpine AS runner
WORKDIR /app
# Security: Non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
# Copy only production dependencies and built artifacts
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]Image size comparison:
- Single-stage: 850 MB (includes dev dependencies, source files)
- Multi-stage: 180 MB (only runtime + production deps)
- 78% reduction
Layer Caching Optimization
Order matters for cache efficiency:
# BAD: Invalidates cache on any code change
COPY . .
RUN npm install
# GOOD: Cache package.json layer separately
COPY package*.json ./
RUN npm ci # Cached unless package.json changes
COPY . . # Source changes don't invalidate npm installSecurity Hardening
Non-root user (uid 1001):
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
USER nodejsRead-only filesystem where possible:
# In K8s deployment
securityContext:
readOnlyRootFilesystem: trueHealth checks for orchestrator integration:
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1Security Scanning with Trivy
- name: Build Docker Image
run: docker build -t myapp:${{ github.sha }} .
- name: Scan for Vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload Scan Results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
- name: Fail on Critical Vulnerabilities
run: |
trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}Docker Compose Development Setup
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: orchestkit
POSTGRES_PASSWORD: dev_password
POSTGRES_DB: orchestkit_dev
ports:
- "5437:5432" # Avoid conflict with host postgres
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U orchestkit"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
volumes:
- redisdata:/data
backend:
build:
context: ./backend
dockerfile: Dockerfile.dev
ports:
- "8500:8500"
environment:
DATABASE_URL: postgresql://orchestkit:dev_password@postgres:5432/orchestkit_dev
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
volumes:
- ./backend:/app # Hot reload
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.dev
ports:
- "5173:5173"
environment:
VITE_API_URL: http://localhost:8500
volumes:
- ./frontend:/app
- /app/node_modules # Avoid overwriting node_modules
volumes:
pgdata:
redisdata:Key patterns:
- Port mapping to avoid host conflicts (5437:5432)
- Health checks before dependent services start
- Volume mounts for hot reload during development
- Named volumes for data persistence
See scripts/Dockerfile and scripts/docker-compose.yml for complete examples.
Environment Management
Environment Management
Secrets management, configuration, and environment variable patterns.
External Secrets Operator
Sync secrets from cloud providers to Kubernetes:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
- secretKey: database-url
remoteRef:
key: prod/app/database
property: url
- secretKey: api-key
remoteRef:
key: prod/app/api-keys
property: mainSupported backends:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
- GCP Secret Manager
GitOps with ArgoCD
ArgoCD watches Git repository and syncs cluster state:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/repo
targetRevision: HEAD
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
retry:
limit: 5
backoff:
duration: 5s
maxDuration: 3mFeatures:
- Automated sync with pruning
- Self-healing (drift detection)
- Retry policies for transient failures
Infrastructure as Code (Terraform)
Remote state in S3 with DynamoDB locking:
terraform {
required_version = ">= 1.5"
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
}
}Module-based architecture:
module "vpc" {
source = "./modules/vpc"
cidr = "10.0.0.0/16"
}
module "eks" {
source = "./modules/eks"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
}
module "rds" {
source = "./modules/rds"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnets
}Environment-specific tfvars:
terraform plan -var-file=environments/production.tfvarsDatabase Migration Strategies
Zero-Downtime Migration Pattern
Problem: Adding a NOT NULL column breaks old application versions
Solution: 3-phase migration
Phase 1: Add nullable column
-- Migration v1 (deploy with old code still running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);Phase 2: Deploy new code + backfill
# New code writes to both old and new schema
def create_user(name: str, email: str):
db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))
# Backfill existing rows
async def backfill_emails():
users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
for user in users_without_email:
email = generate_email(user.id)
await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))Phase 3: Add constraint
-- Migration v2 (after backfill complete)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;Backward/Forward Compatibility
Backward compatible changes (safe):
- Add nullable column
- Add table
- Add index
- Rename column (with view alias)
Backward incompatible changes (requires 3-phase):
- Remove column
- Rename column (no alias)
- Add NOT NULL column
- Change column type
Alembic Migration Pattern
# backend/alembic/versions/2024_12_15_add_langfuse_trace_id.py
"""Add Langfuse trace_id to analyses"""
from alembic import op
import sqlalchemy as sa
def upgrade():
# Add nullable column first (backward compatible)
op.add_column('analyses',
sa.Column('langfuse_trace_id', sa.String(255), nullable=True)
)
# Index for lookup performance
op.create_index('idx_analyses_langfuse_trace',
'analyses', ['langfuse_trace_id']
)
def downgrade():
op.drop_index('idx_analyses_langfuse_trace')
op.drop_column('analyses', 'langfuse_trace_id')Migration workflow:
# Create new migration
poetry run alembic revision --autogenerate -m "Add langfuse trace ID"
# Review generated migration (ALWAYS review!)
cat alembic/versions/abc123_add_langfuse_trace_id.py
# Apply migration
poetry run alembic upgrade head
# Rollback if needed
poetry run alembic downgrade -1Rollback Procedures
# Helm rollback to previous revision
helm rollback myapp 3
# Kubernetes rollback
kubectl rollout undo deployment/myapp
# Database migration rollback (Alembic)
alembic downgrade -1Critical: Test rollback procedures regularly!
See scripts/external-secrets.yaml and scripts/argocd-application.yaml for complete examples.
Kubernetes Basics
Kubernetes Basics
K8s deployments, services, health probes, and production patterns.
Health Probes
Three probe types with distinct purposes:
spec:
containers:
- name: app
# Startup probe (gives slow-starting apps time to boot)
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
# Liveness probe (restarts pod if failing)
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3 # 3 failures = restart
# Readiness probe (removes from service if failing)
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2 # 2 failures = remove from load balancerProbe implementation (FastAPI):
@app.get("/health/startup")
async def startup_check():
# Check DB connection established
if not db.is_connected():
raise HTTPException(status_code=503, detail="DB not ready")
return {"status": "ok"}
@app.get("/health/liveness")
async def liveness_check():
# Basic "is process running" check
return {"status": "alive"}
@app.get("/health/readiness")
async def readiness_check():
# Check all dependencies healthy
if not redis.ping() or not db.health_check():
raise HTTPException(status_code=503, detail="Dependencies unhealthy")
return {"status": "ready"}Security Context
spec:
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALLResource Management
Always set requests and limits:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"- Use
requestsfor scheduling - Use
limitsfor throttling
PodDisruptionBudget
Prevents too many pods from being evicted during node maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: myappUse cases:
- Cluster upgrades (node drains)
- Autoscaler downscaling
- Manual evictions
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "10" # Total CPU requests
requests.memory: 20Gi # Total memory requests
limits.cpu: "20" # Total CPU limits
limits.memory: 40Gi # Total memory limits
pods: "50" # Max podsStatefulSets for Databases
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
# Pod spec here
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100GiKey differences from Deployment:
- Stable pod names (
postgres-0,postgres-1,postgres-2) - Ordered deployment and scaling
- Persistent storage per pod
Helm Chart Structure
charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── hpa.yaml
│ └── _helpers.tpl
└── values/
├── staging.yaml
└── production.yamlEssential Manifests
- Deployment with rolling update strategy
- Service for internal routing
- Ingress for external access with TLS
- HorizontalPodAutoscaler for scaling
See scripts/k8s-manifests.yaml and scripts/helm-values.yaml for complete examples.
Multi Service Setup
Multi-Service Setup on Railway
Deploy multiple services in one Railway project for monorepos, microservices, or web + worker architectures.
Monorepo Configuration
Each service in a Railway project can point to a different root directory:
my-monorepo/
├── apps/
│ ├── api/ ← Service 1 (root: apps/api)
│ │ ├── package.json
│ │ └── railway.json
│ ├── web/ ← Service 2 (root: apps/web)
│ │ ├── package.json
│ │ └── railway.json
│ └── worker/ ← Service 3 (root: apps/worker)
│ ├── package.json
│ └── railway.json
├── packages/ ← Shared packages
└── package.json ← Root workspaceSet each service's root directory in the Railway dashboard under Settings > Source.
Private Networking
Services within the same project communicate over Railway's private network:
# From the web service, call the API service:
http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}/endpoint
# In environment variables (set on web service):
API_URL=http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}Key points:
- Private networking uses internal DNS, no public internet
- Zero egress costs between services
- Always use the
PORTvariable — not hardcoded ports - Services must listen on
0.0.0.0(notlocalhost)
Common Architectures
Web + API + Worker
| Service | Role | Public? |
|---|---|---|
web | Frontend (Next.js, Vite) | Yes |
api | Backend API | Yes (or private if only web calls it) |
worker | Background jobs (BullMQ, Celery) | No |
postgres | Database | No (private only) |
redis | Cache / queue broker | No (private only) |
Shared Environment Variables
Use Railway's shared variables (project-level) for values needed by all services:
NODE_ENV=productionLOG_LEVEL=info
Use reference variables for cross-service connections:
DATABASE_URL=$\{\{Postgres.DATABASE_URL\}\}REDIS_URL=$\{\{Redis.REDIS_URL\}\}API_URL=http://$\{\{api.RAILWAY_PRIVATE_DOMAIN\}\}:$\{\{api.PORT\}\}
Deploy Order
Railway deploys services in parallel by default. If you need ordering (e.g., run migrations before starting web):
- Put migrations in the API service's
startCommand - Use healthchecks — dependent services will retry connections until the API is healthy
- For strict ordering, use separate deploy triggers via Railway CLI
Nixpacks Customization
Nixpacks Customization
Railway uses Nixpacks to auto-detect your stack and generate a build plan. Customize when auto-detection falls short.
Auto-Detection
Nixpacks detects your language by looking for:
| Language | Detection File |
|---|---|
| Node.js | package.json |
| Python | requirements.txt, pyproject.toml, Pipfile |
| Go | go.mod |
| Rust | Cargo.toml |
| Ruby | Gemfile |
| Java | pom.xml, build.gradle |
| PHP | composer.json |
nixpacks.toml
Place at project root (or set nixpacksConfigPath in railway.json for monorepos).
Adding System Packages
[phases.setup]
nixPkgs = ["...", "ffmpeg", "imagemagick", "poppler_utils"]
aptPkgs = ["libvips-dev"]Custom Build Phases
[phases.install]
cmds = ["npm ci --production=false"]
[phases.build]
cmds = [
"npx prisma generate",
"npm run build"
]
dependsOn = ["install"]
[start]
cmd = "npm run start:prod"Environment Variables in Build
[variables]
NODE_ENV = "production"
NEXT_TELEMETRY_DISABLED = "1"Monorepo Root Path
For monorepos, set the root directory per service in the Railway dashboard or via railway.json:
{
"build": {
"builder": "NIXPACKS",
"nixpacksConfigPath": "apps/api/nixpacks.toml"
}
}Each service points to its own directory and nixpacks.toml.
When to Switch to Dockerfile
Use Dockerfile instead of Nixpacks when:
- Multi-stage builds are needed to reduce image size
- Build requires conditional logic (e.g.,
ARG-based feature flags) - Precise control over base image (e.g., distroless, Alpine variants)
- Nixpacks doesn't support a required system dependency
Set in railway.json:
{
"build": {
"builder": "DOCKERFILE",
"dockerfilePath": "Dockerfile.production"
}
}Observability
Observability & Monitoring
Prometheus metrics, Grafana dashboards, and alerting patterns.
Prometheus Metrics Exposition
from prometheus_client import Counter, Histogram, generate_latest
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
# Record metrics
http_requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
http_request_duration_seconds.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(content=generate_latest(), media_type="text/plain")Grafana Dashboard Queries
# Request rate (requests per second)
rate(http_requests_total[5m])
# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)Alerting Rules
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency detected"
description: "p95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "{{ $labels.pod }} has restarted {{ $value }} times"Key Metrics to Monitor
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Request rate | Traffic volume | Anomaly detection |
| Error rate | Service health | > 5% (critical) |
| p95 latency | User experience | > 2s (warning) |
| CPU usage | Resource utilization | > 80% sustained |
| Memory usage | Resource utilization | > 85% sustained |
| Pod restarts | Stability | > 3 in 1 hour |
Golden Signals (SRE)
- Latency - Time to serve a request
- Traffic - Requests per second
- Errors - Rate of failed requests
- Saturation - Resource utilization
Log Aggregation
Structured logging for observability:
import structlog
logger = structlog.get_logger()
@app.middleware("http")
async def logging_middleware(request: Request, call_next):
request_id = str(uuid.uuid4())
with structlog.contextvars.bound_contextvars(
request_id=request_id,
method=request.method,
path=request.url.path,
):
logger.info("request_started")
response = await call_next(request)
logger.info("request_completed", status=response.status_code)
response.headers["X-Request-ID"] = request_id
return responseDistributed Tracing
OpenTelemetry integration:
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
# Manual spans for business logic
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", order_id)
# Processing logic hereRailway Json Config
railway.json Configuration
Complete reference for railway.json schema — the primary way to configure build and deploy behavior on Railway.
Full Schema
{
"$schema": "https://railway.com/railway.schema.json",
"build": {
"builder": "NIXPACKS",
"buildCommand": "npm ci && npm run build",
"watchPatterns": ["src/**", "package.json"],
"nixpacksConfigPath": "nixpacks.toml",
"dockerfilePath": "Dockerfile"
},
"deploy": {
"startCommand": "node dist/server.js",
"healthcheckPath": "/health",
"healthcheckTimeout": 30,
"restartPolicyType": "ON_FAILURE",
"restartPolicyMaxRetries": 3,
"numReplicas": 1,
"sleepApplication": false,
"region": "us-west1",
"cronSchedule": "0 */6 * * *"
}
}Builder Options
| Builder | When to Use |
|---|---|
NIXPACKS | Default — auto-detects language and builds (Node, Python, Go, Rust, etc.) |
DOCKERFILE | Complex builds, multi-stage images, custom system deps |
PAKETO | Cloud Native Buildpacks alternative |
Deploy Settings
| Field | Default | Description |
|---|---|---|
startCommand | Auto-detected | Overrides default start command |
healthcheckPath | None | HTTP path to check for 200 response |
healthcheckTimeout | 30 | Seconds before healthcheck is considered failed |
restartPolicyType | ON_FAILURE | ON_FAILURE, ALWAYS, or NEVER |
restartPolicyMaxRetries | 3 | Max restart attempts before marking deploy failed |
numReplicas | 1 | Number of instances (horizontal scaling) |
sleepApplication | false | Sleep service when no traffic (saves credits) |
cronSchedule | None | Cron expression for scheduled services |
Examples
Node.js API with migrations
{
"$schema": "https://railway.com/railway.schema.json",
"build": {
"builder": "NIXPACKS",
"buildCommand": "npm ci && npx prisma generate && npm run build"
},
"deploy": {
"startCommand": "npx prisma migrate deploy && node dist/server.js",
"healthcheckPath": "/api/health",
"healthcheckTimeout": 60
}
}Python FastAPI
{
"$schema": "https://railway.com/railway.schema.json",
"build": {
"builder": "NIXPACKS",
"buildCommand": "pip install -r requirements.txt"
},
"deploy": {
"startCommand": "uvicorn main:app --host 0.0.0.0 --port $PORT",
"healthcheckPath": "/health"
}
}Cron Worker (no web traffic)
{
"$schema": "https://railway.com/railway.schema.json",
"deploy": {
"startCommand": "node dist/worker.js",
"cronSchedule": "*/15 * * * *",
"restartPolicyType": "NEVER"
}
}Checklists (1)
Production Readiness
Production Readiness Checklist
🔒 Security
- Secrets in environment variables or vault (not in code/config)
- HTTPS enforced (redirect HTTP → HTTPS)
- Security headers configured (HSTS, CSP, X-Frame-Options)
- CORS restricted to known origins
- Rate limiting enabled
- SQL injection prevention (parameterized queries)
- XSS prevention (output encoding, CSP)
- Dependencies scanned for vulnerabilities
- Container images scanned (Trivy/Snyk)
🧪 Testing
- Unit tests passing (>80% coverage)
- Integration tests passing
- E2E tests for critical paths
- Load testing completed (k6/Locust)
- Security testing (OWASP ZAP)
- Smoke tests for deployment verification
📊 Observability
- Structured logging (JSON format)
- Log aggregation configured (ELK/Loki)
- Metrics exported (Prometheus format)
- Dashboards created (Grafana)
- Distributed tracing enabled (Jaeger/Tempo)
- Error tracking configured (Sentry)
- Uptime monitoring (synthetic checks)
- Alerting rules defined
🚀 Deployment
- CI/CD pipeline configured
- Automated tests run on every PR
- Blue-green or canary deployment ready
- Rollback procedure documented and tested
- Database migrations tested
- Feature flags for risky changes
- Deployment notifications (Slack/Teams)
💾 Data
- Database backups automated (daily)
- Backup restoration tested
- Point-in-time recovery enabled
- Data retention policies defined
- PII handling documented (GDPR/CCPA)
- Encryption at rest enabled
- Encryption in transit (TLS)
🏗️ Infrastructure
- Infrastructure as Code (Terraform/Pulumi)
- Auto-scaling configured
- Health checks defined
- Resource limits set (CPU/memory)
- Multi-AZ deployment
- CDN for static assets
- DDoS protection enabled
📝 Documentation
- API documentation (OpenAPI)
- Architecture diagram
- Runbook for common issues
- Incident response playbook
- On-call rotation defined
- SLA/SLO documented
🔄 Reliability
- Graceful shutdown handling
- Connection pooling configured
- Circuit breakers for external calls
- Retry with exponential backoff
- Timeouts set on all external calls
- Chaos engineering tests (optional)
Pre-Launch Final Checks
# Security scan
npm audit --audit-level=high
pip-audit
# Performance baseline
k6 run load-test.js
# DNS and SSL
curl -I https://api.example.com
openssl s_client -connect api.example.com:443
# Health endpoint
curl https://api.example.com/health
# Logs flowing
docker logs -f backend
# Metrics exposed
curl http://localhost:9090/metricsExamples (1)
Github Actions Cicd
GitHub Actions CI/CD Example
Complete pipeline for a Python FastAPI + React application.
Repository Structure
├── .github/workflows/
│ ├── ci.yml # Test on every PR
│ ├── deploy.yml # Deploy on main merge
│ └── security.yml # Weekly security scan
├── backend/ # FastAPI
├── frontend/ # React
└── docker-compose.ymlCI Pipeline (ci.yml)
name: CI
on:
pull_request:
branches: [main, dev]
push:
branches: [main, dev]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
backend-test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: test
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-create: true
virtualenvs-in-project: true
- name: Cache dependencies
uses: actions/cache@v4
with:
path: backend/.venv
key: venv-${{ runner.os }}-${{ hashFiles('backend/poetry.lock') }}
- name: Install dependencies
working-directory: backend
run: poetry install --no-interaction
- name: Lint
working-directory: backend
run: |
poetry run ruff format --check app/
poetry run ruff check app/
- name: Type check
working-directory: backend
run: poetry run mypy app/ --ignore-missing-imports
- name: Test
working-directory: backend
env:
DATABASE_URL: postgresql://postgres:test@localhost:5432/test
run: poetry run pytest --cov=app --cov-report=xml -v
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
files: backend/coverage.xml
frontend-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
cache-dependency-path: frontend/package-lock.json
- name: Install
working-directory: frontend
run: npm ci
- name: Lint
working-directory: frontend
run: npm run lint
- name: Type check
working-directory: frontend
run: npm run typecheck
- name: Test
working-directory: frontend
run: npm run test:coverage
- name: Build
working-directory: frontend
run: npm run buildDeploy Pipeline (deploy.yml)
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
environment: production # Requires approval
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build & push backend
run: |
docker build -t $ECR_REGISTRY/backend:${{ github.sha }} ./backend
docker push $ECR_REGISTRY/backend:${{ github.sha }}
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster production \
--service backend \
--force-new-deployment
- name: Deploy frontend to S3
run: |
cd frontend && npm ci && npm run build
aws s3 sync dist/ s3://$S3_BUCKET --delete
aws cloudfront create-invalidation --distribution-id $CF_DIST --paths "/*"Security Scan (security.yml)
name: Security
on:
schedule:
- cron: "0 0 * * 0" # Weekly Sunday midnight
workflow_dispatch:
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Python dependencies
run: |
pip install pip-audit
pip-audit -r backend/requirements.txt
- name: Node dependencies
working-directory: frontend
run: npm audit --audit-level=high
- name: Docker scan
uses: aquasecurity/trivy-action@master
with:
image-ref: backend:latest
severity: HIGH,CRITICALKey Patterns
| Pattern | Implementation |
|---|---|
| Caching | Poetry venv, npm cache |
| Concurrency | Cancel in-progress on new push |
| Services | Postgres container for tests |
| Environments | Production requires approval |
| Artifacts | Coverage reports to Codecov |
Demo Producer
Creates polished demo videos for skills, tutorials, and CLI demonstrations. Use when producing video showcases, marketing content, or terminal recordings.
Distributed Systems
Distributed systems patterns for locking, resilience, idempotency, and rate limiting. Use when implementing distributed locks, circuit breakers, retry policies, idempotency keys, token bucket rate limiters, or fault tolerance patterns.
Last updated on