Use when setting up CI/CD pipelines, containerizing applications, deploying to Kubernetes, or writing infrastructure as code. DevOps & Deployment covers GitHub Actions, Docker, Helm, and Terraform patterns.

Reference medium

Primary Agent: data-pipeline-engineer

DevOps & Deployment Skill

Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.

Overview

Setting up CI/CD pipelines
Containerizing applications
Deploying to Kubernetes or cloud platforms
Implementing GitOps workflows
Managing infrastructure as code
Planning release strategies

Pipeline Architecture

┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│    Code     │──>│    Build    │──>│    Test     │──>│   Deploy    │
│   Commit    │   │   & Lint    │   │   & Scan    │   │  & Release  │
└─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
       │                 │                 │                 │
       v                 v                 v                 v
   Triggers         Artifacts          Reports          Monitoring

Key Concepts

CI/CD Pipeline Stages

Lint & Type Check - Code quality gates
Unit Tests - Test coverage with reporting
Security Scan - npm audit + Trivy vulnerability scanner
Build & Push - Docker image to container registry
Deploy Staging - Environment-gated deployment
Deploy Production - Manual approval or automated

Container Best Practices

Multi-stage builds minimize image size:

Stage 1: Install production dependencies only
Stage 2: Build application with dev dependencies
Stage 3: Production runtime with minimal footprint

Security hardening:

Non-root user (uid 1001)
Read-only filesystem where possible
Health checks for orchestrator integration

Kubernetes Deployment

Essential manifests:

Deployment with rolling update strategy
Service for internal routing
Ingress for external access with TLS
HorizontalPodAutoscaler for scaling

Security context:

runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Drop all capabilities

Deployment Strategies

Strategy	Use Case	Risk
Rolling	Default, gradual replacement	Low - automatic rollback
Blue-Green	Instant switch, easy rollback	Medium - double resources
Canary	Progressive traffic shift	Low - gradual exposure

Rolling Update (Kubernetes default):

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0  # Zero downtime

Secrets Management

Use External Secrets Operator to sync from cloud providers:

AWS Secrets Manager
HashiCorp Vault
Azure Key Vault
GCP Secret Manager

References

Docker Patterns

See: references/docker-patterns.md

Key topics covered:

Multi-stage build examples with 78% size reduction
Layer caching optimization
Security hardening (non-root, health checks)
Trivy vulnerability scanning
Docker Compose development setup

CI/CD Pipelines

See: references/ci-cd-pipelines.md

Key topics covered:

Branch strategy (Git Flow)
GitHub Actions caching (85% time savings)
Artifact management
Matrix testing
Complete backend CI/CD example

Kubernetes Basics

See: references/kubernetes-basics.md

Key topics covered:

Health probes (startup, liveness, readiness)
Security context configuration
PodDisruptionBudget
Resource quotas
StatefulSets for databases
Helm chart structure

Environment Management

See: references/environment-management.md

Key topics covered:

External Secrets Operator
GitOps with ArgoCD
Terraform patterns (remote state, modules)
Zero-downtime database migrations
Alembic migration workflow
Rollback procedures

Observability

See: references/observability.md

Key topics covered:

Prometheus metrics exposition
Grafana dashboard queries (PromQL)
Alerting rules for SLOs
Golden signals (SRE)
Structured logging
Distributed tracing (OpenTelemetry)

Railway Deployment

See: rules/railway-deployment.md

Key topics covered:

railway.json configuration, Nixpacks builds
Environment variable management, database provisioning
Multi-service setups, Railway CLI workflows
References: references/railway-json-config.md, references/nixpacks-customization.md, references/multi-service-setup.md

Deployment Strategies

See: references/deployment-strategies.md

Key topics covered:

Rolling deployment
Blue-green deployment
Canary releases
Traffic splitting with Istio

Deployment Checklist

Pre-Deployment

All tests passing in CI
Security scans clean
Database migrations ready
Rollback plan documented

During Deployment

Monitor deployment progress
Watch error rates
Verify health checks passing

Post-Deployment

Verify metrics normal
Check logs for errors
Update status page

Helm Chart Structure

charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── _helpers.tpl
└── values/
    ├── staging.yaml
    └── production.yaml

zero-downtime-migration - Database migration patterns for zero-downtime deployments
security-scanning - Security scanning integration for CI/CD pipelines
ork:monitoring-observability - Monitoring and alerting for deployed applications
ork:database-patterns - Python/Alembic migration workflow for backend deployments

Key Decisions

Decision	Choice	Rationale
Container user	Non-root (uid 1001)	Security best practice, required by many orchestrators
Deployment strategy	Rolling update (default)	Zero downtime, automatic rollback, resource efficient
Secrets management	External Secrets Operator	Syncs from cloud providers, GitOps compatible
Health checks	Separate startup/liveness/readiness	Prevents premature traffic, enables graceful shutdown

Extended Thinking Triggers

Use Opus 4.6 adaptive thinking for:

Architecture decisions - Kubernetes vs serverless, multi-region setup
Migration planning - Moving between cloud providers
Incident response - Complex deployment failures
Security design - Zero-trust architecture

Templates Reference

Template	Purpose
`github-actions-pipeline.yml`	Full CI/CD workflow with 6 stages
`Dockerfile`	Multi-stage Node.js build
`docker-compose.yml`	Development environment
`k8s-manifests.yaml`	Deployment, Service, Ingress
`helm-values.yaml`	Helm chart values
`terraform-aws.tf`	VPC, EKS, RDS infrastructure
`argocd-application.yaml`	GitOps application
`external-secrets.yaml`	Secrets Manager integration

Capability Details

ci-cd

Keywords: ci, cd, pipeline, github actions, gitlab ci, jenkins, workflow Solves:

How do I set up CI/CD?
GitHub Actions workflow patterns
Pipeline caching strategies
Matrix testing setup

docker

Keywords: docker, dockerfile, container, image, build, compose, multi-stage Solves:

How do I containerize my app?
Multi-stage Dockerfile best practices
Docker Compose development setup
Container security hardening

kubernetes

Keywords: kubernetes, k8s, deployment, service, ingress, helm, statefulset, pdb Solves:

How do I deploy to Kubernetes?
K8s health probes and resource limits
Helm chart structure
StatefulSet for databases

infrastructure-as-code

Keywords: terraform, pulumi, iac, infrastructure, provision, gitops, argocd Solves:

How do I set up infrastructure as code?
Terraform AWS patterns (VPC, EKS, RDS)
GitOps with ArgoCD
Secrets management patterns

deployment-strategies

Keywords: blue green, canary, rolling, deployment strategy, rollback, zero downtime Solves:

Which deployment strategy should I use?
Zero-downtime database migrations
Blue-green deployment setup
Canary release with traffic splitting

observability

Keywords: prometheus, grafana, metrics, alerting, monitoring, health check Solves:

How do I add monitoring to my app?
Prometheus metrics exposition
Grafana dashboard queries
Alerting rules for SLOs

Rules (6)

Protect CI/CD branches from direct pushes to enforce code review and audit trails — HIGH

CI/CD: Branch Protection

Configure branch protection rules to enforce code review, passing CI checks, and linear history on critical branches. This prevents untested or unreviewed code from reaching production.

Incorrect:

# Direct push to main — no review, no CI checks
git checkout main
git commit -m "quick fix"
git push origin main

# Force push overwrites history
git push --force origin main

# CI workflow with no branch restrictions
on:
  push:
    branches: ['*']

Correct:

Branch strategy with protection rules:

main (production) ─────●────────●──────>
                       |        |
dev (staging)  ─────●──●────●──●──────>
                    |        |
feature/*  ─────────●────────┘
                    ^
                    └─ PR required, CI checks, code review

GitHub branch protection settings:

main branch:
  - Require pull request before merging
  - Required approving reviews: 2
  - Require status checks to pass (lint, test, security)
  - Require branches to be up to date before merging
  - Do not allow force pushes
  - Do not allow deletions

dev branch:
  - Require pull request before merging
  - Required approving reviews: 1
  - Require status checks to pass (lint, test)

Key rules:

main requires PR + 2 approvals + all status checks passing before merge
dev requires PR + 1 approval + all status checks passing
Never allow direct commits or force pushes to main or dev
Feature branches must be created from dev and merged back via PR
Require branches to be up-to-date before merging to prevent integration gaps
Enable "Require linear history" to keep the commit graph clean and auditable

Reference: references/ci-cd-pipelines.md (lines 5-21)

Cache CI/CD pipeline dependencies to avoid re-downloading and save minutes per run — HIGH

CI/CD: Pipeline Caching

Cache dependencies in CI pipelines using lockfile-based cache keys. Proper caching reduces dependency installation from 2-3 minutes to 10-20 seconds (~85% time savings).

Incorrect:

# No caching: re-downloads everything on every run
steps:
  - uses: actions/checkout@v3
  - run: npm install
  - run: npm test

# Bad cache key: no lockfile hash, stale deps served indefinitely
- uses: actions/cache@v3
  with:
    path: node_modules
    key: ${{ runner.os }}-modules

Correct:

- name: Cache Dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
      backend/.venv
    key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
    restore-keys: |
      ${{ runner.os }}-deps-

- name: Install Dependencies
  run: npm ci

- name: Run Tests
  run: npm test

# Python example with Poetry
- name: Cache Poetry Dependencies
  uses: actions/cache@v3
  with:
    path: ~/.cache/pypoetry
    key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}

- name: Install Dependencies
  run: poetry install

Key rules:

Always include hashFiles() of the lockfile in the cache key so caches invalidate when dependencies change
Use restore-keys as a fallback prefix to get a partial cache hit when the exact key misses
Cache the package manager cache directory (~/.npm, ~/.cache/pypoetry), not just node_modules
Use npm ci (not npm install) after cache restore for reproducible installs
Cache multiple dependency directories in a single step when possible (npm + pip + venv)
Set artifact retention policies (retention-days: 7) to prevent storage bloat

Reference: references/ci-cd-pipelines.md (lines 23-68)

Run database migrations safely during deployment to prevent downtime and data loss — CRITICAL

DevOps: Database Migrations

All schema changes must be backward-compatible with the currently running application version. Destructive changes require a multi-phase migration to achieve zero-downtime deployments.

Incorrect:

-- Destructive: renames column while old code still references 'name'
ALTER TABLE users RENAME COLUMN name TO full_name;

-- Destructive: adds NOT NULL column, old inserts fail immediately
ALTER TABLE users ADD COLUMN email VARCHAR(255) NOT NULL;

-- Destructive: drops column while old code still reads it
ALTER TABLE users DROP COLUMN legacy_field;

Correct (3-phase zero-downtime migration):

-- Phase 1: Add nullable column (safe with old code running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);

# Phase 2: Deploy new code that writes to both + backfill
def create_user(name: str, email: str):
    db.execute(
        "INSERT INTO users (name, email) VALUES (%s, %s)",
        (name, email),
    )

async def backfill_emails():
    users = await db.fetch("SELECT id FROM users WHERE email IS NULL")
    for user in users:
        email = generate_email(user.id)
        await db.execute(
            "UPDATE users SET email = %s WHERE id = %s",
            (email, user.id),
        )

-- Phase 3: Add constraint after backfill is verified complete
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Backward-compatible changes (safe to deploy directly):

Add nullable column
Add new table
Add index
Rename column with a view alias

Backward-incompatible changes (require 3-phase migration):

Remove column
Rename column without alias
Add NOT NULL column
Change column type

Deploy order: migrate (phase 1) --> deploy new code (phase 2) --> migrate (phase 3)

Key rules:

Always deploy migrations before the application code that depends on them
Never add a NOT NULL column in a single step — use the 3-phase pattern (add nullable, backfill, add constraint)
Always write a downgrade() function so migrations can be rolled back (alembic downgrade -1)
Always review auto-generated migrations before applying (alembic revision --autogenerate)
Test rollback procedures regularly — do not assume downgrade() works without verification
Column renames require a view alias to maintain backward compatibility during rollout

Reference: references/environment-management.md (lines 115-162)

Secure Docker layers by running as non-root and excluding secrets from image builds — CRITICAL

Docker: Layer Security

Every Docker image layer is immutable and inspectable. Running as root or embedding secrets in layers creates critical security vulnerabilities that persist even if later layers attempt to remove them.

Incorrect:

FROM node:20
WORKDIR /app

# BAD: Copies .env, .git, node_modules, and everything else
COPY . .
RUN npm install

# BAD: Secret baked into image layer (visible via docker history)
ARG DATABASE_URL
ENV DATABASE_URL=$DATABASE_URL

# BAD: Running as root (default)
EXPOSE 3000
CMD ["node", "dist/main.js"]

Correct:

FROM node:20-alpine AS runner
WORKDIR /app

# GOOD: Create and use non-root user (uid 1001)
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

# GOOD: Copy only what's needed with explicit ownership
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./

# GOOD: Run as non-root
USER nodejs

# GOOD: Secrets injected at runtime, never in image
# Use: docker run -e DATABASE_URL=... or Kubernetes secrets
ENV NODE_ENV=production
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]

Required .dockerignore:

.git
.env
.env.*
node_modules
*.md
tests/
.vscode/

Key rules:

Never run containers as root — always create a non-root user with USER directive
Never pass secrets via ARG or ENV in the Dockerfile — they are visible in docker history
Always use a .dockerignore to exclude .env, .git, node_modules, and test files
Use COPY --chown to set file ownership without a separate chown layer
Prefer minimal base images (-alpine) to reduce the CVE surface area
Enable read-only root filesystem in Kubernetes (readOnlyRootFilesystem: true)
Add health checks so orchestrators can detect and restart unhealthy containers

Reference: references/docker-patterns.md (lines 52-85)

Use Docker multi-stage builds to exclude dev dependencies and reduce image size by 4-5x — HIGH

Docker: Multi-Stage Builds

Separate build-time concerns from runtime to produce minimal, secure production images. A well-structured multi-stage build can reduce image size by 78% or more.

Incorrect:

# Single-stage: build tools and dev deps ship to production
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/main.js"]
# Result: ~850 MB image with dev dependencies, source files, build tools

Correct:

# Stage 1: Install production dependencies only
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# Stage 2: Build with dev dependencies
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build && npm run test

# Stage 3: Minimal production runtime
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]
# Result: ~180 MB image with only production runtime

Key rules:

Use separate stages for dependency installation, building, and runtime
Copy only production node_modules and compiled artifacts into the final stage
Use -alpine base images to minimize base layer size
Run npm ci (not npm install) for reproducible, lockfile-exact installs
Clean caches (npm cache clean --force) in the same layer as install to avoid bloating layers
Always include a HEALTHCHECK in the production stage for orchestrator integration
Run tests in the builder stage so test failures prevent image creation

Reference: references/docker-patterns.md (lines 7-50)

Configure Railway PaaS deployment with correct Nixpacks, environment, and railway.json settings — HIGH

Railway Deployment Patterns

railway.json Configuration

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npm run build"
  },
  "deploy": {
    "startCommand": "npm start",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3
  }
}

Nixpacks vs Dockerfile

Factor	Nixpacks (default)	Dockerfile
Setup	Zero config, auto-detect	Manual, full control
Build time	Fast (Nix cache)	Depends on layers
Customization	nixpacks.toml	Unlimited
Use when	Standard apps	Custom runtimes, multi-stage

Environment Variables

Use Railway's shared variables for cross-service config (DATABASE_URL, REDIS_URL)
Service-specific variables override shared ones
Reference other vars: $\{\{shared.DATABASE_URL\}\}
Never hardcode secrets — use Railway's encrypted env vars

Database Provisioning

Railway provisions managed databases with one click:

PostgreSQL, MySQL, Redis, MongoDB
Connection string auto-injected as env var
Backups included on paid plans

Multi-Service Setup

Use monorepo config: set rootDirectory per service
Internal networking: services communicate via $\{\{service.RAILWAY_PRIVATE_DOMAIN\}\}:port
Shared env groups for common config

Railway CLI

railway login              # Authenticate
railway link               # Connect to project
railway up                 # Deploy from local
railway logs               # View deployment logs
railway variables          # List env vars
railway shell              # Open shell in service

Anti-Patterns

Incorrect:

Running railway up from CI without railway link — deploys to wrong project
Using Dockerfile when Nixpacks handles the stack — unnecessary complexity
Storing secrets in railway.json — use env vars
Skipping healthcheck config — Railway can't detect failed deploys

Correct:

Configure healthcheckPath for all web services
Use shared variables for cross-service config
Set restart policy for resilience
Use Nixpacks unless you need custom runtime

References

references/railway-json-config.md — Full railway.json schema and examples
references/nixpacks-customization.md — Custom build configs, environment detection
references/multi-service-setup.md — Monorepo deploy, service networking

References (9)

Ci Cd Pipelines

CI/CD Pipelines

Comprehensive CI/CD patterns for GitHub Actions, caching, matrix testing, and artifact management.

Branch Strategy

Recommended: Git Flow with Feature Branches

main (production) ─────●────────●──────>
                       ┃        ┃
dev (staging) ─────●───●────●───●──────>
                   ┃        ┃
feature/* ─────────●────────┘
                   ▲
                   └─ PR required, CI checks, code review

Branch protection rules:

main: Require PR + 2 approvals + all checks pass
dev: Require PR + 1 approval + all checks pass
Feature branches: No direct commits to main/dev

GitHub Actions Caching Strategy

- name: Cache Dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
      backend/.venv
    key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
    restore-keys: |
      ${{ runner.os }}-deps-

Cache hit ratio impact:

Without cache: 2-3 min install time
With cache: 10-20 sec install time
~85% time savings on typical workflows

Artifact Management

# Build and upload artifact
- name: Build Application
  run: npm run build

- name: Upload Build Artifact
  uses: actions/upload-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/
    retention-days: 7

# Download in deployment job
- name: Download Build Artifact
  uses: actions/download-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/

Benefits:

Avoid rebuilding in deployment job
Deploy exact tested artifact (byte-for-byte match)
Retention policies prevent storage bloat

Matrix Testing

strategy:
  matrix:
    node-version: [18, 20, 22]
    os: [ubuntu-latest, windows-latest]
jobs:
  test:
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm test

Complete Backend CI/CD Example

name: Backend CI/CD

on:
  push:
    branches: [main, dev]
    paths: ['backend/**']
  pull_request:
    branches: [main, dev]
    paths: ['backend/**']

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: backend
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Cache Poetry Dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pypoetry
          key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}

      - name: Install Poetry
        run: pip install poetry

      - name: Install Dependencies
        run: poetry install

      - name: Run Ruff Format Check
        run: poetry run ruff format --check app/

      - name: Run Ruff Lint
        run: poetry run ruff check app/

      - name: Run Type Check
        run: poetry run mypy app/ --ignore-missing-imports

      - name: Run Tests
        run: poetry run pytest tests/ --cov=app --cov-report=xml

      - name: Upload Coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./backend/coverage.xml

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run Trivy Scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: 'backend/'
          severity: 'CRITICAL,HIGH'

Key features:

Path filtering (only run on backend changes)
Poetry dependency caching
Comprehensive quality checks (format, lint, type, test)
Security scanning with Trivy

Pipeline Stages

Lint & Type Check - Code quality gates
Unit Tests - Test coverage with reporting
Security Scan - npm audit + Trivy vulnerability scanner
Build & Push - Docker image to container registry
Deploy Staging - Environment-gated deployment
Deploy Production - Manual approval or automated

Best Practices

Fast feedback - tests complete in < 5 min
Fail fast - stop on first failure
Cache dependencies - npm/pip cache
Matrix testing - multiple Node/Python versions
Secrets management - use GitHub Secrets
Branch protection - require passing tests

See scripts/github-actions-pipeline.yml for complete examples.

Deployment Strategies

Blue-green, canary, and rolling deployment patterns.

Rolling Deployment (Default)

Update pods gradually:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Pros: No downtime, gradual rollout
Cons: Mixed versions running simultaneously

Blue-Green Deployment

Two identical environments, switch traffic:

# Deploy to green (inactive)
kubectl apply -f green-deployment.yaml

# Test green
curl https://green.example.com/health

# Switch traffic (update service selector)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback if needed
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'

Pros: Instant rollback, no mixed versions
Cons: 2x resources, database migrations tricky

Canary Deployment

Gradually shift traffic:

# 90% to stable
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-stable
spec:
  replicas: 9

# 10% to canary
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-canary
spec:
  replicas: 1

Pros: Limit blast radius, test with real traffic
Cons: Complex traffic management

See scripts/argocd-application.yaml for GitOps patterns.

Docker Patterns

Best practices for Dockerfile optimization, multi-stage builds, and container security.

Multi-Stage Build Example

# ============================================================
# Stage 1: Dependencies (builder)
# ============================================================
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# ============================================================
# Stage 2: Build (with dev dependencies)
# ============================================================
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci  # Include dev dependencies
COPY . .
RUN npm run build && npm run test

# ============================================================
# Stage 3: Production runtime (minimal)
# ============================================================
FROM node:20-alpine AS runner
WORKDIR /app

# Security: Non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

# Copy only production dependencies and built artifacts
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./

USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]

Image size comparison:

Single-stage: 850 MB (includes dev dependencies, source files)
Multi-stage: 180 MB (only runtime + production deps)
78% reduction

Layer Caching Optimization

Order matters for cache efficiency:

# BAD: Invalidates cache on any code change
COPY . .
RUN npm install

# GOOD: Cache package.json layer separately
COPY package*.json ./
RUN npm ci  # Cached unless package.json changes
COPY . .    # Source changes don't invalidate npm install

Security Hardening

Non-root user (uid 1001):

RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
USER nodejs

Read-only filesystem where possible:

# In K8s deployment
securityContext:
  readOnlyRootFilesystem: true

Health checks for orchestrator integration:

HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1

Security Scanning with Trivy

- name: Build Docker Image
  run: docker build -t myapp:${{ github.sha }} .

- name: Scan for Vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myapp:${{ github.sha }}'
    format: 'sarif'
    output: 'trivy-results.sarif'
    severity: 'CRITICAL,HIGH'

- name: Upload Scan Results
  uses: github/codeql-action/upload-sarif@v2
  with:
    sarif_file: 'trivy-results.sarif'

- name: Fail on Critical Vulnerabilities
  run: |
    trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}

Docker Compose Development Setup

version: '3.8'
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: orchestkit
      POSTGRES_PASSWORD: dev_password
      POSTGRES_DB: orchestkit_dev
    ports:
      - "5437:5432"  # Avoid conflict with host postgres
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U orchestkit"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile.dev
    ports:
      - "8500:8500"
    environment:
      DATABASE_URL: postgresql://orchestkit:dev_password@postgres:5432/orchestkit_dev
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    volumes:
      - ./backend:/app  # Hot reload

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile.dev
    ports:
      - "5173:5173"
    environment:
      VITE_API_URL: http://localhost:8500
    volumes:
      - ./frontend:/app
      - /app/node_modules  # Avoid overwriting node_modules

volumes:
  pgdata:
  redisdata:

Key patterns:

Port mapping to avoid host conflicts (5437:5432)
Health checks before dependent services start
Volume mounts for hot reload during development
Named volumes for data persistence

See scripts/Dockerfile and scripts/docker-compose.yml for complete examples.

Environment Management

Secrets management, configuration, and environment variable patterns.

External Secrets Operator

Sync secrets from cloud providers to Kubernetes:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: app-secrets
    creationPolicy: Owner
  data:
    - secretKey: database-url
      remoteRef:
        key: prod/app/database
        property: url
    - secretKey: api-key
      remoteRef:
        key: prod/app/api-keys
        property: main

Supported backends:

AWS Secrets Manager
HashiCorp Vault
Azure Key Vault
GCP Secret Manager

GitOps with ArgoCD

ArgoCD watches Git repository and syncs cluster state:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/repo
    targetRevision: HEAD
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m

Features:

Automated sync with pruning
Self-healing (drift detection)
Retry policies for transient failures

Infrastructure as Code (Terraform)

Remote state in S3 with DynamoDB locking:

terraform {
  required_version = ">= 1.5"
  backend "s3" {
    bucket         = "terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
  }
}

Module-based architecture:

module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "eks" {
  source     = "./modules/eks"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

module "rds" {
  source     = "./modules/rds"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.database_subnets
}

Environment-specific tfvars:

terraform plan -var-file=environments/production.tfvars

Database Migration Strategies

Zero-Downtime Migration Pattern

Problem: Adding a NOT NULL column breaks old application versions

Solution: 3-phase migration

Phase 1: Add nullable column

-- Migration v1 (deploy with old code still running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);

Phase 2: Deploy new code + backfill

# New code writes to both old and new schema
def create_user(name: str, email: str):
    db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))

# Backfill existing rows
async def backfill_emails():
    users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
    for user in users_without_email:
        email = generate_email(user.id)
        await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))

Phase 3: Add constraint

-- Migration v2 (after backfill complete)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Backward/Forward Compatibility

Backward compatible changes (safe):

Add nullable column
Add table
Add index
Rename column (with view alias)

Backward incompatible changes (requires 3-phase):

Remove column
Rename column (no alias)
Add NOT NULL column
Change column type

Alembic Migration Pattern

# backend/alembic/versions/2024_12_15_add_langfuse_trace_id.py
"""Add Langfuse trace_id to analyses"""
from alembic import op
import sqlalchemy as sa

def upgrade():
    # Add nullable column first (backward compatible)
    op.add_column('analyses',
        sa.Column('langfuse_trace_id', sa.String(255), nullable=True)
    )
    # Index for lookup performance
    op.create_index('idx_analyses_langfuse_trace',
        'analyses', ['langfuse_trace_id']
    )

def downgrade():
    op.drop_index('idx_analyses_langfuse_trace')
    op.drop_column('analyses', 'langfuse_trace_id')

Migration workflow:

# Create new migration
poetry run alembic revision --autogenerate -m "Add langfuse trace ID"

# Review generated migration (ALWAYS review!)
cat alembic/versions/abc123_add_langfuse_trace_id.py

# Apply migration
poetry run alembic upgrade head

# Rollback if needed
poetry run alembic downgrade -1

Rollback Procedures

# Helm rollback to previous revision
helm rollback myapp 3

# Kubernetes rollback
kubectl rollout undo deployment/myapp

# Database migration rollback (Alembic)
alembic downgrade -1

Critical: Test rollback procedures regularly!

See scripts/external-secrets.yaml and scripts/argocd-application.yaml for complete examples.

Kubernetes Basics

K8s deployments, services, health probes, and production patterns.

Health Probes

Three probe types with distinct purposes:

spec:
  containers:
  - name: app
    # Startup probe (gives slow-starting apps time to boot)
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 5
      failureThreshold: 30  # 30 * 5s = 150s max startup time

    # Liveness probe (restarts pod if failing)
    livenessProbe:
      httpGet:
        path: /health/liveness
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 3  # 3 failures = restart

    # Readiness probe (removes from service if failing)
    readinessProbe:
      httpGet:
        path: /health/readiness
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2  # 2 failures = remove from load balancer

Probe implementation (FastAPI):

@app.get("/health/startup")
async def startup_check():
    # Check DB connection established
    if not db.is_connected():
        raise HTTPException(status_code=503, detail="DB not ready")
    return {"status": "ok"}

@app.get("/health/liveness")
async def liveness_check():
    # Basic "is process running" check
    return {"status": "alive"}

@app.get("/health/readiness")
async def readiness_check():
    # Check all dependencies healthy
    if not redis.ping() or not db.health_check():
        raise HTTPException(status_code=503, detail="Dependencies unhealthy")
    return {"status": "ready"}

Security Context

spec:
  securityContext:
    runAsNonRoot: true
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop:
        - ALL

Resource Management

Always set requests and limits:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "200m"

Use requests for scheduling
Use limits for throttling

PodDisruptionBudget

Prevents too many pods from being evicted during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: myapp

Use cases:

Cluster upgrades (node drains)
Autoscaler downscaling
Manual evictions

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"      # Total CPU requests
    requests.memory: 20Gi   # Total memory requests
    limits.cpu: "20"        # Total CPU limits
    limits.memory: 40Gi     # Total memory limits
    pods: "50"              # Max pods

StatefulSets for Databases

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    # Pod spec here
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Key differences from Deployment:

Stable pod names (postgres-0, postgres-1, postgres-2)
Ordered deployment and scaling
Persistent storage per pod

Helm Chart Structure

charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── _helpers.tpl
└── values/
    ├── staging.yaml
    └── production.yaml

Essential Manifests

Deployment with rolling update strategy
Service for internal routing
Ingress for external access with TLS
HorizontalPodAutoscaler for scaling

See scripts/k8s-manifests.yaml and scripts/helm-values.yaml for complete examples.

Multi Service Setup

Multi-Service Setup on Railway

Deploy multiple services in one Railway project for monorepos, microservices, or web + worker architectures.

Monorepo Configuration

Each service in a Railway project can point to a different root directory:

my-monorepo/
├── apps/
│   ├── api/           ← Service 1 (root: apps/api)
│   │   ├── package.json
│   │   └── railway.json
│   ├── web/           ← Service 2 (root: apps/web)
│   │   ├── package.json
│   │   └── railway.json
│   └── worker/        ← Service 3 (root: apps/worker)
│       ├── package.json
│       └── railway.json
├── packages/          ← Shared packages
└── package.json       ← Root workspace

Set each service's root directory in the Railway dashboard under Settings > Source.

Private Networking

Services within the same project communicate over Railway's private network:

# From the web service, call the API service:
http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}/endpoint

# In environment variables (set on web service):
API_URL=http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}

Key points:

Private networking uses internal DNS, no public internet
Zero egress costs between services
Always use the PORT variable — not hardcoded ports
Services must listen on 0.0.0.0 (not localhost)

Common Architectures

Web + API + Worker

Service	Role	Public?
`web`	Frontend (Next.js, Vite)	Yes
`api`	Backend API	Yes (or private if only web calls it)
`worker`	Background jobs (BullMQ, Celery)	No
`postgres`	Database	No (private only)
`redis`	Cache / queue broker	No (private only)

Shared Environment Variables

Use Railway's shared variables (project-level) for values needed by all services:

NODE_ENV=production
LOG_LEVEL=info

Use reference variables for cross-service connections:

DATABASE_URL=$\{\{Postgres.DATABASE_URL\}\}
REDIS_URL=$\{\{Redis.REDIS_URL\}\}
API_URL=http://$\{\{api.RAILWAY_PRIVATE_DOMAIN\}\}:$\{\{api.PORT\}\}

Deploy Order

Railway deploys services in parallel by default. If you need ordering (e.g., run migrations before starting web):

Put migrations in the API service's startCommand
Use healthchecks — dependent services will retry connections until the API is healthy
For strict ordering, use separate deploy triggers via Railway CLI

Nixpacks Customization

Railway uses Nixpacks to auto-detect your stack and generate a build plan. Customize when auto-detection falls short.

Auto-Detection

Nixpacks detects your language by looking for:

Language	Detection File
Node.js	`package.json`
Python	`requirements.txt`, `pyproject.toml`, `Pipfile`
Go	`go.mod`
Rust	`Cargo.toml`
Ruby	`Gemfile`
Java	`pom.xml`, `build.gradle`
PHP	`composer.json`

nixpacks.toml

Place at project root (or set nixpacksConfigPath in railway.json for monorepos).

Adding System Packages

[phases.setup]
nixPkgs = ["...", "ffmpeg", "imagemagick", "poppler_utils"]
aptPkgs = ["libvips-dev"]

Custom Build Phases

[phases.install]
cmds = ["npm ci --production=false"]

[phases.build]
cmds = [
  "npx prisma generate",
  "npm run build"
]
dependsOn = ["install"]

[start]
cmd = "npm run start:prod"

Environment Variables in Build

[variables]
NODE_ENV = "production"
NEXT_TELEMETRY_DISABLED = "1"

Monorepo Root Path

For monorepos, set the root directory per service in the Railway dashboard or via railway.json:

{
  "build": {
    "builder": "NIXPACKS",
    "nixpacksConfigPath": "apps/api/nixpacks.toml"
  }
}

Each service points to its own directory and nixpacks.toml.

When to Switch to Dockerfile

Use Dockerfile instead of Nixpacks when:

Multi-stage builds are needed to reduce image size
Build requires conditional logic (e.g., ARG-based feature flags)
Precise control over base image (e.g., distroless, Alpine variants)
Nixpacks doesn't support a required system dependency

Set in railway.json:

{
  "build": {
    "builder": "DOCKERFILE",
    "dockerfilePath": "Dockerfile.production"
  }
}

Observability

Observability & Monitoring

Prometheus metrics, Grafana dashboards, and alerting patterns.

Prometheus Metrics Exposition

from prometheus_client import Counter, Histogram, generate_latest

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    duration = time.time() - start_time

    # Record metrics
    http_requests_total.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")

Grafana Dashboard Queries

# Request rate (requests per second)
rate(http_requests_total[5m])

# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)

Alerting Rules

groups:
- name: app-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High p95 latency detected"
      description: "p95 latency is {{ $value }}s"

  - alert: PodCrashLooping
    expr: |
      increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"
      description: "{{ $labels.pod }} has restarted {{ $value }} times"

Key Metrics to Monitor

Metric	Purpose	Alert Threshold
Request rate	Traffic volume	Anomaly detection
Error rate	Service health	> 5% (critical)
p95 latency	User experience	> 2s (warning)
CPU usage	Resource utilization	> 80% sustained
Memory usage	Resource utilization	> 85% sustained
Pod restarts	Stability	> 3 in 1 hour

Golden Signals (SRE)

Latency - Time to serve a request
Traffic - Requests per second
Errors - Rate of failed requests
Saturation - Resource utilization

Log Aggregation

Structured logging for observability:

import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())

    with structlog.contextvars.bound_contextvars(
        request_id=request_id,
        method=request.method,
        path=request.url.path,
    ):
        logger.info("request_started")
        response = await call_next(request)
        logger.info("request_completed", status=response.status_code)

    response.headers["X-Request-ID"] = request_id
    return response

Distributed Tracing

OpenTelemetry integration:

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

# Manual spans for business logic
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order_id", order_id)
        # Processing logic here

Railway Json Config

railway.json Configuration

Complete reference for railway.json schema — the primary way to configure build and deploy behavior on Railway.

Full Schema

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npm run build",
    "watchPatterns": ["src/**", "package.json"],
    "nixpacksConfigPath": "nixpacks.toml",
    "dockerfilePath": "Dockerfile"
  },
  "deploy": {
    "startCommand": "node dist/server.js",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3,
    "numReplicas": 1,
    "sleepApplication": false,
    "region": "us-west1",
    "cronSchedule": "0 */6 * * *"
  }
}

Builder Options

Builder	When to Use
`NIXPACKS`	Default — auto-detects language and builds (Node, Python, Go, Rust, etc.)
`DOCKERFILE`	Complex builds, multi-stage images, custom system deps
`PAKETO`	Cloud Native Buildpacks alternative

Deploy Settings

Field	Default	Description
`startCommand`	Auto-detected	Overrides default start command
`healthcheckPath`	None	HTTP path to check for 200 response
`healthcheckTimeout`	30	Seconds before healthcheck is considered failed
`restartPolicyType`	`ON_FAILURE`	`ON_FAILURE`, `ALWAYS`, or `NEVER`
`restartPolicyMaxRetries`	3	Max restart attempts before marking deploy failed
`numReplicas`	1	Number of instances (horizontal scaling)
`sleepApplication`	false	Sleep service when no traffic (saves credits)
`cronSchedule`	None	Cron expression for scheduled services

Examples

Node.js API with migrations

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npx prisma generate && npm run build"
  },
  "deploy": {
    "startCommand": "npx prisma migrate deploy && node dist/server.js",
    "healthcheckPath": "/api/health",
    "healthcheckTimeout": 60
  }
}

Python FastAPI

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "pip install -r requirements.txt"
  },
  "deploy": {
    "startCommand": "uvicorn main:app --host 0.0.0.0 --port $PORT",
    "healthcheckPath": "/health"
  }
}

Cron Worker (no web traffic)

{
  "$schema": "https://railway.com/railway.schema.json",
  "deploy": {
    "startCommand": "node dist/worker.js",
    "cronSchedule": "*/15 * * * *",
    "restartPolicyType": "NEVER"
  }
}

Checklists (1)

Production Readiness

Production Readiness Checklist

🔒 Security

Secrets in environment variables or vault (not in code/config)
HTTPS enforced (redirect HTTP → HTTPS)
Security headers configured (HSTS, CSP, X-Frame-Options)
CORS restricted to known origins
Rate limiting enabled
SQL injection prevention (parameterized queries)
XSS prevention (output encoding, CSP)
Dependencies scanned for vulnerabilities
Container images scanned (Trivy/Snyk)

🧪 Testing

Unit tests passing (>80% coverage)
Integration tests passing
E2E tests for critical paths
Load testing completed (k6/Locust)
Security testing (OWASP ZAP)
Smoke tests for deployment verification

📊 Observability

Structured logging (JSON format)
Log aggregation configured (ELK/Loki)
Metrics exported (Prometheus format)
Dashboards created (Grafana)
Distributed tracing enabled (Jaeger/Tempo)
Error tracking configured (Sentry)
Uptime monitoring (synthetic checks)
Alerting rules defined

🚀 Deployment

CI/CD pipeline configured
Automated tests run on every PR
Blue-green or canary deployment ready
Rollback procedure documented and tested
Database migrations tested
Feature flags for risky changes
Deployment notifications (Slack/Teams)

💾 Data

Database backups automated (daily)
Backup restoration tested
Point-in-time recovery enabled
Data retention policies defined
PII handling documented (GDPR/CCPA)
Encryption at rest enabled
Encryption in transit (TLS)

Graceful shutdown handling
Connection pooling configured
Circuit breakers for external calls
Retry with exponential backoff
Timeouts set on all external calls
Chaos engineering tests (optional)

Pre-Launch Final Checks

# Security scan
npm audit --audit-level=high
pip-audit

# Performance baseline
k6 run load-test.js

# DNS and SSL
curl -I https://api.example.com
openssl s_client -connect api.example.com:443

# Health endpoint
curl https://api.example.com/health

# Logs flowing
docker logs -f backend

# Metrics exposed
curl http://localhost:9090/metrics

Examples (1)

Github Actions Cicd

GitHub Actions CI/CD Example

Complete pipeline for a Python FastAPI + React application.

Repository Structure

├── .github/workflows/
│   ├── ci.yml          # Test on every PR
│   ├── deploy.yml      # Deploy on main merge
│   └── security.yml    # Weekly security scan
├── backend/            # FastAPI
├── frontend/           # React
└── docker-compose.yml

CI Pipeline (`ci.yml`)

name: CI

on:
  pull_request:
    branches: [main, dev]
  push:
    branches: [main, dev]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  backend-test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Poetry
        uses: snok/install-poetry@v1
        with:
          virtualenvs-create: true
          virtualenvs-in-project: true

      - name: Cache dependencies
        uses: actions/cache@v4
        with:
          path: backend/.venv
          key: venv-${{ runner.os }}-${{ hashFiles('backend/poetry.lock') }}

      - name: Install dependencies
        working-directory: backend
        run: poetry install --no-interaction

      - name: Lint
        working-directory: backend
        run: |
          poetry run ruff format --check app/
          poetry run ruff check app/

      - name: Type check
        working-directory: backend
        run: poetry run mypy app/ --ignore-missing-imports

      - name: Test
        working-directory: backend
        env:
          DATABASE_URL: postgresql://postgres:test@localhost:5432/test
        run: poetry run pytest --cov=app --cov-report=xml -v

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          files: backend/coverage.xml

  frontend-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
          cache-dependency-path: frontend/package-lock.json

      - name: Install
        working-directory: frontend
        run: npm ci

      - name: Lint
        working-directory: frontend
        run: npm run lint

      - name: Type check
        working-directory: frontend
        run: npm run typecheck

      - name: Test
        working-directory: frontend
        run: npm run test:coverage

      - name: Build
        working-directory: frontend
        run: npm run build

Deploy Pipeline (`deploy.yml`)

name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production  # Requires approval

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build & push backend
        run: |
          docker build -t $ECR_REGISTRY/backend:${{ github.sha }} ./backend
          docker push $ECR_REGISTRY/backend:${{ github.sha }}

      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service backend \
            --force-new-deployment

      - name: Deploy frontend to S3
        run: |
          cd frontend && npm ci && npm run build
          aws s3 sync dist/ s3://$S3_BUCKET --delete
          aws cloudfront create-invalidation --distribution-id $CF_DIST --paths "/*"

Security Scan (`security.yml`)

name: Security

on:
  schedule:
    - cron: "0 0 * * 0"  # Weekly Sunday midnight
  workflow_dispatch:

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Python dependencies
        run: |
          pip install pip-audit
          pip-audit -r backend/requirements.txt

      - name: Node dependencies
        working-directory: frontend
        run: npm audit --audit-level=high

      - name: Docker scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: backend:latest
          severity: HIGH,CRITICAL

Key Patterns

Pattern	Implementation
Caching	Poetry venv, npm cache
Concurrency	Cancel in-progress on new push
Services	Postgres container for tests
Environments	Production requires approval
Artifacts	Coverage reports to Codecov

Devops Deployment