Skip to main content
OrchestKit v6.7.1 — 67 skills, 38 agents, 77 hooks with Opus 4.6 support
OrchestKit
Skills

Devops Deployment

Use when setting up CI/CD pipelines, containerizing applications, deploying to Kubernetes, or writing infrastructure as code. DevOps & Deployment covers GitHub Actions, Docker, Helm, and Terraform patterns.

Reference medium

Primary Agent: data-pipeline-engineer

DevOps & Deployment Skill

Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.

Overview

  • Setting up CI/CD pipelines
  • Containerizing applications
  • Deploying to Kubernetes or cloud platforms
  • Implementing GitOps workflows
  • Managing infrastructure as code
  • Planning release strategies

Pipeline Architecture

┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│    Code     │──>│    Build    │──>│    Test     │──>│   Deploy    │
│   Commit    │   │   & Lint    │   │   & Scan    │   │  & Release  │
└─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
       │                 │                 │                 │
       v                 v                 v                 v
   Triggers         Artifacts          Reports          Monitoring

Key Concepts

CI/CD Pipeline Stages

  1. Lint & Type Check - Code quality gates
  2. Unit Tests - Test coverage with reporting
  3. Security Scan - npm audit + Trivy vulnerability scanner
  4. Build & Push - Docker image to container registry
  5. Deploy Staging - Environment-gated deployment
  6. Deploy Production - Manual approval or automated

Container Best Practices

Multi-stage builds minimize image size:

  • Stage 1: Install production dependencies only
  • Stage 2: Build application with dev dependencies
  • Stage 3: Production runtime with minimal footprint

Security hardening:

  • Non-root user (uid 1001)
  • Read-only filesystem where possible
  • Health checks for orchestrator integration

Kubernetes Deployment

Essential manifests:

  • Deployment with rolling update strategy
  • Service for internal routing
  • Ingress for external access with TLS
  • HorizontalPodAutoscaler for scaling

Security context:

  • runAsNonRoot: true
  • allowPrivilegeEscalation: false
  • readOnlyRootFilesystem: true
  • Drop all capabilities

Deployment Strategies

StrategyUse CaseRisk
RollingDefault, gradual replacementLow - automatic rollback
Blue-GreenInstant switch, easy rollbackMedium - double resources
CanaryProgressive traffic shiftLow - gradual exposure

Rolling Update (Kubernetes default):

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0  # Zero downtime

Secrets Management

Use External Secrets Operator to sync from cloud providers:

  • AWS Secrets Manager
  • HashiCorp Vault
  • Azure Key Vault
  • GCP Secret Manager

References

Docker Patterns

See: references/docker-patterns.md

Key topics covered:

  • Multi-stage build examples with 78% size reduction
  • Layer caching optimization
  • Security hardening (non-root, health checks)
  • Trivy vulnerability scanning
  • Docker Compose development setup

CI/CD Pipelines

See: references/ci-cd-pipelines.md

Key topics covered:

  • Branch strategy (Git Flow)
  • GitHub Actions caching (85% time savings)
  • Artifact management
  • Matrix testing
  • Complete backend CI/CD example

Kubernetes Basics

See: references/kubernetes-basics.md

Key topics covered:

  • Health probes (startup, liveness, readiness)
  • Security context configuration
  • PodDisruptionBudget
  • Resource quotas
  • StatefulSets for databases
  • Helm chart structure

Environment Management

See: references/environment-management.md

Key topics covered:

  • External Secrets Operator
  • GitOps with ArgoCD
  • Terraform patterns (remote state, modules)
  • Zero-downtime database migrations
  • Alembic migration workflow
  • Rollback procedures

Observability

See: references/observability.md

Key topics covered:

  • Prometheus metrics exposition
  • Grafana dashboard queries (PromQL)
  • Alerting rules for SLOs
  • Golden signals (SRE)
  • Structured logging
  • Distributed tracing (OpenTelemetry)

Railway Deployment

See: rules/railway-deployment.md

Key topics covered:

  • railway.json configuration, Nixpacks builds
  • Environment variable management, database provisioning
  • Multi-service setups, Railway CLI workflows
  • References: references/railway-json-config.md, references/nixpacks-customization.md, references/multi-service-setup.md

Deployment Strategies

See: references/deployment-strategies.md

Key topics covered:

  • Rolling deployment
  • Blue-green deployment
  • Canary releases
  • Traffic splitting with Istio

Deployment Checklist

Pre-Deployment

  • All tests passing in CI
  • Security scans clean
  • Database migrations ready
  • Rollback plan documented

During Deployment

  • Monitor deployment progress
  • Watch error rates
  • Verify health checks passing

Post-Deployment

  • Verify metrics normal
  • Check logs for errors
  • Update status page

Helm Chart Structure

charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── _helpers.tpl
└── values/
    ├── staging.yaml
    └── production.yaml

  • zero-downtime-migration - Database migration patterns for zero-downtime deployments
  • security-scanning - Security scanning integration for CI/CD pipelines
  • ork:monitoring-observability - Monitoring and alerting for deployed applications
  • ork:database-patterns - Python/Alembic migration workflow for backend deployments

Key Decisions

DecisionChoiceRationale
Container userNon-root (uid 1001)Security best practice, required by many orchestrators
Deployment strategyRolling update (default)Zero downtime, automatic rollback, resource efficient
Secrets managementExternal Secrets OperatorSyncs from cloud providers, GitOps compatible
Health checksSeparate startup/liveness/readinessPrevents premature traffic, enables graceful shutdown

Extended Thinking Triggers

Use Opus 4.6 adaptive thinking for:

  • Architecture decisions - Kubernetes vs serverless, multi-region setup
  • Migration planning - Moving between cloud providers
  • Incident response - Complex deployment failures
  • Security design - Zero-trust architecture

Templates Reference

TemplatePurpose
github-actions-pipeline.ymlFull CI/CD workflow with 6 stages
DockerfileMulti-stage Node.js build
docker-compose.ymlDevelopment environment
k8s-manifests.yamlDeployment, Service, Ingress
helm-values.yamlHelm chart values
terraform-aws.tfVPC, EKS, RDS infrastructure
argocd-application.yamlGitOps application
external-secrets.yamlSecrets Manager integration

Capability Details

ci-cd

Keywords: ci, cd, pipeline, github actions, gitlab ci, jenkins, workflow Solves:

  • How do I set up CI/CD?
  • GitHub Actions workflow patterns
  • Pipeline caching strategies
  • Matrix testing setup

docker

Keywords: docker, dockerfile, container, image, build, compose, multi-stage Solves:

  • How do I containerize my app?
  • Multi-stage Dockerfile best practices
  • Docker Compose development setup
  • Container security hardening

kubernetes

Keywords: kubernetes, k8s, deployment, service, ingress, helm, statefulset, pdb Solves:

  • How do I deploy to Kubernetes?
  • K8s health probes and resource limits
  • Helm chart structure
  • StatefulSet for databases

infrastructure-as-code

Keywords: terraform, pulumi, iac, infrastructure, provision, gitops, argocd Solves:

  • How do I set up infrastructure as code?
  • Terraform AWS patterns (VPC, EKS, RDS)
  • GitOps with ArgoCD
  • Secrets management patterns

deployment-strategies

Keywords: blue green, canary, rolling, deployment strategy, rollback, zero downtime Solves:

  • Which deployment strategy should I use?
  • Zero-downtime database migrations
  • Blue-green deployment setup
  • Canary release with traffic splitting

observability

Keywords: prometheus, grafana, metrics, alerting, monitoring, health check Solves:

  • How do I add monitoring to my app?
  • Prometheus metrics exposition
  • Grafana dashboard queries
  • Alerting rules for SLOs

Rules (6)

Protect CI/CD branches from direct pushes to enforce code review and audit trails — HIGH

CI/CD: Branch Protection

Configure branch protection rules to enforce code review, passing CI checks, and linear history on critical branches. This prevents untested or unreviewed code from reaching production.

Incorrect:

# Direct push to main — no review, no CI checks
git checkout main
git commit -m "quick fix"
git push origin main

# Force push overwrites history
git push --force origin main
# CI workflow with no branch restrictions
on:
  push:
    branches: ['*']

Correct:

Branch strategy with protection rules:

main (production) ─────●────────●──────>
                       |        |
dev (staging)  ─────●──●────●──●──────>
                    |        |
feature/*  ─────────●────────┘
                    ^
                    └─ PR required, CI checks, code review

GitHub branch protection settings:

main branch:
  - Require pull request before merging
  - Required approving reviews: 2
  - Require status checks to pass (lint, test, security)
  - Require branches to be up to date before merging
  - Do not allow force pushes
  - Do not allow deletions

dev branch:
  - Require pull request before merging
  - Required approving reviews: 1
  - Require status checks to pass (lint, test)

Key rules:

  • main requires PR + 2 approvals + all status checks passing before merge
  • dev requires PR + 1 approval + all status checks passing
  • Never allow direct commits or force pushes to main or dev
  • Feature branches must be created from dev and merged back via PR
  • Require branches to be up-to-date before merging to prevent integration gaps
  • Enable "Require linear history" to keep the commit graph clean and auditable

Reference: references/ci-cd-pipelines.md (lines 5-21)

Cache CI/CD pipeline dependencies to avoid re-downloading and save minutes per run — HIGH

CI/CD: Pipeline Caching

Cache dependencies in CI pipelines using lockfile-based cache keys. Proper caching reduces dependency installation from 2-3 minutes to 10-20 seconds (~85% time savings).

Incorrect:

# No caching: re-downloads everything on every run
steps:
  - uses: actions/checkout@v3
  - run: npm install
  - run: npm test
# Bad cache key: no lockfile hash, stale deps served indefinitely
- uses: actions/cache@v3
  with:
    path: node_modules
    key: ${{ runner.os }}-modules

Correct:

- name: Cache Dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
      backend/.venv
    key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
    restore-keys: |
      ${{ runner.os }}-deps-

- name: Install Dependencies
  run: npm ci

- name: Run Tests
  run: npm test
# Python example with Poetry
- name: Cache Poetry Dependencies
  uses: actions/cache@v3
  with:
    path: ~/.cache/pypoetry
    key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}

- name: Install Dependencies
  run: poetry install

Key rules:

  • Always include hashFiles() of the lockfile in the cache key so caches invalidate when dependencies change
  • Use restore-keys as a fallback prefix to get a partial cache hit when the exact key misses
  • Cache the package manager cache directory (~/.npm, ~/.cache/pypoetry), not just node_modules
  • Use npm ci (not npm install) after cache restore for reproducible installs
  • Cache multiple dependency directories in a single step when possible (npm + pip + venv)
  • Set artifact retention policies (retention-days: 7) to prevent storage bloat

Reference: references/ci-cd-pipelines.md (lines 23-68)

Run database migrations safely during deployment to prevent downtime and data loss — CRITICAL

DevOps: Database Migrations

All schema changes must be backward-compatible with the currently running application version. Destructive changes require a multi-phase migration to achieve zero-downtime deployments.

Incorrect:

-- Destructive: renames column while old code still references 'name'
ALTER TABLE users RENAME COLUMN name TO full_name;

-- Destructive: adds NOT NULL column, old inserts fail immediately
ALTER TABLE users ADD COLUMN email VARCHAR(255) NOT NULL;

-- Destructive: drops column while old code still reads it
ALTER TABLE users DROP COLUMN legacy_field;

Correct (3-phase zero-downtime migration):

-- Phase 1: Add nullable column (safe with old code running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);
# Phase 2: Deploy new code that writes to both + backfill
def create_user(name: str, email: str):
    db.execute(
        "INSERT INTO users (name, email) VALUES (%s, %s)",
        (name, email),
    )

async def backfill_emails():
    users = await db.fetch("SELECT id FROM users WHERE email IS NULL")
    for user in users:
        email = generate_email(user.id)
        await db.execute(
            "UPDATE users SET email = %s WHERE id = %s",
            (email, user.id),
        )
-- Phase 3: Add constraint after backfill is verified complete
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Backward-compatible changes (safe to deploy directly):

  • Add nullable column
  • Add new table
  • Add index
  • Rename column with a view alias

Backward-incompatible changes (require 3-phase migration):

  • Remove column
  • Rename column without alias
  • Add NOT NULL column
  • Change column type

Deploy order: migrate (phase 1) --> deploy new code (phase 2) --> migrate (phase 3)

Key rules:

  • Always deploy migrations before the application code that depends on them
  • Never add a NOT NULL column in a single step — use the 3-phase pattern (add nullable, backfill, add constraint)
  • Always write a downgrade() function so migrations can be rolled back (alembic downgrade -1)
  • Always review auto-generated migrations before applying (alembic revision --autogenerate)
  • Test rollback procedures regularly — do not assume downgrade() works without verification
  • Column renames require a view alias to maintain backward compatibility during rollout

Reference: references/environment-management.md (lines 115-162)

Secure Docker layers by running as non-root and excluding secrets from image builds — CRITICAL

Docker: Layer Security

Every Docker image layer is immutable and inspectable. Running as root or embedding secrets in layers creates critical security vulnerabilities that persist even if later layers attempt to remove them.

Incorrect:

FROM node:20
WORKDIR /app

# BAD: Copies .env, .git, node_modules, and everything else
COPY . .
RUN npm install

# BAD: Secret baked into image layer (visible via docker history)
ARG DATABASE_URL
ENV DATABASE_URL=$DATABASE_URL

# BAD: Running as root (default)
EXPOSE 3000
CMD ["node", "dist/main.js"]

Correct:

FROM node:20-alpine AS runner
WORKDIR /app

# GOOD: Create and use non-root user (uid 1001)
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

# GOOD: Copy only what's needed with explicit ownership
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./

# GOOD: Run as non-root
USER nodejs

# GOOD: Secrets injected at runtime, never in image
# Use: docker run -e DATABASE_URL=... or Kubernetes secrets
ENV NODE_ENV=production
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]

Required .dockerignore:

.git
.env
.env.*
node_modules
*.md
tests/
.vscode/

Key rules:

  • Never run containers as root — always create a non-root user with USER directive
  • Never pass secrets via ARG or ENV in the Dockerfile — they are visible in docker history
  • Always use a .dockerignore to exclude .env, .git, node_modules, and test files
  • Use COPY --chown to set file ownership without a separate chown layer
  • Prefer minimal base images (-alpine) to reduce the CVE surface area
  • Enable read-only root filesystem in Kubernetes (readOnlyRootFilesystem: true)
  • Add health checks so orchestrators can detect and restart unhealthy containers

Reference: references/docker-patterns.md (lines 52-85)

Use Docker multi-stage builds to exclude dev dependencies and reduce image size by 4-5x — HIGH

Docker: Multi-Stage Builds

Separate build-time concerns from runtime to produce minimal, secure production images. A well-structured multi-stage build can reduce image size by 78% or more.

Incorrect:

# Single-stage: build tools and dev deps ship to production
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/main.js"]
# Result: ~850 MB image with dev dependencies, source files, build tools

Correct:

# Stage 1: Install production dependencies only
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# Stage 2: Build with dev dependencies
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build && npm run test

# Stage 3: Minimal production runtime
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]
# Result: ~180 MB image with only production runtime

Key rules:

  • Use separate stages for dependency installation, building, and runtime
  • Copy only production node_modules and compiled artifacts into the final stage
  • Use -alpine base images to minimize base layer size
  • Run npm ci (not npm install) for reproducible, lockfile-exact installs
  • Clean caches (npm cache clean --force) in the same layer as install to avoid bloating layers
  • Always include a HEALTHCHECK in the production stage for orchestrator integration
  • Run tests in the builder stage so test failures prevent image creation

Reference: references/docker-patterns.md (lines 7-50)

Configure Railway PaaS deployment with correct Nixpacks, environment, and railway.json settings — HIGH

Railway Deployment Patterns

railway.json Configuration

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npm run build"
  },
  "deploy": {
    "startCommand": "npm start",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3
  }
}

Nixpacks vs Dockerfile

FactorNixpacks (default)Dockerfile
SetupZero config, auto-detectManual, full control
Build timeFast (Nix cache)Depends on layers
Customizationnixpacks.tomlUnlimited
Use whenStandard appsCustom runtimes, multi-stage

Environment Variables

  • Use Railway's shared variables for cross-service config (DATABASE_URL, REDIS_URL)
  • Service-specific variables override shared ones
  • Reference other vars: $\{\{shared.DATABASE_URL\}\}
  • Never hardcode secrets — use Railway's encrypted env vars

Database Provisioning

Railway provisions managed databases with one click:

  • PostgreSQL, MySQL, Redis, MongoDB
  • Connection string auto-injected as env var
  • Backups included on paid plans

Multi-Service Setup

  • Use monorepo config: set rootDirectory per service
  • Internal networking: services communicate via $\{\{service.RAILWAY_PRIVATE_DOMAIN\}\}:port
  • Shared env groups for common config

Railway CLI

railway login              # Authenticate
railway link               # Connect to project
railway up                 # Deploy from local
railway logs               # View deployment logs
railway variables          # List env vars
railway shell              # Open shell in service

Anti-Patterns

Incorrect:

  • Running railway up from CI without railway link — deploys to wrong project
  • Using Dockerfile when Nixpacks handles the stack — unnecessary complexity
  • Storing secrets in railway.json — use env vars
  • Skipping healthcheck config — Railway can't detect failed deploys

Correct:

  • Configure healthcheckPath for all web services
  • Use shared variables for cross-service config
  • Set restart policy for resilience
  • Use Nixpacks unless you need custom runtime

References

  • references/railway-json-config.md — Full railway.json schema and examples
  • references/nixpacks-customization.md — Custom build configs, environment detection
  • references/multi-service-setup.md — Monorepo deploy, service networking

References (9)

Ci Cd Pipelines

CI/CD Pipelines

Comprehensive CI/CD patterns for GitHub Actions, caching, matrix testing, and artifact management.

Branch Strategy

Recommended: Git Flow with Feature Branches

main (production) ─────●────────●──────>
                       ┃        ┃
dev (staging) ─────●───●────●───●──────>
                   ┃        ┃
feature/* ─────────●────────┘

                   └─ PR required, CI checks, code review

Branch protection rules:

  • main: Require PR + 2 approvals + all checks pass
  • dev: Require PR + 1 approval + all checks pass
  • Feature branches: No direct commits to main/dev

GitHub Actions Caching Strategy

- name: Cache Dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
      backend/.venv
    key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
    restore-keys: |
      ${{ runner.os }}-deps-

Cache hit ratio impact:

  • Without cache: 2-3 min install time
  • With cache: 10-20 sec install time
  • ~85% time savings on typical workflows

Artifact Management

# Build and upload artifact
- name: Build Application
  run: npm run build

- name: Upload Build Artifact
  uses: actions/upload-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/
    retention-days: 7

# Download in deployment job
- name: Download Build Artifact
  uses: actions/download-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/

Benefits:

  • Avoid rebuilding in deployment job
  • Deploy exact tested artifact (byte-for-byte match)
  • Retention policies prevent storage bloat

Matrix Testing

strategy:
  matrix:
    node-version: [18, 20, 22]
    os: [ubuntu-latest, windows-latest]
jobs:
  test:
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm test

Complete Backend CI/CD Example

name: Backend CI/CD

on:
  push:
    branches: [main, dev]
    paths: ['backend/**']
  pull_request:
    branches: [main, dev]
    paths: ['backend/**']

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: backend
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Cache Poetry Dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pypoetry
          key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}

      - name: Install Poetry
        run: pip install poetry

      - name: Install Dependencies
        run: poetry install

      - name: Run Ruff Format Check
        run: poetry run ruff format --check app/

      - name: Run Ruff Lint
        run: poetry run ruff check app/

      - name: Run Type Check
        run: poetry run mypy app/ --ignore-missing-imports

      - name: Run Tests
        run: poetry run pytest tests/ --cov=app --cov-report=xml

      - name: Upload Coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./backend/coverage.xml

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run Trivy Scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: 'backend/'
          severity: 'CRITICAL,HIGH'

Key features:

  • Path filtering (only run on backend changes)
  • Poetry dependency caching
  • Comprehensive quality checks (format, lint, type, test)
  • Security scanning with Trivy

Pipeline Stages

  1. Lint & Type Check - Code quality gates
  2. Unit Tests - Test coverage with reporting
  3. Security Scan - npm audit + Trivy vulnerability scanner
  4. Build & Push - Docker image to container registry
  5. Deploy Staging - Environment-gated deployment
  6. Deploy Production - Manual approval or automated

Best Practices

  1. Fast feedback - tests complete in < 5 min
  2. Fail fast - stop on first failure
  3. Cache dependencies - npm/pip cache
  4. Matrix testing - multiple Node/Python versions
  5. Secrets management - use GitHub Secrets
  6. Branch protection - require passing tests

See scripts/github-actions-pipeline.yml for complete examples.

Deployment Strategies

Deployment Strategies

Blue-green, canary, and rolling deployment patterns.

Rolling Deployment (Default)

Update pods gradually:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  • Pros: No downtime, gradual rollout
  • Cons: Mixed versions running simultaneously

Blue-Green Deployment

Two identical environments, switch traffic:

# Deploy to green (inactive)
kubectl apply -f green-deployment.yaml

# Test green
curl https://green.example.com/health

# Switch traffic (update service selector)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback if needed
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'
  • Pros: Instant rollback, no mixed versions
  • Cons: 2x resources, database migrations tricky

Canary Deployment

Gradually shift traffic:

# 90% to stable
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-stable
spec:
  replicas: 9

# 10% to canary
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-canary
spec:
  replicas: 1
  • Pros: Limit blast radius, test with real traffic
  • Cons: Complex traffic management

See scripts/argocd-application.yaml for GitOps patterns.

Docker Patterns

Docker Patterns

Best practices for Dockerfile optimization, multi-stage builds, and container security.

Multi-Stage Build Example

# ============================================================
# Stage 1: Dependencies (builder)
# ============================================================
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# ============================================================
# Stage 2: Build (with dev dependencies)
# ============================================================
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci  # Include dev dependencies
COPY . .
RUN npm run build && npm run test

# ============================================================
# Stage 3: Production runtime (minimal)
# ============================================================
FROM node:20-alpine AS runner
WORKDIR /app

# Security: Non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

# Copy only production dependencies and built artifacts
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./

USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]

Image size comparison:

  • Single-stage: 850 MB (includes dev dependencies, source files)
  • Multi-stage: 180 MB (only runtime + production deps)
  • 78% reduction

Layer Caching Optimization

Order matters for cache efficiency:

# BAD: Invalidates cache on any code change
COPY . .
RUN npm install

# GOOD: Cache package.json layer separately
COPY package*.json ./
RUN npm ci  # Cached unless package.json changes
COPY . .    # Source changes don't invalidate npm install

Security Hardening

Non-root user (uid 1001):

RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
USER nodejs

Read-only filesystem where possible:

# In K8s deployment
securityContext:
  readOnlyRootFilesystem: true

Health checks for orchestrator integration:

HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1

Security Scanning with Trivy

- name: Build Docker Image
  run: docker build -t myapp:${{ github.sha }} .

- name: Scan for Vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myapp:${{ github.sha }}'
    format: 'sarif'
    output: 'trivy-results.sarif'
    severity: 'CRITICAL,HIGH'

- name: Upload Scan Results
  uses: github/codeql-action/upload-sarif@v2
  with:
    sarif_file: 'trivy-results.sarif'

- name: Fail on Critical Vulnerabilities
  run: |
    trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}

Docker Compose Development Setup

version: '3.8'
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: orchestkit
      POSTGRES_PASSWORD: dev_password
      POSTGRES_DB: orchestkit_dev
    ports:
      - "5437:5432"  # Avoid conflict with host postgres
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U orchestkit"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile.dev
    ports:
      - "8500:8500"
    environment:
      DATABASE_URL: postgresql://orchestkit:dev_password@postgres:5432/orchestkit_dev
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    volumes:
      - ./backend:/app  # Hot reload

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile.dev
    ports:
      - "5173:5173"
    environment:
      VITE_API_URL: http://localhost:8500
    volumes:
      - ./frontend:/app
      - /app/node_modules  # Avoid overwriting node_modules

volumes:
  pgdata:
  redisdata:

Key patterns:

  • Port mapping to avoid host conflicts (5437:5432)
  • Health checks before dependent services start
  • Volume mounts for hot reload during development
  • Named volumes for data persistence

See scripts/Dockerfile and scripts/docker-compose.yml for complete examples.

Environment Management

Environment Management

Secrets management, configuration, and environment variable patterns.

External Secrets Operator

Sync secrets from cloud providers to Kubernetes:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: app-secrets
    creationPolicy: Owner
  data:
    - secretKey: database-url
      remoteRef:
        key: prod/app/database
        property: url
    - secretKey: api-key
      remoteRef:
        key: prod/app/api-keys
        property: main

Supported backends:

  • AWS Secrets Manager
  • HashiCorp Vault
  • Azure Key Vault
  • GCP Secret Manager

GitOps with ArgoCD

ArgoCD watches Git repository and syncs cluster state:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/repo
    targetRevision: HEAD
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m

Features:

  • Automated sync with pruning
  • Self-healing (drift detection)
  • Retry policies for transient failures

Infrastructure as Code (Terraform)

Remote state in S3 with DynamoDB locking:

terraform {
  required_version = ">= 1.5"
  backend "s3" {
    bucket         = "terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
  }
}

Module-based architecture:

module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "eks" {
  source     = "./modules/eks"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

module "rds" {
  source     = "./modules/rds"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.database_subnets
}

Environment-specific tfvars:

terraform plan -var-file=environments/production.tfvars

Database Migration Strategies

Zero-Downtime Migration Pattern

Problem: Adding a NOT NULL column breaks old application versions

Solution: 3-phase migration

Phase 1: Add nullable column

-- Migration v1 (deploy with old code still running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);

Phase 2: Deploy new code + backfill

# New code writes to both old and new schema
def create_user(name: str, email: str):
    db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))

# Backfill existing rows
async def backfill_emails():
    users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
    for user in users_without_email:
        email = generate_email(user.id)
        await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))

Phase 3: Add constraint

-- Migration v2 (after backfill complete)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Backward/Forward Compatibility

Backward compatible changes (safe):

  • Add nullable column
  • Add table
  • Add index
  • Rename column (with view alias)

Backward incompatible changes (requires 3-phase):

  • Remove column
  • Rename column (no alias)
  • Add NOT NULL column
  • Change column type

Alembic Migration Pattern

# backend/alembic/versions/2024_12_15_add_langfuse_trace_id.py
"""Add Langfuse trace_id to analyses"""
from alembic import op
import sqlalchemy as sa

def upgrade():
    # Add nullable column first (backward compatible)
    op.add_column('analyses',
        sa.Column('langfuse_trace_id', sa.String(255), nullable=True)
    )
    # Index for lookup performance
    op.create_index('idx_analyses_langfuse_trace',
        'analyses', ['langfuse_trace_id']
    )

def downgrade():
    op.drop_index('idx_analyses_langfuse_trace')
    op.drop_column('analyses', 'langfuse_trace_id')

Migration workflow:

# Create new migration
poetry run alembic revision --autogenerate -m "Add langfuse trace ID"

# Review generated migration (ALWAYS review!)
cat alembic/versions/abc123_add_langfuse_trace_id.py

# Apply migration
poetry run alembic upgrade head

# Rollback if needed
poetry run alembic downgrade -1

Rollback Procedures

# Helm rollback to previous revision
helm rollback myapp 3

# Kubernetes rollback
kubectl rollout undo deployment/myapp

# Database migration rollback (Alembic)
alembic downgrade -1

Critical: Test rollback procedures regularly!

See scripts/external-secrets.yaml and scripts/argocd-application.yaml for complete examples.

Kubernetes Basics

Kubernetes Basics

K8s deployments, services, health probes, and production patterns.

Health Probes

Three probe types with distinct purposes:

spec:
  containers:
  - name: app
    # Startup probe (gives slow-starting apps time to boot)
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 5
      failureThreshold: 30  # 30 * 5s = 150s max startup time

    # Liveness probe (restarts pod if failing)
    livenessProbe:
      httpGet:
        path: /health/liveness
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 3  # 3 failures = restart

    # Readiness probe (removes from service if failing)
    readinessProbe:
      httpGet:
        path: /health/readiness
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2  # 2 failures = remove from load balancer

Probe implementation (FastAPI):

@app.get("/health/startup")
async def startup_check():
    # Check DB connection established
    if not db.is_connected():
        raise HTTPException(status_code=503, detail="DB not ready")
    return {"status": "ok"}

@app.get("/health/liveness")
async def liveness_check():
    # Basic "is process running" check
    return {"status": "alive"}

@app.get("/health/readiness")
async def readiness_check():
    # Check all dependencies healthy
    if not redis.ping() or not db.health_check():
        raise HTTPException(status_code=503, detail="Dependencies unhealthy")
    return {"status": "ready"}

Security Context

spec:
  securityContext:
    runAsNonRoot: true
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop:
        - ALL

Resource Management

Always set requests and limits:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "200m"
  • Use requests for scheduling
  • Use limits for throttling

PodDisruptionBudget

Prevents too many pods from being evicted during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: myapp

Use cases:

  • Cluster upgrades (node drains)
  • Autoscaler downscaling
  • Manual evictions

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"      # Total CPU requests
    requests.memory: 20Gi   # Total memory requests
    limits.cpu: "20"        # Total CPU limits
    limits.memory: 40Gi     # Total memory limits
    pods: "50"              # Max pods

StatefulSets for Databases

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    # Pod spec here
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Key differences from Deployment:

  • Stable pod names (postgres-0, postgres-1, postgres-2)
  • Ordered deployment and scaling
  • Persistent storage per pod

Helm Chart Structure

charts/app/
├── Chart.yaml
├── values.yaml
├── scripts/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── _helpers.tpl
└── values/
    ├── staging.yaml
    └── production.yaml

Essential Manifests

  • Deployment with rolling update strategy
  • Service for internal routing
  • Ingress for external access with TLS
  • HorizontalPodAutoscaler for scaling

See scripts/k8s-manifests.yaml and scripts/helm-values.yaml for complete examples.

Multi Service Setup

Multi-Service Setup on Railway

Deploy multiple services in one Railway project for monorepos, microservices, or web + worker architectures.

Monorepo Configuration

Each service in a Railway project can point to a different root directory:

my-monorepo/
├── apps/
│   ├── api/           ← Service 1 (root: apps/api)
│   │   ├── package.json
│   │   └── railway.json
│   ├── web/           ← Service 2 (root: apps/web)
│   │   ├── package.json
│   │   └── railway.json
│   └── worker/        ← Service 3 (root: apps/worker)
│       ├── package.json
│       └── railway.json
├── packages/          ← Shared packages
└── package.json       ← Root workspace

Set each service's root directory in the Railway dashboard under Settings > Source.

Private Networking

Services within the same project communicate over Railway's private network:

# From the web service, call the API service:
http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}/endpoint

# In environment variables (set on web service):
API_URL=http://${{api.RAILWAY_PRIVATE_DOMAIN}}:${{api.PORT}}

Key points:

  • Private networking uses internal DNS, no public internet
  • Zero egress costs between services
  • Always use the PORT variable — not hardcoded ports
  • Services must listen on 0.0.0.0 (not localhost)

Common Architectures

Web + API + Worker

ServiceRolePublic?
webFrontend (Next.js, Vite)Yes
apiBackend APIYes (or private if only web calls it)
workerBackground jobs (BullMQ, Celery)No
postgresDatabaseNo (private only)
redisCache / queue brokerNo (private only)

Shared Environment Variables

Use Railway's shared variables (project-level) for values needed by all services:

  • NODE_ENV=production
  • LOG_LEVEL=info

Use reference variables for cross-service connections:

  • DATABASE_URL=$\{\{Postgres.DATABASE_URL\}\}
  • REDIS_URL=$\{\{Redis.REDIS_URL\}\}
  • API_URL=http://$\{\{api.RAILWAY_PRIVATE_DOMAIN\}\}:$\{\{api.PORT\}\}

Deploy Order

Railway deploys services in parallel by default. If you need ordering (e.g., run migrations before starting web):

  1. Put migrations in the API service's startCommand
  2. Use healthchecks — dependent services will retry connections until the API is healthy
  3. For strict ordering, use separate deploy triggers via Railway CLI

Nixpacks Customization

Nixpacks Customization

Railway uses Nixpacks to auto-detect your stack and generate a build plan. Customize when auto-detection falls short.

Auto-Detection

Nixpacks detects your language by looking for:

LanguageDetection File
Node.jspackage.json
Pythonrequirements.txt, pyproject.toml, Pipfile
Gogo.mod
RustCargo.toml
RubyGemfile
Javapom.xml, build.gradle
PHPcomposer.json

nixpacks.toml

Place at project root (or set nixpacksConfigPath in railway.json for monorepos).

Adding System Packages

[phases.setup]
nixPkgs = ["...", "ffmpeg", "imagemagick", "poppler_utils"]
aptPkgs = ["libvips-dev"]

Custom Build Phases

[phases.install]
cmds = ["npm ci --production=false"]

[phases.build]
cmds = [
  "npx prisma generate",
  "npm run build"
]
dependsOn = ["install"]

[start]
cmd = "npm run start:prod"

Environment Variables in Build

[variables]
NODE_ENV = "production"
NEXT_TELEMETRY_DISABLED = "1"

Monorepo Root Path

For monorepos, set the root directory per service in the Railway dashboard or via railway.json:

{
  "build": {
    "builder": "NIXPACKS",
    "nixpacksConfigPath": "apps/api/nixpacks.toml"
  }
}

Each service points to its own directory and nixpacks.toml.

When to Switch to Dockerfile

Use Dockerfile instead of Nixpacks when:

  • Multi-stage builds are needed to reduce image size
  • Build requires conditional logic (e.g., ARG-based feature flags)
  • Precise control over base image (e.g., distroless, Alpine variants)
  • Nixpacks doesn't support a required system dependency

Set in railway.json:

{
  "build": {
    "builder": "DOCKERFILE",
    "dockerfilePath": "Dockerfile.production"
  }
}

Observability

Observability & Monitoring

Prometheus metrics, Grafana dashboards, and alerting patterns.

Prometheus Metrics Exposition

from prometheus_client import Counter, Histogram, generate_latest

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    duration = time.time() - start_time

    # Record metrics
    http_requests_total.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")

Grafana Dashboard Queries

# Request rate (requests per second)
rate(http_requests_total[5m])

# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)

Alerting Rules

groups:
- name: app-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High p95 latency detected"
      description: "p95 latency is {{ $value }}s"

  - alert: PodCrashLooping
    expr: |
      increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"
      description: "{{ $labels.pod }} has restarted {{ $value }} times"

Key Metrics to Monitor

MetricPurposeAlert Threshold
Request rateTraffic volumeAnomaly detection
Error rateService health> 5% (critical)
p95 latencyUser experience> 2s (warning)
CPU usageResource utilization> 80% sustained
Memory usageResource utilization> 85% sustained
Pod restartsStability> 3 in 1 hour

Golden Signals (SRE)

  1. Latency - Time to serve a request
  2. Traffic - Requests per second
  3. Errors - Rate of failed requests
  4. Saturation - Resource utilization

Log Aggregation

Structured logging for observability:

import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())

    with structlog.contextvars.bound_contextvars(
        request_id=request_id,
        method=request.method,
        path=request.url.path,
    ):
        logger.info("request_started")
        response = await call_next(request)
        logger.info("request_completed", status=response.status_code)

    response.headers["X-Request-ID"] = request_id
    return response

Distributed Tracing

OpenTelemetry integration:

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

# Manual spans for business logic
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order_id", order_id)
        # Processing logic here

Railway Json Config

railway.json Configuration

Complete reference for railway.json schema — the primary way to configure build and deploy behavior on Railway.

Full Schema

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npm run build",
    "watchPatterns": ["src/**", "package.json"],
    "nixpacksConfigPath": "nixpacks.toml",
    "dockerfilePath": "Dockerfile"
  },
  "deploy": {
    "startCommand": "node dist/server.js",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 3,
    "numReplicas": 1,
    "sleepApplication": false,
    "region": "us-west1",
    "cronSchedule": "0 */6 * * *"
  }
}

Builder Options

BuilderWhen to Use
NIXPACKSDefault — auto-detects language and builds (Node, Python, Go, Rust, etc.)
DOCKERFILEComplex builds, multi-stage images, custom system deps
PAKETOCloud Native Buildpacks alternative

Deploy Settings

FieldDefaultDescription
startCommandAuto-detectedOverrides default start command
healthcheckPathNoneHTTP path to check for 200 response
healthcheckTimeout30Seconds before healthcheck is considered failed
restartPolicyTypeON_FAILUREON_FAILURE, ALWAYS, or NEVER
restartPolicyMaxRetries3Max restart attempts before marking deploy failed
numReplicas1Number of instances (horizontal scaling)
sleepApplicationfalseSleep service when no traffic (saves credits)
cronScheduleNoneCron expression for scheduled services

Examples

Node.js API with migrations

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npm ci && npx prisma generate && npm run build"
  },
  "deploy": {
    "startCommand": "npx prisma migrate deploy && node dist/server.js",
    "healthcheckPath": "/api/health",
    "healthcheckTimeout": 60
  }
}

Python FastAPI

{
  "$schema": "https://railway.com/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "pip install -r requirements.txt"
  },
  "deploy": {
    "startCommand": "uvicorn main:app --host 0.0.0.0 --port $PORT",
    "healthcheckPath": "/health"
  }
}

Cron Worker (no web traffic)

{
  "$schema": "https://railway.com/railway.schema.json",
  "deploy": {
    "startCommand": "node dist/worker.js",
    "cronSchedule": "*/15 * * * *",
    "restartPolicyType": "NEVER"
  }
}

Checklists (1)

Production Readiness

Production Readiness Checklist

🔒 Security

  • Secrets in environment variables or vault (not in code/config)
  • HTTPS enforced (redirect HTTP → HTTPS)
  • Security headers configured (HSTS, CSP, X-Frame-Options)
  • CORS restricted to known origins
  • Rate limiting enabled
  • SQL injection prevention (parameterized queries)
  • XSS prevention (output encoding, CSP)
  • Dependencies scanned for vulnerabilities
  • Container images scanned (Trivy/Snyk)

🧪 Testing

  • Unit tests passing (>80% coverage)
  • Integration tests passing
  • E2E tests for critical paths
  • Load testing completed (k6/Locust)
  • Security testing (OWASP ZAP)
  • Smoke tests for deployment verification

📊 Observability

  • Structured logging (JSON format)
  • Log aggregation configured (ELK/Loki)
  • Metrics exported (Prometheus format)
  • Dashboards created (Grafana)
  • Distributed tracing enabled (Jaeger/Tempo)
  • Error tracking configured (Sentry)
  • Uptime monitoring (synthetic checks)
  • Alerting rules defined

🚀 Deployment

  • CI/CD pipeline configured
  • Automated tests run on every PR
  • Blue-green or canary deployment ready
  • Rollback procedure documented and tested
  • Database migrations tested
  • Feature flags for risky changes
  • Deployment notifications (Slack/Teams)

💾 Data

  • Database backups automated (daily)
  • Backup restoration tested
  • Point-in-time recovery enabled
  • Data retention policies defined
  • PII handling documented (GDPR/CCPA)
  • Encryption at rest enabled
  • Encryption in transit (TLS)

🏗️ Infrastructure

  • Infrastructure as Code (Terraform/Pulumi)
  • Auto-scaling configured
  • Health checks defined
  • Resource limits set (CPU/memory)
  • Multi-AZ deployment
  • CDN for static assets
  • DDoS protection enabled

📝 Documentation

  • API documentation (OpenAPI)
  • Architecture diagram
  • Runbook for common issues
  • Incident response playbook
  • On-call rotation defined
  • SLA/SLO documented

🔄 Reliability

  • Graceful shutdown handling
  • Connection pooling configured
  • Circuit breakers for external calls
  • Retry with exponential backoff
  • Timeouts set on all external calls
  • Chaos engineering tests (optional)

Pre-Launch Final Checks

# Security scan
npm audit --audit-level=high
pip-audit

# Performance baseline
k6 run load-test.js

# DNS and SSL
curl -I https://api.example.com
openssl s_client -connect api.example.com:443

# Health endpoint
curl https://api.example.com/health

# Logs flowing
docker logs -f backend

# Metrics exposed
curl http://localhost:9090/metrics

Examples (1)

Github Actions Cicd

GitHub Actions CI/CD Example

Complete pipeline for a Python FastAPI + React application.

Repository Structure

├── .github/workflows/
│   ├── ci.yml          # Test on every PR
│   ├── deploy.yml      # Deploy on main merge
│   └── security.yml    # Weekly security scan
├── backend/            # FastAPI
├── frontend/           # React
└── docker-compose.yml

CI Pipeline (ci.yml)

name: CI

on:
  pull_request:
    branches: [main, dev]
  push:
    branches: [main, dev]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  backend-test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Poetry
        uses: snok/install-poetry@v1
        with:
          virtualenvs-create: true
          virtualenvs-in-project: true

      - name: Cache dependencies
        uses: actions/cache@v4
        with:
          path: backend/.venv
          key: venv-${{ runner.os }}-${{ hashFiles('backend/poetry.lock') }}

      - name: Install dependencies
        working-directory: backend
        run: poetry install --no-interaction

      - name: Lint
        working-directory: backend
        run: |
          poetry run ruff format --check app/
          poetry run ruff check app/

      - name: Type check
        working-directory: backend
        run: poetry run mypy app/ --ignore-missing-imports

      - name: Test
        working-directory: backend
        env:
          DATABASE_URL: postgresql://postgres:test@localhost:5432/test
        run: poetry run pytest --cov=app --cov-report=xml -v

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          files: backend/coverage.xml

  frontend-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
          cache-dependency-path: frontend/package-lock.json

      - name: Install
        working-directory: frontend
        run: npm ci

      - name: Lint
        working-directory: frontend
        run: npm run lint

      - name: Type check
        working-directory: frontend
        run: npm run typecheck

      - name: Test
        working-directory: frontend
        run: npm run test:coverage

      - name: Build
        working-directory: frontend
        run: npm run build

Deploy Pipeline (deploy.yml)

name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production  # Requires approval

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build & push backend
        run: |
          docker build -t $ECR_REGISTRY/backend:${{ github.sha }} ./backend
          docker push $ECR_REGISTRY/backend:${{ github.sha }}

      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service backend \
            --force-new-deployment

      - name: Deploy frontend to S3
        run: |
          cd frontend && npm ci && npm run build
          aws s3 sync dist/ s3://$S3_BUCKET --delete
          aws cloudfront create-invalidation --distribution-id $CF_DIST --paths "/*"

Security Scan (security.yml)

name: Security

on:
  schedule:
    - cron: "0 0 * * 0"  # Weekly Sunday midnight
  workflow_dispatch:

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Python dependencies
        run: |
          pip install pip-audit
          pip-audit -r backend/requirements.txt

      - name: Node dependencies
        working-directory: frontend
        run: npm audit --audit-level=high

      - name: Docker scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: backend:latest
          severity: HIGH,CRITICAL

Key Patterns

PatternImplementation
CachingPoetry venv, npm cache
ConcurrencyCancel in-progress on new push
ServicesPostgres container for tests
EnvironmentsProduction requires approval
ArtifactsCoverage reports to Codecov
Edit on GitHub

Last updated on

On this page

DevOps & Deployment SkillOverviewPipeline ArchitectureKey ConceptsCI/CD Pipeline StagesContainer Best PracticesKubernetes DeploymentDeployment StrategiesSecrets ManagementReferencesDocker PatternsCI/CD PipelinesKubernetes BasicsEnvironment ManagementObservabilityRailway DeploymentDeployment StrategiesDeployment ChecklistPre-DeploymentDuring DeploymentPost-DeploymentHelm Chart StructureRelated SkillsKey DecisionsExtended Thinking TriggersTemplates ReferenceCapability Detailsci-cddockerkubernetesinfrastructure-as-codedeployment-strategiesobservabilityRules (6)Protect CI/CD branches from direct pushes to enforce code review and audit trails — HIGHCI/CD: Branch ProtectionCache CI/CD pipeline dependencies to avoid re-downloading and save minutes per run — HIGHCI/CD: Pipeline CachingRun database migrations safely during deployment to prevent downtime and data loss — CRITICALDevOps: Database MigrationsSecure Docker layers by running as non-root and excluding secrets from image builds — CRITICALDocker: Layer SecurityUse Docker multi-stage builds to exclude dev dependencies and reduce image size by 4-5x — HIGHDocker: Multi-Stage BuildsConfigure Railway PaaS deployment with correct Nixpacks, environment, and railway.json settings — HIGHRailway Deployment Patternsrailway.json ConfigurationNixpacks vs DockerfileEnvironment VariablesDatabase ProvisioningMulti-Service SetupRailway CLIAnti-PatternsReferencesReferences (9)Ci Cd PipelinesCI/CD PipelinesBranch StrategyGitHub Actions Caching StrategyArtifact ManagementMatrix TestingComplete Backend CI/CD ExamplePipeline StagesBest PracticesDeployment StrategiesDeployment StrategiesRolling Deployment (Default)Blue-Green DeploymentCanary DeploymentDocker PatternsDocker PatternsMulti-Stage Build ExampleLayer Caching OptimizationSecurity HardeningSecurity Scanning with TrivyDocker Compose Development SetupEnvironment ManagementEnvironment ManagementExternal Secrets OperatorGitOps with ArgoCDInfrastructure as Code (Terraform)Database Migration StrategiesZero-Downtime Migration PatternBackward/Forward CompatibilityAlembic Migration PatternRollback ProceduresKubernetes BasicsKubernetes BasicsHealth ProbesSecurity ContextResource ManagementPodDisruptionBudgetResource QuotasStatefulSets for DatabasesHelm Chart StructureEssential ManifestsMulti Service SetupMulti-Service Setup on RailwayMonorepo ConfigurationPrivate NetworkingCommon ArchitecturesWeb + API + WorkerShared Environment VariablesDeploy OrderNixpacks CustomizationNixpacks CustomizationAuto-Detectionnixpacks.tomlAdding System PackagesCustom Build PhasesEnvironment Variables in BuildMonorepo Root PathWhen to Switch to DockerfileObservabilityObservability & MonitoringPrometheus Metrics ExpositionGrafana Dashboard QueriesAlerting RulesKey Metrics to MonitorGolden Signals (SRE)Log AggregationDistributed TracingRailway Json Configrailway.json ConfigurationFull SchemaBuilder OptionsDeploy SettingsExamplesNode.js API with migrationsPython FastAPICron Worker (no web traffic)Checklists (1)Production ReadinessProduction Readiness Checklist🔒 Security🧪 Testing📊 Observability🚀 Deployment💾 Data🏗️ Infrastructure📝 Documentation🔄 ReliabilityPre-Launch Final ChecksExamples (1)Github Actions CicdGitHub Actions CI/CD ExampleRepository StructureCI Pipeline (ci.yml)Deploy Pipeline (deploy.yml)Security Scan (security.yml)Key Patterns