How do you deploy microservices in production?

1) Foundations (before any deploy)

  • Containers: one image per service, multi‑stage builds, SBOM + signature (Cosign).
  • IaC: VPC, clusters, gateways, DBs via Terraform/Pulumi; everything versioned.
  • Environments: dev → stage → prod with parity (same infra classes, smaller scale).
  • Secrets/config: externalized; secrets in Vault/Secrets Manager; config in env/ConfigMap/AppConfig.

2) CI/CD flow

  • CI: build → unit/component → contract tests → integration (Testcontainers) → image scan → push to registry.
  • CD: GitOps (Argo CD/Flux) or pipelines (CodeDeploy/Spinnaker/Argo Workflows).
  • Promotion: tag/PR from stage to prod; require automated checks to pass.

3) Runtime platform (pick one)

  • Kubernetes (EKS/GKE/AKS): Deployments, Services, Ingress/Gateway API; optional service mesh (Istio/Linkerd) for mTLS, retries, canaries.
  • ECS Fargate: Tasks/Services + ALB; Cloud Map for discovery.
  • Serverless mix: API Gateway + Lambda for event‑driven bits.

4) Release strategies (safe rollouts)

  • Rolling updates: default, surge 25%, maxUnavailable 25%.
  • Blue‑green: stand up “green”, validate, flip traffic; instant rollback.
  • Canary: 1% → 5% → 25% → 100% using mesh/ALB rules; auto‑promote on SLOs.
  • Feature flags: decouple code deploy from feature release; add kill‑switches.

5) Data changes

  • Expand–contract migrations (add columns/paths first, backfill, switch, then drop).
  • Run migrations as jobs pre‑traffic; never bundle destructive DDL with app start.
  • Have rollback and replay/backfill plans.

6) Reliability controls

  • Health probes: liveness/readiness/startup.
  • Resilience: timeouts, bounded retries with jitter, circuit breakers, bulkheads.
  • Autoscaling: HPA on CPU/RAM + RPS/queue depth/custom metrics; pod disruption budgets.
  • Rate limiting & quota at gateway/mesh; WAF on public edges.

7) Observability (mandatory)

  • Logs: structured JSON, correlation/trace IDs; centralized (ELK/OpenSearch/CloudWatch).
  • Metrics: RED/Golden signals; dashboards + alerts with SLOs & error budgets.
  • Tracing: OpenTelemetry SDK + Collector → Jaeger/Tempo/X‑Ray.
  • Deploy markers: annotate releases; link to commits and configs.

8) Security & compliance

  • mTLS service‑to‑service (mesh or sidecars), TLS at edge (ACM certs).
  • Least‑privilege IAM/RBAC; rotate secrets; image & dep scans; SBOMs.
  • Network policies (K8s) / Security Groups; private subnets + egress control.
  • Audit trails on config/secret changes.

9) Testing in the pipeline

  • Unit/component + consumer/provider contract tests.
  • Integration with Testcontainers (DB, Kafka).
  • Staging E2E smoke, performance smoke (p95), ZAP baseline.
  • Chaos tests (latency, drops) in staging; periodic prod game‑days.

10) Deployment checklist (per release)

  • Build passed; image signed ✅
  • DB migration applied; backfill done ✅
  • Config/flags reviewed; blast radius understood ✅
  • Rollout plan + rollback plan documented ✅
  • Synthetic checks green after shift ✅

11) Operate & roll back

  • Progressive delivery with automated rollback on SLO breach.
  • Runbooks: incidents, feature freeze, rollback, data fix, replay.
  • Post‑deploy verification: error rate, p95 latency, saturation, business KPIs.
  • Postmortems: blameless, action items tracked.

Minimal prod manifest (Kubernetes example)

apiVersion: apps/v1
kind: Deployment
metadata: { name: orders, labels: { app: orders, version: v1.8.3 } }
spec:
  replicas: 6
  strategy: { type: RollingUpdate, rollingUpdate: { maxSurge: 2, maxUnavailable: 1 } }
  selector: { matchLabels: { app: orders } }
  template:
    metadata: { labels: { app: orders } }
    spec:
      containers:
      - name: orders
        image: registry/orders:v1.8.3
        ports: [{ containerPort: 8080 }]
        envFrom:
        - configMapRef: { name: orders-config }
        - secretRef: { name: orders-secrets }
        readinessProbe: { httpGet: { path: /actuator/health/readiness, port: 8080 }, periodSeconds: 5 }
        livenessProbe:  { httpGet: { path: /actuator/health/liveness,  port: 8080 }, periodSeconds: 10 }
        resources: { requests: { cpu: "200m", memory: "512Mi" }, limits: { cpu: "1", memory: "1Gi" } }

Anti‑patterns to avoid

  • Big‑bang deploys without canary/rollback.
  • Destructive schema changes tied to app start.
  • No SLOs/alerts (“monitoring by hope”).
  • Secrets in Git or baked into images.
  • One giant E2E suite blocking every deploy.
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.