How do you deploy microservices in production?
1) Foundations (before any deploy)
- Containers: one image per service, multi‑stage builds, SBOM + signature (Cosign).
- IaC: VPC, clusters, gateways, DBs via Terraform/Pulumi; everything versioned.
- Environments: dev → stage → prod with parity (same infra classes, smaller scale).
- Secrets/config: externalized; secrets in Vault/Secrets Manager; config in env/ConfigMap/AppConfig.
2) CI/CD flow
- CI: build → unit/component → contract tests → integration (Testcontainers) → image scan → push to registry.
- CD: GitOps (Argo CD/Flux) or pipelines (CodeDeploy/Spinnaker/Argo Workflows).
- Promotion: tag/PR from stage to prod; require automated checks to pass.
3) Runtime platform (pick one)
- Kubernetes (EKS/GKE/AKS): Deployments, Services, Ingress/Gateway API; optional service mesh (Istio/Linkerd) for mTLS, retries, canaries.
- ECS Fargate: Tasks/Services + ALB; Cloud Map for discovery.
- Serverless mix: API Gateway + Lambda for event‑driven bits.
4) Release strategies (safe rollouts)
- Rolling updates: default, surge 25%, maxUnavailable 25%.
- Blue‑green: stand up “green”, validate, flip traffic; instant rollback.
- Canary: 1% → 5% → 25% → 100% using mesh/ALB rules; auto‑promote on SLOs.
- Feature flags: decouple code deploy from feature release; add kill‑switches.
5) Data changes
- Expand–contract migrations (add columns/paths first, backfill, switch, then drop).
- Run migrations as jobs pre‑traffic; never bundle destructive DDL with app start.
- Have rollback and replay/backfill plans.
6) Reliability controls
- Health probes: liveness/readiness/startup.
- Resilience: timeouts, bounded retries with jitter, circuit breakers, bulkheads.
- Autoscaling: HPA on CPU/RAM + RPS/queue depth/custom metrics; pod disruption budgets.
- Rate limiting & quota at gateway/mesh; WAF on public edges.
7) Observability (mandatory)
- Logs: structured JSON, correlation/trace IDs; centralized (ELK/OpenSearch/CloudWatch).
- Metrics: RED/Golden signals; dashboards + alerts with SLOs & error budgets.
- Tracing: OpenTelemetry SDK + Collector → Jaeger/Tempo/X‑Ray.
- Deploy markers: annotate releases; link to commits and configs.
8) Security & compliance
- mTLS service‑to‑service (mesh or sidecars), TLS at edge (ACM certs).
- Least‑privilege IAM/RBAC; rotate secrets; image & dep scans; SBOMs.
- Network policies (K8s) / Security Groups; private subnets + egress control.
-
Audit trails on config/secret changes.
9) Testing in the pipeline
- Unit/component + consumer/provider contract tests.
- Integration with Testcontainers (DB, Kafka).
- Staging E2E smoke, performance smoke (p95), ZAP baseline.
- Chaos tests (latency, drops) in staging; periodic prod game‑days.
10) Deployment checklist (per release)
- Build passed; image signed ✅
- DB migration applied; backfill done ✅
- Config/flags reviewed; blast radius understood ✅
- Rollout plan + rollback plan documented ✅
- Synthetic checks green after shift ✅
11) Operate & roll back
- Progressive delivery with automated rollback on SLO breach.
- Runbooks: incidents, feature freeze, rollback, data fix, replay.
- Post‑deploy verification: error rate, p95 latency, saturation, business KPIs.
- Postmortems: blameless, action items tracked.
Minimal prod manifest (Kubernetes example)
apiVersion: apps/v1
kind: Deployment
metadata: { name: orders, labels: { app: orders, version: v1.8.3 } }
spec:
replicas: 6
strategy: { type: RollingUpdate, rollingUpdate: { maxSurge: 2, maxUnavailable: 1 } }
selector: { matchLabels: { app: orders } }
template:
metadata: { labels: { app: orders } }
spec:
containers:
- name: orders
image: registry/orders:v1.8.3
ports: [{ containerPort: 8080 }]
envFrom:
- configMapRef: { name: orders-config }
- secretRef: { name: orders-secrets }
readinessProbe: { httpGet: { path: /actuator/health/readiness, port: 8080 }, periodSeconds: 5 }
livenessProbe: { httpGet: { path: /actuator/health/liveness, port: 8080 }, periodSeconds: 10 }
resources: { requests: { cpu: "200m", memory: "512Mi" }, limits: { cpu: "1", memory: "1Gi" } }
Anti‑patterns to avoid
- Big‑bang deploys without canary/rollback.
- Destructive schema changes tied to app start.
- No SLOs/alerts (“monitoring by hope”).
- Secrets in Git or baked into images.
- One giant E2E suite blocking every deploy.