How do you handle inter-service communication failure?

1) Detect and bound the failure

  • Tight timeouts: Set per‑call timeouts (client + server). No “infinite” waits.
  • Deadlines/contexts: Propagate request deadlines (gRPC deadline, HTTP header).
  • Health checks: Use readiness for routing, liveness for restarts.

2) Retry—but only when it’s safe

  • Retry policy: Limited attempts (e.g., 2–3), exponential backoff + jitter.
  • Retry only idempotent ops: GET, safe PUT/DELETE with idempotency keys; never blind‑retry non‑idempotent POSTs (payments!).
  • Timeout‑per‑attempt: Don’t exceed the caller’s overall deadline (retry budget).

3) Trip early with circuit breakers

  • Circuit breaker: Open on error/latency thresholds; half‑open probes; close on success.
  • Bulkheads: Separate connection pools/threads per dependency so one bad call doesn’t starve others.

4) Degrade gracefully (fallbacks)

  • Cached/last‑known data: Serve from cache if fresh enough.
  • Default responses: Skeleton UI, partial results, feature flags to hide non‑critical features.
  • Queue work: For write paths, accept and enqueue (outbox → broker) and process async.

5) Ensure correctness under retries

  • Idempotency keys: Deduplicate server side (operationId, Idempotency-Key).
  • Exactly‑once illusion: Combine idempotent handlers + at‑least‑once delivery.
  • Sagas/compensations: For multi‑service workflows, compensate instead of trying cross‑service ACID.

6) Shed load before you tip over

  • Client‑side rate limiting: Token bucket per dependency.
  • Server load shedding: Reject early with 429/503 when saturated.
  • Queue back‑pressure: Bound queue sizes; drop oldest/lowest‑priority first.

7) Observe and automate recovery

  • Golden signals: latency, traffic, errors, saturation per endpoint.
  • Distributed tracing: Correlate failures across hops (trace IDs).
  • DLQs & replays: Poison messages go to DLQ with replay tools.
  • Chaos testing: Periodically inject failures (latency, drops) to validate policies.

8) Network hygiene

  • Connection pooling & keep‑alive tuned; avoid thundering‑herd reconnects.
  • DNS/Discovery: Short TTL + caching; handle endpoint churn cleanly.
  • mTLS & time sync: Cert/clock issues often masquerade as “random” failures.

Tiny examples

HTTP (Java + Resilience4j)

var cb = CircuitBreaker.ofDefaults("inventory");
var rt = Retry.ofDefaults("inventory"); // backoff & max-attempts tuned separately

Supplier<Response> supplier = CircuitBreaker
  .decorateSupplier(cb, () -> http.get("/inventory/42").withTimeout(Duration.ofMillis(500)));

supplier = Retry.decorateSupplier(rt, supplier);
try { return supplier.get(); }
catch (CallNotPermittedException e) { return cache.getOrDefault("inventory-42"); }

gRPC deadline (Go)

ctx, cancel := context.WithTimeout(ctx, 400*time.Millisecond)
defer cancel()
resp, err := client.GetPrice(ctx, req) // server sees deadline

Idempotent write with key (HTTP)

POST /orders
Idempotency-Key: 9e2c-7b...

{ "items": [...], "total": 1200 }

Server stores Idempotency-Key → response mapping; duplicates return the same result.


Quick checklist

  • Per‑dependency timeouts, retries with jitter, circuit breakers
  • Idempotency on writes; outbox/queue for async work ✅
  • Graceful degradation paths defined and tested ✅
  • Metrics + traces + DLQ wired and watched ✅
  • Load shedding and bulkheads to prevent cascades ✅
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.