How do you handle inter-service communication failure?

August 10, 2025

1) Detect and bound the failure

Tight timeouts: Set per‑call timeouts (client + server). No “infinite” waits.
Deadlines/contexts: Propagate request deadlines (gRPC deadline, HTTP header).
Health checks: Use readiness for routing, liveness for restarts.

2) Retry—but only when it’s safe

Retry policy: Limited attempts (e.g., 2–3), exponential backoff + jitter.
Retry only idempotent ops: GET, safe PUT/DELETE with idempotency keys; never blind‑retry non‑idempotent POSTs (payments!).
Timeout‑per‑attempt: Don’t exceed the caller’s overall deadline (retry budget).

3) Trip early with circuit breakers

Circuit breaker: Open on error/latency thresholds; half‑open probes; close on success.
Bulkheads: Separate connection pools/threads per dependency so one bad call doesn’t starve others.

4) Degrade gracefully (fallbacks)

Cached/last‑known data: Serve from cache if fresh enough.
Default responses: Skeleton UI, partial results, feature flags to hide non‑critical features.
Queue work: For write paths, accept and enqueue (outbox → broker) and process async.

5) Ensure correctness under retries

Idempotency keys: Deduplicate server side (operationId, Idempotency-Key).
Exactly‑once illusion: Combine idempotent handlers + at‑least‑once delivery.
Sagas/compensations: For multi‑service workflows, compensate instead of trying cross‑service ACID.

6) Shed load before you tip over

Client‑side rate limiting: Token bucket per dependency.
Server load shedding: Reject early with 429/503 when saturated.
Queue back‑pressure: Bound queue sizes; drop oldest/lowest‑priority first.

7) Observe and automate recovery

Golden signals: latency, traffic, errors, saturation per endpoint.
Distributed tracing: Correlate failures across hops (trace IDs).
DLQs & replays: Poison messages go to DLQ with replay tools.
Chaos testing: Periodically inject failures (latency, drops) to validate policies.

8) Network hygiene

Connection pooling & keep‑alive tuned; avoid thundering‑herd reconnects.
DNS/Discovery: Short TTL + caching; handle endpoint churn cleanly.
mTLS & time sync: Cert/clock issues often masquerade as “random” failures.

Tiny examples

HTTP (Java + Resilience4j)

var cb = CircuitBreaker.ofDefaults("inventory");
var rt = Retry.ofDefaults("inventory"); // backoff & max-attempts tuned separately

Supplier<Response> supplier = CircuitBreaker
  .decorateSupplier(cb, () -> http.get("/inventory/42").withTimeout(Duration.ofMillis(500)));

supplier = Retry.decorateSupplier(rt, supplier);
try { return supplier.get(); }
catch (CallNotPermittedException e) { return cache.getOrDefault("inventory-42"); }

gRPC deadline (Go)

ctx, cancel := context.WithTimeout(ctx, 400*time.Millisecond)
defer cancel()
resp, err := client.GetPrice(ctx, req) // server sees deadline

Idempotent write with key (HTTP)

POST /orders
Idempotency-Key: 9e2c-7b...

{ "items": [...], "total": 1200 }

Server stores Idempotency-Key → response mapping; duplicates return the same result.

Quick checklist

Per‑dependency timeouts, retries with jitter, circuit breakers ✅
Idempotency on writes; outbox/queue for async work ✅
Graceful degradation paths defined and tested ✅
Metrics + traces + DLQ wired and watched ✅
Load shedding and bulkheads to prevent cascades ✅

Back to blog

Item added to your cart

How do you handle inter-service communication failure?

1) Detect and bound the failure

2) Retry—but only when it’s safe

3) Trip early with circuit breakers

4) Degrade gracefully (fallbacks)

5) Ensure correctness under retries

6) Shed load before you tip over

7) Observe and automate recovery

8) Network hygiene

Tiny examples

Quick checklist

Leave a comment