How do you handle inter-service communication failure?
1) Detect and bound the failure
- Tight timeouts: Set per‑call timeouts (client + server). No “infinite” waits.
-
Deadlines/contexts: Propagate request deadlines (gRPC
deadline, HTTP header). - Health checks: Use readiness for routing, liveness for restarts.
2) Retry—but only when it’s safe
- Retry policy: Limited attempts (e.g., 2–3), exponential backoff + jitter.
- Retry only idempotent ops: GET, safe PUT/DELETE with idempotency keys; never blind‑retry non‑idempotent POSTs (payments!).
- Timeout‑per‑attempt: Don’t exceed the caller’s overall deadline (retry budget).
3) Trip early with circuit breakers
- Circuit breaker: Open on error/latency thresholds; half‑open probes; close on success.
- Bulkheads: Separate connection pools/threads per dependency so one bad call doesn’t starve others.
4) Degrade gracefully (fallbacks)
- Cached/last‑known data: Serve from cache if fresh enough.
- Default responses: Skeleton UI, partial results, feature flags to hide non‑critical features.
- Queue work: For write paths, accept and enqueue (outbox → broker) and process async.
5) Ensure correctness under retries
-
Idempotency keys: Deduplicate server side (
operationId,Idempotency-Key). - Exactly‑once illusion: Combine idempotent handlers + at‑least‑once delivery.
- Sagas/compensations: For multi‑service workflows, compensate instead of trying cross‑service ACID.
6) Shed load before you tip over
- Client‑side rate limiting: Token bucket per dependency.
-
Server load shedding: Reject early with
429/503when saturated. - Queue back‑pressure: Bound queue sizes; drop oldest/lowest‑priority first.
7) Observe and automate recovery
- Golden signals: latency, traffic, errors, saturation per endpoint.
- Distributed tracing: Correlate failures across hops (trace IDs).
- DLQs & replays: Poison messages go to DLQ with replay tools.
- Chaos testing: Periodically inject failures (latency, drops) to validate policies.
8) Network hygiene
- Connection pooling & keep‑alive tuned; avoid thundering‑herd reconnects.
- DNS/Discovery: Short TTL + caching; handle endpoint churn cleanly.
- mTLS & time sync: Cert/clock issues often masquerade as “random” failures.
Tiny examples
HTTP (Java + Resilience4j)
var cb = CircuitBreaker.ofDefaults("inventory");
var rt = Retry.ofDefaults("inventory"); // backoff & max-attempts tuned separately
Supplier<Response> supplier = CircuitBreaker
.decorateSupplier(cb, () -> http.get("/inventory/42").withTimeout(Duration.ofMillis(500)));
supplier = Retry.decorateSupplier(rt, supplier);
try { return supplier.get(); }
catch (CallNotPermittedException e) { return cache.getOrDefault("inventory-42"); }
gRPC deadline (Go)
ctx, cancel := context.WithTimeout(ctx, 400*time.Millisecond)
defer cancel()
resp, err := client.GetPrice(ctx, req) // server sees deadline
Idempotent write with key (HTTP)
POST /orders
Idempotency-Key: 9e2c-7b...
{ "items": [...], "total": 1200 }
Server stores Idempotency-Key → response mapping; duplicates return the same result.
Quick checklist
- Per‑dependency timeouts, retries with jitter, circuit breakers ✅
- Idempotency on writes; outbox/queue for async work ✅
- Graceful degradation paths defined and tested ✅
- Metrics + traces + DLQ wired and watched ✅
- Load shedding and bulkheads to prevent cascades ✅