Distributed transactions in Microservices?
0) First principle
- Do NOT use XA/2PC across services. It’s slow, fragile, and couples vendors.
- Prefer eventual consistency with compensations and idempotency.
1) Sagas (the default)
What
A Saga is a workflow of local transactions across services. Each step has a forward action and, if needed, a compensating action.
Styles
-
Choreography (event‑driven): Services subscribe to each other’s events and react.
- ✅ Simple, no central brain; good for short linear flows.
- ❌ Harder to see/trace; risk of “event spaghetti.”
-
Orchestration (controller): A saga orchestrator sends commands and awaits replies/events.
- ✅ Clear control/visibility; easier to time out/compensate.
- ❌ One more service to run; keep it thin to avoid “god service.”
Example (Order → Payment → Inventory)
-
Order creates
PENDING
order; emitsOrderCreated
. -
Payment reserves/charges; emits
PaymentAuthorized
orPaymentFailed
. -
Inventory reserves items; emits
InventoryReserved
orInventoryFailed
. -
Order transitions to
CONFIRMED
on success; otherwise runs compensations:- If inventory fails → emit
PaymentRefund
and set orderCANCELLED
.
- If inventory fails → emit
Compensations: reverse business effects (refund, release stock), not DB rollbacks.
2) Transactional Outbox + CDC (make events reliable)
Problem
Service writes to its DB and publishes an event; if the process crashes in between, event/DB drift occurs.
Solution
- Transactional Outbox: In the same DB transaction as state change, insert an outbox record.
- A background relay (or Debezium CDC) reads the outbox table/DB log and publishes to the broker at‑least once.
- Consumers must be idempotent.
Outbox schema (example)
outbox(id PK, aggregate_id, event_type, payload JSON, created_at, published BOOLEAN)
3) Idempotency & exactly-once (the reality)
- Assume at‑least-once delivery; design idempotent handlers:
- Use idempotency keys (e.g.,
operationId
) persisted in a “processed” table. - Use upserts/ON CONFLICT DO NOTHING or compare version numbers (optimistic locking).
- Make side effects (emails, payments) deduplicable.
- Use idempotency keys (e.g.,
4) State, time, and failure semantics
-
State machines: Model each aggregate (
Order
,Payment
) with explicit states and valid transitions. - Timeouts: Every step must have a timeout and a recovery path (e.g., auto‑cancel after 15 min).
- Retries with jitter: Back off on transient failures; cap retries.
- Poison messages: Send to DLQ with enough context to replay safely.
- User experience: Surface intermediate states (e.g., “Confirming payment…”) and reconcile asynchronously.
5) TCC (Try‑Confirm/Cancel) when you can lock capacity
- Try: Tentatively reserve a resource (payment hold, stock hold).
- Confirm: Commit the reservation on success.
- Cancel: Release on failure/timeout.
- Works well for short‑lived reservations with explicit expiry.
6) When 2PC is acceptable
- Inside a single service boundary (e.g., one Postgres instance with multiple schemas) or
-
Within one data store that guarantees atomic multi‑record changes.
Avoid across services.
7) Payloads & contracts
-
Commands vs Events:
-
Command (
ReserveInventory
) expects a success/failure reply/event. -
Event (
InventoryReserved
) is a fact; consumers decide what to do.
-
Command (
-
Version events:
type=v2
, evolve via expand‑contract. - Include correlationId, causationId, sagaId, and occurredAt in every message.
8) Observability for transactions
- Trace context (W3C traceparent) propagated in messages.
- Business metrics: success/fail counts per step, average saga duration, compensation rate.
- Audit log: append‑only journal of saga transitions.
9) Minimal code sketch (orchestrated saga)
// Orchestrator
startSaga(orderId):
send(Command.Payment.Authorize(orderId), idempotencyKey)
await event PaymentAuthorized | PaymentFailed timeout T1
if PaymentAuthorized:
send(Command.Inventory.Reserve(orderId))
await event InventoryReserved | InventoryFailed timeout T2
if InventoryReserved:
send(Command.Order.Confirm(orderId))
else:
send(Command.Payment.Refund(orderId))
send(Command.Order.Cancel(orderId))
else:
send(Command.Order.Cancel(orderId))
Service handler (idempotent)
-- Process PaymentAuthorized event once
INSERT INTO processed_messages(id) VALUES(:eventId)
ON CONFLICT DO NOTHING;
-- Only proceed if first time
IF row_count = 1 THEN
UPDATE order SET status='PAID' WHERE id=:orderId AND status='PENDING';
END IF;
10) Quick decision guide
- Simple two‑step flows? Choreography + Outbox.
- Long/branching flows? Orchestrator + Outbox/CDC.
- Short‑lived capacity reservations? TCC.
- High monetary risk? Extra safeguards: manual compensation queue, escrow states, stronger reconciliation jobs.
11) Common pitfalls (avoid)
- Dual writes without outbox/CDC.
- Missing timeouts/compensations.
- Non‑idempotent consumers.
- Hidden business rules in the broker (use services, not topic names, for logic).
- No visibility into saga state.