Distributed transactions in Microservices?

0) First principle

  • Do NOT use XA/2PC across services. It’s slow, fragile, and couples vendors.
  • Prefer eventual consistency with compensations and idempotency.

1) Sagas (the default)

What

A Saga is a workflow of local transactions across services. Each step has a forward action and, if needed, a compensating action.

Styles

  • Choreography (event‑driven): Services subscribe to each other’s events and react.
    • ✅ Simple, no central brain; good for short linear flows.
    • ❌ Harder to see/trace; risk of “event spaghetti.”
  • Orchestration (controller): A saga orchestrator sends commands and awaits replies/events.
    • ✅ Clear control/visibility; easier to time out/compensate.
    • ❌ One more service to run; keep it thin to avoid “god service.”

Example (Order → Payment → Inventory)

  1. Order creates PENDING order; emits OrderCreated.
  2. Payment reserves/charges; emits PaymentAuthorized or PaymentFailed.
  3. Inventory reserves items; emits InventoryReserved or InventoryFailed.
  4. Order transitions to CONFIRMED on success; otherwise runs compensations:
    1. If inventory fails → emit PaymentRefund and set order CANCELLED.

Compensations: reverse business effects (refund, release stock), not DB rollbacks.


2) Transactional Outbox + CDC (make events reliable)

Problem

Service writes to its DB and publishes an event; if the process crashes in between, event/DB drift occurs.

Solution

  • Transactional Outbox: In the same DB transaction as state change, insert an outbox record.
  • A background relay (or Debezium CDC) reads the outbox table/DB log and publishes to the broker at‑least once.
  • Consumers must be idempotent.

Outbox schema (example)

outbox(id PK, aggregate_id, event_type, payload JSON, created_at, published BOOLEAN)

3) Idempotency & exactly-once (the reality)

  • Assume at‑least-once delivery; design idempotent handlers:
    • Use idempotency keys (e.g., operationId) persisted in a “processed” table.
    • Use upserts/ON CONFLICT DO NOTHING or compare version numbers (optimistic locking).
    • Make side effects (emails, payments) deduplicable.

4) State, time, and failure semantics

  • State machines: Model each aggregate (Order, Payment) with explicit states and valid transitions.
  • Timeouts: Every step must have a timeout and a recovery path (e.g., auto‑cancel after 15 min).
  • Retries with jitter: Back off on transient failures; cap retries.
  • Poison messages: Send to DLQ with enough context to replay safely.
  • User experience: Surface intermediate states (e.g., “Confirming payment…”) and reconcile asynchronously.

5) TCC (Try‑Confirm/Cancel) when you can lock capacity

  • Try: Tentatively reserve a resource (payment hold, stock hold).
  • Confirm: Commit the reservation on success.
  • Cancel: Release on failure/timeout.
  • Works well for short‑lived reservations with explicit expiry.

6) When 2PC is acceptable

  • Inside a single service boundary (e.g., one Postgres instance with multiple schemas) or
  • Within one data store that guarantees atomic multi‑record changes.
    Avoid across services.

7) Payloads & contracts

  • Commands vs Events:
    • Command (ReserveInventory) expects a success/failure reply/event.
    • Event (InventoryReserved) is a fact; consumers decide what to do.
  • Version events: type=v2, evolve via expand‑contract.
  • Include correlationId, causationId, sagaId, and occurredAt in every message.

8) Observability for transactions

  • Trace context (W3C traceparent) propagated in messages.
  • Business metrics: success/fail counts per step, average saga duration, compensation rate.
  • Audit log: append‑only journal of saga transitions.

9) Minimal code sketch (orchestrated saga)

// Orchestrator
startSaga(orderId):
  send(Command.Payment.Authorize(orderId), idempotencyKey)
  await event PaymentAuthorized | PaymentFailed timeout T1
  if PaymentAuthorized:
     send(Command.Inventory.Reserve(orderId))
     await event InventoryReserved | InventoryFailed timeout T2
     if InventoryReserved:
        send(Command.Order.Confirm(orderId))
     else:
        send(Command.Payment.Refund(orderId))
        send(Command.Order.Cancel(orderId))
  else:
     send(Command.Order.Cancel(orderId))

Service handler (idempotent)

-- Process PaymentAuthorized event once
INSERT INTO processed_messages(id) VALUES(:eventId)
ON CONFLICT DO NOTHING;

-- Only proceed if first time
IF row_count = 1 THEN
  UPDATE order SET status='PAID' WHERE id=:orderId AND status='PENDING';
END IF;

10) Quick decision guide

  • Simple two‑step flows? Choreography + Outbox.
  • Long/branching flows? Orchestrator + Outbox/CDC.
  • Short‑lived capacity reservations? TCC.
  • High monetary risk? Extra safeguards: manual compensation queue, escrow states, stronger reconciliation jobs.

11) Common pitfalls (avoid)

  • Dual writes without outbox/CDC.
  • Missing timeouts/compensations.
  • Non‑idempotent consumers.
  • Hidden business rules in the broker (use services, not topic names, for logic).
  • No visibility into saga state.
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.