How to find Memory Leakage & Fix them?

1) Spot the leak (symptoms)

  • Process RSS/heap grows steadily and never comes down.
  • More frequent/full GCs, rising old‑gen after each GC, GC pauses lengthen.
  • OOMKilled (K8s) or OutOfMemoryError in logs.
  • Throughput degrades over time without more traffic.

2) Reproduce & measure

  • Reproduce with realistic load (e.g., JMeter/k6).
  • Baseline metrics: heap used, GC time, RSS, objects allocated/sec, open FDs.
  • In containers, note limits (-Xmx vs cgroup limits).

3) Capture evidence

Java

  • Enable GC logs: -Xlog:gc*:file=gc.log:tags,uptime,level.
  • Take heap dumps at high usage or OOM:
    • On OOM: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
    • On demand: jcmd <pid> GC.heap_dump /tmp/heap.hprof or jmap -dump:format=b,file=/tmp/heap.hprof <pid>
  • Take JFR profile for allocation hot spots: jcmd <pid> JFR.start name=leak settings=profile filename=app.jfr

Node / Python / .NET (quick notes)

  • Node: Chrome DevTools → Heap snapshot; clinic heapprofiler.
  • Python: tracemalloc, objgraph, guppy3/heapy.
  • .NET: dotMemory / PerfView; dump with dotnet-gcdump, analyze with Visual Studio.

4) Analyze the heap

  • Open dump in Eclipse MAT or YourKit/VisualVM.
  • Look for:
    • Leak suspects / dominator tree: objects retaining large subgraphs.
    • Growing collections (e.g., HashMap, ArrayList, caches) with stack traces to allocation sites.
    • Class histograms: jcmd <pid> GC.class_histogram.
  • Cross‑check with JFR allocation flame graphs to find hot allocation paths.

5) Usual culprits (Java)

  • Unbounded caches/collections (often static singletons).
    • ✅ Fix: add size limits & eviction (Caffeine/Guava), time‑based expiry, or Weak/SoftReference where appropriate.
  • Listeners/observers not removed, event bus subscribers lingering.
    • ✅ Fix: unsubscribe on close; use weak listeners if supported.
  • ThreadLocals not cleared (esp. in pools).
    • ✅ Fix: try { ... } finally { threadLocal.remove(); }
  • Connections/streams not closed (JDBC, HTTP, I/O).
    • ✅ Fix: use try‑with‑resources; pool with max‑lifetime; leak detection in HikariCP.
  • Classloader leaks in app servers (static caches, threads preventing unload).
    • ✅ Fix: stop non‑daemon threads on shutdown; avoid static refs to app classes; verify libraries are container‑friendly.
  • Logging/backpressure issues buffering in memory.
    • ✅ Fix: async logging with bounded queues; drop/flush policies.
  • JSON/XML mappers reusing builders incorrectly.
    • ✅ Fix: reuse safely or create per‑request; avoid holding on to full payloads.

6) Rectify systematically (step‑by‑step)

  1. Pinpoint allocation site (JFR/stack trace from MAT).
  2. Identify retention path (who holds it alive?) via dominator tree.
  3. Introduce bounds / lifecycle hooks (eviction, close, unsubscribe).
  4. Prove the fix:
    1. Re‑run load → heap stabilizes after GC, old‑gen plateau, GC time down.
    2. Compare before/after GC logs & heap histograms.
  5. Add guardrails:
    1. Leak‑detection in pools (Hikari leakDetectionThreshold).
    2. Bounded queues, timeouts, backpressure.
    3. Canary + monitors (heap used %, GC time %, RSS vs Xmx).

7) Quick Java examples

A) Unbounded cache → bounded with Caffeine

Cache<String, User> userCache = Caffeine.newBuilder()
    .maximumSize(100_000)
    .expireAfterWrite(Duration.ofMinutes(10))
    .recordStats()
    .build();

B) ThreadLocal cleanup

private static final ThreadLocal<SimpleDateFormat> F = ThreadLocal.withInitial(
    () -> new SimpleDateFormat("yyyy-MM-dd")
);

try {
    return F.get().format(date);
} finally {
    F.remove(); // important when threads are pooled
}

C) Always close resources

try (Connection c = ds.getConnection();
     PreparedStatement ps = c.prepareStatement(sql);
     ResultSet rs = ps.executeQuery()) {
    // use rs
} // auto-closed

D) MAT workflow

  • Open dump → Leak Suspects Report → inspect top dominators →
  • Right‑click suspect collection → Path to GC Roots → find static/singleton holder →
  • Patch code, redeploy, retest.

8) Native/off‑heap leaks

  • Check direct byte buffers, JNI, netty arenas, image libs.
  • Track with NMT (Java): -XX:NativeMemoryTracking=detail + jcmd <pid> VM.native_memory summary.
  • Ensure netty/buffer releases, cap off‑heap sizes.

9) Production hygiene (prevention)

  • Set sensible Xms/Xmx, container memory limits, and alerts (heap >80%, GC >10% CPU).
  • Autoscaling on CPU + GC time, not just requests.
  • SLOs for latency; dashboards for heap/GC/RSS/open FDs.
  • Regular heap snapshots in staging under load.
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.