How to find Memory Leakage & Fix them?
1) Spot the leak (symptoms)
- Process RSS/heap grows steadily and never comes down.
- More frequent/full GCs, rising old‑gen after each GC, GC pauses lengthen.
-
OOMKilled (K8s) or
OutOfMemoryError
in logs. - Throughput degrades over time without more traffic.
2) Reproduce & measure
- Reproduce with realistic load (e.g., JMeter/k6).
- Baseline metrics: heap used, GC time, RSS, objects allocated/sec, open FDs.
- In containers, note limits (
-Xmx
vs cgroup limits).
3) Capture evidence
Java
- Enable GC logs:
-Xlog:gc*:file=gc.log:tags,uptime,level
. - Take heap dumps at high usage or OOM:
- On OOM:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
- On demand:
jcmd <pid> GC.heap_dump /tmp/heap.hprof
orjmap -dump:format=b,file=/tmp/heap.hprof <pid>
- On OOM:
- Take JFR profile for allocation hot spots:
jcmd <pid> JFR.start name=leak settings=profile filename=app.jfr
Node / Python / .NET (quick notes)
-
Node: Chrome DevTools → Heap snapshot;
clinic heapprofiler
. -
Python:
tracemalloc
,objgraph
,guppy3/heapy
. -
.NET: dotMemory / PerfView; dump with
dotnet-gcdump
, analyze with Visual Studio.
4) Analyze the heap
- Open dump in Eclipse MAT or YourKit/VisualVM.
- Look for:
- Leak suspects / dominator tree: objects retaining large subgraphs.
-
Growing collections (e.g.,
HashMap
,ArrayList
, caches) with stack traces to allocation sites. -
Class histograms:
jcmd <pid> GC.class_histogram
.
- Cross‑check with JFR allocation flame graphs to find hot allocation paths.
5) Usual culprits (Java)
-
Unbounded caches/collections (often static singletons).
- ✅ Fix: add size limits & eviction (Caffeine/Guava), time‑based expiry, or Weak/SoftReference where appropriate.
-
Listeners/observers not removed, event bus subscribers lingering.
- ✅ Fix: unsubscribe on close; use weak listeners if supported.
-
ThreadLocals not cleared (esp. in pools).
- ✅ Fix:
try { ... } finally { threadLocal.remove(); }
- ✅ Fix:
-
Connections/streams not closed (JDBC, HTTP, I/O).
- ✅ Fix: use try‑with‑resources; pool with max‑lifetime; leak detection in HikariCP.
-
Classloader leaks in app servers (static caches, threads preventing unload).
- ✅ Fix: stop non‑daemon threads on shutdown; avoid static refs to app classes; verify libraries are container‑friendly.
-
Logging/backpressure issues buffering in memory.
- ✅ Fix: async logging with bounded queues; drop/flush policies.
-
JSON/XML mappers reusing builders incorrectly.
- ✅ Fix: reuse safely or create per‑request; avoid holding on to full payloads.
6) Rectify systematically (step‑by‑step)
- Pinpoint allocation site (JFR/stack trace from MAT).
- Identify retention path (who holds it alive?) via dominator tree.
- Introduce bounds / lifecycle hooks (eviction, close, unsubscribe).
-
Prove the fix:
- Re‑run load → heap stabilizes after GC, old‑gen plateau, GC time down.
- Compare before/after GC logs & heap histograms.
-
Add guardrails:
- Leak‑detection in pools (Hikari
leakDetectionThreshold
). - Bounded queues, timeouts, backpressure.
- Canary + monitors (heap used %, GC time %, RSS vs Xmx).
- Leak‑detection in pools (Hikari
7) Quick Java examples
A) Unbounded cache → bounded with Caffeine
Cache<String, User> userCache = Caffeine.newBuilder()
.maximumSize(100_000)
.expireAfterWrite(Duration.ofMinutes(10))
.recordStats()
.build();
B) ThreadLocal cleanup
private static final ThreadLocal<SimpleDateFormat> F = ThreadLocal.withInitial(
() -> new SimpleDateFormat("yyyy-MM-dd")
);
try {
return F.get().format(date);
} finally {
F.remove(); // important when threads are pooled
}
C) Always close resources
try (Connection c = ds.getConnection();
PreparedStatement ps = c.prepareStatement(sql);
ResultSet rs = ps.executeQuery()) {
// use rs
} // auto-closed
D) MAT workflow
- Open dump → Leak Suspects Report → inspect top dominators →
- Right‑click suspect collection → Path to GC Roots → find static/singleton holder →
- Patch code, redeploy, retest.
8) Native/off‑heap leaks
- Check direct byte buffers, JNI, netty arenas, image libs.
- Track with NMT (Java):
-XX:NativeMemoryTracking=detail
+jcmd <pid> VM.native_memory summary
. - Ensure netty/buffer releases, cap off‑heap sizes.
9) Production hygiene (prevention)
- Set sensible Xms/Xmx, container memory limits, and alerts (heap >80%, GC >10% CPU).
- Autoscaling on CPU + GC time, not just requests.
- SLOs for latency; dashboards for heap/GC/RSS/open FDs.
- Regular heap snapshots in staging under load.