How Apache Kafka prevents duplicates?
Short answer: you can’t 100% prevent duplicates with a plain Kafka consumer.
Kafka guarantees at-least-once by default. To avoid processing an event twice you must either (a) use Kafka’s exactly-once processing (EOS) when you produce results back to Kafka, or (b) make your consumer idempotent (so reprocessing is harmless). Here are the practical paths:
Option A — Kafka “Exactly-Once” (read → process → write to Kafka)
Use transactional producer + idempotent writes and commit consumer offsets in the same transaction.
Key configs
- Producer:
enable.idempotence=true
,acks=all
, set a stabletransactional.id
(unique per app instance). - Consumer:
isolation.level=read_committed
,enable.auto.commit=false
.
Flow (Java, high-level)
var consumer = new KafkaConsumer<String, String>(consumerProps);
var producer = new KafkaProducer<String, String>(producerProps);
producer.initTransactions();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
if (records.isEmpty()) continue;
producer.beginTransaction();
try {
for (ConsumerRecord<String, String> r : records) {
// process r
producer.send(new ProducerRecord<>("out-topic", r.key(), transform(r.value())));
}
// atomically commit offsets with the write
Map<TopicPartition, OffsetAndMetadata> offsets = currentOffsets(records);
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction(); // safe rollback; offsets not committed
}
}
This gives EOS for read→process→write pipelines. Downstream consumers set isolation.level=read_committed
to avoid seeing aborted writes.
Use Kafka Streams with
processing.guarantee=exactly_once_v2
for the same guarantee with less code.
Option B — Idempotent consumer (read → process to external system)
When you write to a DB/HTTP service/etc., you can’t use Kafka transactions across systems. Make the effect idempotent and commit offsets only after the effect is durably recorded.
Patterns that work
-
Upsert by key in DB (e.g.,
INSERT ... ON CONFLICT UPDATE
), or apply business keys with unique constraints. - Keep a processed_events table keyed by
eventId
(or(topic, partition, offset)
), and write:- begin DB tx
- apply effect (idempotent)
- mark processed id
- commit DB tx
-
commitSync
offsets
- Inbox/Outbox: write message to an “inbox” table and process from there transactionally.
Minimum Java loop
props.put("enable.auto.commit", "false");
while (running) {
var records = consumer.poll(Duration.ofMillis(1000));
for (var r : records) {
// idempotent effect, e.g., UPSERT by eventId or natural key
applyIdempotentEffect(r);
}
consumer.commitSync(); // after effects are durable
}
Why duplicates happen (and how to reduce them)
- Crash after processing but before committing offsets → same messages re-delivered.
- Rebalances and retries can replay the last batch.
- Network glitches during commits.
Hygiene checklist
-
enable.auto.commit=false
and commit after processing (at-least-once). - Tune batch size so replays are small.
- Implement
ConsumerRebalanceListener
: ononPartitionsRevoked
, finish/flush work and commit offsets for those partitions. - Prefer cooperative-sticky rebalancing (Kafka clients 2.4+):
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
. - Use keys and make your handlers naturally idempotent.
- For producers:
retries
high; with idempotence, duplicates won’t occur on the topic.
Quick decision guide
- You produce to Kafka as output? Use transactions (EOS) or Kafka Streams exactly_once_v2.
- You write to DB/HTTP? Make the operation idempotent and commit offsets after the durable effect (or use Inbox/Outbox).
- Pure read-only? Duplicates don’t matter; just be aware of replays.