Kafka Dead Letter Queue Triage: Debugging 25,000 Failed Messages

I analyzed 25,000 DLQ messages. Here’s what I learned the hard way.

A couple of years back, around 2018, our Java-based Kafka streaming data pipeline silently broke. We had a Dead Letter Queue (DLQ) enabled, or so we thought. The configuration was wrong and never really tested. Java exceptions disappeared into logs with 3-day retention. We discovered the issue only on Monday at midday, when the Analytics team reported missing data from last week. We got lucky because the source topic had 7-day retention, so we could reset offsets and reprocess. But we spent hours understanding and reconstructing what went wrong with zero error context.

If the DLQ had been properly configured and monitored, we would have known within minutes. We would have known why it broke, not just that it broke. That incident stuck with me, not because I built it, but because my team inherited it, and I was trying to untangle the spaghetti of what went wrong. Although I enjoy debugging issues, I didn’t enjoy being that guy at noon on a Monday trying to piece together a puzzle of lost knowledge and missing data. Every similar incident since has reinforced the same lesson: DLQs aren’t optional infrastructure; they’re your early warning system.

Drake meme reacting to DLQ setup mistakes

I don’t want to make this another “here’s how to configure DLQ” tutorial. But after debugging thousands of DLQ messages, I realized there’s a systematic approach to triaging them that can save hours of frustration. So I want to focus on how to quickly identify root causes, avoid common pitfalls, and ensure your DLQ is properly configured and monitored. But first, let’s see what is wrong with most DLQ setups.

What are Dead Letter Queues?

Diagram of Kafka Connect DLQ flow from source to dead letter topic

In short, a Dead Letter Queue (DLQ) is just a Kafka topic where failed messages go when they can’t be processed by a connector. Instead of silently dropping messages or crashing the connector, the DLQ captures these failures for later analysis and replay. This is crucial for data integrity in streaming pipelines, where losing messages can lead to incomplete datasets and incorrect analytics.

The Four Sins of DLQ

When it comes to debugging any kind of error, including DLQs, you would expect all standard measures to be in place, but these mistakes repeat across different teams and companies:

DLQs are enabled after the first outage, not before. Nobody thinks they need one until data goes missing.
When enabled, they’re not monitored. The topic exists, but nobody’s watching it grow.
When monitored, alerts are ignored. Alert fatigue or unclear ownership means the alerts get ignored.
When actioned, there’s no replay strategy. You fix the bug, but those 25,000 messages? Still sitting there, becoming stale.

Ask me how I know 🙃. (If any of these sound familiar, you’ve probably been there too.)

DLQ topic showing a surge of 25,000 messages

What I’ve learned from many hours triaging DLQ floods is that most floods aren’t 25,000 unique problems, they are 1-2 root causes amplified by batch processing and retries. And then I realized that analyzing every error log or kafka message is futile. The only way to stay sane was to treat DLQ triage like a forensic investigation. You need a systematic way to cut through the amplification and find the signal in the noise. Methodical, pattern-first, not “headless chicken”-first.

The Triage Playbook

When you’re staring at a DLQ with thousands of messages, here’s your systematic approach.

Step 1: Sample First

In the first minutes, grab 10 messages with headers to get a feel for what’s going on. Don’t analyze everything.

kafka-console-consumer \
  --bootstrap-server kafka-broker:9092 \
  --topic dlq.connect \
  --group dlq-triage-$(date +%s) \
  --from-beginning \
  --max-messages 10 \
  --property print.headers=true \
  --property print.timestamp=true

You look for patterns:

Timestamp clustering? Single incident (good) vs. ongoing issue (bad)
Same connector name repeating? Isolate the problem to one connector
Same error class? One root cause amplified thousands of times

⚠️ Note: In a long‑lived DLQ, --from-beginning may show you older failures. For quick triage of current incidents, use a throwaway consumer group without --from-beginning so you only see newly arriving DLQ messages.

At this point, you may already have a hypothesis. But don’t jump to conclusions yet. You need to quantify the problem.

Step 2: Group By Error Class

This is where you’ll start to see if your hypothesis is correct or if there are multiple root causes. Kafka Connect (KC) DLQ headers are your debugging friend, and these are the key ones to focus on:

__connect.errors.exception.class.name — The Java exception type
__connect.errors.exception.message — The actual error message
__connect.errors.topic — Original source topic
__connect.errors.connector.name — Which connector failed

Extract the class name and count occurrences:

kafka-console-consumer \
  --bootstrap-server kafka-broker:9092 \
  --topic dlq.connect \
  --from-beginning \
  --property print.headers=true 2>/dev/null | \
  grep -oP '__connect.errors.exception.class.name:\K[^,]+' | \
  sort | uniq -c | sort -rn

Typical output will look like this:

Count	Error Class	What It Means
24,500	BatchUpdateException	Entire batches rejected (amplification)
490	DataException	Schema/type mismatch (root cause)
10	RetriableException	Transient failures exhausted retries

This is the 💡 aha moment. The biggest number (25,000 BatchUpdateException messages) is often amplification, not the root cause. The most specific class (DataException) is your signal. In this example, about 500 bad records had a schema mismatch (missing a required field). Each bad record was part of a 50-message batch sent to the downstream database sink. When one message failed, the entire batch was rejected, so 500 bad records × 50 messages/batch = 25,000 DLQ messages. The real problem was a producer schema bug; everything else was collateral damage.

But I am getting ahead of myself. Now that we have identified the dominant error class, it’s time to dig deeper.

Step 3: Diagnose the Root Cause

Once you decide what is the dominant error class, inspect one representative message:

# Replace `DataException` with the error class you're investigating
kafka-console-consumer \
  --bootstrap-server kafka-broker:9092 \
  --topic dlq.connect \
  --from-beginning \
  --property print.headers=true \
  --property print.key=true \
  --property print.value=true 2>/dev/null | \
  grep -m 1 '__connect.errors.exception.class.name:DataException'

Common culprits and what they mean:

DataException (most common):

Missing required field → Producer sending incomplete records
Type mismatch → Schema evolution bug or serialization error
Value out of range → Timestamp overflow, integer too large for target column

RetriableException:

Sink was down during the incident window → Check sink availability logs
Network timeout → Check latency metrics for that time period
Resource exhaustion → Sink couldn’t keep up (check batch sizes and throughput)

ConnectException:

Wrong credentials or permissions
Configuration mismatch between environments (dev URL in prod, etc.)
Restarting the connector may help temporarily, but find the config error

Step 4: Fix It at the Source

What You Found	Action	Who Owns It
Schema mismatch	Fix producer schema or add SMT to handle	Producer team or you
Missing field	Add default value via SMT, or fix producer	Depends on contract
Type casting error	Fix source data or add transformation	Producer team
Transient sink failure	Usually self-healed; replay if messages matter	You
Poison pill record	Quarantine, investigate, possibly discard	Data quality team

ℹ️ Key insight: Most issues require fixing the producer or adding a Single Message Transform (SMT), not tweaking the connector config.

Essential Commands Reference

Let’s be honest; when things go wrong, you just want the few commands at hand that tell you how bad it is, where it came from, and can I fix it? Here are a few useful commands:

Count total messages in DLQ:

kafka-consumer-groups \
  --bootstrap-server kafka-broker:9092 \
  --group dlq-offset-checker \
  --describe \
  --topic dlq.connect | \
  awk 'NR>1 && $1 ~ /^[0-9]+$/ { sum += $4 } END { print sum }'

This sums the lag (unconsumed messages) across partitions from the LOG-END-OFFSET (fourth column).

Extract the original topic for replay:

kafka-console-consumer \
  --bootstrap-server kafka-broker:9092 \
  --topic dlq.connect \
  --from-beginning \
  --property print.headers=true | \
  grep -oP '__connect.errors.topic:\K[^,]+' | \
  sort | uniq -c

Those DLQ headers (__connect.errors.topic and __connect.errors.partition) are your breadcrumbs. They tell you where these dead letters came from and where they might need to go back.

Monitoring: Don’t Let DLQ Become a Graveyard

A DLQ you don’t monitor is a DLQ you don’t have. It’s a graveyard where they go to die, literally dead letters. And if your client realizes that you are losing their precious events, you will soon become the next Yahoo! So if you take DLQs seriously, make monitoring part of your MVP:

Metric	Threshold	Why It Matters
DLQ message count	> 0	Every message in DLQ needs attention
DLQ growth rate	100/hour	Something is actively failing
DLQ age (oldest message)	24 hours	Unactioned failures piling up
Unique error classes	> 2	Multiple failure modes = systemic issue

Calibrate these thresholds to your scale. If your system ingests 10M messages/hour, then 100 errors/hour (0.0001%) might be acceptable noise. But if you’re processing 10K messages/hour, and you have 100 errors/hour, that’s a 1% failure rate (🔔🔔🔔). The goal is a trend toward zero DLQs. If it’s growing, you have an active incident. If it’s static, but non-zero, you have technical debt.

Remember my story from the beginning? A quiet DLQ doesn’t mean everything is fine, it might be misconfigured!

Replaying DLQ Messages: Lessons from the Trenches

Once you’ve fixed the bug, in that euphoria you may want to replay those failed messages ~~quickly~~ safely. Replaying messages blindly can re-trigger ⚠️ side effects (duplicate emails, duplicate payments, double API writes). Evaluate the blast radius before pushing anything from the DLQ back into the source topic. Kafka Connect doesn’t provide native replay support, so you’ll need a script or lightweight service. I assume you will do something manually, and because humans tend to make mistakes, please make sure to:

Verify the fix works - Try a few messages manually. If messages fail again, your fix didn’t work.
Check message age - Is this data still relevant? A 3-day-old expired session might cause more harm than good.
Have a kill switch - Whether you replay via CLI or custom script, make sure you can pause it immediately.

The Replay Process

Cartoon showing a planned replay process versus messy reality

Once you are ready:

Fix the underlying issue (schema, config, sink outage, etc.)
Republish DLQ messages to the original topic using kafka-console-producer or a custom tool
Watch the DLQ again — if the same messages reappear, your fix didn’t work

Avoid Infinite Loops

While you are scripting “BASIC” replay solutions, sometimes they turn into “10 GOTO 10”:

Add a header like replay-count=1 to track retries.
Set a max retry threshold (e.g., after 3 replay attempts, quarantine the message)
Alert when the same message repeatedly fails

The Minimum Viable DLQ Config

If you take only one configuration snippet from this post, make it this one. This is your bare-minimum setup for Kafka Connect DLQ sanity:

{
  "errors.tolerance": "all",
  "errors.deadletterqueue.topic.name": "dlq.<your-connector-name>",
  "errors.deadletterqueue.context.headers.enable": true,
  "errors.log.enable": true,
  "errors.log.include.messages": true
}

Here’s why it matters:

"errors.tolerance": "all" ensures every single failure goes to the DLQ, without silent drops.
"errors.deadletterqueue.context.headers.enable": true makes debugging possible. Without headers, you see the failed message, but you don’t know why it failed.
Logging the message payload (errors.log.include.messages) gives you quick visibility in emergencies.

If you ever wonder whether to enable headers, the answer is always yes. No headers = no debugging.

Cartoon contrasting DLQ triage plan versus reality

What I Wish I’d Known From Day One

I have learned that the Dead Letter Queue isn’t a trash can where bad messages go to die. That it’s your pipeline’s early warning system, quietly telling you a story. People today just need to stop talking and start listening. Well, that is true for any communication system, but especially for data pipelines.

If you came here for the TL;DR, here it is:

Most DLQ floods come from one error repeated thousands of times. Sample first, aggregate second.
Kafka Headers contain the key error metadata. Enable context headers or suffer later.
Monitor DLQ like you monitor your main pipeline. Zero messages is the goal.
Have a replay plan before you need it. The 4am incident is not the time to figure this out.

The next time you see that DLQ topic start to grow, take a deep breath; you’ve got a plan now. I’m still surprised there are no tools that automate this triage process end to end, but hopefully that will change soon.

So, if you ever built custom DLQ dashboards, DLQ analysis tools, or replay scripts, I’d love to hear about them and learn how others are solving this in the wild. Feel free to reach out!