I analyzed 25,000 DLQ messages. Here’s what I learned the hard way.
Secret Hint: TL;DR is at the bottom, so you might as well just read while scrolling down 😉
A couple of years back, around 2018, our Java-based Kafka streaming data pipeline silently broke. We had Dead Letter Queues (DLQ) enabled, or so we thought. The configuration was wrong and never really tested. Java exceptions disappeared into logs with 3-day retention. We discovered the issue only Monday midday, when the Analytics team reported missing data from last week. We got lucky because the source topic had 7-day retention, so we could reset offsets and reprocess. But we spent hours understanding and reconstructing what went wrong with zero error context.
If the DLQ had been properly configured and monitored, we would have known within minutes. We would have known why it broke, not just that it broke. That incident stuck with me, not because I have built it, but because my team inherited it, and I was trying to untangle the spaghetti of what went wrong. Although I enjoy debugging issues, I didn’t enjoy being that guy at 12pm on a Monday trying to piece together a puzzle of lost knowledge and missing data. Every similar incident since has reinforced the same lesson: DLQs aren’t optional infrastructure; they’re your early warning system.
I don’t want to make this another “here’s how to configure DLQ” tutorial. But after debugging thousands of DLQ messages, I realized there’s a systematic approach to triaging them that can save hours of frustration. So I want to focus on how to quickly identify root causes, avoid common pitfalls, and ensure your DLQ is properly configured and monitored. But first, let’s see what is wrong with most DLQ setups.
The Four Sins of DLQ
When it comes to debugging any kind of errors, including DLQs, you would expect all standard measures to be in place, but these mistakes repeat across different teams and companies:
- DLQs are enabled after the first outage, not before. Nobody thinks they need one until data goes missing.
- When enabled, they’re not monitored. The topic exists, but nobody’s watching it grow.
- When monitored, alerts are ignored. Alert fatigue or unclear ownership means the alerts get ignored.
- When actioned, there’s no replay strategy. You fix the bug, but those 25,000 messages? Still sitting there, becoming stale.
Ask me how I know 🙃. (If any of these sound familiar, you’ve probably been there too.) What I’ve learned from many hours triaging DLQ floods is that most floods aren’t 25,000 unique problems, they are 1-2 root causes amplified by batch processing and retries.
Here’s a real example from that initial 25,000 message incident:
What we saw: 25,000 messages in the DLQ
What actually happened:
- 500 bad records with a schema mismatch (missing a required field)
- Each bad record was part of a 50-message batch sent to the downstream Database sink
- When one message failed, the entire batch was rejected (typical batch processing behavior)
- 500 bad records × 50 messages/batch = 25,000 messages rejected because of batch failures
The real problem was just 500 malformed records caused by a producer schema bug. Everything else was collateral damage.
By this point, I realized that analyzing every error log or kafka message is futile. The only way to stay sane was to treat DLQ triage like a forensic investigation. You need a systematic way to cut through the amplification and find the signal in the noise. Methodical, pattern-first, not “headless chicken”-first.
The Triage Playbook
When you’re staring at a DLQ with thousands of messages, here’s your systematic approach.
Step 1: Sample First
In the first minutes, grab 10 messages with headers to get a feel for what’s going on. Don’t analyze everything.
kafka-console-consumer \
--bootstrap-server kafka-broker:9092 \
--topic dlq.connect \
--group dlq-triage-$(date +%s) \
--from-beginning \
--max-messages 10 \
--property print.headers=true \
--property print.timestamp=true
You look for patterns:
- Timestamp clustering? Single incident (good) vs. ongoing issue (bad)
- Same connector name repeating? Isolate the problem to one connector
- Same error class? One root cause amplified thousands of times
⚠️ Note: In a long‑lived DLQ, --from-beginning may show you older failures. For quick triage of current incidents, use
a throwaway consumer group without --from-beginning so you only see newly arriving DLQ messages.
At this point, you may already have a hypothesis. But don’t jump to conclusions yet. You need to quantify the problem.
Step 2: Group By Error Class
This is where you’ll start to see if your hypothesis is correct or if there are multiple root causes. Kafka Connect (KC) DLQ headers are your debugging friend, and these are the key ones to focus on:
__connect.errors.exception.class.name— The Java exception type__connect.errors.exception.message— The actual error message__connect.errors.topic— Original source topic__connect.errors.connector.name— Which connector failed
Extract the class name and count occurrences:
kafka-console-consumer \
--bootstrap-server kafka-broker:9092 \
--topic dlq.connect \
--from-beginning \
--property print.headers=true 2>/dev/null | \
grep -oP '__connect.errors.exception.class.name:\K[^,]+' | \
sort | uniq -c | sort -rn
Very probable output will look like this:
| Count | Error Class | What It Means |
|---|---|---|
| 24,500 | BatchUpdateException | Entire batches rejected (amplification) |
| 490 | DataException | Schema/type mismatch (root cause) |
| 10 | RetriableException | Transient failures exhausted retries |
This is the 💡 AHA the moment. You have 25,000 error messages, but really just 490 actual errors, which probably lead to just one single schema mismatch type of error. Now you know where to focus. Someone would focus on the biggest number first. But the biggest number is often just a secondary effect of the real problem. You want to focus on the most specific error class that indicates the root cause.
Step 3: Diagnose the Root Cause
Once you decide what is the dominant error class, inspect one representative message:
# Replace `DataException` with the error class you're investigating
kafka-console-consumer \
--bootstrap-server kafka-broker:9092 \
--topic dlq.connect \
--from-beginning \
--property print.headers=true \
--property print.key=true \
--property print.value=true 2>/dev/null | \
grep -m 1 '__connect.errors.exception.class.name=DataException'
Common culprits and what they mean:
DataException (most common):
- Missing required field → Producer sending incomplete records
- Type mismatch → Schema evolution bug or serialization error
- Value out of range → Timestamp overflow, integer too large for target column
RetriableException:
- Sink was down during the incident window → Check sink availability logs
- Network timeout → Check latency metrics for that time period
- Resource exhaustion → Sink couldn’t keep up (check batch sizes and throughput)
ConnectException:
- Wrong credentials or permissions
- Configuration mismatch between environments (dev URL in prod, etc.)
- Restarting the connector may help temporarily, but find the config error
Step 4: Fix It at the Source
| What You Found | Action | Who Owns It |
|---|---|---|
| Schema mismatch | Fix producer schema or add SMT to handle | Producer team or you |
| Missing field | Add default value via SMT, or fix producer | Depends on contract |
| Type casting error | Fix source data or add transformation | Producer team |
| Transient sink failure | Usually self-healed; replay if messages matter | You |
| Poison pill record | Quarantine, investigate, possibly discard | Data quality team |
ℹ️ Key insight: Most issues require fixing the producer or adding a Single Message Transform (SMT), not tweaking the connector config.
Essential Commands Reference
Let’s be honest; when things go wrong, you just want the few commands at hand that tell you how bad it is, where it came from, and can I fix it? Here are a few useful commands:
Count total messages in DLQ:
kafka-consumer-groups \
--bootstrap-server kafka-broker:9092 \
--group dlq-offset-checker \
--describe \
--topic dlq.connect | \
awk 'NR>1 && $1 ~ /^[0-9]+$/ { sum += $4 } END { print sum }'
This sums the lag (unconsumed messages) across partitions from the LOG-END-OFFSET (fourth column).
Extract the original topic for replay:
kafka-console-consumer \
--bootstrap-server kafka-broker:9092 \
--topic dlq.connect \
--from-beginning \
--property print.headers=true | \
grep -oP '__connect.errors.topic:\K[^,]+' | \
sort | uniq -c
Those DLQ headers (__connect.errors.topic and __connect.errors.partition) are your breadcrumbs. They tell you where these dead letters came from and where they might need to go back.
Monitoring: Don’t Let DLQ Become a Graveyard
A DLQ you don’t monitor is a DLQ you don’t have. It’s a graveyard where they go to die, literally dead letters. And if your client realizes that you are losing their precious events, you will soon become the next Yahoo! So if you take DLQs seriously, make monitoring part of your MVP:
| Metric | Threshold | Why It Matters |
|---|---|---|
| DLQ message count | > 0 | Every message in DLQ needs attention |
| DLQ growth rate | 100/hour | Something is actively failing |
| DLQ age (oldest message) | 24 hours | Unactioned failures piling up |
| Unique error classes | > 2 | Multiple failure modes = systemic issue |
Calibrate these thresholds to your scale. If your system ingests 10M messages/hour, then 100 errors/hour (0.0001%) might be acceptable noise. But if you’re processing 10K messages/hour, and you have 100 errors/hour, that’s a 1% failure rate (🔔🔔🔔). The goal is a trend toward zero DLQs. If it’s growing, you have an active incident. If it’s static, but non-zero, you have technical debt.
Remember my story from the beginning?; a quiet DLQ doesn’t mean everything is fine, it might be misconfigured!
Replaying DLQ Messages: Lessons from the Trenches
Once you’ve fixed the bug, in that euphoria you may want to replay those failed messages quickly safely.
Replaying messages blindly can re-trigger ⚠️ side effects (duplicate emails, duplicate payments, double API writes).
Evaluate the blast radius before pushing anything DLQ back into the source topic. Kafka Connect doesn’t provide native
replay support, so you’ll need a script or lightweight service. I assume you will do something manually, and because
humans tend to make mistakes, please make sure to:
- Verify the fix works - Try a few messages manually. If messages fail again, your fix didn’t work.
- Check message age - Is this data still relevant? 3-day-old expired session might cause more harm than good.
- Have a kill switch - Whether you replay via CLI or custom script, make sure you can pause it immediately.
The Replay Process
Once you are ready:
- Fix the underlying issue (schema, config, sink outage, etc.)
- Republish DLQ messages to the original topic using
kafka-console-produceror a custom tool - Watch the DLQ again — if the same messages reappear, your fix didn’t work
Avoid Infinite Loops
While you are scripting “BASIC” replay solutions, sometimes they turn into “10 GOTO 10”:
- Add a header like
replay-count=1to track retries. - Set a max retry threshold (e.g., after 3 replay attempts, quarantine the message)
- Alert when the same message repeatedly fails
The Minimum Viable DLQ Config
If you take only one configuration snippet from this post, make it this one. This is your bare-minimum setup for Kafka Connect DLQ sanity:
{
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "dlq.<your-connector-name>",
"errors.deadletterqueue.context.headers.enable": true,
"errors.log.enable": true,
"errors.log.include.messages": true
}
Here’s why it matters:
"errors.tolerance": "all"ensures every single failure goes to the DLQ, without silent drops."context.headers.enable": truemakes debugging possible. Without headers, you see the failed message, but you don’t know why it failed.- Logging the message payload (
errors.log.include.messages) gives you quick visibility in emergencies.
If you ever wonder whether to enable headers, the answer is always yes. No headers = no debugging.
What I Wish I’d Known From Day One
I have learned that the Dead Letter Queue isn’t a trash can where bad messages go to die. That it’s your pipeline’s early warning system, quietly telling you a story. People today just need to stop talking and start listening. Well, that is true for any communication system, but especially for data pipelines.
If you came here for the TL;DR, here it is:
- Most DLQ floods come from one error repeated thousands of times. Sample first, aggregate second.
- Kafka Headers contain everything about the error. Enable context headers or suffer later.
- Monitor DLQ like you monitor your main pipeline. Zero messages is the goal.
- Have a replay plan before you need it. The 4am incident is not the time to figure this out.
The next time you see that DLQ topic start to grow, take a deep breath; you’ve got a plan now. I’m still surprised there are no tools that automate this triage process end to end, but hopefully that will change soon.
So, if you ever built custom DLQ dashboards, DLQ analysis tools, or replay scripts, I’d love to hear about them and learn how others are solving this in the wild. Feel free to reach out!