Your DLQ Is a Graveyard - Here's How to Fix It

Kevin Brown on Feb 4, 2024

7 min read

Hospital recovery ward for failed messages with nurse-bots monitoring recovery progress and discharging healthy messages for replay

A payment processing queue accumulates 50,000 messages in its DLQ over three months. Nobody knows why they failed - the original errors weren’t captured, just a generic “processing failed” status. Nobody knows if they’re safe to replay - some might cause duplicate charges, some might reference customers who no longer exist. The team debates for a week, then deletes them all and hopes no customers notice.

That’s not a safety mechanism. That’s data loss with extra steps.

The problem isn’t the DLQ concept - it’s three specific design failures that make messages non-debuggable and non-replayable. Every team I’ve seen delete a DLQ in frustration made the same mistakes: they didn’t classify failures, they didn’t capture context, and they stored messages where they couldn’t query them. Fix those three things and your DLQ becomes an operational tool instead of a black hole.

Classify Failures at Write Time

Messages fail for different reasons, and those reasons determine how you should handle them. Most DLQs treat all failures the same - a message failed, it goes in the queue, someone will look at it eventually. That “eventually” often becomes “never” because manual triage doesn’t scale.

The failure taxonomy matters because different failure types require fundamentally different responses:

Transient failures These are temporary issues that resolve themselves: database connection timeouts, downstream service unavailability, rate limit errors, lock contention. These should succeed on retry. The right strategy is automatic replay after a cooldown period - no human intervention required if your retry policy is sound.
Permanent failures These won't resolve without intervention: invalid message format, referenced entity doesn't exist, business rule violation, schema version mismatch. These will never succeed without a fix. Someone needs to look at the payload, understand what's wrong, and either fix the data or discard the message.
Poison messages These crash the consumer: null pointer in a required field, infinite loop triggered by specific data, memory exhaustion from an oversized payload. The consumer dies on every attempt. These are the most dangerous because they can take down your processing capacity. The strategy is to isolate immediately, fix the consumer code, then replay.
Poison messages These fail due to sequence issues: an update arriving before the create, a delete for a non-existent record. These may succeed if replayed in correct order. The strategy is to hold for reordering, then replay - but this often requires understanding the business context to determine the correct sequence.

When you classify failures at write time - the moment the message moves to the DLQ - you enable automation. Transient failures route to an auto-replay queue with a delay. Permanent failures route to a team-specific triage queue. Poison messages trigger an immediate alert. Without classification, every message requires a human to open it, read the error, and decide what to do. That’s why DLQs accumulate for months.

If your DLQ treats all failures the same, you’re forcing humans to do what software should do. Classification is the foundation that makes everything else - auto-replay, targeted alerts, efficient triage - possible.

The classification should be explicit in your DLQ schema, not inferred later. Your consumer already knows why it failed - a ValidationError is permanent, a ConnectionError is transient, an unhandled exception might be poison. Map exception types to failure classifications programmatically, and use that classification to drive routing and alerting.

Capture Context Before It’s Gone

Traditional DLQs fail at their job because they don’t capture enough context. Most give you the original message payload, a timestamp when the message was DLQ‘d, and maybe the number of retries. That’s it. When you’re debugging a failure three weeks later, you’re staring at a payload with no idea what went wrong.

Context required for a debuggable DLQ.
What You Need for Debugging	Why It Matters
Original payload (exact bytes)	Replay exactly what was sent, not a re-serialized interpretation
First failure, last failure, DLQ timestamps	Understand the failure timeline and retry behavior
Full attempt history with timing	See patterns across retries, not just the final error
Error type, message, stack trace per attempt	Debug without guessing what went wrong
Consumer version and host	Know which code version failed and where
Correlation ID, trace ID	Connect to distributed traces and related requests
Failure classification	Enable automated routing and alerting
Operational state	Track triage progress, avoid duplicate investigation

Context required for a debuggable DLQ.

The gap between what most DLQs capture and what you need for debugging is enormous. Closing that gap requires capturing context at the right moment.

The critical insight is timing. Once a message is in the DLQ, the processing context is gone. The consumer that experienced the failure - with the error object in memory, the trace span active, the request-scoped state available - is your only chance to capture that information. You can’t reconstruct it later.

This means the DLQ write path needs to be synchronous with failure handling, not an afterthought. When a message fails its final retry, capture the context and write to the DLQ in the same call stack. If you push enrichment to a background job or separate service, you’ll lose the error object, the stack trace, and any request-scoped context.

newsletter.subscribe

Store every attempt, not just the last one. A message might fail three times for three different reasons - connection timeout, rate limit, then validation error. If you only store the last error, you miss the pattern. The full history often reveals the root cause: maybe the first two failures were transient, but the third exposed a real data problem.

Track operational state alongside the message. Messages in your DLQ have a lifecycle: new, under investigation, ready for replay, replayed, discarded. Track that state. Track who’s looking at what. Add notes. Without this, you’ll have multiple engineers independently debugging the same failure, or messages that sit untouched for weeks because nobody knows if someone else is already handling them.

Store Where You Can Query

Where you store your DLQ messages determines what operations you can perform on them. The choice comes down to two models: queue-based storage (native DLQ features in SQS, RabbitMQ, or Kafka) versus database-backed storage (PostgreSQL, MongoDB, or similar).

Queue-based storage is simple to set up - most brokers have built-in DLQ support. But it creates an operational problem: you can’t query a queue. To inspect messages, you have to receive them, which affects visibility timeout and delivery counts. You can’t answer “show me all validation failures from the orders queue in the last hour” without polling through the entire queue. You can’t join DLQ data against your application data. You can’t build dashboards or run aggregations.

For teams using Kafka, ecosystem tooling bridges some of this gap. Kpow and Conduktor are enterprise UIs that support browsing DLQ topics, filtering by schema, and manually re-injecting messages. DLQMan is an open-source tool specifically for Kafka DLQ management. You get the durability and replayability of Kafka’s log-based storage with inspection and filtering capabilities layered on top.

Database-backed storage trades native queue features for query capability - and for most teams, it’s a worthwhile trade. With DLQ messages in PostgreSQL or MongoDB, debugging becomes straightforward. Find all messages that failed with a specific error type. Filter by source queue, time range, or failure classification. Join against your application tables to enrich the view. Build operational dashboards that show failure trends over time.

Several libraries make database-backed DLQ storage easy to implement. pg-boss is a Node.js job queue backed by PostgreSQL with built-in retry policies and dead-lettering - failed jobs stay in Postgres tables where you can query them directly. pgmq is a lightweight message queue native to Postgres supporting DLQ management. River provides Go applications with Postgres-backed message persistence and sophisticated redrive policies. Wolverine gives .NET teams PostgreSQL-backed messaging with integrated DLQ handling.

Queue-based DLQ storage optimizes for simplicity at the cost of visibility. Database-backed storage optimizes for understanding at the cost of complexity. For anything beyond low-volume, simple workflows, the visibility is worth it.

The decision tree is relatively simple. If you’re running Kafka at scale, evaluate the ecosystem tools - they’ll save weeks of custom development. If you’re using Postgres and a supported language, the database-backed libraries give you queryable DLQ storage essentially for free. If your needs include custom enrichment, schema validation before replay, or conditional routing based on failure type, you’ll end up building a dedicated DLQ service regardless of storage choice.

The Payoff

With these three fixes - classification, context capture, queryable storage - the opening scenario plays out differently. The team queries the DLQ and finds that 80% of the 50,000 messages are transient_infrastructure failures from a two-week period when the downstream payment processor had intermittent issues. Those can be bulk-replayed now that the processor is healthy.

The remaining 10,000 messages split into clear categories. Some are validation failures with a specific error - a field format changed and older messages don’t match the new schema. Those need a migration script. Some are permanent failures referencing deleted customers - those can be safely discarded with an audit trail. A handful are poison messages that exposed a bug in the consumer - the team fixes the code, then replays those messages.

Instead of a week of debate followed by a mass deletion, the team resolves the backlog in a day with confidence. They know exactly what failed, why it failed, and whether replay is safe. The DLQ did its job.

Free PDF Guide

Download the DLQ Design Guide

Get the complete dead-letter handling playbook for failure classification, replay safety, and query-driven triage workflows.

What you'll get:

Failure taxonomy decision matrix
Context capture field checklist
Replay safety validation runbook
DLQ triage workflow templates

Free resource

Instant access

Download Now

Learn More

No credit card required.

The investment required isn’t enormous. Enrich messages at failure time with the context you’ll need later. Store them somewhere queryable. Classify failures so transient issues auto-resolve and permanent failures get human attention. Most of this work is one-time infrastructure that pays dividends every time something goes wrong.

That 50,000-message deletion? It didn’t have to happen. The cost of getting DLQ design right is a few days of infrastructure work. The cost of getting it wrong is data loss dressed up as operational hygiene.

Enjoyed the read? Share it with your network.

Table of Contents

Download the DLQ Design Guide

Rate Limiting Done Right: Protecting Users From Yourself

Tracing Span Design: How Many Is Too Many

Terraform Module Interfaces: Defaults and Versioning

Flaky E2E Tests: Systematic Diagnosis

Ephemeral Environments Without Runaway Costs

Table of Contents

Classify Failures at Write Time

Capture Context Before It’s Gone

Store Where You Can Query

The Payoff

Download the DLQ Design Guide

Share this article

Rate Limiting Done Right: Protecting Users From Yourself

Tracing Span Design: How Many Is Too Many

Terraform Module Interfaces: Defaults and Versioning

Flaky E2E Tests: Systematic Diagnosis

Ephemeral Environments Without Runaway Costs