Why BullMQ monitoring is different from app monitoring
Most engineering teams treat BullMQ like an implementation detail. They add Datadog to their Express server, set up error tracking with Sentry, maybe throw in some APM — and assume their queues are covered.
They're not.
Queue failures are fundamentally different from HTTP request failures. When your API throws a 500, a user sees an error page. When a BullMQ job silently stalls, nothing visibly breaks. The email doesn't get sent. The webhook doesn't fire. The report doesn't generate. And nobody knows until a customer complains hours later.
I've spent years debugging queue systems — first at AWS on the MWAA team helping Fortune 500 customers with Airflow, then building data pipelines at a quantitative hedge fund, then at a PLG platform running identity resolution pipelines on Temporal and BullMQ. The pattern is always the same: teams instrument their request path exhaustively and leave their async processing completely blind.
This guide covers what I've learned about monitoring BullMQ specifically — the failure modes unique to Redis-backed job queues, the metrics that actually predict problems before they become outages, and how to set up monitoring that earns its keep.
The 7 failure modes that will break your queues
Every monitoring setup needs to be designed around the specific things that go wrong. Here are the seven failure modes I've seen hit production BullMQ deployments repeatedly — all sourced from real incidents, many from the BullMQ GitHub issue tracker itself.
1Stalled jobs
A worker picks up a job, sets a lock, and starts processing. Then the worker dies — OOM kill, unhandled exception, container eviction, network partition. The lock expires, but the job is still in the active list. BullMQ's stall checker will eventually move it back to wait, but only if another healthy worker for that queue is running and has stalledInterval configured.
The failure mode: if all workers for a queue are down, stalled jobs sit in active forever. Nothing moves them. Nothing alerts. They're zombies.
How to detect it: Periodically fetch the active list (LRANGE bull:<queue>:active 0 -1), then check each job's lock key (EXISTS bull:<queue>:<jobId>:lock). Any active job without a lock is stalled. This is a zero-baseline anomaly — even one stalled job should fire immediately.
2Redis OOM
BullMQ stores everything in Redis — job data, event streams, sorted sets for delayed jobs, lists for active and waiting queues. If you're not pruning, it grows without bound.
The biggest offenders are almost always the event streams. BullMQ writes a stream entry for every job lifecycle event (added, active, completed, failed, stalled). At 1,000 jobs/minute, that's 3,000-5,000 stream entries per minute, per queue. Over a week, a single busy queue can accumulate millions of stream entries consuming gigabytes of memory.
// Configure stream trimming in your Queue options:
const queue = new Queue('email-send', {
connection,
streams: {
events: {
maxLen: 10000 // keep last 10K events
}
}
});
// Also configure auto-removal of completed/failed jobs:
await queue.add('send', payload, {
removeOnComplete: { count: 1000 },
removeOnFail: { count: 5000 }
});
The second offender is completed job hashes. By default, BullMQ keeps every completed job forever. If each job payload is 2KB and you process 100K jobs/day, that's 200MB/day in completed job data alone.
maxmemory-policy noeviction. If Redis is set to allkeys-lru or any other eviction policy, it will silently delete BullMQ's internal keys under memory pressure. Jobs will vanish. State will corrupt. This is the most common misconfiguration I see in production deployments. Check your Redis config right now.
How to detect it: Track used_memory from INFO memory over time. Compute a growth rate. Project when you'll hit maxmemory. Alert hours before the limit, not after. Also identify which keys are growing fastest — is it event streams, completed sets, or something else? The remediation is completely different for each.
3Silent backlog growth
"5,000 jobs waiting" — is that bad? It depends entirely on context. If your workers drain 10,000/minute, a 5K backlog clears in 30 seconds. If your workers drain 50/minute, that's a 100-minute delay and climbing.
The dangerous failure mode isn't a sudden spike (those are obvious). It's inflow slightly exceeding drain rate over hours. At 2 AM, your data enrichment queue starts receiving 12 jobs/minute instead of the usual 10. Your single worker processes 11/minute. The backlog grows by 1 job/minute. After 8 hours, you have 480 jobs waiting, processing times have degraded, and your customer-facing SLA is blown.
How to detect it: You need three computed metrics, not just queue depth:
- Inflow rate — jobs entering the queue per minute
- Drain rate — jobs leaving the queue per minute (completed + failed)
- Net rate — drain minus inflow. Negative means growing. Positive means draining.
Then you can answer the questions that actually matter: "Is this queue keeping up?" and "If it's falling behind, by how much, and how long until it clears?"
4Flow deadlocks
BullMQ flows let you define parent-child job dependencies. A parent job enters waiting-children state and only resumes when all its children complete. This is powerful for building DAGs — but it creates a deadlock risk that's almost impossible to spot without dedicated tooling.
The scenario: A child job fails and exhausts all its retry attempts. The parent is waiting on that child. But failParentOnFailure isn't set (it defaults to false). The parent will wait forever. It's not failed — it's in waiting-children. It's not stalled — it doesn't have a lock. It just sits there. Permanently.
// This creates a deadlock risk if child-job fails:
const flow = new FlowProducer({ connection });
await flow.add({
name: 'generate-report',
queueName: 'reports',
children: [
{
name: 'fetch-data',
queueName: 'data-fetch',
opts: {
attempts: 3,
// Missing: failParentOnFailure: true
}
}
]
});
How to detect it: Scan all queues for jobs in waiting-children state. For each, resolve the child job IDs and check if any child is in the failed set with attemptsMade >= maxAttempts and no failParentOnFailure option. That's a deadlock.
5Overdue delayed jobs
BullMQ's delayed job mechanism stores jobs in a sorted set scored by their execution timestamp. A promotion mechanism inside the Worker checks this sorted set and moves jobs to the wait list when their time arrives.
The problem: the promotion mechanism runs inside workers. If your workers crash, restart, or are down during a deployment window, delayed jobs scheduled during that gap are never promoted. They sit in the delayed sorted set, past their execution time, doing nothing.
This is especially brutal for scheduled jobs. A nightly cleanup job scheduled for 4 AM fails to execute because the worker was restarting during a deploy at 3:55 AM. Nobody notices until the next day.
How to detect it: Query the delayed sorted set (ZRANGEBYSCORE bull:<queue>:delayed 0 <now>) with a grace period to avoid false positives from normal promotion latency. Any jobs scored before the current timestamp minus the grace period are overdue. A grace period of 60 seconds works well — BullMQ's promotion check runs every few seconds, so anything older than a minute is genuinely stuck.
6Error storms
A deployment introduces a bug. One job type starts failing at 100%. BullMQ dutifully retries each job 3 times with exponential backoff. Every retry fails again. Now you have 3x your failure volume, all consuming worker capacity, all failing, all generating Redis writes for retry state transitions.
Meanwhile, your healthy job types are starving for worker capacity because the failing jobs are eating all the concurrency slots during their retry processing.
How to detect it: Raw failure counts are noisy. You need two things: failure rate relative to a baseline (a rolling 7-day average), and error grouping to see that 312 of your 347 failures share the same TypeError. Without error grouping, you're scrolling through individual failed jobs trying to manually spot patterns.
7Clock skew in containerized deployments
This is the subtle one. BullMQ packs delayed job scores as timestamp * 0x1000 + counter. Overdue detection relies on comparing these scores against the current time. If your monitoring tool runs on a different host than Redis with a clock drift of even 10 seconds, you'll get false positives (or miss real overdue jobs).
In containerized environments — Kubernetes, ECS, Docker Compose — clock drift is common. NTP might not be running in the container. The host clock might be off. Docker Desktop on macOS is notorious for clock skew after sleep/wake cycles.
How to detect it: On startup, compare local Date.now() against Redis TIME command. If the delta exceeds 5 seconds, apply a skew correction to all time-sensitive operations.
The 5 metrics that actually matter
Most BullMQ dashboards show you 20+ metrics per queue. Most of them are noise. Here are the five I care about:
2. Failure rate vs baseline — not absolute failure count, but the multiplier against your 7-day rolling average. 50 failures/hour is fine if your baseline is 45. 50 failures/hour is a five-alarm fire if your baseline is 2.
3. Stalled job count — zero-baseline. Even one is urgent.
4. Redis memory growth rate — not current usage, but the rate of change. Projecting time-to-OOM is more useful than a static memory threshold.
5. Oldest waiting job age — how long the oldest job in the wait list has been waiting. This is a proxy for end-to-end latency. If your oldest waiting job is 10 minutes old, your customers are waiting 10 minutes.
Everything else — completed count, active count, delayed count — is useful for debugging but shouldn't drive your alerting. These are state indicators, not health indicators.
Monitoring approaches compared
There are four common approaches to monitoring BullMQ in production. Each has tradeoffs.
| Approach | Pros | Cons |
|---|---|---|
| bull-board (embedded UI) |
Simple setup. Good for dev. Familiar to most BullMQ users. | No history. No anomaly detection. No alerting. Requires code changes — you mount it as Express middleware. Can't see what happened 5 minutes ago. |
| Prometheus + Grafana (DIY metrics) |
Full control. Integrates with existing monitoring stack. Custom dashboards. | Significant setup time. You write and maintain the exporter. BullMQ-specific anomalies (stalls, deadlocks, overdue delayed) require custom logic. No error grouping. No flow awareness. |
| Taskforce.sh (SaaS, by BullMQ creator) |
Built by the BullMQ author. Polished UI. Hosted — no infrastructure to manage. | SaaS only — your Redis credentials go to a third party. No anomaly detection. No capacity planning. No error clustering. Costs money. |
| Standalone collector (e.g., Damasqas) |
Zero code changes. Historical data. Anomaly detection. Self-hosted — data stays local. Runs alongside your app. | Another process to run. SQLite storage means single-node (fine for most teams, not for multi-region HA). |
My recommendation: use a standalone collector for production, and keep bull-board for local development. Bull-board is great for poking around during development, but it's a liability in production because it's embedded in your application process and has no historical data.
Setting up production monitoring
Here's the practical setup. I'll use Damasqas as the example because it covers all seven failure modes above out of the box, but the principles apply regardless of which tool you use.
Step 1: Install alongside your application
The collector is a separate process. It connects to the same Redis instance your BullMQ workers use but runs independently. This means it keeps monitoring even when your workers are down (which is exactly when you need monitoring most).
# Docker Compose — add alongside your existing services:
services:
damasqas:
image: damasqas/damasqas
ports:
- "3888:3888"
command: node dist/index.js --redis redis://redis:6379
environment:
DAMASQAS_DATA_DIR: /data
SLACK_WEBHOOK: https://hooks.slack.com/services/...
volumes:
- damasqas-data:/data
depends_on:
redis:
condition: service_healthy
Or if you just want to try it:
npx damasqas --redis redis://localhost:6379
Step 2: Verify Redis configuration
Before anything else, check two things:
# Must be 'noeviction':
redis-cli CONFIG GET maxmemory-policy
# Must have a limit set (not 0):
redis-cli CONFIG GET maxmemory
If maxmemory is 0 (unlimited), Redis will grow until the OS OOM-kills it. If maxmemory-policy isn't noeviction, Redis will silently delete your BullMQ keys. Both are production disasters.
Step 3: Configure job cleanup
The single most impactful thing you can do is configure auto-removal on your queues. This prevents the two biggest sources of Redis memory growth:
// On your Queue instance — trim the event stream:
const queue = new Queue('email-send', {
connection,
streams: { events: { maxLen: 10000 } }
});
// On every job — auto-remove old completed/failed:
await queue.add('send-welcome', data, {
removeOnComplete: { count: 1000 },
removeOnFail: { count: 5000 },
attempts: 3,
backoff: { type: 'exponential', delay: 1000 }
});
Step 4: Set up webhook alerts
Configure Slack or Discord webhooks so anomalies reach you immediately. At minimum, you want alerts for:
- Stalled jobs — any count above zero, immediate alert
- Failure spikes — 3x above the 7-day rolling average
- Redis memory — projected OOM within 6 hours
- Backlog growth — queue growing for 5+ consecutive analysis intervals
npx damasqas \
--redis redis://your-redis:6379 \
--slack-webhook https://hooks.slack.com/services/... \
--failure-threshold 3 \
--backlog-threshold 5 \
--cooldown 300
Alerting that doesn't suck
The failure mode of monitoring is too many alerts. Here's how to avoid that:
Use baseline-relative thresholds, not absolute values. "Alert when failures exceed 50/minute" is brittle — it fires during normal traffic spikes on your busiest queue and stays quiet during a 100% failure rate on your low-volume queue. "Alert when failures exceed 3x the 7-day rolling average" adapts automatically to each queue's normal behavior.
Cooldown periods prevent alert storms. If a failure spike fires, don't fire again for the same queue within 5 minutes. The first alert told you about the problem. Repeating it every 10 seconds doesn't help — it trains you to ignore alerts.
Include context in every alert. A useful alert isn't "email-send failure spike." A useful alert is: "email-send failure spike — 347 failures in the last 5 minutes, 29x above baseline. Top error: TypeError: Cannot read 'email' of undefined (312 occurrences). 3,402 jobs waiting. 1 stalled job detected." That's the difference between an alert you need to investigate and an alert you can act on immediately.
Separate detection from dispatch. Your anomaly detection should run every 10 seconds. Your alert dispatch should run independently — it checks for unsent anomalies and delivers them. This way, a slow Slack webhook doesn't block your detection loop.
Wrapping up
BullMQ is an excellent queue system. It's well-designed, actively maintained, and battle-tested at scale. But like any infrastructure component, it needs monitoring that understands its specific failure modes.
The tools you use matter less than the coverage. At minimum, you need: stall detection, Redis memory tracking, baseline-relative failure alerting, and drain rate analysis. Everything else is a nice-to-have.
If you want to set all of this up in 30 seconds:
npx damasqas --redis redis://localhost:6379
It covers every failure mode in this guide out of the box. Open source, self-hosted, zero code changes.