Python Task Queues: Choosing Between Celery and Redis Queue
Python Task Queues: Choosing Between Celery and Redis Queue
Python task queues like Celery and Redis Queue (RQ) manage asynchronous workloads, but Celery is better for complex workflows while RQ excels in simplicity and memory isolation. For high-latency AI tasks, choosing the right queue prevents task starvation and reduces infrastructure costs.
Last month, at exactly 3:14 AM, my PagerDuty went off. My FastAPI service, which handles batch processing for Gemini 1.5 Pro document analysis, had completely stalled. The culprit wasn't the LLM API itself, but rather my task queue. I was using Celery with a default configuration, and a sudden burst of 500 high-context PDF processing requests had caused a "prefetch" cascade. My workers were grabbing 4 tasks each, holding them in memory while waiting for the Gemini API to respond, and eventually, the visibility timeout kicked in. The broker thought the workers had died, re-queued the tasks, and created an infinite loop of duplicate API calls that burned through $400 of credits in two hours.
I realized then that the "standard" advice of just using Celery for everything is dangerous. When you are dealing with high-latency AI tasks—where a single request might take 60 seconds to process—the overhead and complexity of your task queue can become your biggest bottleneck. I spent the next two weeks benchmarking Redis Queue (RQ) against Celery to see if I could simplify my stack without losing the reliability I needed for production. This isn't a "hello world" comparison; this is a breakdown of the architectural scars I earned while trying to make Python task queues robust at scale.
Why Celery Complexity Can Increase Infrastructure Costs
Celery offers extensive features but its complex configuration often leads to unpredictable behavior and higher resource consumption in cloud environments. It is the undisputed heavyweight champion of the Python world. It is powerful, feature-rich, and has a plugin for everything. But for my specific use case—triggering LLM calls and updating a database—it felt like using a chainsaw to cut a piece of string. The first problem I encountered was the configuration surface area. Celery has over 100 configuration settings. If you miss one, like task_acks_late or worker_prefetch_multiplier, your system will behave unpredictably under load.
In my initial setup, I was struggling with rate limits. I previously wrote about preventing LLM API rate limiting, but my Celery workers were ignoring those constraints because I hadn't properly tuned the concurrency settings. Celery workers, by default, try to be as efficient as possible by pre-fetching tasks. When your tasks are fast (milliseconds), this is great. When your tasks are slow (LLM inference), this is a disaster. A worker might pre-fetch 10 tasks, but if the first task takes 2 minutes, the other 9 tasks sit idle in the worker's local memory, invisible to other workers that might be free.
# My original, problematic Celery config
from celery import Celery
app = Celery('tasks', broker='redis://localhost:6379/0')
# This default was killing my LLM pipeline
app.conf.update(
task_serializer='json',
accept_content=['json'],
result_serializer='json',
timezone='UTC',
enable_utc=True,
# The default prefetch multiplier is 4.
# With 8 worker processes, that's 32 tasks "locked" per node.
worker_prefetch_multiplier=4
)
When I switched the worker_prefetch_multiplier to 1, performance improved, but the memory footprint of the Celery beat and the worker nodes remained high. On Google Cloud Run, where I bill by the millisecond and megabyte, running a fleet of Celery workers was costing me 30% more than the actual compute required for the tasks. I needed something leaner.
How Redis Queue (RQ) Simplifies Python Task Management
Redis Queue (RQ) provides a minimalist, Redis-only approach that uses a fork-horse pattern for superior memory management during long-running tasks. I decided to spike a prototype using Python-RQ (Redis Queue). The first thing I noticed was the simplicity of the mental model. RQ doesn't try to support RabbitMQ, Amazon SQS, or Zookeeper. It only supports Redis. Because it's focused, the code is significantly more readable, and it uses a fork-horse pattern for worker isolation that I found much easier to debug.
In RQ, a worker is just a Python process that listens to a list in Redis. When a task comes in, the worker forks itself, executes the task in the child process, and then the child exits. This is a godsend for memory management. If you have a memory leak in your LLM processing logic—perhaps a large document isn't being cleared from the heap correctly—the fork ensures that the memory is reclaimed by the OS as soon as the task finishes. Celery workers stay alive for a long time, and while you can set max_tasks_per_child, it’s not as clean as the RQ approach.
How to Implement a FastAPI and Redis Queue Pattern
Integrating RQ into a FastAPI application requires minimal boilerplate and allows for explicit retry logic per task. Here is the pattern I moved to for my document analysis endpoint. It’s significantly more "Pythonic" and lacks the boilerplate that haunted my Celery implementation.
from fastapi import FastAPI
from redis import Redis
from rq import Queue
from my_tasks import analyze_document_task
app = FastAPI()
redis_conn = Redis(host='redis', port=6379)
# I use different queues for different priorities
analysis_queue = Queue('analysis', connection=redis_conn)
@app.post("/analyze")
async def trigger_analysis(doc_id: str):
# Enqueue the task with a long timeout for LLM processing
job = analysis_queue.enqueue(
analyze_document_task,
args=(doc_id,),
job_timeout='10m', # Gemini can be slow for long contexts
retry=Retry(max=3, interval=[10, 60, 300])
)
return {"job_id": job.get_id()}
The retry logic here is explicit and easy to reason about. In Celery, I often found myself fighting with autoretry_for and retry_backoff decorators that would sometimes conflict with my global settings. With RQ, it's passed right at the point of enqueuing, which makes it much easier to customize retries based on the specific document size or user tier.
Why the Visibility Timeout Trap Causes Duplicate API Calls
Misconfigured visibility timeouts in task brokers lead to "ghost tasks" where multiple workers process the same request, doubling API costs. One of the most critical issues I faced was the "Visibility Timeout." This is the amount of time the broker waits for a worker to acknowledge a task before putting it back in the queue. If your LLM task takes 10 minutes to process a massive legal document, but your visibility timeout is set to 5 minutes, you will enter a "Ghost Task" loop. The broker re-assigns the task to Worker B while Worker A is still working on it. Now you have two workers calling the Gemini API for the same document, doubling your costs.
In Celery (using Redis as a broker), this is controlled by visibility_timeout. In RQ, it's called results_ttl and job_timeout. The difference is how they handle the heartbeat. Celery has a complex gossip/heartbeat mechanism between workers. RQ is much more "dumb"—and I mean that in a good way. It relies on the worker process's state in Redis. However, I did find a major downside to RQ here: it doesn't handle worker crashes as gracefully as Celery does. If an RQ worker gets a SIGKILL (e.g., OOMKilled on Kubernetes), the job can sometimes get stuck in the 'started' state indefinitely unless you have a cleanup script running.
Benchmarking Performance: Celery vs. Redis Queue Metrics
Performance benchmarks show that while Celery has lower per-task overhead, Redis Queue allows for significantly higher worker density on the same hardware. I ran a series of stress tests on my staging environment (4 vCPU, 8GB RAM Cloud Run instances). I was looking for three metrics: Idle memory usage, Task overhead (time from enqueue to start), and Throughput under high-latency conditions.
| Metric | Celery (Redis Broker) | Redis Queue (RQ) |
|---|---|---|
| Idle Memory (per worker) | ~145 MB | ~42 MB |
| Overhead per task | 12ms | 45ms |
| Max Concurrent Tasks (8GB RAM) | ~40 tasks | ~110 tasks |
| Config Complexity | High (100+ options) | Low (5-10 options) |
The results were eye-opening. RQ has significantly higher per-task overhead (45ms vs 12ms), but for my use case, 45ms is irrelevant when the task itself takes 30 seconds. The win was the memory. I could fit nearly 3x as many RQ workers on the same hardware because they didn't carry the baggage of the Kombu library and the complex Celery state machine. This directly translates to lower costs, which I've been tracking using the methods I discussed in my post on building a real-time LLM API cost dashboard.
When Redis Queue (RQ) Is Not the Right Choice
Redis Queue lacks advanced features like built-in JSON serialization, robust monitoring UIs, and complex task grouping primitives found in Celery. After migrating 50% of my traffic to RQ, I hit a wall with prioritization and observability. In Celery, I can use Flower to see exactly what's happening in real-time with a beautiful UI. RQ has rq-dashboard, but it's much more primitive. I missed the ability to see a "Task Graph" or to easily revoke tasks by name across a whole cluster.
Furthermore, RQ's reliance on Python's pickle for serialization is a security risk if you don't trust your Redis environment. I had to spend extra time configuring a custom JSON serializer for RQ to ensure that we weren't vulnerable to arbitrary code execution if our Redis instance was ever compromised. Celery handles JSON serialization out of the box much more robustly.
Another pain point was Task Groups (Chords/Chains). If you need to run 10 LLM prompts in parallel and then run a "summary" task once they are all finished, Celery's chord primitive is magic. In RQ, you have to manually track the state of the sub-tasks in Redis and trigger the final task yourself. It's error-prone and leads to a lot of "glue code" that I didn't want to maintain.
Final Verdict: Which Python Task Queue Should You Choose?
The choice between Python task queues depends on whether your workload requires complex orchestration (Celery) or simple, isolated execution (RQ). After a month of running both in production, I've settled on a hybrid approach. It turns out the answer isn't "which is better," but "which fits the task profile."
Use Celery If:
- You need complex task workflows (chains, groups, chords).
- You require support for multiple brokers (RabbitMQ/SQS).
- You need a robust, production-ready monitoring UI (Flower).
- Your tasks are very short (sub-second) and you need low overhead.
Use Redis Queue (RQ) If:
- You are building a FastAPI/Flask app and want to stay within the Python ecosystem.
- Your tasks are long-running (LLM calls, video processing) and you need clean memory isolation.
- You are running on resource-constrained environments like Cloud Run or small VPS instances.
- You value code readability and "debuggability" over feature density.
Key Lessons for Managing Python Task Queues in Production
Effective management of Python task queues requires careful tuning of prefetch settings, memory isolation, and timeout configurations. Here are the core takeaways from my migration experience:
- Defaults are dangerous: Never run Celery in production with the default
worker_prefetch_multiplierif your tasks take more than a few seconds. You will experience task starvation. - Memory isolation matters: For AI workloads involving large models or heavy data processing, the fork-based model of RQ is superior for preventing memory bloat.
- Visibility timeouts are the silent killer: Always set your broker visibility timeout to be at least 2x your longest expected task duration. I learned this the hard way by paying for duplicate Gemini API calls.
- Serialization isn't free: Moving from Pickle to JSON in RQ is necessary for security but requires careful handling of Python objects (like datetimes or UUIDs) that don't natively serialize to JSON.
Related Reading
- Preventing LLM API Rate Limiting: Concurrency Control for Production Workloads - Essential for understanding how to throttle your workers so they don't get blocked by providers.
- How I Reduced LLM API Costs with a Custom Tokenization Strategy - Useful for optimizing the payload size being passed through your task queues.
Moving forward, I'm keeping my high-priority, complex document workflows on Celery, but I've moved all my simple, high-latency "background enrichment" tasks to RQ. This reduced my overall infrastructure complexity and saved me about 15% on my monthly GCP bill. My next challenge is to build a custom scheduler that can dynamically adjust worker counts based on the current token price of the Gemini API—but that's a story for another post.
Comments
Post a Comment