Building a Flexible Human-in-the-Loop AI Agent with Python and FastAPI

A Human-in-the-Loop AI Agent integrates human oversight into autonomous workflows by pausing execution during high-stakes tasks or low-confidence scenarios. This architecture utilizes state persistence in databases like PostgreSQL and specific interrupt tools to prevent recursive loops and reduce API costs by up to 65%.

My credit card statement for last April had a $412 line item from Google Cloud Vertex AI that shouldn't have been there. It wasn't a traffic spike or a DDoS attack. It was a "self-correcting" agent I’d built that got stuck in a recursive loop. The agent was trying to scrape a site, encountered a dynamic selector it couldn't parse, and spent four hours retrying with slightly different Python scripts, consuming tokens and execution time like a furnace. I realized then that "fully autonomous" is often just a synonym for "unpredictable and expensive."

I needed a way to keep the agent's speed but insert myself into the decision loop when the confidence score dropped or when a high-stakes action—like spending money or emailing a client—was about to happen. In the AI engineering world, we call this a Human-in-the-Loop AI Agent (HITL). However, implementing this in a stateless web environment using FastAPI is significantly harder than just calling input() in a terminal. You have to manage state persistence, handle asynchronous wait times, and ensure the agent can resume exactly where it left off without re-running the entire chain.

In this post, I’ll break down the architecture I built to solve this, moving from a dangerous "fire and forget" model to a sophisticated "pause and resume" system. This builds directly on my previous work regarding Building Self-Correcting AI Agents with Gemini and Python, but adds the critical layer of human oversight.

How to Design the Architecture of an Interruptible Human-in-the-Loop AI Agent

The core challenge of building an interruptible agent is managing state persistence in a stateless environment without blocking active threads. The core problem with standard LLM chains is that they are designed to run to completion. When you use Gemini’s function calling, the model expects a tool output to be returned immediately so it can generate the next step. If you want a human to provide that output, you can't just keep the HTTP request open for twenty minutes while you finish your coffee and check the dashboard.

My solution involves three components:

State Persistence: A PostgreSQL (via SQLAlchemy) or Redis store that saves the entire conversation history and the current "pending tool" call.
Interrupt Tooling: A specific tool defined in the Gemini API that, when called, signals the backend to stop execution and flag the run as AWAITING_INPUT.
Resume Webhooks: A FastAPI endpoint that accepts the human's input and re-triggers the agent logic using the saved state.

Defining the Interrupt Tool for Gemini Function Calling

Defining a specific interrupt tool allows the LLM to explicitly request help rather than hallucinating a solution when it encounters a logic error. First, I had to define a tool that the agent could call when it felt "unsure" or when it reached a restricted action. I don't want the agent to guess. I want it to stop. Using Gemini's tool definition, I created a request_human_intervention function. This is a significant improvement over the logic I discussed in Scaling AI Agent Logic with Gemini Tool Use, where the agent tried to solve everything itself.

from pydantic import BaseModel, Field

class RequestHumanIntervention(BaseModel):
    """Call this tool when you need human approval for a high-stakes action 
    or when you are stuck in a loop."""
    reason: str = Field(..., description="The reason why you need help.")
    context: str = Field(..., description="The specific data or draft requiring review.")
    action_type: str = Field(..., description="e.g., 'API_PURCHASE', 'EMAIL_SEND', 'UNSURE_LOGIC'")

# Gemini Tool Configuration
tools = [
    {
        "function_declarations": [
            {
                "name": "request_human_intervention",
                "description": "Requests intervention from a human operator.",
                "parameters": RequestHumanIntervention.model_json_schema()
            }
        ]
    }
]

When the LLM decides to call this tool, my backend detects the function call. Instead of executing code, it saves the call_id and the arguments to my database and returns a 202 Accepted to the client, effectively pausing the agentic loop.

How to Manage Agent State Without Blocking FastAPI Threads

Externalizing agent state to a database is essential for maintaining continuity across distributed systems and preventing data loss during restarts. One of the biggest mistakes I made in the first iteration was trying to use Python's asyncio.Event to pause the execution. This works fine for a single process, but as soon as you deploy to Cloud Run or any distributed system, your "wait" state is lost if the instance scales down or restarts. You must externalize the state.

I use a ThreadState model to track the conversation. When the agent hits a human-in-the-loop trigger, I serialize the entire message history (including the pending tool call) into a JSONB column in Postgres. This is where my previous experience optimizing FastAPI dependency injection came in handy—I needed a clean way to inject the database session and the Gemini client into my agent service without creating circular dependencies.

async def run_agent(thread_id: str, user_input: str = None):
    # Load state from DB
    thread = await db.get_thread(thread_id)
    history = thread.message_history
    
    if user_input:
        history.append({"role": "user", "content": user_input})

    # Call Gemini
    response = await gemini_client.generate_content(
        contents=history,
        tools=tools
    )
    
    # Check for the intervention tool call
    for part in response.candidates[0].content.parts:
        if part.function_call and part.function_call.name == "request_human_intervention":
            # Save the state and EXIT the loop
            await db.update_thread_status(
                thread_id, 
                status="AWAITING_HUMAN",
                pending_call=part.function_call
            )
            return {"status": "paused", "reason": part.function_call.args['reason']}
            
    # Standard agent logic continues...

Implementing Resume Logic to Inject Human Feedback into the AI Agent

Resuming an agent requires injecting the human's feedback as a tool response to maintain the integrity of the model's conversation history. You can't just send a new message to the LLM. You have to provide the "output" of the request_human_intervention tool so the LLM thinks the tool execution finished. This maintains the integrity of the tool-calling sequence required by the Gemini API documentation.

When I approve an action in my admin dashboard, it hits a FastAPI endpoint /threads/{thread_id}/approve. This endpoint fetches the pending_call, packages my manual feedback as the "tool result," and kicks the agent back into gear.

@app.post("/threads/{thread_id}/approve")
async def approve_action(thread_id: str, feedback: str, approved: bool):
    thread = await db.get_thread(thread_id)
    
    # Construct the tool response part
    tool_response = {
        "function_response": {
            "name": "request_human_intervention",
            "response": {
                "output": f"Human approved: {approved}. Feedback: {feedback}"
            }
        }
    }
    
    # Update history and resume the agent loop
    await db.add_message_to_history(thread_id, tool_response)
    return await run_agent(thread_id) # Re-entry point

How to Prevent Race Conditions in Asynchronous AI Agent Workflows

Optimistic locking with versioning prevents race conditions when multiple users attempt to interact with the same agent thread simultaneously. During testing, I ran into a frustrating bug. Sometimes, the UI would send two approval requests if the user double-clicked the "Approve" button. Since the agent logic is asynchronous, I ended up with two parallel execution runs for the same thread. This resulted in the agent sending two emails or, worse, corrupting the message history in the database.

I solved this by implementing an optimistic locking mechanism. I added a version column to my ThreadState table. Every time I update the thread, I increment the version. If a second request comes in with an outdated version ID, I reject it with a 409 Conflict. This is a standard distributed systems pattern, but it's often overlooked in AI agent development until things start breaking in production.

Analyzing the Performance and Cost Impact of Human-in-the-Loop Systems

Implementing human oversight can reduce AI operational costs by 65% while increasing task success rates from 74% to 98%. Since implementing this Human-in-the-Loop AI Agent flow, my Gemini costs have dropped by 65%. Most of that saving comes from preventing the "infinite loop" scenario. But there’s a hidden benefit: the quality of the agent's output has improved. By reviewing the agent's "reasoning" when it pauses, I can identify patterns where the system prompt is weak and refine it.

Here is a breakdown of the latency vs. cost trade-off I observed over a 30-day period:

Metric	Fully Autonomous	Human-in-the-Loop
Avg. Cost per Task	$1.42	$0.48
Success Rate (Zero-shot)	74%	98% (with intervention)
Avg. Completion Time	45 seconds	12 minutes (human delay)
Token Waste (Loops)	18%	< 1%

The latency increase is significant, but for the tasks I’m automating—like generating monthly financial summaries or updating internal documentation—12 minutes is perfectly acceptable compared to the risk of a $400 mistake.

Best Practices for Building Reliable Human-in-the-Loop AI Agents

Building a reliable Human-in-the-Loop AI Agent requires a fundamental shift in how you think about agent state and error handling. Here is what I learned:

State is everything: You cannot build a reliable HITL system without a robust persistence layer. If your agent's memory only lives in a Python list in memory, your system is a toy, not a tool.
Tool-based interrupts are cleaner: Don't try to parse the LLM's text for "I need help." Force the model to use a specific function. It makes the backend logic deterministic.
The "Resume" must be transparent: When you pass control back to the agent, give it the human's feedback clearly. I found that saying "The human operator rejected this because..." works much better than just saying "Error: Rejected."
Version your threads: Prevent race conditions early. AI agents are slow, and users are impatient. They will click buttons twice.

Additional Resources for AI Agent Development

Scaling AI Agent Logic with Gemini Tool Use - Essential for understanding how to structure the function calls that trigger these interrupts.
Optimizing FastAPI Dependency Injection - How I structured the backend to handle the database and AI client sessions mentioned here.

Moving forward, I'm looking into "Passive Oversight," where the agent doesn't stop, but sends a stream of its thoughts to a WebSocket for real-time monitoring, only pausing if I hit a physical "Emergency Stop" button in my dashboard. The goal is to reach a point where the human is a supervisor, not a bottleneck, but for now, the explicit "pause and resume" pattern is the only thing keeping my cloud bill under control.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Building a Flexible Human-in-the-Loop AI Agent with Python and FastAPI

Building a Flexible Human-in-the-Loop AI Agent with Python and FastAPI

How to Design the Architecture of an Interruptible Human-in-the-Loop AI Agent

Defining the Interrupt Tool for Gemini Function Calling

How to Manage Agent State Without Blocking FastAPI Threads

Implementing Resume Logic to Inject Human Feedback into the AI Agent

How to Prevent Race Conditions in Asynchronous AI Agent Workflows

Analyzing the Performance and Cost Impact of Human-in-the-Loop Systems

Best Practices for Building Reliable Human-in-the-Loop AI Agents

Additional Resources for AI Agent Development

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs