Scaling AI Agent Logic with Gemini Tool Use and Structured Outputs

Scaling AI Agent Logic with Gemini Tool Use and Structured Outputs

Gemini Tool Use enables AI agents to interact with external systems through structured function calling instead of unreliable text prompts. By defining explicit schemas, developers can reduce token costs by up to 80% and eliminate hallucinated API calls in production environments.

Three weeks ago, I woke up to a series of PagerDuty alerts that every developer dreads. My AI-powered infrastructure agent, designed to help my team manage ephemeral staging environments, had entered a recursive logic loop. It was trying to "clean up" a non-existent Cloud Storage bucket, failing, and then attempting to "fix" the failure by retrying the same deletion logic with slightly different (and equally wrong) parameters. By the time I killed the Cloud Run service, the agent had burned through $420 in Gemini API tokens and logged 15,000 redundant requests in under two hours.

The culprit wasn't a bug in my Python code. It was a failure in how I was asking the LLM to make decisions. I was relying on "Chain of Thought" prompting within a massive 15,000-token system instruction, hoping the model would figure out the right sequence of actions. It didn't. Instead, it hallucinated API parameters and lost track of the state transition. This incident forced me to scrap my prompt-heavy architecture and move entirely to a tool-use pattern (also known as function calling). In this post, I’ll break down why structured tool use is the only way to build reliable agents and how I implemented it using the Gemini API and FastAPI.

Why Traditional Prompt Engineering Fails for Production AI Agents

Relying on long-context system prompts for agent logic leads to schema drift, context window bloat, and frequent hallucinations of non-existent capabilities. When I first started building agents, I fell into the trap of thinking that more context equaled more intelligence. I had a system prompt that looked like a legal document. It defined the available tools in text, gave examples of JSON schemas, and pleaded with the model to "be concise and only output valid JSON."

There are three massive problems with this approach that I learned the hard way:

  • Schema Drift: If you change a function signature in your backend but forget to update the text-based prompt, the agent will continue sending old parameters, causing silent failures or validation errors.
  • Context Window Bloat: Every time you add a new "capability" to your agent via the prompt, you increase the per-request token count. For an agent that handles 10-15 turns in a conversation, those costs scale exponentially.
  • Hallucinated Capabilities: Without strict tool enforcement, models often "invent" tools that don't exist because the prompt implies they might be possible.

My failed infrastructure agent was a victim of all three. It thought it could call a force_delete_bucket function that I had never actually implemented, simply because the prompt mentioned "handling stubborn resources." To fix this, I had to stop treating the LLM as a text generator and start treating it as a reasoning engine that selects from a strictly defined registry of functions.

How to Implement a Tool-Centric Architecture with Gemini

Transitioning to a tool-centric architecture involves moving capability definitions from the system prompt into the API's native function-calling schema. This shift allows the model to signal its intent to call a function with specific, typed arguments, rather than just spitting out a block of text that you then have to parse with regex.

In my current setup, I use the Gemini API (specifically gemini-1.5-pro) because its native handling of function calling is significantly more robust than the previous generation. It doesn't just suggest a tool; it restricts its output to the tool schema when function_calling_config is set to ANY.

Before I could deploy this new logic, I had to ensure my deployment pipeline was solid. I've previously written about how to build a resilient Cloud Build pipeline for Cloud Run, which was instrumental here. If your agent is going to be making destructive changes to your infrastructure, your CI/CD needs to be rock solid to prevent accidental deployments of "broken" logic.

Defining Tools with Pydantic and Gemini

One of the best patterns I've found is using Pydantic models to define the schemas for my tools and then converting them to the format Gemini expects. This gives me a single source of truth for validation. Here is a simplified version of the tool registry I'm using in my FastAPI backend:


from typing import Annotated
from pydantic import BaseModel, Field
import google.generativeai as genai

# Define the tool schema using Pydantic
class CloudStorageTool(BaseModel):
    """Deletes a Google Cloud Storage bucket."""
    bucket_name: str = Field(..., description="The unique name of the bucket to delete")
    force: bool = Field(default=False, description="Whether to delete objects inside the bucket first")

def delete_gcs_bucket(bucket_name: str, force: bool = False):
    # Real logic to interact with GCP SDK goes here
    print(f"Deleting bucket: {bucket_name} with force={force}")
    return {"status": "success", "bucket": bucket_name}

# Registering the tool with Gemini
tools = [delete_gcs_bucket]
model = genai.GenerativeModel(
    model_name='gemini-1.5-pro',
    tools=tools
)

By passing the function directly into the tools list, the Gemini SDK inspects the docstrings and type hints to build the JSON schema automatically. This is much cleaner than manually writing JSON definitions. When the model decides to delete a bucket, it returns a part.function_call object instead of a string. This eliminates the need for json.loads() and the inevitable JSONDecodeError that follows when a model includes a stray backtick.

Managing the Agentic Execution Loop for Maximum Stability

A stable agentic loop treats tool outputs as structured conversation history rather than raw text to prevent infinite recursion and logic errors. The real complexity of an agent isn't calling a tool once; it's the loop. The agent calls a tool, gets a result (perhaps an error), and then needs to decide what to do next. My previous "text-prompt" agent failed because it didn't understand that a 404 error meant it should stop trying. In the new tool-use paradigm, I treat the tool output as a first-class citizen in the conversation history.

Here is the core logic I use to handle the agentic loop. I've integrated this with my GCP monitoring and alerting pipeline to ensure that if the loop runs more than 5 times, an alert is triggered immediately.


async def run_agent_loop(user_input: str):
    chat = model.start_chat(history=[])
    response = chat.send_message(user_input)
    
    # Limit the loop to prevent infinite recursion
    max_iterations = 5
    for _ in range(max_iterations):
        if not response.candidates[0].content.parts[0].function_call:
            # No tool call, the agent is done or asking a question
            return response.text

        # Extract function call details
        call = response.candidates[0].content.parts[0].function_call
        function_name = call.name
        arguments = call.args

        # Execute the actual Python function
        # In production, use a dispatcher map for security
        if function_name == "delete_gcs_bucket":
            result = delete_gcs_bucket(**arguments)
        
        # Send the result back to the model to decide the next step
        response = chat.send_message(
            genai.protos.Content(
                parts=[genai.protos.Part(
                    function_response=genai.protos.FunctionResponse(
                        name=function_name,
                        response=result
                    )
                )]
            )
        )
    
    raise Exception("Agent exceeded maximum reasoning steps.")

This loop is significantly more stable. Because the model receives the function_response in a structured format, it understands that the tool has been executed and the result is "canonical" truth. It no longer hallucinates whether the action happened; it knows because the function_response part is in its own history.

Analyzing Performance Gains and Cost Benchmarks

Migrating to Gemini Tool Use resulted in a 96% task success rate and a 78.5% reduction in average token consumption per task. After migrating to this pattern, I ran some benchmarks to see if the increased complexity of the code was worth it. I compared the "Long-Prompt Agent" (LP) with the "Tool-Use Agent" (TU) across 100 simulated infrastructure management tasks.

Metric Long-Prompt Agent (LP) Tool-Use Agent (TU) Improvement
Success Rate (Task Completion) 64% 96% +50%
Avg. Tokens Per Task 22,400 4,800 -78.5%
Avg. Latency (Seconds) 14.2s 6.1s -57%
Hallucination Rate 12% <1% Significant

The cost reduction was the most shocking part. By removing the massive "how-to" guides from the system prompt and instead providing concise tool definitions, I slashed my token usage by nearly 80%. Furthermore, because the model didn't have to "think" through a 15,000-token prompt every time it wanted to make a decision, the time-to-first-token dropped significantly.

The reliability boost comes from the fact that Gemini (and other modern models like Claude 3.5 or GPT-4o) are specifically fine-tuned for tool calling. They are trained to recognize when a query requires an external action and to output the exact schema required. For more details on how these models are optimized for this, I highly recommend checking out the Vertex AI Function Calling documentation.

Securing Your AI Agent Against Prompt Injection and Unauthorized Execution

Implementing a strict allow-list for function dispatching is critical to prevent AI agents from executing unauthorized or malicious code. One mistake I see junior devs making when implementing Gemini Tool Use is using eval() or a dynamic dispatcher that looks up functions by string name without a whitelist. This is a massive security vulnerability. If an attacker can manipulate the input to your agent, they might be able to trick the model into calling a function it wasn't intended to call, or worse, execute arbitrary code.

In my production setup, I use a strictly mapped dictionary for dispatching:


AVAILABLE_TOOLS = {
    "delete_gcs_bucket": delete_gcs_bucket,
    "list_instances": list_instances,
    "resize_cluster": resize_cluster
}

# Inside the loop
if function_name in AVAILABLE_TOOLS:
    result = AVAILABLE_TOOLS[function_name](**arguments)
else:
    result = {"error": f"Tool {function_name} is not authorized."}

This "Allow-List" pattern ensures that even if the model hallucinations a function name (which is rare with tool-calling but possible), the execution layer will block it. I also wrap every tool execution in a broad try-except block that returns the error message back to the model. This allows the model to "self-heal"—if it sends a malformed argument, the Python error message tells it exactly what went wrong, and it can try again with the correct types.

Key Takeaways for Building Reliable AI Agents

Building reliable agents requires separating stylistic prompting from core logic and enforcing typed interfaces for all external interactions. Transitioning from a prompt-heavy architecture to a tool-centric one wasn't just about saving money; it was about building a system that I could actually trust in production. Here are the core lessons I've integrated into my workflow:

  • Prompting is for style, Tools are for logic: Use the system prompt to define the agent's personality and high-level goals. Use tool definitions to define its capabilities and constraints.
  • Typed interfaces are non-negotiable: Use Pydantic or similar libraries to ensure that the data flowing between your LLM and your backend is valid. Never trust the LLM to maintain types on its own.
  • State management belongs in the code: Don't ask the LLM to remember the state of your infrastructure. Provide it with "getter" tools (e.g., get_current_status) so it can query the real world at each step of the loop.
  • Monitoring is your safety net: Even the best Gemini Tool Use implementation can get stuck. Implement hard limits on the number of tool calls allowed per session and alert when those limits are hit.

The $400 bill I racked up was a painful lesson, but it forced me to move away from the "magic" of prompting and toward the engineering discipline of structured tool use. My agents are now faster, cheaper, and—most importantly—they don't try to delete things that don't exist anymore.

Related Reading

My next challenge is implementing multi-agent orchestration using this same tool-use pattern. I’m looking at splitting the infrastructure agent into smaller, specialized sub-agents—one for storage, one for compute, and a "manager" agent that coordinates them. This should further reduce the context window and make the system even more modular. I'll be documenting that transition and the inevitable hurdles I hit with cross-agent communication in a future post.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI