Building a Multi-Modal AI Agent with Gemini API and Python

A multi-modal AI agent leverages models like Gemini 1.5 Pro to process video, audio, and text natively without the need for lossy pre-processing steps like OCR. By using the Gemini File API and Python, developers can automate complex tasks such as incident reporting and log analysis with high spatial accuracy and significantly reduced latency. This architecture allows for direct reasoning across diverse data types within a single context window.

Two months ago, I was tasked with building an automated "Incident Investigator" for our DevOps team. The goal was simple but ambitious: when a production outage occurred, the agent needed to ingest screen recordings of the failing UI, parse 50-page PDF architecture diagrams, and analyze exported JSON logs to provide a root-cause summary. My first attempt was a disaster. I tried using a patchwork of Tesseract for OCR, Whisper for audio, and a standard text-based LLM. The latency was over 45 seconds, and the coordinate mapping for UI elements was consistently wrong.

I realized I was fighting the medium instead of using a model built for it. I migrated the entire pipeline to the Gemini 1.5 Pro API using Python. The shift from "text-plus-plugins" to a truly multi-modal architecture reduced my codebase by 1,200 lines and cut my processing latency by 60%. However, it wasn't a silver bullet. I ran into significant issues with the Gemini File API's persistence limits and some unexpected token bloat that nearly tripled my initial budget. In this post, I am going to walk through the exact architecture I settled on, the code that makes it work, and the cost-optimization strategies I had to implement to keep the project viable.

Why Native Multi-Modal Architectures Outperform Traditional OCR Pipelines

Native multi-modal processing eliminates the data loss inherent in converting images or videos to text before analysis. In my previous setups, I treated images and videos as data that needed to be converted into text before the LLM saw it. This "pre-processing" approach is inherently lossy. If your OCR fails to recognize a small "Error 500" toast notification in a 4K screen recording, the LLM has zero chance of knowing it happened. Gemini 1.5 Pro changes this because it treats video frames and audio snippets as native tokens. It "sees" the video directly.

My multi-modal AI agent architecture now follows a three-stage process:

Ingestion: Files are uploaded to the Gemini File API. This is crucial for large assets like videos or long PDFs, as passing them as inline base64 strings is both slow and prone to hitting request size limits.
Contextual Mapping: I use a system instruction that defines the agent's persona and provides a "Table of Contents" for the uploaded files.
Reasoning Loop: The agent uses function calling to query specific parts of the logs or to "re-watch" specific timestamps in the video when it finds a correlation.

How to Manage Large Assets Using the Gemini File API

Using the Gemini File API is essential for handling assets larger than a few megabytes to prevent SDK timeouts and request size errors. One of the first mistakes I made was trying to send 100MB video files directly in the generate_content call. The SDK would frequently hang or time out. I learned that for anything larger than a few megabytes, you must use the files utility. Here is the robust upload wrapper I wrote to handle state management and ensure the files are actually ready before the agent tries to process them.

import google.generativeai as genai
import time
import os

# Initialize the SDK
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def upload_to_gemini(path, mime_type=None):
    """Uploads the given file to Gemini and waits for processing."""
    file = genai.upload_file(path, mime_type=mime_type)
    print(f"Uploaded file '{file.display_name}' as: {file.uri}")
    
    # Files aren't immediately available for reasoning
    # We must poll the state until it becomes 'ACTIVE'
    while file.state.name == "PROCESSING":
        print(".", end="", flush=True)
        time.sleep(2)
        file = genai.get_file(file.name)

    if file.state.name == "FAILED":
        raise ValueError(f"File {file.name} failed to process.")
    
    return file

# Example usage for an incident investigation
video_asset = upload_to_gemini("repro_steps.mp4", mime_type="video/mp4")
log_asset = upload_to_gemini("server_logs.pdf", mime_type="application/pdf")

Files uploaded via the genai.upload_file method are stored for 48 hours and then deleted automatically. I initially thought I could use this as a permanent storage layer for audit logs, which led to a series of 404 errors in production after the first weekend. If you need longer persistence, you have to manage the storage in GCS and re-upload to Gemini as needed.

What Are the Best Multi-Modal Prompting Strategies for AI Agents?

Effective multi-modal prompting requires establishing a chronological anchor to synchronize timestamps across different media types like video and logs. Prompting a multi-modal model is different from prompting a text-only one. You have to guide the model's "attention" across different media types. In my incident agent, I found that providing a chronological anchor helps significantly. I tell the model that the video's timestamp 00:00 corresponds to the first entry in the PDF log file. Without this synchronization, the model would often hallucinate that an error in the logs caused a UI glitch that actually happened three minutes earlier.

I also encountered a massive issue with "Prompt Bloat." When you are sending multiple files, your token count skyrockets. I previously wrote about debugging LLM API cost spikes, and the lessons there applied directly here. Every video frame counts as a set number of tokens (usually around 258 tokens per second of video at 1fps for Gemini). If you aren't careful, a 5-minute video can eat up 80,000 tokens before you even type a single word of instructions.

Implementing the Reasoning Agent

Here is the core logic of the agent. I use tools (function calling) to allow the agent to fetch specific data points from our internal database if the uploaded files don't provide enough context.

def get_deployment_metadata(timestamp: str):
    """Tool to fetch what version of the code was live at a specific time."""
    # Logic to query GitHub/GitLab API
    return {"version": "v2.4.1", "author": "dev-team-a", "commit": "8f2d3e"}

model = genai.GenerativeModel(
    model_name="gemini-1.5-pro",
    tools=[get_deployment_metadata],
    system_instruction=(
        "You are a Senior Site Reliability Engineer. "
        "Analyze the provided video and PDF logs to find the root cause of the outage. "
        "Correlate timestamps between the screen recording and the server logs. "
        "If you see a version mismatch, use the get_deployment_metadata tool."
    )
)

chat = model.start_chat(enable_automatic_function_calling=True)

# The request includes both the uploaded file references and the text prompt
response = chat.send_message([
    video_asset,
    log_asset,
    "Look at the video from 01:20 to 01:45. The user sees a spinning loader. "
    "Check the logs at that same time and tell me what the backend was doing."
])

print(response.text)

How to Reduce Multi-Modal AI Agent Costs with Context Caching

Context caching reduces API costs by up to 90% by storing heavy assets like system architecture PDFs so they are not re-tokenized for every query. After running this in staging for a week, my GCP bill showed a projected monthly cost of $1,200 just for this agent. The culprit was the repeated uploading and processing of the same "Context" files (like our 100-page system architecture PDF) for every single query. Every time a developer asked a follow-up question, the entire 100-page PDF was re-tokenized.

To solve this, I implemented Context Caching. Gemini allows you to cache a set of files and instructions so that you only pay for the initial tokenization once. Subsequent calls that use that cache are significantly cheaper and faster. This is a game-changer for a multi-modal AI agent where the "base" context (videos/heavy docs) stays the same while the user asks multiple questions.

For more on how I optimized these specific costs, you should check out my deep dive on reducing LLM API costs by 40% with dynamic prompt compression. While that post focuses on text, the principles of caching and pruning irrelevant context are even more vital when dealing with high-token multi-modal inputs.

Optimizing Large-Scale PDF Parsing and Audio Processing

Gemini treats PDF pages as images, making it superior for visual diagrams but requiring strategic document splitting for files exceeding 200 pages. One specific nuance I discovered with Gemini's PDF handling is that it treats each page as an image. If you have a PDF that is mostly text, this is actually less efficient than extracting the text yourself and sending it as a string. However, for engineering diagrams or cloud architecture exports, the native PDF vision is superior. I found that if my PDF was over 200 pages, the model's "long-context" recall would start to dip slightly. I solved this by splitting the PDF into logical sections (e.g., "Networking," "Database," "Application") and only attaching the relevant sections based on the initial user query.

You can find the official limits and best practices for file types in the Gemini API Vision documentation. One tip: if you are processing audio, Gemini 1.5 can handle up to 9.5 hours in a single request, but the model's ability to pinpoint a specific sound (like a fan whirring or a notification ping) improves drastically if you provide a 16kHz mono stream rather than a high-bitrate stereo file.

What I Learned / Key Takeaways

The File API state is asynchronous: Never assume a file is ready immediately after the upload call returns. Always implement a polling loop to check for the ACTIVE state.
Timestamps are your best friend: When working with video and logs, explicitly tell the model to use timestamps to correlate events. It prevents the model from hallucinating a causal relationship between unrelated events.
Context Caching is mandatory: For a multi-modal AI agent where users ask follow-up questions, failing to use context_caching is essentially throwing money away. It reduced my per-query cost from $0.15 to $0.02 for the same set of assets.
Mime-type matters: Don't let the SDK guess. Explicitly setting mime_type="application/pdf" or video/mp4 during upload reduced "Unsupported File Type" errors by 100% in my production logs.
Native is better than OCR: Stop using Tesseract or other pre-processors if you can avoid it. The model's ability to understand the spatial relationship of elements (like where a button is relative to a header) is lost when you convert everything to raw text.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Building a Multi-Modal AI Agent with Gemini API and Python

Building a Multi-Modal AI Agent with Gemini API and Python

Why Native Multi-Modal Architectures Outperform Traditional OCR Pipelines

How to Manage Large Assets Using the Gemini File API

What Are the Best Multi-Modal Prompting Strategies for AI Agents?

Implementing the Reasoning Agent

How to Reduce Multi-Modal AI Agent Costs with Context Caching

Optimizing Large-Scale PDF Parsing and Audio Processing

What I Learned / Key Takeaways

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs