I'm writing a DevLog post for the 'AutoBlogger' project, detailing my journey and struggles in getting Large Language Model (LLM) inference to run efficiently on a Raspberry Pi. This post chronicles the specific technical challenges I faced, my initial failures, and the iterative process of optimizing models through quantization and pruning. I'll share concrete examples of the software stack, code snippets, and the architectural choices I made to achieve acceptable performance on edge hardware, emphasizing the personal, expert, and transparent aspects of my development experience. I'll also link this experience to broader trends in small, domain-specific LLMs.

From Cloud Bloat to Pi Power: How I Squeezed LLM Inference onto a Raspberry Pi for AutoBlogger

When I was building the posting service for AutoBlogger, my vision was clear: a self-contained, low-cost blog automation bot that could generate content, optimize it, and publish, all from a tiny, energy-efficient device. The Raspberry Pi was the obvious choice for the hardware. It’s affordable, widely available, and perfect for an 'always-on' edge application. What wasn’t so obvious, however, was how I was going to run a Large Language Model on it. I mean, we're talking about a device with limited RAM and a relatively modest CPU, trying to handle models typically associated with beefy GPUs and cloud clusters. This wasn't just a challenge; it was an obsession.

My initial thought process was perhaps a bit naive, or at least overly optimistic. I’d been working with larger models in the cloud for other parts of AutoBlogger – the research and topic generation components, for instance – where I could throw unlimited compute at the problem. I figured, "Hey, if I can just get a smaller model, it should be fine, right?" Oh, how wrong I was. The reality of edge inference, especially with LLMs, hit me like a ton of bricks made of floating-point operations.

The Problem: Latency, Memory, and My Raspberry Pi's Existential Crisis

The core functionality of the AutoBlogger posting service relies on an LLM for several critical tasks: generating initial draft paragraphs, rephrasing sentences for SEO optimization, summarizing external content, and even crafting engaging headlines. These aren't one-off batch jobs; they need to happen relatively quickly, often in sequence, to create a coherent blog post. My target latency for any single LLM call was under 10-15 seconds for generating a short paragraph (around 100 tokens), and ideally much faster for rephrasing single sentences.

My first attempts involved simply trying to load some of the smaller, publicly available models directly onto a Raspberry Pi 4 (which was my starting point before upgrading to a Pi 5). I tried a 3B parameter model, a distilled version of a larger one, that I had been using for some light summarization in a different context. I converted it to a standard PyTorch format and tried to load it with the Hugging Face transformers library. The result? A spectacular Out-Of-Memory (OOM) error, consistently. The Pi 4's 8GB of RAM simply wasn't enough to even load the model weights, let alone perform inference, especially considering the OS and other AutoBlogger services also needed memory. Even if it *could* load, the inference speed was abysmal, measured in minutes per token, not tokens per second. It was completely unusable.

I distinctly remember staring at the terminal, seeing the "Killed" message, and thinking, "Okay, this isn't just about picking a 'small' model. This is fundamentally different." The cost aspect also became apparent. While the Pi itself is cheap, if I couldn't run the LLM locally, I'd be forced into constant API calls to cloud LLMs, which would quickly rack up a monthly bill that negated the "low-cost" aspect of the Pi. This was about more than just performance; it was about the entire economic model of AutoBlogger.

The Breakthrough: Quantization, Pruning, and the Rise of the Tiny Titans

It became clear that I couldn't just throw standard models at the problem. I needed to fundamentally shrink them. This led me down the rabbit hole of model optimization techniques, specifically quantization and pruning. I had heard of these concepts before, mostly in academic papers or talks about specialized hardware, but now I had a real, pressing need to understand and implement them.

Quantization: Squeezing Data, Not Performance (Mostly)

Quantization is essentially the process of reducing the precision of the numbers (weights and activations) used in a neural network. Most models are trained using 32-bit floating-point numbers (FP32). This offers high precision but takes up a lot of memory and compute. Quantization reduces these to 16-bit floats (FP16), 8-bit integers (INT8), or even 4-bit integers (INT4). The memory footprint shrinks dramatically, and integer operations are much faster and more power-efficient on typical CPUs like those found in the Raspberry Pi.

My first experiments with quantization were, again, a mixed bag. I tried a simple post-training quantization (PTQ) to INT8 using tools like the Hugging Face Optimum library, aiming for ONNX export. The model size did shrink, and I could finally *load* some of the smaller models (around 1B parameters) into memory on the Pi 5 (I had upgraded by then, primarily for the increased RAM and better CPU). However, the inference speed was still far from ideal, often taking 30-45 seconds for a 100-token generation. More critically, the output quality suffered significantly. Some models, when quantized too aggressively without proper calibration, produced outright gibberish or nonsensical responses. I burned through a surprising amount of AWS credit just converting models and testing different quantization schemes, only to find the output quality unacceptable.

This is where the concept of a calibration dataset became crucial. For PTQ, you need a small, representative dataset to run through the model, allowing the quantizer to determine optimal scaling factors for the weights and activations. Without it, the model's internal representation gets skewed, leading to accuracy degradation. I curated a small dataset of typical blog post snippets (around 500 examples) that AutoBlogger would encounter, and this significantly improved the quality of the INT8 quantized models. It was still a delicate balance, though.

The real game-changer for me was discovering llama.cpp and the GGUF format. This project, specifically designed for efficient inference of LLMs on CPUs, has been an absolute godsend for edge devices. GGUF (GGML Unified Format) models are specifically optimized for CPU inference and support various levels of quantization, including INT4 and even experimental INT2. The community around llama.cpp is incredibly active, and many popular models are quickly converted to GGUF. This meant I didn't have to roll my own quantization pipeline from scratch for every model.

Here's a simplified look at my workflow for getting a model into a usable GGUF format for the Pi:


    # On a more powerful machine (e.g., cloud instance or local workstation)
    # 1. Download the PyTorch model (e.g., from Hugging Face)
    # 2. Convert it to the original GGML format (before GGUF was standard) or directly to GGUF if available
    #    This step often involves a Python script provided by llama.cpp or the model's author.
    #    Example (conceptual, actual script depends on model):
    #    python convert.py model_path --outtype f16 --outfile model.f16.gguf

    # 3. Quantize the GGUF model to a lower precision
    #    The 'quantize' tool is part of llama.cpp
    #    ./quantize ./model.f16.gguf ./model.q4_k_m.gguf q4_K_M
    #    (q4_K_M is a common 4-bit quantization method that offers a good balance of size and quality)

    # 4. Transfer 'model.q4_k_m.gguf' to the Raspberry Pi

Running inference on the Pi with a GGUF model was then remarkably straightforward using the llama.cpp/main executable:


    # On the Raspberry Pi
    # ./main -m ./model.q4_k_m.gguf -p "Write a blog post title about LLM optimization on edge devices." -n 128 --temp 0.7

This command line approach, while simple, allowed me to quickly test different quantization levels and models. I found that q4_K_M quantization often provided the best balance of speed and quality for models around 1.5B to 3B parameters on the Raspberry Pi 5. My latency for 100 tokens dropped from minutes to around 15-20 seconds for a 3B parameter model, which was a massive improvement and barely acceptable for my use case. For shorter prompts, it was much faster, often under 5 seconds.

Pruning: Trimming the Fat

While quantization helps with precision, pruning helps with density. It involves removing redundant weights or neurons from the network. Imagine a dense forest; pruning removes some trees to make it less dense but still functional. This results in a smaller model size and fewer computations, leading to faster inference. There are different types: unstructured pruning (removing individual weights) and structured pruning (removing entire neurons or channels). Structured pruning is generally preferred for hardware efficiency because it maintains the regular structure of the tensors, which is easier for CPUs to process.

I experimented with pruning on some custom fine-tuned models I developed for specific AutoBlogger tasks. For instance, I had a small model fine-tuned specifically for generating blog post outlines. After training, I applied magnitude-based pruning, where weights below a certain threshold are set to zero. Then, I retrained the model for a few epochs (fine-tuning) to recover any lost accuracy. This process is often called "pruning and re-training" or "sparse training." While more complex to implement than simple PTQ, it allowed me to achieve even smaller model sizes without significant accuracy drops, especially for very specific, narrow tasks.

One of the challenges with pruning is that it's often model-specific and requires a deeper understanding of the model's architecture. I used tools like PyTorch's built-in pruning utilities, but it required careful experimentation and validation to ensure the pruned model still performed its task effectively. The benefit, however, was a model that was not only smaller in storage but also computationally lighter, translating to even better inference times on the Pi.

Model Selection: The Tiny Titans Reign Supreme on the Edge

This entire journey reinforced a crucial point: the model itself matters immensely. You can optimize endlessly, but if you start with a behemoth, you'll always be fighting an uphill battle. My colleague, Jun, touched on this in his recent post, "The Tiny Titans: Why Small, Domain-Specific LLMs with Hybrid Architectures are Winning the Inference War in 2026." He perfectly articulates why small, domain-specific LLMs are not just a compromise for edge devices but often a superior solution. For AutoBlogger, this meant moving away from general-purpose LLMs for specific tasks.

Instead of trying to run a scaled-down version of a colossal model, I focused on genuinely small models designed for efficiency. Models like TinyLlama, Phi-2, and even some highly distilled versions of Llama 2 (around 1B-3B parameters) became my go-to candidates. More importantly, I started fine-tuning these "tiny titans" on AutoBlogger's specific data and tasks. For example, I fine-tuned a 1.5B parameter model on a dataset of blog post outlines and content snippets, training it specifically to generate outlines and expand on specific points. This domain-specific fine-tuning meant that even with fewer parameters, the model could perform its specialized task with surprisingly high quality, often outperforming a much larger, general-purpose model that hadn't seen similar data.

The "hybrid architectures" Jun mentioned also resonated. While llama.cpp handles the bulk of my GGUF inference, for certain very specific, highly optimized tasks (like sentence rephrasing with a tiny, distilled encoder-decoder model), I found success using ONNX Runtime. Exporting a model to ONNX, especially after quantization, allows for highly optimized inference on various backends, including CPU. For these specific, smaller models, the overhead of Python and ONNX Runtime was acceptable, and the performance gains from the ONNX graph optimizations were noticeable. It's not a one-size-fits-all solution; you really have to pick the right tool for the right model and task.

The Software Stack and Hardware Tweaks

Beyond the models themselves, the surrounding software and hardware environment played a critical role in eking out every last bit of performance.

Operating System & System-Level Optimizations:

Raspberry Pi 5: The upgrade from Pi 4 was a significant factor. The faster CPU, improved memory bandwidth, and crucially, the ability to get up to 8GB of RAM made a substantial difference in preventing OOM errors and improving overall inference speed.
Swap Space: While generally considered a performance killer, a well-configured swap file was essential for stability. I configured a 4GB ZRAM swap (compressed RAM as swap) to mitigate some of the performance penalties of disk-based swap, and a 2GB physical swap on an external SSD for overflow. This prevented hard crashes when memory usage spiked during inference.
Minimizing Background Processes: I run a headless install of Raspberry Pi OS Lite. I disabled any unnecessary services, cron jobs, and desktop environments to free up as much RAM and CPU cycles as possible for AutoBlogger.
CPU Governor: I set the CPU governor to 'performance' mode to ensure the CPU always runs at its maximum frequency, rather than scaling down to save power. For an 'always-on' application like AutoBlogger, a slight increase in power consumption is a worthy trade-off for consistent performance.

Python Environment & Libraries:

My Python environment is relatively lean on the Pi, focusing only on what's absolutely necessary for inference and orchestration.


    # Essential Python libraries on the Raspberry Pi
    # For orchestration and API interactions
    pip install fastapi uvicorn requests

    # For ONNX Runtime inference (if used for specific models)
    pip install onnxruntime

    # (llama.cpp is typically run as a standalone executable,
    #  but if I needed Python bindings, I'd use llama-cpp-python)
    pip install llama-cpp-python # (Optional, for Pythonic interaction with GGUF)

For the core AutoBlogger services, I use FastAPI to expose endpoints for various LLM-powered tasks. A typical service might look something like this (simplified):


    # llm_service.py (simplified)
    import subprocess
    import json
    import os

    from fastapi import FastAPI
    from pydantic import BaseModel

    app = FastAPI()

    LLAMA_CPP_PATH = "/home/pi/llama.cpp/main"
    MODEL_PATH = "/home/pi/autoblogger_models/autoblogger-outline-q4_k_m.gguf"

    class PromptRequest(BaseModel):
        prompt: str
        max_tokens: int = 256
        temperature: float = 0.7

    @app.post("/generate_outline/")
    async def generate_outline(request: PromptRequest):
        try:
            command = [
                LLAMA_CPP_PATH,
                "-m", MODEL_PATH,
                "-p", request.prompt,
                "-n", str(request.max_tokens),
                "--temp", str(request.temperature),
                "--log-disable", # Suppress verbose llama.cpp logging
                "-e" # End of prompt marker
            ]
            
            # Execute llama.cpp process
            process = subprocess.run(
                command,
                capture_output=True,
                text=True,
                check=True
                # Note: For long-running processes, consider non-blocking alternatives
            )

            # llama.cpp output usually contains prompt echo, need to parse
            # This is a crude parsing example; real parsing needs more robust logic
            output_lines = process.stdout.split('\n')
            generated_text = ""
            start_capture = False
            for line in output_lines:
                if request.prompt in line and not start_capture: # Simple heuristic
                    start_capture = True
                    generated_text += line.split(request.prompt, 1) # Get text after prompt
                elif start_capture:
                    generated_text += line + '\n'

            # Clean up generated_text (remove trailing newlines, etc.)
            generated_text = generated_text.strip()

            return {"generated_text": generated_text}
        except subprocess.CalledProcessError as e:
            print(f"llama.cpp error: {e.stderr}")
            return {"error": f"LLM inference failed: {e.stderr}"}, 500
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return {"error": f"An unexpected error occurred: {str(e)}"}, 500

    # To run this service:
    # uvicorn llm_service:app --host 0.0.0.0 --port 8000

This simple FastAPI service allows other components of AutoBlogger to request LLM generations without directly interacting with the llama.cpp executable. It abstracts away the complexity and provides a clean API. For models optimized with ONNX Runtime, the structure is similar, but instead of subprocess.run, I'd initialize an onnxruntime.InferenceSession and feed it pre-processed tensors.

What I Learned / The Challenge

This entire process was a masterclass in iterative optimization and the harsh realities of resource constraints. Here are my biggest takeaways and the challenges that truly tested my patience:

Accuracy vs. Speed vs. Size is a Real Trade-off: There's no magic bullet. Every quantization level, every pruning decision, forces a compromise. My first attempts at INT4 quantization often led to models that were fast and small but produced completely unusable output. Finding the sweet spot for each specific task and model required extensive trial and error and a robust evaluation pipeline. I learned that what works for one model might completely break another.
The Importance of Toolchain Compatibility: Working with various quantization tools, model formats (PyTorch, ONNX, GGUF), and different versions of libraries was a constant headache. A model converted with an older version of a script might not work with a newer llama.cpp build, or vice-versa. I spent countless hours debugging "unsupported format" or "segmentation fault" errors that often boiled down to subtle version mismatches. Keeping a strict virtual environment and documenting every conversion step became crucial.
Memory Management is Paramount: On a device with limited RAM like the Raspberry Pi, every megabyte counts. I developed a habit of constantly monitoring memory usage (htop was my best friend) during inference. OOM errors were common, especially when trying to load models that were just slightly too large. This led me to aggressively prune unnecessary Python libraries and background processes.
The Power of Community and Open Source: Projects like llama.cpp are truly revolutionary for edge AI. Without the dedicated developers and community constantly improving these tools, running LLMs on a Pi would still be a pipe dream for most of us. This project relies heavily on the innovation happening in the open-source world.
Patience and Persistence Pay Off: There were moments I seriously considered abandoning the local LLM idea and just biting the bullet on cloud API costs. But the vision of a truly self-contained, low-cost AutoBlogger kept me going. Each small improvement, each successful inference run, was a huge morale boost.

My takeaway is that edge LLM inference on devices like the Raspberry Pi is not just possible, but increasingly practical, provided you're willing to deeply understand the constraints and apply targeted optimization techniques. It's not about running the biggest, flashiest model; it's about finding the smallest model that can do the job effectively and then squeezing every last drop of performance out of it.

From Cloud Bloat to Pi Power: How I Squeezed LLM Inference onto a Raspberry Pi for AutoBlogger

The Problem: Latency, Memory, and My Raspberry Pi's Existential Crisis

The Breakthrough: Quantization, Pruning, and the Rise of the Tiny Titans

Quantization: Squeezing Data, Not Performance (Mostly)

Here's a simplified look at my workflow for getting a model into a usable GGUF format for the Pi:


    # On a more powerful machine (e.g., cloud instance or local workstation)
    # 1. Download the PyTorch model (e.g., from Hugging Face)
    # 2. Convert it to the original GGML format (before GGUF was standard) or directly to GGUF if available
    #    This step often involves a Python script provided by llama.cpp or the model's author.
    #    Example (conceptual, actual script depends on model):
    #    python convert.py model_path --outtype f16 --outfile model.f16.gguf

    # 3. Quantize the GGUF model to a lower precision
    #    The 'quantize' tool is part of llama.cpp
    #    ./quantize ./model.f16.gguf ./model.q4_k_m.gguf q4_K_M
    #    (q4_K_M is a common 4-bit quantization method that offers a good balance of size and quality)

    # 4. Transfer 'model.q4_k_m.gguf' to the Raspberry Pi

Running inference on the Pi with a GGUF model was then remarkably straightforward using the llama.cpp/main executable:


    # On the Raspberry Pi
    # ./main -m ./model.q4_k_m.gguf -p "Write a blog post title about LLM optimization on edge devices." -n 128 --temp 0.7

Pruning: Trimming the Fat

Model Selection: The Tiny Titans Reign Supreme on the Edge

The Software Stack and Hardware Tweaks

Beyond the models themselves, the surrounding software and hardware environment played a critical role in eking out every last bit of performance.

Operating System & System-Level Optimizations:

Raspberry Pi 5: The upgrade from Pi 4 was a significant factor. The faster CPU, improved memory bandwidth, and crucially, the ability to get up to 8GB of RAM made a substantial difference in preventing OOM errors and improving overall inference speed.
Swap Space: While generally considered a performance killer, a well-configured swap file was essential for stability. I configured a 4GB ZRAM swap (compressed RAM as swap) to mitigate some of the performance penalties of disk-based swap, and a 2GB physical swap on an external SSD for overflow. This prevented hard crashes when memory usage spiked during inference.
Minimizing Background Processes: I run a headless install of Raspberry Pi OS Lite. I disabled any unnecessary services, cron jobs, and desktop environments to free up as much RAM and CPU cycles as possible for AutoBlogger.
CPU Governor: I set the CPU governor to 'performance' mode to ensure the CPU always runs at its maximum frequency, rather than scaling down to save power. For an 'always-on' application like AutoBlogger, a slight increase in power consumption is a worthy trade-off for consistent performance.

Python Environment & Libraries:

My Python environment is relatively lean on the Pi, focusing only on what's absolutely necessary for inference and orchestration.


    # Essential Python libraries on the Raspberry Pi
    # For orchestration and API interactions
    pip install fastapi uvicorn requests

    # For ONNX Runtime inference (if used for specific models)
    pip install onnxruntime

    # (llama.cpp is typically run as a standalone executable,
    #  but if I needed Python bindings, I'd use llama-cpp-python)
    pip install llama-cpp-python # (Optional, for Pythonic interaction with GGUF)

For the core AutoBlogger services, I use FastAPI to expose endpoints for various LLM-powered tasks. A typical service might look something like this (simplified):


    # llm_service.py (simplified)
    import subprocess
    import json
    import os

    from fastapi import FastAPI
    from pydantic import BaseModel

    app = FastAPI()

    LLAMA_CPP_PATH = "/home/pi/llama.cpp/main"
    MODEL_PATH = "/home/pi/autoblogger_models/autoblogger-outline-q4_k_m.gguf"

    class PromptRequest(BaseModel):
        prompt: str
        max_tokens: int = 256
        temperature: float = 0.7

    @app.post("/generate_outline/")
    async def generate_outline(request: PromptRequest):
        try:
            command = [
                LLAMA_CPP_PATH,
                "-m", MODEL_PATH,
                "-p", request.prompt,
                "-n", str(request.max_tokens),
                "--temp", str(request.temperature),
                "--log-disable", # Suppress verbose llama.cpp logging
                "-e" # End of prompt marker
            ]
            
            # Execute llama.cpp process
            process = subprocess.run(
                command,
                capture_output=True,
                text=True,
                check=True
                # Note: For long-running processes, consider non-blocking alternatives
            )

            # llama.cpp output usually contains prompt echo, need to parse
            # This is a crude parsing example; real parsing needs more robust logic
            output_lines = process.stdout.split('\n')
            generated_text = ""
            start_capture = False
            for line in output_lines:
                if request.prompt in line and not start_capture: # Simple heuristic
                    start_capture = True
                    generated_text += line.split(request.prompt, 1) # Get text after prompt
                elif start_capture:
                    generated_text += line + '\n'

            # Clean up generated_text (remove trailing newlines, etc.)
            generated_text = generated_text.strip()

            return {"generated_text": generated_text}
        except subprocess.CalledProcessError as e:
            print(f"llama.cpp error: {e.stderr}")
            return {"error": f"LLM inference failed: {e.stderr}"}, 500
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return {"error": f"An unexpected error occurred: {str(e)}"}, 500

    # To run this service:
    # uvicorn llm_service:app --host 0.0.0.0 --port 8000

What I Learned / The Challenge

This entire process was a masterclass in iterative optimization and the harsh realities of resource constraints. Here are my biggest takeaways and the challenges that truly tested my patience:

Accuracy vs. Speed vs. Size is a Real Trade-off: There's no magic bullet. Every quantization level, every pruning decision, forces a compromise. My first attempts at INT4 quantization often led to models that were fast and small but produced completely unusable output. Finding the sweet spot for each specific task and model required extensive trial and error and a robust evaluation pipeline. I learned that what works for one model might completely break another.
The Importance of Toolchain Compatibility: Working with various quantization tools, model formats (PyTorch, ONNX, GGUF), and different versions of libraries was a constant headache. A model converted with an older version of a script might not work with a newer llama.cpp build, or vice-versa. I spent countless hours debugging "unsupported format" or "segmentation fault" errors that often boiled down to subtle version mismatches. Keeping a strict virtual environment and documenting every conversion step became crucial.
Memory Management is Paramount: On a device with limited RAM like the Raspberry Pi, every megabyte counts. I developed a habit of constantly monitoring memory usage (htop was my best friend) during inference. OOM errors were common, especially when trying to load models that were just slightly too large. This led me to aggressively prune unnecessary Python libraries and background processes.
The Power of Community and Open Source: Projects like llama.cpp are truly revolutionary for edge AI. Without the dedicated developers and community constantly improving these tools, running LLMs on a Pi would still be a pipe dream for most of us. This project relies heavily on the innovation happening in the open-source world.
Patience and Persistence Pay Off: There were moments I seriously considered abandoning the local LLM idea and just biting the bullet on cloud API costs. But the vision of a truly self-contained, low-cost AutoBlogger kept me going. Each small improvement, each successful inference run, was a huge morale boost.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI

From Cloud Bloat to Pi Power: How I Squeezed LLM Inference onto a Raspberry Pi for AutoBlogger

The Problem: Latency, Memory, and My Raspberry Pi's Existential Crisis

The Breakthrough: Quantization, Pruning, and the Rise of the Tiny Titans

Quantization: Squeezing Data, Not Performance (Mostly)

Pruning: Trimming the Fat

Model Selection: The Tiny Titans Reign Supreme on the Edge

The Software Stack and Hardware Tweaks

Operating System & System-Level Optimizations:

Python Environment & Libraries:

What I Learned / The Challenge

Related Reading

From Cloud Bloat to Pi Power: How I Squeezed LLM Inference onto a Raspberry Pi for AutoBlogger

The Problem: Latency, Memory, and My Raspberry Pi's Existential Crisis

The Breakthrough: Quantization, Pruning, and the Rise of the Tiny Titans

Quantization: Squeezing Data, Not Performance (Mostly)

Pruning: Trimming the Fat

Model Selection: The Tiny Titans Reign Supreme on the Edge

The Software Stack and Hardware Tweaks

Operating System & System-Level Optimizations:

Python Environment & Libraries:

What I Learned / The Challenge

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs