Skip to main content

Running LLMs Locally with Ollama and llama.cpp: A Developer's Guide

Carlos Mendoza Carlos Mendoza
13 min read
Link copied!
Running LLMs Locally with Ollama and llama.cpp: A Developer's Guide
Quick take

Practical developer guide to running local LLMs: hardware, quantization, setup, APIs, and integrating models into workflows.

Running large language models (LLMs) locally is now a practical option for developers in 2026. Tools like Ollama and llama.cpp allow you to bypass cloud APIs, offering advantages like privacy, cost savings, offline availability, and full control over model usage. While local models may not match the performance of advanced cloud models like GPT-4, they’re excellent for tasks such as code completion, data extraction, and text summarization.

Key Takeaways:

  • Why Local? Keep data private, reduce costs (e.g., $200/month vs. $1,000+ for APIs), and avoid vendor lock-in.
  • Hardware Needs: VRAM is critical. Options range from Apple Silicon Macs (16GB) for 7B models to GPUs like RTX 4090 (24GB) for 30B+ models.
  • Setup Tools: Ollama simplifies installation and offers an OpenAI-compatible API, while llama.cpp provides advanced performance tuning.
  • Performance: Local models using quantization can run efficiently on consumer hardware, delivering 30–95 tokens/second depending on setup.

If you’re handling sensitive data or running frequent queries, local LLMs can save money and provide more control. For occasional use or complex tasks, cloud solutions may still be better.

Hardware Requirements for Running LLMs Locally

VRAM is the key factor when running large language models (LLMs) locally. As the SitePoint team explains:

"The single most important number for local LLM inference is available VRAM."

While other components like CPU speed, storage, and system RAM are important, they take a backseat to GPU memory. Let’s break down the minimum hardware needed for different model sizes.

Minimum Hardware You Actually Need

  • Apple Silicon Mac (16GB unified memory): A great starting point for developers without extra costs. Thanks to its unified memory architecture, the GPU can access the full 16GB as VRAM, enabling it to run 7B–8B models at speeds of 30–50 tokens per second. This is fast enough for tasks like coding assistance.
  • RTX 3060 (12GB VRAM) or RTX 4060 (8GB VRAM): These consumer GPUs are ideal for running quantized 7B–8B models entirely in VRAM, delivering similar speeds of 30–50 tokens per second.
  • RTX 4060 Ti (16GB VRAM): Perfect for handling 13B–14B models. For example, it can run a Q4_K_M-quantized 13B model fully in VRAM.
  • RTX 4090 (24GB VRAM): A powerhouse for 30B+ models. It can run a Llama 3.1 8B model at 95 tokens per second with GPU utilization between 92–96%.

For CPU-only inference, expect slower speeds - around 3–8 tokens per second for 7B models. While this is manageable for batch tasks, it’s too slow for interactive use. At a minimum, you’ll need a CPU with AVX2 support and 16–32GB of system RAM.

RAM, VRAM, and Quantization Tradeoffs

Quantization helps reduce memory requirements by lowering model precision, making it possible to run larger models on less hardware. For instance, a 7B model in full FP16 precision needs about 14GB of VRAM, but with Q4_K_M quantization, that drops to just 4–5GB - a reduction of around 70%. This tradeoff is crucial when planning your hardware setup.

Here’s a quick guide to VRAM usage and quality across quantization levels:

Quantization Quality VRAM Usage vs. FP16
Q8_0 Near-lossless ~50% reduction
Q5_K_M High ~60% reduction
Q4_K_M Recommended default ~70% reduction
Q3_K_S Noticeable degradation ~75% reduction
Q2_K Significant loss ~80% reduction

Start with Q4_K_M as your default. If you notice quality issues, move up to Q5 or Q8. Keep at least 1–2GB of VRAM free for the KV cache, as filling up your VRAM entirely can cause crashes or slowdowns during longer interactions.

If your model exceeds the available VRAM, tools like Ollama and llama.cpp can offload layers to system RAM. However, this comes at the cost of reduced throughput. For hybrid setups, 32GB of system RAM is a practical minimum.

Hardware Cost Estimates in USD

Tier Hardware Approx. Cost Best For
Budget Existing Apple Silicon Mac (16GB+) $0 (existing hardware) 7B–8B models, daily development tasks
Mid-Range RTX 4060 Ti (16GB VRAM) ~$400 13B–14B models, full VRAM fit
Production RTX 4090 (24GB VRAM) ~$1,200–$2,000 30B+ models, high throughput
Advanced 2× RTX 4090 or NVIDIA A6000 (48GB) $5,000+ Full 70B inference, enterprise use

Higher-end hardware enables finer quantization and improved performance. These cost estimates provide a reference for choosing hardware that aligns with your needs.

On the bright side, electricity costs are minimal. Running a $2,000 GPU continuously costs about $50 per month in power, which is far cheaper than relying on cloud API access for moderate workloads.

"At 10,000+ queries per day with substantial context windows, local hosting pays for itself within 3–6 months." - Isha Maggu, CodeWords

For most developers, the RTX 4060 Ti (16GB) at ~$400 offers an excellent balance of price and performance. And if you already have an Apple Silicon Mac, you can start experimenting without any additional investment.

Getting Started with Ollama

Ollama

Once your hardware is set up, you're ready to install and run Ollama for a fast, local LLM experience. Ollama simplifies the process by bundling llama.cpp into a single binary that automatically detects your GPU.

Installing Ollama

Installation is quick and works across all major platforms:

Platform Method Command
macOS Homebrew brew install ollama
Linux Official script curl -fsSL https://ollama.com/install.sh
Windows Winget winget install Ollama.Ollama
Docker Official image docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

If you're on Windows and prefer a graphical installer, you can download it directly from ollama.com/download. Once installed, the Ollama daemon automatically starts on localhost:11434. This setup highlights the balance of cost-efficiency and control Ollama offers.

Running Your First Model

Getting started with a model requires just two steps. First, download the Llama 3.1 8B model with Q4_K_M quantization - a great starting point for most development tasks. Then, launch an interactive chat session:

ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M

Now you're running local inference. Use ollama list to view all downloaded models and ollama ps to check which ones are loaded in memory.

For more customization, you can use a Modelfile to define settings like the system prompt, temperature, and context window size. The default context window is 2,048 tokens, but you can adjust it for tasks like coding.

FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 8192
PARAMETER temperature 0.2
SYSTEM "You are a helpful coding assistant."

Save this as Modelfile, then build and run it with:

ollama create mydev -f Modelfile && ollama run mydev

Exposing OpenAI-Compatible API Endpoints

Once Ollama is up and running, it provides an OpenAI-compatible REST API. Key endpoints include /v1/chat/completions, /v1/completions, and /v1/embeddings, all accessible on port 11434.

To use this with the OpenAI Python SDK, update the base_url to point to your local instance:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # placeholder  -  not validated locally
)

response = client.chat.completions.create(
    model="llama3.1:8b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Explain async/await in Python."}],
)
print(response.choices[0].message.content)

Two key points to remember:

  • By default, Ollama binds to 127.0.0.1. If you need to expose the API to your local network, set OLLAMA_HOST=0.0.0.0:11434 in your environment. Make sure to secure access with an authenticated reverse proxy.
  • For frequent API calls, include keep_alive: "30m" in your requests. Without it, Ollama unloads the model between requests, causing cold-start delays.

With Ollama's local API ready to go, you can dive into deeper integrations and advanced features using tools like llama.cpp.

Using llama.cpp for Advanced Control

llama.cpp

Ollama vs llama.cpp: Local LLM Tools Compared
Ollama vs llama.cpp: Local LLM Tools Compared

Ollama operates on top of llama.cpp, but for those who want more control, llama.cpp itself offers a way to fine-tune performance and make detailed adjustments. It’s ideal when you need to tweak compilation flags, manage GPU layers precisely, or explore the latest quantization techniques.

Installing llama.cpp

For macOS users, the fastest way to get started is through Homebrew: brew install llama.cpp. On Linux and Windows, you can download prebuilt binaries from the GitHub releases page. These binaries are sufficient for CPU-based inference, but building from source unlocks optimizations tailored to your specific hardware.

To build from source, clone the repository and configure CMake with the appropriate hardware acceleration flags:

git clone https://github.com/ggerganov/llama.cpp
cmake -B build -DGGML_CUDA=ON      # For NVIDIA GPUs
# cmake -B build -DGGML_METAL=ON   # For Apple Silicon
# cmake -B build -DGGML_HIPBLAS=ON # For AMD ROCm
cmake --build build --config Release

The compiled binary is under 90MB and doesn’t rely on external dependencies . Once you’ve built it, you’re ready to run GGUF models and take advantage of llama.cpp’s advanced features.

Running GGUF Models with llama.cpp

llama.cpp uses the GGUF format, which consolidates model weights, tokenizer, and metadata into a single file. To get started, download a GGUF model from Hugging Face (look for Q4_K_M variants), and then launch the server:

./build/bin/llama-server \
  --model ./models/llama-3.1-8b-instruct-q4_k_m.gguf \
  --n-gpu-layers 35 \
  --ctx-size 8192 \
  --threads 8 \
  --port 8080

Here’s a breakdown of the key flags:

  • --n-gpu-layers (or -ngl): Specifies the number of model layers to offload to GPU memory. For example, with a 7B model on an 8GB GPU, start with 35 and adjust upward until you’re close to, but not exceeding, your VRAM limit. Leave around 500MB of headroom to avoid KV cache crashes .
  • --ctx-size: Controls the context window size. Larger windows need more VRAM, so reducing from 32K to 8K can save 2–4 GB . Always set this explicitly - if not, runtimes might silently drop older tokens without notifying you .
  • --threads: Match this to the number of physical CPU cores on your machine, not the logical cores .

Once configured, the server provides an OpenAI-compatible API at http://localhost:8080/v1.

Ollama vs. llama.cpp: Feature Comparison

The table below highlights the key differences between Ollama and llama.cpp, helping you decide which tool suits your needs. While both aim to run LLMs locally, they differ in terms of ease of use and customization.

Feature Ollama llama.cpp
Setup complexity Single command Requires build/CLI knowledge
Model management Automated (ollama pull) Manual (download GGUF from Hugging Face)
Performance tuning Limited (via Modelfiles) Extensive (layer/thread control)
API port 11434 8080
OS support macOS, Linux, Windows macOS, Linux, Windows, RISC-V, Android
Best for Integration with workflows Performance benchmarking & custom hardware

When running benchmarks for Llama 3.1 8B at Q4_K_M, llama.cpp typically delivers 60–70 tokens per second, compared to Ollama’s 55–65 tokens per second. This slight edge comes from reduced overhead . On high-end hardware like an RTX 4090, llama.cpp can achieve 95.51 tokens per second with GPU utilization between 92–96% . For most users, the difference may not be noticeable in everyday tasks, but if you’re optimizing for older or less common hardware, llama.cpp’s detailed controls can make a big difference.

Integrating Local LLMs into Your Dev Workflow

Once you've set up your local server and fine-tuned your hardware, integrating local LLMs into your development workflow can greatly enhance your day-to-day tasks. Whether you're using Ollama on port 11434 or llama.cpp on port 8080, the next step is connecting these tools to your development environment.

Connecting Your IDE to a Local LLM

Both Ollama and llama.cpp provide OpenAI-compatible REST APIs. This means that any IDE extension designed for cloud-based LLMs can be easily redirected to your local endpoint. For instance, if you're using VS Code, the Continue.dev extension is a popular choice. To configure it, navigate to its config.json file and set the apiBase to http://localhost:11434/v1. This setup enables inline completions and chat features powered by your local LLM.

JetBrains IDEs can be configured in a similar way by using plugins like Continue.dev or Tabby. Additionally, Cursor supports a "local mode" that lets you replace its default endpoint with your own server URL. Once your IDE is connected, you can take things further by setting up a local Retrieval-Augmented Generation (RAG) pipeline tailored specifically to your codebase.

Building a Local RAG Pipeline

Creating a local RAG pipeline for your codebase typically involves a few key components: a document loader (e.g., PyPDF or a basic file reader), a text splitter like RecursiveCharacterTextSplitter, an embedding model such as nomic-embed-text (768 dimensions), a local vector store like ChromaDB or SQLite-vec, and your generative LLM.

To avoid losing data during processing, always set the num_ctx parameter explicitly. A value of 8192 tokens is a safe starting point. Here's why this matters:

"When input exceeds the context window, the runtime silently drops the oldest tokens, typically the earliest messages in a conversation, with no error or warning returned to the caller." - SitePoint Team

Another tip: batch your embedding calls. This can significantly reduce overall processing time, often by a factor of 3–5×. In terms of hardware, you'll need at least 16 GB of VRAM or unified memory to run both the embedding model and the generative LLM simultaneously. These optimizations not only improve efficiency but also set the stage for better performance, which we’ll explore next.

Performance Benchmarks and Tradeoffs

One of the key benefits of local inference is the elimination of network latency. Cloud APIs often introduce a delay of 200–500 ms before generating the first token . On an RTX 4060 Ti with 16 GB, an 8B model using Q4_K_M quantization can generate around 55–70 tokens per second. For teams needing multi-user support, vLLM uses PagedAttention to handle up to 793 tokens per second in benchmarks, far surpassing Ollama’s 41 tokens per second under similar conditions .

That said, local models in the 8B–70B range might not match the complex reasoning capabilities of advanced cloud models like GPT-4o or Claude 3.5 Sonnet. For everyday tasks like code completion, docstring creation, or local RAG queries, this difference is usually acceptable. However, cloud solutions may still be better for tasks requiring deep reasoning or extensive world knowledge:

"The decision breaks down along three axes. Latency sensitivity favors local... Data sensitivity favors local... Budget favors cloud when usage is sporadic." - SitePoint Team

Lastly, consider the costs of running a local setup. A $5,000 GPU system operating 24/7 could incur daily electricity costs of about $0.50. If your workload exceeds 10,000 queries per day, the system could pay for itself within 3–6 months . For lower usage, however, cloud APIs might be more budget-friendly in the short term.

Conclusion: Picking the Right Approach for Local LLMs

Key Takeaways

By mid-2026, models in the 3B–8B parameter range running on standard hardware will be capable of handling most development tasks - without relying on the cloud. Just two years ago, achieving this level of quality required models with over 30B parameters. This shift, supported by advancements in hardware benchmarks and API integrations, highlights the growing practicality of local LLM deployment.

When deciding between local and cloud-based solutions, consider factors like data sensitivity, query volume, and task complexity. Local deployment works best for scenarios involving sensitive data or high query volumes. On the other hand, cloud-based solutions are better suited for tasks requiring advanced reasoning or when usage is too sporadic to justify investing in dedicated hardware.

A quote from Ilir Ivezaj, a technology executive, perfectly sums up this balance:

"Local models are smaller and less capable than cloud APIs (Claude Opus, GPT-4). But for code completion, data extraction, classification, and drafting - a 35B parameter model running locally is more than sufficient for 80% of development tasks." - Ilir Ivezaj, Technology Executive

Adopting a hybrid strategy - leveraging local models for day-to-day tasks while reserving cloud APIs for specialized needs - can help maintain cost efficiency and ensure data security.

Additionally, compliance requirements like the EU AI Act, which imposes strict regulations on high-risk systems, make local deployment increasingly critical for industries under heavy regulation.

Stay Current with daily.dev

daily.dev

The local AI landscape is evolving rapidly. New model releases, runtime optimizations like speculative decoding, and community-tested setups are constantly reshaping the field. daily.dev helps developers stay ahead by curating tutorials, benchmarks, and discussions that matter most. It’s a go-to resource for millions of developers navigating the ever-changing world of local AI infrastructure.

FAQs

How do I choose the right model size for my GPU VRAM?

To choose the right model size, consider how much VRAM you have, as this will be the main factor limiting local inference. With Q4_K_M quantization, you’ll need roughly 0.6–0.7 GB of VRAM per billion parameters. Here’s a quick breakdown:

  • 7–8B model: Requires 4–6 GB of VRAM
  • 13B model: Needs 8–10 GB of VRAM

Make sure to reserve 1–2 GB of VRAM for the KV cache to prevent crashes. If your VRAM falls short, you can use GPU+CPU offloading. Just keep in mind, this will impact performance.

What’s the best quantization level to start with?

The suggested starting point for quantization is Q4_K_M, as it strikes a great balance between model quality, speed, and size for the majority of applications. If you encounter issues with prompt quality, you can consider moving to higher-precision levels like Q5_K_M or Q8_0. When working with Ollama, Q4_K_M is typically the default setting, but you can tweak it based on your hardware capabilities and specific tasks.

How can I safely expose my local Ollama API on my network?

To make your Ollama API accessible across your network, you’ll need to adjust its configuration to listen on all network interfaces instead of just localhost. If you’re using Linux with systemd, follow these steps:

  1. Open the service file, typically located at /etc/systemd/system/ollama.service, and modify it to include the following line under the [Service] section:

    [Service]
    Environment="OLLAMA_HOST=0.0.0.0:11434"
    
  2. Save the changes, reload the systemd configuration, and restart the service to apply the update.

Keep in mind that this change will expose the API to your entire network. To maintain security, make sure your firewall settings are properly configured to limit access to trusted devices or IPs.

Read more, every new tab

Posts like this, on every new tab.

daily.dev curates a feed of articles ranked against what you actually care about. Free forever.

Link copied!