Running LLMs locally with Ollama — when you don't want to send data to the cloud

TL;DR: Ollama runs LLMs locally (Llama 3, Mistral, Phi-3). OpenAI-compatible REST API. 8 GB RAM handles 7B models. Data never leaves your server.

More and more companies are integrating language models into their workflows — from ticket classification to content generation. But sending company data to the OpenAI or Anthropic API raises questions: where does that data go, can it be used for training, how does it relate to GDPR? Ollama is the answer for those who want the power of LLMs without giving up control over their data.

Why local LLMs

There are several reasons, and their weight varies depending on context. First — GDPR and trade secrets. If you’re processing customer personal data, contracts, or internal documentation, sending it to an external API can be legally problematic, and certainly requires a detailed analysis of data processing agreements. Second — cost at scale. Claude and GPT-4 with heavy use represent real expenses of $50–200 per month for a small team, and for larger operations the numbers grow linearly with volume. A local model costs electricity — literally a few dollars a month.

The third consideration is latency and availability. Your own endpoint has no rate limiting, no downtime during provider outages, and works without internet access.

Installing Ollama

The simplest approach is Docker Compose:

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

volumes:
  ollama_data:

After starting the container, pull a model:

docker exec -it ollama ollama pull llama3.1:8b

Verify it works:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain in one sentence what a REST API is",
  "stream": false
}'

The response will arrive after a few seconds — the first request is slower because the model is loading into memory.

Model comparison

With 8 GB RAM you have access to several solid models in the 7B/8B range. Llama 3.1 8B is currently the best quality open model in this class — particularly good for code and English. Mistral 7B is slightly faster and handles classification tasks well. Phi-3.5 Mini from Microsoft is a model optimized for reasoning at a small size — the fastest of the three, ideal for simple tasks like data extraction from structured text.

None of these models will match GPT-4o or Claude Sonnet for complex multi-step reasoning, but for 80% of typical tasks they are sufficient.

OpenAI-compatible REST API

This is Ollama’s killer feature — if you already have code using the OpenAI SDK, you only need to change one parameter. Example in Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # any string, Ollama doesn't verify it
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Write a Python function that sorts a list of dicts by a 'date' key"}
    ]
)

print(response.choices[0].message.content)

Zero changes to application logic — just base_url and model.

When the cloud wins

A local 7B model isn’t the answer to everything. GPT-4 and Claude Sonnet win in several areas: multi-step reasoning with long context (128k tokens and full utilization), code generation in niche frameworks where 7B models have sparse training data representation, and tasks requiring up-to-date knowledge (local models have a training cutoff date).

Practical rule of thumb: start with a local model, test quality on your own data, and only reach for the cloud for those specific cases where results are unacceptable.

Summary

Ollama lowers the barrier to self-hosted LLM to a minimum — Docker Compose, one pull, done. OpenAI API compatibility means existing code works without modification. For projects where data privacy is a priority, this isn’t a quality compromise — it’s a deliberate architectural decision.