Ollama Provider¶
No API keys. No cloud. No data leaving your machine. Ollama runs open-source LLMs directly on your hardware — your laptop, your workstation, your air-gapped server in the basement. The tradeoff is obvious: local models are smaller and slower than cloud-hosted frontier models. But for privacy-sensitive work, offline development, or just experimenting without burning API credits, nothing beats a model that runs entirely under your control.
Overview¶
The Ollama provider connects Amplifier to models running through Ollama, a local model server that manages downloading, running, and serving open-source LLMs. Ollama exposes an OpenAI-compatible API on localhost, so Amplifier talks to it exactly like it would talk to any cloud provider — except the inference happens on your CPU or GPU.
This is the only provider where your conversation data never touches a network. Not encrypted in transit, not stored on someone else's server — it simply never leaves your machine.
Setup¶
Step 1: Install Ollama¶
Install Ollama on your system
[Tool: bash] curl -fsSL https://ollama.ai/install.sh | sh
On macOS, you can also install via Homebrew:
[Tool: bash] brew install ollama
On Windows, download the installer from ollama.ai.
Step 2: Start the Server¶
Start the Ollama server
[Tool: bash] ollama serve
Ollama is running on http://127.0.0.1:11434
On macOS and Windows, the Ollama desktop app starts the server automatically. On Linux, you may want to run it as a systemd service for persistence.
Step 3: Pull a Model¶
Ollama doesn't include models by default — you pull them like Docker images.
Pull some useful models for development
[Tool: bash] ollama pull llama3.1
pulling manifest... done
pulling model... 100% ████████████████████ 4.7 GB
[Tool: bash] ollama pull codellama
pulling manifest... done
pulling model... 100% ████████████████████ 3.8 GB
[Tool: bash] ollama pull mistral
pulling manifest... done
pulling model... 100% ████████████████████ 4.1 GB
See what's available locally
[Tool: bash] ollama list
NAME SIZE MODIFIED
llama3.1:latest 4.7 GB 2 minutes ago
codellama:latest 3.8 GB 1 minute ago
mistral:latest 4.1 GB 30 seconds ago
Step 4: Verify the Connection¶
Test that Ollama responds
[Tool: bash] curl -s http://localhost:11434/api/tags | python3 -m json.tool | head -10
{
"models": [
{
"name": "llama3.1:latest",
"size": 4661223424,
...
If you see your models listed, the server is ready.
Configuration¶
Minimal Configuration¶
# amplifier.yaml
provider:
name: ollama
model: llama3.1
The base_url defaults to http://localhost:11434 — no need to specify it unless you've changed the port or are running Ollama on another machine.
Full Configuration¶
provider:
name: ollama
base_url: http://localhost:11434
model: llama3.1
temperature: 0.2
max_tokens: 4096
Configuration Parameters¶
| Parameter | Default | Description |
|---|---|---|
| name | — | Always ollama |
| model | — | Model name as shown in ollama list |
| base_url | http://localhost:11434 |
Ollama server address |
| temperature | 0.8 |
Randomness (0.0 = deterministic, higher = creative) |
| max_tokens | 2048 |
Maximum tokens in the response |
Remote Ollama Server¶
Running Ollama on a beefy GPU server while developing on a lighter laptop? Point Amplifier at the remote instance:
provider:
name: ollama
base_url: http://gpu-server.local:11434
model: llama3.1:70b
Just make sure the server is reachable on your network. Ollama binds to localhost by default — set OLLAMA_HOST=0.0.0.0 on the server to accept remote connections.
Models¶
Open-source models vary widely in capability. Here's what works well with Amplifier.
Llama 3.1 — Best General Purpose¶
model: llama3.1
Meta's flagship open model. Llama 3.1 handles general coding, explanation, and conversation well. The 8B parameter version runs on most laptops; the 70B version needs a serious GPU but approaches cloud model quality.
[Tool: bash] ollama pull llama3.1 # 8B - runs on most hardware
[Tool: bash] ollama pull llama3.1:70b # 70B - needs ~40GB VRAM
Best for: General coding tasks, conversation, explanation, the default choice.
CodeLlama — Code-Specialized¶
model: codellama
Fine-tuned specifically for code generation. CodeLlama understands programming patterns better than general models of the same size. Supports fill-in-the-middle completions natively.
Best for: Code generation, completion, code-focused tasks on limited hardware.
Mistral — Fast and Compact¶
model: mistral
Mistral's 7B model punches above its weight. Fast inference, small memory footprint, and solid general performance. A good choice when you need speed over depth.
Best for: Quick tasks, limited hardware, fast iteration cycles.
DeepSeek Coder — Strong at Code¶
model: deepseek-coder-v2
A code-specialized model with strong benchmark performance. Handles complex code generation and understanding well for its size class.
Best for: Code-heavy workflows where you need better-than-average local code quality.
Choosing Model Sizes¶
The :latest tag typically pulls the smallest variant. Specify sizes explicitly:
[Tool: bash] ollama pull llama3.1:8b # 8B params, ~5GB, runs on 8GB RAM
[Tool: bash] ollama pull llama3.1:70b # 70B params, ~40GB, needs serious GPU
Rule of thumb: Use the largest model your hardware can run at acceptable speed. An 8B model responding in 2 seconds beats a 70B model that takes 30 seconds — for most interactive work.
A Fully Local Amplifier Environment¶
Here's the complete setup for running Amplifier with zero cloud dependencies:
Set up a fully offline Amplifier workspace
# amplifier.yaml — fully local configuration
provider:
name: ollama
base_url: http://localhost:11434
model: llama3.1
# No API keys needed anywhere
# No network access required
# All data stays on this machine
[Tool: bash] ollama serve &
[Tool: bash] ollama pull llama3.1
Verify the local setup works
> Write a Python function to parse CSV files
Here's a CSV parser using the standard library...
Everything runs locally. The conversation, the model inference, the tool execution — all on your machine. This is ideal for:
- Working on classified or highly sensitive codebases
- Development in air-gapped environments
- Planes, trains, and places without internet
- Experimenting without spending money on API calls
Tips¶
- Start with llama3.1. It's the best general-purpose local model. Branch out once you know your needs.
- GPU makes a huge difference. CPU inference works but is slow — 10-30x slower than GPU. If you have an NVIDIA GPU, Ollama uses it automatically via CUDA.
- Monitor memory usage. Models load into RAM (or VRAM). An 8B model needs ~5GB, 70B needs ~40GB. If your system starts swapping, use a smaller model.
- Expect shorter, simpler responses. Local models are less capable than cloud frontier models. They'll handle straightforward tasks well but may struggle with complex multi-step reasoning.
- Use Ollama for privacy, cloud for power. A common pattern: Ollama for sensitive code, Claude or GPT for everything else. The routing matrix makes this seamless.
- Keep models updated. Run
ollama pull <model>periodically — new quantizations and model versions improve quality without hardware changes. - Run the server as a service. On Linux, set up a systemd unit so Ollama starts on boot. No need to remember
ollama serveevery morning.
Next Steps¶
- See the Provider Index for mixing Ollama with cloud providers in a routing matrix
- Explore Community Providers for vLLM if you need higher-throughput local serving
- Check the Anthropic Provider or OpenAI Provider for cloud alternatives when you need more capable models