Step 1: Install Ollama
Ollama is the runtime that lets you download and run open-source models locally. Think of it like Docker, but for AI models.
curl -fsSL https://ollama.com/install.sh | shOnce installed, verify it's working:
ollama --versionOllama exposes a local REST API at
http://localhost:11434. Every model you pull will be available through that same endpoint — no API keys, no internet required after the initial download.Step 2: Pull and run Llama 3
Llama 3 is your reasoning model. Use it when a task needs thinking — complex prompts, multi-step logic, architecture decisions, long-form answers.
Pull and run it with:
ollama run llama3The first run downloads the model (around 4–8 GB depending on the size variant). After that it's cached locally.
You can test it interactively in the terminal, but in your actual system you'll call it via HTTP:
terminal
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain the tradeoffs between REST and GraphQL",
"stream": false
}'The response comes back as JSON with a
response field containing the model's output.When to use Llama: Complex reasoning, open-ended questions, architecture thinking, anything that needs more than a one-line answer.
Step 3: Pull and run Gemma 4 Effective
Gemma 4 is Google's latest release (April 2026). The Effective 4B (e4b) variant is your fast execution model. It's built for "Agentic" workflows — meaning it's incredibly good at following instructions and using tools quickly.

ollama run gemma4:e4bGemma 4 is also natively multimodal, so it can handle text, images, and even audio inputs locally. This makes it the perfect backbone for a coding agent like Claude Code when running in local-first mode.
terminal
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e4b",
"prompt": "Classify this transaction: Uber Eats $24.50",
"stream": false
}'When to use Gemma 4: Structured tasks, heavy tool-calling, data extraction, or when you need multimodal (image/audio) support on a budget.
Step 4: Design the system before you build it
This is the step most people skip, and it's why their local AI setups become a mess. High-performance local AI depends on routing: don't waste Llama's reasoning power on one-line tasks or Gemma's speed on complex architecture questions.
User Request
Backend OrchestratorNode.js / Express
Router Logic
"Complex"
Llama 38B / 70B Params
"Fast Task"
Gemma 4Effective 4B
Response
Your backend is the orchestrator — it decides which model to call based on the complexity of the task. The models themselves don't know about each other. This separation is what makes the system clean, fast, and maintainable.
Step 5: Build the backend
You can use any backend you're comfortable with. Here's how it looks in Node.js using Express and Axios.
npm install express axiosllamaService.js
const axios = require('axios');const OLLAMA_URL = 'http://localhost:11434/api/generate';async function generate(prompt) {
const response = await axios.post(OLLAMA_URL, {
model: 'llama3',
prompt: prompt,
stream: false
});
return response.data.response;
}module.exports = { generate };Now you can POST to
/api/ai/ask with a taskType of "complex" or "simple" and the right model handles it.Step 6: Use Claude Code to write your backend
Claude Code is a CLI tool that understands your entire codebase. Instead of writing boilerplate by hand, you describe what you want and it writes it for you.
terminal
npm install -g @anthropic-ai/claude-codeOpen your project folder and run:
terminal
claudeGive it a clear, structured prompt:
I'm building a Node.js app that routes AI tasks to two local Ollama models:
llama3 for complex reasoning tasks
gemma4:e4b for fast execution tasks
Both are available at http://localhost:11434/api/generatePlease create:
A service class for each model using axios
A router service that picks the right model based on task type
An Express controller that exposes a /api/ai/ask endpointClaude Code will read your existing files, write the new ones, and wire everything together. You review and approve each change before it's applied.
Claude Code helps you build the system. It doesn't run inside it by default.
Bonus: Run Claude Code entirely offline
Want 100% privacy? You can run Claude Code on your own hardware by pointing it to your local Ollama server. No more cloud billing for code generation.
The easiest way is using the native launcher:
ollama launch claude --model gemma4:e4bOr manually set your environment variables:
terminal
Set environment variablesexport ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
claude --model gemma4:e4bContext Window Tip: Claude Code needs a lot of memory to read your project. If it feels "forgetful", increase your Ollama context length before starting:
OLLAMA_CONTEXT_LENGTH=64000 ollama serveStep 7: Test the whole thing
Start your Node.js app, then fire off a couple of test requests:
terminal
Complex task → routes to Llamacurl -X POST http://localhost:8080/api/ai/ask -H "Content-Type: application/json" -d '{"taskType": "complex", "prompt": "What are the tradeoffs of microservices vs monolith?"}'Simple task → routes to Gemmacurl -X POST http://localhost:8080/api/ai/ask -H "Content-Type: application/json" -d '{"taskType": "simple", "prompt": "Classify this: Netflix $15.99"}'If both return sensible answers, your system is working.
Common mistakes to avoid
Using one model for everything.
Llama is slow for simple tasks. Gemma isn't great at reasoning. Use each for what it's good at.
Hardcoding the model choice in the controller.
The router service should own that logic, not the HTTP layer. Keep it separate so you can change routing rules without touching the API.
Not setting stream: false.
By default Ollama streams tokens back one at a time. For a backend service, you almost always want the full response in one shot.
Treating Claude Code as a magic button.
Give it context — what you're building, what already exists, and what you want. Vague prompts produce vague code.
What's next
From here you can extend the system in a few directions:
01
Confidence thresholds
If Gemma's response seems uncertain, re-route to Llama automatically
02
Logging
Log every request and response to a database for debugging
03
Swapping models
Swap in different models per task type as better ones get released (Ollama makes this trivial)
04
Frontend
Add a simple frontend to test prompts without curl
The foundation is solid. The rest is just building on top of it.
Specter
Need tools for local-first AI?
Join the waitlist for tools crafted for building and orchestrating robust local-first AI systems.
Security-firstBuild-readyPrivate
Join WaitlistJoin now to get early access
