Sheats | Software & AI Workflows

Step 1: Install Ollama

Ollama is the runtime that lets you download and run open-source models locally. Think of it like Docker, but for AI models.

curl -fsSL https://ollama.com/install.sh | sh

Once installed, verify it's working:

ollama --version

Ollama exposes a local REST API at http://localhost:11434. Every model you pull will be available through that same endpoint — no API keys, no internet required after the initial download.

Step 2: Pull and run Llama 3

Llama 3 is your reasoning model. Use it when a task needs thinking — complex prompts, multi-step logic, architecture decisions, long-form answers.

Pull and run it with:

ollama run llama3

The first run downloads the model (around 4–8 GB depending on the size variant). After that it's cached locally.

You can test it interactively in the terminal, but in your actual system you'll call it via HTTP:

terminal

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain the tradeoffs between REST and GraphQL",
"stream": false
}'

The response comes back as JSON with a response field containing the model's output.

When to use Llama: Complex reasoning, open-ended questions, architecture thinking, anything that needs more than a one-line answer.

Step 3: Pull and run Gemma 4 Effective

Gemma 4 is Google's latest release (April 2026). The Effective 4B (e4b) variant is your fast execution model. It's built for "Agentic" workflows — meaning it's incredibly good at following instructions and using tools quickly.

Gemma 4 Branding — Gemma 4: Designed for local agentic workflows and tool-calling efficiency.

ollama run gemma4:e4b

Gemma 4 is also natively multimodal, so it can handle text, images, and even audio inputs locally. This makes it the perfect backbone for a coding agent like Claude Code when running in local-first mode.

terminal

curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e4b",
"prompt": "Classify this transaction: Uber Eats $24.50",
"stream": false
}'

When to use Gemma 4: Structured tasks, heavy tool-calling, data extraction, or when you need multimodal (image/audio) support on a budget.

Step 4: Design the system before you build it

This is the step most people skip, and it's why their local AI setups become a mess. High-performance local AI depends on routing: don't waste Llama's reasoning power on one-line tasks or Gemma's speed on complex architecture questions.

User Request

Backend OrchestratorNode.js / Express

Router Logic

"Complex"

Llama 38B / 70B Params

"Fast Task"

Gemma 4Effective 4B

Response

Your backend is the orchestrator — it decides which model to call based on the complexity of the task. The models themselves don't know about each other. This separation is what makes the system clean, fast, and maintainable.

Step 5: Build the backend

You can use any backend you're comfortable with. Here's how it looks in Node.js using Express and Axios.

npm install express axios

llamaService.js

const axios = require('axios');const OLLAMA_URL = 'http://localhost:11434/api/generate';async function generate(prompt) {
const response = await axios.post(OLLAMA_URL, {
model: 'llama3',
prompt: prompt,
stream: false
});
return response.data.response;
}module.exports = { generate };

Now you can POST to /api/ai/ask with a taskType of "complex" or "simple" and the right model handles it.

Step 6: Use Claude Code to write your backend

Claude Code is a CLI tool that understands your entire codebase. Instead of writing boilerplate by hand, you describe what you want and it writes it for you.

terminal

npm install -g @anthropic-ai/claude-code

Open your project folder and run:

terminal

claude

Give it a clear, structured prompt:

I'm building a Node.js app that routes AI tasks to two local Ollama models:
llama3 for complex reasoning tasks
gemma4:e4b for fast execution tasks
Both are available at http://localhost:11434/api/generatePlease create:
A service class for each model using axios
A router service that picks the right model based on task type
An Express controller that exposes a /api/ai/ask endpoint

Claude Code will read your existing files, write the new ones, and wire everything together. You review and approve each change before it's applied.

Claude Code helps you build the system. It doesn't run inside it by default.

Bonus: Run Claude Code entirely offline

Want 100% privacy? You can run Claude Code on your own hardware by pointing it to your local Ollama server. No more cloud billing for code generation.

The easiest way is using the native launcher:

ollama launch claude --model gemma4:e4b

Or manually set your environment variables:

terminal

Set environment variablesexport ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
claude --model gemma4:e4b

Context Window Tip: Claude Code needs a lot of memory to read your project. If it feels "forgetful", increase your Ollama context length before starting:

OLLAMA_CONTEXT_LENGTH=64000 ollama serve

Step 7: Test the whole thing

Start your Node.js app, then fire off a couple of test requests:

terminal

Complex task → routes to Llamacurl -X POST http://localhost:8080/api/ai/ask   -H "Content-Type: application/json"   -d '{"taskType": "complex", "prompt": "What are the tradeoffs of microservices vs monolith?"}'Simple task → routes to Gemmacurl -X POST http://localhost:8080/api/ai/ask   -H "Content-Type: application/json"   -d '{"taskType": "simple", "prompt": "Classify this: Netflix $15.99"}'

If both return sensible answers, your system is working.

Common mistakes to avoid

Using one model for everything.

Llama is slow for simple tasks. Gemma isn't great at reasoning. Use each for what it's good at.

Hardcoding the model choice in the controller.

The router service should own that logic, not the HTTP layer. Keep it separate so you can change routing rules without touching the API.

Not setting stream: false.

By default Ollama streams tokens back one at a time. For a backend service, you almost always want the full response in one shot.

Treating Claude Code as a magic button.

Give it context — what you're building, what already exists, and what you want. Vague prompts produce vague code.

What's next

From here you can extend the system in a few directions:

Confidence thresholds

If Gemma's response seems uncertain, re-route to Llama automatically

Logging

Log every request and response to a database for debugging

Swapping models

Swap in different models per task type as better ones get released (Ollama makes this trivial)

Frontend

Add a simple frontend to test prompts without curl

The foundation is solid. The rest is just building on top of it.

Specter

Need tools for local-first AI?

Join the waitlist for tools crafted for building and orchestrating robust local-first AI systems.

Security-firstBuild-readyPrivate

Join WaitlistJoin now to get early access

Gemma 4 + Claude Code

Step 1: Install Ollama

Step 2: Pull and run Llama 3

Step 3: Pull and run Gemma 4 Effective

Step 4: Design the system before you build it

Step 5: Build the backend

Step 6: Use Claude Code to write your backend

Bonus: Run Claude Code entirely offline

Step 7: Test the whole thing

Common mistakes to avoid

What's next

Need tools for local-first AI?

Build better resumes & land interviews

Ready to automate
your engineering?

Gemma 4 + Claude Code

Step 1: Install Ollama

Step 2: Pull and run Llama 3

Step 3: Pull and run Gemma 4 Effective

Step 4: Design the system before you build it

Step 5: Build the backend

Step 6: Use Claude Code to write your backend

Bonus: Run Claude Code entirely offline

Step 7: Test the whole thing

Common mistakes to avoid

What's next

Need tools for local-first AI?

Build better resumes & land interviews

Ready to automate your engineering?

Ready to automate
your engineering?