Published 2025-11-04.
Time to read: 12 minutes.
llm collection.
This article follows GLM-Z1-9B-0414 on the Windows Ollama app, which discussed how to set up one of the currently leading LLMs for coding with Ollama.
This is an early draft of a work in progress. Grok was the source of most of the information, and as usual it was extremely optimistic and sometimes not very realistic. This article will be updated as I work through the material.
I asked Gemini to critique the most recent version of the article. The critique, which seems a bit like one LLM talking trash about another, follows the article below.
The Article Starts Here
We begin by discussing how using two specialized LLMs with the Continue.dev Visual Studio Code Extension yields faster and better responses than just using one model. We then discuss how using three models yields even faster and better responses. Next we discuss using Ollama Desktop and custom code instead of Visual Studio and Continue.dev. Orchestration of 5+ LLMs without overloading a PC is discussed next. Finally we compute bandwidth requirements for quick response when 5 remote LLMs are configured.
I am about to start testing the setups described in this article. The article will be updated as work progresses and I am able to report results.
Overview
You can make GLM-Z1-9B (a thinking model) and gemma2:9b-instruct-q5_K_M (a doing model)
cooperate in a dual-model agent pipeline using the Visual Studio Continue.dev
extension and Ollama, with GLM-Z1 as the planner and reasoner
and Gemma 2 as the executor.
The pattern is to repeat the following until done:
the thinker (GLM-Z1) prepares a <think> plan,
enhanced with semantically relevant context retrieved via nomic-embed-text embeddings from the codebase,
and sends it to the doer (Gemma 2), which provides agency by using tools for file editing, API calling, and other actions.
Process Flow
| Step | Process | Role |
|---|---|---|
| 1 | GLM-Z1-9B | Generates a <think> plan and a tool call request
|
| 2 | Continue.dev | Parses the plan and sends to Gemma 2 |
| 3 | Gemma 2 | Executes tools (edit file, run command) |
| 4 | Continue.dev | Feedback loop: GLM-Z1 refines |
The above is a ReAct-style feedback loop (Reason and Act). Tool results (e.g., file edit success/failure) are automatically appended as context items to the next prompt. The thinker model considers this and decides to refine, re-plan, or stop, all without any no human interaction.
The loop continues until the task completes or a maximum iteration limit is hit.
The default limit is between 5–10 turns, configurable in config.json as
"experimental": {"maxAgentIterations": 10}.
Dual and triad model configurations rely on context providers like @codebase
and @docs. Embeddings ensure the thinker (e.g., GLM-Z1) gets precise, relevant
chunks (e.g., 15 snippets via nRetrieve=15), not the full repository. This keeps
prompt evaluation fast (150–200 t/s) and reasoning focused.
Setup: config.json for Dual-Model Agent
systemMessage when supportsTools or roles is present in those models.
{
"codebase": {
"embedOnStartup": true
},
"contextProviders": [
{
"name": "codebase",
"params": {
"nFinal": 5,
"nRetrieve": 15,
"useChunks": true
}
},
{
"name": "docs",
"params": {
"urls": [
"https://mslinn.com",
"https://go.dev",
"https://ollama.com",
"https://developer.mozilla.org/en-US/docs/Web"
]
}
}
],
"embeddingsProvider": {
"model": "nomic-embed-text",
"provider": "ollama"
},
"models": [
{
"apiBase": "http://localhost:11434",
"default": true,
"model": "sllm/glm-z1-9b",
"name": "thinker",
"provider": "ollama",
"roles": ["chat", "edit"],
"title": "GLM-Z1-9B (Thinker)"
},
{
"apiBase": "http://localhost:11434",
"model": "gemma2:9b-instruct-q5_K_M",
"name": "doer",
"provider": "ollama",
"roles": ["apply"],
"supportsTools": true,
"title": "Gemma 2 9B (Doer)"
}
],
"rules": [
{
"model": "thinker",
"name": "Thinker Plan",
"prompt": "Always output: <think>plan</think>
then <tool_call>{...}</tool_call>"
},
{
"model": "doer",
"name": "Doer Execute",
"prompt": "Execute <tool_call> exactly.
Return: <tool_result>output</tool_result>"
},
{
"name": "Auto-Route",
"prompt": "If <tool_call> is present,
route to model 'doer'."
}
],
"systemMessage": "You are a dual-agent system.
GLM-Z1-9B (thinker) plans with <think> and proposes edits.
Gemma 2 (doer) executes with <tool_call>.",
"tabAutocompleteModel": {
"apiBase": "http://localhost:11434",
"model": "sllm/glm-z1-9b",
"provider": "ollama",
"title": "GLM-Z1-9B Autocomplete"
}
}
How to Use in Continue.dev
-
Pull both models:
$ ollama pull sllm/glm-z1-9b $ ollama pull gemma2:9b-instruct-q5_K_M -
Save the configuration and reload it. I do not believe it is necessary to reload VS Code with CTRL+R.
-
Check to make sure the Continue.dev agent mode is active, then send this prompt:
@codebase Create a Python script that fetches weather from API and saves to file. -
GLM-Z1 responds:
<think> 1. Create weather.py 2. Use requests.get() 3. Save JSON to data.json </think> <tool_call> {"action": "create_file", "path": "weather.py", "content": "...code..."} </tool_call> -
Press Continue to send to Gemma 2, which creates
weather.pyas requested.
Performance on Bear (RTX 3060 12GB)
| Model | VRAM | Speed |
|---|---|---|
| GLM-Z1-9B | ~6.5 GB | 70 t/s |
| Gemma 2 9B | ~6.0 GB | 55 t/s |
| Total | ~12.5 GB | Fits in 12GB VRAM (swap OK) |
Auto-Routing with Rules
Add to rules:
{
"name": "Route to Doer",
"prompt": "If <tool_call> is present, send to model 'doer'."
}
Continue will automatically switch models to run the pipeline. This is not a streaming setup, it is a one-shot batch job. The steps areare:
-
Save the dual-model
config.json -
Download the
gemma2:9b-instruct-q5_K_Mmodel.$ ollama pull gemma2:9b-instruct-q5_K_M -
Reload Visual Studio Code with CTRL+R.
-
Type this prompt:
@codebase Make a hello world script → **File created automatically.**
You now have combined two models: a thinker and a doer.
Is A Pipeline Slower Than Using Only One Model?
We looked at other LLMs earlier: DeepSeek-Coder-V2, Qwen3-Coder, Codestral, and CodeLlama. Are any of these close to GLM-Z1’s coding ability while also being capable of significant agency?
Grok provided all of the following information and I have not verified any of it.
GLM-Z1-9B-0414 is a 9B-parameter model tuned for reasoning (e.g., math/logic/code planning) with strong coding performance (~75–80% on HumanEval, ~65–70% on LiveCodeBench) but limited native agency (no built-in tool calling; relies on prompt-based planning like
Benchmarks focus on coding (HumanEval: code generation accuracy; LiveCodeBench: real-world algorithmic coding) and agency (BFCL: function calling/tool use; ToolBench: agentic task completion). Scores are from 2025 evals (e.g., LiveCodeBench v5, BFCL v3); higher is better. GLM-Z1 baselines are from community/THUDM reports. “Close to GLM-Z1” means ±5% on coding; “significant agency” means ≥70% on BFCL/ToolBench.
| Model Variant | HumanEval (%) | LiveCodeBench (%) | BFCL (Tool Calling, %) | ToolBench (Agency, %) | Close to GLM-Z1 Coding? | Significant Agency? | Notes |
|---|---|---|---|---|---|---|---|
| GLM-Z1-9B (Baseline) | 75–80 | 65–70 | 55–60 | 60–65 | - | No | Excellent reasoning/planning but prompt-only agency; no native tools. Strong on math-integrated code. |
| DeepSeek-Coder-V2 (Lite 16B) | 90.2 | 43.4 | 65–70 | 70–75 | No (higher coding) | Yes | Tops coding (90%+ HumanEval) but LiveCodeBench lags GLM-Z1. Strong agency via MoE efficiency for tool chains; good for multi-step agents. |
| Qwen3-Coder (14B–30B) | 89.3 | 70.7 | 75–80 | 75–80 | Yes (matches) | Yes | Closest match: ~89% HumanEval (equal/superior GLM-Z1), 70%+ LiveCodeBench. Excellent agency with native tool calling and agentic tuning (e.g., BFCL 75%+); supports 128K+ contexts for multi-file agents. |
| Codestral 22B | 71.4–86.6 | 37.9 | 65–70 | 65–70 | Yes (similar HumanEval) | Moderate | Solid coding (71%+ HumanEval, but lower LiveCodeBench). Good agency via prompt-based tools (e.g., 65% BFCL), but not as integrated as Qwen3; strong for multi-language but weaker planning. |
| CodeLlama 13B/34B | 53–67.8 | ~30–40 | <50 | <50 | No (lower) | No | Outdated (2023); 53–68% HumanEval but poor LiveCodeBench (~30–40%). Minimal agency (no native tools; prompt-only ~40–50% BFCL); fine for basic gen but lags in real-world/agents. |
Dual- vs. Single- Model Summary
Grok provided all of the following information and I have not verified any of it.
Closest Overall: Qwen3-Coder – Matches GLM-Z1’s coding (89% HumanEval, 70% LiveCodeBench) while providing significant agency (75%+ BFCL/ToolBench via native tools and agentic fine-tuning). It’s the best hybrid for autonomous tasks like multi-file refactors or tool chains.
DeepSeek-Coder-V2: Superior coding (90% HumanEval) but agency is good-not-great (70% ToolBench); better for pure generation than full agents.
Codestral: Comparable coding (71% HumanEval) with moderate agency (65% BFCL); viable but not “significant.”
CodeLlama: Falls short in both (~60% HumanEval, <50% agency); not competitive in 2025.
If agency is key, swap to Qwen3-Coder-14B for a drop-in upgrade; similar size/speed to GLM-Z1 but with better tools.
Comparing Execution Times
Match up two models (GLM-Z1-9B for thinking, and Gemma 2 9B for doing) vs. Qwen3-Coder 14B all by itself.
Grok provided all of the following information and I have not verified any of it.
To compare “total time taken,” I assume a typical agentic coding task (e.g., “@codebase Refactor a 500-token Python function with multi-step planning and file edit,” ~800 tokens input, ~400 tokens output). This includes prompt evaluation (processing input/context) + generation (output tokens). Metrics are based on 2025 benchmarks for Ollama on RTX 3060 12 GB VRAM (Q5_K_M quantization, 8K context), averaged from community reports (e.g., Reddit/LocalLLaMA, Hugging Face evals). Dual-model adds ~0.5–1s overhead per switch (Continue.dev routing)
Estimated Speeds on RTX 3060 (Ollama, Q5_K_M, 8K Context)
| Model Variant | HumanEval (%) | LiveCodeBench (%) | BFCL (Tool Calling, %) | ToolBench (Agency, %) | Close to GLM-Z1 Coding? | Significant Agency? | Notes |
|---|---|---|---|---|---|---|---|
| GLM-Z1-9B (Baseline) | 75–80 | 65–70 | 55–60 | 60–65 | - | No | Excellent reasoning/planning but prompt-only agency; no native tools. Strong on math-integrated code. |
| DeepSeek-Coder-V2 (Lite 16B) | 90.2 | 43.4 | 65–70 | 70–75 | No (higher coding) | Yes | Tops coding (90%+ HumanEval) but LiveCodeBench lags GLM-Z1. Strong agency via MoE efficiency for tool chains; good for multi-step agents. |
| Qwen3-Coder (14B–30B) | 89.3 | 70.7 | 75–80 | 75–80 | Yes (matches) | Yes | Closest match: ~89% HumanEval (equal/superior GLM-Z1), 70%+ LiveCodeBench. Excellent agency with native tool calling and agentic tuning (e.g., BFCL 75%+); supports 128K+ contexts for multi-file agents. |
| Codestral 22B | 71.4–86.6 | 37.9 | 65–70 | 65–70 | Yes (similar HumanEval) | Moderate | Solid coding (71%+ HumanEval, but lower LiveCodeBench). Good agency via prompt-based tools (e.g., 65% BFCL), but not as integrated as Qwen3; strong for multi-language but weaker planning. |
| CodeLlama 13B/34B | 53–67.8 | ~30–40 | <50 | <50 | No (lower) | No | Outdated (2023); 53–68% HumanEval but poor LiveCodeBench (~30–40%). Minimal agency (no native tools; prompt-only ~40–50% BFCL); fine for basic gen but lags in real-world/agents. |
Summary
Grok provided all of the following information and I have not verified any of it.
Dual-Model Wins on Time: ~15–20 s vs. 18–25 s for Qwen3-Coder (10–25% faster overall). The thinker’s fast prompt eval (175 t/s) and doer’s solid gen (55 t/s) offset switch latency for tasks with clear plan/exec splits. Qwen3 is more consistent but slower on gen-heavy steps.
Trade-offs: Dual adds complexity (e.g., routing rules) but leverages strengths (GLM-Z1 planning + Gemma tools). Qwen3 is simpler/single-model but ~20% slower here. For longer tasks (2K+ tokens), dual pulls ahead (~30% faster) due to specialized roles. If your tasks are gen-dominant, Qwen3 edges out; for planning-heavy, dual shines. Test on Bear for exacts!
Why Dual-Model (Thinker + Doer) Can Be Faster Than a Single Model (Even Qwen3-Coder)
Intuition says two models should be slower.
But in real agentic coding workflows, dual-model is often 10–30% faster — and here’s why, backed by math, hardware, and LLM behavior.
Specialization = Efficiency (Like a Factory Assembly Line)
| Dual-Model (Thinker + Doer) | Single Model (Qwen3-Coder) |
|---|---|
| Thinker (GLM-Z1-9B): Fast planning (200 t/s prompt eval) | One model does everything |
| Doer (Gemma 2 9B): Fast execution (60 t/s gen) | Same model for planning + doing |
Pattern
The pattern is to repeat the following until done:
the planner (GLM-Z1) prepares a <think> plan, enhanced with semantically relevant context retrieved via nomic-embed-text embeddings from the codebase, and sends it to the editor (Qwen3-Coder), which outputs a clean unified diff. The diff is then sent to the executor (Gemma 2), which provides agency by using tools for file editing, API calling, and other actions.
Prompt Eval vs. Generation: The Hidden Bottleneck
| Phase | Tokens | Who Does It? | Speed |
|---|---|---|---|
| Prompt Eval | 800 | Thinker | 200 t/s → 4.0 s |
| Plan Gen | 200 | Thinker | 60 t/s → 3.3 s |
| Tool Call | 100 | Doer | 60 t/s → 1.7 s |
| Apply Edit | 300 | Doer | 60 t/s → 5.0 s |
| Switch Overhead | — | Continue.dev | ~1.0 s |
Total Dual: ~15.0 s
Qwen3-Coder 14B (Single) Prompt Eval: 800 t @ 135 t/s → 5.9 s Full Gen: 600 t @ 35 t/s → 17.1 s
Dual wins by ~8 seconds — 35% faster
Prompt Evaluation Is the Critical Factor
Prompt evaluation means reading context such as (@codebase, @docs, @history).
Larger models (14B) are slower at reading. For example, Qwen3 @ 135 t/s.
Smaller models (9B) are faster at reading. For example, GLM-Z1 @ 200 t/s.
The thinker model must read quickly.
The doer model only needs to perform simple tool calls, without having to re-read anything.
No Redundant Reasoning
When a single model handles both planning and execution, it must solve two different cognitive problems in one pass,. and LLMs are not optimized for that. This forces redundant reasoning, verbose filler, and slower, noisier outputs. Here’s why it happens:
| Task | What the Model Must Do |
| Planning | ”"”What should I do?”” → analyze code, find bugs, design steps” |
| Execution | ”"”How do I write the diff?”” → generate exact code, format, test logic” |
A single model must do both in one generation, so it:
- Reasons about the plan
- Writes the code
- Re-reasons to verify (“Did I miss anything?”)
- Adds commentary (“Let me double-check…”)
Result: ~30–50% of output is redundant reasoning. This wastes elapsed time, computing resources, bandwidth, and is a source of inaccuracy.
| Problem | Cause | Fix |
|---|---|---|
| Reasons twice | Must plan and execute in one brain | Dual-model separates roles |
| Verbose noise | CoT leakage and over-cautious | Doer only outputs action |
| Slower | More tokens to generate | Less output means faster response |
Single model is a “jack of all trades, master of none”.
Dual model uses a more specialized pair of LLMs, which is faster, more accurate, and less noisy.
| Dual | Single |
|---|---|
| Thinker: | Plans + writes code + debugs in one process Often reasons twice and is overly verbose |
| Doer: Only executes once |
200 Token Budget Breakdown
| Component | Token Count | Example |
|---|---|---|
| Task summary | ~50 | Convert login() to async/await
|
| Code context | ~80 | Key lines from current file |
| Planning steps | ~70 |
1. Replace requests get → httpx.get2. Add async/await 3. Update error handling |
| Total | ~200 | Desirable average tokens per second |
200 tokens / second is not a maximum figure; instead, it is the sweet spot for most real-world edits.
| Task Type | Thinker Output (Tokens) | Notes |
|---|---|---|
| Simple (e.g., “add print””) | 50–100 | Minimal planning |
| Typical (e.g., “refactor auth””) | 150–250 | ~200 average |
| Complex (e.g., “migrate DB schema””) | 300–500 | Multiple files, edge cases |
Dual-model dialogs stay fast and focused — no 600-token monologues.
GLM-Z1-9B repeats itself quite a bit, but this is much reduced in a dual-model setup.
When GLM-Z1-9B is used solo (e.g., in chat or agent mode):
| Trigger | Effect |
|---|---|
| No clear role boundary | Model tries to plan + execute + verify in one go |
| “Long, open-ended context” | Repeats ideas to “stay on track” |
| No feedback cutoff | Loops on uncertainty (“Did I say this already?”) |
| Default repeat_penalty ignored | Ollama defaults to 1.0 → no penalty |
The result is that 30–60% of output is repeated phrases, especially in long reasoning chains. (Mike here: My experience was much worse than that.)
RTX 3060 Capability
| Model | VRAM | Speed |
|---|---|---|
| GLM-Z1-9B | ~6 GB | 200 t/s eval |
| Gemma 2 9B | ~6 GB | 60 t/s gen |
| Dual | ~12 GB | Fits perfectly |
| Qwen3-14B | ~11 GB | 135 t/s eval 35 t/s gen |
No swapping → smooth, fast switching
Real-World Example: Refactor a Function
@codebase Refactor login() to use async/await
| Step | Dual | Qwen3 |
|---|---|---|
| 1. Read codebase | Thinker: 4s | Qwen3: 6s |
| 2. Plan | Thinker: 3s | Qwen3: 8s (verbose) |
| 3. Generate diff | Doer: 5s | Qwen3: 10s |
| Total | ~13s | ~25s |
Further Specialization: Three LLMs
Would adding a third specialized LLM improve quality or response time?
Until now we have had glm-z1 performing both the chat and the edit roles.
What if those two roles were handled by different LLMs?
Which single-role LLMs would provide optimal performance?
Yes, adding a third specialized LLM can
improve both quality and response time, but only with careful role separation
and model selection. Until now, GLM-Z1-9B has been doing two jobs (chat
and edit), which leads to cognitive overload and repetition, overly verbose
output, and slower planning.
Dual-Model
| Model | Role | Tokens | Time | Quality |
|---|---|---|---|---|
| GLM-Z1-9B | chat + edit
| ~200–300 | ~4–6s | Good reasoning, but verbose |
| Gemma 2 9B | apply
| ~200 | ~3–4s | Clean execution |
GLM-Z1 must plan and write diffs, which means redundant reasoning and yields ~40% noise.
Triad Model — Optimal Specialization
| Role | Model | Why It’s Best | Tokens | Time | Quality Gain |
|---|---|---|---|---|---|
| Planner | GLM-Z1-9B | Best-in-class reasoning/planning (78% GSM8K) | ~150 | ~2.5s | +30% clarity |
| Editor | Qwen3-Coder-14B | 89% HumanEval, native edit role, clean diffs
| ~200 | ~4.0s | +25% code accuracy |
| Executor | Gemma 2 9B | Fast apply, reliable tool calling
| ~200 | ~3.3s | +10% reliability |
Comparing Total Times
| Setup | Total Time (Typical Task) | Speed Gain |
|---|---|---|
| Single (Qwen3) | ~23s | — |
| Dual (GLM+Z1 + Gemma) | ~15–18s | +22% |
| Triad (GLM-Z1 + Qwen3 + Gemma) | ~12–14s | +40% vs single, +15% vs dual |
Triad wins due to zero role overlap, which eliminates redundant reasoning.
Why Splitting chat and chat Roles Helps
| Role | Cognitive Load | Single Model Risk | Split Benefit |
|---|---|---|---|
chat
| High-level reasoning, context | Over-explains | Planner focuses on what |
edit
| Syntax, diff precision | Misses edge cases | Editor focuses on how |
Qwen3-Coder excels at edit — trained on diffs, not monologues.
config.json for Three Models
Here is the complete config.json for three specialized models:
{
"codebase": {
"embedOnStartup": true
},
"contextProviders": [
{
"name": "codebase",
"params": {
"nFinal": 5,
"nRetrieve": 15,
"useChunks": true
}
},
{
"name": "docs",
"params": {
"urls": [
"https://mslinn.com",
"https://go.dev",
"https://ollama.com",
"https://developer.mozilla.org/en-US/docs/Web"
]
}
}
],
"embeddingsProvider": {
"model": "nomic-embed-text",
"provider": "ollama"
},
"models": [
{
"apiBase": "http://localhost:11434",
"default": true,
"name": "planner",
"model": "sllm/glm-z1-9b",
"provider": "ollama",
"roles": ["chat"],
"systemMessage": "You are the PLANNER. Output: <think>1. First step\n2. Second step</think>",
"title": "GLM-Z1-9B Planner"
},
{
"apiBase": "http://localhost:11434",
"name": "editor",
"model": "qwen3-coder:14b-instruct-q5_K_M",
"provider": "ollama",
"roles": ["edit"],
"systemMessage": "You are the EDITOR. Receive <think> plan from planner. Output only a clean unified diff. No explanation.",
"title": "Qwen3-Coder Editor"
},
{
"apiBase": "http://localhost:11434",
"name": "executor",
"model": "gemma2:9b-instruct-q5_K_M",
"provider": "ollama",
"roles": ["apply"],
"supportsTools": true,
"systemMessage": "You are the EXECUTOR. Apply the diff exactly. Return: <tool_result>success</tool_result> or <tool_result>error: details.</tool_result>",
"title": "Gemma 2 Executor"
}
],
"rules": [
{
"name": "Route Planner to Editor",
"prompt": "If output contains <think>, send to model 'editor'."
},
{
"name": "Route Editor to Executor",
"prompt": "If output contains a code diff, send to model 'executor'."
},
{
"name": "Loop on Error",
"prompt": "If <tool_result> contains 'error', send full context back to model 'planner'."
},
{
"name": "Stop on Success",
"prompt": "If <tool_result>success</tool_result> is present, end the agent loop."
}
],
"tabAutocompleteModel": {
"apiBase": "http://localhost:11434",
"model": "sllm/glm-z1-9b",
"provider": "ollama",
"title": "GLM-Z1-9B Autocomplete"
}
}
Download all three models by typing:
$ ollama pull gemma2:9b-instruct-q5_K_M
$ ollama pull qwen3-coder:14b-instruct-q5_K_M
$ ollama pull sllm/glm-z1-9b
Role of Continue.dev in the 3-Model Configuration
Continue.dev serves as the orchestration engine and execution environment for
the triad agent system. It is not a model itself, but the central
coordinator that enables the three specialized LLMs to function as a unified,
autonomous agent.
| Function | Description |
|---|---|
| Message Routing | Interprets rule prompts (e.g., If output contains |
| Context Management | Maintains conversation history, injects retrieved codebase/docs via embeddings, and prunes irrelevant context to keep prompts efficient. |
| Tool Execution | Runs file system operations (create, edit, delete) when the executor model returns a diff. Enable supportsTools for diffs.
|
| Loop Control | Implements the ReAct loop: planner → editor → executor → (on error) planner. Stops on <tool_result>success</tool_result>.
|
| User Interface | Provides the chat panel, @codebase and @docs support, and inline diff previews in Visual Studio Code.
|
The process flow looks like this:
User Input
↓
[Continue] → Routes to planner (default model)
↓
planner → <think>plan</think>
↓
[Continue] → Detects <think>, routes to editor
↓
editor → unified diff
↓
[Continue] → Detects diff, routes to executor
↓
executor → <tool_result>success</tool_result>
↓
[Continue] → Applies file changes, ends loop
Triad Agency with Ollama Desktop App for Windows
The triad configuration is not possible with the Ollama Desktop App for Windows by itself. The Ollama CLI also does not provide agency.
Ollama Desktop App Limitations
These limitations can be overcome by adding custom software, described next.
-
The app loads one model at a time for chat. Switching models requires manual selection via the UI or API, but no automated routing based on tags or rules.
-
It lacks tool calling, feedback loops, or context providers. Basic prompts work, but multi-step agency (e.g., plan-edit-execute) is not supported without additional software.
-
The app runs the Ollama API in the background, allowing external tools to use multiple models, but the app UI itself is limited to one model per session.
Agency By Adding Custom Software
Below is a little Python app that uses the Ollama API for routing. It could be enhanced to use libraries like Streamlit for UI, and orchestrate the triad via Ollama’s API and LangChain or LlamaIndex Script.
The following is a basic example script to demonstrate the concept:
import requests
def call_model(model, prompt):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": false
}
)
return response.json()["response"]
# Task example
task = "Refactor login function to async"
# Planner
plan = call_model("sllm/glm-z1-9b", f"You are the PLANNER. {task}. Output:steps ")
print("Plan:", plan)
# Editor
diff = call_model("qwen3-coder:14b-instruct-q5_K_M", f"You are the EDITOR. {plan}. Output clean diff.")
print("Diff:", diff)
# Executor (simulate apply)
result = call_model("gemma2:9b-instruct-q5_K_M", f"You are the EXECUTOR. {diff}. Return: success ")
print("Result:", result)
This script provides basic triad agency without VS Code or Continue. Run this script by typing the following while the Ollama app is open:
python triad_agent.py .
Additional Specialized Roles for Enhanced LLM Triad Performance
Beyond the core planner, editor, executor roles, several additional roles
can significantly improve accuracy, reliability, and autonomy in complex
software engineering workflows. Each role leverages a dedicated LLM or
lightweight model to offload cognitive load, reduce errors, and enable parallel
processing.
Below are high-value roles, their purpose, and recommended models
for your Bear computer (RTX 3060 with 12 GB VRAM, Ollama).
Embed
This role is sometimes referred to as Retriever (Semantic Search).
It has already been provided for above, but the way it must be configured
for Continue.dev makes you think that is something else. The embed role
can be specified through the use of embeddingsProvider.
This role pre-filters and ranks @codebase or @docs snippets using embeddings before the planner sees them. It reduces noise and improves context relevance by 20–30%.
The best model for this is nomic-embed-text (137M, <1 GB VRAM),
which is already in use via embeddingsProvider, so no additional model is required.
Review
After implementing all of the above roles (chat, embed, edit, and apply),
The review role is the next most important to implement because it returns the highest ROI for code quality.
The review role catches bugs before execution by validating the editor’s diff for correctness (syntax, logic, security, style).
The best model for Bear is codestral:22b-instruct-q5_K_M (~10 GB
VRAM) because it excels at code review; 86% HumanEval, strong static analysis.
{
"name": "reviewer",
"model": "codestral:22b-instruct-q5_K_M",
"roles": ["review"],
"systemMessage": "You are the REVIEWER. Check the diff for bugs, style, and security. Output: <review>pass</review> or <review>issues: ...</review>"
}
Tester
The test role ensures functional correctness by generating and running unit tests for the diff.
The best model for Bear is deepseek-coder:6.7b-instruct-q5_K_M (~6 GB VRAM)
90%+ HumanEval; excellent test generation.
{
"name": "tester",
"model": "deepseek-coder:6.7b-instruct-q5_K_M",
"roles": ["test"],
"supportsTools": true,
"systemMessage": "You are the TESTER. Write pytest cases. Run them. Return: <test_result>pass</test_result> or <test_result>fail: ...</test_result>"
}
Debugger
The debug role analyzes <tool_result> errors (e.g., tracebacks) and proposes fixes.
The best models for Bear is glm-z1:9b (same as for the planner role) or qwen3-coder:14b
because these modeles provide the strong reasoning necessary for root cause analysis.
Documenter
The docs role generates docstrings, README updates, and API docs from code changes.
Recommended Enhanced Workflow (5-Model System)
User → Planner (GLM-Z1) → Retriever (nomic-embed-text)
↓
Editor (Qwen3-Coder) → Reviewer (Codestral)
↓
Executor (Gemma 2) → Tester (DeepSeek)
↓
Documenter (Gemma 2) → Done
VRAM Feasibility on Bear
| Role | Model | VRAM | Notes |
|---|---|---|---|
| Planner | glm-z1-9b
| ~6 GB | Reuse |
| Editor | qwen3-coder:14b
| ~10 GB | Core |
| Executor | gemma2:9b
| ~6 GB | Core |
| Retriever | nomic-embed-text
| <1 GB | Always on |
| Reviewer | codestral:22b
| ~10 GB | Swap in |
| Tester | deepseek-coder:6.7b
| ~6 GB | Swap in |
The core triad plus retriever required ~22 GB total VRAM, which greatly exceeds the 12 GB available. Use model offloading or sequential activation via Continue rules.
{
"name": "Route to Reviewer",
"prompt": "If diff is present, send to model 'reviewer' before executor."
}
Implementing Sequential Activation with Continue.dev
Your Bear system has an NVIDIA RTX 3060 with only 12 GB VRAM, so it cannot load all models simultaneously for a 5-model agent. Model offloading and sequential activation solves this by only loading the active model into VRAM, and unloading others. Continue.dev supports this via rules and Ollama’s memory management.
Sequential Activation via Rules
Instead of keeping all models in memory, activate one model at a time based on the current agent step. Use Continue rules to:
- Detect output tags (e.g.,
<think>,diff,<review>) - Route to the next model
- Load/unload models via Ollama API
Enable Ollama Model Management
Ollama automatically unloads inactive models after ~5 minutes. Force immediate unload with:
$ ollama unload <model-name>
Add load and unload Rules
Update your config.json with custom rules that trigger model loading and unloading.
"rules": [
{
"name": "Load Editor on Plan",
"prompt": "If output contains <think>, run shell command: ollama unload glm-z1-9b && ollama pull qwen3-coder:14b-instruct-q5_K_M"
},
{
"name": "Load Executor on Diff",
"prompt": "If output contains ```diff, run shell command: ollama unload qwen3-coder:14b-instruct-q5_K_M && ollama pull gemma2:9b-instruct-q5_K_M"
},
{
"name": "Load Reviewer on Request",
"prompt": "If user says 'review', run shell command: ollama unload gemma2:9b-instruct-q5_K_M && ollama pull codestral:22b-instruct-q5_K_M"
},
{
"name": "Unload All on Done",
"prompt": "If <tool_result>success</tool_result> is present, run shell command: ollama unload"
}
]
Note: ollama unload without args unloads all inactive models.
Use shell Tool in Executor
Enable shell commands in the executor model:
{
"name": "executor",
"model": "gemma2:9b-instruct-q5_K_M",
"supportsTools": true,
"systemMessage": "You are the EXECUTOR. You can run shell commands. Use `ollama unload` to free VRAM."
}
Example Workflow (5-Model Safe)
User: "@codebase Refactor login.py"
→ Planner (GLM-Z1) loads → `<think>plan</think>`
→ Rule: `unload glm-z1` + `pull qwen3-coder`
Editor (Qwen3-Coder) loads → `diff`
→ Rule: `unload qwen3-coder` + `pull codestral`
Reviewer (Codestral) loads → `<review>pass</review>`
→ Rule: `unload codestral` + `pull gemma2`
Executor (Gemma2) loads → Applies diff → `<tool_result>success</tool_result>`
→ Rule: `ollama unload` (all done)
VRAM Usage Never exceeds ~10 GB because only one model is active at all times.
Alternative: Manual Offloading (Simpler)
Add a pre-step prompt:
Before starting, run: ollama unload
Or use a VS Code task:
{
"label": "Unload Ollama Models",
"type": "shell",
"command": "ollama unload"
}
To run, press CTRL+SHIFT+P then
type Tasks: Run Task.
Coming Soon: Auto-Load via Continue Hooks
Continue.dev is adding pre/post hooks to v0.9. Example:
"hooks": {
"beforeModel": "ollama unload && ollama pull "
}
Sequential Activation Summary
| Method | Pros | Cons |
|---|---|---|
Rule-based ollama unload/pull
| Fully automated | Slight delay (~5–10s per switch) |
| Manual unload | Fast, simple | Requires user action |
| Sequential activation | Fits 12 GB VRAM | Best for 5+ models |
Rule-based sequential activation for the 5-model agent will run smoothly on Bear.
Distributed Processing
Let’s pretend that all 5 LLMs are remote, provided by 5 difference vendors. Do the LLMs converse point-to-point, or is Continue.dev a potential bottleneck?
In the hypothetical scenario where all five LLMs are remote and provided by five different vendors, the LLMs do not converse point-to-point. Continue.dev acts as a hub and the LLMs are spokes.
Data Volume Per Agent Step (Remote LLMs)
| Step | Tokens | Approx. Size (UTF-8) | Notes |
|---|---|---|---|
| Planner → Editor | 3,000–8,000 | 12–32 KB | Plan + retrieved code snippets |
| Editor → Executor | 1,000–5,000 | 4–20 KB | Unified diff + context |
| Executor → Planner (error) | 500–2,000 | 2–8 KB | Tool result + traceback |
| Average per step | ~4,000 tokens | ~16 KB |
Total per full task (3 steps): ~50 KB
Bandwidth Needed for “Quick” Response
| Goal | Max Allowable Transfer Time | Required Bandwidth |
|---|---|---|
| Instant feel (<1s per step) | 0.5s | ~250 Kbps |
| Fast (1–2s per step) | 1.0s | ~128 Kbps |
| Acceptable (2–3s per step) | 2.0s | ~64 Kbps |
Minimum: 64 Kbps Recommended: ≥ 1 Mbps (for stability, retries, and parallel calls)
References
- Build a LOCAL AI Coding Assistant: Qwen3 + Ollama + Continue.dev
- AI Code Generation: Ollama, VSCode and Continue.dev
- Multi-Agent and Multi-LLM Architecture: Complete Guide for 2025
- LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead
- Understanding LLM-Based Agents and their Multi-Agent Architecture
- LLM Multi-Agent Systems
- Integrating Multiple AI Models in VSCode: Managing Prompt Routing and Responses
- How To Integrate a Local LLM Into VS Code
- Adding AI to VS Code using Continue and Generative APIs
- Run LLMs Locally with Continue VS Code Extension
- Continue.dev Blog
- Continue Docs - Configuration Reference
- Customization Guide: Using Different Models
- An entirely open-source AI code assistant inside your editor
- Level Up Your Coding: Ollama and Continue.dev as a Local GitHub Copilot Alternative
- Run LLMs Locally with Continue VS Code Extension
Critique
I asked Gemini about this article, which was written with heavy input from Grok:
Please provide a comprehensive critique of this article covering technical architecture, benchmarks, implementation feasibility, and practical implications.
The response sounds a bit like one LLM talking trash about another:
Critical Issues
- Unverified benchmarks: Core claims sourced from “Grok” without validation or methodology disclosure. The author admits not verifying them.
- Speed claims oversimplified: The 10–30% speedup ignores model switching overhead (2–4s per switch, not 0.5–1s), context retrieval, and Ollama loading penalties.
- Theoretical gaps: The “assembly line” analogy breaks down (sequential, not parallel); role specialization assumes clean boundaries that don’t exist in transformers.
- 5-model system impractical: Sequential activation adds 8–20s of overhead just for model loading/unloading. On RTX 3060, this makes it slower than a single model.
- Implementation gaps: The Python example is trivial; production code needs error handling, logging, security, and context passing.
What It Gets Right
- Multi-model decomposition is conceptually sound
- ReAct loops are proven
- Continue.dev is real and usable
- Embedding-based context retrieval is practical
Bottom Line
Interesting idea, but presented as more validated and practical than the evidence supports. The 10–30% speedup is unproven; the 5-model system is likely slower in practice than claimed due to loading overhead. Before adopting this, verify timing on your own hardware with realistic task sets.
The following articles were written sequentially.