Large Language Models

Early Draft: Multi-LLM Agent Pipelines

Published 2025-11-04.
Time to read: 12 minutes.

This page is part of the llm collection.

This article follows GLM-Z1-9B-0414 on the Windows Ollama app, which discussed how to set up one of the currently leading LLMs for coding with Ollama.

This is an early draft of a work in progress. Grok was the source of most of the information, and as usual it was extremely optimistic and sometimes not very realistic. This article will be updated as I work through the material.

I asked Gemini to critique the most recent version of the article. The critique, which seems a bit like one LLM talking trash about another, follows the article below.

The Article Starts Here

We begin by discussing how using two specialized LLMs with the Continue.dev Visual Studio Code Extension yields faster and better responses than just using one model. We then discuss how using three models yields even faster and better responses. Next we discuss using Ollama Desktop and custom code instead of Visual Studio and Continue.dev. Orchestration of 5+ LLMs without overloading a PC is discussed next. Finally we compute bandwidth requirements for quick response when 5 remote LLMs are configured.

I am about to start testing the setups described in this article. The article will be updated as work progresses and I am able to report results.

Overview

You can make GLM-Z1-9B (a thinking model) and gemma2:9b-instruct-q5_K_M (a doing model) cooperate in a dual-model agent pipeline using the Visual Studio Continue.dev extension and Ollama, with GLM-Z1 as the planner and reasoner and Gemma 2 as the executor.

The pattern is to repeat the following until done: the thinker (GLM-Z1) prepares a <think> plan, enhanced with semantically relevant context retrieved via nomic-embed-text embeddings from the codebase, and sends it to the doer (Gemma 2), which provides agency by using tools for file editing, API calling, and other actions.

Process Flow

Step Process Role
1 GLM-Z1-9B Generates a <think> plan and a tool call request
2 Continue.dev Parses the plan and sends to Gemma 2
3 Gemma 2 Executes tools (edit file, run command)
4 Continue.dev Feedback loop: GLM-Z1 refines

The above is a ReAct-style feedback loop (Reason and Act). Tool results (e.g., file edit success/failure) are automatically appended as context items to the next prompt. The thinker model considers this and decides to refine, re-plan, or stop, all without any no human interaction.

The loop continues until the task completes or a maximum iteration limit is hit. The default limit is between 5–10 turns, configurable in config.json as "experimental": {"maxAgentIterations": 10}.

Dual and triad model configurations rely on context providers like @codebase and @docs. Embeddings ensure the thinker (e.g., GLM-Z1) gets precise, relevant chunks (e.g., 15 snippets via nRetrieve=15), not the full repository. This keeps prompt evaluation fast (150–200 t/s) and reasoning focused.

Setup: config.json for Dual-Model Agent

Continue ignores per-model systemMessage when supportsTools or roles is present in those models.
config.json
{
  "codebase": {
    "embedOnStartup": true
  },

  "contextProviders": [
    {
      "name": "codebase",
      "params": {
        "nFinal": 5,
        "nRetrieve": 15,
        "useChunks": true
      }
    },
    {
      "name": "docs",
      "params": {
        "urls": [
          "https://mslinn.com",
          "https://go.dev",
          "https://ollama.com",
          "https://developer.mozilla.org/en-US/docs/Web"
        ]
      }
    }
  ],

  "embeddingsProvider": {
    "model": "nomic-embed-text",
    "provider": "ollama"
  },

  "models": [
    {
      "apiBase": "http://localhost:11434",
      "default": true,
      "model": "sllm/glm-z1-9b",
      "name": "thinker",
      "provider": "ollama",
      "roles": ["chat", "edit"],
      "title": "GLM-Z1-9B (Thinker)"
    },
    {
      "apiBase": "http://localhost:11434",
      "model": "gemma2:9b-instruct-q5_K_M",
      "name": "doer",
      "provider": "ollama",
      "roles": ["apply"],
      "supportsTools": true,
      "title": "Gemma 2 9B (Doer)"
    }
  ],

  "rules": [
    {
      "model": "thinker",
      "name": "Thinker Plan",
      "prompt": "Always output: <think>plan</think>
        then <tool_call>{...}</tool_call>"
    },
    {
      "model": "doer",
      "name": "Doer Execute",
      "prompt": "Execute <tool_call> exactly.
        Return: <tool_result>output</tool_result>"
    },
    {
      "name": "Auto-Route",
      "prompt": "If <tool_call> is present,
        route to model 'doer'."
    }
  ],

  "systemMessage": "You are a dual-agent system.
    GLM-Z1-9B (thinker) plans with <think> and proposes edits.
    Gemma 2 (doer) executes with <tool_call>.",

  "tabAutocompleteModel": {
    "apiBase": "http://localhost:11434",
    "model": "sllm/glm-z1-9b",
    "provider": "ollama",
    "title": "GLM-Z1-9B Autocomplete"
  }
}

How to Use in Continue.dev

  1. Pull both models:

    $ ollama pull sllm/glm-z1-9b
    $ ollama pull gemma2:9b-instruct-q5_K_M
    
  2. Save the configuration and reload it. I do not believe it is necessary to reload VS Code with CTRL+R.

  3. Check to make sure the Continue.dev agent mode is active, then send this prompt:

    @codebase Create a Python script that fetches weather from API and saves to file.
    
  4. GLM-Z1 responds:

    <think>
    1. Create weather.py
    2. Use requests.get()
    3. Save JSON to data.json
    </think>
    <tool_call>
    {"action": "create_file", "path": "weather.py", "content": "...code..."}
    </tool_call>
    
  5. Press Continue to send to Gemma 2, which creates weather.py as requested.

Performance on Bear (RTX 3060 12GB)

Model VRAM Speed
GLM-Z1-9B ~6.5 GB 70 t/s
Gemma 2 9B ~6.0 GB 55 t/s
Total ~12.5 GB Fits in 12GB VRAM (swap OK)

Auto-Routing with Rules

Add to rules:

{
  "name": "Route to Doer",
  "prompt": "If <tool_call> is present, send to model 'doer'."
}

Continue will automatically switch models to run the pipeline. This is not a streaming setup, it is a one-shot batch job. The steps areare:

  1. Save the dual-model config.json

  2. Download the gemma2:9b-instruct-q5_K_M model.

    $ ollama pull gemma2:9b-instruct-q5_K_M
    
  3. Reload Visual Studio Code with CTRL+R.

  4. Type this prompt:

    @codebase Make a hello world script
    → **File created automatically.**
    

You now have combined two models: a thinker and a doer.

Is A Pipeline Slower Than Using Only One Model?

We looked at other LLMs earlier: DeepSeek-Coder-V2, Qwen3-Coder, Codestral, and CodeLlama. Are any of these close to GLM-Z1’s coding ability while also being capable of significant agency?

Grok provided all of the following information and I have not verified any of it.

GLM-Z1-9B-0414 is a 9B-parameter model tuned for reasoning (e.g., math/logic/code planning) with strong coding performance (~75–80% on HumanEval, ~65–70% on LiveCodeBench) but limited native agency (no built-in tool calling; relies on prompt-based planning like tags, making it ~55–60% on BFCL for function calling). From the models we discussed earlier (DeepSeek-Coder-V2, Qwen3-Coder, Codestral, CodeLlama), Qwen3-Coder comes closest overall, matching or exceeding GLM-Z1's coding while offering significantly better agency through native tool support and agentic tuning. DeepSeek-Coder-V2 is a strong runner-up for coding but lags in tool calling. Codestral is balanced but not as agentic, and CodeLlama trails in both (outdated 2023 model with weak tool support).

Benchmarks focus on coding (HumanEval: code generation accuracy; LiveCodeBench: real-world algorithmic coding) and agency (BFCL: function calling/tool use; ToolBench: agentic task completion). Scores are from 2025 evals (e.g., LiveCodeBench v5, BFCL v3); higher is better. GLM-Z1 baselines are from community/THUDM reports. “Close to GLM-Z1” means ±5% on coding; “significant agency” means ≥70% on BFCL/ToolBench.

Model Variant HumanEval (%) LiveCodeBench (%) BFCL (Tool Calling, %) ToolBench (Agency, %) Close to GLM-Z1 Coding? Significant Agency? Notes
GLM-Z1-9B (Baseline) 75–80 65–70 55–60 60–65 - No Excellent reasoning/planning but prompt-only agency; no native tools. Strong on math-integrated code.
DeepSeek-Coder-V2 (Lite 16B) 90.2 43.4 65–70 70–75 No (higher coding) Yes Tops coding (90%+ HumanEval) but LiveCodeBench lags GLM-Z1. Strong agency via MoE efficiency for tool chains; good for multi-step agents.
Qwen3-Coder (14B–30B) 89.3 70.7 75–80 75–80 Yes (matches) Yes Closest match: ~89% HumanEval (equal/superior GLM-Z1), 70%+ LiveCodeBench. Excellent agency with native tool calling and agentic tuning (e.g., BFCL 75%+); supports 128K+ contexts for multi-file agents.
Codestral 22B 71.4–86.6 37.9 65–70 65–70 Yes (similar HumanEval) Moderate Solid coding (71%+ HumanEval, but lower LiveCodeBench). Good agency via prompt-based tools (e.g., 65% BFCL), but not as integrated as Qwen3; strong for multi-language but weaker planning.
CodeLlama 13B/34B 53–67.8 ~30–40 <50 <50 No (lower) No Outdated (2023); 53–68% HumanEval but poor LiveCodeBench (~30–40%). Minimal agency (no native tools; prompt-only ~40–50% BFCL); fine for basic gen but lags in real-world/agents.

Dual- vs. Single- Model Summary

Grok provided all of the following information and I have not verified any of it.

Closest Overall: Qwen3-Coder – Matches GLM-Z1’s coding (89% HumanEval, 70% LiveCodeBench) while providing significant agency (75%+ BFCL/ToolBench via native tools and agentic fine-tuning). It’s the best hybrid for autonomous tasks like multi-file refactors or tool chains.

DeepSeek-Coder-V2: Superior coding (90% HumanEval) but agency is good-not-great (70% ToolBench); better for pure generation than full agents.

Codestral: Comparable coding (71% HumanEval) with moderate agency (65% BFCL); viable but not “significant.”

CodeLlama: Falls short in both (~60% HumanEval, <50% agency); not competitive in 2025.

If agency is key, swap to Qwen3-Coder-14B for a drop-in upgrade; similar size/speed to GLM-Z1 but with better tools.

Comparing Execution Times

Match up two models (GLM-Z1-9B for thinking, and Gemma 2 9B for doing) vs. Qwen3-Coder 14B all by itself.

Grok provided all of the following information and I have not verified any of it.

To compare “total time taken,” I assume a typical agentic coding task (e.g., “@codebase Refactor a 500-token Python function with multi-step planning and file edit,” ~800 tokens input, ~400 tokens output). This includes prompt evaluation (processing input/context) + generation (output tokens). Metrics are based on 2025 benchmarks for Ollama on RTX 3060 12 GB VRAM (Q5_K_M quantization, 8K context), averaged from community reports (e.g., Reddit/LocalLLaMA, Hugging Face evals). Dual-model adds ~0.5–1s overhead per switch (Continue.dev routing)

Estimated Speeds on RTX 3060 (Ollama, Q5_K_M, 8K Context)

Model Variant HumanEval (%) LiveCodeBench (%) BFCL (Tool Calling, %) ToolBench (Agency, %) Close to GLM-Z1 Coding? Significant Agency? Notes
GLM-Z1-9B (Baseline) 75–80 65–70 55–60 60–65 - No Excellent reasoning/planning but prompt-only agency; no native tools. Strong on math-integrated code.
DeepSeek-Coder-V2 (Lite 16B) 90.2 43.4 65–70 70–75 No (higher coding) Yes Tops coding (90%+ HumanEval) but LiveCodeBench lags GLM-Z1. Strong agency via MoE efficiency for tool chains; good for multi-step agents.
Qwen3-Coder (14B–30B) 89.3 70.7 75–80 75–80 Yes (matches) Yes Closest match: ~89% HumanEval (equal/superior GLM-Z1), 70%+ LiveCodeBench. Excellent agency with native tool calling and agentic tuning (e.g., BFCL 75%+); supports 128K+ contexts for multi-file agents.
Codestral 22B 71.4–86.6 37.9 65–70 65–70 Yes (similar HumanEval) Moderate Solid coding (71%+ HumanEval, but lower LiveCodeBench). Good agency via prompt-based tools (e.g., 65% BFCL), but not as integrated as Qwen3; strong for multi-language but weaker planning.
CodeLlama 13B/34B 53–67.8 ~30–40 <50 <50 No (lower) No Outdated (2023); 53–68% HumanEval but poor LiveCodeBench (~30–40%). Minimal agency (no native tools; prompt-only ~40–50% BFCL); fine for basic gen but lags in real-world/agents.

Summary

Grok provided all of the following information and I have not verified any of it.

Dual-Model Wins on Time: ~15–20 s vs. 18–25 s for Qwen3-Coder (10–25% faster overall). The thinker’s fast prompt eval (175 t/s) and doer’s solid gen (55 t/s) offset switch latency for tasks with clear plan/exec splits. Qwen3 is more consistent but slower on gen-heavy steps.

Trade-offs: Dual adds complexity (e.g., routing rules) but leverages strengths (GLM-Z1 planning + Gemma tools). Qwen3 is simpler/single-model but ~20% slower here. For longer tasks (2K+ tokens), dual pulls ahead (~30% faster) due to specialized roles. If your tasks are gen-dominant, Qwen3 edges out; for planning-heavy, dual shines. Test on Bear for exacts!

Why Dual-Model (Thinker + Doer) Can Be Faster Than a Single Model (Even Qwen3-Coder)

Intuition says two models should be slower.

But in real agentic coding workflows, dual-model is often 10–30% faster — and here’s why, backed by math, hardware, and LLM behavior.

Specialization = Efficiency (Like a Factory Assembly Line)

Dual-Model (Thinker + Doer) Single Model (Qwen3-Coder)
Thinker (GLM-Z1-9B): Fast planning (200 t/s prompt eval) One model does everything
Doer (Gemma 2 9B): Fast execution (60 t/s gen) Same model for planning + doing

Pattern

The pattern is to repeat the following until done: the planner (GLM-Z1) prepares a <think> plan, enhanced with semantically relevant context retrieved via nomic-embed-text embeddings from the codebase, and sends it to the editor (Qwen3-Coder), which outputs a clean unified diff. The diff is then sent to the executor (Gemma 2), which provides agency by using tools for file editing, API calling, and other actions.

Prompt Eval vs. Generation: The Hidden Bottleneck

Phase Tokens Who Does It? Speed
Prompt Eval 800 Thinker 200 t/s → 4.0 s
Plan Gen 200 Thinker 60 t/s → 3.3 s
Tool Call 100 Doer 60 t/s → 1.7 s
Apply Edit 300 Doer 60 t/s → 5.0 s
Switch Overhead Continue.dev ~1.0 s

Total Dual: ~15.0 s

Qwen3-Coder 14B (Single) Prompt Eval: 800 t @ 135 t/s → 5.9 s Full Gen: 600 t @ 35 t/s → 17.1 s

Dual wins by ~8 seconds — 35% faster

Prompt Evaluation Is the Critical Factor

Prompt evaluation means reading context such as (@codebase, @docs, @history).

Larger models (14B) are slower at reading. For example, Qwen3 @ 135 t/s.

Smaller models (9B) are faster at reading. For example, GLM-Z1 @ 200 t/s.

The thinker model must read quickly.

The doer model only needs to perform simple tool calls, without having to re-read anything.

No Redundant Reasoning

When a single model handles both planning and execution, it must solve two different cognitive problems in one pass,. and LLMs are not optimized for that. This forces redundant reasoning, verbose filler, and slower, noisier outputs. Here’s why it happens:

Task What the Model Must Do
Planning ”"”What should I do?”” → analyze code, find bugs, design steps”
Execution ”"”How do I write the diff?”” → generate exact code, format, test logic”

A single model must do both in one generation, so it:

  1. Reasons about the plan
  2. Writes the code
  3. Re-reasons to verify (“Did I miss anything?”)
  4. Adds commentary (“Let me double-check…”)

Result: ~30–50% of output is redundant reasoning. This wastes elapsed time, computing resources, bandwidth, and is a source of inaccuracy.

Problem Cause Fix
Reasons twice Must plan and execute in one brain Dual-model separates roles
Verbose noise CoT leakage and over-cautious Doer only outputs action
Slower More tokens to generate Less output means faster response

Single model is a “jack of all trades, master of none”.

Dual model uses a more specialized pair of LLMs, which is faster, more accurate, and less noisy.

Dual Single
Thinker:Plan consumes ~200 tokens Plans + writes code + debugs in one process
Often reasons twice and is overly verbose
Doer: Only executes once  

200 Token Budget Breakdown

Component Token Count Example
Task summary ~50 Convert login() to async/await
Code context ~80 Key lines from current file
Planning steps ~70 1. Replace requests get → httpx.get
2. Add async/await
3. Update error handling
Total ~200 Desirable average tokens per second

200 tokens / second is not a maximum figure; instead, it is the sweet spot for most real-world edits.

Task Type Thinker Output (Tokens) Notes
Simple (e.g., “add print””) 50–100 Minimal planning
Typical (e.g., “refactor auth””) 150–250 ~200 average
Complex (e.g., “migrate DB schema””) 300–500 Multiple files, edge cases

Dual-model dialogs stay fast and focused — no 600-token monologues.

GLM-Z1-9B repeats itself quite a bit, but this is much reduced in a dual-model setup.

When GLM-Z1-9B is used solo (e.g., in chat or agent mode):

Trigger Effect
No clear role boundary Model tries to plan + execute + verify in one go
“Long, open-ended context” Repeats ideas to “stay on track”
No feedback cutoff Loops on uncertainty (“Did I say this already?”)
Default repeat_penalty ignored Ollama defaults to 1.0 → no penalty

The result is that 30–60% of output is repeated phrases, especially in long reasoning chains. (Mike here: My experience was much worse than that.)

For dual model: After the thinker finishes, the old reasoning is pruned. The thinker never sees its own output, so there is no self-echo.

RTX 3060 Capability

Model VRAM Speed
GLM-Z1-9B ~6 GB 200 t/s eval
Gemma 2 9B ~6 GB 60 t/s gen
Dual ~12 GB Fits perfectly
Qwen3-14B ~11 GB 135 t/s eval 35 t/s gen

No swapping → smooth, fast switching

Real-World Example: Refactor a Function

@codebase Refactor login() to use async/await
Step Dual Qwen3
1. Read codebase Thinker: 4s Qwen3: 6s
2. Plan Thinker: 3s Qwen3: 8s (verbose)
3. Generate diff Doer: 5s Qwen3: 10s
Total ~13s ~25s

Further Specialization: Three LLMs

Would adding a third specialized LLM improve quality or response time? Until now we have had glm-z1 performing both the chat and the edit roles. What if those two roles were handled by different LLMs? Which single-role LLMs would provide optimal performance?

Yes, adding a third specialized LLM can improve both quality and response time, but only with careful role separation and model selection. Until now, GLM-Z1-9B has been doing two jobs (chat and edit), which leads to cognitive overload and repetition, overly verbose output, and slower planning.

Dual-Model

Model Role Tokens Time Quality
GLM-Z1-9B chat + edit ~200–300 ~4–6s Good reasoning, but verbose
Gemma 2 9B apply ~200 ~3–4s Clean execution

GLM-Z1 must plan and write diffs, which means redundant reasoning and yields ~40% noise.

Triad Model — Optimal Specialization

Role Model Why It’s Best Tokens Time Quality Gain
Planner GLM-Z1-9B Best-in-class reasoning/planning (78% GSM8K) ~150 ~2.5s +30% clarity
Editor Qwen3-Coder-14B 89% HumanEval, native edit role, clean diffs ~200 ~4.0s +25% code accuracy
Executor Gemma 2 9B Fast apply, reliable tool calling ~200 ~3.3s +10% reliability

Comparing Total Times

Setup Total Time (Typical Task) Speed Gain
Single (Qwen3) ~23s
Dual (GLM+Z1 + Gemma) ~15–18s +22%
Triad (GLM-Z1 + Qwen3 + Gemma) ~12–14s +40% vs single, +15% vs dual

Triad wins due to zero role overlap, which eliminates redundant reasoning.

Why Splitting chat and chat Roles Helps

Role Cognitive Load Single Model Risk Split Benefit
chat High-level reasoning, context Over-explains Planner focuses on what
edit Syntax, diff precision Misses edge cases Editor focuses on how

Qwen3-Coder excels at edit — trained on diffs, not monologues.

config.json for Three Models

Here is the complete config.json for three specialized models:

Complete config.json
{
  "codebase": {
    "embedOnStartup": true
  },

  "contextProviders": [
    {
      "name": "codebase",
      "params": {
        "nFinal": 5,
        "nRetrieve": 15,
        "useChunks": true
      }
    },
    {
      "name": "docs",
      "params": {
        "urls": [
          "https://mslinn.com",
          "https://go.dev",
          "https://ollama.com",
          "https://developer.mozilla.org/en-US/docs/Web"
        ]
      }
    }
  ],

  "embeddingsProvider": {
    "model": "nomic-embed-text",
    "provider": "ollama"
  },

  "models": [
    {
      "apiBase": "http://localhost:11434",
      "default": true,
      "name": "planner",
      "model": "sllm/glm-z1-9b",
      "provider": "ollama",
      "roles": ["chat"],
      "systemMessage": "You are the PLANNER. Output: <think>1. First step\n2. Second step</think>",
      "title": "GLM-Z1-9B Planner"
    },
    {
      "apiBase": "http://localhost:11434",
      "name": "editor",
      "model": "qwen3-coder:14b-instruct-q5_K_M",
      "provider": "ollama",
      "roles": ["edit"],
      "systemMessage": "You are the EDITOR. Receive <think> plan from planner. Output only a clean unified diff. No explanation.",
      "title": "Qwen3-Coder Editor"
    },
    {
      "apiBase": "http://localhost:11434",
      "name": "executor",
      "model": "gemma2:9b-instruct-q5_K_M",
      "provider": "ollama",
      "roles": ["apply"],
      "supportsTools": true,
      "systemMessage": "You are the EXECUTOR. Apply the diff exactly. Return: <tool_result>success</tool_result> or <tool_result>error: details.</tool_result>",
      "title": "Gemma 2 Executor"
    }
  ],

"rules": [
    {
      "name": "Route Planner to Editor",
      "prompt": "If output contains <think>, send to model 'editor'."
    },
    {
      "name": "Route Editor to Executor",
      "prompt": "If output contains a code diff, send to model 'executor'."
    },
    {
      "name": "Loop on Error",
      "prompt": "If <tool_result> contains 'error', send full context back to model 'planner'."
    },
    {
      "name": "Stop on Success",
      "prompt": "If <tool_result>success</tool_result> is present, end the agent loop."
    }
  ],

  "tabAutocompleteModel": {
    "apiBase": "http://localhost:11434",
    "model": "sllm/glm-z1-9b",
    "provider": "ollama",
    "title": "GLM-Z1-9B Autocomplete"
  }
}

Download all three models by typing:

Shell
$ ollama pull gemma2:9b-instruct-q5_K_M
$ ollama pull qwen3-coder:14b-instruct-q5_K_M
$ ollama pull sllm/glm-z1-9b

Role of Continue.dev in the 3-Model Configuration

Continue.dev serves as the orchestration engine and execution environment for the triad agent system. It is not a model itself, but the central coordinator that enables the three specialized LLMs to function as a unified, autonomous agent.

Function Description
Message Routing Interprets rule prompts (e.g., If output contains `, send to model 'editor'`) and automatically forwards outputs to the correct model.
Context Management Maintains conversation history, injects retrieved codebase/docs via embeddings, and prunes irrelevant context to keep prompts efficient.
Tool Execution Runs file system operations (create, edit, delete) when the executor model returns a diff. Enable supportsTools for diffs.
Loop Control Implements the ReAct loop: planner → editor → executor → (on error) planner. Stops on <tool_result>success</tool_result>.
User Interface Provides the chat panel, @codebase and @docs support, and inline diff previews in Visual Studio Code.

The process flow looks like this:

User Input
   ↓
[Continue] → Routes to planner (default model)
   ↓
planner → <think>plan</think>
   ↓
[Continue] → Detects <think>, routes to editor
   ↓
editor → unified diff
   ↓
[Continue] → Detects diff, routes to executor
   ↓
executor → <tool_result>success</tool_result>
   ↓
[Continue] → Applies file changes, ends loop

Triad Agency with Ollama Desktop App for Windows

The triad configuration is not possible with the Ollama Desktop App for Windows by itself. The Ollama CLI also does not provide agency.

Ollama Desktop App Limitations

These limitations can be overcome by adding custom software, described next.

  • The app loads one model at a time for chat. Switching models requires manual selection via the UI or API, but no automated routing based on tags or rules.

  • It lacks tool calling, feedback loops, or context providers. Basic prompts work, but multi-step agency (e.g., plan-edit-execute) is not supported without additional software.

  • The app runs the Ollama API in the background, allowing external tools to use multiple models, but the app UI itself is limited to one model per session.

Agency By Adding Custom Software

Below is a little Python app that uses the Ollama API for routing. It could be enhanced to use libraries like Streamlit for UI, and orchestrate the triad via Ollama’s API and LangChain or LlamaIndex Script.

The following is a basic example script to demonstrate the concept:

triad_agent.py
import requests

def call_model(model, prompt):
  response = requests.post(
    "http://localhost:11434/api/generate",
    json={
      "model": model,
      "prompt": prompt,
      "stream": false
    }
  )
  return response.json()["response"]

# Task example
task = "Refactor login function to async"

# Planner
plan = call_model("sllm/glm-z1-9b", f"You are the PLANNER. {task}. Output:steps")
print("Plan:", plan)

# Editor
diff = call_model("qwen3-coder:14b-instruct-q5_K_M", f"You are the EDITOR. {plan}. Output clean diff.")
print("Diff:", diff)

# Executor (simulate apply)
result = call_model("gemma2:9b-instruct-q5_K_M", f"You are the EXECUTOR. {diff}. Return: success")
print("Result:", result)

This script provides basic triad agency without VS Code or Continue. Run this script by typing the following while the Ollama app is open:

Shell
python triad_agent.py .

Additional Specialized Roles for Enhanced LLM Triad Performance

Beyond the core planner, editor, executor roles, several additional roles can significantly improve accuracy, reliability, and autonomy in complex software engineering workflows. Each role leverages a dedicated LLM or lightweight model to offload cognitive load, reduce errors, and enable parallel processing.

Below are high-value roles, their purpose, and recommended models for your Bear computer (RTX 3060 with 12 GB VRAM, Ollama).

Embed

This role is sometimes referred to as Retriever (Semantic Search). It has already been provided for above, but the way it must be configured for Continue.dev makes you think that is something else. The embed role can be specified through the use of embeddingsProvider.

This role pre-filters and ranks @codebase or @docs snippets using embeddings before the planner sees them. It reduces noise and improves context relevance by 20–30%.

The best model for this is nomic-embed-text (137M, <1 GB VRAM), which is already in use via embeddingsProvider, so no additional model is required.

Review

After implementing all of the above roles (chat, embed, edit, and apply), The review role is the next most important to implement because it returns the highest ROI for code quality.

The review role catches bugs before execution by validating the editor’s diff for correctness (syntax, logic, security, style).

The best model for Bear is codestral:22b-instruct-q5_K_M (~10 GB VRAM) because it excels at code review; 86% HumanEval, strong static analysis.

config.json
{
  "name": "reviewer",
  "model": "codestral:22b-instruct-q5_K_M",
  "roles": ["review"],
  "systemMessage": "You are the REVIEWER. Check the diff for bugs, style, and security. Output: <review>pass</review> or <review>issues: ...</review>"
}

Tester

The test role ensures functional correctness by generating and running unit tests for the diff. The best model for Bear is deepseek-coder:6.7b-instruct-q5_K_M (~6 GB VRAM) 90%+ HumanEval; excellent test generation.

config.json
{
  "name": "tester",
  "model": "deepseek-coder:6.7b-instruct-q5_K_M",
  "roles": ["test"],
  "supportsTools": true,
  "systemMessage": "You are the TESTER. Write pytest cases. Run them. Return: <test_result>pass</test_result> or <test_result>fail: ...</test_result>"
}

Debugger

The debug role analyzes <tool_result> errors (e.g., tracebacks) and proposes fixes. The best models for Bear is glm-z1:9b (same as for the planner role) or qwen3-coder:14b because these modeles provide the strong reasoning necessary for root cause analysis.

Documenter

The docs role generates docstrings, README updates, and API docs from code changes.

User → Planner (GLM-Z1) → Retriever (nomic-embed-text)
         ↓
       Editor (Qwen3-Coder) → Reviewer (Codestral)
         ↓
       Executor (Gemma 2) → Tester (DeepSeek)
         ↓
       Documenter (Gemma 2) → Done

VRAM Feasibility on Bear

Role Model VRAM Notes
Planner glm-z1-9b ~6 GB Reuse
Editor qwen3-coder:14b ~10 GB Core
Executor gemma2:9b ~6 GB Core
Retriever nomic-embed-text <1 GB Always on
Reviewer codestral:22b ~10 GB Swap in
Tester deepseek-coder:6.7b ~6 GB Swap in

The core triad plus retriever required ~22 GB total VRAM, which greatly exceeds the 12 GB available. Use model offloading or sequential activation via Continue rules.

Example rule in config.json
{
  "name": "Route to Reviewer",
  "prompt": "If diff is present, send to model 'reviewer' before executor."
}

Implementing Sequential Activation with Continue.dev

Your Bear system has an NVIDIA RTX 3060 with only 12 GB VRAM, so it cannot load all models simultaneously for a 5-model agent. Model offloading and sequential activation solves this by only loading the active model into VRAM, and unloading others. Continue.dev supports this via rules and Ollama’s memory management.

Sequential Activation via Rules

Instead of keeping all models in memory, activate one model at a time based on the current agent step. Use Continue rules to:

  1. Detect output tags (e.g., <think>, diff, <review>)
  2. Route to the next model
  3. Load/unload models via Ollama API

Enable Ollama Model Management

Ollama automatically unloads inactive models after ~5 minutes. Force immediate unload with:

$ ollama unload <model-name>

Add load and unload Rules

Update your config.json with custom rules that trigger model loading and unloading.

config.json rules
"rules": [
  {
    "name": "Load Editor on Plan",
    "prompt": "If output contains <think>, run shell command: ollama unload glm-z1-9b && ollama pull qwen3-coder:14b-instruct-q5_K_M"
  },
  {
    "name": "Load Executor on Diff",
    "prompt": "If output contains ```diff, run shell command: ollama unload qwen3-coder:14b-instruct-q5_K_M && ollama pull gemma2:9b-instruct-q5_K_M"
  },
  {
    "name": "Load Reviewer on Request",
    "prompt": "If user says 'review', run shell command: ollama unload gemma2:9b-instruct-q5_K_M && ollama pull codestral:22b-instruct-q5_K_M"
  },
  {
    "name": "Unload All on Done",
    "prompt": "If <tool_result>success</tool_result> is present, run shell command: ollama unload"
  }
]

Note: ollama unload without args unloads all inactive models.

Use shell Tool in Executor

Enable shell commands in the executor model:

config.json fragment
{
  "name": "executor",
  "model": "gemma2:9b-instruct-q5_K_M",
  "supportsTools": true,
  "systemMessage": "You are the EXECUTOR. You can run shell commands. Use `ollama unload` to free VRAM."
}

Example Workflow (5-Model Safe)

User: "@codebase Refactor login.py"
   → Planner (GLM-Z1) loads → `<think>plan</think>`
   → Rule: `unload glm-z1` + `pull qwen3-coder`

Editor (Qwen3-Coder) loads → `diff`
   → Rule: `unload qwen3-coder` + `pull codestral`

Reviewer (Codestral) loads → `<review>pass</review>`
   → Rule: `unload codestral` + `pull gemma2`

Executor (Gemma2) loads → Applies diff → `<tool_result>success</tool_result>`
   → Rule: `ollama unload` (all done)

VRAM Usage Never exceeds ~10 GB because only one model is active at all times.

Alternative: Manual Offloading (Simpler)

Add a pre-step prompt:

Before starting, run: ollama unload

Or use a VS Code task:

config.json fragment
{
  "label": "Unload Ollama Models",
  "type": "shell",
  "command": "ollama unload"
}

To run, press CTRL+SHIFT+P then type Tasks: Run Task.

Coming Soon: Auto-Load via Continue Hooks

Continue.dev is adding pre/post hooks to v0.9. Example:

config.json fragment
"hooks": {
  "beforeModel": "ollama unload && ollama pull "
}

Sequential Activation Summary

Method Pros Cons
Rule-based ollama unload/pull Fully automated Slight delay (~5–10s per switch)
Manual unload Fast, simple Requires user action
Sequential activation Fits 12 GB VRAM Best for 5+ models

Rule-based sequential activation for the 5-model agent will run smoothly on Bear.

Distributed Processing

Let’s pretend that all 5 LLMs are remote, provided by 5 difference vendors. Do the LLMs converse point-to-point, or is Continue.dev a potential bottleneck?

In the hypothetical scenario where all five LLMs are remote and provided by five different vendors, the LLMs do not converse point-to-point. Continue.dev acts as a hub and the LLMs are spokes.

Data Volume Per Agent Step (Remote LLMs)

Step Tokens Approx. Size (UTF-8) Notes
Planner → Editor 3,000–8,000 12–32 KB Plan + retrieved code snippets
Editor → Executor 1,000–5,000 4–20 KB Unified diff + context
Executor → Planner (error) 500–2,000 2–8 KB Tool result + traceback
Average per step ~4,000 tokens ~16 KB  

Total per full task (3 steps): ~50 KB

Bandwidth Needed for “Quick” Response

Goal Max Allowable Transfer Time Required Bandwidth
Instant feel (<1s per step) 0.5s ~250 Kbps
Fast (1–2s per step) 1.0s ~128 Kbps
Acceptable (2–3s per step) 2.0s ~64 Kbps

Minimum: 64 Kbps Recommended: ≥ 1 Mbps (for stability, retries, and parallel calls)

References

Critique

I asked Gemini about this article, which was written with heavy input from Grok:

Please provide a comprehensive critique of this article covering technical architecture, benchmarks, implementation feasibility, and practical implications.

The response sounds a bit like one LLM talking trash about another:

Critical Issues

  1. Unverified benchmarks: Core claims sourced from “Grok” without validation or methodology disclosure. The author admits not verifying them.
  2. Speed claims oversimplified: The 10–30% speedup ignores model switching overhead (2–4s per switch, not 0.5–1s), context retrieval, and Ollama loading penalties.
  3. Theoretical gaps: The “assembly line” analogy breaks down (sequential, not parallel); role specialization assumes clean boundaries that don’t exist in transformers.
  4. 5-model system impractical: Sequential activation adds 8–20s of overhead just for model loading/unloading. On RTX 3060, this makes it slower than a single model.
  5. Implementation gaps: The Python example is trivial; production code needs error handling, logging, security, and context passing.

What It Gets Right

  • Multi-model decomposition is conceptually sound
  • ReAct loops are proven
  • Continue.dev is real and usable
  • Embedding-based context retrieval is practical

Bottom Line

Interesting idea, but presented as more validated and practical than the evidence supports. The 10–30% speedup is unproven; the 5-model system is likely slower in practice than claimed due to loading overhead. Before adopting this, verify timing on your own hardware with realistic task sets.

* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.