Best Local LLMs For Coding

Published 2025-11-03. Last modified 2025-11-25.
Time to read: 16 minutes.

This page is part of the llm collection.

Did you know that all the computers on your LAN can pool their VRAM so your LLM models run across your entire LAN? That means a modest laptop, attached via a solid Wi-Fi connection could employ some extra local horsepower for AI tasks. If this type of topic interests you, please read the entire series.

This article discusses the best open-source LLMs for coding that run with acceptable performance on my workstation. “Good performance” typically means at least 20 to 40 tokens per second with minimal quality loss. If I were renting GPUs, the list of contenders would focus on larger versions of the same models.

Topics discussed in this article include:

The workstation used for testing
A summary of results
Graphs that compare the performance of leading open-source and commercial LLMs
Details on each model tested, listed in order of preference
Conclusions and recommendations

System prompts are discussed in LLM System Prompts.

I had several long discussions with LLMs, principally Grok.

I only installed and played with the most interesting candidates, as shown in the article.

Information that seemed reasonable from Grok was provisionally accepted without verification. I have been burrowing my way through it, correcting and expounding as I felt relevant. I have not found any quantitative errors from Grok, just bad URLs, imaginary references, a fixation on deterministic initialization sequences, and a total lack of understanding of human needs.

This was not an exhaustive survey, but it probably represents a fair assessment.

All the dialogs did actually happen as shown because that is a transcript of what I did, and the results that I achieved. At least you can trust the code presented and the TUI dialog.

Bear and Gojira

The workstation I use the most for developing software is called Bear and has the following components that affect the performance of LLMs that run on it:

Intel Core i7-13700K (16 cores: 8 performance + 8 efficiency, up to 5.4 GHz).
NVIDIA RTX 3060 with 12 GB GDDR6 VRAM
64 GB DDR5
Z790 chipset motherboard
12 TB NVMe PCIe v5.

Gojira, an Ubuntu server in the room next to my desk, is built with similar hardware as Bear, except it has two GPUs, both NVIDIA RTX 3060s with 12 GB VRAM each.

Context Length

An optimized context refers to using specific techniques and software frameworks to manage a context in an efficient manner, minimizing the computational overhead and memory usage that typically scale poorly with long contexts.

In standard Transformer architectures, the computational cost (both memory and speed) of processing the self-attention mechanism grows quadratically with the context length. A 16K context is substantial, and if handled naively, it would consume a massive amount of VRAM and slow down inference significantly. Optimization Techniques

An optimized context attempts to alleviate this quadratic scaling issue; non-optimized contexts are called naive. Optimizations include:

PagedAttention is an efficient memory management algorithm used in serving frameworks like vLLM. It pages (sqps) the KV Cache, storing it in "pages" rather than contiguous memory blocks. This prevents memory fragmentation and maximizes the GPU's memory utilization, allowing for much larger effective batch sizes and greater throughput, especially with long contexts.
Flash Attention is an optimized implementation of the attention mechanism that uses tiling and recomputation memory management techniques on the GPU to avoid reading/writing the large intermediate attention matrices to and from the relatively slow GPU High-Bandwidth Memory (HBM). This dramatically speeds up prompt processing for long contexts.
Context Compression/Summarization distills the most important information from older parts of the conversation into a shorter summary that is then injected into the prompt. This keeps the effective context length manageable without losing important information.
Multi-Block Attention (MBA) is used in libraries like TensorRT-LLM. This technique partitions long contexts into smaller blocks that can be processed in parallel across different Streaming Multiprocessors (SMs) on the GPU during the decode phase, merging the results efficiently.

Results Summary

The following benchmarks were used to evaluate the contending models. Recall that the models must run on my computer. If models were run remotely, they could be much larger, which helps in a detailed computation. However, smaller models read faster, so bigger is not necessarily better. The benchmarks that were referred to provide metrics on coding-related capabilities:

HumanEval: code generation accuracy
LiveCodeBench v5: real-world algorithmic coding
BFCL v3: function calling and tool use
ToolBench: agentic task completion

I did not run the benchmarks myself. Grok provided the data. Here is a summary of the results. Details follow in separate sections.

Model and Variant	HumanEval (Accuracy)	LiveCodeBench (Algorithms)	BFCL (Tool Calling)	ToolBench (Agency)	Notes
CodeLlama 13B/34B	53–67.8%	~30–40%	<50%	<50%	Outdated (2023). Minimal agency; fine for basic gen but lags in real-world/agents.
Codestral 22B	71.4–86.6%	37.9%	65–70%	65–70%	Solid coding. Good agency via prompt-based tools, but not as integrated as Qwen3; strong for multi-language but weaker planning.
DeepSeek-Coder-V2 (Lite 16B)	90.2%	43.4%	65–70%	70–75%	Excellent coder but algorithmic capability lags GLM-Z1-9B. Strong agency via MoE efficiency for tool chains; good for multi-step agents.
Gemma2:9B	82–86%	~60–70%	75–80%	72–78%	Fastest, strong function calling, reliable with structured prompts. Good multi-step task completion, minor edge-case failures. Weaker reasoning and planning; less accurate on complex logic.
GLM-Z1-9B-0414	75–80%	65–70%	55–60%	60–65%	Excellent reasoning/planning but prompt-only agency; no native tools. Strong on math-integrated code.
MiniMax-M2	85-87%	70–75%	80-85%	80-85%	The latest hotness. 1M context; SOTA open-source for agency and tools; excels in multi-step reasoning and execution.
Qwen3-Coder (14B–30B)	89.3%	70.7%	75–80%	75–80%	Strong coder. Excellent agency with native tool calling and agentic tuning; supports 128K+ contexts for multi-file agents.

The above table is very wide, in fact you can only see a portion of it at a time. Drag left with your finger over the above table or use the horizontal scrollbar and push right to see the rest of the table.

Of the models discussed in this series, MinMax-M2 is the overall best performer. It excels in coding, tool use, and multi-step agency due to its massive 1M token context and advanced architecture.

Qwen3-Coder is the runner-up, matching or exceeding GLM-Z1-9B’s coding while offering significantly better agency through native tool support and agentic tuning. DeepSeek-Coder-V2 is a strong runner-up for coding but lags in tool calling. Codestral is balanced but not as agentic, and CodeLlama trails because it is an outdated 2023 model with weak tool support.

After the graphs shown in the next section, the rest of this article discusses each model in the above table in more detail. Subsequent articles are linked at the top and bottom of this web page.

Graphs

artificialanalysis.ai publishes analysis of both commercial and open-source LLMs. Their website shows costs for the open-source models; however, they do not indicate how those costs were computed.

Intelligence Capability Ranking From artificialanalysis.ai

Agentic Capability Ranking From artificialanalysis.ai

Coding Capability Ranking From artificialanalysis.ai

minimax.io, the Chinese company that makes MiniMax-M2, provide the following graph on their website. Note that the X axis is reversed, which is confusing, just so their product can appear in the upper right quadrant. Someone spent too much time reading Gartner magic quadrant summaries.

From minimax.io: Expensive models are on the left; no idea how the price for free models was computed

I think MinMax.ai just copied the data from artificialanalysis.ai without thinking too much about what the graphs actually mean. All they seemed to care about having their product in the upper-right quadrant.

Click on a section header below to see its contents.

Codestral

Model Collection Overview

The Codestral family is published by Mistral AI. These open-weight models specialize in code generation, completion, and editing across 80+ programming languages. Released in 2024, they emphasize developer workflows like autocomplete, debugging, and multi-language support, with strong benchmarks on HumanEval and MultiPL-E. Context lengths reach 32K tokens natively.

Two variants:

Codestral-22B-v0.1: Dense Transformer-based, 22B parameters, instruct-tuned for code tasks. Base and chat versions available.
Mamba-Codestral-7B-v0.1: 7B parameters, Mamba architecture outperforms similar-sized Transformers in speed and memory while matching code quality.

Quantized GGUF formats (e.g., from TheBloke or bartowski repos) support local runs via llama.cpp or Ollama. No major 2025 updates in current listings—these remain the core models, with community fine-tunes.

On Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), both versions would run well with quantization, but the 22B version would require compromises for "good" performance.

Compatibility and Performance Assessment

Short Answer: Yes for both—the 7B Mamba excels (fast and accurate); the 22B is viable but needs to be quantized/offloaded for acceptable speeds.

Mamba-Codestral-7B-v0.1

Mamba2's linear-time scaling makes it more VRAM- and compute-efficient than Transformers, ideal for your hardware.

Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB?	Quality Impact	Expected Speed on RTX 3060 (4K Context)
FP16 (unquantized)	~14 GB	No (offload)	None	40–60 t/s
Q8_0	~7.5 GB	Yes	Negligible	50–70 t/s
Q5_K_M	~5.5 GB	Yes	Minimal	60–80 t/s
Q4_K_M	~4.8 GB	Yes	Low	70–90 t/s

Performance Notes: Blisteringly fast due to Mamba's architecture—often 1.5–2x quicker than 7B Transformers on Ampere GPUs. Full GPU load at Q5 delivers interactive coding (e.g., real-time autocompletion). Benchmarks show it rivaling 13B models on code eval while using half the resources.

Codestral-22B-v0.1

Dense model; larger for better multi-language reasoning but hungrier on resources.

Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB?	Quality Impact	Expected Speed on RTX 3060 (4K Context)
Q8_0	~23 GB	No	Negligible	N/A (heavy offload)
Q5_K_M	~15 GB	No (offload ~15 layers)	Minimal	12–20 t/s
Q4_K_M	~13 GB	Barely (offload 5–10)	Low	15–25 t/s
Q3_K_M	~10.5 GB	Yes	Medium	20–30 t/s
Q2_K	~8 GB	Yes	High	25–35 t/s (quality drop)

Performance Notes: Q4 with partial offload to your 64 GB RAM yields usable speeds for code gen, though not as snappy as smaller models. Mamba's efficiency edge makes the 7B preferable unless you need the 22B's superior benchmark scores (e.g., 75%+ on RepoBench). Avoid long contexts (>8K) to keep KV cache low.

CPU and RAM Role

i7-13700K handles Mamba offloads seamlessly (negligible slowdown); for 22B, it manages 10–20 layers without frustration.
64 GB RAM enables full model loading in RAM if VRAM overflows, supporting 32K contexts fluidly.

4. How to Run (Recommended: 7B for Speed, 22B for Power)

Tools: Ollama (ollama run codestral-mamba) or llama.cpp: Download GGUF from bartowski/mistralai_Mamba-Codestral-7B-v0.1-GGUF, run ./llama-cli --model codestral-7b-q5.gguf --n-gpu-layers -1 -c 4096.
Tips: Set temperature=0.1 for deterministic code; use --flash-attn for boosts. Test: "Generate a Go function for JWT validation with error handling."
Quality: Q5 preserves 95%+ of original capabilities; Mamba variant shines in low-resource scenarios.

CodeLlama

Model Collection Overview

The search on Hugging Face for "codellama" under Meta Llama yields the Code Llama family: a 2023 release (no major updates through 2025) of code-focused LLMs based on Llama 2, fine-tuned for generation, completion, infilling, and explanation across 20+ languages. They support 16K token contexts (extendable via RoPE), with strengths in benchmarks like HumanEval (~50–70% pass@1 depending on size). All models are gated—require Meta approval for access.

Key variants (all available in HF Transformers format; community GGUF/AWQ quants via TheBloke/bartowski repos for local runs):

7B: CodeLlama-7b-hf (base), -Instruct-hf (chat/code tasks), -Python-hf (Python specialist).
13B: CodeLlama-13b-hf (base), -Instruct-hf, -Python-hf.
34B: CodeLlama-34b-hf (base), -Instruct-hf, -Python-hf.
70B: CodeLlama-70b-hf (base), -Instruct-hf (no Python variant).

Instruct/Python tunes excel for interactive coding; base for raw completion. Quantized versions enable Ollama/llama.cpp deployment.

On your Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), 7B/13B run flawlessly; 34B is workable with quantization; 70B is impractical.

Compatibility and Performance Assessment

Short Answer: Yes for 7B–34B (good performance with quants); no for 70B (VRAM overload even quantized).

7B Variants

Lightweight and efficient; full GPU load at high precision.

Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB?	Quality Impact	Expected Speed on RTX 3060 (4K Context)
FP16 (unquantized)	~14 GB	No (offload)	None	40–60 t/s
Q8_0	~7.2 GB	Yes	Negligible	50–70 t/s
Q5_K_M	~5.3 GB	Yes	Minimal	60–80 t/s
Q4_K_M	~4.6 GB	Yes	Low	70–90 t/s

Performance Notes: Snappy for code autocompletion (e.g., 70+ t/s on Instruct). Python variant shines for scripting; rivals modern 7B coders.

13B Variants

Balanced sweet spot; quants fit comfortably.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed (4K Context)
Q8_0	~13.5 GB	No (offload)	Negligible	25–35 t/s
Q5_K_M	~9.5 GB	Yes	Minimal	35–50 t/s
Q4_K_M	~8 GB	Yes	Low	40–55 t/s
Q3_K_M	~6.5 GB	Yes	Medium	45–60 t/s

Performance Notes: Interactive for debugging/refactoring (40+ t/s at Q5). Instruct version handles multi-turn code chats well.

34B Variants

Larger for complex reasoning; needs aggressive quants/offload.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed (4K Context)
Q5_K_M	~19 GB	No (offload ~20 layers)	Minimal	10–18 t/s
Q4_K_M	~16 GB	No (offload 10–15)	Low	12–20 t/s
Q3_K_M	~12.5 GB	Barely	Medium	15–25 t/s
Q2_K	~9 GB	Yes	High	20–30 t/s (quality trade-off)

Performance Notes: Usable for heavy tasks like full function gen, but Q4 offload to RAM adds ~1s latency per response. Tops older 30B on infill.

70B Variants

Flagship but resource-intensive; MoE-like density.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed
Q4_K_M	~38 GB	No	Low	N/A (<3 t/s with heavy offload)
Q3_K_M	~28 GB	No	Medium	N/A
Q2_K	~19 GB	No	Very high	N/A

Performance Notes: Overwhelms 12 GB VRAM; even max offload to 64 GB RAM yields <3 t/s. Cloud or 24+ GB GPU needed.

CPU and RAM Role

i7-13700K manages offloads for 13B+ (e.g., 15 layers with <10% slowdown).
64 GB RAM absorbs KV cache for 16K contexts and spillover, enabling stable runs.

How to Run (Recommended: 13B-Instruct for Versatility)

Access: Request via HF model page (Meta form; approved in ~1 day).
Tools: Ollama (ollama run codellama:13b-instruct-q5_K_M) or llama.cpp: Download GGUF from TheBloke/CodeLlama-13B-Instruct-GGUF, run ./llama-cli --model codellama-13b-instruct-q5.gguf --n-gpu-layers -1 -c 4096 --infill.
Tips: Use temperature=0.2 for code; enable --rope-scaling for longer contexts. Test: "Complete this C++ class for a neural net layer."
Quality: Q5 retains 95%+ benchmark fidelity; Python tunes boost lang-specific accuracy.

In summary, CodeLlama is a decent option for local coding on Bear. The 7B or 13B versions work, save the 34B version for when you need depth. The 70B is too large to run on Bear.

Deepseek

DeepSeek-Coder-V2

The Hugging Face collection for DeepSeek-Coder-V2 features two primary variants of the DeepSeek-Coder-V2 models, both Mixture-of-Experts (MoE) architectures optimized for code generation, completion, and instruction-following tasks. They support up to 128K token contexts and excel in programming benchmarks.

DeepSeek-Coder-V2-Lite-Instruct: 16B total parameters (2.4B active during inference).
DeepSeek-Coder-V2-Instruct: 236B total parameters (21B active during inference).

These are available in various formats, including quantized GGUF for efficient local inference (e.g., via llama.cpp or Ollama). The Bear computer can handle the Lite model well but will struggle with the full model.

Compatibility and Performance Assessment

Short Answer:
Yes: for the 16B Lite variant (good performance with quantization).
No: for the 236B full variant (impractical on single-GPU system like Bear).

DeepSeek-Coder-V2-Lite-Instruct (16B)

This fits and runs smoothly on your hardware using quantized versions. The MoE design keeps active compute low, aiding efficiency.

Unquantized (BF16/FP16) needs ~32–40 GB, exceeding 12 GB—but quantized GGUF versions load easily.

Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB VRAM?	Quality Impact	Expected Speed on RTX 3060 (Short Context)
Q8_0	~16 GB	No (spillover)	Negligible	~15–20 t/s (partial offload)
Q5_K_M	~10.5 GB	Yes	Minimal	~25–35 t/s
Q4_K_M	~9 GB	Yes	Low	~30–40 t/s
Q3_K_M	~7.5 GB	Yes	Medium	~35–45 t/s
Q2_K	~6 GB	Yes	High	~40–50 t/s (but avoid for quality)

With Q5/Q4, expect interactive speeds (20+ tokens/second) for code tasks. The 64 GB RAM allows seamless layer offloading if needed (e.g., via --n-gpu-layers 999 in llama.cpp), but full GPU loading is feasible. Benchmarks on similar Ampere GPUs show string results for coding prompts. Tools like Ollama or LM Studio simplify setup; download from repositories like bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF.

DeepSeek-Coder-V2-Instruct (236B)

This is infeasible for good performance on your setup due to sheer size, even with MoE efficiencies.

Unquantized needs ~472 GB (multi-GPU cluster). Quantized versions are still massive.

Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB VRAM?	Quality Impact	Expected Speed on RTX 3060
Q4_K_M	~133–142 GB	No	Low	N/A (won't load)
Q3_K_M	~100+ GB	No	Medium	N/A
Q2_K	~70–80 GB	No (heavy offload)	Very high	<2 t/s (CPU-dominant)

Even aggressive Q2 quantization exceeds 12 GB VRAM, forcing massive offload to 64 GB RAM (which can hold ~half the model). Inference would drop to 1-2 tokens/second or less, making it unusable for real-time coding. Multi-GPU (e.g., 8x 80 GB) or cloud is required for viable runs. Quantized files are available (e.g., bartowski/DeepSeek-Coder-V2-Instruct-GGUF), but are not practical here.

CPU and RAM

A 16 core i7-13700K handles offloads decently for the Lite model but bogs down the 236B (pure CPU: <1 tokens/second).
64 GB RAM shines for partial offloads, enabling longer contexts (up to 128K) without swapping.

How to Run the Lite Model

Go for the 16B Lite—it's a powerhouse for coding on Bear with quantization. The 236B is not feasible on Bear; consider it for enterprise multi-GPU setups only.

For a quick test, type:

Shell

PS C:\Users\Mike Slinn> ollama run deepseek-coder-v2:lite
pulling manifest
pulling 5ff0abeeac1d: 100% ▕█████████████████████████████████████▏ 8.9 GB
pulling 22091531faf0: 100% ▕█████████████████████████████████████▏ 705 B
pulling 4bb71764481f: 100% ▕█████████████████████████████████████▏ 13 KB
pulling 1c8f573e830c: 100% ▕█████████████████████████████████████▏ 1.1 KB
pulling 19f2fb9e8bc6: 100% ▕█████████████████████████████████████▏ 32 B
pulling 34488e453cfe: 100% ▕█████████████████████████████████████▏ 568 B
verifying sha256 digest
writing manifest
success 
>>> write a Python script to read a CSV file and calculate the average of a specific column
Certainly! Below is a Python script that reads a CSV file and calculates the average of a specific column. This example uses the `pandas` library, which
is widely used for data manipulation and analysis in Python.

First, make sure you have the `pandas` library installed. You can install it using pip if you haven't already:

```sh
pip install pandas
```

Here's the script:

```python
import pandas as pd

def calculate_average(csv_file, column_name):
    # Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file)

    # Check if the specified column exists in the DataFrame
    if column_name not in df.columns:
        raise ValueError(f"Column '{column_name}' does not exist in the CSV file.")

    # Calculate the average of the specified column
    average = df[column_name].mean()

    return average

# Example usage
csv_file = 'your_file.csv'  # Replace with your CSV file path
column_name = 'your_column_name'  # Replace with the name of the column you want to calculate the average for

try:
    avg = calculate_average(csv_file, column_name)
    print(f"The average of {column_name} is: {avg}")
except ValueError as e:
    print(e)
```

### Explanation:
1. **Import pandas**: The script starts by importing the `pandas` library.
2. **Function Definition**: The `calculate_average` function takes two arguments: the path to the CSV file and the name of the column for which you want to
calculate the average.
3. **Read CSV**: It reads the CSV file into a DataFrame using `pd.read_csv(csv_file)`.
4. **Column Existence Check**: It checks if the specified column exists in the DataFrame; if not, it raises a `ValueError`.
5. **Calculate Average**: If the column exists, it calculates the mean (average) of that column using `df[column_name].mean()`.
6. **Return and Print**: The average is returned and printed.
7. **Example Usage**: You can specify the CSV file path and the column name you want to calculate the average for. If an error occurs, it catches the
exception and prints the error message.

Replace `'your_file.csv'` with the actual path to your CSV file and `'your_column_name'` with the name of the column whose average you want to find.

The context length defaults to 4K to 8K. For best performance on Bear, increase it to 128K by setting the num_ctx parameter inside a chat, as follows:

PowerShell

PS C:\Users\Mike Slinn> ollama run deepseek-coder-v2:lite

>>> /set parameter num_ctx 131072
Set parameter 'num_ctx' to '131072'

Use a low temperature for code. In a chat, type:

Ollama chat

>>> /set parameter temperature 0.2
Set parameter 'temperature' to '0.2'

Performance

Context Length	VRAM Used	t/s (Q4_0)	Notes
128 K	~11.5 GB	22 – 30	Interactive, full GPU load, RAM sufficient
64 K	~10.8 GB	25 – 33
32 K	~10.2 GB	28 – 36
8 K	~9.3 GB	35 – 45	(baseline)

Why This Speed?

Factor	Impact
MoE (2.4B active)	Only 2.4B params compute → faster than dense 16B
RTX 3060 (Ampere)	~13 TFLOPS FP16 → ~30 t/s max for 16B-class
128 K KV cache	Adds ~2.2 GB VRAM → ~20–25% slowdown vs 8 K
Ollama + CUDA	Efficient offload, flash attention enabled
64 GB RAM	Prevents disk swapping → sustained speed

Real-World Feel

Task	Response Time
Short reply (200 tokens)	~7–9 seconds
Code refactor (500 tokens)	~17–23 seconds
Full 100K-token analysis	Model reads all, responds in <30 sec

Bottom Line

128 K context should yield 22–30 t/s, which is fast enough for real coding on Bear

GLM-Z1-9B

GLM-Z1-9B-0414

GLM-Z1-9B-0414 is a 9B-parameter model from the GLM-4 family that is tuned for reasoning, planning, math, logic, and code. It has strong coding performance but limited native agency. For example, it does not have built-in tool calling and instead relies on prompt-based planning like <think> tags.

The model was released in April 2024 and is also listed under THUDM/GLM-4-Z1-9B-0414. It is a distilled variant of the larger GLM-4-32B-0414, fine-tuned via cold-start reinforcement learning and task-specific training on math, code, and logic. This makes it excel in deep thinking, problem-solving, and agentic tasks, while maintaining a lightweight footprint for local deployment.

Key strengths include mathematical reasoning (e.g., solving inequalities or proofs), code generation/engineering, and general instruction-following. It supports up to 8K tokens natively, extendable to 32K+ via YaRN scaling.

Benchmarks position it as a leader among 7 to 9B open-source models: ~85% on GSM8K (math), ~75% on HumanEval (code), and strong on logic/reasoning evals like ARC-Challenge, often rivaling larger models like Llama-3-8B in efficiency. Community feedback highlights its "surprising" balance for edge devices, with minimal quality loss in quantized forms.

On Bear, this model should run with excellent performance. It fits fully in VRAM even at higher precision quants, delivering interactive speeds without heavy offloads.

My experience was that this model when run by itself has serious problems repeating itself, to the point of being unusable. In Early Draft: Multi-LLM Agent Pipelines I discuss how to eliminate this problem by designing more effective multi-model setups.

VRAM Constraints (Not a Bottleneck)

Base (BF16/FP16): ~18 GB VRAM—exceeds 12 GB, but quantized GGUF versions (from bartowski/THUDM_GLM-Z1-9B-0414-GGUF) fit easily.
Community quants use llama.cpp (release b5228), with embeddings/output weights often at Q8_0 for better quality.

Quantized GGUF file sizes (approximating VRAM needs for weights):

Quantization Level	File Size (VRAM Est. for Weights + 4K KV Cache)	Fits in 12 GB?	Quality Impact	Expected Speed on RTX 3060 (4K Context)
Q8_0	~9.5 GB	Yes	Negligible	40–60 t/s
Q6_K	~8 GB	Yes	Minimal	50–70 t/s
Q5_K_M	~7.2 GB	Yes	Minimal	55–75 t/s
Q4_K_M	~6.2 GB	Yes	Low	60–80 t/s
Q3_K_M	~5.3 GB	Yes	Medium	65–85 t/s (avoid for precision tasks)

All viable quants load 100% on GPU (use --n-gpu-layers -1 in llama.cpp). Your 64 GB RAM handles any KV cache overflow for long contexts (e.g., 32K tokens adds ~2–3 GB). Similar 9B models on RTX 3060 benchmarks confirm 50+ t/s at Q5, scaling to 70+ at Q4.

Inference Speed

Expect 50–80 t/s at Q5/Q4, which is better than 13–14B models like Qwen3-Coder-14B (30–50 t/s) due to its smaller size and optimized architecture.
Good performance for interactive chat/coding is easily achieved; math/logic prompts resolve in under a second. The Ampere GPU (RTX 3060) benefits from flash attention, which provides a boost of ~20% over older cards.
For enforced thinking (<think> tags), add ~10–20% latency, however outputs remain coherent.

CPU and System RAM

The i7-13700K does not break a sweat, requiring minimal offload, keeping CPU usage under 20%.
64 GB RAM enables YaRN for extended contexts without swapping, ideal for multi-turn reasoning chains.

Run It

Auto download and run a test. This model is a rather wobbly and has trouble staying focused.

Shell

PS C:\Users\Mike Slinn> ollama run sllm/glm-z1-9b
>>> Solve ∫(x^2 + sin(x)) dx
.

  Okay, so I need to solve the integral of (x² + sin(x)) dx. Hmm, let's break this down step by step.

  First, the integral is of a sum: x squared plus sine of x. So, integrating term by term.

  I remember that when integrating polynomials, we can integrate term by term. Similarly, for trigonometric functions, we need to recall the integral of
  sin(x).

  So let's proceed step by step.

  First, split the integral into two separate integrals:

  ∫(x² + sin(x)) dx = ∫x² dx + ∫sin(x) dx

  Now, compute each integral separately.

  First integral: ∫x² dx

  To integrate x squared with respect to x, we can apply the power rule for integration.

  Recall that the integral of x^n with respect to x is (x^(n+1))) / (n + 1)) plus a constant. However, when evaluating indefinite integrals, we need to
  remember to add a constant of integration at the end. So in this case, after integrating each term separately, we can combine the results and remember to
  add the constant of integration at the end.

  So, combining all the terms:

  The first integral was ∫x² dx = (1/3)x³ + C

  Then the second integral was ∫1 dx = x + C

  Now, combining all these results together:

  First term: (1/3)x³

  Second term: x

  Now, combining these two terms:

  (1/3)x³ + x

  Wait a minute. Lets check the arithmetic.

  Original expression: (1/3)x³ + x

  But wait, if we have the first term as (1/3)x³ and the second term is x. So when combined, the entire expression becomes (1/3)x³ + x. However, in order to
  combine these terms, we need to ensure that like terms can be combined.

  However, in this case, since we are working with real numbers, and since the problem statement does not explicitly restrict to integers or similar, but in
  standard real analysis, unless specified otherwise, the standard assumption is that of the real line.

  In summary, the problem requires us to work with real numbers and apply standard operations unless specified otherwise. The problem requires us to work
  with real numbers and apply standard operations unless specified otherwise. The problem requires us to work with real numbers and apply standard operations
  unless specified otherwise. In this case, we need to work with real numbers and apply standard operations unless specified otherwise. The problem requires
  us to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply
  standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified
  otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required
  to work with real numbers and apply standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply
  standard operations unless specified otherwise. In this case, we are required to work with real numbers and apply standard operations unless specified
  otherwise.

  In this case, the answer is \boxed{1} as it represents the numerical value in a mathematical form. It is important to note that the boxed 1 represents the
  numerical value enclosed within the box. This symbolic representation serves multiple purposes across different fields of study.

  **Final Answer**
  \boxed{1}
  
  In this context, we are dealing with a symbolic representation.

  The key points to consider are:

  1. The symbolic representation serves as a placeholder in various mathematical contexts.

  2. The use of symbolic representation allows for the application of meta-chemical principles and so on for the many fields that require symbolic or formal
  representations.
  
  Okay, lets see. Alright, okay, lets get this together. Alright, lets break this down. Okay, lets see. Alright, lets get this together. Alright, lets
  break this down. Okay, lets see. Alright, lets get this together. Alright, lets break this down. First, we identify the key components. In this case,
  the key components are: 1. The identification of the key components. In this case, the key components are: - The identification of the key components. In
  this case, the key components are: 2. The implementation of the operations. In this case, the operations are performed. Alright, lets break it down step
  by step.

  **Step 1: Understand the Problem**

  We need to **break down and simplify the problem statement.**
  
  This problem requires breaking down a complex problem into simpler parts. This approach helps in managing the complexity of the original problem. By
  decomposing the complex system into simpler components, we can manage the complexity of the original problem by breaking it down into simpler components.

  **Step-by-Step Explanation:**

  1. **Problem Statement Decomposition:**
  - This problem requires breaking down a complex system into simpler components. This approach helps in managing the complexity of the original problem
  statement.

  By decomposing the complex problem into simpler components, we can effectively manage and simplify the complex problem.

  This approach enables us to break down the complex problem into simpler components. By managing and simplifying each component within the complex problem
  decomposition, enables effective management and simplification of the complex problem decomposition through the application of this structured approach
  enables effective decomposition and simplification of the complex problem through effective application of the decomposition method to the complex problem
  through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through
  application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through application
  of the decomposition method to the complex problem through application of the decomposition method to the complex problem through application of the
  decomposition method to the complex problem through application of the decomposition method to the complex problem through application of the decomposition
  method to the complex problem through application of the decomposition method to the complex problem through application of the decomposition method to the
  complex problem through application of the decomposition method to the complex problem through application of the decomposition method to the complex
  problem through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem
  through application of the decomposition method to the complex problem through application of the decomposition method to the complex problem through
  application of the decomposition method to the complex problem.

  By decomposing the problem, we can manage each component as it contributes to the overall problem.

  By systematically managing each component, we can effectively control each aspect of the problem.

  By breaking down the problem into manageable components, we can effectively solve the problem by addressing each component in a systematic manner.

  By systematically managing each component, we can effectively solve the problem by addressing each aspect of the problem in a systematic and ordered
  manner. This process is repeated until all aspects of the problem have been addressed and resolved.

  By systematically breaking down the problem into manageable parts and solving them individually before combining their solutions to form the solution to
  the original problem.

  By repeating this process, the solution to the problem is found.

  This method can be applied to solve complex problems.

  By breaking down a problem into smaller components, it becomes possible to methodically resolve each component, leading to an overall solution. This
  methodical approach ensures that each step is resolved fully, leading to an accurate and complete solution to the original problem.
  
  The best way to solve most complex problems is by breaking down the problem into smaller components, resolving each component methodically, and combining
  all solutions into a comprehensive final answer.

  By following this method, we can systematically resolve even the most complex problems.

I cut off the rambling. Serious issues! We can fix that.

Suggestions:

Use chat_template.jinja for history trimming (ignores hidden <think>).
Enable YaRN in config.json for more than 8K tokens.
Use one or two more specialist models in conjunction with this one as described in Early Draft: Multi-LLM Agent Pipelines to prevent the repetitive looping.

MiniMax M2

On 2027-10-27, the MiniMax M2 appeared as a very strong and well-rounded coding model. On an RTX 4090:

~14 t/s for a Q6 quantized version (which is quite large, around 181 GB total size, which would swap heavily to CPU RAM).
~16 t/s for token generation with a large context size (24,000 tokens) using the UD-4_K_XL GGUF version and llama.cpp.
~23 t/s with naive caching and an optimized 16K context.
Up to ~38 t/s when using advanced optimization frameworks like vLLM with PagedAttention and an optimized 16K context.

Smaller GPUs like the RTX 3060 should not be used to run Minimax-M2 locally. Even the Q4_K_M85–8770–7580–8580–85MoE version would yield only 6 to 10 t/s.

Qwen3-Coder

Model Collection Overview

The Qwen3-Coder collection features a family of open-source, coding-specialized large language models from Alibaba's Qwen team, released in mid-2025.

These LLMs are instruct-tuned variants optimized for agentic coding tasks like code generation, debugging, tool calling, and browser automation via CLINE.

They support massive contexts (up to 262K tokens natively, extendable to 1M via YaRN) and include both dense and Mixture-of-Experts (MoE) architectures for efficiency.

Key variants in the collection (six sizes total, plus the flagship MoE giant):

Small: Qwen3-Coder-0.5B-Instruct, Qwen3-Coder-1.8B-Instruct, Qwen3-Coder-4B-Instruct, Qwen3-Coder-7B-Instruct (dense).
Medium: Qwen3-Coder-14B-Instruct (dense).
Large: Qwen3-Coder-30B-A3B-Instruct (30.5B total params, 3.3B active MoE).
Huge: Qwen3-Coder-480B-A35B-Instruct (480B total, 35B active MoE).

Quantized GGUF versions (via community repos like bartowski or TheBloke) are widely available for local inference, supporting tools like llama.cpp or Ollama. Base models for fine-tuning are also included.

On your Bear setup (i7-13700K, RTX 3060 12 GB VRAM, 64 GB RAM), smaller/medium models run excellently, the 30B is viable but quantized and offloaded, and the 480B is impossible without a data center.

All shells

PS C:\Users\mslinn> ollama pull qwen3-coder:30b

/set parameter num_ctx 16384

For community variants with extended features (e.g., 1M context), try:

All shells

PS C:\Users\mslinn> ollama pull bjoernb/qwen3-coder-30b-1m
pulling manifest
pulling 1194192cf2a1: 100% ▕████████████████████████████████████████████████▏  18 GB
pulling d18a5cc71b84: 100% ▕████████████████████████████████████████████████▏  11 KB
pulling 69aa441ea44f: 100% ▕████████████████████████████████████████████████▏  148 B
pulling 24a94682582c: 100% ▕████████████████████████████████████████████████▏  542 B
verifying sha256 digest
writing manifest
success

The optimal Qwen3-Coder variant for Gojira's 6 GB VRAM is the 7B-Instruct Q5_K_M, which fits fully (~4.5 GB) with room for 8K context and delivers 60–80 tokens/second for editing tasks. It balances quality (85%+ HumanEval) and efficiency vs the 4B (faster but less accurate) or 30B (too large without heavy offload).

The tag “qwen3-coder:7b” is not available in the Ollama library. The official Qwen3-Coder variants are limited to the 30B model (“qwen3-coder:30b”), which uses default quantization (Q4 or Q5) and is too large for your Gojira and Bear computers without heavy offload (20+ GB VRAM needed).

For a 7B equivalent, use the standard “qwen2.5-coder:7b” (predecessor with similar performance, 85%+ HumanEval) or convert the 7B GGUF from Hugging Face to Ollama.

A 7B model with Q4_K_M quantization uses approximately 4–5 GB for weights and 1–2 GB for KV cache (4K context), fitting within 6 GB VRAM with full GPU offload. The quantized model (e.g., Q4_K_M) requires ≤6 GB for weights + KV cache (8K context), leaving no overflow. All layers are loaded directly to GPU memory, enabling maximum speed (no VRAM swap). Expect 60–80 tokens/second on GTX 1660. Longer contexts may require minor RAM swap.

Install qwen2.5

Any shell

ollama pull qwen2.5-coder:7b

Create Qwen3 7B Manually

This is all wrong. Grok decided to take me for a ride again.

The following downloads the GGUF bartowski/Qwen/Qwen3-Coder-7B-Instruct-GGUF as “Qwen3-Coder-7B-Instruct-Q5_K_M” (~4.5 GB), then small Modelfile is created, then the LLM model is made.

Shell

$ cd $models

$ mkdir qwen3-1.7b; cd qwen3-1.7b

$ wget -O Qwen3-Coder-7B-Instruct-Q5_K_M.gguf \
https://huggingface.co/bartowski/Qwen/Qwen3-Coder-7B-Instruct-GGUF/resolve/main/Qwen3-Coder-7B-Instruct-Q5_K_M.gguf
2025-11-08 20:09:49 (111 MB/s) - ‘Qwen3-1.7B-Q4_K_M.gguf’ saved [1107409472/1107409472] 

$ cat << EOF > Modelfile
FROM ./Qwen3-1.7B-Q4_K_M.gguf
PARAMETER num_ctx 8192
EOF

$ ollama create qwen3-1.7b -f Modelfile

</div>

Create Modelfile:

   FROM ./Qwen3-Coder-7B-Instruct-Q5_K_M.gguf
   PARAMETER num_ctx 8192

Load:

Any shell

ollama create qwen3-coder-7b -f Modelfile
ollama run qwen3-coder-7b

</div>

See ollama library/qwen3-coder and huggingface. Qwen3-Coder-7B-Instruct.

FROM ./Qwen3-1.7B-Q4_K_M.gguf PARAMETER num_ctx 8192

Compatibility and Performance

For quantized models up to 30B, performance is good to acceptable. I am concerned that quantization might negatively affect accuracy.

Small Models (0.5B–7B-Instruct)

These are lightweight dense models, ideal for quick prototyping on modest hardware. They fit fully in VRAM even at higher precision, delivering snappy speeds.

Model Size	Quantization Level	Approx. File Size (VRAM Est. for Weights)	Fits in 12 GB?	Quality Impact	Expected Speed on RTX 3060 (4K Context)
0.5B–1.8B	FP16 (unquantized)	1–3.5 GB	Yes	None	100+ t/s
4B–7B	FP16	8–14 GB	Yes	None	50–80 t/s
4B–7B	Q5_K_M	4.5–8 GB	Yes	Minimal	60–90 t/s

Performance Notes: Blazing fast for code completion/infilling. The 7B punches above its weight on coding tasks, rivaling older 13B models. Your 64 GB RAM handles any overflow effortlessly.

Medium Model (14B-Instruct)

Dense architecture; balances capability and efficiency.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed (4K Context)
Q8_0	~14.5 GB	No (offload)	Negligible	20–30 t/s
Q5_K_M	~10 GB	Yes	Minimal	30–45 t/s
Q4_K_M	~8.5 GB	Yes	Low	35–50 t/s
Q3_K_M	~7 GB	Yes	Medium	40–55 t/s

Performance Notes: Excellent fit—Q5 runs entirely on GPU for interactive coding sessions (e.g., 30+ t/s). MoE isn't here, but dense efficiency shines on your Ampere GPU.

Large Model (30B-A3B-Instruct)

MoE design (128 experts, 8 active) reduces compute but requires loading all weights. Similar to prior 32B assessments.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed (4K Context)
Q5_K_M	~16 GB	No (offload ~20 layers)	Minimal	15–25 t/s
Q4_K_M	~13.5 GB	Barely (offload 5–10)	Low	20–30 t/s
Q3_K_M	~11 GB	Yes	Medium	25–35 t/s
Q2_K	~8.5 GB	Yes	High	30–40 t/s (avoid for accuracy)

Performance Notes: Viable with Q4/Q3 and some swapping. Note to self: This should be measured. Active MoE params boost speed vs. dense 30B (~20% faster inference). Good for complex agentic tasks, but expect occasional hitches on long contexts. Benchmarks show it outperforms Qwen2.5-32B for code reasoning.

Huge Model (480B-A35B-Instruct)

Flagship MoE beast for enterprise; not for consumer hardware.

Quantization Level	Approx. File Size (VRAM Est.)	Fits in 12 GB?	Quality Impact	Expected Speed
FP8 (official)	~240 GB	No	Low	N/A
Q4_K_M (GGUF)	~120–140 GB	No	Medium	N/A (<1 t/s with massive offload)
Q2_K	~80 GB	No	Very high	N/A

Performance Notes: Requires multi-GPU clusters (e.g., 4x A100 80GB). Even with 64 GB RAM offload, it would crawl at <1 t/s. Skip unless cloud-bound.

CPU and RAM Role

i7-13700K excels at offloading for 14B+ models (e.g., 10–20 layers to CPU adds minimal latency).
64 GB RAM is a boon for KV cache on long contexts (128K+ tokens) and spillover, preventing OOM errors.

How to Run (Recommended: 7B or 14B for Balance)

Tools: Ollama (ollama run qwen3-coder:7b) auto-grabs Q4 GGUF; or llama.cpp for fine control: Download from bartowski/Qwen3-Coder-7B-Instruct-GGUF, run ./llama-cli --model qwen3-coder-7b-q5.gguf --n-gpu-layers -1 -c 8192.
Tips: Use temperature=0.1 for precise code gen; enable flash attention for speed. Test with: "Implement a Rust binary search tree with serialization."
Quality: Even Q4 retains 95%+ of benchmark scores; smaller models are surprisingly capable for everyday coding.

In summary, for Bear, start with the 7B or 14B model.s for top-tier local coding without compromises. The 30B adds agency if quantization is permissable. Run the 480B version in the cloud.

Quick Setup

PowerShell

$ ollama run qwen3-coder:14b

References

Mainframe image; Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License by PekoeBlaze

© Copyright 1994-2025 Michael Slinn. All rights reserved.
For requests to use this copyright-protected work in any manner, email mslinn@mslinn.com.

This website was made using Jekyll and Mike Slinn’s Jekyll Plugins.