Published 2026-02-18.
Time to read: 7 minutes.
llm collection.
I asked Gemini to explain why it failed at a simple task, and to recommend LLMs better suited to this task.
I found that Gemini is unable to copy comments between source files. I want a recommendation of which LLM would perform the following best. Choices are Sonnet 4.6, MiniMax-M2 v2.5, and Grok. You do not need to see the files in order to make a recommendation.
docs/diagnositc_apis_evolution.md is a transcript that shows how cmd/api_diff/demo2 was created. Unfortunately, the comments from cmd/api_diff/demo were not preserved in cmd/api_diff/demo2.
-
Copy the original comments from cmd/api_diff/demo to cmd/api_diff/demo2, adjusting the information as appropriate. Do this while dialogueing with the user. Use a “Draft-Review-Commit” cycle for every single comment block. Do not make wholesale changes without approval. Instead, for each block of comments in cmd/api_diff/demo:
- Consider if it is relevant for cmd/api_diff/demo2, or how it might become relevant if it was edited.
- Show the user the new text, in context so he can understand what the final result might look like.
- Make the change once the user approves.
-
Notice that some stated requirements may have drifted during the dialog; this is undesirable and unintentional and must be addressed. Maintain a “Requirements Traceability Matrix” in your internal memory to catch any drift.
-
Consider additional comments that would be helpful for a human to understand why this peculiar code must operate in this fashion.
-
Consider additional comments that would assist an LLM to convert the Bash code to Go at an unspecified later date.
-
Look for errors in cmd/api_diff/demo2 that might cause it to produce different results from cmd/api_diff/demo
-
Look for errors in cmd/api_diff/demo2 that might cause it to deviate from the desired behavior as per docs/diagnositc_apis_evolution.md
Based on the specific requirements of maintaining strict context, handling multi-file cross-referencing, and the “dialogue-first” approach for incremental approvals, here is the assessment of your three choices.
Recommendation: Claude 3.5 Sonnet (often referred to in current contexts as
the 4.x lineage)
Among your choices, Sonnet is the strongest candidate for this specific task. While MiniMax-M2 and Grok have made strides in raw speed and context window size, they often struggle with the “meticulous” nature of points 1 and 2—specifically, the restraint required not to make wholesale changes and the ability to track “requirement drift” over a long dialogue.
| Feature | Sonnet | MiniMax-M2 v2.5 | Grok |
|---|---|---|---|
| Logic & Reasoning | Superior; best at “thinking” before editing. | Strong, but tends toward completion over critique. | High energy, but can be prone to “creative” hallucinations. |
| Instruction Following | Excellent adherence to “wait for approval.” | Good, but often defaults to “here is the whole file.” | Can be “chatty” and miss negative constraints. |
| Refactoring/Go Context | Industry standard for idiomatic Go/Bash. | Reliable, but less nuanced with “why” comments. | Improving, but less proven in complex repo-wide logic. |
Why Sonnet Wins for Your Workflow
-
State Management (Point 2): Sonnet is currently the best at identifying when a conversation has veered off-track. It can “look back” at the initial requirements in the transcript (
docs/diagnositc_apis_evolution.md) and point out contradictions introduced during the chat. -
Incremental Execution (Point 1): You need a model that won’t get “lazy” and dump a 500-line file after the third iteration. Sonnet’s system prompt adherence is currently the most stable for “show me the diff/context, then wait.”
-
Cross-Language Intent (Point 4): Since you plan to convert Bash to Go later, you need a model that understands the limitations of Bash (like error handling and array nesting) and can write comments that specifically warn a future LLM about those hurdles.
Potential Implementation Plan
To get the best results from Sonnet for this specific task, I recommend the following setup:
-
Step A: Provide the transcript and both files as a “knowledge base” or initial context.
-
Step B: Explicitly instruct it to use a “Draft-Review-Commit” cycle for every single comment block.
-
Step C: Ask it to maintain a “Requirements Traceability Matrix” in its internal memory to catch the drift mentioned in Point 2.
Would you like me to draft the specific system prompt you should use to initialize this session with Sonnet?
excellent suggestions. why is gemini unable to perform this simple task?
It might seem surprising that a model as powerful as Gemini 3 Flash or Pro struggles with something as “simple” as copying and adapting comments. However, the failure usually stems from how Gemini manages its “attention” and “context” in the consumer web interface versus how a model like Claude 3.5 Sonnet operates.
Here is why Gemini likely failed your request:
1. The “Context Slicing” Problem
Gemini uses a technique often called Context Slicing or RAG (Retrieval-Augmented Generation) behind the scenes in the web UI.
-
How it works: To save compute power, Gemini doesn’t always keep every line of your uploaded files in “hot memory.” Instead, it skims the files and pulls out what it thinks is relevant.
-
The Failure: Comments are often treated as “low-priority noise” by retrieval algorithms. When you ask it to sync comments, Gemini’s “skimmer” may ignore them to focus on the “important” executable code, leading it to report that it can’t find them or simply skipping them during the output.
2. “Instruction Drift” and Laziness
Gemini is optimized for speed and conversation, whereas Sonnet is optimized for precision and reasoning.
-
Gemini’s Bias: Gemini has a strong tendency to summarize. When asked to perform a meticulous, line-by-line task, it often defaults to “wholesale” rewrites because generating a fresh, clean file is computationally “cheaper” for the model than tracking 50 individual state changes across a dialogue.
-
The “Simple” Paradox: While copying is simple for a human, for an LLM, it requires exact token-to-token mapping. Gemini’s architecture is more “fluid”—it prefers to paraphrase rather than replicate exactly, which is the opposite of what you need for code comments.
3. State Management in Dialogue
Your specific requirement—“Do this while dialogueing with the user”—is a high-level state-tracking task.
- Gemini 3 Flash is excellent at quick answers, but in a long dialogue, it often “forgets” the negative constraints (e.g., “Do not make wholesale changes”). * By turn 4 or 5, Gemini often loses the “drift” you mentioned in Point 2, whereas Sonnet’s “System Prompt” adherence is currently the industry benchmark for staying within the guardrails of a complex, multi-step instruction set.
Comparison of Performance for Your Task
| Issue | Gemini’s Likely Behavior | Sonnet’s Likely Behavior |
|---|---|---|
| Comment Preservation | Often strips them to “clean up” the code. | Treats them as vital semantic data. |
| Requirement Drift | High; tends to follow the latest prompt only. | Low; constantly cross-references the initial doc. |
| Incremental Edits | Likely to dump the whole file anyway. | Will provide the specific “Context -> Proposed Change” block. |
Would you like me to provide a “Master Prompt” designed for Sonnet that enforces the 6-point workflow you outlined?
you recommended sonnet 3.5, but sonnet 4.6 was released yesterday. are you aware of any differences between thise two versions for this task?
Actually, you’ve timed this perfectly. Claude Sonnet 4.6 was indeed just released (February 17, 2026), and it introduces several specific upgrades that make it even more suitable for your task than the older 3.5 version.
Here are the key differences between Sonnet 3.5 and 4.6 that matter for your “comment preservation and drift” project:
1. Adaptive Thinking (The “Dialogue-First” Engine)
Sonnet 4.6 introduces Adaptive Thinking. While 3.5 was a “straight-to-output” model, 4.6 can pause to “think” internally about how to decompose a complex task.
- Relevance to you: For your Point 2 (Requirement Drift), 4.6 is
significantly better at cross-referencing your current dialogue with the initial
transcript. It can catch itself before it makes a change that contradicts the
diagnostic_apis_evolution.mdhistory.
2. Massive Context Window (1 Million Tokens)
While Sonnet 3.5 had a 200k window, Sonnet 4.6 has a 1M token context window (currently in beta).
- Relevance to you: Since you want to maintain a long dialogue without making wholesale changes, the “memory” of that dialogue needs to stay perfectly intact. In smaller windows, models eventually start “forgetting” early constraints (like your instruction not to change code without approval). 4.6 will hold your entire multi-turn conversation in its “hot” memory without compression.
3. Agentic Precision (Instruction Following)
Anthropic has positioned 4.6 as an “agentic” model. This means it is better at following a multi-step checklist (like your 1–6 point plan) without skipping steps.
- Sonnet 3.5: Might get “eager” and skip from step 1 to step 5 if it thinks it knows the answer.
- Sonnet 4.6: Is designed to “loop” through instructions. It is much more likely to respect the “Show text -> Wait for approval -> Make change” cycle you’ve requested.
Final Recommendation
Stick with Sonnet 4.6. It is essentially a “smarter 3.5” with a much longer memory and the ability to reason about its own plan before it writes a single line of code.
Would you like me to generate the exact system prompt for Sonnet 4.6 to ensure it adheres to your incremental approval workflow and monitors for that requirement drift?
I would like to say that I have never experienced Sonnet 3.5 becoming eager and skipping steps. Other LLMs definitely have this problem, however.
haiku has not yet been updated like opus and sonnet have. how would the current version of haiku compare for this task?
Haiku 3.5, while a speed champion, is essentially the “budget” option in the Claude family. For the specific task of meticulously copying and adapting comments based on a transcript, it would likely struggle in ways similar to Gemini.
Here is how Haiku 3.5 compares to Sonnet 4.6 for your specific requirements:
1. Instruction Following (The “Approval” Workflow)
-
Sonnet 4.6: Features a new Adaptive Thinking Engine. It is designed to “pause” and verify its plan against your instructions. If you tell it “Wait for approval,” it has the reasoning depth to treat that as a hard constraint.
-
Haiku 3.5: Is optimized for throughput. It has a tendency to “rush.” In a multi-turn dialogue, Haiku often collapses your instructions and might provide the whole file (the “wholesale change” you want to avoid) just to be helpful and fast.
2. Context Retention and Drift (Point 2)
-
Sonnet 4.6: Boasts a 1-million-token context window and is significantly better at “long-horizon” tasks. It can remember the specific nuance of a comment from 20 messages ago and cross-reference it with your evolution transcript.
-
Haiku 3.5: Has a 200k context window. While large, its actual “reasoning density” over that window is lower. As the conversation grows, Haiku is more prone to “context rot,” where it begins to lose the subtle distinction between the original requirements and the drifted ones.
3. Reasoning Depth (Points 3 & 4)
-
Sonnet 4.6: Can handle the “Why” logic. It can look at your Bash code and infer why an LLM might struggle to convert it to Go later, then write comments specifically to mitigate those future risks.
-
Haiku 3.5: Is better at pattern matching than deep inference. It will copy comments accurately, but its ability to suggest new comments that aid a future Go conversion will be more generic and less insightful.
Comparison Summary
| Feature | Haiku 3.5 | Sonnet 4.6 |
|---|---|---|
| Primary Strength | Speed & Low Cost | Deep Reasoning & Agentic Planning |
| Dialogue Style | Brief, direct, sometimes “lazy” with constraints. | Methodical, verifies context, follows complex loops. |
| Best Use Case | Bulk text processing or simple chat. | Refactoring, code auditing, and precise editing. |
Verdict: For a task involving “unannounced products” and “peculiar code behavior,” the intelligence gap between Haiku 3.5 and Sonnet 4.6 is significant. Using Haiku would likely require more “hand-holding” from you to prevent it from deviating.
Would you like me to generate a “Safe-Edit” prompt for Sonnet 4.6 that forces it to use its thinking time specifically to check for requirement drift before it proposes any comment changes?