From regex cuts to real syntax trees: coco’s diff pipeline moved off line-oriented pattern matching and onto a real parser – and the same parser now highlights the hunks you read in the Workstation.

A few releases back I wrote that coco’s regex-based structural diff extractors worked but had a ceiling – one that showed up the moment a diff contained anything more interesting than a top-level function declaration. The plan was to replace the line-oriented matcher with a real parser: tree-sitter, the same incremental parser library that backs syntax highlighting in Neovim, Helix, and GitHub’s code search.

That migration has now largely landed. This is the “how it actually went” follow-up: what the layered swap looks like in the shipped code, where the plan changed on contact with reality, and the bonus payoff that fell out of it – tree-sitter syntax highlighting in the diff viewer itself.

Where this fits in the pipeline

Quick recap for folks new to the series. Coco’s commit-message flow feeds a diff to an LLM and asks for a summary. The naive version just dumps the raw unified diff in. It works, but a lot of those lines are noise: moved imports, whitespace shifts, reformat churn, comment touch-ups. The model spends tokens reading noise and hands back generic summaries.

The fix was language-aware diff summaries: extract structure from each file (function signatures, class outlines, impl blocks) and either replace the raw diff or prepend the extract so the model gets higher-signal context per token. The regex-based extractors for TypeScript, JavaScript, Python, Rust, and Go landed in #927 and #928. They were intentionally conservative: catch the obvious cases, fall through to the raw diff otherwise.

They worked. They also exposed the ceiling.

Where the regex cuts fall short

Regex is great at “is there a line that starts with function or fn or def.” It is terrible at almost everything else a real codebase throws at you. A few of the cases that kept showing up in real commits:

  • TypeScript arrow exports. export const foo = () => { ... } is a function. The regex captures it as const foo.
  • Rust impl blocks. The cut surfaces the impl Widget for Frob header but treats everything inside as opaque body. Three new methods on the same impl all get collapsed into one entry.
  • Python class methods. Indented def gets skipped because the regex only matches module-level functions. Class internals are invisible.
  • Go grouped declarations. The var ( ... ) and const ( ... ) block forms confuse the line-oriented matcher.
  • False positives from strings and comments. A template literal with the word function in it. A multi-line comment mentioning fn. The regex doesn’t know it’s inside a string. Tree-sitter does.

You can keep adding edge cases to a regex until it’s three thousand lines and still wrong. Or you can hand the problem to something that actually parses the language.

What a real parser changed

Tree-sitter grammars are real handwritten parsers, not heuristics, and there’s a maintained grammar for basically every language you’d plausibly commit code in. With the structural cut coming from a syntax tree instead of a line-oriented regex, here’s what changed:

  • Real scope detection. No more false positives from function inside a template string or a doc comment. The parser knows the difference between code and the text that looks like code.
  • Receiver types on methods. For Rust and Go the cut now qualifies a method by its owner – Go renders Receiver.method, Rust surfaces the impl Trait for Type it belongs to – so the same method name on different types stays distinguishable instead of collapsing into one bare entry. (Python method extraction is still on the list.)
  • Structure over noise. A change to an existing symbol surfaces as a signature change entry naming exactly which symbols moved, rather than a generic “modified file.” Slicing that even finer – “added schema parameter to parseRequest” as its own line – is still a someday refinement; the syntax tree makes it possible, the formatter just doesn’t cut that fine yet.
  • Cross-language uniformity. The scope-walking dispatch is one shared registry, not a bespoke matcher per language. Adding a language is a registry entry plus a grammar, not a new extractor.

Layered, not hard-swap

One thing I went back and forth on was whether to rip the regex extractors out the moment tree-sitter was wired up. The clean answer was yes; the pragmatic answer was no – and the pragmatic answer is what shipped. Each language runs a chain: tree-sitter first, regex as the fallback when the parser for that language isn’t available or doesn’t produce a clean cut. It’s one table –

ts -> [tree-sitter, regex]
js -> [tree-sitter, regex]
py -> [tree-sitter, regex]
rs -> [tree-sitter, regex]
go -> [tree-sitter, regex]

– and each language walks its chain, taking the first parser that returns a summary. A parser that throws is swallowed so the next one still gets a shot. The reasons the regex layer earned its keep:

  • Offline use. Coco runs against local models via Ollama for a non-trivial fraction of users, often on planes or trains or coffee shops with bad wifi. If the first run on a new machine had to download a parser before anything worked, that’s a bad first impression. The bundled languages and the regex fallback both work with no network.
  • Parser failures. Tree-sitter is good but not infallible. JSX-in-TS with unclosed tags, half-finished macros, weird preprocessor pragmas – any of these can throw the parser. The regex fallback isn’t smart, but it doesn’t crash.
  • Bisecting quality regressions. With both paths in place I can A/B them on the same commit and see whether the tree-sitter cut actually leads to better commit messages downstream. The day I’m confident across enough real diffs, the regex layer comes out. Not before.

The knob lives at service.fastPath.languageAware, next to the existing fastPath.markdown path. It’s an object, not a bare boolean: enabled (default false until quality is proven across enough real commits) plus an optional languages list to opt in a subset (ts, js, py, rs, go). Once it flips on by default, the layering underneath is invisible.

Packaging: bundle the popular ones, lazy-load the rest

Tree-sitter parsers are wasm modules – 200KB to about 2MB each. Bundling all of them roughly doubles the install size for someone who only ever writes TypeScript, and that math gets worse the more languages you support. So the package splits the difference:

  • Bundled in the box: TypeScript, TSX, and JavaScript – one grammar covers all three – plus the web-tree-sitter engine. That’s the common case, and it works fully offline on a fresh install.
  • Lazy-loaded on first use: Python, Rust, and Go. (I’d originally planned to bundle Python too; keeping the default install lean won out, so it joined the lazy set.) The first time the extractor sees a .py, .rs, or .go file, the grammar is pulled from a CDN, SHA-256-verified against a pinned manifest, and cached under the user’s cache directory (overridable via COCO_CACHE_DIR). Subsequent runs use the cached copy.

That keeps the default install small without making the polyglot experience worse than it needs to be. The cache lives outside the npm install dir on purpose – so it survives version bumps and doesn’t get re-downloaded every time you upgrade coco.

How this shows up in the Workstation

For the Workstation TUI, tree-sitter now shows up in two places – one under the hood, one right on the screen.

  • Compose view (g c). The structural cut feeds the model when it drafts a commit message: renamed functions read as renames, impl-block internals stop disappearing into opaque headers, the model has less noise to fight through. This one is invisible by design – it shapes the context handed to the model, not the UI you look at.
  • Diff view (g d). The diff itself stays literal – raw hunks, because that’s what a diff viewer is for. What changed is that those hunks are now syntax-highlighted by the same parser stack. More on that next.

The rule of thumb across the surfaces still holds: the closer you are to the raw bytes, the less any summarization intrudes. Diff view stays literal (and now legible). Compose view sits one level up and uses the structural cut as the model’s context. PR review (g p) sits another level up again – that’s a later post in this series.

The bonus: syntax-highlighted diffs

Here’s the payoff I didn’t see coming when I started. Wiring tree-sitter in for structural extraction meant the parser stack was already sitting there in the process – so it got a second job. The Workstation’s diff view now renders syntax-highlighted hunks: keywords, strings, types, functions, comments, and the rest picked out per token and mapped to whatever color theme you’re running. (#1117.)

A few details that matter for a diff viewer specifically:

  • Per-line, not whole-file. A diff hunk is a pile of line fragments, not a coherent document – so the tokenizer runs per line rather than trying to parse the surrounding file. That sidesteps the “this line references a type declared 200 lines up that isn’t in the hunk” problem.
  • Both layouts. It works in the unified view and the side-by-side split, layered on top of the add/remove/context coloring rather than fighting it.
  • Quiet degradation. No-color themes, a missing grammar, or a pathologically long line all fall back to plain text instead of erroring. The highlight pass runs async off the render path and caches per line, so it costs nothing on scroll.

It’s on by default (logTui.syntaxHighlight). So the title earns itself twice over: tree-sitter makes the diff coco hands the model sharper, and it makes the diff coco hands you easier to read.

Where things stand

As of this post, the migration has largely landed. All five languages – TypeScript, JavaScript, Python, Rust, and Go – have tree-sitter extractors live, each backed by the regex layer as a fallback. The work is tracked in issue #933: the foundation (parser loader, lazy-load infra, the layered registry), the bundled-vs-CDN packaging split, and the per-language extractors all went in, one piece at a time. What’s left is mostly polish – telemetry for when the whole chain falls through to the raw diff, and the eventual call to flip languageAware.enabled on by default once the A/B says the tree-sitter cut reliably produces better commit messages.

The build/test problem I flagged as unsolved last time – getting tree-sitter wasm to load in a Jest setup that never previously touched native modules – is sorted. web-tree-sitter@0.26 shipping dual CommonJS/ESM exports did most of the heavy lifting (no more new Function() import shim). The rest was plumbing: a pretest step that copies the wasm into dist/, a guard that skips the grammar-dependent tests when the wasm isn’t present, and a small reset seam so the process-lifetime parser cache doesn’t leak state between tests. If you hit the same wall, those three pieces are the whole trick.

Next up in the BYO Git Workstation series: how the PR review view handles structural extracts across many files at once. Curious to hear if anyone else has built a tree-sitter-backed code analysis pipeline and run into the same packaging trade-offs.

Griffen Fargo headshot

Griffen Fargo

Published

Share
Keep Reading

Discussion

Have thoughts? Drop them in.

Comments are powered by Disqus. Sign in once, comment anywhere.

Loading comments…
Fin.

griffen.codes

made with 💖 and

© 2026all rights reservedupdated 17 seconds ago