
May 02, 20256 min read

From regex cuts to real syntax trees: coco’s diff pipeline moved off line-oriented pattern matching and onto a real parser – and the same parser now highlights the hunks you read in the Workstation.
A few releases back I wrote that coco’s regex-based structural diff extractors worked but had a ceiling – one that showed up the moment a diff contained anything more interesting than a top-level function declaration. The plan was to replace the line-oriented matcher with a real parser: tree-sitter, the same incremental parser library that backs syntax highlighting in Neovim, Helix, and GitHub’s code search.
That migration has now largely landed. This is the “how it actually went” follow-up: what the layered swap looks like in the shipped code, where the plan changed on contact with reality, and the bonus payoff that fell out of it – tree-sitter syntax highlighting in the diff viewer itself.
Quick recap for folks new to the series. Coco’s commit-message flow feeds a diff to an LLM and asks for a summary. The naive version just dumps the raw unified diff in. It works, but a lot of those lines are noise: moved imports, whitespace shifts, reformat churn, comment touch-ups. The model spends tokens reading noise and hands back generic summaries.
The fix was language-aware diff summaries: extract structure from each file (function signatures, class outlines, impl blocks) and either replace the raw diff or prepend the extract so the model gets higher-signal context per token. The regex-based extractors for TypeScript, JavaScript, Python, Rust, and Go landed in #927 and #928. They were intentionally conservative: catch the obvious cases, fall through to the raw diff otherwise.
They worked. They also exposed the ceiling.
Regex is great at “is there a line that starts with function or fn or def.” It is terrible at almost everything else a real codebase throws at you. A few of the cases that kept showing up in real commits:
export const foo = () => { ... } is a function. The regex captures it as const foo.impl Widget for Frob header but treats everything inside as opaque body. Three new methods on the same impl all get collapsed into one entry.def gets skipped because the regex only matches module-level functions. Class internals are invisible.var ( ... ) and const ( ... ) block forms confuse the line-oriented matcher.function in it. A multi-line comment mentioning fn. The regex doesn’t know it’s inside a string. Tree-sitter does.You can keep adding edge cases to a regex until it’s three thousand lines and still wrong. Or you can hand the problem to something that actually parses the language.
Tree-sitter grammars are real handwritten parsers, not heuristics, and there’s a maintained grammar for basically every language you’d plausibly commit code in. With the structural cut coming from a syntax tree instead of a line-oriented regex, here’s what changed:
function inside a template string or a doc comment. The parser knows the difference between code and the text that looks like code.Receiver.method, Rust surfaces the impl Trait for Type it belongs to – so the same method name on different types stays distinguishable instead of collapsing into one bare entry. (Python method extraction is still on the list.)signature change entry naming exactly which symbols moved, rather than a generic “modified file.” Slicing that even finer – “added schema parameter to parseRequest” as its own line – is still a someday refinement; the syntax tree makes it possible, the formatter just doesn’t cut that fine yet.One thing I went back and forth on was whether to rip the regex extractors out the moment tree-sitter was wired up. The clean answer was yes; the pragmatic answer was no – and the pragmatic answer is what shipped. Each language runs a chain: tree-sitter first, regex as the fallback when the parser for that language isn’t available or doesn’t produce a clean cut. It’s one table –
ts -> [tree-sitter, regex]
js -> [tree-sitter, regex]
py -> [tree-sitter, regex]
rs -> [tree-sitter, regex]
go -> [tree-sitter, regex]
– and each language walks its chain, taking the first parser that returns a summary. A parser that throws is swallowed so the next one still gets a shot. The reasons the regex layer earned its keep:
The knob lives at service.fastPath.languageAware, next to the existing fastPath.markdown path. It’s an object, not a bare boolean: enabled (default false until quality is proven across enough real commits) plus an optional languages list to opt in a subset (ts, js, py, rs, go). Once it flips on by default, the layering underneath is invisible.
Tree-sitter parsers are wasm modules – 200KB to about 2MB each. Bundling all of them roughly doubles the install size for someone who only ever writes TypeScript, and that math gets worse the more languages you support. So the package splits the difference:
.py, .rs, or .go file, the grammar is pulled from a CDN, SHA-256-verified against a pinned manifest, and cached under the user’s cache directory (overridable via COCO_CACHE_DIR). Subsequent runs use the cached copy.That keeps the default install small without making the polyglot experience worse than it needs to be. The cache lives outside the npm install dir on purpose – so it survives version bumps and doesn’t get re-downloaded every time you upgrade coco.
For the Workstation TUI, tree-sitter now shows up in two places – one under the hood, one right on the screen.
g c). The structural cut feeds the model when it drafts a commit message: renamed functions read as renames, impl-block internals stop disappearing into opaque headers, the model has less noise to fight through. This one is invisible by design – it shapes the context handed to the model, not the UI you look at.g d). The diff itself stays literal – raw hunks, because that’s what a diff viewer is for. What changed is that those hunks are now syntax-highlighted by the same parser stack. More on that next.The rule of thumb across the surfaces still holds: the closer you are to the raw bytes, the less any summarization intrudes. Diff view stays literal (and now legible). Compose view sits one level up and uses the structural cut as the model’s context. PR review (g p) sits another level up again – that’s a later post in this series.
Here’s the payoff I didn’t see coming when I started. Wiring tree-sitter in for structural extraction meant the parser stack was already sitting there in the process – so it got a second job. The Workstation’s diff view now renders syntax-highlighted hunks: keywords, strings, types, functions, comments, and the rest picked out per token and mapped to whatever color theme you’re running. (#1117.)
A few details that matter for a diff viewer specifically:
It’s on by default (logTui.syntaxHighlight). So the title earns itself twice over: tree-sitter makes the diff coco hands the model sharper, and it makes the diff coco hands you easier to read.
As of this post, the migration has largely landed. All five languages – TypeScript, JavaScript, Python, Rust, and Go – have tree-sitter extractors live, each backed by the regex layer as a fallback. The work is tracked in issue #933: the foundation (parser loader, lazy-load infra, the layered registry), the bundled-vs-CDN packaging split, and the per-language extractors all went in, one piece at a time. What’s left is mostly polish – telemetry for when the whole chain falls through to the raw diff, and the eventual call to flip languageAware.enabled on by default once the A/B says the tree-sitter cut reliably produces better commit messages.
The build/test problem I flagged as unsolved last time – getting tree-sitter wasm to load in a Jest setup that never previously touched native modules – is sorted. web-tree-sitter@0.26 shipping dual CommonJS/ESM exports did most of the heavy lifting (no more new Function() import shim). The rest was plumbing: a pretest step that copies the wasm into dist/, a guard that skips the grammar-dependent tests when the wasm isn’t present, and a small reset seam so the process-lifetime parser cache doesn’t leak state between tests. If you hit the same wall, those three pieces are the whole trick.
Next up in the BYO Git Workstation series: how the PR review view handles structural extracts across many files at once. Curious to hear if anyone else has built a tree-sitter-backed code analysis pipeline and run into the same packaging trade-offs.
Discussion
Comments are powered by Disqus. Sign in once, comment anywhere.