1
0
Fork 0
Pure Go syntax tree parsing library. Built on tree-sitter principles and tuned to be fast.
  • Go 99%
  • Shell 0.5%
  • C 0.5%
Find a file
2026-02-28 18:43:38 -06:00
cgo_harness rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
cmd rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
docs/plans rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
grammars rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
internal perf: eliminate GLR stack value copies in merge hot path (-38% full parse) 2026-02-28 17:33:33 -06:00
.gitignore add(benchmark,cursor,lexer): comprehensive benchmark harness and cursor robustness 2026-02-26 09:57:58 -08:00
AGENTS.md update(agents): document GLR parity and performance workflow 2026-02-26 09:58:39 -08:00
arborist.go refactor: move implementation into internal packages 2026-02-28 16:37:32 -06:00
benchmark_highlight_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
benchmark_query_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
benchmark_tagger_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
benchmark_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
fuzz_parser_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
fuzz_query_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
fuzz_tagger_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
go.mod rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
go.sum refactor: isolate CGO code into cgo_harness module 2026-02-21 18:56:51 -08:00
highlight_integration_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
LICENSE.md Big documentation update 2026-02-28 17:48:31 -06:00
LLMS.txt Big documentation update 2026-02-28 17:48:31 -06:00
parser_c_test.go Add C and TS tests to identify limitations 2026-02-28 18:43:38 -06:00
parser_go_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00
parser_typescript_test.go Add C and TS tests to identify limitations 2026-02-28 18:43:38 -06:00
README.md Big documentation update 2026-02-28 17:48:31 -06:00
tagger_integration_test.go rebrand: rename gotreesitter to Arborist 2026-02-28 15:34:14 -06:00

Arborist

Pure Go tree-sitter-inspired AST parsing library.

go get git.sharkk.net/go/arborist

Arborist loads the same parse-table format that tree-sitter's C runtime uses. Grammar tables are extracted from upstream parser.c files by ts2go, compressed into binary blobs, and deserialized on first use. 209 grammars ship in the registry.

Motivation

Every Go tree-sitter binding in the ecosystem depends on CGo:

  • Cross-compilation requires a C cross-toolchain per target. GOOS=wasip1, GOARCH=arm64 from a Linux host, or any Windows build without MSYS2/MinGW, will not link.
  • CI images must carry gcc and the grammar's C sources. go install fails for downstream users who don't have a C compiler.
  • The Go race detector, coverage instrumentation, and fuzzer cannot see across the CGo boundary. Bugs in the C runtime or in FFI marshaling are invisible to go test -race.

Arborist eliminates the C dependency entirely. The parser, lexer, query engine, incremental reparsing, arena allocator, external scanners, and tree cursor are all implemented in Go. The only input is the grammar blob.

Quick start

import (
    "fmt"

    "git.sharkk.net/go/arborist"
    "git.sharkk.net/go/arborist/grammars"
)

func main() {
    src := []byte(`package main

func main() {}
`)

    lang := grammars.GoLanguage()
    parser := arborist.NewParser(lang)

    tree, _ := parser.Parse(src)
    fmt.Println(tree.RootNode())
}

grammars.DetectLanguage("main.go") resolves a filename to the appropriate LangEntry.

Queries

q, _ := arborist.NewQuery(`(function_declaration name: (identifier) @fn)`, lang)
cursor := q.Exec(tree.RootNode(), lang, src)

for {
    match, ok := cursor.NextMatch()
    if !ok {
        break
    }
    for _, cap := range match.Captures {
        fmt.Println(cap.Node.Text(src))
    }
}

The query engine supports the full S-expression pattern language: structural quantifiers (?, *, +), alternation ([...]), field constraints, negated fields, anchor (!), and all standard predicates. See Query API.

Incremental reparsing

tree, _ := parser.Parse(src)

// User types "x" at byte offset 42
src = append(src[:42], append([]byte("x"), src[42:]...)...)

tree.Edit(arborist.InputEdit{
    StartByte:   42,
    OldEndByte:  42,
    NewEndByte:  43,
    StartPoint:  arborist.Point{Row: 3, Column: 10},
    OldEndPoint: arborist.Point{Row: 3, Column: 10},
    NewEndPoint: arborist.Point{Row: 3, Column: 11},
})

tree2, _ := parser.ParseIncremental(src, tree)

ParseIncremental walks the old tree's spine, identifies the edit region, and reuses unchanged subtrees by reference. Only the invalidated span is re-lexed and re-parsed. Both leaf and non-leaf subtrees are eligible for reuse; non-leaf reuse is driven by pre-goto state tracking on interior nodes, so the parser can skip entire subtrees without re-deriving their contents.

When no edit has occurred, ParseIncremental detects the nil-edit on a pointer check and returns in single-digit nanoseconds with zero allocations.

Tree cursor

TreeCursor maintains an explicit (node, childIndex) frame stack. Parent, child, and sibling movement are O(1) with zero allocations — sibling traversal indexes directly into the parent's children[] slice.

c := arborist.NewTreeCursorFromTree(tree)

c.GotoFirstChild()
c.GotoChildByFieldName("body")

for ok := c.GotoFirstNamedChild(); ok; ok = c.GotoNextNamedSibling() {
    fmt.Printf("%s at %d\n", c.CurrentNodeType(), c.CurrentNode().StartByte())
}

idx := c.GotoFirstChildForByte(128)

Movement methods: GotoFirstChild, GotoLastChild, GotoNextSibling, GotoPrevSibling, GotoParent, named-only variants (GotoFirstNamedChild, etc.), field-based (GotoChildByFieldName, GotoChildByFieldID), and position-based (GotoFirstChildForByte, GotoFirstChildForPoint).

Cursors hold direct pointers into tree nodes. Recreate after Tree.Release(), Tree.Edit(...), or incremental reparse.

Highlighting

hl, _ := arborist.NewHighlighter(lang, highlightQuery)
ranges := hl.Highlight(src)

for _, r := range ranges {
    fmt.Printf("%s: %q\n", r.Capture, src[r.StartByte:r.EndByte])
}

Tagging

entry := grammars.DetectLanguage("main.go")
lang := entry.Language()

tagger, _ := arborist.NewTagger(lang, entry.TagsQuery)
tags := tagger.Tag(src)

for _, tag := range tags {
    fmt.Printf("%s %s at %d:%d\n", tag.Kind, tag.Name,
        tag.NameRange.StartPoint.Row, tag.NameRange.StartPoint.Column)
}

Benchmarks

All measurements below use the same workload: a Go source file containing 500 function definitions (~19 KB). Three runtimes are compared:

  1. Native C — tree-sitter compiled with -O3, linked directly, no Go involved. This is the hardware floor.
  2. CGo bindinggo-tree-sitter, calling the same C runtime through cgo.
  3. arborist — this project, pure Go.
goos: linux / goarch: amd64 / cpu: Intel(R) Core(TM) Ultra 9 285

Full parse (cold, no prior tree)

Runtime Time Relative
Native C (-O3) 1.75 ms 1.0x
CGo binding 1.95 ms 1.1x
arborist 19.2 ms 11.0x

arborist's full parse is roughly an order of magnitude slower than the C runtime. Both runtimes use the same algorithm — table-driven LR(1) with GLR fallback when the parse table contains conflicts — and operate on the same parse tables. The gap comes from Go language-level overhead: bounds checking on every table lookup and stack access, interface dispatch on the token source, GC write barriers on node allocation, and pointer indirection through slice-of-slice parse tables where C uses flat arrays. These costs are inherent to a pure-Go implementation, though targeted optimizations (unsafe table lookups, devirtualized token dispatch, flattened data layouts) can narrow the gap.

Incremental reparse (warm, prior tree available)

Workload Native C CGo binding arborist vs. native C
1-byte edit 104 μs 119 μs 1.0 μs 104x faster
No-op reparse 101 μs 115 μs 2.4 ns 42,000x faster

Incremental parsing is where the architecture diverges. The C runtime and CGo binding must re-enter the parser on every call, even when nothing has changed — the C API provides no mechanism to short-circuit a no-op. arborist holds the full tree in Go-managed memory. On a no-op reparse, a nil-pointer check on the edit list exits immediately. On a 1-byte edit, the parser walks the tree spine to the edit site, re-lexes only the affected tokens, and splices the result back in, reusing all unmodified subtrees by reference.

For editor-class workloads (keystroke-at-a-time on a single file), incremental performance dominates total latency. A sub-microsecond reparse on every keystroke is effectively free at 60 fps.

Raw benchmark output
# Pure Go benchmarks:
go test -run '^$' -bench 'BenchmarkGoParse' -benchmem -count=3

# CGo binding benchmarks:
cd cgo_harness
go test . -run '^$' -tags treesitter_c_bench -bench 'BenchmarkCTreeSitterGoParse' -benchmem -count=3

# Native C benchmarks (no Go, direct C binary):
bash cgo_harness/pure_c/run_go_benchmark.sh 500 2000 20000
Benchmark ns/op B/op allocs/op
Native C full parse 1,748,000
Native C incremental (1-byte edit) 104,300
Native C incremental (no-op) 100,900
CTreeSitterGoParseFull 1,946,000 600 6
CTreeSitterGoParseIncrementalSingleByteEdit 119,100 648 7
CTreeSitterGoParseIncrementalNoEdit 114,900 600 6
GoParseFull 19,270,000 7,645 1,226
GoParseFullDFA 19,920,000 2,636 6
GoParseIncrementalSingleByteEdit 1,000 424 8
GoParseIncrementalSingleByteEditDFA 1,335 629 10
GoParseIncrementalNoEdit 6.88 0 0
GoParseIncrementalNoEditDFA 2.36 0 0

Benchmark matrix

For repeatable multi-workload tracking:

go run ./cmd/benchmatrix --count 10

Emits bench_out/matrix.json (machine-readable), bench_out/matrix.md (summary), and raw logs under bench_out/raw/.

Supported languages

209 grammars ship in the registry. 203 produce error-free parse trees on their smoke samples; 6 are degraded. Run go run ./cmd/parity_report for current status.

  • 112 hand-written Go external scanners (Python indentation, HTML tag matching, Rust raw strings, etc.)
  • 9 hand-written Go token sources (authzed, c, go, html, java, json, lua, toml, yaml)
  • Remaining languages use the DFA lexer generated from grammar tables

Parse quality

Each LangEntry carries a Quality field:

Quality Meaning
full All scanner and lexer components present. Parser has full access to the grammar.
partial Missing external scanner. DFA lexer handles what it can; external tokens are skipped.
none Cannot parse.

full means the parser has every component the grammar requires. It does not guarantee error-free trees on all inputs — grammars with high GLR ambiguity may produce syntax errors on very large or deeply nested constructs due to parser safety limits (iteration cap, stack depth cap, node count cap). These limits scale with input size. Check tree.RootNode().HasError() at runtime.

Full language list (209)

ada, agda, angular, apex, arduino, asm, astro, authzed, awk, bash, bass, beancount, bibtex, bicep, bitbake, blade, brightscript, c, c_sharp, caddy, cairo, capnp, chatito, circom, clojure, cmake, cobol, comment, commonlisp, cooklang, corn, cpon, cpp, crystal, css, csv, cuda, cue, cylc, d, dart, desktop, devicetree, dhall, diff, disassembly, djot, dockerfile, dot, doxygen, dtd, earthfile, ebnf, editorconfig, eds, eex, elisp, elixir, elm, elsa, embedded_template, enforce, erlang, facility, faust, fennel, fidl, firrtl, fish, foam, forth, fortran, fsharp, gdscript, git_config, git_rebase, gitattributes, gitcommit, gitignore, gleam, glsl, gn, go, godot_resource, gomod, graphql, groovy, hack, hare, haskell, haxe, hcl, heex, hlsl, html, http, hurl, hyprlang, ini, janet, java, javascript, jinja2, jq, jsdoc, json, json5, jsonnet, julia, just, kconfig, kdl, kotlin, ledger, less, linkerscript, liquid, llvm, lua, luau, make, markdown, markdown_inline, matlab, mermaid, meson, mojo, move, nginx, nickel, nim, ninja, nix, norg, nushell, objc, ocaml, odin, org, pascal, pem, perl, php, pkl, powershell, prisma, prolog, promql, properties, proto, pug, puppet, purescript, python, ql, r, racket, regex, rego, requirements, rescript, robot, ron, rst, ruby, rust, scala, scheme, scss, smithy, solidity, sparql, sql, squirrel, ssh_config, starlark, svelte, swift, tablegen, tcl, teal, templ, textproto, thrift, tlaplus, tmux, todotxt, toml, tsx, turtle, twig, typescript, typst, uxntal, v, verilog, vhdl, vimdoc, vue, wat, wgsl, wolfram, xml, yaml, yuck, zig

Query API

Feature Status
Compile + execute (NewQuery, Execute, ExecuteNode) supported
Cursor streaming (Exec, NextMatch, NextCapture) supported
Structural quantifiers (?, *, +) supported
Alternation ([...]) supported
Field matching (name: (identifier)) supported
#eq? / #not-eq? supported
#match? / #not-match? supported
#any-of? / #not-any-of? supported
#lua-match? supported
#has-ancestor? / #not-has-ancestor? supported
#not-has-parent? supported
#is? / #is-not? supported
#set! / #offset! directives parsed and accepted

All shipped highlight and tags queries compile (156/156 highlight, 69/69 tags).

Known limitations

  • Full-parse throughput: ~11x slower than the C runtime on the Go grammar benchmark. Both runtimes use the same GLR algorithm; the gap is from Go bounds checking, interface dispatch, GC write barriers, and pointer indirection. Incremental reparsing amortizes this for interactive use.
  • GLR safety caps: The parser enforces iteration, stack depth, and node count limits proportional to input size. These prevent pathological blowup on grammars with high ambiguity but impose a ceiling on the maximum input complexity that parses without error. The caps are tunable but not removable without risking unbounded resource consumption.
  • Degraded grammars: 6 of 209 grammars produce syntax errors on smoke samples: norg (external scanner with 122 token types, not yet ported), caddy, cobol, hcl, properties, vimdoc (DFA lexer limitations). Check entry.Quality and tree.RootNode().HasError().

Adding a language

  1. Add the grammar repo to grammars/languages.manifest
  2. Generate tables: go run ./cmd/ts2go -manifest grammars/languages.manifest -outdir ./grammars -package grammars -compact=true
  3. Add smoke samples to cmd/parity_report/main.go and grammars/parse_support_test.go
  4. Verify: go run ./cmd/parity_report && go test ./grammars/...

Architecture

arborist is a ground-up reimplementation of the tree-sitter runtime in Go. No code is shared with or translated from the C implementation.

Parser — Table-driven LR(1) with GLR fallback. When a (state, symbol) pair maps to multiple actions in the parse table, the parser forks the stack and explores all alternatives in parallel. Stack merging collapses equivalent paths. Safety limits (iteration count, stack depth, node count) scale with input size and prevent runaway exploration on ambiguous grammars.

Incremental engine — Walks the edit region of the previous tree and reuses unchanged subtrees by reference. Non-leaf subtree reuse is enabled by storing a pre-goto parser state on each interior node, allowing the parser to skip an entire subtree and resume in the correct state without re-deriving its contents. External scanner state is serialized on each node boundary so scanner-dependent subtrees can be reused without replaying the scanner from the start.

Lexer — Two paths. A DFA lexer is generated from the grammar's lex tables by ts2go and handles the majority of languages. For grammars where the DFA is insufficient (e.g., Go's automatic semicolons, YAML's indentation-sensitive structure), hand-written Go token sources implement the TokenSource interface directly.

External scanners — 112 grammars require external scanners for context-sensitive tokens (Python indentation, HTML implicit close tags, Rust raw string delimiters, etc.). Each scanner is a hand-written Go implementation of the grammar's ExternalScanner interface: Create, Serialize, Deserialize, Scan. Scanner state is snapshotted after every token and stored on tree nodes so incremental reuse can restore scanner state on skip.

Arena allocator — Nodes are allocated from slab-based arenas to reduce GC pressure. Arenas are released in bulk when a tree is freed.

Query engine — S-expression pattern compiler with predicate evaluation and streaming cursor iteration. Supports all standard tree-sitter predicates (#eq?, #match?, #any-of?, #has-ancestor?, etc.) and directive annotations (#set!, #offset!).

Grammar loadingts2go extracts parse tables, lex tables, field maps, symbol metadata, and external token lists from upstream parser.c files. These are serialized to compressed binary blobs under grammars/grammar_blobs/ and lazy-loaded via loadEmbeddedLanguage() with an LRU cache. String and transition interning reduce memory footprint across loaded grammars.

Build tags and environment

External grammar blobs (avoid embedding in the binary):

go build -tags grammar_blobs_external
ARBORIST_GRAMMAR_BLOB_DIR=/path/to/blobs  # required
ARBORIST_GRAMMAR_BLOB_MMAP=false           # disable mmap (Unix only)

Curated language set (smaller binary):

go build -tags grammar_set_core  # c, go, java, javascript, python, rust, typescript, etc.
ARBORIST_GRAMMAR_SET=go,json,python  # runtime restriction

Grammar cache tuning (long-lived processes):

grammars.SetEmbeddedLanguageCacheLimit(8)    // LRU cap
grammars.UnloadEmbeddedLanguage("rust.bin")  // drop one
grammars.PurgeEmbeddedLanguageCache()        // drop all
ARBORIST_GRAMMAR_CACHE_LIMIT=8       # LRU cap via env
ARBORIST_GRAMMAR_IDLE_TTL=5m         # evict after idle
ARBORIST_GRAMMAR_IDLE_SWEEP=30s      # sweep interval
ARBORIST_GRAMMAR_COMPACT=true        # loader compaction (default)
ARBORIST_GRAMMAR_STRING_INTERN_LIMIT=200000
ARBORIST_GRAMMAR_TRANSITION_INTERN_LIMIT=20000

Testing

go test ./... -race -count=1

Test suite covers: smoke tests (209 grammars), golden S-expression snapshots, highlight query validation, query pattern matching, incremental reparse correctness, error recovery, GLR fork/merge, and fuzz targets.

Roadmap

v0.5.0 — 209 grammars, GLR parser, incremental reparsing, query engine, tree cursor, highlighting, tagging.

Next:

  • Corpus parity testing against C tree-sitter reference output
  • GLR ambiguity overhead reduction for large files
  • Query parity hardening (field-negation semantics, metadata directives)
  • External scanners for remaining degraded grammars
  • Fuzz coverage expansion

License

Sharkk Open License