I keep coming across people who consider parser technology a
forbidding, scary field of programming. This is nonsense—a small
parser can bevery, very
simple
,
and provide a wholesome exercise in recursive thinking. At the same
time, it is true that youcanmake parsing extremely complicated.
Mostly, this tends to happen when you generalize parsing techniques to
work for different grammars. Since compiler textbooks usually describe
general approaches to parsing, they may be to blame for putting people
off.

This post describes anew parsing
system
I wrote for
CodeMirror, a source code editor. It frames
the system with some history, and digressing into some neat
architectural details.

Editor features like syntax highlighting, bracket matching, code
folding, and autocompletion all involve some level of parsing.
Unfortunately, since editors have to handle many different languages,
they require a generalized approach to parsing.

CodeMirror is in the process of being rewritten, and I wanted to
improve the way it parses its content. Parsing inside of an editor
comes with its own unique set of constraints, which can be hard to
satisfy. Though I had been planning new approaches for years, all I
had to show for it so far were a pile of dead ends.

The constraints that make the parsing problem in a code editor hard
are roughly these:

  • The document is constantly changing.

  • You can’t do anything expensive. If the parsing works takes too
    long, it’ll introduce latency that makes editing feel
    slugglishand unresponsive.

  • The input is often not in a finished, syntactically correct form.
    But you still have to make some sense of it—nobody wants an editor
    where most features stop working when you have a syntax error in
    your document.

  • You often want to be able to mix several languages/grammars in a
    single document (think HTML with JavaScript and CSS embedded in
    it).

Keeping those in mind, let’s go over the approaches I’ve tried.

A Brief History of CodeMirror Parsing

The system in as it exists in CodeMirror 5 now is (which is pretty
much what we’ve been using from the very beginning) is asimple
one
. For
each language, you write a tokenizer which splits the input into
pieces, and labels each piece with some syntactic category (such as
variable,keyword, ornumber). The tokenizers can be stateful,
which allows them to secretly be full parsers if they want to.

This state must by copyable, so that the editor can strategically
store tokenizer states from a previous run, and after a change, resume
one close to that change to avoid re-tokenizing the entire document.
Because we are usually only interested in the code in the visible
viewport, this means the complexity of re-tokenizing is bounded by the
distance between the change and the end of the viewport. Since most
changes happen inside of that viewport, this works well in practice.


Such tokenizers are awkward to write directly, so over the years
several attempts have been made to build abstractions over them. The
first was theCommon JavaScript Syntax Highlighting
Specification
,
an attempt by the authors of Mozilla Skywriter (formerly Bespin, later
merged intoACE) to define a declarative format
for describing tokenizers as state machines with regular expressions
(describing the tokens) as edges. The ACE project ended up with an
incompatible but similar format (too entangled with their internals to
use in CodeMirror, unfortunately). I did an implementation of the
original spec for CodeMirror, and then another incompatible
extensionbecause the base spec was
too limiting. There are a few CodeMirror modes still based on that
code, but it was no real success.

I think the reason such state machines (and the somewhat related
TextMate
grammars
which
are in wide use in desktop editors) never felt like a great solution
is that, once you get past trivial grammars (where their declarative
simplicity does look really nice), they don’t really help that much
with abstraction. Manually designing complicated state machines is a
chore. Regular expressions, which are bad enough on their own, become
downright
terrifying
when you have to construct all your edges out of them, often stuffing
multiple tokens into a single expression to avoid creating
intermediate states. This “abstraction” has a tendency to produce
uglier, less maintainable code than what you’d get when writing the
tokenizer as plain code.


So in 2017, I started an ambitious project to create a better way to
abstractly define incremental tokenizers. I had concluded that
classical parser generators based on context-free grammars were never
going to work in this context (for reasons that I’ll come back to
later on). But I kept coming acrossparsing expression
grammars
,
which weren’t based on context-free grammars and had some interesting
properties, such as being able to combine multiple grammars to create
a new grammar (which is great for mixed-language documents).

So I spent several months building a parsing system that took a
PEG-like grammar, compiled it down to a state machine, and made it
possible to run that state machine as a CodeMirror language mode.

Thissystemis a marvel.
It uses a moderately sophisticatedoptimizing
compiler
to generate the
state machines. The result works quite well, and is used in several
real-world systems today. But unfortunately, if I’m honest, it is a
tragically bad idea taken way too far.

Parsing expression grammars are parsed by backtracking. And as such,
they are very poorly suited for implementing a stateful tokenizer. In
a backtracking system, you never know when you’vedefinitelyparsed
a piece of content—later input might require you to backtrack again.
So what I ended up with was actually not PEG at all, but a system
where you had to explicitly annotate where the parser should look
ahead. Though grammars written this way were relatively readable, they
involved a lot of finicky, error-prone kludges to resolve local
ambiguity.

Also, parsing PEG is just really inefficient. Such grammars are
“scannerless” meaning they don’t make a distinction between tokenizing
and parsing. When parsing in that way naively, you basically have to
run your whole parsing logic for every input character. Probably
multiple times, due to backtracking. A lot of the magic in the
compiler was intended to recover the tokens that were implicit in the
grammar, in order to recover some efficiency. But the system never
came close to hand-written language modes in terms of speed.

Tree-sitter

So, though I knew I needed a new approach, I went into the CodeMirror
6 rewrite without any specific idea on what that approach would look
like.

And then I saw
tree-sitter, and was
enlightened.

Tree-sitter is a parser system written with the code editor use case
in mind, and is in the process of being integrated into theAtom
editor
. It takes a much more ambitious approach to
what a parser inside an editor should do: It builds up a full,
accurate syntax tree for the content.

You can do so much more with an actual syntax tree than with a
sequence of tokens. Whereas tokens, possibly augmented with some
information stored in the tokenizer state, allow you to sort of
approximate understanding some aspects of the code’s structure, a tree
usually gives you precisely the information you need.

Most of the ideas that tree-sitter uses aren’t new, in fact a
paperfrom 1997
describes a somewhat similar system. But as far as I know, tree-sitter
is the first system that puts them all together into a practical piece
of software.

Unfortunately, tree-sitter is written in C, which is still awkward to
run in the browser (and CodeMirrror targets non-WASM browsers). It
also generates very hefty grammar files because it makes the
size/speed trade-off in a different way than a web system would.

But good ideas can be ported.Lezeris
a JavaScript-based system heavily inspired by tree-sitter.

LR Parsing and Context-Free Grammars

For a long time, I was firmly convinced that classical parser system
based on context-free grammars and
LLor
LRparsing algorithms were
just not suitable for the editor use case. My arguments for this
were…

Context-free grammars are a limiting abstraction that breaks down as
soon as the language does anything funky. Needing the grammar to be LR
or LL to please the parser generator further pins you into a corner.

This is not wrong. Expressing operator precedence in a pure
context-free grammar requires writing a silly formulaic rule for each
level of precedence. And when you need to implement something like
automatic semicolon insertion or whitespace-sensitivity, which would
be a couple of lines of code in a hand-written grammar, you can’t
express that directly, and have to somehow escape the context-free
abstraction.

Making such a grammar suitable for an LR parser generator can be even
more tricky, and often requires you to have a rather deep
understanding of how the parser generator works.

But like many things, once you get to know them, they aren’t that bad.
Parser generators can support precedence declarations, which make
operator parsing a lot less terrible. They can even output decent
error messages.

Supporting dynamic resolution of ambiguities through something like
GLR parsingcan provide a
practical way out of situations that parser generators are
traditionally bad at.

And contrary to some of the abstractions I mentioned before, this one
actually gets us something. Context-free grammars, when combined with
a proper parser generator, really do give us fast parsers from
readable, compact grammar declarations.

A strict separation between the tokenizer and parser is
problematic.

It is, in many languages (think of JavaScript’s ambiguity between
regular expressions and the division operator). It also tends to make
mixed-language parsing harder.

But just because this type of parser is traditionally ran with a
completely separate tokenizer doesn’t mean it has to be. Having the
parse state drive the tokenizer is largely unproblematic. You can
either can even have the parser generator set this up
automatically, without user involvement.

Generated parsers are way too big.

A naively generated LR parser ishuge, and many tools spit out
embarrassingly big files. But with careful parser state deduplication
and table compression such a parser can be made about as compact as a
hand-written one.

Making such a parser error-tolerant is extremely cumbersome.

If you search the scholarly literature for approaches to
error-tolerance in LR parser systems, you get a lot of results, with a
lot of different approaches, but none of them are very practical. Most
require the grammar writer to explicitly annotate the grammar with
error-recovery strategies, bloating the grammar and putting the
responsibility for getting it right on every grammar author.

Tree-sitter ingeniously abusesGLR
parsing
, where the parser
can try multiple interpretations simultaneously, to integrate
automatic error-correction without a lot of extra complexity. Lezer
copiesthis approach.

Lezer

I called my tree-sitter copycat project
Lezer, which is the Dutch word for
reader(and pronounced a lot likelaser). It is a bit less
advanced than tree-sitter in some areas, a bit more advanced in
others, and simply different on quite a lot of points, as determined
by a different set of priorities and tastes.

CodeMirror 6 will retain the ability to run a classical stateful
tokenizer, but its recommended way to define a language mode is to
write a Lezer grammar and wrap it in a CodeMirror-specific packages
that adds some editor-related metadata.

Lezer is anLR(with opt-in
GLR) parser generator. It
has support for incremental parsing, where you can cheaply re-parse a
document after local changes have been made to it, by reusing pieces
of the old parse tree. It automatically tries to recover and continue
parsing when it runs into a syntax error, leaving markers in the
output tree that indicate where the recovery happened.

Lezer consists of an off-line parser generator tool, which takes a
grammar description and outputs a JavaScript module containing a
parser for that grammar, and a parser run-time system (which such
output files depend on) to do the actual parsing. Only the run-time
system and the generated parser need to be loaded by the editor.

The parser outputs non-abstract syntax trees, meaning that it just
creates a raw tree structure containing the constructs it parsed (with
information on where it found them), without organizing them into a
clean, easy-to-use data structure.

The system is optimized for compactness, both in parser table size and
syntax tree size. It needs to be practical to ship a bunch of parsers
to a user on the web without producing megabytes of network traffic,
and it needs to be realistic to keep syntax trees for large documents
around without running out of memory.

TheLezer guideprovides a
more thorough introduction, as well as a description of its grammar
notation. In this blog post, I want to go into the neat implementation
details that aren’t relevant in user documentation.

Error Recovery

The point where I became convinced that I definitely needed to use or
copy tree-sitter was when I understood its error recovery strategy.

Say you reach a point where you can no longer proceed normally because
there is a syntax error. The rest of the input, after the error, is
probably full of meaningful constructs that could still be parsed. We
want those constructs in our syntax tree. But our regular parsing
process is stuck—it doesn’t know how to get from the error to a state
where the parse can continue.

I definitely did not want to require the grammar author to add error
recovery hints to their grammar. These tend to clutter up the grammar
and are error-prone to write. Writing a grammar is hard enough
without that distraction.

You can see error recovery as a search problem. There might be a parse
state and input position (past the error) where the parse can
meaningfully continue. We just have to find it.

The actions encoded in the parse tables, along with some
recovery-specific actions that the parser wouldn’t normally take,
provide a kind of search tree. You start at the state(s) where the
error occurred, and keep exploring new states from there.

But what does the accept condition look like? When do you know that
you’ve found an acceptable solution? You could define that precisely,
for example as the state that can handle the next N tokens without
further errors. But we can also be vague.

The solution found byMax Brunsfeld
in tree-sitter is to use the same mechanism that’s used to parse
ambiguous grammars. A GLR parser can split its parse stack and run
both sides alongside each other for a while until it becomes clear
which one works out.

That’s pretty much exactly what a search algorithm does—it tracks a
number of branches that it still has to explore, and continues to
explore them, possibly pruning unpromising branches with some
heuristic, until it finds a solution.

To be able to get good results, or at leastsomeresult, in messy
situations like longer stretches of invalid input, each branch has a
badness score associated with it, which is increased (linearly) each
time a recovery action is taken, and decreased (asymptotically) every
time it can consume a token normally.

What we want to do is, after an error, try all kinds of possible
recovery tricks, which recursively branch off a large amount of states.
But then, after a bit of that, we should consolidate to one or, at
most, a few parse states again, because parsing input in a whole bunch
of different ways is expensive.

To get this effect, Lezer forbids states with a badness higher than a
given multiple of the best state’s badness (or some maximum threshold)
from applying further recovery actions, effectively dropping those
branches when they can’t proceed normally. In the case where one
branch finds a good way to continue, that branch’s badness will
converge to zero and eventually stop all worse branches. In cases
where the input continues to make no sense, all branches will
eventually get a badness score exceeding the maximum, and the parser
will only continue one of them.

The recovery strategies used are:

  • Skip the next token, and try again with the same state after that.

  • Invent a token—take any of the tokens that are valid in this state,
    and continue to the state that consuming them would produce. This
    is the main source of branching, since many states allow a lot of
    tokens.

  • Force the end of the innermost production that’s currently being
    parsed.

There are situations where the result of this approach isn’t entirely
optimal, but it usually does well. The important thing is that it
always keeps parsing, and does so in a way that remains tractable
(exponential searches are quickly dampened). The system is biased a
bit towards the token-skipping rule, so that if all else fails it’ll,
in effect, just continue skipping tokens until it stumbles into a
situation where it can continue parsing.

Post-Order Parser Output

When you have a parser that may be splitting its state—a lot—and build
up parts of the tree multiple times, that duplicate tree building and
the bookkeeping involved in it can cause a lot of unnecessary work.

The order in which an LR parser creates nodes is inner-to-outer. It
will, for example, first create the node for the operands, and then
later the node for the operator expression. This suggests an approach:
What if, instead of building a tree structure right away, the parser
just keeps a flat log of the nodes it created. This can be an array in
which the nodes are arranged inpost-order, with children
coming before parents.

The parser just appends to this array. When splitting the state, one
state keeps the existing array, and the other gets a new empty array
along with a pointer to the state that has the rest of the array, and
the length of that array at the time of the split.

Now splitting involves no node copying at all. You do need to copy the
state stack, which LR parser use to track context, but that is
generally relative shallow.

In addition, node allocation becomes as cheap as appending a few
numbers to an array. For actions that don’t result in tree nodes
(Lezer allows you to mark rules as uninteresting, to keep the tree
small), you don’t have to do anything at all. The control stacks
stores the output array position at the start of each rule, and can
use that to emit enough data to later reconstruct parent-child
relationships.

After a parse finishes successfully, the final state’s parent-array
pointers can be used to find all the nodes that make up the tree, and
construct an actual tree structure out of them.

One tricky issue occurs when skipped content (whitespace and comments)
produces nodes. If you have code like this…

if (true) something()
// Comment
otherStatement()

… the comment shouldnotbe part of the if statement’s node. Yet
the parser only knows for sure that it can finish that node after
seeing the next statement (there might be anelsestill coming).

In cases like this, where the output array contains skipped nodes
immediately in front of a reduction, the parser has to move them
forward and store the end of the nodebeforethem. Fortunately, this
occurs relatively rarely (unless you add nodes for whitespace, in
which case it’ll happen at the end of every rule that has a possible
continuation).

Buffer Trees

A nice thing about the flat post-order tree representation is that it
is compact. Tree structures constructed the usual way, as separately
allocated nodes, incur a lot of extra overhead for pointers and
allocation headers. They can also have terrible locality, since who
knows how far from each other the memory allocator will put the nodes.

Unfortunately, we can’t just use a flat representation for our syntax
trees. The incremental parser has to be able to reuse parts of it
without copying those parts into a different buffer.

But wecanuse it for parts of the tree. Storing the coarse
structure as a classical tree, but the content of smaller nodes (say
less than a few thousand characters long) as flat arrays, gives us the
best of both worlds. Since most nodes, by number, live in the fine
structure, this saves a large amount of overhead (and helps with
locality).

That does mean that we can’t reuse small nodes. But since their size
is limited, the amount of work that is involved in re-parsing them is
also limited. And by removing them from consideration, the incremental
parser can avoid quite a bit of the work involved in preparing and
scanning the tree for reuse.

A small node stores its content in a typed array of 16-bit unsigned
integers. It uses 4 such numbers (64 bits) per node, storing a type, a
start position, an end position, and a child count for each node.
Contrary to the array created by the parser, these arrays are in
pre-order,
because that makes forward iteration (which tends to be more common
than backward iteration) cheaper. The child count was almost obsolete
(the end position can sort of tell you which nodes are children), but
Lezer supports zero-length nodes, which might land on the end of their
parent node and make it ambiguous whether they belong to it or not.

Client code, of course, doesn’t want to deal with this representation.
Lezer provides an abstract interface to searching in and walking
through trees that hides the buffer structure, allowing you to
conceptually work with a uniform tree of nodes.

Lezer, like tree-sitter, stores the result of repetitions in the
grammar (produced by the*and+operators) as balanced subtrees.
This means that, unless your input is pathological (say, a thousand
applications of a single binary operator in a row), you tend to get
shallow, well-balanced syntax trees, which are cheap to search allow
effective reuse.

Contextual Tokens

Depending on the grammar’s complexity, an LR parser generator creates
between a dozen and a few thousand parse states for your grammar.
These represent syntactic positions like “after the opening paren of
an argument list” or “after an expression, possibly expecting some
expression suffix”.

The parser generator can figure out which tokens are valid in a given
state. It can also, for tokens specified as part of the grammar,
automatically determine which tokens conflict (match the same input,
or some prefix of each other).

A well-known example of conflicting tokens is the division operator
versus regular expression syntax in JavaScript. But others are
keywords that can also appear as property names, and the bitwise right
shift operator (>>) versus two closing angle brackets in C++.

Lezer will not complain about overlapping tokens if the tokens do not
appear in the same parse states. This implicitly resolves the regular
expression and property name issues, without any user interaction.

When conflicting tokens do appear in the same place, such as division
operators and C-style comments, you have to specify an explicit
precedence ordering (comments take precedence) to tell the tool that
you know what you’re doing.

Contextual tokenization is implemented with a concept called token
groups. Tokens that have unresolved conflicts with other tokens are
assigned to one or more groups, where each group contains only
non-conflicting tokens. Each state is assigned a single group (if it
expects tokens that conflict with each other that’s an error). This
group is passed to the tokenizer, which then takes care to only return
tokens that are either in that group, or don’t conflict with any other
tokens. The check is optimized by storing group membership in a
bitset, and seeing if the right bit is set with binaryand.

Tokens are compiled down to a single deterministic state machine,
which is ran on the input character stream. In cases like the
regexp-versus-division issue, you don’t want the machine to go running
through regexp-specific states in a situation where you only allow
division, since that would be wasteful. Therefore, each tokenizer
state is also tagged with a bitset that tells you which groups the
tokens reachable from that state belong to, and the tokenizer stops
running when it hits a state that has no overlap with the allowed
tokens for the parse state.

Skip Expressions

Almost all programming languages have special syntactic elements like
whitespace and comments that may occur between any tokens. Encoding
these directly in the grammar is extremely tedious for most languages.

Traditionally, tokenizer just skip such elements when reading the next
token. That works well in most contexts, but makes it awkward to
include the elements in the parse tree.

Lezer treats skipped things like they are part of the grammar (though
in an optimized way to avoid increasing the size of the parse tables).
It is possible to skip things that aren’t single tokens (to implement
something like nestable comments, for example, or to make sure your
block comment nodes consist of smaller nodes so that you can
incrementally parse giant block comments).

Each rule or group of rules may have its own set of skipped
expressions, so that you can express different sublanguages in a
grammar, for example something like the content of interpolated
strings, without allowing spacing in places where the language doesn’t
allow it.

Each parse state has a pointer to a (shared) set of skip actions,
which, for the skipped tokens or tokens that start a compound skipped
expression, contains the actions to take for those tokens. For
single-token skipped elements, that action just tells the parser to
skip the token and stay in the same state. For compound elements, it
causes the state that handles the rest of the element to be pushed
onto the control stack.

Tree Node Tagging

The languages that a tool like Lezer needs to handle are wildly
different, from JavaScript to Haskell to CSS to YAML. As such, it is
difficult to find a cross-language vocabulary to describe their
constructs. In fact, it seems like that would be a separate multi-year
project, and pull in a serious amount of complexity.

Yet it would be nice if the parser output comes with some information
that can be interpreted without knowing what language you are working
with.

After several iterations, what I decided on was a system where nodes
havenames, which only have a meaning within the language, and
props, which are values associated with tags defined by external
code. Integrating a language grammar into CodeMirror involves
assigning values for some of these props to the node types used by the
language—things like syntax highlighting style information and how to
indentsuch nodes.

Since the number of node types in a language is limited, we can
allocate an object for each node type to hold this information, and
have all nodes of that type point to the same object.

To allow code outside the grammar to add props without mutating global
state, parser instances can be extended with additional props,
creating a copy that will output nodes with the props attached. This
is especially useful in the context of mixed-language trees.

Lezer has support for a limited form of grammar nesting. If language A
can appear inside a document in language B, and the end of the region
covered by A can be unambiguously found by scanning for a specific
token, Lezer can temporarily switch to another set of parse tables
while parsing such a region.

The syntax tree will then contain nodes from both grammars. Having
props directly attached to the nodes makes it much easier to work with
such trees (as opposed to using a language-specific table that
associates node names with metadata).

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here