This is the first post in a two-part series that takes a tutorial-based approach
to exploring the Go compiler. The compiler is large and would require a
small book to describe properly, so the idea of these posts is to provide a
quick depth-first dive instead. I plan to write more descriptive posts on
specific areas of the compiler in the future.

We’re going to change the Go compiler to add a new (toy) language feature, and
build a modified compiler to play with.

The task – adding a new statement

Many languages have awhilestatement, which in Go is expressed with
for:

for{
 
}

Adding awhilestatement to Go is rather trivial, therefore – we simply
translate it tofor. So I chose a slightly more challenging task, adding
until.untilis the same aswhileexcept that the condition is
negated. For example, this code:

i:=4
untili==0{
 i--
 fmt.Println("Hello, until!")
}

Is equivalent to:

i:=4
fori!=0{
 i--
 fmt.Println("Hello, until!")
}

In fact, we could even use an initializer in the loop declaration as follows:

untili:=4;i==0{
 i--
 fmt.Println("Hello, until!")
}

Our implementation will support this.

A mandatory disclaimer – this is just a toy exercise. I don’t think adding
untilto Go is a good idea at all, because Go’s minimalism is an absolutely
correct design choice.

High-level structure of the Go compiler

The default Go compiler (gc) has a fairly traditional structure that should
be immediately familiar if you worked on other compilers before:

Go gc compiler flow

Relative to the Go repository root, the compiler implementation lives in
src/cmd/compile/internal; all the code paths mentioned later in the post are
going to be relative to this directory. It’s all written in Go and the code is
fairly readable. Throughout this post we’re going to examine these stages one by
one, as we add the required code to support anuntilstatement.

Check out theREADMEfile insrc/cmd/compilefor a nice step-by-step
description of the compilation steps. That file is a good companion to this
blog post.

Scan

The scanner (also known aslexer) breaks up source code text into discrete
entities for the compiler. For example, the wordforbecomes the constant
_For; the characters...become_DotDotDot, while.on its own
becomes_Dot, and so on.

The scanner is implemented in thesyntaxpackage. All we need from it here
is to understand a new keyword –until. The filesyntax/tokens.gohas
a list of all tokens understood by the compiler, and we’ll add a new one:

_Fallthrough// fallthrough
_For       // for
_Until     // until
_Func      // func

The comment on the right-hand side of the token constant is important, as it’s
used to identify the token in text. This is done by means of code
generation fromsyntax/tokens.go, which has this line above the list of
tokens:

//go:generate stringer -type token -linecomment

go generatehas to be run manually and the output file
(syntax/token_string.go) is checked into the Go source repository. To
regenerate it I ran the following command from thesyntaxdirectory:

GOROOT=go generate tokens.go

TheGOROOTsetting isessential as of Go 1.12, and has to point to the root of
the source checkout where we’re modifying the compiler.

Having run the code generator and verified thatsyntax/token_string.gonow
has the new token, I tried rebuilding the compiler and ran into a panic:

It comes from this code insyntax/scanner.go:

// hash is a perfect hash function for keywords.
// It assumes that s has at least length 2.
funchash(s[]byte)uint{
 return(uint(s[0])4^uint(s[1])+uint(len(s)))&uint(len(keywordMap)-1)
}

varkeywordMap[16]token// size must be power of two

funcinit(){
 // populate keywordMap
 fortok:=_Break;tok_Var;tok++{
   h:=hash([]byte(tok.String()))
   ifkeywordMap[h]!=0{
     panic("imperfect hash")
   }
   keywordMap[h]=tok
 }
}

The compiler tries to build a “perfect” hash table
to perform keyword string to token lookups. By “perfect” it means it wants no
collisions, just a linear array where every keyword maps to a single index.
The hash function is rather ad-hoc (it only looks at the contents of the first
characters of the string token, for example) and it’s not easy to debug why
a new token creates collisions. To work around it, I increased the lookup table
size by changing it to[1, thus changing the size of the lookup
array from 64 to 128. This gives the hash function much more space to distribute
its keys, and the collision went away.

Parse

Go has a fairly standard recursive-descent parser, which converts a stream of
tokens produced by the scanner into aconcrete syntax tree. We’ll start by
adding a new node type foruntilinsyntax/nodes.go:

UntilStmtstruct{
 InitSimpleStmt
 CondExpr
 Body*BlockStmt
 stmt
}

I borrowed the overall structure fromForStmt, which is used forfor
loops. Similarly tofor, ouruntilstatement has several optional
sub-statements:

until;{
 
}

Bothandare optional, though it’s not common to omit
. TheUntilStmt.stmtembedded field is used for all syntax tree
statements and contains position information.

The parsing itself is done insyntax/parser.go. Theparser.stmtOrNil
method parses a statement in the current position. It looks at the current
token and makes a decision of which statement to parse. Here’s an excerpt
with the code we’re adding:

switchp.tok{
case_Lbrace:
 returnp.blockStmt("")

// ...

case_For:
 returnp.forStmt()

case_Until:
returnp.untilStmt()

And this isuntilStmt:

func(p*parser)untilStmt()Stmt{
 iftrace{
   deferp.trace("untilStmt")()
 }

 s:=new(UntilStmt)
 s.pos=p.pos()

 s.Init,s.Cond,_=p.header(_Until)
 s.Body=p.blockStmt("until clause")

 returns
}

We reuse the existingparser.headermethod which parses a header forif
andforstatements. In its most general form, it supports three parts
(separated by semicolons). Inforstatements the third part can be used for
the“post” statement, but we’re not
going to support this foruntilso we’re only interested in the first two.
Note thatheaderaccepts the source token to be able to differentiate
between the kinds of statements it’s serving; for example it would reject a
“post” statement forif. We should explicitly reject it foruntiltoo,
though I haven’t bothered to implement this right now.

These are all the changes we need for the parser. Sinceuntilis so similar
structurally to existing statements, we could reuse much of the functionality.

If we instrument the compiler to dump out the syntax tree (using
syntax.Fdump) after parsing and run it on:

i=4
untili==0{
 i--
 fmt.Println("Hello, until!")
}

We’ll get this fragment for theuntilstatement:

84  .  .  .  .  .  3: *syntax.UntilStmt {
 85  .  .  .  .  .  .  Init: nil
 86  .  .  .  .  .  .  Cond: *syntax.Operation {
 87  .  .  .  .  .  .  .  Op:==
 88  .  .  .  .  .  .  .  X: i @ ./useuntil.go:13:8
 89  .  .  .  .  .  .  .  Y: *syntax.BasicLit {
 90  .  .  .  .  .  .  .  .  Value: "0"
 91  .  .  .  .  .  .  .  .  Kind: 0
 92  .  .  .  .  .  .  .  }
 93  .  .  .  .  .  .  }
 94  .  .  .  .  .  .  Body: *syntax.BlockStmt {
 95  .  .  .  .  .  .  .  List: []syntax.Stmt (2 entries) {
 96  .  .  .  .  .  .  .  .  0: *syntax.AssignStmt {
 97  .  .  .  .  .  .  .  .  .  Op: -
 98  .  .  .  .  .  .  .  .  .  Lhs: i @ ./useuntil.go:14:3
 99  .  .  .  .  .  .  .  .  .  Rhs: *(Node @ 52)
100  .  .  .  .  .  .  .  .  }
101  .  .  .  .  .  .  .  .  1: *syntax.ExprStmt {
102  .  .  .  .  .  .  .  .  .  X: *syntax.CallExpr {
103  .  .  .  .  .  .  .  .  .  .  Fun: *syntax.SelectorExpr {
104  .  .  .  .  .  .  .  .  .  .  .  X: fmt @ ./useuntil.go:15:3
105  .  .  .  .  .  .  .  .  .  .  .  Sel: Println @ ./useuntil.go:15:7
106  .  .  .  .  .  .  .  .  .  .  }
107  .  .  .  .  .  .  .  .  .  .  ArgList: []syntax.Expr (1 entries) {
108  .  .  .  .  .  .  .  .  .  .  .  0: *syntax.BasicLit {
109  .  .  .  .  .  .  .  .  .  .  .  .  Value: ""Hello, until!""
110  .  .  .  .  .  .  .  .  .  .  .  .  Kind: 4
111  .  .  .  .  .  .  .  .  .  .  .  }
112  .  .  .  .  .  .  .  .  .  .  }
113  .  .  .  .  .  .  .  .  .  .  HasDots: false
114  .  .  .  .  .  .  .  .  .  }
115  .  .  .  .  .  .  .  .  }
116  .  .  .  .  .  .  .  }
117  .  .  .  .  .  .  .  Rbrace: syntax.Pos {}
118  .  .  .  .  .  .  }
119  .  .  .  .  .  }

Create AST

Now that it has a syntax tree representation of the source, the compiler builds
anabstract syntax tree. I’ve written aboutAbstract vs. Concrete syntax
trees
in
the past – it’s worth checking out if you’re not familiar with the differences.
In case of Go, however, this may get changed in the future. The Go compiler was
originally written in C and later auto-translated to Go; some parts of it are
vestigial from the olden C days, and some parts are newer. Future refactorings
may leave only one kind of syntax tree, but right now (Go 1.12) this is the
process we have to follow.

The AST code lives in thegcpackage, and the node types are defined in
gc/syntax.go(not to be confused with thesyntaxpackage where the
CST is defined!)

Go ASTs are structured differently from CSTs. Instead of each node type having
its dedicated struct type, all AST nodes are using thesyntax.Nodetype
which is a kind of adiscriminated unionthat holds fields for many different
types. Some fields are generic, however, and used for the majority of node
types:

// A Node is a single node in the syntax tree.
// Actually the syntax tree is a syntax DAG, because there is only one
// node with Op=ONAME for a given instance of a variable x.
// The same is true for Op=OTYPE and Op=OLITERAL. See Node.mayBeShared.
typeNodestruct{
 // Tree structure.
 // Generic recursive walks should follow these fields.
 Left*Node
 Right*Node
 NinitNodes
 NbodyNodes
 ListNodes
 RlistNodes

 // ...

We’ll start by adding a new constant to identify anuntilnode:

// statements
// ...
OFALL   // fallthrough
OFOR    // for Ninit; Left; Right { Nbody }
OUNTIL  // until Ninit; Left { Nbody }

We’ll rungo generateagain, this time ongc/syntax.go, to
generate a string representation for the new node type:

// from the gc directory
GOROOT=go generate syntax.go

This should update thegc/op_string.gofile to includeOUNTIL. Now it’s
time to write the actual CST->AST conversion code for our new node type.

The conversion is done ingc/noder.go. We’ll keep modeling our changes after
the existingforstatement support, starting withstmtFallwhich has a
switch for statement types:

case*syntax.ForStmt:
 returnp.forStmt(stmt)
case*syntax.UntilStmt:
returnp.untilStmt(stmt)

And the newuntilStmtmethod we’re adding to thenodertype:

// untilStmt converts the concrete syntax tree node UntilStmt into an AST
// node.
func(p*noder)untilStmt(stmt*syntax.UntilStmt)*Node{
 p.openScope(stmt.Pos())
 varn*Node
 n=p.nod(stmt,OUNTIL,nil,nil)
 ifstmt.Init!=nil{
   n.Ninit.Set1(p.stmt(stmt.Init))
 }
 ifstmt.Cond!=nil{
   n.Left=p.expr(stmt.Cond)
 }
 n.Nbody.Set(p.blockStmt(stmt.Body))
 p.closeAnotherScope()
 returnn
}

Recall the genericNodefields explained above. Here we’re using the
Initfield for the optional initializer, theLeftfield for the
condition and theNbodyfield for the loop body.

This is all we need to construct AST nodes foruntilstatements. If we dump
the AST after construction, we get:

.   .   UNTIL l(13)
.   .   .   EQ l(13)
.   .   .   .   NAME-main.i a(true) g(1) l(6) x(0) class(PAUTO)
.   .   .   .   LITERAL-0 l(13) untyped number
.   .   UNTIL-body
.   .   .   ASOP-SUB l(14) implicit(true)
.   .   .   .   NAME-main.i a(true) g(1) l(6) x(0) class(PAUTO)
.   .   .   .   LITERAL-1 l(14) untyped number

.   .   .   CALL l(15)
.   .   .   .   NONAME-fmt.Println a(true) x(0) fmt.Println
.   .   .   CALL-list
.   .   .   .   LITERAL-"Hello, until!" l(15) untyped string

Type-check

The next step in compilation is type-checking, which is done on the AST.
In addition to detecting type errors, type-checking in Go also includestype
inference
, which allows us to write statements like:

Without declaring the types ofresanderrexplicitly. The Go
type-checker does a few more tasks, like linking identifiers to their
declarations and computing compile-time constants. The
code is ingc/typecheck.go. Once again, following the lead of thefor
statement, we’ll add this clause to the switch intypecheck:

caseOUNTIL:
 ok|=ctxStmt
 typecheckslice(n.Ninit.Slice(),ctxStmt)
 decldepth++
 n.Left=typecheck(n.Left,ctxExpr)
 n.Left=defaultlit(n.Left,nil)
 ifn.Left!=nil{
   t:=n.Left.Type
   ift!=nil&&!t.IsBoolean(){
     yyerror("non-bool %L used as for condition",n.Left)
   }
 }
 typecheckslice(n.Nbody.Slice(),ctxStmt)
 decldepth--

It assigns types to parts of the statement, and also checks that the condition
is valid in a boolean context.

Analyze and rewrite AST

After type-checking, the compiler goes through several stages of AST analysis
and rewrite. The exact sequence is laid out in thegc.Mainfunction in
gc/main.go. In compiler nomenclature such stages are usually called
passes.

Many passes don’t require modifications to supportuntil
because they act generically on all statement kinds (here the generic structure
ofgc.Nodecomes useful). Some still do, however. For example escape
analysis, which tries to find which variables “escape” their function scope and
thus have to be allocated on the heap rather than on the stack.

Escape analysis works per statement type, so we have to add this switch clause
inEscape.stmt:

caseOUNTIL:
 e.loopDepth++
 e.discard(n.Left)
 e.stmts(n.Nbody)
 e.loopDepth--

Finally,gc.Maincalls into the portable code generator
(gc/pgen.go) to compile the analyzed code. The code generator starts by
applying a sequence of AST transformations to lower the AST to a more easily
compilable form. This is done in thecompilefunction, which starts by
callingorder.

This transformation (ingc/order.go) reorders statements and expressions to
enforce evaluation order. For example it will rewritefoo /=10tofoo=
foo / 10
, replace multi-assignment statements by multiple single-assignment
statements, and so on.

To supportuntilstatements, we’ll add this toOrder.stmt:

caseOUNTIL:
 t:=o.markTemp()
 n.Left=o.exprInPlace(n.Left)
 n.Nbody.Prepend(o.cleanTempNoPop(t)...)
 orderBlock(&n.Nbody,o.free)
 o.out=append(o.out,n)
 o.cleanTemp(t)

Afterorder,compilecallswalkwhich lives ingc/walk.go. This
pass collects a bunch of AST transformations that helps lower the AST to SSA
later on. For example, it rewritesrangeclauses inforloops to simpler
forms offorloops with an explicit loop variable[1]. It alsorewrites
map accesses to runtime calls
,
and much more.

To support a new statement inwalk, we have to add a switch clause in the
walkstmtfunction. Incidentally, this is also the place where we can
“implement” ouruntilstatement by rewriting it into AST nodes the compiler
already knows how to handle. In the case ofuntilit’s easy – we just
rewrite it into aforloop with an inverted condition, as shown in the
beginning of the post. Here is the transformation:

caseOUNTIL:
 ifn.Left!=nil{
   walkstmtlist(n.Left.Ninit.Slice())
   init:=n.Left.Ninit
   n.Left.Ninit.Set(nil)
   n.Left=nod(ONOT,walkexpr(n.Left,&init),nil)
   n.Left=addinit(n.Left,init.Slice())
   n.Op=OFOR
 }
 walkstmtlist(n.Nbody.Slice())

Note that we replacen.Left(the condition) with a new node of typeONOT
(which represents the unary!operator) wrapping the oldn.Left, and we
replacen.OpbyOFOR. That’s it!

If we dump the AST again after the walk, we’ll see that theOUNTILnode is
gone and a newOFORnode takes its place.

Trying it out

We can now try out our modified compiler and run a sample program that uses
anuntilstatement:

$ cat useuntil.go
package main

import "fmt"

func main() {
  i :=4
  until i==0 {
    i--
    fmt.Println("Hello, until!")
  }
}

$/bin/go run useuntil.go
Hello, until!
Hello, until!
Hello, until!
Hello, until!

It works!

Reminder:is the directory where we checked out Go, changed
it and compiled it (see Appendix for more details).

Concluding part 1

This is it for part 1. We’ve successfully implemented a new statement in the Go
compiler. We didn’t cover all the parts of the compiler because we could take a
shortcut by rewriting the AST ofuntilnodes to usefornodes instead.
This is a perfectly valid compilation strategy, and the Go compiler already has
many similar transformations tocanonicalizethe AST (reducing many forms to
fewer forms so the last stages of compilation have less work to do). That said,
we’re still interested in exploring the last two compilation stages –Convert
to SSA
andGenerate machine code. This will be covered inpart 2.

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here