MEP 1. Lexer
| Field | Value |
|---|---|
| MEP | 1 |
| Title | Lexer |
| Author | Mochi core |
| Status | Informational |
| Type | Informational |
| Created | 2026-05-08 |
Abstract
Mochi uses a single participle/v2/lexer.MustSimple table with nine token rules. This MEP documents each rule, the reserved word set, and the design choices that make the rest of the parser possible. It is informational because the lexer is small enough that we can describe it completely.
Motivation
Many parser bugs reduce to a token rule that ate too much or too little. A new keyword silently steals an identifier. A new punctuation character collides with an existing prefix. The block comment regex does not nest. Knowing the token rules in source order, and knowing why each rule is where it is, prevents the next change from breaking the existing fixtures.
Specification
Token classes
The rules in source order at parser/parser.go:48-62:
- Comment. Line comments with
//or#, plus block comments/* ... */. The block form does not nest. - Bool. Word boundary delimited
trueorfalse. - Keyword. The reserved word set, listed below.
- Ident. Unicode letters, the
So(other symbol) class, underscore, followed by letters, So, digits or underscore. The unicode coverage is intentional so identifiers can use mathematical or emoji characters. - Float. Decimal literal with a fractional part or an exponent. No leading minus.
- Int. Hex (
0x), binary (0b), octal (0o), or decimal. No leading minus. - String. Double quoted with backslash escapes.
- Punct. Multi character punctuation (
==,!=,<=,>=,&&,||,=>,:-,..) or a single character from-+*/%=<>!|{}[](),.:. - Whitespace. Space, tab, newline, carriage return, semicolon. The lexer treats
;as whitespace, which is why statements do not require terminators.
Rule order matters. Bool must come before Keyword, and Keyword must come before Ident, otherwise reserved words would be lexed as identifiers.
Reserved words
parser/parser.go:51 lists the keyword regex. The full set:
test expect agent intent on stream emit type fun extern import return
break continue let var if else then for while in generate match fetch
load save package export fact rule all null
A few words that look like keywords are not in the global list and are matched as identifiers when used in expression contexts. They become significant only inside a specific production. The list:
- Declaration words:
bench,model,update. Used to introduce aBenchBlock, aModelDecl, or anUpdateStmt. - Query expression keywords:
from,where,select,group,by,into,having,sort,order,skip,take,distinct,join,left,right,outer. - Modifier words:
as,to,with. Used afterload,save,cast, or in import clauses. - Set operators:
union,except,intersect. Used in binary expressions.
All four buckets parse as Ident outside their production. let bench = 1 is accepted today; the same is true for every word above.
Numeric literal design
Numeric literals do not include a leading minus sign. The lexer tokenises -1 as the punctuation - followed by the integer 1. The parser then handles unary minus as a prefix operator on the postfix expression.
The reason is that len(list)-1 should parse as (len(list) - 1) even when written as list[len(list)-1]. If the lexer were greedy about negative numbers, that would tokenise as list, [, len, (, list, ), -1, ], which the parser would then read as indexing by -1 after a missing operator, not as subtraction.
String literals
Strings are double quoted. The participle option participle.Unquote("String") at parser/parser.go:664 turns the quoted source into the unescaped string value before the parser sees it. There is no raw string form, no triple quoted string, no string interpolation. Concatenation is done with +.
Comment forms
Block and line comments are stripped by participle.Elide("Whitespace", "Comment") at parser/parser.go:663. There is also an out-of-band doc attachment pass (parser/docs.go) that re-reads the source and attaches preceding line comments to the nearest declaration as the Doc field. This means doc strings survive even though the parser drops the underlying token.
Rationale
A single simple lexer table is enough for a small language. Participle's MustSimple is fast, easy to audit, and produces good error messages out of the box.
The decision to keep query and modifier words as soft keywords rather than reserving them globally trades parser complexity for user freedom. Reserving where or union globally would forbid identifiers like where or union in user code. The cost is that the parser must work harder to distinguish identifier uses from keyword uses, and a careless grammar change can introduce ambiguity.
Treating semicolons as whitespace lets users write either one statement per line or several statements on one line, without imposing a syntactic terminator.
Backwards Compatibility
This MEP describes the lexer as it is. No proposal is embedded.
When a future MEP adds a keyword:
- Test the new word against the existing fixture set under
tests/parser/valid/. - Search third party code the team can scan, to estimate breakage.
- Decide between a hard keyword (always reserved) and a soft keyword (reserved only inside a specific production). Soft is cheaper; it scales worse.
Reference Implementation
parser/parser.go:48-62— lexer rules.parser/parser.go:51— keyword regex.parser/parser.go:53-56— comment on numeric literal design.parser/parser.go:663-664— elision and string unquote configuration.parser/docs.go— out-of-band doc attachment pass.
Open Questions
- Block comments do not nest. A literal
*/inside a block comment terminates it early. We could switch to a hand-written tokenizer that supports nesting. Low priority. - Negative numeric literals. A future change could make the lexer recognise
-1as a number when it is not preceded by an identifier or close bracket, removing thex == -1parens gotcha. Todayx == -1andx == - 1both fail to parse; onlyx == (-1)works. The trade-off is grammar complexity for the benefit of one syntactic surprise removed.
References
- Participle v2 lexer documentation: https://github.com/alecthomas/participle.
Copyright
This document is placed in the public domain.