In this series, I will be building an Humio like query language which I’ll call Oimuh (Oi - Muh). You can read more about Humio here:
I’ll attempt to create the entire pipeline in Golang by leveraging on libraries, and writing them on my own if they are unavailable.
Lexer
The complete flow of building a language is as follows:
Lexer
→ Parser
→ Interpreter
A Lexer is a program that takes in an input and parses it to tokens and values.
You can take a look at Golang’s very own Lexer here:
The Lexer goes through the program character by character and labels each of them to their appropriate types. The Lexer does this by literally advancing through the text by each character and determines what tokens to assign it to.
... // next returns the next rune in the input. func (l *lexer) next() rune { if int(l.pos) >= len(l.input) { l.atEOF = true return eof } r, w := utf8.DecodeRuneInString(l.input[l.pos:]) l.pos += Pos(w) if r == '\n' { l.line++ } return r } ... // thisItem returns the item at the current input point with the specified type // and advances the input. func (l *lexer) thisItem(t itemType) item { i := item{t, l.start, l.input[l.start:l.pos], l.startLine} l.start = l.pos l.startLine = l.line return i }
The tokens must be defined first in order for the Lexer to map the characters accordingly. These are some examples of the tokens used for numerical operations
... var key = map[string]itemType{ ".": itemDot, "block": itemBlock, "break": itemBreak, "continue": itemContinue, "define": itemDefine, "else": itemElse, "end": itemEnd, "if": itemIf, "range": itemRange, "nil": itemNil, "template": itemTemplate, "with": itemWith, } ...
If a character cannot be mapped to any of these types or has a syntax error, the Lexer will return an error
... func (i item) String() string { switch { case i.typ == itemEOF: return "EOF" case i.typ == itemError: // <-- error handling for invalid characters or syntax return i.val case i.typ > itemKeyword: return fmt.Sprintf("<%s>", i.val) case len(i.val) > 10 return fmt.Sprintf("%.10q...", i.val) } return fmt.Sprintf("%q", i.val) } ...
For example
return l.errorf("unterminated quoted string") // throwing an error based on syntax return l.errorf("unrecognized character in action: %#U", r) // throwing an error based unrecognizable characters
Oimuh
Using the example code, we modify it so that it’s able to parse more than just numerical operations, but in Oimuh syntax.
Before we proceed, we need to briefly define the syntax so that the Lexer knows what tokens they are, and what’s an illegal character.
Syntax
- Filters
Filters accept a field and a value for filtering. They can be exact matches, negation, relative comparators and fuzzy checks
field = value
field != value
field = "value"
field > value
field contains "value%"
- Aggregation Functions
Aggregation functions calculate a value or performs certain functions over single a row or multiple rows.
max()
min()
avg()
count()
distinct_count()
groupby()
- Other operations
Miscellaneous operations to modify the output of the query
sort field
columns field field field..
limit 100
- Joining commands
Commands can be concatenated through pipes
|
name contains "bob%" and age > 32 | sort name
In this exercise, we will only be implementing Filters and Miscellaneous operations for simplicity
Golang Lexer Code
Thankfully for us, Golang has a built in library called
text/scanner
, which helps us build a Lexer and does the tokenization for us already.As engineers, we need to be smart and use the tools that are available to us, and not re-invent the wheel possibly less efficiently.
package main import ( "fmt" "strings" "text/scanner" ) func main() { var s scanner.Scanner input := "name contains \"bob%\" and age > 32 | sort name" s.Init(strings.NewReader(input)) for { tok := s.Scan() if tok == scanner.EOF { break } fmt.Printf("Token: %s, Position: %s\n", s.TokenText(), s.Position.String()) } }
kali in ~/golex λ go run lexer.go Token: name, Position: <input>:1:1 Token: contains, Position: <input>:1:6 Token: "bob%", Position: <input>:1:15 Token: and, Position: <input>:1:22 Token: age, Position: <input>:1:26 Token: >, Position: <input>:1:30 Token: 32, Position: <input>:1:32 Token: |, Position: <input>:1:35 Token: sort, Position: <input>:1:37 Token: name, Position: <input>:1:42
But if you don’t want to be lazy and write you own code for the Lexer, here’s an implementation that Lexes
(name contains "bob%" and age > 32) or age = 32 | sort name
I used this blog post as a guide
package main import ( "bufio" "fmt" "io" "os" "unicode" ) type Token int const ( EOF = iota ILLEGAL IDENT INT SEMI // ; EQUALS // = NEGATE // ! LESS // < GREATER // > LBRAC // ( RBRAC // ) PIPE // | QUOTE // " PERC // % ) var tokens = []string{ EOF: "EOF", ILLEGAL: "ILLEGAL", IDENT: "IDENT", INT: "INT", EQUALS: "=", NEGATE: "!", LESS: "<", GREATER: ">", LBRAC: "(", RBRAC: ")", PIPE: "|", QUOTE: "\"", PERC: "%", } func (t Token) String() string { return tokens[t] } type Position struct { line int column int } type Lexer struct { pos Position reader *bufio.Reader } func NewLexer(reader io.Reader) *Lexer { return &Lexer{ pos: Position{line: 1, column: 0}, reader: bufio.NewReader(reader), } } // Lex scans the input for the next token. It returns the position of the token, // the token's type, and the literal value. func (l *Lexer) Lex() (Position, Token, string) { // keep looping until we return a token for { r, _, err := l.reader.ReadRune() if err != nil { if err == io.EOF { return l.pos, EOF, "" } // at this point there isn't much we can do, and the compiler // should just return the raw error to the user panic(err) } // update the column to the position of the newly read in rune l.pos.column++ switch r { case '(': return l.pos, LBRAC, "(" case ')': return l.pos, RBRAC, ")" case '|': return l.pos, PIPE, "|" case '>': return l.pos, GREATER, ">" case '<': return l.pos, LESS, "<" case '"': return l.pos, QUOTE, "\"" case '=': return l.pos, EQUALS, "=" case '%': return l.pos, PERC, "%" default: if unicode.IsSpace(r) { continue // nothing to do here, just move on } else if unicode.IsDigit(r) { // backup and let lexInt rescan the beginning of the int startPos := l.pos l.backup() lit := l.lexInt() return startPos, INT, lit } else if unicode.IsLetter(r) { // backup and let lexIdent rescan the beginning of the ident startPos := l.pos l.backup() lit := l.lexIdent() return startPos, IDENT, lit } else { return l.pos, ILLEGAL, string(r) } } } } func (l *Lexer) backup() { if err := l.reader.UnreadRune(); err != nil { panic(err) } l.pos.column-- } // lexInt scans the input until the end of an integer and then returns the // literal. func (l *Lexer) lexInt() string { var lit string for { r, _, err := l.reader.ReadRune() if err != nil { if err == io.EOF { // at the end of the int return lit } } l.pos.column++ if unicode.IsDigit(r) { lit = lit + string(r) } else { // scanned something not in the integer l.backup() return lit } } } // lexIdent scans the input until the end of an identifier and then returns the // literal. func (l *Lexer) lexIdent() string { var lit string for { r, _, err := l.reader.ReadRune() if err != nil { if err == io.EOF { // at the end of the identifier return lit } } l.pos.column++ if unicode.IsLetter(r) { lit = lit + string(r) } else { // scanned something not in the identifier l.backup() return lit } } } func main() { file, err := os.Open("input.test") if err != nil { panic(err) } lexer := NewLexer(file) for { pos, tok, lit := lexer.Lex() if tok == EOF { break } fmt.Printf("%d:%d\t%s\t%s\n", pos.line, pos.column, tok, lit) } }
kali in ~/golex λ go run lexer.go 1:1 ( ( 1:2 IDENT name 1:7 IDENT contains 1:16 " " 1:17 IDENT bob 1:20 % % 1:21 " " 1:23 IDENT and 1:27 IDENT age 1:31 > > 1:33 INT 32 1:35 ) ) 1:37 IDENT or 1:40 IDENT age 1:44 = = 1:46 INT 32 1:49 | | 1:51 IDENT sort 1:56 IDENT name
As a rule of thumb, the Lexer should only catch illegal characters and not syntax errors.
For example, if we feed the Lexer with this input with an unclosed quotes, the Lexer should dutifully emit the tokens. This is in the spirit of each component doing one thing only.
(name contains "bob% and age > 32) or age = 32 | sort name
In the following posts, we will using Participle to help us parse the tokens. Participle uses
text/scanner
as well, so it would be a natural extension to parsing.participle
alecthomas • Updated Sep 9, 2023