What are the three examples of interpreter?

Some say that “everything boils down to ones and zeros”—but do we really understand how our programs get translated into those bits?

Compilers and interpreters both take a raw string representing a program, parse it, and make sense of it. Though interpreters are the simpler of the two, writing even a very simple interpreter [that only does addition and multiplication] will be instructive. We’ll focus on what compilers and interpreters have in common: lexing and parsing the input.

The Dos and Don’ts of Writing Your Own Interpreter

Readers may wonder What’s wrong with a regex? Regular expressions are powerful, but source code grammars aren’t simple enough to be parsed by them. Neither are domain-specific languages [DSLs], and a client might need a custom DSL for authorization expressions, for example. But even without applying this skill directly, writing an interpreter makes it much easier to appreciate the effort behind many programming languages, file formats, and DSLs.

Writing parsers correctly by hand can be challenging with all the edge cases involved. That’s why there are popular tools, like ANTLR, that can generate parsers for many popular programming languages. There are also libraries called parser combinators, that enable developers to write parsers directly in their preferred programming languages. Examples include FastParse for Scala and Parsec for Python.

We recommend that, in a professional context, readers use such tools and libraries to avoid reinventing the wheel. Still, understanding the challenges and possibilities of writing an interpreter from scratch will help developers leverage such solutions more effectively.

Overview of Interpreter Components

An interpreter is a complex program, so there are multiple stages to it:

  1. A lexer is the part of an interpreter that turns a sequence of characters [plain text] into a sequence of tokens.
  2. A parser, in turn, takes a sequence of tokens and produces an abstract syntax tree [AST] of a language. The rules by which a parser operates are usually specified by a formal grammar.
  3. An interpreter is a program that interprets the AST of the source of a program on the fly [without compiling it first].

We won’t build a specific, integrated interpreter here. Instead, we’ll explore each of these parts and their common issues with separate examples. In the end, the user code will look like this:

val input = "2 * 7 + 5"
val tokens = Lexer[input].lex[]
val ast = Parser[tokens].parse[]
val res = Interpreter[ast].interpret[]
println[s"Result is: $res"]

Following the three stages, we’ll expect this code to calculate a final value and print Result is: 19. This tutorial happens to use Scala because it’s:

  • Very concise, fitting a lot of code in one screen.
  • Expression oriented, with no need for uninitialized/null variables.
  • Type safe, with a powerful collections library, enums, and case classes.

Specifically, the code here is written in Scala3 optional-braces syntax [a Pythonlike, indentation-based syntax]. But none of the approaches are Scala-specific, and Scala is similar to many other languages: Readers will find it straightforward to convert these code samples into other languages. Barring that, the examples can be run online using Scastie.

Lastly, the Lexer, Parser, and Interpreter sections have different example grammars. As the corresponding GitHub repo shows, the dependencies in later examples change slightly to implement these grammars, but the overall concepts stay the same.

Interpreter Component 1: Writing a Lexer

Let’s say we want to lex this string: "123 + 45 true * false1". It contains different types of tokens:

  • Integer literals
  • A + operator
  • A * operator
  • A true literal
  • An identifier, false1

Whitespace between tokens will be skipped in this example.

At this stage, expressions don’t have to make sense; the lexer simply converts the input string into a list of tokens. [The job of “making sense of tokens” is left to the parser.]

We’ll use this code to represent a token:

case class Token[
  tpe: Token.Type,
  text: String,
  startPos: Int
]

object Token:
  enum Type:
    case Num
    case Plus
    case Times
    case Identifier
    case True
    case False
    case EOF

Every token has a type, textual representation, and position in the original input. The position can help end users of the lexer with debugging.

The EOF token is a special token that marks the end of input. It doesn’t exist in the source text; we only use it to simplify the parser stage.

This will be the output of our lexer:

Lexing input:
123 + 45 true * false1

Tokens:
List[
  Token[tpe = Num, text = "123", tokenStartPos = 0],
  Token[tpe = Plus, text = "+", tokenStartPos = 4],
  Token[tpe = Num, text = "45", tokenStartPos = 6],
  Token[tpe = True, text = "true", tokenStartPos = 9],
  Token[tpe = Times, text = "*", tokenStartPos = 14],
  Token[tpe = Identifier, text = "false1", tokenStartPos = 16],
  Token[tpe = EOF, text = "", tokenStartPos = 22]
]

Let’s examine the implementation:

class Lexer[input: String]:

  def lex[]: List[Token] =
    val tokens = mutable.ArrayBuffer.empty[Token]
    var currentPos = 0
    while currentPos < input.length do
      val tokenStartPos = currentPos
      val lookahead = input[currentPos]
      if lookahead.isWhitespace then
        currentPos += 1 // ignore whitespace
      else if lookahead == '+' then
        currentPos += 1
        tokens += Token[Type.Plus, lookahead.toString, tokenStartPos]
      else if lookahead == '*' then
        currentPos += 1
        tokens += Token[Type.Times, lookahead.toString, tokenStartPos]
      else if lookahead.isDigit then
        var text = ""
        while currentPos < input.length && input[currentPos].isDigit do
          text += input[currentPos]
          currentPos += 1
        tokens += Token[Type.Num, text, tokenStartPos]
      else if lookahead.isLetter then // first must be letter
        var text = ""
        while currentPos < input.length && input[currentPos].isLetterOrDigit do
          text += input[currentPos]
          currentPos += 1
        val tpe = text match
          case "true"  => Type.True // special casing literals
          case "false" => Type.False
          case _       => Type.Identifier
        tokens += Token[tpe, text, tokenStartPos]
      else
        error[s"Unknown character '$lookahead' at position $currentPos"]

    tokens += Token[Type.EOF, "", currentPos] // special end marker
    tokens.toList

We start with an empty list of tokens, then we go through the string and add tokens as they come.

We use the lookahead character to decide the type of the next token. Note that the lookahead character is not always the furthest character ahead being examined. Based on the lookahead, we know what the token looks like and use currentPos to scan all of the expected characters in the current token, then add the token to the list:

If the lookahead is whitespace, we skip it. Single letter tokens are trivial; we add them and increment the index. For integers, we only need to take care of the index.

Now we get to something a bit complicated: identifiers versus literals. The rule is that we take the longest possible match and check if it’s a literal; if it’s not, it’s an identifier.

Take care when handling operators like expr '+' expr | NUM

Given the input 1 + 2 + 3, our parser could choose to calculate either the left expr or the right expr first in the resulting AST:

Left-handed and right-handed ASTs.

That’s why we need to introduce some asymmetry:

expr -> expr '+' NUM | NUM

The set of expressions we can represent with this grammar hasn’t changed since its first version. Only now it is unambiguous: The parser always goes left. Just what we needed!

This makes our + operation left associative, but this will become apparent when we get to the Interpreter section.

Left-recursive Rules

Unfortunately, the above fix doesn’t solve our other problem, left recursion:

def expr[] =
  expr[]
  eat['+']
  eat[NUM]

We have infinite recursion here. If we were to step into this function, we’d eventually get a stack-overflow error. But parsing theory can help!

Suppose we have a grammar like this, where alpha could be any sequence of terminals and nonterminals:

A -> A alpha | B

We can rewrite this grammar as:

A   -> B A'
A'  -> alpha A' | epsilon

There, epsilon is an empty string—nothing, no token.

Let’s take the current revision of our grammar:

expr -> expr '+' NUM | NUM

Following the method of rewriting parsing rules detailed above, with alpha being our '+' NUM tokens, our grammar becomes:

expr    -> NUM exprOpt
exprOpt -> '+' NUM exprOpt | epsilon

Now the grammar is OK, and we can parse it with a recursive descent parser. Let’s see how such a parser would look for this latest iteration of our grammar:

class Parser[allTokens: List[Token]]:
  import Token.Type
  
  private var tokens = allTokens
  private var lookahead = tokens.head
  
  def parse[]: Unit = 
    expr[]
    if lookahead.tpe != Type.EOF then
      error[s"Unknown token '${lookahead.text}' at position ${lookahead.tokenStartPos}"]

  private def expr[]: Unit =
    eat[Type.Num]
    exprOpt[]
  
  private def exprOpt[]: Unit =
    if lookahead.tpe == Type.Plus then
      eat[Type.Plus]
      eat[Type.Num]
      exprOpt[]
    // else: end recursion, epsilon
  
  private def eat[tpe: Type]: Unit =
    if lookahead.tpe != tpe then
      error[s"Expected: $tpe, got: ${lookahead.tpe} at position ${lookahead.startPos}"]
    tokens = tokens.tail
    lookahead = tokens.head

Here we use the EOF token to simplify our parser. We are always sure that there is at least one token in our list, so we don’t need to handle a special case of an empty list.

Also, if we switch to a streaming lexer, we wouldn’t have an in-memory list but an iterator, so we need a marker to know when we come to the end of the input. When we come to the end, the EOF token should be the last token remaining.

Walking through the code, we can see that an expression can be just a number. If there’s nothing left, the next token wouldn’t be a Plus, so we would stop parsing. The last token would be EOF, and we would be done.

If the input string has more tokens, then these would have to look like + 123. That’s where recursion on exprOpt[] kicks in!

Generating an AST

Now that we successfully parsed our expression, it’s hard to do anything with it as is. We could put some callbacks in our parser, but that would be very cumbersome and unreadable. Instead, we will return an AST, a tree representing the input expression:

case class Expr[num: Int, exprOpt: ExprOpt]

enum ExprOpt:
  case Opt[num: Int, exprOpt: ExprOpt]
  case Epsilon

This resembles our rules, using simple data classes.

Our parser now returns a useful data structure:

class Parser[allTokens: List[Token]]:
  import Token.Type
  
  private var tokens = allTokens
  private var lookahead = tokens.head
  
  def parse[]: Expr = 
    val res = expr[]
    if lookahead.tpe != Type.EOF then
      error[s"Unknown token '${lookahead.text}' at position ${lookahead.tokenStartPos}"]
    else
      res

  private def expr[]: Expr =
    val num = eat[Type.Num]
    Expr[num.text.toInt, exprOpt[]]
  
  private def exprOpt[]: ExprOpt =
    if lookahead.tpe == Type.Plus then
      eat[Type.Plus]
      val num = eat[Type.Num]
      ExprOpt.Opt[num.text.toInt, exprOpt[]]
    else
      ExprOpt.Epsilon

For eat[], error[], and other implementation details, please see the accompanying GitHub repo.

Simplifying Rules

Our ExprOpt nonterminal can still be improved:

'+' NUM exprOpt | epsilon

It’s hard to recognize the pattern it represents in our grammar just by looking at it. It turns out that we can replace this recursion with a simpler construct:

['+' NUM]*

This construct simply means '+' NUM occurs zero or more times.

Now our full grammar looks like this:

expr  -> NUM exprOpt*
exprOpt -> '+' NUM

And our AST looks nicer:

case class Expr[num: Int, exprOpts: Seq[ExprOpt]]
case class ExprOpt[num: Int]

The resulting parser is the same length but simpler to understand and use. We’ve eliminated Epsilon, which is now implied by starting with an empty structure.

We didn’t even need the ExprOpt class here. We could have just put case class Expr[num: Int, exprOpts: Seq[Int]], or in grammar format, NUM ['+' NUM]*. So why didn’t we?

Consider that if we had multiple possible operators, like - or *, then we’d have a grammar like this:

expr  -> NUM exprOpt*
exprOpt -> [+-*] NUM

In this case, the AST then needs ExprOpt to accommodate the operator type:

case class Expr[num: Int, exprOpts: Seq[ExprOpt]]
case class ExprOpt[op: String, num: Int]

Note that the [+-*] syntax in the grammar means the same thing as in regular expressions: “one of those three characters.” We’ll see this in action soon.

Interpreter Component 3: Writing an Interpreter

Our interpreter will make use of our lexer and parser to get the AST of our input expression and then evaluate that AST whichever way we want. In this case, we’re dealing with numbers, and we want to evaluate their sum.

In the implementation of our interpreter example, we will use this simple grammar:

expr  -> NUM exprOpt*
exprOpt -> [+-] NUM

And this AST:

case class Expr[num: Int, exprOpts: Seq[ExprOpt]]
case class ExprOpt[op: Token.Type, num: Int]

[We’ve covered how to implement a lexer and a parser for similar grammars, but any readers who get stuck can peruse the lexer and parser implementations for this exact grammar in the repo.]

Now we’ll see how to write an interpreter for the above grammar:

class Interpreter[ast: Expr]:

  def interpret[]: Int = eval[ast]

  private def eval[expr: Expr]: Int =
    var tmp = expr.num
    expr.exprOpts.foreach { exprOpt =>
      if exprOpt.op == Token.Type.Plus
      then tmp += exprOpt.num
      else tmp -= exprOpt.num
    }
    tmp

If we parsed our input into an AST without encountering an error, we’re sure that we’ll always have at least one NUM. Then we take the optional numbers and add them to [or subtract them from] our result.

The note from the beginning about the left associativity of + is now clear: We start from the leftmost number and add others, from left to right. This may seem unimportant for addition, but consider subtraction: The expression 5 - 2 - 1 is evaluated as [5 - 2] - 1 = 3 - 1 = 2 and not as 5 - [2 - 1] = 5 - 1 = 4!

But if we want to go beyond interpreting plus and minus operators, there’s another rule to define.

Precedence

We know how to parse a simple expression like 1 + 2 + 3, but when it comes to 2 + 3 * 4 + 5, we have a bit of a problem.

Most people agree on the convention that multiplication has higher precedence than addition. But the parser doesn’t know that. We can’t just evaluate it as [[2 + 3] * 4] + 5. Instead we want [2 + [3 * 4]] + 5.

This means that we need to evaluate multiplication first. Multiplication needs to be further from the root of the AST to force it to be evaluated before addition. For this, we need to introduce yet another layer of indirection.

Fixing a Naive Grammar From Start to Finish

This is our original, left-recursive grammar, which has no precedence rules:

expr -> expr '+' expr | expr '*' expr | NUM

First, we give it rules of precedence and remove its ambiguity:

expr    -> expr '+' term | term
term    -> term '*' NUM | NUM

Then it gets non-left-recursive rules:

expr      -> term exprOpt*
exprOpt   -> '+' term
term      -> NUM termOpt*
termOpt   -> '*' NUM

The result is a beautifully expressive AST:

case class Expr[term: Term, exprOpts: Seq[ExprOpt]]
case class ExprOpt[term: Term]

case class Term[num: Int, termOpts: Seq[TermOpt]]
case class TermOpt[num: Int]

This leaves us with a concise interpreter implementation:

class Interpreter[ast: Expr]:

  def interpret[]: Int = eval[ast]

  private def eval[expr: Expr]: Int =
    var tmp = eval[expr.term]
    expr.exprOpts.foreach { exprOpt =>
      tmp += eval[exprOpt.term]
    }
    tmp

  private def eval[term: Term]: Int =
    var tmp = term.num
    term.termOpts.foreach { termOpt =>
      tmp *= termOpt.num
    }
    tmp

As before, the ideas in the requisite lexer and grammar were covered earlier, but readers can find them in the repo if needed.

Next Steps in Writing Interpreters

We didn’t cover this, but error handling and reporting are crucial features of any parser. As developers, we know how frustrating it can be when a compiler produces confusing or misleading errors. It’s an area that has many interesting problems to solve, like giving correct and precise error messages, not deterring the user with more messages than necessary, and recovering gracefully from errors. It’s up to the developers writing an interpreter or compiler to ensure their future users have a better experience.

In walking through our example lexers, parsers, and interpreters, we only scratched the surface of the theories behind compilers and interpreters, which cover topics like:

  • Scopes and symbol tables
  • Static types
  • Compile-time optimization
  • Static program analyzers and linters
  • Code formatting and pretty-printing
  • Domain-specific languages

For further reading, I recommend these resources:

  • Language Implementation Patterns by Terence Parr
  • A free online book, Crafting Interpreters, by Bob Nystrom
  • Intro to Grammars and Parsing by Paul Klint
  • Writing Good Compiler Error Messages by Caleb Meredith
  • The notes from the East Carolina University course “Program Translation and Compiling”

What are the examples of interpreters?

An Interpreter directly executes instructions written in a programming or scripting language without previously converting them to an object code or machine code. Examples of interpreted languages are Perl, Python and Matlab. Following are some interesting facts about interpreters and compilers.

What are the three types of interpreters?

The three basic interpretation modes are simultaneous interpretation [SI], consecutive interpretation, and whispered interpretation. However, modern linguists suggest that there are more than simultaneous interpretation, consecutive interpretation, and whispered interpretation to interpretation modes.

What are 3 popular examples of interpreted languages?

Examples of common interpreted languages are PHP, Ruby, Python, and JavaScript.

Is basic an example of interpreter?

A self-interpreter is a programming language interpreter written in a programming language which can interpret itself; an example is a BASIC interpreter written in BASIC.

Chủ Đề