Writing a JVM compiler – Tokenization

Since I first started programming I was always been interested in understanding how a programming language works. A lot happens in the background and I feel that understanding a layer underneath the language you program in can be very beneficial to understanding a language as a whole.

First I decided  that I would start by writing a compiler that would convert from my own defined language into C++. This ended up working quite well but seems somewhat like cheating as you are still using GNU compilers. I started learning Scala which then interested me in writing a language that converts into Java byte-code.

My  chosen language for writing the compiler is Scala. The project can be easily ported to other languages.

The aim of this blog is to create a compiler that will convert our code into Java byte-code. The language will be of a similar syntax to that of Java/Scala but indented instead of using curly braces.

Why convert into JVM byte code? 

  • Full access to all existing Java libraries
  • Cross-platform
  • JVM deals with memory handling

Project Structure

Seeming as this project will be rather large I’ve decided to split it into sections.

Lexer (Tokenization)

The lexer will take a string and convert it into tokens using regular expressions as a set of rules.

Parser

A check is made using regular expressions decides what the code represents.

Block Creation (Abstract Syntax Tree)

Blocks are used to represent the code in the memory.

Generate Java ASM code 

The blocks are looped through and the Java code using the ASM library is output to a file.

Lexer (Tokenization)

The main focus today is on Tokenization. Tokenization will be used to split a string into multiple tokens assigning them a type and a value.

// Token.scala
class Token(val tokenInit: String, val typeInit: TokenType) {

  def token: String = tokenInit

  def tokenType: TokenType = typeInit

  override def toString: String = tokenInit

}

A token type is defined using an enum that we then link to a regular expression. Each token will have the type set.

// TokenType.scala
object TokenType extends Enumeration {
  type TokenType = Value
  val EMPTY,                 // "^( )"
  TOKEN,                     // "^(=().,')"
  IDENTIFIER,                // "^([a-zA-Z][a-zA-Z0-9]*)"
  INTEGER_LITERAL,           // "^([0-9]+)"
  COLON = Value              // "^([:])"
}

The TokenData class stores the regular expression and TokenType.

// TokenData.scala
class TokenData(val pattern: Regex, val `type`: TokenType) {

  def getPattern: Regex=  pattern

  def getType: TokenType = `type`

}

The tokenizer takes the string and loops through all token datas to see if there is a match. If there is then that is the next token.

// Tokenizer.scala
class Tokenizer(var str: String) {

  private val tokenDatas: ListBuffer[TokenData] = ListBuffer[TokenData](
    new TokenData("^([0-9]+)".r, TokenType.INTEGER_LITERAL),
    new TokenData("^([a-zA-Z][a-zA-Z0-9]*)".r, TokenType.IDENTIFIER)
    new TokenData("^([:])".r, TokenType.COLON)
  )

for (t <- List[String]("\\=", "\\(", "\\)", "\\.", "\\,", "\\'"))
   tokenDatas += new TokenData(("^(" + t + ")").r, TokenType.TOKEN)

def nextToken: Token = {
   str = str.trim

   if (str.isEmpty) {
      new Token("", TokenType.EMPTY)
   } else {
      for (data <- tokenDatas) {
         val matched = data.pattern.findFirstIn(str)
         if (matched.isDefined) {
             val token: String = matched.getOrElse("")
             str = str.replace(token, "")
             return new Token(token, data.getType)
         }
      }
      throw new IllegalStateException("Could not parse:" + str)
   }
}

   def hasNextToken: Boolean = !str.isEmpty
}

Using this tokenizer we can now step through sections of code to extract data.

Here is some test code to show it in action.


// TokenizerTest.scala
object TokenizerTest {
  def main(args: Array[String]): Unit = {
    val tokenizer = new Tokenizer("x = 15")
    val varToken = tokenizer.nextToken
    println("Variable Token - Value: " + varToken.token + " Type: " + varToken.tokenType)
    val assignmentToken = tokenizer.nextToken
    println("Assignment Token - Value: " + assignmentToken.token + " Type: " + assignmentToken.tokenType)
    val intToken = tokenizer.nextToken
    println("Integer Token - Value: " + intToken.token + " Type: " + intToken.tokenType)
  }
}

Output:


Variable Token - Value: x Type: IDENTIFIER
Assignment Token - Value: = Type: TOKEN
Integer Token - Value: 15 Type: INTEGER_LITERAL

Source Files

To access the complete files access the tokenizer directory within my project.

https://github.com/Michael2109/cobalt

Advertisements

6 thoughts on “Writing a JVM compiler – Tokenization

  1. Hi, nice tutorial, I always wanted to learn how compilers work.
    I have a doubt, in the output of TokenizerTest, the second line says the Token Type is EQUAL_TO, but I don’t see it declared in TokenType.scala, where does EQUAL_TO come from?
    Thanks in advance.

    Liked by 1 person

    1. Basically when testing the code I’m running it in my main project that contains more tokens etc and didn’t notice it was different. I’ve updated the output now. Thank you for noticing!

      Like

  2. In Leaf I do structured tokenization. I do extract tokens, but I also follow a high-level tree/line structure to provide context sensitivity. The result is not a token stream, but a token tree.

    It’s an option if you ever hit troubles with the linear stream tokenization.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s