PA2: Decaf Lexer
Objective
The goal of our semester-long project is to gain experience in compiler implementation by constructing a simple compiler.
The goal of this part of the project is to gain experience specifying lexical analysis patterns by using regular expressions to implement the first pass of our Decaf compiler.
Introduction
The semester-long project for this course is to build a compiler for the Decaf language. In this project, you will implement a lexer for Decaf, which will be the first phase in our compiler. You will do this by creating regular expressions to describe various pieces of the Decaf language, and writing some Java code to repeatedly apply these regular expressions to an input character stream.
Thoroughly read the "Lexical Considerations" section of the reference document and look over the grammar in the next section. For this assignment, you will need to construct a lexer that will scan a Decaf program and produce a stream of tokens.
HINT: Pay close attention to the description of string and integer literals, especially the escape characters in strings.
You should download the project starter files from the "Files" tab in Canvas. You may also be interested in the Eclipse setup tutorial.
Before you begin writing code, you should spend some time reading through
all of the infrastructure files provided in the starter files. Our compiler this
semester is a large project. Your solution will need to interface with several
other classes, and you will need to thoroughly read through those classes to
understand their interactions. Also, there are utility functions that you may
find useful (extract
and getLastMatch
.
As you study the code, you may find it useful to generate Javadoc references for the existing project files. Assuming you have Javadoc set up properly, you can do this by running the following sequence of commands in the project folder:
mkdir doc cd src/main/java javadoc -d ../../../doc edu.jmu.decaf
This should generate Javadoc files in the doc
folder; to read
them, just open the index.html
file in your web browser.
If you have questions about the existing code, please ask them on the Piazza forum (see the link on the sidebar in Canvas).
Assignment
WARNING: You should only proceed with actual development once you are SURE you understand exactly what your task is and how your code should interact with the rest of the system.
Implement the MyDecafLexer
class by providing an implementation
for the following function:
public Queue<Token> lex(BufferedReader input, String filename) throws IOException, InvalidTokenException
This function takes as input a BufferedReader
object that
should be used to read the Decaf source line-by-line, and a filename that
should be used to create source line information. The function should return a
Queue
of scanned Token
objects.
You should split up the input using Java regular expression matchers. I
recommend using the utility functions in DecafLexer
(extract
and getLastMatch
). You will likely wish to
write at least one regular expression for each Token
type
(identifier, integer literal, etc.) and test the input for each regular
expression. As you consume input you will need to continuously test against
these regular expressions to find new tokens.
HINT: Use a caret ('^') to match against the beginning of an input string.
HINT: The order in which you test the regular expressions is important!
HINT: Recall from the book that it is easier in hand-written lexers to write a single recognizers for both keywords and identifiers, using a hash-based lookup on the resulting text to determine whether it is a keyword or a regular identifier.
The Token
objects should contain source code information,
encoded in SourceInfo
objects. The source info should contain
filename, line, and column information for all tokens. For multi-character
tokens, the reported column should be the column of the first character in the
token.
For this particular project, you should pay special attention to the following classes:
DecafCompiler
- The overall driver for the compiler. Right now it only implements the lexer phase of compilation; it will gradually expand over the semester as we implement more phases of the compiler.DecafLexer
- The base lexer class. You will implement a subclass of this class, and you may find some of the methods provided in the parent class useful in your solution.Token
- Stores a single Decaf token. Your code should generate a stream of these objects, stored in aQueue
.SourceInfo
- Stores source file information (i.e., filename, line number, and column number). Your code should make sure this information is present in anyToken
object you return.InvalidTokenException
- Special exception that your code should throw when you detect an invalid token.
Sample Input
def int main() { int a; a = 4 + 5; return a; }
Sample Output
KEY "def" [pa02-sample.decaf:1:1] KEY "int" [pa02-sample.decaf:1:5] ID "main" [pa02-sample.decaf:1:9] SYM "(" [pa02-sample.decaf:1:13] SYM ")" [pa02-sample.decaf:1:14] SYM "{" [pa02-sample.decaf:2:1] KEY "int" [pa02-sample.decaf:3:2] ID "a" [pa02-sample.decaf:3:6] SYM ";" [pa02-sample.decaf:3:7] ID "a" [pa02-sample.decaf:4:2] SYM "=" [pa02-sample.decaf:4:4] DEC "4" [pa02-sample.decaf:4:6] SYM "+" [pa02-sample.decaf:4:8] DEC "5" [pa02-sample.decaf:4:10] SYM ";" [pa02-sample.decaf:4:11] KEY "return" [pa02-sample.decaf:5:2] ID "a" [pa02-sample.decaf:5:9] SYM ";" [pa02-sample.decaf:5:10] SYM "}" [pa02-sample.decaf:6:1]