Decaf

P2: Decaf Lexer

Objective

The goal of our semester-long project is to gain experience in compiler implementation by constructing a simple compiler.

The goal of this part of the project is to gain experience specifying lexical analysis patterns by using regular expressions to implement the first pass of our Decaf compiler.

Introduction

The semester-long project for this course is to build a compiler for the Decaf language. In this project, you will implement a lexer for Decaf, which will be the first phase in our compiler. You will do this by creating regular expressions to describe various pieces of the Decaf language, and writing some Java code to repeatedly apply these regular expressions to an input character stream.

Thoroughly read the "Lexical Considerations" section of the reference document and look over the grammar in the next section. For this assignment, you will need to construct a lexer that will scan a Decaf program and produce a stream of tokens.

You should download the project starter files from the "Files" tab in Canvas. You may also be interested in the Eclipse setup tutorial (not tested yet this year!).

Before you begin writing code, you should spend some time reading through all of the infrastructure files provided in the starter files. Our compiler this semester is a large project. Your solution will need to interface with several other classes, and you will need to thoroughly read through those classes to understand their interactions. Also, there are utility functions that will significantly reduce the difficulty of this project if used properly.

As you study the code, you may find it useful to generate Javadoc references for the existing project files. To do this, run mvn site and then open (using a web browser) the index.html file that is created in the target/site/apidocs folder.

If you have questions about the existing code, please ask them on the Piazza forum (see the link on the sidebar in Canvas).

Assignment

WARNING: You should only proceed with actual development once you are SURE you understand exactly what your task is and how your code should interact with the rest of the system.

Implement the MyDecafLexer class by providing an implementation for the following function:

    public Queue<Token> lex(BufferedReader input, String filename)
            throws IOException, InvalidTokenException

This function takes as input a BufferedReader object that should be used to read in the Decaf source, and a filename that should be used to create source line information. The function should return a queue of scanned Token objects as described in the "Lexical Considerations" section of the Decaf language reference.

The Token objects should contain source code information, encoded in SourceInfo objects. The source info should contain filename and line number information for all tokens.

For this particular project, you should pay special attention to the following classes:

  • DecafCompiler - The overall driver for the compiler. Right now it only implements the lexer phase of compilation; it will gradually expand over the semester as we implement more phases of the compiler.
  • DecafLexer - The base lexer class. You will implement a subclass of this class, and you may find some of the methods provided in the parent class useful in your solution.
  • Token - Stores a single Decaf token. Your code should generate a stream of these objects, stored in a Queue.
  • SourceInfo - Stores source file information (i.e., filename and line number). Your code should make sure this information is present in any Token object you return.
  • InvalidTokenException - Special exception that your code should throw when you detect an invalid token.

The recommended solution for this project (as discussed in class) is to process the input one line at a time. Write one regex per token type in general (there may be one or two exceptions). Prioritize your regular expressions and try to match each of them in order against the beginning of the line. When you find a match, extract the matching text from the line and continue until either no regex matches the next input (i.e., an InvalidTokenException) or the entire input is consumed. Check out DecafLexer for a variety of useful functions that will do some of the above work for you!

CAUTION: Do NOT attempt to split the line on a particular delimiter (e.g., a space character)! This approach is not sufficiently powerful to aid in lexing Decaf code, and you will waste a lot of time building workarounds and handling edge cases that would be far simpler to express using regular expressions from the outset.

For this and all future projects, you must also make sure you check the inputs to public API functions (in this case, any public members of MyDecafLexer) to make sure they aren't null or otherwise invalid. If they are, you should throw an IllegalArgumentException error.

Finally, there may be cases that you encounter where the language specification and the above requirements do not fully describe the expected behavior. For such cases, feel free to ask on Piazza, but such situations should be rare and inconsequential in the big picture. In general, for such cases you should just make sure the output from your solution matches the output of the reference compiler. Run it with no arguments or "-h" to get a list of valid output flags. For this project, use the "--fdump-tokens" flag to print the output after the lexing phase (note that it will then finish the rest of compilation, but you can ignore any output from the later phases).

Sample Input

def int main()
{
	int a;
	a = 4 + 5;
	return a;
}

Sample Output

KEY	"def"	[pa02-sample.decaf:1]
KEY	"int"	[pa02-sample.decaf:1]
ID 	"main"	[pa02-sample.decaf:1]
SYM	"("	[pa02-sample.decaf:1]
SYM	")"	[pa02-sample.decaf:1]
SYM	"{"	[pa02-sample.decaf:2]
KEY	"int"	[pa02-sample.decaf:3]
ID 	"a"	[pa02-sample.decaf:3]
SYM	";"	[pa02-sample.decaf:3]
ID 	"a"	[pa02-sample.decaf:4]
SYM	"="	[pa02-sample.decaf:4]
DEC	"4"	[pa02-sample.decaf:4]
SYM	"+"	[pa02-sample.decaf:4]
DEC	"5"	[pa02-sample.decaf:4]
SYM	";"	[pa02-sample.decaf:4]
KEY	"return"	[pa02-sample.decaf:5]
ID 	"a"	[pa02-sample.decaf:5]
SYM	";"	[pa02-sample.decaf:5]
SYM	"}"	[pa02-sample.decaf:6]

Submission

Please submit ONLY your MyDecafLexer.java file to the appropriate assignment on Canvas by the due date indicated there. Additionally, make sure the package edu.jmu.decaf; line is not modified.

Code Reviews

One of the goals of this course is to encourage good software development practices, especially when building a large software system (such as a compiler). For this project, you will be assigned two other random students in the course. You must review their code and offer constructive feedback according to the given rubric.

Submit your review on Canvas by midnight a week following the original project due date.

Grading

Here is the grading breakdown for this project:

Automated tests80
Instructor review10
Peer review10
TOTAL100