Image

PA2 - Decaf Lexer

Goal

The goal of our semester-long project is to gain experience in compiler implementation by constructing a simple compiler. We will be implementing a compiler for a variant of the well-known Decaf language, which is a simple Java-like procedural programming language.

The goal of this part of the project is to gain experience specifying lexical analysis patterns by using a lexer/parser generator to construct the first pass of a Decaf compiler.

Preliminaries

At this point you must choose which language you wish to use for the semester-long compiler project. I recommend a well-known procedural language, such as C/C++, Java, Python, or Ruby.

Once you have chosen a language, you need to find a lexer/parser generator that will output code in the language of your choice. Some generators combine lexer and parser generation into a single process, while others leave the two as separate processes. The generator must be capable of producing a regular expression lexer and LR parser. Here are some suggestions:

Depending on your lexer/parser generator, you may not be able to achieve a clean break between the generated lexer and the generated parser. For the purposes of this assignment, this distinction should not matter because we are not yet trying to parse any language that requires a context-free grammar.

Description

Skim the entire Decaf language specification to become familiar with the language. I suggest trying to write a few simple programs in Decaf.

Thoroughly read the "Lexical Considerations" section and look over the grammar in the next section. For this assignment, you will need to construct a lexer that will scan a Decaf program and produce a stream of tokens. To prepare for this, you should build a comprehensive list of the tokens and token types that you will need to recognize. You will need to combine the reserved words list from page 1 with all of the symbols on pages 2 and 3, as well as token types for the various literals.

Finally, construct regular expressions to match these tokens and build a lexer specification file that can be used by your chosen lexer/parser generator to generate lexing code. You will probably also need to create some wrapper code to handle file I/O, token output, and error reporting.

Your program should accept a Decaf program (*.decaf) as a command-line parameter. It should scan the file for valid tokens, emitting each token on a separate line to standard output. Any error should be reported along with the line number where the error was detected. Error messages should be as descriptive as possible within the limitations of your chosen language and lexer/parser generator.

Minor Modification

We will use a few minor modifications to the language specification. Only one of these is relevant at this point:

  • We will be including "while" loops as well as "for" loops as a valid construct in our Decaf variant. This means that "while" should be a reserved word, included alongside the others in the list on page 1.

Sample Input

    12+4*2
    "test"
    int b=true;
    if (a >= 4)

Sample Output

    12
    +
    4
    *
    2
    "test"
    int
    b
    =
    true
    ;
    if
    (
    a
    >=
    4
    )

Submission

DUE: Friday, Feb 13, 2015, at 23:59 (11:59PM) EST

Submit a zip or tarball with your complete solution on Canvas. Your code must compile and work on the class server. Part of your grade will be based on automated testing and part will be based on style considerations. Please make sure that your project adheres to the specification exactly to avoid difficulties with the automated testing framework, and make sure that your code is clean and documented.