Decaf

P1: Decaf Lexer

Objective

The goal of our semester-long project is to gain experience in compiler implementation by constructing a simple compiler.

The goal of this part of the project is to gain experience specifying lexical analysis patterns by using regular expressions to implement the first pass of our Decaf compiler.

Introduction

The semester-long project for this course is to build a compiler for the Decaf language. In this project, you will implement a lexer for Decaf, which will be the first phase in our compiler. You will do this by creating regular expressions to describe various pieces of the Decaf language, and writing some C code to repeatedly apply these regular expressions to an input character stream.

Thoroughly read the "Lexical Considerations" section of the reference document and look over the grammar in the next section. For this assignment, you will need to construct a lexer that will scan a Decaf program and produce a queue of tokens that will serve as an input to the next phase of the compiler.

The starter files are available on stu at the following path:

/cs/students/cs432/f20/p1-lexer.tar.gz

Before you begin writing code, you should spend some time reading through all of the infrastructure files provided in the starter files. Our compiler this semester is a large project. Your solutions will need to interface with other modules, and you will need to read the documentation thoroughly to understand their interactions. There are also utility functions that will significantly reduce the difficulty of the projects if used properly.

As you study the code, you may find it useful to generate HTML or PDF references for the existing project files. To do this, run make docs and then see the resulting files in the "html" and "latex" subfolders. I have generated a copy of the HTML documentation for your convenience: P1 reference

If you did not take CS 261 at JMU or if it has been a while since you took it, you may wish to review the project guide from that course. Regardless, you should read the C for CS 432 reference guide about some of the features of C that we will be using that you may not have seen before.

If you have questions about the existing code, please ask them on the Piazza forum (see the link on the sidebar in Canvas).

Assignment

WARNING: You should only proceed with actual development once you are SURE you understand exactly what your task is and how your code should interact with the rest of the system. Make sure you read the "Lexical Considerations" section of the language reference carefully in addition to this description!

Implement lexical analysis by providing an implementation for the following function:

TokenQueue* lex(char* text);

This function takes as input a char* string containing the Decaf source code. The function should return a TokenQueue of scanned Token structures as described in the "Lexical Considerations" section of the Decaf language reference.

For this particular project, you should pay special attention to the following files:

  • src/main.c - The overall driver for the compiler. Right now it only implements the lexer phase of compilation; it will gradually expand over the semester as we implement more phases of the compiler. You may change this file if it is useful for testing, but your submission should work with the original version.
  • src/p1-lexer.c - The lexer implementation. Your implementation of lex will go here, and you may add helper functions here if you find them useful (but note that they will be inaccessible to other modules--you should NOT modify p1-lexer.h).
  • include/token.h - Describes the interface to Regex, Token, and TokenQueue structures and their associated functions/methods, which you should use to write your solution. YOU SHOULD NOT MODIFY THIS FILE. The implementations are in token.c, which you may want to skim as well (although you shouldn't change those either). The Regex typedef and associated methods are a loose wrapper around the C standard library implementation of POSIX regular expressions. You can run "man 7 regex" on stu for a reference.
  • include/common.h - Contains some constants, functions, and macros that are useful in multiple modules in the compiler. In particular, you should use Error_throw_printf for "throwing" front-end errors, and you may wish to look into how the DECL_LIST_TYPE and DEF_LIST_IMPL macros work together to ease the creation of append-only list structures (e.g., one singly-linked list implementation that can be re-used in a way similar to C++ templates). YOU SHOULD NOT MODIFY THIS FILE. The implementations are in common.c, which you may want to skim as well (although you shouldn't change those either).
  • tests - Testing framework folder. These files use the same general layout as described in the CS 261 project guide, so you may wish to review that document. UNLIKE IN CS 261, YOU WILL NOT BE PROVIDED ALL TEST CASES, only a few sanity test checks to help you estimate your progress in completing the project. You should significantly extend the provided test suite and convince yourself that your solution completely implements this phase of the compiler as described by the language reference and the description provided here.

The recommended solution for this project (as discussed in class) is to process the input by repeatedly extracting tokens from the beginning. Write one regex per token type (in general; there may be one or two exceptions). Prioritize your regular expressions and try to match each of them in order against the beginning (remember you can use the caret '^' to match the beginning of input). When you find a match, extract the matching text from the line into a token and continue until either no regex matches the next input (i.e., a lexing error) or the entire input is consumed.

CAUTION: DO NOT attempt to split the line on a particular delimiter (e.g., a space character)! This approach is not sufficiently powerful to aid in lexing Decaf code, and you will waste a lot of time building workarounds and handling edge cases that would be far simpler to express using regular expressions from the outset.

For this and all future projects, you must also make sure you check the inputs to public API functions (in this case, lex) to make sure they aren't null or otherwise invalid. If they are, you should report an error (in the front and middle end -- use Error_throw_printf) or just do nothing (in the back end). In other words, the compiler should never crash or segfault.

Finally, there may be cases that you encounter where the language specification and the above requirements do not fully describe the expected behavior. For such cases, feel free to ask on Piazza, but such situations should be rare and inconsequential in the big picture. In general, for such cases you should just make sure the output from your solution matches the output of the reference compiler. Run it with no arguments or "-h" to get a list of valid output flags. For this project, use the "--fdump-tokens" flag to print the output after the lexing phase (note that it will then finish the rest of compilation, but you can ignore any output from the later phases).

HINT: Use the provided rubric (see "Grading" below) as a guide to completing this project. First, complete all of the requirements to earn a "D", then the requirements to earn a "C", and so forth. Work incrementally and test thoroughly to ensure you have a comprehensive solution.

Sample Input

def int main()
{
    int a;
    a = 4 + 5;
    return a;
}

Sample Output

KEYWORD  [line 001]  def
KEYWORD  [line 001]  int
ID       [line 001]  main
SYMBOL   [line 001]  (
SYMBOL   [line 001]  )
SYMBOL   [line 002]  {
KEYWORD  [line 003]  int
ID       [line 003]  a
SYMBOL   [line 003]  ;
ID       [line 004]  a
SYMBOL   [line 004]  =
DECLIT   [line 004]  4
SYMBOL   [line 004]  +
DECLIT   [line 004]  5
SYMBOL   [line 004]  ;
KEYWORD  [line 005]  return
ID       [line 005]  a
SYMBOL   [line 005]  ;
SYMBOL   [line 006]  }

Submission

Please submit your p1-lexer.c file on stu AND to the appropriate assignment on Canvas by the due date indicated there.

To submit on stu, run the following command from your project folder:

/cs/students/cs432/f20/submit.sh p1

Code Reviews

One of the goals of this course is to encourage good software development practices, especially when building a large software system (such as a compiler). For this project, you will be assigned two other random students in the course. You must review their code and offer constructive feedback according to the given rubric.

Submit your review on Canvas by the Friday following this project's due date, as described in the appropriate Canvas assignment.

Grading

Here is the overall grading breakdown for this project:

Instructor grade90
Peer review8
Review response2
TOTAL100

Here is the rubric for the "Instructor grade" part of the overall grade:

Grade Description Requirements
A Exceptional
  • Properly ignore comments
  • Handle all edge cases correctly
  • No compiler warnings or memory leaks
  • All of the below
B Good
  • Lex string literals w/ escape codes
  • Lex all symbols (one- and two-character)
  • Lex keywords (differentiated from identifiers)
  • Report error for reserved words
  • All of the below
C Satisfactory
  • Ignore whitespace between tokens
  • Lex string literals w/o escape codes
  • Lex basic symbols w/o potential one- vs. two-character ambiguity
  • Lex hexadecimal literals
  • All of the below
D Deficient
  • Lex basic symbols: (, ), +, and *
  • Lex decimal integer constants
  • Lex identifiers
F Unacceptable
  • Some evidence of a good-faith attempt

The above rubric shows the base grade possible if your submission meets the criteria listed. Most of the items listed will be assessed via automated testing, consisting of test cases that are mostly NOT provided to you in advance (so write your own tests!).

I will also be examining your submission manually for acceptable style and documentation, as well as the use of any unsafe functions. If your submission is deficient in any of these categories you may earn a numerical deduction, and if it is particularly egregious it will earn a half- or full-letter grade deduction depending on the severity of the issue. If you are unsure regarding my standards in this regard, I would suggest reviewing the style guide and list of unsafe functions from CS 261.