C for CS 432
CS 432 is a systems elective, so we will use the C programming language for the major programming projects. C is a systems programming language with a tremendous history and influence, and it is still widely used in various areas of systems work (e.g., operating systems, networking, embedded devices, etc.). It is also the same language used in CS 261 and CS 361.
However, relative to CS 261 and CS 361, we will be working on a large and complex software system (a compiler) in CS 432, so we will be exercising some features of C and some techniques that you may not have encountered in other classes. This guide discusses those features and how we will be using them in CS 432. Note that we will specifically be using the C11 revision of the C language standard (primarily for the convenience of anonymous unions).
General resources for reviewing C:
- "C Programming Language" (classic "K&R book" available free through JMU libraries)
- Practical C Programming (online book)
- Understanding and Using C Pointers (online book)
- GNU C Reference Manual
- C Standard Library Reference
CS 261 resources (still relevant for CS 432):
- Project guide
- Coding standards
- C function quick reference (in particular, notice the list of unsafe functions, which are forbidden in CS 432 as well)
Object-Oriented Data Structures in C
C is not an "object-oriented" language (and in fact predates the rise of the paradigm's popularity), but it is certainly possible to use object-oriented software design in C. There is even an entire book about it, and C++ was originally implemented in C as "C with classes". We do not need the full range of functionality described and implemented in those works, but we will use a few object-oriented conventions to help tame the complexity of our compiler project.
In particular, we will use the term "object" to refer to a C struct that has member fields and some corresponding "methods," which are functions that take a pointer to an instance of the struct as their first argument and are considered to be associated with that struct type.
Here is an example from P1 (token.h, documentation omitted for brevity):
typedef struct TokenQueue { Token* head; Token* tail; } TokenQueue; void TokenQueue_add (TokenQueue* queue, Token* token); Token* TokenQueue_peek (TokenQueue* queue); Token* TokenQueue_remove (TokenQueue* queue); bool TokenQueue_is_empty (TokenQueue* queue); size_t TokenQueue_size (TokenQueue* queue); void TokenQueue_print (TokenQueue* queue, FILE* out);
As you can see, all of the TokenQueue_* functions take a pointer to a TokenQueue struct as their first parameter and are considered "methods" of a TokenQueue "object" (the first parameter for each of the functions, something Python programmers may find familiar). Together, the struct and functions form something like a "class" definition (something that was later formalized as a language feature in C++).
In addition, all objects have an associated "new" and "free" function that should be used for allocation and de-allocation:
TokenQueue* TokenQueue_new (); void TokenQueue_free (TokenQueue* queue);
As with standard malloc usage, any code that calls the "new" allocator associated with a particular object is also responsible for either calling the appropriate deallocator or passing "ownership" of that object to some other routine (e.g., the caller via a return pointer) or structure (e.g., you don't need to manually deallocate tokens in a TokenQueue because TokenQueue_free handles any leftover tokens).
Example:
TokenQueue* tqueue = TokenQueue_new(); // allocate new queue TokenQueue_add(tqueue, Token_new(KEY, "int", 0)); // allocate and add some tokens TokenQueue_add(tqueue, Token_new(ID, "foo", 0)); Token* token = TokenQueue_remove(tqueue); // removes "int" token Token_free(token); // (we're now responsible token = NULL; // for deallocating it) TokenQueue_free(tqueue); // but no need to deallocate tqueue = NULL; // any remaining tokens in tqueue
As in CS 261, your submission will be checked for memory leaks; any leaks found may negatively impact your grade.
Polymorphism in C
C provides several mechanisms for polymorphism that we will use in our semester-long project.
Union Types
One such mechanism is C's union types, which allow multiple kinds of data to be stored in the same memory region. By itself this is pretty unsafe because it's often impossible to distinguish between those different kinds of data at runtime. However, with the addition of a "discriminator" or "tag" field, the union can be used (relatively) safely by taking care to check the tag before accessing the union.
The primary example of this in our compiler is the ASTNode struct introduced in P2 that contains an anonymous union for the different AST node types (most of which are omitted here for brevity):
typedef struct ProgramNode { struct NodeList* variables; struct NodeList* functions; } ProgramNode; typedef struct FuncDeclNode { char name[MAX_ID_LEN]; DecafType return_type; ParameterList* parameters; struct ASTNode* body; } FuncDeclNode; typedef struct ASTNode { NodeType type; int source_line; Attribute* attributes; struct ASTNode* next; union { struct ProgramNode program; struct VarDeclNode vardecl; struct FuncDeclNode funcdecl; struct BlockNode block; struct AssignmentNode assignment; struct ConditionalNode conditional; struct WhileLoopNode whileloop; struct ReturnNode funcreturn; struct BinaryOpNode binaryop; struct UnaryOpNode unaryop; struct LocationNode location; struct FuncCallNode funccall; struct LiteralNode literal; }; } ASTNode;
This essentially designates ASTNode as an abstract parent class with all of the structs in the union as subtypes or child classes. As mentioned before, you do need to take care to check the tag field before accessing the union (although note also that the union is "anonymous" and you can access its members as if they were members of the struct itself--this is a C11 feature):
ASTNode* find_main (ASTNode* root) { if (root.type == PROGRAM) { /* check all function declarations */ FOR_EACH (ASTNode*, node, root->program.functions) { if (token_str_eq(node->funcdecl.name, "main")) { return node; } } } else if (root.type == FUNCDECL) { /* check a single function declaration */ if (token_str_eq(root->funcdecl.name, "main")) { return root; } } return NULL; /* couldn't find any 'main' function declaration */ }
List Macros
You may have noticed the FOR_EACH loop in the above example. This is a C macro declared alongside two others (DECL_LIST_TYPE and DEF_LIST_IMPL) in common.h. Together, these macros provide a second mechanism for polymorphism by expanding to the declarations and definitions (respectively) of an append-only, iterable list of some struct pointer type (although the exact identity of the struct is unimportant). These macros use C's concatenation feature to build a new "list" struct and its associated methods. The only restriction is that the struct type must contain a "next" field that is a pointer to a struct of the same type (and this field must be otherwise unused by the struct's methods).
For example, there is such a field in the ASTNode struct above, so we do not need to explicitly declare and implement a NodeList to store nodes; the following two lines are sufficient (and are located in ast.h and ast.c, respectively):
DECL_LIST_TYPE(Node, struct ASTNode*) DEF_LIST_IMPL(Node, struct ASTNode*, ASTNode_free)
Those two lines expand to the full list definition, which will be identical to every other list definition aside from the parameterized information provided like the name and type. This allows us to generate all of the list handling code as needed in the C preprocessor before the C compiler runs. For example, here are the generated functions for the Node list declaration shown above:
NodeList* NodeList_new () Allocate and initialize a new, empty list. void NodeList_add (NodeList *list, struct ASTNode *item) Add an item to the end of a list. int NodeList_size (NodeList *list) Look up the size of a list. bool NodeList_is_empty (NodeList *list) Test a list to see if it is empty. void NodeList_free (NodeList *list) Deallocate a list and any contained items.
If you have any experience with C++, you may recognize this pattern of as-needed polymophic code generation; in C++ the concept was formalized as templates.
Once a list type has been declared and defined, the FOR_EACH macro allows for clean, concise iteration as shown in the find_main example above.
Void* Pointers
The final form of polymorphism in C that we will use involves the use of void* pointers to store references to any kind of data using a single pointer. This is inherently unsafe, however, so we must use great caution. One example of this can be seen in the attributes of an AST node, which are stored as key-value pairs in an Attribute struct:
typedef void (*AttributeValueDOTPrinter)(void*, FILE*); typedef void (*Destructor)(void*); typedef struct Attribute { const char* key; void* value; AttributeValueDOTPrinter dot_printer; Destructor dtor; struct Attribute* next; } Attribute;
The value is stored as a void* so its type is unknown in general. However, a value's type is known when it is stored, and that information should be used to set the value's destructor/deallocator and pretty-printing function (stored as function pointers). Some useful functions are provided (e.g., int_attr_print for an integral value or dummy_free for a value that doesn't actually need to be freed).
Assuming that a particular key is consistently associated with a particular value type (and assuming that the printing and deallocating function pointers are set correctly), it will be safe to do any manual casts necessary to store and retrieve the value, and the value will be printed and cleaned up appropriately.
Multiple Dispatch in C
C does not inherently provide any form of dynamic dispatch (choosing a particular instantation of a polymorphic function to be called at runtime based on the type of some data), but it is fairly common to implement dynamic dispatch using function pointers inside a struct. This is essentially equivalent to implementing a manual virtual method table. The book about object-oriented C uses this technique extensively for objects.
We will be more sparing with our use of dynamic dispatch; however, it is quite useful for facilitating AST traversals so we shall use the visitor pattern to implement multiple dynamic dispatch (double dispatch in particular--note that each visit method takes two parameters).
Visitors are defined in visitor.h:
typedef struct NodeVisitor { /** * @brief Visitor-specific state information */ void *data; /** * @brief Pointer to destructor function (used to deallocate the @c data member) */ Destructor dtor; /* * Traversal routines; each of these is called at the appropriate time as * the visitor traverses the AST. */ void (* previsit_default) (struct NodeVisitor* visitor, ASTNode* node); void (*postvisit_default) (struct NodeVisitor* visitor, ASTNode* node); void (* previsit_program) (struct NodeVisitor* visitor, ASTNode* node); void (*postvisit_program) (struct NodeVisitor* visitor, ASTNode* node); void (* previsit_vardecl) (struct NodeVisitor* visitor, ASTNode* node); void (*postvisit_vardecl) (struct NodeVisitor* visitor, ASTNode* node); void (* previsit_funcdecl) (struct NodeVisitor* visitor, ASTNode* node); void (*postvisit_funcdecl) (struct NodeVisitor* visitor, ASTNode* node); /* others omitted */ } NodeVisitor;
The details of how this pattern works will be covered as part of CS 432, but the essence is that you build a "visitor" by writing functions that handle visiting particular kinds of AST nodes, then create a NodeVisitor struct and set the function pointers appropriately. The visitor can then be applied to any particular AST using the NodeVisitor_traverse method, which implements a generic traversal, invoking the visit/handler functions at the appropriate times.
Exception Handling in C
We are going to assume errors are fatal in the front end of the compiler. However, we would still like to be able to clean up after an error regardless of where the error occurs (i.e., we don't ever want the compiler just to crash or segfault). The historic mechanism for accomplishing this in C is the setjmp library, which provides an interface for non-local jumps. First, we must wrap the code that might raise an error using a call to setjmp (in main.c, for example, with some code omitted for brevity):
char decaf_error_msg[MAX_ERROR_LEN]; jmp_buf decaf_error; int main(int argc, char** argv) { ... if (setjmp(decaf_error) == 0) { /* PROJECT 1: lexer */ tokens = lex(text); /* PROJECT 2: parser */ tree = parse(tokens); /* clean up tokens (no longer needed) */ TokenQueue_free(tokens); tokens = NULL; } else { /* handle fatal error: print message and clean up */ fprintf(stderr, "%s", decaf_error_msg); if (tokens != NULL) TokenQueue_free(tokens); if (tree != NULL) ASTNode_free(tree); exit(EXIT_FAILURE); } ... }
The setjmp call saves the environment at the call site and returns zero. Later, a call to longjmp will restore this environment and it will appear as though setjmp has returned again, this time with the return value given to longjmp (presumably a non-zero value to indicate that an error has occurred).
However, you don't need to call longjmp directly because I have provided a wrapper called Error_throw_printf (also in main.c) that wraps a call to vsnprintf (using C's variadic function mechanism) to assemble the error message before delegating to the longjmp library function to perform the non-local jump back to the setjmp call.
void Error_throw_printf (const char* format, ...) { /* delegate to vsnprintf for error message formatting */ va_list args; va_start(args, format); vsnprintf(decaf_error_msg, MAX_ERROR_LEN, format, args); va_end(args); /* jump to location saved by setjmp */ longjmp(decaf_error, 1); }
In C++ and Java, this kind of mechanism is now a language feature ("exceptions") with streamlined and improved functionality.