Project 3: Mini-ELF diassembler

This project is based on a project originally written by Dr. Michael Kirkpatrick.

This project serves as an expansion of the Mini-ELF utility from the first and second projects, and the goal is to reinforce assembly code concepts and the standard components of executable files.

In this project, you will implement the "fetch" cycle of a von Neumann architecture by reading machine code instructions from memory and populating a C struct with the decoded instruction details. You will also write a disassembly routine to output the machine code as human-readable assembly code. Finally, for full credit you will also output some details about the data segments of an executable file.

For generic project instructions, refer to the project guide.

Here is the path to the starter tarball file on stu:

/cs/students/cs261/f16/src/p3-disas.tar.gz

The project distribution contains a compiled solution to projects 1 and 2. As for project 2, we recommend using your solutions instead if they are complete.

TESTS UPDATED 10/28: The test suite is subject to change. Make sure you use the most recent version. If you want to download just the most recent tests, I will keep the following location on stu up-to-date so you can always pull the most recent tests from here:

/cs/students/cs261/f16/src/p3-tests

Command-line parsing

As in project 2, you will expand your program to recognize two new command-line options: "-d" to disassemble code and "-D" to disassemble data.

Part 1: Y86 decoding

The first step of the standard von Neumann process involves fetching the bytes of an instruction from memory and decoding the instruction. To decode the instruction, you will examine the first byte of the instruction to determine what kind of instruction it is (we recommend using a switch statement for this). Depending on the instruction type, you may then need to read additional bytes to determine things like register numbers, immediate values, or memory addresses.

In your main() method, you'll need to step through executable segments, decoding all of the instructions. Note that Y86 uses a variable-length instruction encoding, so you will need to decode the current instruction in order to find out how many bytes forward to skip to get to the next instruction.

As an example, consider the simple.o file in tests/inputs. Here is the output of the previous project:

00000000  01 00 00 01 10 00 01 00  00 00 00 00 45 4c 46 00
Mini-ELF version 1
Entry point 0x100
There are 1 program headers, starting at offset 16 (0x10)
There is no symbol table present
There is no string table present
 Segment   Offset    VirtAddr  FileSize  Type      Flag
  00       0x0024    0x0100    0x0017    CODE      R X
Contents of memory from 0100 to 0117:
  0100  30 f0 68 24 00 00 00 00  00 00 30 f3 34 12 00 00
  0110  00 00 00 00 60 03 00

Examining the program headers, we see that there is a single 23-byte code segment that is loaded at address 0x100 in virtual memory. The first opcode is 0x30, which indicates an irmovq instruction. The second byte is 0xf0, which indicates that the destination register is %rax (register 0 in the low-order 4 bits and f in the high-order 4 bits). The next eight bytes contain the immediate value 0x2468 stored in little-endian format. The next instruction begins at address 0x10a and also has a 0x30 opcode. If we were to continue this process until the end of the segment, we would find the following instruction sequence (with associated assembly code):

  30 f0 68 24 00 00 00 00 00 00    irmovq 0x2468, %rax
  30 f3 34 12 00 00 00 00 00 00    irmovq 0x1234, %rbx
  60 03                            add %rax, %rbx
  00                               halt

Part 2: Y86 code disassembly

To disassemble Y86, you will need to write a routine that prints out decoded instructions in a human-readable form as shown in the following example:

$ ./y86 -d tests/inputs/simple.o
Disassembly of executable contents:
  0x100:                      | .pos 0x100 code
  0x100:                      | _start:
  0x100: 30f06824000000000000 |   irmovq 0x2468, %rax
  0x10a: 30f33412000000000000 |   irmovq 0x1234, %rbx
  0x114: 6003                 |   add %rax, %rbx
  0x116: 00                   |   halt

Each segment will start with a .pos line indicating the starting address of the segment. The _start: label must be provided for the instruction that is identified by the Mini-ELF header as the entry point. The remainder of the lines consist of the address of the instruction, the memory bytes for it, and then the disassembly of the instruction. If there are multiple code sections, you should print all of them with the appropriate .pos header and a trailing empty line.

Part 3: Data disassembly

Some Y86 object files have data segments as well as code segments. There are two cases for data segments: 1) read-write and 2) read-only. For the first case (read-write), the segment by convention consists of global variables that are all 8 bytes in size. To disassemble this data, simply print each variable in hex. Here is an example:

$ ./y86 -D tests/inputs/data.o
Disassembly of data contents:
  0x200:                      | .pos 0x200 data
  0x200: 8877665544332211     |   .quad 0x1122334455667788
  0x208: 7856341200000000     |   .quad 0x12345678
  0x210: 4200000000000000     |   .quad 0x42

For the second case (read-only), the segment by convention consists of a sequence of null-terminated ASCII-encoded strings (as in C). To disassemble this data, you will need to scan to find the trailing null characters that separate individual strings. Here are a couple of examples:

$ ./y86 -D tests/inputs/rodata.o
Disassembly of data contents:
  0x200:                      | .pos 0x200 rodata
  0x200: 68656c6c6f00         |   .string "hello"

$ ./y86 -D tests/inputs/rodata_long.o
Disassembly of data contents:
  0x200:                      | .pos 0x200 rodata
  0x200: 30313233343536373839 |   .string "0123456789"
  0x20a: 00                   |
  0x20b: 68656c6c6f20776f726c |   .string "hello world"
  0x215: 6400                 |
  0x217: 6162636465666768696a |   .string "abcdefghijklmnopqrstuvwxyz"
  0x221: 6b6c6d6e6f7071727374 |
  0x22b: 75767778797a00       |

As can be seen in the second example, the strings may be longer than ten characters. In this case, you should print the entire string on the first line, but wrap the hex digits to fit inside the ten-byte hex field.

Requirements

Here are the required functions that you must implement in p3-disas.c. We will use unit tests to exercise this portion of your submission.

  • bool parse_command_line_p3 (int argc, char **argv, bool *header, bool *segments, bool *membrief, bool *memfull, bool *disas_code, bool *disas_data, char **file);

    As in Projects 1 and 2, set the booleans pointed to by the parameters as appropriate based on the command-line parameters. The two new parameters disas_codeand disas_data correspond to the "-d" and "-D" flags, respectively.
  • y86_inst_t fetch (y86_t cpu, memory_t memory);

    Fetch the instruction and supporting operands from memory. The cpu parameter contains the PC value that is the address of the instruction. If there is no valid address at that location, the function should return an instruction with an INVALID type.
  • void disassemble (y86_inst_t inst);

    Disassemble a single Y86 instruction inst. This should ONLY print the assembly code (e.g., the mnemonic and operands), not all of the surrounding data (e.g., address, hex, etc.) described above. That should be done in disassemble_code.
  • void disassemble_code (memory_t memory, elf_phdr_t *phdr, elf_hdr_t *hdr);

    Disassemble an entire code segment. This function should repeatedly call fetch to retrieve instructions from the segment and print the information, using a call to disassemble for the actual assembly code output. Essentially, this function should implement a three-cycle von Neumann architecture where the stages are "fetch," "disassemble," and "update PC," and where the PC update is just an increment based on the current instruction size (i.e., no control flow changes). The function takes a pointer to a single program header struct (phdr) that describes the section to be disassembled as well as a pointer to the original ELF header struct (hdr).
  • void disassemble_data (memory_t memory, elf_phdr_t *phdr);

    Disassemble a single read-write data segment that consists of a sequence of .quad assembler directives. The segment is described by the phdr parameter.
  • void disassemble_rodata (memory_t memory, elf_phdr_t *phdr);

    Disassemble a single read-only data segment that consists of strings. The segment is described by the phdr parameter.

In addition, you must implement main() in main.c such that your program behaves as described above.

You are free (and encouraged!) to define multiple helper functions, as well. While you are free to call them anything you want, good coding practices recommend you provide meaningful names and clear purposes. Regardless, the functions above must execute correctly according to specification.

Error checking

Robust software proactively checks for potential errors and responds appropriately when an error occurs. Failure to build robust software leads to security breaches, lost sales, and other problems; this failure is not acceptable. Our grading procedures will try to break your code. The following list is a sample (not complete) of the types of errors we may test:

  • Passing NULL values in pointer parameters.
  • Passing a file that contains invalid instructions.
  • Passing a file with references to addresses that are outside Y86 virtual address space.
  • Passing a file with program headers that have invalid flags or types.
  • Passing invalid PC values (outside the Y86 virtual address space) to the fetch() function.
  • Passing invalid instructions to disassemble().

If an invalid opcode is detected during the disassembly routines, the program should print the error message "Invalid opcode: 0xYY" where YY is the value of the opcode. Disassembly of a segment should not continue past an invalid opcode. However, if there are more code segments at higher addresses, the program should continue to disassemble those.

The above list is not necessarily exhaustive, and you should think carefully about what sort of errors might occur so that you can prevent them or perform additional error checking as needed. In particular, we will also use valgrind to detect memory leaks. Failure to respond appropriately will result in grade reductions.

Hints

  • Start early. This project involves writing significantly more code than the two previous projects.
  • Work incrementally. As always, start small and gradually add more functionality. In addition, DO NOT FOCUS on fetch() at the expense of the disassembly methods; you will get no credit for a full fetch implementation if the disassembly portion is not also working. Instead, implement the first part of fetch() by determining what kind of instruction it is by examining the opcode. Start with the one-byte instructions and work your way up to the ten-byte instructions. Implement disassembly for the instructions as you implement fetch so that you can test them together. Leave the data disassembly for last.
  • Use switch statements and bitwise operators. This will make your fetch() and disassemble() functions much cleaner and easier to read.
  • Don't forget endianness. When printing the memory bytes for an instruction, the bytes have to appear in their memory order. However, when printing their interpretation in an instruction, you have to display them as the correct value.
  • Make sure you understand y86.h. This file contains a number of definitions and declarations that will make your work significantly easier if you use them properly.

To help you get started, here is the general framework (not complete!) for the three-stage von Neumann cycle that you must implement in disassemble_code:

    y86_t cpu;			// CPU struct to store "fake" PC
    y86_inst_t ins;		// struct to hold fetched instruction

    // start at beginning of the segment
    cpu.pc = phdr->p_vaddr;

    // iterate through the segment one instruction at a time
    while (cpu.pc < end_of_segment) {

        ins = fetch (cpu, memory);          // fetch instruction

        // print current address and raw bytes of instruction

        disassemble (ins);                  // print disassembly
        cpu.pc += ins.size;                 // update PC (next instruction)
    }

Grading

The following requirements are necessary to earn a grade of 'A' for this project:

  1. Correctly decode and disassemble all one-byte instructions
  2. Correctly decode and disassemble all two-byte instructions
  3. Correctly decode and disassemble all instructions
  4. Correctly decode and disassemble all data segments
  5. Accept all valid Y86 Mini-ELF files
  6. Reject all invalid Y86 Mini-ELF files

Completing steps 1-3 are required to earn a grade of 'B' while completing only steps 1-2 will yield a maximum grade of 'C'. You may receive a grade of 'D' for completing only step 1. Note that these are the maximum grades you can earn. Inadequate error checking, failure to adhere to the coding standards, or deviations from the submit procedures will result in grade reductions as described in the project guide.

Failure to submit code that compiles on stu.cs.jmu.edu will result in an automatic grade of 0. In particular, you SHOULD NOT modify the included header files; your code must compile without errors or warnings against the original headers.

Caution: No test suite is fully exhaustive, and the test suite distributed with this program is no exception. Be aware that we may test your code with new test cases created after the submission deadline. These test cases will not substantively change the project specification, but they may exercise your program more thoroughly than the current test suite does. You should treat the given test suite as providing your maximum possible base grade, and you should always anticipate how your program may fail to adhere to the project spec in ways that the initial test suite does not test.

Submission

Due: Fri, Oct 28 at 23:59:59 ET (midnight)

Please see the project guide for general project help and grading policies. Please refer to the coding standards for coding practice guidelines.