Project 1: Building a shell

In this project, you will build a command-line shell that operates similarly to the bash shell.

Set up a repository on stu.cs.jmu.edu based on the instructions in the CS 361 Submission Procedures, using the submit directory p1-sh.git and the file name p1-sh.tar.gz.

Background information

At each implementation level, you will implement three different kinds of requirements:

Utility programs: These are stand-alone programs that are written and compiled independently of the shell. For example, consider the ls program that lists files in a directory or the gcc compiler. These programs are not part of the shell. Instead, the shell will try to locate the executable program file in a directory based on the $PATH environment variable. If the program is found, the shell uses fork() and execlp() (or better yet, posix_spawn()) to run the program in a separate process.
Shell built-ins: These are commands that look like distinct programs, but they actually are not. Instead, the shell recognizes a small number of commands and calls an internal function rather than creating a new process. For instance, consider the cd command that can be used to change to a different directory. Since the notion of "current working directory" only exists within the context of the shell, this command is a built-in that just changes an internal variable in the shell process.
Additional shell features: In addition to running programs, shells provide a number of features that create an interactive experience. For instance, when a process exits (i.e., the code does return 0; from main() or calls exit (0);), the user might wish to know this return value. As an example, if you try to run the ls program on a file, the return code 0 would indicate the file exists and the listing was successful, while 1 could indicate the file wasn't found or permission was denied. Other features include the ability to pipe output streams to other processes or using environment variables to customize the user's experience.

Project directory structure

You will mainly be working with files in the p1-sh/src directory. These source code files will be compiled as object files into p1-sh/build, then linked into the dukesh executable.

In addition, you will be implementing the utility programs in the p1-sh/utils directory. There is a Makefile in that directory, so you can run make there to build these programs. The compiled executables will be in p1-sh/bin. (Note that running make in the main p1-sh directory will also compile these programs.)

Within the p1-sh/tests directory structure, there are symbolic links to p1-sh/bin and p1-sh/data. These links help to ensure that the testing infrastructure (which runs with p1-sh/tests as the current working directory) can access these files appropriately. That is, once you have implemented the ls utility program, running ./bin/ls from either p1-sh or p1-sh/tests will run the same program.

The utility programs do not have their own test cases provided. Instead, they are tested indirectly when the shell test cases try to run them.

Lastly, the p1-sh/tests/scripts directory contain a number of text files that are shell scripts. A shell script is a file that contains a list of commands, with each line corresponding to a single command that could be typed at the shell prompt. These shell scripts can be passed to dukesh using the -b flag (for "batch processing") to launch the shell, run a few programs, then shut down.

Implementation Requirements

Your first task is to build a minimally functional shell that can run external programs and built-in commands, as well as providing access to return codes. From there, you will incrementally add features to control environment variables, pipe data between processes, and redirect output to persistent files.

Phase 1: Shell loop (partial credit)

For this stage, you will need to implement the following features:

Interactive shell and batch processing:
- Shell prompt - print the "$ " string (no trailing newline) and read a command from STDIN.
- Shell scripts - if dukesh is run with -b, read the commands from the file instead of STDIN.
Built-ins:
- echo - print the given input to the screen exactly as entered. Escape sequences and environment variables are ignored.
- quit - exit the shell.

For both the prompt and scripts, you should use fgets() to read a line of input. The difference is that the script will use fopen() to open a file as a stream while the interactive prompt uses stdin as the stream.

This phase will be graded both automated and manually. You need to test your code by running dukesh without a script. This will be helpful later because you will be able to test your code more easily.

Phase 2: Shell built-ins and processes (C requirements)

For this stage, you will need to implement the following features:

Built-ins:
- cd - change the current working directory.
- echo - print the given input to the screen. If the input contains the escape sequence "\n", then replace these two bytes with the newline character ('\n'). If the input contains '$', this is a reference to a return value or environment variable. See below.
- pwd - print the current working directory to the screen.
- which - print the executable source of a command. If the command is a built-in, print cmd: dukesh built-in command (assuming cmd is the command name). If the command begins with ./ (e.g., which ./bin/ls), check if it is executable. Otherwise, traverse through the $PATH for an executable file that matches the command.
Utilities:
- ls - list the files in a requested directory. If no directory is specified, use the current working directory. Add support for the flags as specified in the usage() function.
- chmod - change the permissions on a file. User, group, and other permissions are passed as separate strings of the form "rwx" with - in place of a permission not granted. You will need to process these strings and create the appropriate mode_t value to pass to the chmod() C function.
- head - print the first several lines of a file. The default is to print 5 lines, but the -n flag can specify a different number. If no file is specified on the command line, read from standard input.
- cut - tokenize each line of input text from a file and print out a specified field. The default is to split lines based on spaces but this can be changed with the -d flag. The default is to print the first field but the -f flag changes this. For example -f 2 for the input "hello, world it's me" would print world. Using the flags -d , -f 1 for the same input would print hello. If no file is specified on the command line, read from standard input. For simplicity, you may assume there are no repeated delimiters (e.g., "a,,,,b,,,c") and none at the beginning or end of the line (e.g., " a b ").
Other features:
- Spawning processes - run utility programs in a separate process using either fork()/exec() or posix_spawn().

In the past, many students have implemented the command line parsing with HUGE chunks of spaghetti code, all crammed into shell.c. Don't do this! Instead, integrate the code that you implemented in Lab 2. That lab was specifically designed to walk you through a more extensible approach to implementing the command line. (Note that you can modify the Makefile to add other object modules to the MODS line.)

Once you've got the FSM code integrated, much of your work will become significantly easier. Instead of handling each built-in command separately, you could simply include a function that checks if the command (the first token from the buffer) is a built-in. If not, assume it's an executable and run it with the argument list the FSM builds.

For the utility programs, you should not look for an exact name of the program. You simply need to check if a command is a built-in; if it is not, then assume it is the name of a program that is either an absolute path or can be found with a $PATH lookup. This is necessary for some of the later phases, as we will be using your shell to run other standard utilities that are not part of your implementation requirements.

For the echo command, at this point, you should print the remaining arguments with a single space between the words as shown below:

$ echo hello world
hello world
$ echo hello world    how        are you
hello world how are you

For the ls command, the order that the files are retrieved from the directory can vary based on when files were created. You must print the files in alphabetical order. This can be a challenge with the -a flag, as hidden files begin with a . but must be ordered alphabetically. E.g., the correct order should be a.txt, .b.txt, c.txt for those three files. Also note that . (current directory) and .. (parent directory) should be listed first.

Except for repeat, all of the built-in commands and utility programs in this project are real. You can use man to read the documentation for each (such as man which to learn about the which command). However, the interfaces provided are deliberately different from the real commands. To get acquainted with what each command should do and produce as output, you need to look at the shell scripts used in the test cases and the corresponding expected output files. Consult the testing documentation if you are unfamiliar or rusty with the testing infrastructure.

Phase 3: Environment variables (B requirements)

For this stage, you will need to implement the following features:

Built-ins:
- export - set an environment variable. You may use the provided hash table to keep track of all environment variables. CAUTION: You will need to change the way that you execute programs from the partial implementation. You will need to use execve() or posix_spawn() so that you can pass environment variables to the new process. By default, you must always pass the $PATH. (You can use getenv() to get your real $PATH and pass it along.)
- unset - unset an environment variable. If using the provided hash table, remove the key from the table.
- echo - extend the implementation so that environment variables can be printed. All environment variable names must be wrapped in curly braces as shown below.
Utilities:
- env - set an environment variable for the execution of one particular process. This program will create the envp array of environment variables passed as arguments (passed as VAR=VAL with no space) and pass that array as the environment variable for a new process based on the rest of the command line (see below). The envp array should contain the variables specified as arguments as well as any that had been set by the export command above.
- repeat - print out the specified environment variables a repeated number of times.
Other features:
- Return codes - using echo $? will print the return code from the previous utility or built-in command.

For the echo command, $? indicates that we are wanting to print the return code. As an example, consider the following sequence of commands (note that the $ at the beginning of the line is the prompt and other lines are output produced):

$ ./bin/ls -sa data
0 empty.txt
6 FIRST.txt
5 .hidden.txt
9 pwd.txt
22 yat.txt
$ echo $?
0
$ ./bin/ls flubberdub
$ echo $?
1

Recall that environment variables get passed to processes along with their value as a string. If you set the $ALPHA environment variable to have the value beta, then the string ALPHA=beta gets passed in the envp array. Or, you can retrieve the variable in the new process by calling getenv ("ALPHA"), which will return the string "beta".

The following lines illustrate the expected behavior for environment variables in this project. The first echo tries to look up the $VAR variable, which does not exist yet. Then the variable gets set with export and the second echo finds the value.

$ echo VAR=${VAR}
VAR=
$ export VAR=hello
$ echo VAR=${VAR}
VAR=hello
$ ./bin/repeat 2 VAR
VAR=hello
VAR=hello
$ ./bin/env ALPHA=beta ./bin/repeat 1 ALPHA 1 VAR
ALPHA=beta
VAR=hello

CAUTION: Your code must NOT depend on manually setting environment variables before running make test. Doing so means you are using bash's implementation of environment variables and not your own. If you want to make sure that your code works and is not inheriting this implementation, you should open a new terminal window and run make clean && make test.

Phase 4: Piping STDIN and STDOUT (A requirements)

Utilities:
- cat - open a file and print out its contents.
Other features:
- Piping commands together - use a pipe to redirect the STDOUT from one process to become the STDIN of the next.

Consider the following command line

$ ./bin/ls data
empty.txt
FIRST.txt
pwd.txt
yat.txt

When this command line is executed, the shell will create a new process to run the ./bin/ls executable program. This program uses printf() to write its output to STDOUT. The shell (both the one you are implementing and the underlying Linux shell) are set up so that STDOUT refers to an internal buffer that is used to write text to the screen.

In this phase, you will change this behavior. Consider the following lines as examples.

$ ./bin/ls data | ./bin/head -n 1
empty.txt
$ ./bin/cat data/pwd.txt | ./bin/head -n 1
pwd

In each case, your shell is creating two new processes instead of just one. One process executes ./bin/ls while the other process executes ./bin/head. However, the added twist is that the STDOUT from ./bin/ls becomes the STDIN for ./bin/head. Importantly, this data does not go to the shell and it doesn't appear on the screen. Instead, the shell only receives (and displays) the output from the last process on each line (./bin/head).

The fix for this is conceptually straightforward: Use dup2() to redirect the first process's STDOUT to use a pipe and another dup2() to redirect the second process's STDIN to read from the pipe. In principle, this can be done with any number of processes forming a chain with pipes connecting their output/input.

Where this gets tricky (in a real shell) is when some of the commands in this chain are shell built-ins rather than separate processes. When this occurs, the shell must know to read from or write to the pipe directly rather than relying on dup2(). Doing so can be particularly nasty once you start encountering exceptional cases where some of the commands have errors or do not produce output.

You do not need to handle these complicated cases. In this project, both of the commands will be utility programs and you do not have to support pipes for built-in commands. As such, you will just need to set up the pipe and use dup2() as appropriate to redirect standard I/O in the child processes.

Tips

The following tips may be helpful for completing this project on time.

Using the hash table

To assist with the environment variable handling, a hash table implementation has been included in the distribution. For simplicity, the hash table only accepts strings as keys and values. Also, the hash table is a global variable, so you can access it from anywhere. You can consult src/hash.h for the variable types, but the following code demonstrates the core functionality:

hash_init (100); // store up to 100 items initially
hash_insert ("key", "10"); // create mapping hash[key] = 10
char *value = hash_find ("key"); // search for hash[key]
hash_remove ("key"); // removes hash[key] from the table
hash_destroy (); // remove all keys and free the resources

If you try to find a key that does not exist, hash_find() will return NULL. You need to detect the NULL return value and make sure you do not cause a segmentation fault. Instead, you just print the empty string.

The hash table also provides a hash_dump() function that you can use to print all values (for debugging purposes). You can also loop through all of the keys as follows:

char **keys = hash_keys (); // get all keys as a NULL-terminated array
for (size_t i = 0; keys[i] != NULL; i++) // length is unknown, so go until NULL key is found
  {
    char *key = keys[i]; // get the key
    char *value = hash_lookup (key); // entry should be found since this is a key
    printf ("hash[%s] = %s\n", key, value); // print the mapping
  }

Managing with a partner

At each stage of this project, there are several components that can be implemented independently. In particular, the utility programs can be developed without having a functioning shell, and the shell built-ins and features can be done without supporting processes. As such, there is a natural structure for dividing the work.

As a word of caution, you should agree to a completion schedule. The integration of these components is not trivial, particularly at the Phase 2 implementation level. As such, you should set your own intermediate deadlines for each level. If you plan on 10-14 days for each level, you will have enough time to complete at least the Phase 3 implementation.