CS 470

Project 3 - OpenMP Application

Introduction

The primary purpose of this assignment is to gain experience with multicore parallelism using the OpenMP library. The secondary purpose is to gain experience parallelizing dense matrix operations.

Description

You will develop an optimized, parallel version of the following serial program: gauss.c Your parallel version should have the filename par_gauss.c.

The program generates a random system of linear equations in the form of a coefficient matrix (A) and a right-hand-side column vector (b). The size of the system is given as a command-line parameter to the program. It then solves the system using Gaussian elimination and backwards substitution. The system is generated in such a way the solution (x) will be a column vector with all values equal to 1.0, making it easy to check the answers. This assignment is similar to programming assignments 5.4 and 5.5 in your textbook.

The serial code contains several functions as well as a few preprocessor directives to toggle program features on/off. For this assignment, you must efficiently parallelize (using OpenMP) all functions that can benefit from parallelization. In particular, you should focus on rand_system, gaussian_elimination, and the two back_substitute functions.

RESTRICTIONS: You should not use Pthreads, semaphores, or any other concurrency/parallelization system. You may perform minor re-writes within functions in order to be able to more effectively parallelize them, but you should avoid any rewrites that cut across multiple functions or that fundamentally change the nature of the function. Both the serially-compiled and 1-thread OpenMP versions should produce the exact same numeric results as the serial version. For the higher thread counts, it is ok for the randomly-generated matrices to be different from the serial version, but the computed solutions should still be accurate. Finally, the behavior of the functions mentioned above should be independent of how they are called in main(); i.e., you should not have OpenMP parallel regions surrounding their call sites.

In your code documentation, you should include a discussion of your changes and why you made them. This assignment is more open-ended than the first two projects, but the amount of code you actually write could be rather small in the end. Thus, you must include documentation explaining your experiments and the reasoning behind your final submission. You should include some performance results and an argument as to why your version is the optimal way to parallelize the original code. IMPORTANT: "I tried it and it didn't run faster" is not on its own a sufficient explanation! If you were unable to parallelize a particular function, explain what you tried and discuss reasons why it may not have worked well. If possible, suggest ways to improve the code to increase parallelism.

For your final submission, REAL must be set to "double". You may leave either back-substitution function enabled; for maximum credit you must justify your choice.

Use the "-d" command-line switch to toggle debug output. This output will likely be unhelpful for matrices larger than 10x10. Use the "-t" command-line switch to generate the coefficient matrix already in upper-triangular form, removing the need to do Gaussian elimination. Enable the USE_COLUMN_BACKSUB definition to enable the alternative back substitution method (column-oriented vs. the standard row-oriented). Finally, change the REAL macro if you wish to experiment with using single precision instead of double precision.

You should use omp_get_wtime() instead of clock() to time your OpenMP version. If you wish, you may copy /shared/cs470/omp_timer.h into your project directory and use the macros provided for your timings. You must include the header as follows, and you must NOT modify the header in any way:

#include "omp_timer.h"

You should use compiler directives to disable all OpenMP code if the compiler does not support it. (HINT: check for "_OPENMP") I suggest using a Makefile (see "Submission" section below) that builds three different programs: 1) the original serial version, 2) the new parallel version, and 3) the new parallel version with OpenMP disabled.

Your program should respect the OMP_NUM_THREADS environment variable; there is no need to accept the number of threads as a command-line parameter.

You may use the cluster to do development and testing. However, please be considerate and do not run your tests on the login node directly. I suggest writing a test script that runs your program using various thread counts, then converting this script into a SLURM job script so that it can run on the compute nodes.

If you would like to test with some custom (non-randomized) matrices, you can pass a file to the program instead of a system size. I have supplied a function (read_system) that handles the file I/O. The files should contain a line at the top with a single integer n (the size of the system). Subsequent lines should have rows of the augmented matrix [A][b]. Keep in mind that the calculated "error" is irrelevant for non-randomized matrices because the system was not generated in a way that guarantees the answers will be all 1's.

For example, here is the representation of this system (taken from the Wikipedia article):

3
 3.0  2.0 -1.0  1.0
 2.0 -2.0  4.0 -2.0
-1.0  0.5 -1.0  0.0

The solution to this system is [ 1.0 -2.0 -2.0 ].

HINT: Test with larger matrices! You won't begin to see true performance trends until you start testing on matrices larger than 500x500. I suggest testing with even larger matrices (2000x2000 or 5000x5000 or more). Remember that you can use the "-t" option to disable gaussian elimination in order to test the other functions with larger matrices in a reasonable amount of time.

Grading Criteria

Your submission will be graded on a combination of correctness, functionality, elegance, style, and documentation. Here is a tentative grading rubric:

Grade	Description	Requirements
A	Exceptional	Optimal performance for all significant subroutines. Thorough and introspective discussion of optimality for all significant subroutines. Clean code with fully consistent style. Use of "`default(none)`" on all parallel for loops. Insightful documentation. All of the below.
B	Good	Measurable performance improvement on at least two subroutines. Thorough discussion of performance with experimental results. Consistent and reasonable style. No memory leaks. Serially-compiled and 1-thread results match serial version exactly. Useful and consistent documentation. All of the below.
C	Satisfactory	Compiles without modifications even without OpenMP enabled. Correct numeric results in all tests. Measurable performance improvement on at least one subroutine. Discussion of your modifications and their impact on performance with experimental results. Readable style. Minimal amount of documentation. All of the below.
D	Deficient	Some evidence of a good-faith attempt.
F	Unacceptable

Submission

You should copy gauss.c into a new file called par_gauss.c and modify it to implement the system described above. Your modified program should use the OMP_NUM_THREADS environment variable for scaling. Your program must compile on the cluster using the following Makefile:

default: gauss par_gauss par_gauss_serial

gauss: gauss.c
	gcc -g -O2 --std=c99 -Wall -o gauss gauss.c

par_gauss: par_gauss.c
	gcc -g -O2 --std=c99 -fopenmp -Wall -o par_gauss par_gauss.c

par_gauss_serial: par_gauss.c
	gcc -g -O2 --std=c99 -Wno-unknown-pragmas -Wall -o par_gauss_serial par_gauss.c

clean:
	rm -f gauss par_gauss par_gauss_serial

You should submit your modified program as par_gauss.c on Canvas by the due date. You should not submit your Makefile or anything else.

All submitted code should be elegant, clean, consistent, and well-documented. The code I have provided uses 4-space indentation and the 1TBS style. If you choose to reformat the file to match your own style preferences, make sure you apply the changes throughout.

Peer Reviews

One of the goals of this course is to encourage good software development practices, especially when building a parallel or distributed software system. For this project, you will be assigned two other random students in the course. You must review their code and offer constructive feedback according to the given rubric.

After the submission deadline, I will assign you two other submissions to review on Canvas. I expect you to review two distinct submissions that are not your own; I will do my best to ensure that the assignments work out but sometimes Canvas does not cooperate. It is your responsibility to inform me immediately if you have been assigned yourself, your partner, or two other same-team partners to review.

To complete the peer review, submit a Canvas comment on the submission with your assessment of the submission according to the provided code review guidelines. I expect at least two paragraphs per review with detailed observations and an assigned overall numeric score. You will be graded on the quality, thoroughness, and professional tone of your review. The peer reviews are due one week after the original project due date.

Parallel and Distributed Systems - Spring 2017