CS 470

Project 2 - MPI Application

Introduction

The primary purpose of this assignment is to gain experience with distributed programming using the MPI library.

Description

You will develop an optimized, distributed version of the following serial program: mergesort.c Your parallel version should have the filename par_mergesort.c.

The program generates a random series of numbers, generates a histogram of the values, shifts them by a given offset, and then finally sorts them using merge sort. This assignment is similar to programming assignment 3.8 in your textbook.

The serial code contains four main subroutines. For this assignment, you must convert all four subroutines to be distributed as described below:

randomize() - To ensure that results are comparable with the serial version, all numbers should be generated on process 0 and should be efficiently split and transmitted in equal chunks to all processes.
histogram() - Histograms should be calculated in a distributed fashion (i.e., every process should calculate the histogram for the data that it received during randomize()) and then efficiently aggregated back to process 0 for the output.
left_shift() - The numbers should be rotated left (with wrap-around) in the global "array" by the offset given as the second command-line parameter. For instance, if the offset is "1" then each number should be shifted one place left in the array, and the first number will become the last number. For the distributed version, this will require exchanging values on the boundary between processes. If the offset is zero, this function is essentially a no-op.
merge_sort() - You should implement a distributed merge sort using a fan pattern as shown in Figure 3.6 of the textbook. Use the built-in qsort function to sort the local numbers on each individual process before the merging process begins. Note that you may use the provided merge function if you wish, but you shouldn't need to use the serial merge_sort_helper function. You SHOULD NOT perform any other sort (qsort or otherwise) AFTER after the initial local sort on each process--this will be considered cheating and your submission will receive a zero. Each local merge should require only O(n) time so that the overall sort is O(n log n).

Your changes should largely be confined to the above subroutines. Keep main() as clean as you can; do not put significant program logic there as it will interfere with your timing analysis. Use MPI collective operations where possible, but note that you will need point-to-point communication for at least one of the subroutines.

In your code documentation, you should include a discussion of your changes and the performance characteristics of your solution for each of the four main subroutines. Include experimental results and a thorough discussion of the computation and your parallelization for each of the four main subroutines named above. Here are some example questions that can guide your discussion:

Does your solution scale well? Does it exhibit strong or weak scaling (or both)? How does the performance compare to the serial version? How do the answers to these questions change for the different subroutines, and what causes those variations?
What trends do you see as you vary the number of MPI processes, the random number count, and the shift offset? What do you think is the reason for these trends?
Which of the four main subroutines are more compute-bound and which are more communication-bound? Do any subroutines shift characteristics during execution?

To verify correctness, the results (i.e, the histogram, the shifted values, and the sorted list) must match those of the serial version (for corresponding input sizes). The histogram is already printed; you can enable printing of the sorted list by enabling DEBUG (do not try this for large input sizes or you will generate very large outputs!). Your parallel version should only print these results from rank 0. Error messages should also only be printed from rank 0.

For your final submission, DEBUG must be disabled. You may assume 1) that the number of MPI processes is a power of two, 2) that the number count n is divisible by the number of processes, and 3) that the shift offset is equal to or less than the number count n divided by the number of processes.

You should use MPI_Wtime() instead of gettimeofday() to time your MPI version. To ensure accurate timings, you must make sure all tasks have reached the timing statements before actually taking the measurements.

You must use the cluster to do development and testing. However, do not run your tests on the login node directly, as this will obstruct others and produce inaccurate timings. I suggest writing a SLURM job script that runs your program using various task and number counts. Recall that SLURM refers to MPI processes as tasks and that cluster nodes can support a maximum of eight processes/tasks per node.

Grading Criteria

Your submission will be graded on a combination of correctness, functionality, elegance, style, and documentation. Here is a tentative grading rubric:

Grade	Description	Requirements
A	Exceptional	Optimal distributed implementation of all four subroutines. Thorough, insightful discussion with scaling and characteristic analyses. Clean code with fully consistent style. Insightful documentation. All of the below.
B	Good	Correct distributed implementation of all four subroutines. Reasonable discussion of parallelization with performance analysis and experimental results. Consistent and reasonable style. No memory leaks (check with MPICH, not OpenMPI). Useful and consistent documentation. All of the below.
C	Satisfactory	Compiles without modifications. Correct timing code. Correct distributed implementation of at least three subroutines. Readable style. Minimal amount of documentation. All of the below.
D	Deficient	Some evidence of a good-faith attempt.
F	Unacceptable

Submission

You should copy mergesort.c into a new file called par_mergesort.c and modify it to implement the system described above. Your program must compile on compliant systems using the following Makefile:

default: mergesort par_mergesort

mergesort: mergesort.c
	gcc -g -O2 --std=c99 -Wall -o mergesort mergesort.c

par_mergesort: par_mergesort.c
	mpicc -g -O2 --std=c99 -Wall -o par_mergesort par_mergesort.c

clean:
	rm -f mergesort par_mergesort

You should submit your modified program as par_mergesort.c on Canvas by the due date. You should not submit your Makefile or anything else.

All submitted code should be elegant, clean, consistent, and well-documented. The code I have provided uses 4-space indentation and the 1TBS style. If you choose to reformat the file to match your own style preferences, make sure you apply the changes throughout.

Peer Reviews

One of the goals of this course is to encourage good software development practices, especially when building a parallel or distributed software system. For this project, you will be assigned two other random students in the course. You must review their code and offer constructive feedback according to the given rubric.

After the submission deadline, I will assign you two other submissions to review on Canvas. I expect you to review two distinct submissions that are not your own; I will do my best to ensure that the assignments work out but sometimes Canvas does not cooperate. It is your responsibility to inform me immediately if you have been assigned yourself, your partner, or two other same-team partners to review.

To complete the peer review, submit a Canvas comment on the submission with your assessment of the submission according to the provided code review guidelines. I expect at least two paragraphs per review with detailed observations and an assigned overall numeric score. You will be graded on the quality, thoroughness, and professional tone of your review. The peer reviews are due one week after the original project due date.