CS 470

Project 2 - MPI Application

Introduction

The primary purpose of this assignment is to gain experience with distributed programming using the MPI library.

You will develop an optimized, distributed version of the following serial program: mergesort.c Your parallel version should have the filename par_mergesort.c.

The program generates a random series of numbers, generates a histogram of the values, shifts them by a given offset, and then finally sorts them using merge sort. This assignment is similar to programming assignment 3.8 in your textbook.

Code Conversion

The serial code contains four main subroutines. For this assignment, you must convert all four subroutines to be distributed using data decomposition as described below:

randomize() - To ensure that results are comparable with the serial version, all numbers should be generated on process 0 and should be efficiently split and transmitted in equal chunks to all processes. In other words, process 0 should generate random numbers into a temporary buffer, and after transmission each process should have an equal-sized partition of the numbers in its nums buffer.
histogram() - Histograms (i.e., counts of numbers split into bins using the modulus operation) should be calculated in a distributed fashion (i.e., every process should calculate the histogram for the data that it received during randomize()) and then efficiently aggregated back to process 0 for the output.
left_shift() - The numbers should be rotated left (with wrap-around) in the global "array" by the offset given as the second command-line parameter. For instance, if the offset is "1" then each number should be shifted one place left in the array, and the first number will become the last number. For the distributed version, this will require exchanging values on the boundary between processes. If the offset is zero, this function is essentially a no-op (although you shouldn't need to hard-code this).
merge_sort() - You should implement a distributed merge sort using a fan pattern as shown in Figure 3.6 of the textbook. Use the built-in qsort function to sort the local numbers on each individual process before the merging process begins. Note that you may use the provided merge function if you wish, but you shouldn't need to use the serial merge_sort_helper function. You SHOULD NOT perform any other sort (qsort or otherwise) AFTER after the initial local sort on each process--this will be considered cheating and your submission will receive a zero. Each local merge should require only O(n) time so that the overall sort is O(n log n) assuming no parallelism. This function should NOT be recursive.

Your changes should largely be confined to the above subroutines--do not put any of the above functionality in main(). In fact, you should keep main() as clean as you can; do not put significant program logic there as it will interfere with your timing analysis. Use MPI collective operations where possible, but note that you will need point-to-point communication for at least one of the subroutines.

Performance Analysis

In your code documentation, you should include a discussion of your changes and the performance characteristics of your solution for each of the four main subroutines individually. This analysis will be a significant portion of your grade for this project. Include experimental results and a thorough discussion of the computation and your parallelization for each of the four main subroutines named above. Here are some example questions that can guide your discussion:

Does your solution scale well? Does it exhibit strong or weak scaling (or both)? How does the performance compare to the serial version? How do the answers to these questions change for the different subroutines, and what causes those variations?
What trends do you see as you vary the number of MPI processes, the random number count, and the shift offset? What do you think is the reason for these trends?
Which of the four main subroutines are more compute-bound and which are more communication-bound? Do any subroutines shift characteristics during execution? How do you know this?

You must include at least a paragraph and table for each function. You may combine the results into a larger table if you wish, but you should still have results from all four functions. Feel free to include URL links to your results (e.g., in a Google Sheet) rather than trying to format them as text. DO NOT submit these materials via any other means (e.g., email, Canvas message, Canvas comment, etc.)--such submissions will not be graded.

Notes

To verify correctness, the results (i.e, the histogram, the shifted values, and the sorted list) must match those of the serial version (for corresponding input sizes). Your parallel version should modify the debug output so that prints the entire distributed list (HINT: recall similar code from the MPI lab), and does so only from rank 0. Timing info and command-line argument error messages should also only be printed from rank 0. In summary, the output should EXACTLY match the original serial version.
We will make some simplifying assumptions about the program input. In particular, you may assume 1) that the number of MPI processes is a power of two, 2) that the number count n is divisible by the number of processes, and 3) that the shift offset is equal to or less than the number count n divided by the number of processes. Your solution should give correct results for a number count of up to at least 640 million.
You should use MPI_Wtime() instead of gettimeofday() to time your MPI version. Note that these mechanisms work very differently--there is no need to convert to seconds when using MPI_Wtime. To ensure accurate timings, you must make sure all tasks have reached the timing statements before actually taking the measurements.
You must use the cluster to do development and testing. I suggest writing a Slurm job script that runs your program using various task and number counts. Please be considerate of others as you run timing tests; in general you should reserve no more than four nodes (half of the cluster). Recall that Slurm refers to MPI processes as tasks and that cluster nodes can support a maximum of sixteen processes/tasks per node, so this means the maximum number of tasks should be 64.
For weak scaling, you may experience out-of-memory errors with 64 tasks on four nodes. If this happens, you can limit the number of processes per node by passing "--cpus-per-task=2" to srun or sbatch. It is ok to reserve up to eight nodes for one or two experiments at this scale, but make sure these tests are run individually and don't tie up the cluster for more than approximately five minutes (enforce using "-t 5:00").
Unfortunately, OpenMPI does not properly free some of its data structures, and so if we use Valgrind to check for memory leaks or other memory-related errors we will get false positives. See item 13 on this FAQ for more information. For this project, you will not lose points for memory leaks unless you run out of memory entirely.

Hints

Work deliberately and insert as many temporary debug printouts as necessary to understand exactly what's going on (just make sure you remove them before submitting!). Include rank IDs in your debug printouts.
Be careful about data types (e.g., ints vs. unsigned longs).
Don't overwork! For example, there's no need to move all histograms to rank 0 in order to calculate aggregations.
Don't over-use MPI_Barrier; most collectives have implicit barriers.
Make sure you have the communication patterns correct in the merge tree before actually trying to code the mergesort. In other words, implement rank ID output similar to the tree lab.
Use a small, non-zero shift count for most of your tests, then have a separate sequence of tests to evaluate performance as you scale up the shift offset while keeping other parameters fixed.

Grading Criteria

Your submission will be graded on a combination of correctness, functionality, elegance, style, and documentation. Here is a tentative grading rubric:

Grade	Description	Requirements
A	Exceptional	Optimal distributed implementation of all four subroutines. No calls to unsafe functions. Thorough, insightful discussion with scaling and characteristic analyses. Clean code with fully consistent style. Insightful documentation. All of the below.
B	Good	Correct distributed implementation of at least three subroutines. Reasonable, correct discussion of parallelization with performance analysis and experimental results. Consistent and reasonable style. No out-of-memory errors. No compiler warnings. Useful and consistent documentation. All of the below.
C	Satisfactory	Compiles without modifications. Correctly-parallelized output and timing code. Correct distributed implementation of at least two subroutines. Decent (if perhaps incomplete or partially-incorrect) discussion and analysis. Readable style. Minimal amount of documentation. All of the below.
D	Deficient	Some evidence of a good-faith attempt.
F	Unacceptable

Submission

You should copy mergesort.c into a new file called par_mergesort.c and modify it to implement the system described above. Your program must compile on compliant systems using the following Makefile:

default: mergesort par_mergesort

mergesort: mergesort.c
	gcc -g -O2 --std=c99 -Wall -o mergesort mergesort.c

par_mergesort: par_mergesort.c
	mpicc -g -O2 --std=c99 -Wall -o par_mergesort par_mergesort.c -lm

clean:
	rm -f mergesort par_mergesort

You should submit your modified program as par_mergesort.c on Canvas by the due date. You should not submit your Makefile or anything else.

All submitted code should be elegant, clean, consistent, and well-documented. The code I have provided uses 4-space indentation and the 1TBS style. If you choose to reformat the file to match your own style preferences, make sure you apply the changes throughout.

Peer Reviews

One of the goals of this course is to encourage good software development practices, especially when building a parallel or distributed software system. For this project, you will be assigned two other random students in the course. You must review their code and offer constructive feedback according to the given rubric.

After the submission deadline, I will assign you two other submissions to review on Canvas. I expect you to review two distinct submissions that are not your own; I will do my best to ensure that the assignments work out but sometimes Canvas does not cooperate. It is your responsibility to inform me immediately if you have been assigned yourself, your partner, or two other same-team partners to review.

Submit your peer review in plain text on the appropriate Canvas assignment according to the provided code review guidelines. I expect at least two paragraphs per review with detailed observations and an assigned overall numeric score. You will be graded on the quality, thoroughness, and professional tone of your review. The peer reviews are due one week after the original project due date.

Parallel and Distributed Systems - Spring 2022