.. _Scaling:
.. raw:: html
.. |--| unicode:: U+2013 .. en dash
.. |---| unicode:: U+2014 .. em dash, trimming surrounding whitespace
:trim:
.. This file is part of the OpenCSF eTextbook project. It was
.. auto-generated by scripts from the OpenDSA eTextbook project.
.. See https://OpenCSF.org for more details. OpenCSF is distributed
.. under a Creative Commons Attribution-NonCommercial 4.0 International
.. License (see http://creativecommons.org/licenses/by-nc/4.0/),
.. Copyright (c) 2019-2021 by Michael S. Kirkpatrick. OpenDSA is
.. distributed under an MIT open source license, Copyright (c) 2012-2021
.. by the OpenDSA Project Contributors.
.. avmetadata::
:author: Michael S. Kirkpatrick
:requires:
:satisfies:
:topic:
Limits of Parallelism and Scaling
=================================
While hardware support is required to achieve parallel computation, it is not
sufficient on its own. Many problems or algorithms simply do not support
parallel computation. For example, consider merge sort as shown in `Code Listing
9.2 <#cl9-2>`_. Although it is true that the left and right halves of the array
could be sorted in parallel with two threads, the ``merge()`` routine cannot be
parallelized; a single thread must traverse through both the left and right
halves to assemble the results. Consequently, there are limits to how much merge
sort can be improved with parallel execution.
Amdahl's Law and Strong Scaling
-------------------------------
:term:`Amdahl's law` provides a way to quantify the theoretical maximum
:term:`speedup in latency` (also called the :term:`speedup factor` or just
:term:`speedup `) that can occur with parallel execution. Specifically, Amdahl's
law describes the ratio of the original execution time with the improved
execution time, assuming perfect parallelism and no overhead penalty. That is,
Amdahl's law provides a theoretical limit to how much faster a program can run
if it is parallelized. If $p$ denotes the percent of a program that can be
executed in parallel and $N$ denotes the number of parallel execution units,
Amdahl's law states that the theoretical maximum speedup would be:
.. raw:: html
$\large S = \displaystyle\frac{1}{(1 - p) + \frac{p}{N}}$
This formula can be naturally derived by taking the ratio of the original
execution time and the improved execution time for the parallelized version. If
we use $T_{orig}$ to denote the original execution time, then $(1 – p) T_{orig}$
would signify the portion of the execution time that must run sequentially.
Assuming perfect parallelism as Amdahl's law does, the remainder of the time
would be divided across the $N$ processors. This gives us the derivation of Amdahl's law:
.. raw:: html
$\large S = \displaystyle\frac{T_{orig}}{T_{parallel}} =
\frac{T_{orig}}{(1 - p)T_{orig} + \frac{p}{N}T_{orig}} =
\frac{T_{orig}}{((1 - p) + \frac{p}{N}) T_{orig}}$
By cancelling out the $T_{orig}$ values from the numerator and denominator, we
are left with the formulation of Amdahl's above. A variant of Amdahl's law uses
$f$ to denote the portion that must be run sequentially; that is, $f = 1 – p$.
This leads to another derivation of Amdahl's law. The result will be identical
as the original formulation, but the calculation might be easier.
.. raw:: html
$\large S = \displaystyle\frac{1}{(1 - p) + \frac{p}{N}} =
\frac{N}{N(1 - p) + p} = \frac{N}{Nf + 1 - f} = \frac{N}{1 + (N - 1)f}$
.. topic:: Example
.. figure:: Images/CSF-Images-Example.png
:align: left
:width: 100%
:alt: Decorative example icon
As an example, consider a program that runs in 20 ms. The way the program is
written, 20% of it must be run sequentially; the remaining 80% will be run in
parallel on a quad-core. Per Amdahl's law, the maximum theoretical speedup of
this program would be:
.. raw:: html
$\large S = \displaystyle\frac{1}{0.2 + \frac{0.8}{4}} =
\frac{1}{0.2 + 0.2} = 2.5$
Using the alternative derivation, we would still get the same result:
.. raw:: html
$\large S = \displaystyle\frac{4}{1 + (4 - 1)(0.2)} =
\frac{4}{1 + 0.6} = \frac{4}{1.6} = 2.5$
In this case, we could also determine that the parallelized version would spend
4 ms in the sequential portion of the program. The remaining portion (16 ms in
the original) would be divided across the 4 cores, so the parallel portion would
take 4 ms. Consequently, parallelizing the program would improve the run time
from 20 ms to 8 ms, which is a speedup factor of 2.5. The advantage of Amdahl's
law, however, is that **we did not need to know the original run-time**. As long
as we know what portion can be parallelized and how many processing units we
have, we can determine the speedup factor. Amdahl's law also emphasizes a key
point about parallelism: **Improving the percent of a program that can be
parallelized has more impact than increasing the amount of parallelism**.
.. _Amdahl:
.. topic:: Example
.. figure:: Images/CSF-Images-Example.png
:align: left
:width: 100%
:alt: Decorative example icon
To illustrate this point, let us consider two variants on the previous scenario.
In one variant, the program has been restructured so that 90% can be
parallelized rather than 80%. This now leads to a speedup factor of:
.. raw:: html
$\large S = \displaystyle\frac{1}{0.1 + \frac{0.9}{4}} =
\frac{1}{0.1 + 0.225} \approx 3.08$
In the second variant, we can still only parallelize 80% of the program, but we
have increased the number of cores from four to six. This variant produces a speedup of:
.. raw:: html
$\large S = \displaystyle\frac{1}{0.2 + \frac{0.8}{6}} =
\frac{1}{0.2 + 0.133} = 3.00$
In other words, increasing the percent of parallelized code by 12.5% had a
bigger improvement than increasing the number of cores by 50%.
As the number of processing units continues to increase, the precise calculation
of Amdahl's law becomes less important. Specifically, we can determine a faster
approximation of the speedup limit by considering the impact of an arbitrarily
large number of processing units; that is, we can derive a simplified estimate
by calculating the limit of $S$ as $N$ goes to infinity:
.. raw:: html
$\large \displaystyle\lim_{N \to \infty} \frac{1}{(1 - p) + \frac{p}{N}}
= \frac{1}{(1 - p) + 0} = \frac{1}{1 - p} = \frac{1}{f}$
Using this simplified estimate, we can determine that the upper bound on the
speedup for the program in :num:`Example #Amdahl` would be 5 (i.e., 1 / 0.2).
Gustafson's Law and Weak Scaling
--------------------------------
Although Amdahl's law provides an initial estimate to quantify the speedup from
parallel execution, it is important to note that it rests on unrealistic
assumptions. Amdahl's law assumes that the problem solved by the program
exhibits :term:`strong scaling`, meaning that the difficulty of the problem is
unaffected by the number of processors involved. In perfectly strong scaling,
there is no overhead penalty for creating more threads or using more processors.
A program run on a system with 100 processors will run in 1/100\ :superscript:`th`
of the time than it would on a single-processor system. In contrast, a more
realistic and common property is :term:`weak scaling`, which emphasizes
accomplishing more work rather than running in less time. In weak scaling, the
additional processors are used to tackle bigger and more complex problems, while
holding the expected run time to be the same.
.. _Gustafson:
.. figure:: Images/CSF-Images.9.6.png
:align: right
:width: 90%
:figwidth: 30%
:alt: Fence painting appears to show strong scaling initially, but only for
a few painters
Fence painting appears to show strong scaling initially, but only for a few
painters
To illustrate the difference between strong and weak scaling, consider a
painting business. :num:`Figure #Gustafson` illustrates the scenario
where the company has been hired to paint a fence that is 20 feet in length.
This job initially seems to exhibit strong scaling. If one painter could finish
painting the fence in one hour, then four painters could probably finish the job
in 15 minutes. However, as more painters are added, the scaling becomes weak. If
the company tries to send 20 painters for the same fence, they are unlikely to
finish the job in only three minutes. Rather, the fence would become too crowded
and painters would have to wait on each other. A better choice for the company
would be to send the additional 16 painters to paint other fences. If they work
in groups of four to paint multiple 20-foot fences, the company could paint five
fences in a 15-minute time period. Alternatively, the company could choose to
send just one painter per fence, completing 20 fences in a single hour. The
fences are not necessarily painted any faster than before, but the company is
getting more work accomplished in the same amount of time.
The usefulness of Amdahl's law is limited by its reliance on strong scaling and
unrealistic assumptions of parallel execution. Specifically, Amdahl's law
deliberately ignores any performance cost associated with creating and managing
threads, as well as system-specific factors such as NUMA or processor workload.
Amdahl's law is also limited by its exclusive focus on parallelism; Amdahl's law
cannot be used to predict the impact of changing the layout of data within NUMA.
:term:`Gustafson's law` provides an alternative formulation for speedup that
addresses these limitations.
Similar to Amdahl's law, Gustafson's law uses $p$ to denote the percent of the
work that can benefit from an improvement of some sort. Unlike Amdahl's law,
this improvement is not tied to parallelism solely; the improvement could result
from an improvement in how the OS manages threads, moving data around within a
NUMA architecture, increasing the parallelism, or any other such change. The
amount of the improvement [#f49]_ is denoted as $s$. Gustafson's law then states
that the maximum theoretical speedup of deploying the improvement is:
.. raw:: html
$\large S = 1 - p + sp$
.. topic:: Example
.. figure:: Images/CSF-Images-Example.png
:align: left
:width: 100%
:alt: Decorative example icon
As an example, consider a program that can be partially improved with parallel
execution. Let us assume that 20% of the program cannot be improved and some
initial empirical results suggest that the parallel execution portion runs in
1/5\ :superscript:`th` of the time that it takes sequentially (i.e., an
improvement factor of 5). Note that this does not assume anything about how
many processors are used, so it can be based on more realistic measurements by
running some initial tests. In this case, the speedup would be:
.. raw:: html
$\large S = 0.2 + 5 * 0.8 = 0.2 + 4.0 = 4.2$
It is key to note that this speedup has a different meaning than the speedup
described by Amdahl's law. This speedup factor does not mean that the program
runs 4.2 times as fast as the original, which is an assertion built on strong
scaling. Instead, the proper interpretation of the Gustafson's law notion of
speedup is that this program can achieve 4.2 times as much work in the same
amount of time, which is based on weak scaling. If the original program could
process 10 MB of data in a minute, then the improved version could process 42 MB
**in the same amount of time**. With Gustafson's law, the emphasis is on the
:term:`throughput` (amount of work done) rather than a faster time.
.. [#f49] To be fair, Gustafson's law can also be criticized for ignoring
complicating factors such as synchronization or communication overhead.
However, this objection is not as strong as it is for Amdahl's, as the
improvement factor $s$ can be based more empirically.
.. avembed:: Exercises/ParallelDistributed/ScalingSumm.html ka
:module: Scaling
:points: 1.0
:required: True
:exer_opts: JXOP-debug=true&JOP-lang=en&JXOP-code=java
:long_name: Scaling questions
:threshold: 5