You are on page 1of 23

Introduction To Parallel Computing

Note The code in this notebook with work with version 7.01 or higher of Mathematica. Most the code will run in earlier versions, but configuring the kernels will be harder.

Why compute in parallel?


In the old days (before 2004) clock speed doubled every 18 months or so and computer manufacturers pushed what is known as the megahertz myth, the faster the clock speed the faster the computer. Of course this is not true. All sorts of things affect performance. The depth of the instruction pipeline, cache size, number of operations per clock cycle and the number of cores all effect performance. Nowadays clock speeds are no longer increasing but most computers have more than one core per processor and many have more than one processor. This is being written on a machine with two dual core 2.66 Intel Xenon processors so I have a total of four cores. The Ohio Supercomputer Center's new supercomputer Glenn has over 10,000 cores. Each core on Glenn is about the same speed as the core's on my machine but I only have four. If I just want to browse the web, use word, send email and maybe run a single Mathematica kernel then I can let my operating system thread processes over my processors so that the load is balanced and I don't have to think about it. But if I want to use Glenn, a cluster, or even just run several Mathematica kernels I have to do parallel computing.

What can we compute in parallel?


Unfortunately not everything can be computed in parallel. A classical example is simplifying a large algebraic expression. There is no known parallel algorithm for this. Other examples include recursive calculations since each step depends on previous steps there is no way to do the calculation in parallel. Thus we can not speed up the calculation of the Fibonacci numbers by doing the calculation in parallel.
fibonacci@n_IntegerD := fibonacci@nD = fibonacci@n - 1D + fibonacci@n - 2D fibonacci@2D = 1; fibonacci@1D = 1; fibonacci@200D 280 571 172 992 510 140 037 611 932 413 038 677 189 525

However many common operations can be done in parallel. Most linear algebra operations can be done in parallel as can FFT's, most image processing manipulations, Monte Carlo calculations and the numerical solution of many partial differential equations. Another type of parallelism called data parallelism occurs when you have many large chunks of data all of which you want to do the same thing to. In this case you can just ship each chunk off to a different processor. Explorations in parameter space also lend themselves to parallel computation.

Intro_To_Parallel_Computing.nb

How much do we gain?


If you run your code on n processors you will never see a speed-up by a factor of n although in some cases you may come close. Several factors determine the speed-up, the most important being the fraction of code that can be run in parallel and the communication overhead. By factor of code I mean of course the fraction of time your code spends doing operations that can be done in parallel. If f is the fraction of code that can be run in parallel and p the number of processors then the potential speed-up is given by Amdahl's law (see Landau, Pez and Bordeianu [1] for a derivation) and is
S@p_, f_D = 1 H1 - f + f pL 1
f p

- f +1

As you can see from the Manipulate below the results are not encouraging. If you can write 75 % of your code in parallel and you run on 16 processors you only get a speed-up of just over a factor of three and going to 64 processors buys you very little over 16 processors.
Manipulate@ Plot@S@n, fD, 8f, 0, 1<, FrameLabel 8"fraction of code run in parallel", "Speedup"<, PlotLabel "Amdahl's Law", RotateLabel True, Frame True, PlotRange 81, 16<D, 88n, 2, "Processors"<, 2, 16, 2, Appearance "Labeled"<, SaveDefinitions TrueD

Processors

16 14 12 Speedup 10 8 6 4 2 0.0 0.2

Amdahl's Law

0.4

0.6

0.8

1.0

fraction of code run in parallel

So unless your code spends most of its time doing operations that can be run in parallel, you are out of luck. The curve is quite steep for if f = .9 and you run on 16 processors you can expect a speed-up of more than a factor of six. In practice the situation is worse than Amdahl's law predicts because we have neglected the communications overhead. The processors have to communicate with each other and this takes time. How much time depends both on how much information needs to be exchanged and on what type of cluster or grid or machine you are running on. In the best case, a set of calculations or data are shipped off to the processors by the controlling processor and no communication is needed except to send the results back to the control processor. In cases with high communication overhead compared to computation time per process, more processors can actually lead to a slow-down rather than a speed-up. Different types of "clusters and grids"

Intro_To_Parallel_Computing.nb

Different types of "clusters and grids" The terms cluster and grid are often used interchangeably but technically clusters are tightly coupled machines or processors and grids are loosely coupled, not necessarily homogeneous, and may be geographically dispersed. Grids may also be collections of clusters as for example TeraGrid.
Single machines

A multiprocessor single machine is the simplest example of a cluster. The machine I am writing this on has two Duel-Core Intel Xenon chips so I have a total of four cores. That's not very many cores but they are convenient and an excellent tool for debugging parallel code. They also have very low communication overhead since all the cores sit on the same motherboard. I currently have eight kernels configured (four on my machine and four on a machine in a computer lab) plus a master or controlling kernel. The controlling kernel has a KernelID of zero and these have ID's of one through eight. I can check this by evaluating $KernelID on each of the kernels
ParallelEvaluate@$KernelIDD 81, 2, 3, 4, 5, 6, 7, 8<

or by looking at the parallel kernel status window which you will find under the Evaluation menu. Mine is shown in the figure below.

The communication overhead can be checked by asking a kernel for its KernelID. Checking a local kernel on my machine
t1 = AbsoluteTime@D; ParallelEvaluate@$KernelID, 6D; timeLocal = AbsoluteTime@D - t1 0.000982

and comparing the response time with a kernel on the machine in the computer lab

Intro_To_Parallel_Computing.nb

t1 = AbsoluteTime@D; ParallelEvaluate@$KernelID, 1D; timeLab = AbsoluteTime@D - t1 0.001465

shows that it takes almost 50% longer. The machine in the lab is on a different sub-net than mine, but the network between them is fast with an average ping time to the machine (at the time of the test) of less then 1/2 a millisecond,
timeLab timeLocal 1.492 t1 = AbsoluteTime@D; $KernelID; timeMaster = AbsoluteTime@D - t1 0.000220

and both take longer than the same request on the master kernel.
timeLocal timeMaster 4.21

Labs

Computer labs are easy to turn into a grid using Mathematica's Lightweight Grid Services. The disadvantage of using a computer lab as grid resource is that communication between machines is relatively slow as we saw in the example above since the machines are just connected by standard Ethernet. The machines may also not all be the same leading to an inhomogeneous grid. Depending on the type of calculation you are doing inhomogeneous grids may or may not be a problem.
Dedicated Clusters

Dedicated clusters range from small Beowulf clusters to the clusters at national supercomputer centers. Big clusters offer a host of advantages. They may have thousands, tens of thousands or even a two hundred thousand cores (Blue Waters), very fast interconnects between the cores, abundant memory and large fast file systems. The disadvantage is that they are harder to use. You must apply for an allocation and learn to use the cluster's queue system. Jobs of any significance need to be submitted as batch jobs and this makes debugging much harder. You should always try to debug on a small local grid where you can run interactively before submitting a batch job.
Code Performance

Predicting how much speed-up you will get by running on multiple cores and how this will scale with the number of cores is very difficult. It depends not only on the percent of your code that can be run in parallel and on the latency between processors, but also on factors such a cache size, chip architecture and of course your algorithm. A better algorithm will almost always beat more cores! It is always a good idea to run some benchmarks for your problem before embarking on a large run.

Parallel Computation
The details of parallel computation depend of course on the problem but roughly speaking problems fall into two classes, those that are embarrassingly parallel and those that are not. Embarrassingly parallel problems (now sometimes called perfectly parallel problems for PR reasons) are those where every processor does an independent calculation. Some examples will make the distinction clear.

Examples

Intro_To_Parallel_Computing.nb

Examples
Table or DO loop in Parallel As simple and common case is where we have a nested Table command or Do loop which we want to split up over a number of processors. The command for this in Mathematica is ParallelTable. Let's look at some examples. Suppose we want to compute 16 million random numbers. We can do this on a single processor with
t1 = AbsoluteTime@D; data = TableARandomReal@D, 9i, 1, 16 106 =E; timeLocal = AbsoluteTime@D - t1 1.721980

and it only takes 1.7 seconds. Seeking to speed things up I launch four kernels on my machine and use ParallelTable.
ParallelEvaluate@$KernelIDD 81, 2, 3, 4<

resulting in a slow-down by a factor of more than 2.6.


t1 = AbsoluteTime@D; data = ParallelTableARandomReal@D, 8j, 1, 16<, 9i, 1, 106 =E; timeParallel = AbsoluteTime@D - t1 4.582354

For parallel computation to be useful the computation time must dominate the time spent by the processors communicating with one other. In this example the computation is trivial, but each processor has to send two million numbers to the controlling processor. So let's try a case that's more computational intensive. In this example we evaluate a set of integrals that arise in the calculation of the bound states for a particle partially confined to an irregularly shaped region. Each integral takes a second or so to compute and an accurate calculation of the bound states requires over 20,000 such integrals. That makes this a ideal problem for grid computing, it's computational intensive, the calculations are independent and very little information needs to be exchanged between the processors. Run on just one processor our sample calculation takes 283 seconds. You will not get exactly the same result each time because the integration method used is a Monte Carlo method.

Intro_To_Parallel_Computing.nb

t1 = AbsoluteTime@D; results = TableBNIntegrateB 1 SinB 1


2

25 10 10 8x, - 10 2, 10 2<, 8y, - 10 2, 10 2<, PrecisionGoal 3, Method "AdaptiveMonteCarlo"F, 8i, 1, 8<, 8j, 1, 8<F timing1 = AbsoluteTime@D - t1 5.13261 7.59215 7.53233 7.88624 7.88249 7.88055 7.95155 7.91519 7.36935 9.52305 9.70408 9.92364 9.943 9.96884 10.0019 10.0017 7.69923 9.8716 10.0245 10.2538 10.285 10.2868 10.3468 10.3082 7.73463 9.97151 10.0981 10.3533 10.375 10.3851 10.4181 10.4513 7.82996 10.0693 10.1383 10.4074 10.4231 10.4261 10.5112 10.4814 7.85054 10.077 10.2066 10.4576 10.4532 10.4767 10.5483 10.5285 7.89342 10.1106 10.2299 10.4933 10.4764 10.5094 10.5656 10.5848 7.88835 10.0957 10.2309 10.49 10.4968 10.5066 10.5515 10.5619

p i Hx + 5LF SinB

p j Hy + 5LF

2.5` BooleBx2 + Sin@x yD +

y2 4

> 1F +

1 2

Ix2 + y2 M ,

282.964925

However if we run on eight processors (four local and four on a machine in our computer lab) it only takes 50 seconds, a speedup of more than 5 and a half times.
t8 = AbsoluteTime@D; results = ParallelTableBNIntegrateB 2.5` BooleBx2 + Sin@x yD + y2 4 1 25 SinB 1 2 1 10
2

p i Hx + 5LF SinB

1 10

p j Hy + 5LF

> 1F +

Ix2 + y2 M , 8x, - 10 2, 10 2<, 8y, - 10 2, 10 2<,

PrecisionGoal 3, Method "AdaptiveMonteCarlo"F, 8i, 1, 8<, 8j, 1, 8<F timing8 = AbsoluteTime@D - t2 5.1416 7.56304 7.54385 7.8909 7.86387 7.87463 7.92989 7.92356 50.332087 7.37636 9.52117 9.69789 9.9228 9.95199 9.97514 10.0076 10.0092 7.6679 9.86153 10.0368 10.2617 10.2743 10.2793 10.3351 10.3059 7.7512 9.99618 10.1051 10.3654 10.3698 10.4013 10.41 10.4366 7.81154 10.0624 10.1556 10.4281 10.4462 10.4551 10.5288 10.4649 7.8712 10.0773 10.1933 10.4417 10.4742 10.4876 10.5299 10.5309 7.89286 10.0925 10.2216 10.4667 10.4774 10.5284 10.5424 10.5617 7.90843 10.0936 10.2186 10.4641 10.5095 10.5065 10.5552 10.5623

Shown below is a picture showing the activity on the parallel kernels.

Intro_To_Parallel_Computing.nb

Now suppose we want to run on 16 processors. I have added a couple of additional machines to my cluster as shown by the list of kernel ID's.
ParallelEvaluate@$KernelIDD 81, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16<

We cannot simply use ParallelTable anymore since that will distribute the computations over only as many processors as the length of the outer list, which is this case is eight. Instead I will generate a list of all the indices that I want to evaluate the integral over
data = Table@8i, j<, 8i, 1, 8<, 8j, 1, 8<D 81, 82, 83, 84, 85, 86, 87, 88, 1< 1< 1< 1< 1< 1< 1< 1< 81, 82, 83, 84, 85, 86, 87, 88, 2< 2< 2< 2< 2< 2< 2< 2< 81, 82, 83, 84, 85, 86, 87, 88, 3< 3< 3< 3< 3< 3< 3< 3< 81, 82, 83, 84, 85, 86, 87, 88, 4< 4< 4< 4< 4< 4< 4< 4< 81, 82, 83, 84, 85, 86, 87, 88, 5< 5< 5< 5< 5< 5< 5< 5< 81, 82, 83, 84, 85, 86, 87, 88, 6< 6< 6< 6< 6< 6< 6< 6< 81, 82, 83, 84, 85, 86, 87, 88, 7< 7< 7< 7< 7< 7< 7< 7< 81, 82, 83, 84, 85, 86, 87, 88, 8< 8< 8< 8< 8< 8< 8< 8<

and define a function that takes the an ordered pair and returns the numerical integral associated with that ordered pair.
f@8i_, j_<D := NIntegrateB 1 25 SinB 1 10
2

p i Hx + 5LF SinB

1 10

p j Hy + 5LF

2.5` BooleBx2 + Sin@x yD +

y2 4

> 1F +

1 2

Ix2 + y2 M ,

8x, - 10 2, 10 2<, 8y, - 10 2, 10 2<, PrecisionGoal 3, Method "AdaptiveMonteCarlo"F

The definition of the function needs to be known on all the kernels and this is done using the command DistributeDefinitions.
DistributeDefinitions@fD

We can now compose the function f with ParallelSubmit using the function Composition. This can then be mapped (see Map) across the flattened list of indices. This sends each computation off to the next available kernel. The function WaitAll waits until all of the computations are finished and then returns the result.

Intro_To_Parallel_Computing.nb

t16 = AbsoluteTime@D; result = WaitAll@Map@Composition@ParallelSubmit, fD, Flatten@data, 1DDD AbsoluteTime@D - t16 85.14162, 7.38053, 7.69018, 7.75236, 7.8314, 7.8754, 7.90389, 7.89871, 7.58, 9.52288, 9.88889, 9.98128, 10.0533, 10.0661, 10.0922, 10.1204, 7.54541, 9.69595, 10.0003, 10.1177, 10.1727, 10.1919, 10.1995, 10.232, 7.89539, 9.91082, 10.2508, 10.368, 10.4188, 10.4496, 10.4692, 10.4724, 7.86373, 9.9584, 10.2597, 10.3817, 10.4212, 10.4814, 10.4807, 10.4938, 7.86593, 9.95216, 10.2981, 10.3935, 10.4444, 10.4819, 10.4896, 10.4989, 7.93739, 10.01, 10.3415, 10.4193, 10.5093, 10.5269, 10.5365, 10.5707, 7.91577, 10.0046, 10.2979, 10.4572, 10.48, 10.5223, 10.5284, 10.5554< 28.066349

You will notice that we get a speed-up of almost but not quite a factor of two over using eight processors. If we like we can recover the array by partitioning the list into groups of eight.
Partition@result, 8D 5.14162 7.58 7.54541 7.89539 7.86373 7.86593 7.93739 7.91577 7.38053 9.52288 9.69595 9.91082 9.9584 9.95216 10.01 10.0046 7.69018 9.88889 10.0003 10.2508 10.2597 10.2981 10.3415 10.2979 7.75236 9.98128 10.1177 10.368 10.3817 10.3935 10.4193 10.4572 7.8314 10.0533 10.1727 10.4188 10.4212 10.4444 10.5093 10.48 7.8754 10.0661 10.1919 10.4496 10.4814 10.4819 10.5269 10.5223 7.90389 10.0922 10.1995 10.4692 10.4807 10.4896 10.5365 10.5284 7.89871 10.1204 10.232 10.4724 10.4938 10.4989 10.5707 10.5554

To get a better feeling for the scaling I will add a few more machines to my cluster so that I have 32 processors
ParallelEvaluate@$KernelIDD 81, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32<

and re-run.
t32 = AbsoluteTime@D; result = WaitAll@Map@Composition@ParallelSubmit, fD, Flatten@data, 1DDD AbsoluteTime@D - t32 85.14246, 7.37841, 7.67782, 7.77287, 7.82585, 7.86871, 7.8992, 7.90776, 7.57628, 9.53258, 9.89156, 9.96736, 10.0533, 10.0696, 10.103, 10.1093, 7.54151, 9.71273, 10.0207, 10.1048, 10.1647, 10.1924, 10.1953, 10.2227, 7.88558, 9.91974, 10.2577, 10.3665, 10.4452, 10.4514, 10.4682, 10.4799, 7.86765, 9.95272, 10.2691, 10.3854, 10.4519, 10.4594, 10.4849, 10.499, 7.87528, 9.97799, 10.2786, 10.3919, 10.4394, 10.4933, 10.4923, 10.5099, 7.95266, 10.0234, 10.3371, 10.4335, 10.5153, 10.5184, 10.5391, 10.5725, 7.93935, 9.99675, 10.3034, 10.4428, 10.4666, 10.5275, 10.5548, 10.5474< 18.198967

Plotting the speed-up versus the number of processors shows that although we get a substantial speed-up the scaling is not linear. That is, using n processors does not produce a speed-up by a factor of n. Since there is always some communication overhead you can never do as well as linear scaling.
Speedup 16 14 12 10 8 6 4 2 5 10 15 20 25 30 number of processors

Intro_To_Parallel_Computing.nb

Monte Carlo in Parallel Monte Carlo calculations form an important class of problems that are embarrassingly parallel. Suppose we want to calculate p which is a classic Monte Carlo problem. We can calculate p by generating a list of ordered pairs of random numbers that lie in the unit rectangle. We then check to see which of these lie with in the unit circle and divide the number in the unit circle by the total number. This will give p/4. The geometry is shown below for a random set of 200 ordered pairs.
points = Table@8Random@D ^ 2, Random@D ^ 2<, 8i, 1, 200<D; Show@Graphics@8Red, Circle@80, 0<, 1D<D, Graphics@8Blue, Opacity@.5D, Rectangle@80, 0<, 81, 1<D<D, Graphics@Map@Point, pointsDDD

To check to see if the ordered pair lies inside the unit circle we compute the distance d the point is from the unit circle and then compute 1-Floor[d]. Since 0 d 2 , 1-Floor[d] will give zero if d > 1 and 1 if d < 1. Summing the result then gives the number that lie inside the unit circle. The code below computes p on one processor using 40,000 points.
t1 = SessionTime@D; m = 10 000; n = 4; acceptpoint@j_D := Total@Table@1 - Floor@Sqrt@HRandom@DL ^ 2 + HRandom@DL ^ 2DD, 8i, 1, m<DD hits = Table@acceptpoint@jD, 8j, 1, n<D; hits = Total@hitsD; pi = hits Hn * mL * 4.0 t2 = SessionTime@D - t1 3.1241 0.018890

We now repeat the calculation with the number of points running from 40 000 to 2 400 000,

10

Intro_To_Parallel_Computing.nb

data = Table@t1 = SessionTime@D; n = 4; m = 10 000 i; acceptpoint@j_D := Total@Table@1 - Floor@Sqrt@HRandom@DL ^ 2 + HRandom@DL ^ 2DD, 8i, 1, m<DD; hits = Table@acceptpoint@jD, 8j, 1, n<D; hits = Total@hitsD; pi = hits Hn * mL * 4; t2 = SessionTime@D - t1, 8i, 1, 60, 1<D 80.017218, 0.033792, 0.050908, 0.067765, 0.083442, 0.101441, 0.116668, 0.134959, 0.149136, 0.167389, 0.182753, 0.198649, 0.215114, 0.239671, 0.250403, 0.267790, 0.281511, 0.301852, 0.314844, 0.330356, 0.347504, 0.366320, 0.381263, 0.397563, 0.413041, 0.433603, 0.446624, 0.468336, 0.477889, 0.504850, 0.513413, 0.533846, 0.549762, 0.566204, 0.578752, 0.597015, 0.612475, 0.634436, 0.647579, 0.673419, 0.686596, 0.703805, 0.710408, 0.730378, 0.752915, 0.774888, 0.786886, 0.812464, 0.823392, 0.837327, 0.848847, 0.870392, 0.882399, 0.907395, 0.915867, 0.933190, 0.947354, 0.970612, 0.990039, 1.002883< size = Table@10 000 n i, 8i, 1, 60, 1<D;

and plot the timings.


plot1 = ListPlot@Transpose@8size, data<D, PlotStyle 8Red, PointSize@0.02D<, AxesLabel 8"number of trials", "Wall Time"<D
Wall Time 1.0 0.8 0.6 0.4 0.2 number of trials 1.0 106 1.5 106 2.0 106

500 000

I now launch four local kernels on my machine and run the code in parallel by using ParallelTable instead of Table,
ParallelEvaluate@$KernelIDD 81, 2, 3, 4< paralleldata = Table@t1 = SessionTime@D; n = 4; m = 10 000 i; acceptpoint@j_D := Total@Table@1 - Floor@Sqrt@HRandom@DL ^ 2 + HRandom@DL ^ 2DD, 8i, 1, m<DD; DistributeDefinitions@n, m, acceptpointD; hits = ParallelTable@acceptpoint@jD, 8j, 1, n<D; hits = Total@hitsD; pi = hits Hn * mL * 4; t2 = SessionTime@D - t1, 8i, 1, 60, 1<D 80.019722, 0.020234, 0.046689, 0.039801, 0.043958, 0.040830, 0.076282, 0.061994, 0.075921, 0.059879, 0.085911, 0.107155, 0.098316, 0.097847, 0.101571, 0.107299, 0.102630, 0.109271, 0.108071, 0.117492, 0.114668, 0.122104, 0.128070, 0.158946, 0.133143, 0.151100, 0.152616, 0.157726, 0.157402, 0.160758, 0.163578, 0.191635, 0.185098, 0.178914, 0.180616, 0.187040, 0.192539, 0.208846, 0.198180, 0.222645, 0.227309, 0.210901, 0.219767, 0.234203, 0.229726, 0.235224, 0.241043, 0.252379, 0.275198, 0.253945, 0.290171, 0.263600, 0.265654, 0.274015, 0.303809, 0.293894, 0.277165, 0.288213, 0.319465, 0.309167<

and plot the timings.

Intro_To_Parallel_Computing.nb

11

plot2 = ListPlot@Transpose@8size, paralleldata<D, PlotStyle 8Blue, PointSize@0.02D<, AxesLabel 8"number of trials", "Wall Time"<D
Wall Time 0.30 0.25 0.20 0.15 0.10 0.05 500 000 number of trials 1.0 106 1.5 106 2.0 106

Combining the plots shows that we get a significant speed-up. The difference between this and the early table example is that here each processor sends only a single number to the master kernel, not a large table so there is very little communication overhead.
Show@plot1, plot2, ImageSize 72 6, PlotLabel "Red = Single Processor \n Blue = Four Processors"D

Red = Single Processor Blue = Four Processors


Wall Time 1.0

0.8

0.6

0.4

0.2

500 000

number of trials 1.0 106 1.5 106 2.0 106

Simulated Annealing

Although it is based on Monte Carlo methods, simulated annealing is not embarrassingly parallel since the state of the whole system is needed at each step. There are a number of algorithms for parallel simulated annealing and choosing the best one for your problem is non-trivial[2,3].

12

Intro_To_Parallel_Computing.nb

Algorithm testing Another application of parallel computing is algorithm testing. A classic case is global minimization where there is no method that is guaranteed to find the global minimum in a finite amount of time. In Mathematica global minimization is done by the function NMinimize. NMinimize can use four different methods, differential evolution, Nelder-Mead, random search and simulated annealing. It is often very difficult to know for a given problem which method is best. With parallel computing we can test them all at once. A test problem given in the Mathematica documentation is to find the global minimum of the function f=
1 4

Ix2 + y2 M - sinH10 Hx + yLL + sinH50 xL + sinH70 sinHxLL + sinH60 y L + sinHsinH80 yLL. Plotting shows that the function has a

huge number of local minima.


Clear@fD; f = Sin@50 xD + Sin@60 y D + Sin@70 Sin@xDD + Sin@Sin@80 yDD - Sin@10 Hx + yLD + Plot3D@f, 8x, - 1, 1<, 8y, - 1, 1<, PlotPoints 50, Mesh FalseD 1 4 Ix2 + y2 M;

There are four possible methods so I will launch four kernels


ParallelEvaluate@$KernelIDD 81, 2, 3, 4<

and define a list with the four methods as strings.


methods = 8"DifferentialEvolution", "NelderMead", "RandomSearch", "SimulatedAnnealing"< 8DifferentialEvolution, NelderMead, RandomSearch, SimulatedAnnealing<

I need to distribute the definition of the function f .


DistributeDefinitions@fD

I can now use ParallelMap to try each method on a different kernel.

Intro_To_Parallel_Computing.nb

13

ParallelMap@8$KernelID, , Timing@NMinimize@f, 8x, y<, Method DD< &, methodsD 4 DifferentialEvolution 80.187716, 8-3.20814, 8x -0.394526, y -0.0932002<<< 3 NelderMead 80.027803, 8-2.07249, 8x 0.750139, y 0.677648<<< 2 RandomSearch 80.046241, 8-3.00383, 8x 0.343687, y 0.44071<<< 1 SimulatedAnnealing 80.038024, 8-1.97503, 8x 1.34759, y 1.23296<<<

We see that differential evolution, run on kernel number 4, is by far the slowest but returns the best answer. If we let NMinimize use the default method, it uses Nelder-Mead which is fast but gives only the third best answer on this problem.
NMinimize@f, 8x, y<D 8-2.07249, 8x 0.750139, y 0.677648<<

The Diffusion Equation The numerical solution of partial differential equations is one of the most important topics in computational physics. As an example consider the heat or diffusion equation
U t

=k

2 U x2

. One way to solve this equation is to put the spatial part on a grid and

then step forward in time. This is easy to implement but is only conditionally stable. We must take time steps no larger than Dx2 H2 kL where Dx is the spatial step in order to avoid instabilities. A better method which is unconditionally stable is the CrankNicolson method [4]. In the Crank-Nicolson method the spatial derivatives are discretized using a central difference and the time derivative is approximated with the trapezoidal rule. That means that
f Ht, xL t

= gHt, xL

f It j+1 , xi M- f I t j , xi M Dt

1 2

IgIt j+1 , xi M - gIt j , xi MM.

The advantage of this method is that it is unconditionally stable so we can take much larger time steps. The disadvantage is that the t j+1 terms appear on both sides of the equation so we must solve a set of linear equations at each time step. Using the trapezoidal rule for the time derivative and central differences for the spatial derivatives converts the diffusion equation into
uIt j+1 , xi M-uIt j , xi M Dt

k 2 H DxL2

IuIt j+1 , xi+1 M - 2 uIt j+1 , xi M + uIt j+1 , xi-1 M + uIt j , xi+1 M - 2 uIt j , xi M + uIt j , xi-1 MM. giving

It is customary to define r =

k Dt HDxL2

-r uIt j+1 , xi+1 M + H2 + 2 rL uIt j+1 , xi M - r uI t j+1 , xi-1 M = r uI t j , xi+1 M + H2 - 2 rL uIt j , xi M + r uIt j , xi-1 M. If we take r = 1 the equations simplify to -uIt j+1 , xi+1 M + 4 uIt j+1 , xi M - uI t j+1 , xi-1 M = uI t j , xi+1 M + uIt j , xi-1 M. We could use larger values for r. This would allow larger time steps. There would be some loss of accuracy but the method would still be stable. To implement the Crank-Nicolson method we start by defining the initial concentration. To begin with I will work on a 2000 point grid. So if the grid covers say a meter then Dx =0.0005 meters and thus k Dt = 2.5 10-7 m2 s. Thus if k = 1 m2 s each time step is 2.5 10-7 s which is very small. We could get larger time steps by using a larger value of r.

14

Intro_To_Parallel_Computing.nb

InitialConcentration = TableAN@Sin@2 Pi i 2000DD2 , 8i, 0, 2000<E; plotInitial = ListPlot@InitialConcentration, Joined TrueD


1.0

0.8

0.6

0.4

0.2

500

1000

1500

2000

We need to impose the boundary conditions which we will take to be sinks and initialize uH0, xL. We also need to define the variables uH j, iL that we will solve for at each time step j.
Clear@u, rD leftend = 0.0; rightend = 0.0; Table@u@0, iD = InitialConcentration@@iDD, 8i, 1, Length@InitialConcentrationD<D; u@i_, 0D = leftend; u@i_, Length@InitialConcentrationD + 1D = rightend; var@j_D := Table@u@j, iD, 8i, Length@InitialConcentrationD<D

We can now define the equations for Crank-Nicholson which we will do for arbitrary r. The function "equations[j]" generates the full set of equations for time step j.
eqn@j_, i_D := - u@j + 1, i - 1D + H2 r + 2L u@j + 1, iD - u@j + 1, i + 1D u@j, i - 1D + u@j, i + 1D + H2 r - 2L u@j, iD equations@j_D := Table@eqn@j, iD, 8i, 1, Length@InitialConcentrationD<D

The equations that we need to solve are of the form M x = b where M is tridiagonal. Taking advantage of Mathematica's SparseArray and PackedArray technologies we can quickly solve very large systems of tridiagonal equations. We first use CoefficientArrays to generate the matrix M and vector b. This can be slow for large arrays but we only need to do it once since M is the same at each time step. We will however have to recompute b with each time step.
r = 1; t1 = TimeUsed@D; 8b, M< = - CoefficientArrays@equations@0D, var@1DD TimeUsed@D - t1 8SparseArray@<2001>, 82001<D, SparseArray@<6001>, 82001, 2001<D< 0.073877

Notice that both M and b have been returned as SparseArray objects which take very little storage. We can check this by using ByteCount
ByteCount@MD 80 744

which shows that M takes up less than 81 kilobytes but the normal form of M takes just over 96 megabytes.

Intro_To_Parallel_Computing.nb

15

ByteCount@Normal@MDD 96 176 096 ByteCount@MD ByteCount@Normal@MDD N 0.000839543

To compute b at each step j we need u[j, i - 1] + u[j, i + 1] + (2/r - 2) u[j, i]. We can efficiently generate the u[j, i - 1] + u[j, i + 1] terms by list rotation (see RotateRight and RotateLeft) as the small symbolic example below shows.
testData = Table@U@iD, 8i, 1, 5<D 8U H1L, U H2L, U H3L, U H4L, U H5L< RotateRight@testDataD + RotateLeft@testDataD 8U H2L + U H5L, U H1L + U H3L, U H2L + U H4L, U H3L + U H5L, U H1L + U H4L<

This gets the interior elements right but not the ends. To fix the ends we insert (see Insert) the left and right boundary conditions (note the factors of two) at the start and end of the list
RotateRight@Insert@Insert@testData, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@testData, 2 leftend, 1D, 2 rightend, - 1DD 8U H1L + 0., U H2L + 0., U H1L + U H3L, U H2L + U H4L, U H3L + U H5L, U H4L + 0., U H5L + 0.<

and then drop the first and last elements.


Take@RotateRight@Insert@Insert@testData, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@testData, 2 leftend, 1D, 2 rightend, - 1DD, 82, - 2<D 8U H2L + 0., U H1L + U H3L, U H2L + U H4L, U H3L + U H5L, U H4L + 0.<

The (2/r - 2) u[j, i] term is then just added to this result. Putting this together gives
Do@newU = LinearSolve@- M, bD; b = Take@RotateRight@Insert@Insert@newU, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@newU, 2 leftend, 1D, 2 rightend, - 1DD, 82, - 2<D + H2 r - 2L newU, 8i, 1, 6000<D Timing 85.69219, Null<

Since LinearSolve uses very efficient methods for tridiagonal systems it takes only 5.7 seconds to do 6000 iterations.

16

Intro_To_Parallel_Computing.nb

ListPlot@8InitialConcentration, newU<, PlotLabel "Initial concentration in blue \n Concentration after 6000 steps in red", ImageSize 6 72 D

Initial concentration in blue Concentration after 6000 steps in red


1.0

0.8

0.6

0.4

0.2

500

1000

1500

2000

If we change r we take large time steps, although we lose some accuracy. If for example we take r=10 we need only 600 steps.
r = 10; 8b, M< = - CoefficientArrays@equations@0D, var@1DD; Do@newULargeR = LinearSolve@- M, bD; b = Take@RotateRight@Insert@Insert@newULargeR, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@newULargeR, 2 leftend, 1D, 2 rightend, - 1DD, 82, - 2<D + H2 r - 2L newULargeR, 8i, 1, 600<D Timing 80.485925, Null<

As you can see the change is small.


ListPlot@newU - newULargeR, PlotRange AllD
2. 10-9

500 -2. 10-9 -4. 10-9 -6. 10-9 -8. 10-9

1000

1500

2000

To solve the problem in parallel we break the spatial domain into pieces and give each processor one or more pieces. This problem is not embarrassingly parallel since at the end of each time step the processors have to exchange information about the boundaries of their pieces. There are two general approaches for this. There are synchronous methods in which each processor communicates the current values of its boundaries to its neighboring processors and asynchronous methods in which the boundary variables are allowed to lag by one time step. Synchronous methods are more accurate but slower. Asynchronous methods are faster and easier to program but make (hopefully small) errors at the boundaries. We will look at an asynchronous method.

Intro_To_Parallel_Computing.nb

17

To solve the problem in parallel we break the spatial domain into pieces and give each processor one or more pieces. This problem is not embarrassingly parallel since at the end of each time step the processors have to exchange information about the boundaries of their pieces. There are two general approaches for this. There are synchronous methods in which each processor communicates the current values of its boundaries to its neighboring processors and asynchronous methods in which the boundary variables are allowed to lag by one time step. Synchronous methods are more accurate but slower. Asynchronous methods are faster and easier to program but make (hopefully small) errors at the boundaries. We will look at an asynchronous method. An Asynchronous method In order to develop the code we start with a small grid of forty points.
InitialConcentration = TableAN@Sin@2 Pi i 40DD2 , 8i, 0, 40<E; plotInitial = ListPlot@InitialConcentration, Joined TrueD
1.0

0.8

0.6

0.4

0.2

10

20

30

40

The initialization is the same as for a single processor,


Clear@u, r, vectors, matricesD leftend = 0.0; rightend = 0.0; Table@u@0, iD = InitialConcentration@@iDD, 8i, 1, Length@InitialConcentrationD<D; u@i_, 0D = leftend; u@i_, Length@InitialConcentrationD + 1D = rightend; var@j_D := Table@u@j, iD, 8i, Length@InitialConcentrationD<D

as is the equation generation.


eqn@j_, i_D := - u@j + 1, i - 1D + H2 r + 2L u@j + 1, iD - u@j + 1, i + 1D u@j, i - 1D + u@j, i + 1D + H2 r - 2L u@j, iD equations@j_D := Table@eqn@j, iD, 8i, 1, Length@InitialConcentrationD<D

We first set the number of processors (in this example four) and break the list into the same number of parts so that each part can be sent to a processor.
n = Length@ParallelEvaluate@$KernelIDDD 4

width = Ceiling@Length@InitialConcentrationD nD; edges = Table@8i + 1, Min@i + width, Length@InitialConcentrationDD<, 8i, 0, Length@InitialConcentrationD, width<D; segments@i_D := Transpose@8Map@Take@equations@i - 1D, D &, edgesD, Map@Take@var@iD, D &, edgesD<D

We next map CoefficientArrays over the segments to generate the matrices and vectors that we will send to LinearSolve.

18

Intro_To_Parallel_Computing.nb

We next map CoefficientArrays over the segments to generate the matrices and vectors that we will send to LinearSolve.
r = 10; arrays = Map@CoefficientArrays@1@@1DD, @@2DDD &, segments@1DD; vectors = - Map@First, arraysD; matrices = Map@Last, arraysD;

Checking shows that we do indeed have four different matrices and four different vectors
matrices 8SparseArray@<31>, 811, 11<D, SparseArray@<31>, 811, 11<D, SparseArray@<31>, 811, 11<D, SparseArray@<22>, 88, 8<D< vectors 8SparseArray@<11>, 811<D, SparseArray@<11>, 811<D, SparseArray@<11>, 811<D, SparseArray@<8>, 88<D<

All the processors need to know the definitions of the matrices and vectors so we need to use DistributeDefinitions to achieve that.
DistributeDefinitions@matrices, vectorsD

Each one of our vectors contains an undefined variable. In the first vector this is uH1, 12L which lies on the left edge of the second vector. You can check this by converting the first vector from a sparse array to a normal object.
vectors@@1DD Normal 80.0244717, 0.0514424, 0.0586944, 0.0699897, 0.0842227, 0.1, 0.115777, 0.13001, 0.141306, 0.148558, 1. uH1, 12L + 0.151057<

The first line of code in the next cell u[1,x_]=u[0,x] will set u[1,12] in vector one to u[0,12]. This is why the code is called asynchronous. This leads to an error but the error will be small if the time step is sufficiently small. The next line of code ships LinearSolve[matrices[[i]],vectors[[i]]] off to processor i (see MapThread).
u@1, x_D = u@0, xD; WaitAll@MapThread@ParallelSubmit@LinearSolve@1, 2DD &, 8matrices, vectors<DD;

Next, we must find the values of the vectors at the edges of each domain
edgeValues = Map@Drop@, 82, - 2<D &, newUD; edgeValues 0.0980425 0.870331 0.87269 0.167863 0.16165 0.761212 0.765148 0.10125

and impose the boundary conditions at the left and right most edges (see ReplacePart).
boundaries = ReplacePart@edgeValues, 881, 1< leftend, 8n, 2< rightend<D 0. 0.870331 0.87269 0.167863 0.16165 0.761212 0.765148 0.

We can bundle this all up in Do loop. The only new wrinkle is that we have to use MapThread to form the new vectors since we now have as many vectors as processors. We can test the speed and accuracy of our code with just a few iterations (in this case four)

Intro_To_Parallel_Computing.nb

19

t1 = AbsoluteTime@D; Do@u@i, x_D = u@i - 1, xD; newU = WaitAll@MapThread@ParallelSubmit@LinearSolve@1, 2DD &, 8matrices, vectors<DD; edgeValues = Map@Drop@, 82, - 2<D &, newUD; boundaries = ReplacePart@edgeValues, 881, 1< leftend, 8n, 2< rightend<D; vectors = MapThread@Take@RotateRight@Insert@Insert@1, 2 2@@1DD, 1D, 2 2@@2DD, - 1DD + RotateLeft@Insert@Insert@1, 2 2@@1DD, 1D, 2 2@@2DD, - 1DD, 82, - 2<D + H2 r - 2L &, 8newU, boundaries<D, 8i, 1, 4<D t2 = AbsoluteTime@D t1 0.017427

In order to plot the new concentration, we need to flatten the list since the result is returned as a nested list with as many sublists as processors.
plotMultiple = ListPlot@Flatten@newUD, Joined TrueD

0.6 0.5 0.4 0.3 0.2 0.1

10

20

30

40

We need to check our result against the single processor result. The code for a single processor is bundled in the cell below.
Clear@u, b, MD leftend = 0.0; rightend = 0.0; Table@u@0, iD = InitialConcentration@@iDD, 8i, 1, Length@InitialConcentrationD<D; u@i_, 0D = leftend; u@i_, Length@InitialConcentrationD + 1D = rightend; var@j_D := Table@u@j, iD, 8i, Length@InitialConcentrationD<D eqn@j_, i_D := - u@j + 1, i - 1D + H2 r + 2L u@j + 1, iD - u@j + 1, i + 1D u@j, i - 1D + u@j, i + 1D + H2 r - 2L u@j, iD equations@j_D := Table@eqn@j, iD, 8i, 1, Length@InitialConcentrationD<D 8b, M< = - CoefficientArrays@equations@0D, var@1DD; t1 = TimeUsed@D; Do@newUSingle = LinearSolve@- M, bD; b = Take@RotateRight@Insert@Insert@newUSingle, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@newUSingle, 2 leftend, 1D, 2 rightend, - 1DD, 82, - 2<D + H2 r - 2L newUSingle, 8i, 1, 4<D; TimeUsed@D t1 0.000693

20

Intro_To_Parallel_Computing.nb

plotSingle = ListPlot@newUSingleD
0.5

0.4

0.3

0.2

0.1

10

20

30

40

Comparing the results shows that our multiple processor code took 25 times as long to run and returned an inaccurate answer. But all is not lost. The problem is our small grid. A small grid is good for debugging but you would never run a problem this small on multiple processors. The task sent to each processor, solving 10 linear equations, is trivial so the problem is dominated by the communications overhead. Furthermore we are taking very large time steps. Since r =
k Dt HDxL2

and we took r = 10, we have

Dt = 10 HDxL2 k. If k = 1 mm2 s and our spatial domain is 1 cm then with forty grid points Dx = 0.25 mm and Dt=0.625 seconds.
Show@plotMultiple, plotSingleD

0.6 0.5 0.4 0.3 0.2 0.1

10

20

30

40

If we try a larger grid things look better. If we take a grid of 40 000points,

Intro_To_Parallel_Computing.nb

21

InitialConcentration = TableAN@Sin@2 Pi i 40 000DD2 , 8i, 0, 40 000<E; plotInitial = ListPlot@InitialConcentration, Joined TrueD


1.0

0.8

0.6

0.4

0.2

10 000

20 000

30 000

40 000

run for 4000 steps (the code for this is given in the next cell)
Clear@u, r, vectors, matricesD leftend = 0.0; rightend = 0.0; Table@u@0, iD = InitialConcentration@@iDD, 8i, 1, Length@InitialConcentrationD<D; u@i_, 0D = leftend; u@i_, Length@InitialConcentrationD + 1D = rightend; var@j_D := Table@u@j, iD, 8i, Length@InitialConcentrationD<D eqn@j_, i_D := - u@j + 1, i - 1D + H2 r + 2L u@j + 1, iD - u@j + 1, i + 1D u@j, i - 1D + u@j, i + 1D + H2 r - 2L u@j, iD equations@j_D := Table@eqn@j, iD, 8i, 1, Length@InitialConcentrationD<D n = Length@ParallelEvaluate@$KernelIDDD; width = Ceiling@Length@InitialConcentrationD nD; edges = Table@8i + 1, Min@i + width, Length@InitialConcentrationDD<, 8i, 0, Length@InitialConcentrationD, width<D; segments@i_D := Transpose@8Map@Take@equations@i - 1D, D &, edgesD, Map@Take@var@iD, D &, edgesD<D r = 10; arrays = Map@CoefficientArrays@1@@1DD, @@2DDD &, segments@1DD; vectors = - Map@First, arraysD; matrices = Map@Last, arraysD; DistributeDefinitions@matrices, vectorsD t1 = AbsoluteTime@D; Do@u@i, x_D = u@i - 1, xD; newU = WaitAll@MapThread@ParallelSubmit@LinearSolve@1, 2DD &, 8matrices, vectors<DD; edgeValues = Map@Drop@, 82, - 2<D &, newUD; boundaries = ReplacePart@edgeValues, 881, 1< leftend, 8n, 2< rightend<D; vectors = MapThread@Take@RotateRight@Insert@Insert@1, 2 2@@1DD, 1D, 2 2@@2DD, - 1DD + RotateLeft@Insert@Insert@1, 2 2@@1DD, 1D, 2 2@@2DD, - 1DD, 82, - 2<D + H2 r - 2L &, 8newU, boundaries<D, 8i, 1, 4000<D t2 = AbsoluteTime@D - t1 66.457858

and compare this to the single processor case,

22

Intro_To_Parallel_Computing.nb

Clear@u, b, MD leftend = 0.0; rightend = 0.0; Table@u@0, iD = InitialConcentration@@iDD, 8i, 1, Length@InitialConcentrationD<D; u@i_, 0D = leftend; u@i_, Length@InitialConcentrationD + 1D = rightend; var@j_D := Table@u@j, iD, 8i, Length@InitialConcentrationD<D eqn@j_, i_D := - u@j + 1, i - 1D + H2 r + 2L u@j + 1, iD - u@j + 1, i + 1D u@j, i - 1D + u@j, i + 1D + H2 r - 2L u@j, iD equations@j_D := Table@eqn@j, iD, 8i, 1, Length@InitialConcentrationD<D 8b, M< = - CoefficientArrays@equations@0D, var@1DD; t1 = TimeUsed@D; Do@newUSingle = LinearSolve@- M, bD; b = Take@RotateRight@Insert@Insert@newUSingle, 2 leftend, 1D, 2 rightend, - 1DD + RotateLeft@Insert@Insert@newUSingle, 2 leftend, 1D, 2 rightend, - 1DD, 82, - 2<D + H2 r - 2L newUSingle, 8i, 1, 4000<D; TimeUsed@D - t1 53.4869

we see that we are now only 1.5 times slower (not 25) and accuracy is good. You can clearly see the errors (which are now small) at the edges of the domains due to the asynchronous updating.
ListPlot@Flatten@newUD - newUSingle, PlotRange All, PlotLabel "Error due to asynchronous updating", AxesLabel 8"grid point", "Difference"<D
Difference

Error due to asynchronous updating

0.00005

10 000

20 000

30 000

40 000

grid point

-0.00005

The plot below shows the speed-up or slow down as a function of the grid size.

Intro_To_Parallel_Computing.nb

23

Speed-up as a function of grid size processors=4


speed-up 1.4

1.2

1.0

0.8

0.6

0.4

0.2

200 000

400 000

600 000

800 000

grid size 1 106

How much speed-up can you expect as you increase the number of processors? That depends on the size of your problem and the architecture of your grid. The plot below shows timings for a grid of one million elements run on Glenn at OSC. The code was run on dual socket quad core Opterons. Notice that using six processors is slower than using four, because the code is now split over two chips not just cores on the same chip, so the I/O is slower. Speed-up as a function of processors for a 106 element grid normalized to the time for one processor
speed-up 2.2 2.0 1.8 1.6 1.4 1.2 1.0 number of processors Arrows show where a new chip is used.

10

15

References
[1] Rubin Landau, Manuel Jos Pez and Cristian C. Bordeianu, A Survey of Computational Physics, page 36, Princeton University Press, 2008 [2] DJ Chen, CY Lee, CH Park, P Mendes, Parallelizing simulated annealing algorithms based on high-performance computer, Journal of Global Optimization, 39, 261-289, 2007 - Springer [3] Daniel R Greening, Parallel Simulated Annealing Techniques, Physica D 42, 293-3006, 1990 [4] J. Crank and P. Nicolson, A practical method for numerical evaluation of solutions of partial differential equations of the heat conduction type, Proc. Cambridge Phil. Soc 43, 50, 1946.

You might also like