PBDR

Programming with Big Data in R
Drew Schmidt and George Ostrouchov
useR! 2014
http://r-pbd.org/tutorial
ppbbddR
R Core Team
The pbdR Core Team

Wei-Chen Chen1
George Ostrouchov2,3
Pragneshkumar Patel3
Drew Schmidt3
Support
This work used resources of National Institute for Computational Sciences at the University of Tennessee, Knoxville, which is
supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.
ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work also used resources of the Oak Ridge Leadership Computing Facility
at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under
Contract No. DE-AC05-00OR22725.
1
Department of Ecology and Evolutionary Biology

University of Tennessee, Knoxville TN, USA
2
Computer Science and Mathematics Division

Oak Ridge National Laboratory, Oak Ridge TN, USA
3
Joint Institute for Computational Sciences

University of Tennessee, Knoxville TN, USA
ppbbddR
R Core Team
About This Presentation
Downloads
This presentation is available at: http://r-pbd.org/tutorial
ppbbddR
R Core Team
About This Presentation
Installation Instructions
Installation instructions for setting up a pbdR environment are available:
http://r-pbd.org/install.html
This includes instructions for installing R, MPI, and pbdR .
ppbbddR
R Core Team
Contents
1
Introduction
Profiling and Benchmarking
The pbdR Project
Introduction to pbdMPI
The Generalized Block Distribution
Basic Statistics Examples
Data Input
Introduction to pbdDMAT and the ddmatrix Structure
Examples Using pbdDMAT
10
MPI Profiling
11
Wrapup
ppbbddR
R Core Team
Introduction
Contents
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
ppbbddR
R Core Team
Introduction
Introduction
Summary
ppbbddR
R Core Team
Introduction
Parallelism
Serial Programming
ppbbddR
R Core Team
Parallel Programming
1/131
Introduction
Difficulty in Parallelism
1
Implicit parallelism: Parallel details hidden from user
Explicit parallelism: Some assembly required. . .
Embarrassingly Parallel: Also called loosely coupled. Obvious how to

make parallel; lots of independence in computations.
Tightly Coupled: Opposite of embarrassingly parallel; lots of

dependence in computations.
ppbbddR
R Core Team
2/131
Introduction
Speedup
Wallclock Time: Time of the clock on the wall from start to finish
Speedup: unitless measure of improvement; more is better.
Sn1 ,n2 =
Run time for n1 cores

Run time for n2 cores
n1 is often taken to be 1
In this case, comparing parallel algorithm to serial algorithm
ppbbddR
R Core Team
3/131
Introduction
Good Speedup
Bad Speedup
group
Application
Optimal
4
Speedup
Speedup
group
Application
Optimal
4
Cores
Cores
ppbbddR
R Core Team
4/131
Introduction
Scalability and Benchmarking

1
Strong: Fixed total problem size.

Less work per core as more cores are added.
Weak: Fixed local (per core) problem size.

Same work per core as more cores are added.
ppbbddR
R Core Team
5/131
Introduction
Good Strong Scaling
Good Weak Scaling
50
3100
40
3050
30
Time
Speedup
3000
20
10
2950
0
0
20
40
60
Cores
4000
8000
12000
16000
Cores
ppbbddR
R Core Team
6/131
Introduction
Shared and Distributed Memory Machines
Shared Memory
Direct access to read/change
memory (one node)
ppbbddR
R Core Team
Distributed
No direct access to read/change
memory (many nodes); requires
communication
7/131
Introduction
Shared and Distributed Memory Machines
Shared Memory Machines

Thousands of cores
Nautilus, University of Tennessee

1024 cores
4 TB RAM
ppbbddR
R Core Team
Distributed Memory Machines

Hundreds of thousands of cores
Kraken, University of Tennessee

112,896 cores
147 TB RAM
8/131
Introduction
Introduction
Summary
ppbbddR
R Core Team
Introduction
Three Basic Flavors of Hardware
Distributed Memory
Interconnection Network
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Local Memory
GPU: Graphical Processing Unit
CORE
+ cache
MIC: Many Integrated Core
Network
Memory
ppbbddR
R Core Team
9/131
Introduction
Your Laptop or Desktop
Distributed Memory
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Local Memory
CORE
+ cache
Network
Memory
ppbbddR
R Core Team
10/131
Introduction
A Server or Cluster
Distributed Memory
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Local Memory
CORE
+ cache
Network
Memory
ppbbddR
R Core Team
11/131
Introduction
Server to Supercomputer
Distributed Memory
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Local Memory
CORE
+ cache
Network
Memory
ppbbddR
R Core Team
12/131
Introduction
Knowing the Right Words
ing
t
r
u
e
ib
st
Clu Distr
Distributed Memory
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Mu
CORE
+ cache
e
cor
y
n
a
ing
r M
d
o
a
U
o
ffl
GP
O
Co-Processor
ore
ltic
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
Local Memory
CORE
+ cache
Network
ing
d
a
re
ti t h
l
u
M
Memory
ppbbddR
R Core Team
13/131
Introduction
Introduction
Summary
ppbbddR
R Core Team
Introduction
Native Programming Models and Tools
Distributed Memory
Focus on who owns what data and what

communication is needed
Sockets
MPI
Hadoop
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Network
Memory
Same Task on Blocks of data
CUDA
OpenCL
OpenACC
Local Memory
CORE
+ cache
Focus on which tasks can be

parallel
ppbbddR
R Core Team
OpenMP
OpenACC
OpenMP
Pthreads
fork
14/131
Introduction
30+ Years of Parallel Computing Research
Distributed Memory

Sockets
MPI
Hadoop
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Network
Memory
CUDA
OpenCL
OpenACC
Local Memory
CORE
+ cache

parallel
ppbbddR
R Core Team
OpenMP
OpenACC
OpenMP
Pthreads
fork
15/131
Introduction
Last 10 years of Advances
Distributed Memory

Sockets
MPI
Hadoop
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Network
Memory
CUDA
OpenCL
OpenACC
Local Memory
CORE
+ cache

parallel
ppbbddR
R Core Team
OpenMP
OpenACC
OpenMP
Pthreads
fork
16/131
Introduction
Putting It All Together Challenge
Distributed Memory

Sockets
MPI
Hadoop
PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
CORE
+ cache
Network
Memory
CUDA
OpenCL
OpenACC
Local Memory
CORE
+ cache

parallel
ppbbddR
R Core Team
OpenMP
OpenACC
OpenMP
Pthreads
fork
17/131
Introduction
R Interfaces to Native Tools
Distributed Memory

PROC
+ cache
PROC
+ cache
PROC
+ cache
PROC
+ cache
Mem
Mem
Mem
Mem
Co-Processor
GPU
or
MIC
Shared Memory
CORE
+ cache
CORE
+ cache
Network
Memory
CUDA
OpenCL
OpenACC
Local Memory
CORE
+ cache

parallel
snow + multicore = parallel
snow
Rmpi
pbdMPI
RHadoop
CORE
+ cache
Sockets
MPI
Hadoop
ppbbddR
R Core Team
OpenMP
OpenACC
OpenMP
Pthreads
fork
Foreign
Langauge
Interfaces:
.C
.Call
Rcpp
OpenCL
inline
.
.
.
multicore
18/131
Introduction
Summary
Introduction
Summary
ppbbddR
R Core Team
Introduction
Summary
Summary
Three flavors of hardware
Distributed is stable
Multicore and co-processor are evolving
Two memory models
Distributed works in multicore
Parallelism hierarchy
Medium to big machines have all three
ppbbddR
R Core Team
19/131
Contents

Why Profile?
Profiling R Code
Advanced R Profiling
Summary
ppbbddR
R Core Team
Why Profile?

Why Profile?
Profiling R Code
Summary
ppbbddR
R Core Team
Why Profile?
Performance and Accuracy

Sometimes = 3.14 is (a) infinitely faster
than the correct answer and (b) the difference between the correct and the wrong
answer is meaningless. . . . The thing is, some
specious value of correctness is often irrelevant because it doesnt matter. While performance almost always matters. And I absolutely detest the fact that people so often
dismiss performance concerns so readily.
Linus Torvalds, August 8, 2008
ppbbddR
R Core Team
20/131
Why Profile?
Why Profile?
Because performance matters.
Bad practices scale up!
Your bottlenecks may surprise you.
Because R is dumb.
R users claim to be data people. . . so act like it!
ppbbddR
R Core Team
21/131
Why Profile?
Compilers often correct bad behavior. . .

clang example.c
A Really Dumb Loop
main :
int main () {
int x , i ;
for ( i =0; i <10; i ++)
x = 1;
return 0;
}
. cfi _ startproc
# BB #0:
movl
movl
$ 0 , -4(% rsp )
$ 0 , -12(% rsp )
cmpl
jge
$ 10 , -12(% rsp )
. LBB0 _ 4
movl
$ 1 , -8(% rsp )
movl
addl
movl
jmp
-12(% rsp ) , % eax

$ 1 , % eax
% eax , -12(% rsp )
. LBB0 _ 1
movl
ret
$ 0 , % eax
. LBB0 _ 1:
# BB #2:
clang -O3 example.c

# BB #3:
main :
. cfi _ startproc
# BB #0:
xorl
% eax ,
% eax
ret
. LBB0 _ 4:
ppbbddR
R Core Team
22/131
Why Profile?
R will not!
Dumb Loop
1
2
3
4
5
6
7
for
tA
Y
Q
Y
Q
}
(i
<<<<<-
Better Loop
in 1: n ) {
t(A)
tA % * % Q
qr . Q ( qr ( Y ) )
A %*% Q
qr . Q ( qr ( Y ) )
1
2
3
4
5
6
7
8
9
for
Y
Q
Y
Q
}
(i
<<<<-
in 1: n ) {
tA % * % Q
qr . Q ( qr ( Y ) )
A %*% Q
qr . Q ( qr ( Y ) )
9
10
tA <- t(A)
ppbbddR
R Core Team
23/131
Why Profile?
Example from a Real R Package
Exerpt from Original function

1
2
3
4
while (i <= N ) {
for ( j in 1: i ) {
d.k <- as.matrix(x)[l==j,l==j]
...
Exerpt from Modified function

1
x.mat <- as.matrix(x)
2
3
4
5
6
while (i <= N ) {
for ( j in 1: i ) {
d.k <- x.mat[l==j,l==j]
...
ppbbddR
R Core Team
By changing just 1 line of

code, performance of the
main method improved by
over 350%!
24/131
Why Profile?
Some Thoughts
R is slow.
Bad programmers are slower.
R isnt very clever (compared to a compiler).
The Bytecode compiler helps, but not nearly as much as a compiler.
ppbbddR
R Core Team
25/131
Profiling R Code

Why Profile?
Profiling R Code
Summary
ppbbddR
R Core Team
Profiling R Code
Timings
Getting simple timings as a basic measure of performance is easy, and
valuable.
system.time() timing blocks of code.
Rprof() timing execution of R functions.
Rprofmem() reporting memory allocation in R .
tracemem() detect when a copy of an R object is created.
The rbenchmark package Benchmark comparisons.
ppbbddR
R Core Team
26/131

Why Profile?
Profiling R Code
Summary
ppbbddR
R Core Team
Other Profiling Tools

perf (Linux)
PAPI
MPI profiling: fpmpi, mpiP, TAU
ppbbddR
R Core Team
27/131
Profiling MPI Codes with pbdPROF

1. Rebuild pbdR packages
R CMD INSTALL pbdMPI _ 0.2 -1. tar . gz \
-- configure - args = \
" -- enable - pbdPROF "
2. Run code
mpirun - np 64 Rscript my _ script . R
3. Analyze results
1
2
3
library ( pbdPROF )
prof <- read . prof ( " output . mpiP " )
plot ( prof , plot . type = " messages2 " )
ppbbddR
R Core Team
28/131
Profiling with pbdPAPI

Performance Application Programming
Interface
High and low level interfaces
Linux only :(
Function
system.flips()
system.flops()
system.cache()
system.epc()
system.idle()
system.cpuormem()
system.utilization()
Description of Measurement
Time, floating point instructions, and Mflips
Time, floating point operations, and Mflops
Cache misses, hits, accesses, and reads
Events per cycle
Idle cycles
CPU or RAM bound
CPU utilization
ppbbddR
R Core Team
29/131
Summary

Why Profile?
Profiling R Code
Summary
ppbbddR
R Core Team
Summary
Summary
Profile, profile, profile.
Use system.time() to get a general sense of a method.
Use rbenchmarks benchmark() to compare 2 methods.
Use Rprof() for more detailed profiling.
Other tools exist for more hardcore applications (pbdPAPI and
pbdPROF).
ppbbddR
R Core Team
30/131
The pbdR Project
Contents
The pbdR Project

The pbdR Project
pbdR Connects R to HPC Libraries
Using pbdR
Summary
ppbbddR
R Core Team
The pbdR Project
The pbdR Project
The pbdR Project

The pbdR Project
Using pbdR
Summary
ppbbddR
R Core Team
The pbdR Project
The pbdR Project
Programming with Big Data in R (pbdR)

Striving for Productivity, Portability, Performance
Freea R packages.
Bridging high-performance compiled code
with high-productivity of R
Scalable, big data analytics.
Offers implicit and explicit parallelism.
Methods have syntax identical to R.
a
MPL, BSD, and GPL licensed
ppbbddR
R Core Team
31/131
The pbdR Project
The pbdR Project
pbdR Packages
ppbbddR
R Core Team
32/131
The pbdR Project
The pbdR Project
pbdR Motivation
Why HPC libraries (MPI, ScaLAPACK, PETSc, . . . )?
The HPC community has been at this for decades.
Theyre tested. They work. Theyre fast.
Youre not going to beat Jack Dongarra at dense linear algebra.
ppbbddR
R Core Team
33/131
The pbdR Project
The pbdR Project
Simple Interface for MPI Operations with pbdMPI

Rmpi
1
2
3
4
pbdMPI
# int
mpi . allreduce (x , type =1)
# double
mpi . allreduce (x , type =2)
allreduce ( x )
Types in R
1
2
3
4
5
6
> is . integer (1)

[1] FALSE
> is . integer (2)
[1] FALSE
> is . integer (1:2)
[1] TRUE
ppbbddR
R Core Team
34/131
The pbdR Project
The pbdR Project
Distributed Matrices and Statistics with pbdDMAT

Matrix Exponentiation
Speedup (Relative to Serial Code)
Nodes
1 2 3 4
12 4
12
16
24
32
48
64
Cores
library ( pbdDMAT )
2
3
4
dx <- ddmatrix ( " rnorm " , 5000 , 5000)

expm ( dx )
ppbbddR
R Core Team
35/131
The pbdR Project
The pbdR Project
Distributed Matrices and Statistics with pbdDMAT

Least Squares Benchmark
Fitting y~x With Fixed Local Size of ~43.4 MiB
125
Predictors
.4
41
1.4
.4
41
500
1000
2000
.09
16
10
0.7
17
Run Time (Seconds)
100
.3468 35
2142. 85.
75
.09
16
10
50
.35
85
.3468
2142.
0.7
17
34
.09
16
00
64
24
40
80
17
32
5
.34 8 5.3
2142.6 8
5
1004
08
20
16
25
10
0.7
Cores
x < d d m a t r i x ( rnorm , nrow=m, n c o l=n )

y < d d m a t r i x ( rnorm , nrow=m, n c o l =1)
mdl < lm . f i t ( x=x , y=y )
ppbbddR
R Core Team
36/131
The pbdR Project
The pbdR Project
Getting Started with HPC for R Users: pbdDEMO

140 page, textbook-style vignette.
Over 30 demos, utilizing all packages.
Not just a hello world!
Demos include:
PCA
Regression
Parallel data input
Model-based clustering
Simple Monte Carlo simulation
Bayesian MCMC
ppbbddR
R Core Team
37/131
The pbdR Project
The pbdR Project

The pbdR Project
Using pbdR
Summary
ppbbddR
R Core Team
The pbdR Project
R and pbdR Interfaces to HPC Libraries

pbdDEMO
Tau
Distributed Memory
ScaLAPACK
PBLAS
BLACSPROC
PROC
+ cache
+ cache
PROC
+ cache
pbdPROF
pbdPROF
pbdPROF
PETSc
mpiP
Trilinos
PROC
+ cache
fpmpi
pbdMPI
pbdDMAT
Mem
Mem
MPI
Mem
Mem
MKL
LibSci
pbdPAPI
PAPI
GPU
or
MIC
DPLASMA
NetCDF4
Shared Memory
CORE
+ cache
Co-Processor
CORE
ACML
+ cache
CORE
+ cache
magma
Local Memory
PLASMA
MAGMA
Network
pbdNCDF4
CORE
+ cache
HiPLARM
HiPLAR
Memory
ADIOS
cuBLAS
pbdADIOS
ppbbddR
R Core Team
38/131
The pbdR Project
Using pbdR
The pbdR Project

The pbdR Project
Using pbdR
Summary
ppbbddR
R Core Team
The pbdR Project
Using pbdR
pbdR Paradigms
pbdR programs are R programs!
Differences:
Batch execution (non-interactive).
Parallel code utilizes Single Program/Multiple Data (SPMD) style
Emphasizes data parallelism.
ppbbddR
R Core Team
39/131
The pbdR Project
Using pbdR
Batch Execution
Running a serial R program in batch:
1
Rscript my _ script . r
or
1
R CMD BATCH my _ script . r
Running a parallel (with MPI) R program in batch:

1
mpirun - np 2 Rscript my _ par _ script . r
ppbbddR
R Core Team
40/131
The pbdR Project
Using pbdR
Single Program/Multiple Data (SPMD)

SPMD is a programming paradigm.
Not to be confused with SIMD.
Paradigms
Programming models
OOP, Functional, SPMD, . . .
ppbbddR
R Core Team
SIMD
Hardware instructions
MMX, SSE, . . .
41/131
The pbdR Project
Using pbdR
Single Program/Multiple Data (SPMD)

SPMD is arguably the simplest extension of serial programming.
Only one program is written, executed in batch on all processors.
Different processors are autonomous; there is no manager.
Dominant programming model for large machines for 30 years.
ppbbddR
R Core Team
42/131
The pbdR Project
Summary
Summary
pbdR connects R to scalable HPC libraries.
The pbdDEMO package offers numerous examples and explanations
for getting started with distributed R programming.
pbdR programs are R programs.
ppbbddR
R Core Team
43/131
Contents
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
ppbbddR
R Core Team
Other pbdMPI Tools
Summary
ppbbddR
R Core Team
Message Passing Interface (MPI)

MPI : Standard for managing communications (data and instructions)
between different nodes/computers.
Implementations: OpenMPI, MPICH2, Cray MPT, . . .
Enables parallelism (via communication) on distributed machines.
Communicator : manages communications between processors.
ppbbddR
R Core Team
44/131
MPI Operations (1 of 2)
Managing a Communicator: Create and destroy communicators.
init() initialize communicator
finalize() shut down communicator(s)
Rank query: determine the processors position in the communicator.
comm.rank() who am I?
comm.size() how many of us are there?
Printing: Printing output from various ranks.
comm.print(x)
comm.cat(x)
WARNING: only use these functions on results, never on
yet-to-be-computed things.
ppbbddR
R Core Team
45/131
Quick Example 1
Rank Query: 1 rank.r
1
2
library ( pbdMPI , quietly = TRUE )

init ()
3
4
5
my . rank <- comm . rank ()

comm . print ( my . rank , all . rank = TRUE )
6
7
finalize ()
mpirun - np 2 Rscript 1 _ rank . r
Execute this script via:
Sample Output:
1
2
3
4
ppbbddR
R Core Team
COMM . RANK = 0
[1] 0
COMM . RANK = 1
[1] 1
46/131
Quick Example 2
Hello World: 2 hello.r
1
2

init ()
3
4
comm . print ( " Hello , world " )
5
6
comm . print ( " Hello again " , all . rank = TRUE , quietly = TRUE )
7
8
finalize ()

1
mpirun - np 2 Rscript 2 _ hello . r
Sample Output:
1
2
3
4
ppbbddR
R Core Team
COMM . RANK = 0
[1] " Hello , world "
[1] " Hello again "
[1] " Hello again "
47/131
Other pbdMPI Tools
Summary
ppbbddR
R Core Team
MPI Operations
1
Reduce
Gather
Broadcast
Barrier
ppbbddR
R Core Team
48/131
Reductions Combine results into single result
ppbbddR
R Core Team
49/131
Gather Many-to-one
ppbbddR
R Core Team
50/131
Broadcast One-to-many
ppbbddR
R Core Team
51/131
Barrier Synchronization
ppbbddR
R Core Team
52/131
MPI Operations (2 of 2)
Reduction: each processor has a number x; add all of them up, find
the largest/smallest, . . . .
reduce(x, op=sum) reduce to one
allreduce(x, op=sum) reduce to all
Gather: each processor has a number; create a new object on some
processor containing all of those numbers.
gather(x) gather to one
allgather(x) gather to all
Broadcast: one processor has a number x that every other processor
should also have.
bcast(x)
Barrier: computation wall; no processor can proceed until all
processors can proceed.
barrier()
ppbbddR
R Core Team
53/131
Quick Example 3
Reduce and Gather: 3 gt.r
1
2

init ()
3
4
comm . set . seed ( diff = TRUE )
5
6
n <- sample (1:10 , size =1)
7
8
9
gt <- gather ( n )
comm . print ( unlist ( gt ) )
10
11
12
sm <- allreduce (n , op = sum )

comm . print ( sm , all . rank = T )
13
14
finalize ()

1
mpirun - np 2 Rscript 3 _ gt . r
Sample Output:
1
2
3
4
5
6
ppbbddR
R Core Team
COMM . RANK = 0
[1] 2 8
COMM . RANK = 0
[1] 10
COMM . RANK = 1
[1] 10
54/131
Quick Example 4
Broadcast: 4 bcast.r
1
2
library ( pbdMPI , quietly = T )

init ()
3
4
5
6
7
8
if ( comm . rank () ==0) {

x <- matrix (1:4 , nrow =2)
} else {
x <- NULL
}
9
10
y <- bcast (x , rank . source =0)
11
12
comm . print (y , rank =1)
13
14
finalize ()

1
mpirun - np 2 Rscript 4 _ bcast . r
Sample Output:
1
2
3
4
ppbbddR
R Core Team
COMM . RANK = 1
[ ,1] [ ,2]
[1 ,]
1
3
[2 ,]
2
4
55/131
Other pbdMPI Tools
Other pbdMPI Tools
Summary
ppbbddR
R Core Team
Other pbdMPI Tools
Random Seeds
pbdMPI offers a simple interface for managing random seeds:
comm.set.seed(seed=1234, diff=TRUE) All processors
generate different streams.
comm.set.seed(seed=1234, diff=FALSE) All processors
generate same streams.
ppbbddR
R Core Team
56/131
Other pbdMPI Tools
Other Helper Tools

pbdMPI Also contains useful tools for Manager/Worker and task
parallelism codes:
Task Subsetting: Distributing a list of jobs/tasks
get.jid(n)
*ply: Functions in the *ply family.
pbdApply(X, MARGIN, FUN, ...) analogue of apply()
pbdLapply(X, FUN, ...) analogue of lapply()
pbdSapply(X, FUN, ...) analogue of sapply()
ppbbddR
R Core Team
57/131
Summary
Other pbdMPI Tools
Summary
ppbbddR
R Core Team
Summary
Summary
Start by loading the package:
1
library ( pbdMPI , quiet = TRUE )
Always initialize before starting and finalize when finished:

1
init ()
2
3
# ...
4
5
finalize ()
ppbbddR
R Core Team
58/131
GBD
Contents

GBD: a Way to Distribute Your Data
Example GBD Distributions
Summary
ppbbddR
R Core Team
GBD

Summary
ppbbddR
R Core Team
GBD
Distributing Data
Problem: How to distribute the data
x =
x1,1
x2,1
x3,1
x4,1
x5,1
x6,1
x7,1
x8,1
x9,1
x10,1
x1,2
x2,2
x3,2
x4,2
x5,2
x6,2
x7,2
x8,2
x9,2
x10,2
x1,3
x2,3
x3,3
x4,3
x5,3
x6,3
x7,3
x8,3
x9,3
x10,3
103
ppbbddR
R Core Team
59/131
GBD
Distributing a Matrix
Data
x1,1 x1,2
x2,1 x2,2
x
3,1 x3,2
x
4,1 x4,2
x5,2
x
x = 5,1
x6,1 x6,2
x7,1 x7,2
x8,1 x8,2
x9,1 x9,2
x10,1 x10,2
Across 4 Processors: Block Distribution

Processors
0
x1,3
1
x2,3
2
x3,3
3
x4,3
x5,3
x6,3
x7,3
x8,3
x9,3
x10,3 103
ppbbddR
R Core Team
60/131
GBD
Distributing a Matrix
Data
x1,1 x1,2
x2,1 x2,2
x
3,1 x3,2
x
4,1 x4,2
x5,2
x
x = 5,1
x6,1 x6,2
x7,1 x7,2
x8,1 x8,2
x9,1 x9,2
x10,1 x10,2
Across 4 Processors: Local Load Balance

Processors
0
x1,3
1
x2,3
2
x3,3
3
x4,3
x5,3
x6,3
x7,3
x8,3
x9,3
x10,3 103
ppbbddR
R Core Team
61/131
GBD
The GBD Data Structure

Throughout the examples, we will make use of the Generalized Block Distribution, or GBD
distributed matrix structure.
1
GBD is distributed. No processor owns all the data.
GBD is non-overlapping. Rows uniquely assigned to

processors.
GBD is row-contiguous. If a processor owns one element

of a row, it owns the entire row.
GBD is globally row-major, locally column-major.
GBD is often locally balanced, where each processor

owns (almost) the same amount of data. But this is
not required.
The last row of the local storage of a processor is adjacent (by global row) to the first row
of the local storage of next processor (by communicator number) that owns data.
GBD is (relatively) easy to understand, but can lead to bottlenecks if you have many more
columns than rows.
ppbbddR
R Core Team
x1,1
x2,1
x3,1
x4,1
x5,1
x6,1
x7,1
x8,1
x9,1
x10,1
x1,2
x2,2
x3,2
x4,2
x5,2
x6,2
x7,2
x8,2
x9,2
x10,2
x1,3
x2,3
x3,3
x4,3
x5,3
x6,3
x7,3
x8,3
x9,3
x10,3
62/131
GBD

Summary
ppbbddR
R Core Team
GBD
Understanding GBD: Global Matrix
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
Processors = 0 1 2 3 4 5
ppbbddR
R Core Team
63/131
GBD
Understanding GBD: Load Balanced GBD
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
ppbbddR
R Core Team
64/131
GBD
Understanding GBD: Local View

x11 x12 x13 x14 x15 x16 x17 x18 x19

x21 x22 x23 x24 x25 x26 x27 x28 x29
x31 x32 x33 x34 x35 x36 x37 x38 x39
x41 x42 x43 x44 x45 x46 x47 x48 x49
x51 x52 x53 x54 x55 x56 x57 x58 x59
x61 x62 x63 x64 x65 x66 x67 x68 x69

x71 x72 x73 x74 x75 x76 x77 x78 x79

x81 x82 x83 x84 x85 x86 x87 x88 x89
x91 x92 x93 x94 x95 x96 x97 x98 x99
29
29
29
19
19
19
ppbbddR
R Core Team
65/131
GBD
Understanding GBD: Non-Balanced GBD
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
ppbbddR
R Core Team
66/131
GBD
Understanding GBD: Local View

09
x11
x21
x31
x41

x51
x61

x71

x12
x22
x32
x42
x13
x23
x33
x43
x14
x24
x34
x44
x15
x25
x35
x45
x16
x26
x36
x46
x17
x27
x37
x47
x18
x28
x38
x48
x52 x53 x54 x55 x56 x57 x58

x62 x63 x64 x65 x66 x67 x68
x72 x73 x74 x75 x76 x77 x78
x19
x29
x39
x49 49

x59
x69 29

x79 19

09
x81 x82 x83 x84 x85 x86 x87 x88 x89

x91 x92 x93 x94 x95 x96 x97 x98 x99
29
ppbbddR
R Core Team
67/131
GBD
Summary

Summary
ppbbddR
R Core Team
GBD
Summary
Summary
Need to distribute your data? Try splitting by row.
May not work well if your data is square (or longer than tall).
ppbbddR
R Core Team
68/131
Contents

pbdMPI Example: Monte Carlo Simulation
pbdMPI Example: Sample Covariance
pbdMPI Example: Linear Regression
Summary
ppbbddR
R Core Team

Summary
ppbbddR
R Core Team
Example 1: Monte Carlo Simulation

Sample N uniform observations (xi , yi ) in the unit square [0, 1] [0, 1].
Then

# Blue
# Inside Circle
=4
4
# Total
# Blue + # Red
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
ppbbddR
R Core Team
0.6
0.8
1.0
69/131
Example 1: Monte Carlo Simulation GBD Algorithm

1
Let n be big-ish; well take n = 50, 000.
Generate an n 2 matrix x of standard uniform observations.
Count the number of rows satisfying x 2 + y 2 1
Ask everyone else what their answer is; sum it all up.
Take this new answer, multiply by 4 and divide by n
If my rank is 0, print the result.
ppbbddR
R Core Team
70/131
Example 1: Monte Carlo Simulation Code

Serial Code
1
2
3
4
5
N <- 50000
X <- matrix ( runif ( N * 2) , ncol =2)
r <- sum ( rowSums ( X ^2) <= 1)
PI <- 4 * r / N
print ( PI )
Parallel Code
1
2
3
library ( pbdMPI , quiet = TRUE )

init ()
comm . set . seed ( seed =1234567 , diff = TRUE )
4
5
6
7
8
9
10
N . gbd <- 50000 / comm . size ()

X . gbd <- matrix ( runif ( N . gbd * 2) , ncol = 2)
r . gbd <- sum ( rowSums ( X . gbd ^2) <= 1)
r <- allreduce ( r . gbd )
PI <- 4 * r / ( N . gbd * comm . size () )
comm . print ( PI )
11
12
finalize ()
ppbbddR
R Core Team
71/131
Note
For the remainder, we will exclude loading, init, and finalize calls.
ppbbddR
R Core Team
72/131

Summary
ppbbddR
R Core Team
Example 2: Sample Covariance

n
cov (xnp ) =
1 X
(xi x ) (xi x )T
n1
i=1
ppbbddR
R Core Team
73/131
Example 2: Sample Covariance GBD Algorithm

1
Determine the total number of rows N.
Compute the vector of column means of the full matrix.
Subtract each columns mean from that columns entries in each local
matrix.
Compute the crossproduct locally and reduce.
Divide by N 1.
ppbbddR
R Core Team
74/131
Example 2: Sample Covariance Code

Serial Code
1
2
N <- nrow ( X )
mu <- colSums ( X ) / N
3
4
5
X <- sweep (X , STATS = mu , MARGIN =2)

Cov . X <- crossprod ( X ) / (N -1)
6
7
print ( Cov . X )
Parallel Code
1
2
N <- allreduce ( nrow ( X . gbd ) , op = " sum " )

mu <- allreduce ( colSums ( X . gbd ) / N , op = " sum " )
3
4
5
X . gbd <- sweep ( X . gbd , STATS = mu , MARGIN =2)

Cov . X <- allreduce ( crossprod ( X . gbd ) , op = " sum " ) / (N -1)
6
7
comm . print ( Cov . X )
ppbbddR
R Core Team
75/131

Summary
ppbbddR
R Core Team
Example 3: Linear Regression

Find such that
y = X +
When X is full rank,
= (X T X )1 X T y
ppbbddR
R Core Team
76/131
Example 3: Linear Regression GBD Algorithm

1
Locally, compute tx = x T
Locally, compute A = tx x. Query every other processor for this

result and sum up all the results.
Locally, compute B = tx y . Query every other processor for this

result and sum up all the results.
Locally, compute A1 B
ppbbddR
R Core Team
77/131
Example 3: Linear Regression Code

Serial Code
1
2
3
tX <- t ( X )
A <- tX % * % X
B <- tX % * % y
4
5
ols <- solve ( A ) % * % B
Parallel Code
1
2
3
tX . gbd <- t ( X . gbd )

A <- allreduce ( tX . gbd % * % X . gbd , op = " sum " )
B <- allreduce ( tX . gbd % * % y . gbd , op = " sum " )
4
5
ols <- solve ( A ) % * % B
ppbbddR
R Core Team
78/131
Summary

Summary
ppbbddR
R Core Team
Summary
Summary
SPMD programming is (often) a natural extension of serial
programming.
More pbdMPI examples in pbdDEMO.
ppbbddR
R Core Team
79/131
Data Input
Contents
Data Input
Cluster Computer and File System
Serial Data Input
Parallel Data Input
Summary
ppbbddR
R Core Team
Data Input
Data Input
Serial Data Input
Parallel Data Input
Summary
ppbbddR
R Core Team
Data Input
One Node to One Storage Server

Cluster computer
File System
Storage
Server
Disk
Compute Nodes
ppbbddR
R Core Team
80/131
Data Input
Many Nodes to One Storage Server

Cluster computer
File System
Storage
Server
Disk
Compute Nodes
ppbbddR
R Core Team
81/131
Data Input
Many Nodes to Few Storage Servers

Cluster computer
Compute Nodes
Parallel
File System
Storage
Servers
ppbbddR
R Core Team
Disk
82/131
Data Input
Few Nodes to Few Storage Servers - Default

Cluster computer
Compute Nodes
I/O Nodes
ppbbddR
R Core Team
Parallel
File System
Storage
Servers
Disk
83/131
Data Input
Few Nodes to Few Storage Servers - Coordinated

Cluster computer
Compute Nodes
pbdADIOS
I/O Nodes
Parallel
File System
Storage
Servers
Disk
ADIOS
ppbbddR
R Core Team
84/131
Data Input
Serial Data Input
Data Input
Serial Data Input
Parallel Data Input
Summary
ppbbddR
R Core Team
Data Input
Serial Data Input
Separate manual: http://r-project.org/

scan()
read.table()
read.csv()
socket
ppbbddR
R Core Team
85/131
Data Input
Serial Data Input
CSV Data: Read Serial then Distribute

Listing:
1
2
3
4
5
6
library ( pbdDMAT )
if ( comm . rank () == 0) { # only read on process 0
x <- read . csv ( " myfile . csv " )
} else {
x <- NULL
}
7
8
dx <- as . ddmatrix ( x )
ppbbddR
R Core Team
86/131
Data Input
Parallel Data Input
Data Input
Serial Data Input
Parallel Data Input
Summary
ppbbddR
R Core Team
Data Input
Parallel Data Input
New Issues
How to read in parallel?
CSV, SQL, NetCDF4, HDF, ADIOS, custom binary
How to partition data across nodes?
How to structure for scalable libraries?
Read directly into form needed or restructure?
...
A lot of work needed here!
ppbbddR
R Core Team
87/131
Data Input
Parallel Data Input
CSV Data
Serial Code
1
x <- read . csv ( " x . csv " )
2
3
Parallel Code
1
2
library ( pbdDEMO , quiet = TRUE )

init . grid ()
3
4
5
dx <- read . csv . ddmatrix ( " x . csv " , header = TRUE , sep = " ," ,
nrows =10 , ncols =10 , num . rdrs =2 , ICTXT =0)
6
7
dx
8
9
finalize ()
ppbbddR
R Core Team
88/131
Data Input
Parallel Data Input
Binary Data: Vector
1
2
# # set up start and length for reading a vector of n doubles

size <- 8 # bytes
3
4
my _ ids <- get . jid (n , method = " block " )
5
6
7
my _ start <- ( my _ ids [1] - 1) * size

my _ length <- length ( my _ ids )
8
9
10
11
con <- file ( " binary . vector . file " , " rb " )
seekval <- seek ( con , where = my _ start , rw = " read " )
x <- readBin ( con , what = " double " , n = my _ length , size = size )
ppbbddR
R Core Team
89/131
Data Input
Parallel Data Input
Binary Data: Matrix
1
2
# # read an nrow by ncol matrix of doubles split by columns

size <- 8 # bytes
3
4
5
6
7
my _ ids <- get . jid ( ncol , method = " block " )

my _ ncol <- length ( my _ ids )
my _ start <- ( my _ ids [1] - 1) *nrow* size
my _ length <- my _ ncol *nrow
8
9
10
11
con <- file ( " binary . matrix . file " , " rb " )
seekval <- seek ( con , where = my _ start , rw = " read " )
x <- readBin ( con , what = " double " , n = my _ length , size = size )
12
13
14
15
16
17
18
# # glue together as a column - block ddmatrix

gdim <- c ( nrow , ncol )
ldim <- c ( nrow , my _ ncol )
bldim <- c ( nrow , allreduce ( my _ ncol , op = " max " ) )
X <- new ( " ddmatrix " , Data = matrix (x , nrow , my _ ncol ) ,
dim = gdim , ldim = ldim , bldim = bldim , ICTXT =1)
19
20
21
22
# # redistribute for ScaLAPACK s block - cyclic

X <- redistribute (X , bldim = c (2 , 2) , ICTXT =0)
Xprc <- prcomp ( X )
ppbbddR
R Core Team
90/131
Data Input
Parallel Data Input
NetCDF4 Data
1
2
# ## parallel read after determining start and length

nc <- nc _ open _ par ( file _ name )
3
4
5
6
nc _ var _ par _ access ( nc , " variable _ name " )

new . X <- ncvar _ get ( nc , " variable _ name " , start , length )
nc _ close ( nc )
7
8
finalize ()
ppbbddR
R Core Team
91/131
Data Input
Summary
Data Input
Serial Data Input
Parallel Data Input
Summary
ppbbddR
R Core Team
Data Input
Summary
Summary
Mostly do it yourself
Parallel file system for big data
Binary files for true parallel reads
Know number of readers vs number of storage servers
Redistribution help from ddmatrix functions

More help under development
ppbbddR
R Core Team
92/131
Contents

Introduction to Distributed Matrices
pbdDMAT
Summary
ppbbddR
R Core Team

pbdDMAT
Summary
ppbbddR
R Core Team
Distributed Matrices
You can only get so far with one node. . .
1000
Seconds Elapsed time
Script
100
Inverse of a 10000x10000 random matrix
Linear regression over a 10000x10000 matrix (c = a \\ b')
10000x10000 crossproduct matrix (b = a' * a)
Determinant of a 10000x10000 random matrix
Cholesky decomposition of a 10000x10000 matrix
Eigenvalues of a 10000x10000 random matrix
10
16
64
256
Cores
The solution is to distribute the data.

ppbbddR
R Core Team
93/131
(a) Block
(b) Cyclic
(c) Block-Cyclic
Figure: Matrix Distribution Schemes
ppbbddR
R Core Team
94/131
(a) 2d Block
(b) 2d Cyclic
(c) 2d Block-Cyclic
Figure: Matrix Distribution Schemes Onto a 2-Dimensional Grid
ppbbddR
R Core Team
95/131
Processor Grid Shapes

T
0
1

4
0
3
5
(a) 1 6
1
4
2
5
(b) 2 3
0
2
4
1
3
5
(c) 3 2
0
1
2
3
4
5
(d) 6 1
Table: Processor Grid Shapes with 6 Processors
ppbbddR
R Core Team
96/131
The ddmatrix Class

For distributed dense matrix objects, we use the special S4 class
ddmatrix.
Data
dim
ddmatrix = ldim
bldim
ICTXT
The local submatrix (an R matrix)

Dimension of the global matrix
Dimension of the local submatrix
Dimension of the blocks
MPI Grid Context
ppbbddR
R Core Team
97/131
Understanding ddmatrix: Global Matrix
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
ppbbddR
R Core Team
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
98/131
ddmatrix: 1-dimensional Row Block
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95

0

1
Processor grid =
2
3
ppbbddR
R Core Team
x16 x17 x18

x26 x27 x28
x36 x37 x38
x46 x47 x48
x56 x57 x58
x66 x67 x68
x76 x77 x78
x86 x87 x88
x96 x97 x98

(0,0)

(1,0)
=

(2,0)

(3,0)
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
99/131
ddmatrix: 2-dimensional Row Block
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95

0 1
Processor grid =
2 3
ppbbddR
R Core Team
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97

(0,0)
=
(1,0)
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99

(0,1)
(1,1)
100/131
ddmatrix: 1-dimensional Row Cyclic
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95

0

1
Processor grid =
2
3
ppbbddR
R Core Team
x16 x17 x18

x26 x27 x28
x36 x37 x38
x46 x47 x48
x56 x57 x58
x66 x67 x68
x76 x77 x78
x86 x87 x88
x96 x97 x98

(0,0)

(1,0)
=

(2,0)

(3,0)
x19
x29
x39
x49
x59
x69
x79
x89
x99
99
101/131
ddmatrix: 2-dimensional Row Cyclic
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95

0 1
Processor grid =
2 3
ppbbddR
R Core Team
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97

(0,0)
=
(1,0)
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99

(0,1)
(1,1)
102/131
ddmatrix: 2-dimensional Block-Cyclic
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95

0 1
Processor grid =
2 3
ppbbddR
R Core Team
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97

(0,0)
=
(1,0)
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
99

(0,1)
(1,1)
103/131
pbdDMAT

pbdDMAT
Summary
ppbbddR
R Core Team
pbdDMAT
The ddmatrix Data Structure

The more complicated the processor grid, the more complicated the
distribution.
ppbbddR
R Core Team
104/131
pbdDMAT
ddmatrix: 2-dimensional Block-Cyclic with 6 Processors
x =
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
x16
x26
x36
x46
x56
x66
x76
x86
x96
x17
x27
x37
x47
x57
x67
x77
x87
x97

0 1 2 (0,0)

=
Processor grid =
3 4 5 (1,0)
ppbbddR
R Core Team
x18
x28
x38
x48
x58
x68
x78
x88
x98
x19
x29
x39
x49
x59
x69
x79
x89
x99
(0,1)
(1,1)
99

(0,2)
(1,2)
105/131
pbdDMAT
Understanding ddmatrix: Local View
x11
x21
x51
x61
x91
x12
x22
x52
x62
x92
x17
x27
x57
x67
x97
x31
x41
x71
x81
x32
x42
x72
x82
x37
x47
x77
x87
x18
x28
x58
x68
x98
54
x38
x48
x78
x88 44
x13
x23
x53
x63
x93
x14
x24
x54
x64
x94
x33
x43
x73
x83
x34
x44
x74
x84
x19
x29
x59
x69
x99
ppbbddR
R Core Team
53
x39
x49
x79
x89 43

0 1 2 (0,0)

=
Processor grid =
3 4 5 (1,0)
x15
x25
x55
x65
x95
x35
x45
x75
x85
(0,1)
(1,1)
x16
x26
x56
x66
x96
52
x36
x46
x76
x86 42

(0,2)
(1,2)
106/131
pbdDMAT
The ddmatrix Data Structure

1
ddmatrix is distributed. No one processor owns all

of the matrix.
ddmatrix is non-overlapping. Any piece owned by
one processor is owned by no other processors.
ddmatrix can be row-contiguous or not,
depending on the processor grid and blocking
factor used.
x11
x21
x31
x41
x51
x61
x71
x81
x91
x12
x22
x32
x42
x52
x62
x72
x82
x92
x13
x23
x33
x43
x53
x63
x73
x83
x93
ddmatrix is locally column-major and globally, it

depends. . .
GBD is a generalization of the one-dimensional block ddmatrix distribution.

Otherwise there is no relation.
ddmatrix is confusing, but very robust.
ppbbddR
R Core Team
x14
x24
x34
x44
x54
x64
x74
x84
x94
x15
x25
x35
x45
x55
x65
x75
x85
x95
107/131
pbdDMAT
Pros and Cons of This Data Structure

Pros
Cons
Confusing layout.
Robust for matrix

computations.
This is why we hide most of the distributed details.

The details are there if you want them (you dont want them).
ppbbddR
R Core Team
108/131
pbdDMAT
Methods for class ddmatrix

pbdDMAT has over 100 methods with identical syntax to R:
`[`, rbind(), cbind(), . . .
lm.fit(), prcomp(), cov(), . . .
`%*%`, solve(), svd(), norm(), . . .
median(), mean(), rowSums(), . . .
Serial Code
1
cov ( x )
Parallel Code
1
cov ( x )
ppbbddR
R Core Team
109/131
pbdDMAT
Comparing pbdMPI and pbdDMAT

pbdMPI:
MPI + sugar.
GBD not the only structure pbdMPI can handle (just a useful
convention).
pbdDMAT:
Distributed matrices + statistics.
The ddmatrix structure must be used for pbdDMAT.
If the data is not 2d block-cyclic compatible, ddmatrix will definitely
give the wrong answer.
ppbbddR
R Core Team
110/131
Summary

pbdDMAT
Summary
ppbbddR
R Core Team
Summary
Summary
1
Start by loading the package:

1
library ( pbdDMAT , quiet = TRUE )
Always initialize before starting and finalize when finished:

1
init . grid ()
2
3
# ...
4
5
finalize ()
ppbbddR
R Core Team
111/131
Contents

RandSVD
Summary
ppbbddR
R Core Team
RandSVD

RandSVD
Summary
ppbbddR
R Core Team
RandSVD
Randomized
SVD1
PROBABILISTIC ALGORITHMS FOR MATRIX APPROXIMATION
227
Prototype for Randomized SVD

Given an m n matrix A, a target number k of singular vectors, and an
exponent q (say, q = 1 or q = 2), this procedure computes an approximate
rank-2k factorization U V , where U and V are orthonormal, and is
nonnegative and diagonal.
Stage A:
244 1 Generate an n N.
HALKO, P. G. MARTINSSON, AND J. A. TROPP
2k
Gaussian test matrix .
2
Form Y = (AA )q A by multiplying alternately with A and A .
Algorithm
Randomized
Power
Iterationbasis for
3
Construct a matrix Q4.3:
whose
columns form
an orthonormal
Given
m ofn Ymatrix
A and integers and q, this algorithm computes an
thean
range
.
m

orthonormal
matrix
Q
whose
range
approximates
the range of A.
Stage B:
Draw B
an=n Q
A.
Gaussian random matrix .
41 Form
2
Form the m matrix Y = (AA )q A via alternating
application
5
Compute anSVD of the small matrix: B = U V .
of A and A .
6
Set U = QU
.
3
Construct an m matrix Q whose columns form an orthonormal
Note: The computation of Y in step 2 is vulnerable to round-o errors.
basis for the range of Y , e.g., via the QR factorization Y = QR.
When high accuracy is required, we must incorporate an orthonormalization
Note: This procedure is vulnerable to round-o errors; see Remark 4.3. The
step between each application of A and A ; see Algorithm 4.4.
recommended implementation appears as Algorithm 4.4.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
The theoryAlgorithm
developed in4.4:
thisRandomized
paper provides
much
more
detailed
information
Subspace Iteration
15
about
the performance
of theAproto-algorithm.
Given
an m n matrix
and integers and q, this algorithm computes an 16
the singular
values
of A decay
the error
A of
QQ
mWhen
orthonormal
matrix
Q whose
range slightly,
approximates
the range
A.A does17
depend
on the
dimensions
of the
matrix
1 not
Draw
an n
standard
Gaussian
matrix
.(sections 10.210.3).
canY0reduce
of the bracket
in the error bound
by combining18
2 We
Form
= Athe
andsize
compute
its QR factorization
Y0 = Q(1.8)
0 R0 .
3 the
for proto-algorithm
j = 1, 2, . . . , q with a power iteration (section 10.4). For an example,19
20
below.

4 see section
Form Y1.6
j = A Qj1 and compute its QR factorization Yj = Qj Rj .
For the
structured
matrices we mentioned in section 1.4.1, related21
random
5
Form
Yj = AQ
j and compute its QR factorization Yj = Qj Rj .
error
22
6
end bounds are in force (section 11).
7
We can obtain inexpensive a posteriori error estimates to verify the quality

Q = Qq .
of the approximation (section 4.3).
Serial R
randSVD < f u n c t i o n (A , k , q=3)
{
## S t a g e A
Omega < matrix(rnorm(n*2*k),
nrow=n, ncol=2*k)
Y < A %% Omega
Q < q r .Q( q r (Y) )
At < t (A)
for ( i in 1: q)
{
Y < At %% Q
Q < q r .Q( q r (Y) )
Y < A %% Q
Q < q r .Q( q r (Y) )
}
## S t a g e B
B < t (Q) %% A
U < La . s v d (B) $u
U < Q %% U
U[ , 1: k ]
}
1 1.6. Example: Randomized SVD. We conclude this introduction with a short

Halko N, Martinsson P-G and Tropp J A 2011 Finding structure with randomness: probabilistic algorithms
discussion
of how
ideasthe
allow
us to perform
an approximate
of a large data
Algorithm
4.3these
targets
xed-rank
problem.
To address SVD
the xed-precision
approximate
decompositions
Rev. matrix
53 in21788
matrix, which
ismatrix
aincorporate
compelling
application
ofSIAM
randomized
approximation
[113].
problem,
we can
the
error estimators
described
section
4.3 to obtain
The
two-stage
randomized
method
oers
a
natural
approach
to
compuan adaptive scheme analogous with Algorithm 4.2. In situations where itSVD
is critical
to
tations. near-optimal
Unfortunately,
the simplesterrors,
version
this
schemethe
is oversampling
inadequate inbeyond
many
achieve
approximation
oneofcan
increase
applications
the singular
of the
matrix
decay slowly.
To
our
standardbecause
recommendation
spectrum
= k + 5 all
the input
way to
= may
2k without
changing
ppbbddR
Team
with Big Data in R
address
thisofdiculty,
we incorporate
q stepscost.
of a Apower
iteration,
whereappears
q =Programming
1 or
R Core
the
scaling
the asymptotic
computational
supporting
analysis
in
for constructing
112/131
RandSVD
Randomized SVD
Serial R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Parallel pbdR

{
## S t a g e A
Omega < matrix(rnorm(n*2*k),
nrow=n, ncol=2*k)
Y < A %% Omega
Q < q r .Q( q r (Y) )
At < t (A)
for ( i in 1: q)
{
Y < At %% Q
Q < q r .Q( q r (Y) )
Y < A %% Q
Q < q r .Q( q r (Y) )
}
## S t a g e B
B < t (Q) %% A
U < La . s v d (B) $u
U < Q %% U
U[ , 1: k ]
}
ppbbddR
R Core Team
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

{
## S t a g e A
Omega < ddmatrix(rnorm,
nrow=n, ncol=2*k)
Y < A %% Omega
Q < q r .Q( q r (Y) )
At < t (A)
for ( i in 1: q)
{
Y < At %% Q
Q < q r .Q( q r (Y) )
Y < A %% Q
Q < q r .Q( q r (Y) )
}
## S t a g e B
B < t (Q) %% A
U < La . s v d (B) $u
U < Q %% U
U[ , 1: k ]
}
113/131
RandSVD
Randomized SVD
30 Singular Vectors from a 100,000 by 1,000 Matrix
Algorithm
full
30 Singular Vectors from a 100,000 by 1,000 Matrix

Speedup of Randomized vs. Full SVD
randomized
128
15
64
Speedup
16
Speedup
32
10
2
1
16
32
64
128
Cores
16
32
64
128
Cores
ppbbddR
R Core Team
114/131
Summary

RandSVD
Summary
ppbbddR
R Core Team
Summary
Summary
pbdDMAT makes distributed (dense) linear algebra easier.
Can enable rapid prototyping at large scale.
ppbbddR
R Core Team
115/131
MPI Profiling
Contents
10
MPI Profiling
Profiling with the pbdPROF Package
Installing pbdPROF
Example
Summary
ppbbddR
R Core Team
MPI Profiling
10
MPI Profiling
Installing pbdPROF
Example
Summary
ppbbddR
R Core Team
MPI Profiling
Introduction to pbdPROF
Successful Google Summer of Code 2013 project.
Available on the CRAN.
Enables profiling of MPI-using R scripts.
pbdR packages officially supported; can work with others. . .
Also reads, parses, and plots profiler outputs.
ppbbddR
R Core Team
116/131
MPI Profiling
How it works
MPI calls get hijacked by profiler and logged:
ppbbddR
R Core Team
117/131
MPI Profiling
Introduction to pbdPROF
Currently supports the profilers fpmpi and mpiP.
fpmpi is distributed with pbdPROF and installs easily, but offers
minimal profiling capabilities.
mpiP is fully supported also, but you have to install and link it
yourself.
ppbbddR
R Core Team
118/131
MPI Profiling
10
Installing pbdPROF
MPI Profiling
Installing pbdPROF
Example
Summary
ppbbddR
R Core Team
MPI Profiling
Installing pbdPROF
Installing pbdPROF
1
Build pbdPROF.
Rebuild pbdMPI (linking with pbdPROF).
Run your analysis as usual.
Interactively analyze profiler outputs with pbdPROF.
This is explained at length in the pbdPROF vignette.
ppbbddR
R Core Team
119/131
MPI Profiling
Installing pbdPROF
Rebuild pbdMPI
R CMD INSTALL pbdMPI _ 0.2 -2. tar . gz
-- configure - args = " -- enable - pbdPROF "
Any package which explicitly links with an MPI library must be rebuilt
in this way (pbdMPI, Rmpi, . . . ).
Other pbdR packages link with pbdMPI, and so do not need to be
rebuilt.
See pbdPROF vignette if something goes wrong.
ppbbddR
R Core Team
120/131
MPI Profiling
10
Example
MPI Profiling
Installing pbdPROF
Example
Summary
ppbbddR
R Core Team
MPI Profiling
Example
An Example from pbdDMAT

Compute SVD in pbdDMAT package.
Profile MPI calls with mpiP.
ppbbddR
R Core Team
121/131
MPI Profiling
Example
Example Script
my svd.r
1
2
3

library ( pbdDMAT , quietly = TRUE )
init . grid ()
4
5
6
7
n <- 1000
x <- ddmatrix ( " rnorm " , n , n )
8
9
my . svd <- La . svd ( x )
10
11
12
finalize ()
ppbbddR
R Core Team
122/131
MPI Profiling
Example
Example Script
Run example with 4 ranks:
$ mpirun - np 4 Rscript my _ svd . r
mpiP :
mpiP : mpiP : mpiP V3 .3.0 ( Build Sep 23 2013 / 14:00:47)
mpiP : Direct questions and errors to
mpip - help@lists . sourceforge . net
mpiP :
Using 2 x2 for the default grid size
mpiP :
mpiP : Storing mpiP output in [. / R .4.5944.1. mpiP ].
mpiP :
ppbbddR
R Core Team
123/131
MPI Profiling
Example
Read Profiler Data into R

Interactively (or in batch) Read in Profiler Data
1
2
library ( pbdPROF )
prof . data <- read . prof ( " R .4.28812.1. mpiP " )
Partial Output of Example Data

> prof . data
An mpip profiler object :
[[1]]
Task AppTime MPITime MPI .
1
0
5.71 0.0387 0.68
2
1
5.70 0.0297 0.52
3
2
5.71 0.0540 0.95
4
3
5.71 0.0355 0.62
5
*
22.80 0.1580 0.69
[[2]]
ID Lev File . Address Line _ Parent _ Funct MPI _ Call
1
1
0 1.397301 e +14
[ unknown ] Allreduce
2
2
0 1.397301 e +14
[ unknown ]
Bcast
ppbbddR
R Core Team
124/131
MPI Profiling
Example
Generate plots
1
plot ( prof . data )
ppbbddR
R Core Team
125/131
MPI Profiling
Example
Generate plots
1
plot ( prof . data , plot . type = " stats1 " )
ppbbddR
R Core Team
126/131
MPI Profiling
Example
Generate plots
1
plot ( prof . data , plot . type = " stats2 " )
ppbbddR
R Core Team
127/131
MPI Profiling
Example
Generate plots
1
plot ( prof . data , plot . type = " messages1 " )
ppbbddR
R Core Team
128/131
MPI Profiling
Example
Generate plots
1
plot ( prof . data , plot . type = " messages2 " )
ppbbddR
R Core Team
129/131
MPI Profiling
10
Summary
MPI Profiling
Installing pbdPROF
Example
Summary
ppbbddR
R Core Team
MPI Profiling
Summary
Summary
pbdPROF offers tools for profiling R-using MPI codes.
Easily builds fpmpi; also supports mpiP.
ppbbddR
R Core Team
130/131
Wrapup
Contents
11
Wrapup
ppbbddR
R Core Team
Wrapup
Summary
Profile your code to understand your bottlenecks.
pbdR makes distributed parallelism with R easier.
Distributing data to multiple nodes
For truly large data, I/O must be parallel as well.
ppbbddR
R Core Team
Wrapup
The pbdR Project

Our website: http://r-pbd.org/
Email us at: RBigData@gmail.com
Our google group: http://group.r-pbd.org/
Where to begin?
The pbdDEMO package
http://cran.r-project.org/web/packages/pbdDEMO/
The pbdDEMO Vignette: http://goo.gl/HZkRt
ppbbddR
R Core Team
Thanks for coming!
Questions?
http://r-pbd.org/
Come see our poster on Wednesday at 5:30!
ppbbddR
R Core Team

PBDR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PBDR

Uploaded by

Copyright:

Available Formats

Programming with Big Data in R

Drew Schmidt and George Ostrouchov

Programming with Big Data in R

The pbdR Core Team

Department of Ecology and Evolutionary Biology

Computer Science and Mathematics Division

Joint Institute for Computational Sciences

Programming with Big Data in R

About This Presentation

Programming with Big Data in R

About This Presentation

Programming with Big Data in R

Profiling and Benchmarking

The pbdR Project

The Generalized Block Distribution

Basic Statistics Examples

Introduction to pbdDMAT and the ddmatrix Structure

Examples Using pbdDMAT

Programming with Big Data in R

Programming with Big Data in R

A Concise Introduction to Parallelism

Programming with Big Data in R

A Concise Introduction to Parallelism

Programming with Big Data in R

A Concise Introduction to Parallelism

Implicit parallelism: Parallel details hidden from user

Explicit parallelism: Some assembly required. . .

Embarrassingly Parallel: Also called loosely coupled. Obvious how to

Tightly Coupled: Opposite of embarrassingly parallel; lots of

Programming with Big Data in R

A Concise Introduction to Parallelism

Run time for n1 cores

Programming with Big Data in R

A Concise Introduction to Parallelism

Programming with Big Data in R

A Concise Introduction to Parallelism

Scalability and Benchmarking

Strong: Fixed total problem size.

Weak: Fixed local (per core) problem size.

Programming with Big Data in R

Good Strong Scaling

A Concise Introduction to Parallelism

Good Weak Scaling

Programming with Big Data in R

A Concise Introduction to Parallelism

Shared and Distributed Memory Machines

Programming with Big Data in R

A Concise Introduction to Parallelism

Shared and Distributed Memory Machines

Shared Memory Machines

Nautilus, University of Tennessee

Distributed Memory Machines

Kraken, University of Tennessee

Programming with Big Data in R

A Quick Overview of Parallel Hardware

Programming with Big Data in R

A Quick Overview of Parallel Hardware

Three Basic Flavors of Hardware

MIC: Many Integrated Core

Programming with Big Data in R

A Quick Overview of Parallel Hardware

Your Laptop or Desktop

MIC: Many Integrated Core