Professional Documents
Culture Documents
Bandwidth (MB/s)
1000
800
Figure 1 presents the bandwidth performance of the 600
RDMA write using the pgping microbenchmark available 400
in the Elan Library. It is evident that the bandwidth is 200
doubled in the dual-rail system. The both-way single-rail 0
4 32 256 2K 16K 128K 1M
and dual-rail elan_put () bandwidths are 670MB/s and Message size (bytes)
1332 MB/s, respectively. The bandwidth for elan_get () is
almost the same as elan_put () in each case (not shown). Figure 2. T-port send/receive performance.
The Elan RDMA write short message latency does not
change much between single-rail and dual-rail. The Uni-directional (2-rail) Uni-directional (1-rail)
Bi-directional (2-rail) Bi-directional (1-rail)
latency varies between 2 µs to 2.77 µs for a 4-byte Both-way (2-rail) Both-way (1-rail)
message. The elan_get () short message latency is slightly
1200
larger than the RDMA write. That is why we decided to
Bandwidth (MB/s)
1000
use elan_put () in our collective implementations. 800
600
400
Uni-directional (2-rail) Uni-directional (1-rail)
Bi-directional (2-rail) Bi-directional (1-rail)
200
Both-way (2-rail) Both-way (1-rail) 0
1 16 256 4K 64K 1M
1400 Message size (bytes)
Bandwidth (MB/s)
1200
1000 Figure 3. MPI send/receive performance.
800
600
400 4.4. Collective Performance
200
0
1 16 256 4K 64K 1M Figure 4 depicts the aggregate bandwidth for the Elan
Message size (bytes) hardware and software broadcasts, gather, and all-to-all, as
Figure 1. Elan RDMA write performance. well as MPI_Scatter and MPI_Allgather. For the Elan
collectives, we have used the gping microbenchmark
4.2. Tports Performance available in the Elan Library. For the MPI collectives, we
have written our own code.
Figure 2 shows the Tports bandwidth. Tests are done From the results, except for the elan_alltoall () (and
using the tping microbenchmark (except for the uni- elan_reduce (), not shown here) other collectives at the
directional case, where we wrote our own code). Like the Elan level do not benefit from the dual-rail QsNetII.
Elan RDMA, the dual-rail Tports bandwidth outperforms MPI_Scatter and MPI_Allgather are implemented on top
the single-rail bandwidth in each case. The single-rail T- of Tports, but only MPI_Scatter achieves larger bandwidth
ports bandwidth is roughly the same as RDMA under dual-rail.
bandwidth; however, dual-rail bandwidth falls short of Efficient implementation of collective operations is
RDMA. The short message latency is slightly larger than one of the keys to the performance of parallel applications.
the RDMA. Given the multi-rail performance offered at the Elan and
Tports levels, excellent opportunities exist for devising
efficient collectives for such systems.
4.3. MPI Send/Receive Performance
Basically, there are two ways to improve the
performance of collectives on multi-rail systems. One is to
Figure 3 compares the MPI bandwidth under different
implement single-port collective communication
cases. Unlike the both-way, the uni-directional and bi-
algorithms that gain multi-rail striping from the underlying
directional MPI bandwidths for dual-rail are almost
communication subsystem. This is the approach currently
doubled. This shows that the MPI point-to-point
used for MPI_Scatter and elan_alltoall (). However, this 5.1. Scatter
will only improve the performance for large messages.
The second approach that we propose is to design and The spanning binomial tree algorithm [10] can be
implement multi-port algorithms for multi-rail systems extended for k-port modeling. In this algorithm, the
that also benefit from the striping feature supported by scattering process sends k messages of length N/(k + 1)
QsNetII. We have used some known multi-port algorithms each, to its k children. Therefore, there are (k + 1)
and implemented them on our dual-rail QsNetII network processes having N/(k + 1) different messages. These
directly at the Elan level using RDMA write. processes, at step 2, send one (k + 1)-th of their initial
message to each of their immediate k children. This
5. Collective Algorithms process continues and all processes are informed after
logk+1 N communication steps. Using Hockney’s model
In this section, we provide an overview of some known [8], the total communication time, T, is:
algorithms for scatter, gather, and all-to-all personalized log k +1 N
exchange. In the following discussion, N is the number of T = (t s × log k +1 N ) + (l m × τ ) ∑ (k + 1) (log k +1 N ) −i
(1)
processors (or processes) and k is the number of ports in i =1
the multi-port algorithms (equal to the number of available N −1
T = (t s × log k +1 N ) + (l m × τ )
rails). In the k-port (or multi-port) modeling, each process k
has the ability to simultaneously send and receive k where, ts is the message startup time, lm is the message size
messages on its k links. The assumption is that the number in bytes, and τ is the time to transfer one byte.
of processes is a power of (k + 1). Otherwise, dummy The above algorithm has a logarithmic number of
processes can be assumed to exist until the next power of steps, therefore suitable for short messages and networks
(k + 1), and the algorithms apply with little or no where the cost of message transfer is dominated by the
performance loss. startup latency. Another algorithm, for large messages, is
the extension of sequential tree algorithm for k-port
modeling. At each step, the source process sends its k
different messages to k other processes. There are a total
of (N -1)/k communication steps. Therefore, the total
Hardware broadcast (elan_hbcast) Software broadcast (elan_bcast)
communication cost, T, is:
300
N −1
B a n d w id th (M B /s )
120
B a n d w id th (M B /s )
200
300 intermediate processes until it reaches the root. The total
150
200
communication cost is the same as in Equation (1).
100
100
50
5.3. All-to-all Personalized Exchange
0 0
4 32 256 2K 16K 128K 1M 4 32 256 2K 16K 128K 1M
Message size (bytes) Message size (bytes) A lower bound for all-to-all personalized exchange
MPI_Scatter MPI_Allgather time is (N -1)/k since each process must receive N – 1
different messages and it can only receive at most k
1200 160
messages at a time. A simple algorithm is based on the
B a n dw id th (M B /s )
B a n d w id th (M B /s )
1000
800
120
extension of the direct algorithm for k-port modeling. The
600 80 processes are arranged in a virtual ring. That is, at step i,
400
200
40 process p sends its message to processes (p + (i – 1)k + 1)
0 0 mod N, (p + (i – 1)k + 2) mod N, …, (p + ik) mod N.
1 16 256 4K 64K 1M 1 16 256 4K 64K 1M Modulus operation avoids sending messages to a single
Message size (bytes) Message size (bytes)
destination. The communication cost is the same as in
Figure 4. Collective performance on 16 Equation (2).
processes.
6. RDMA-based Implementation and Scatter (16 processes) Scatter (16 processes)
BSRW SSRW BSRW SSRW
Performance
50
1000
40
Time (μs)
800
In this section, the intention is to show the
Time (μs)
30 600
effectiveness of the multi-port algorithms introduced in 20 400
Section 5 on multi-rail QsNetII with striping support, when 10 200
0
they are implemented directly at the Elan layer using Elan 0
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K
RDMA write. Message size (bytes) Message size (bytes)
Memory registration/deregistration is a costly Scatter (16 processes) Scatter scalability (16B)
operation. However, contrary to InfiniBand and Myrinet, BSRW SSRW BSRW SSRW
QsNetII does not need memory registration and address 30000
40
exchange for message transfers. This eases the 25000
Time (μs)
30
Time (μs)
20000
implementation, and effectively reduces the 15000 20
communication latency. 10000
10
5000
Our algorithms are two-port put-based algorithms, 0 0
where a sending process has direct control in sending 64K 128K 256K 512K 1M 3 9 16
Message size (bytes) Number of processes
messages simultaneously over the two rails using the
elan_doput () function. When a message is larger than a Scatter scalability (8KB) Scatter scalability (256KB)
BSRW SSRW
threshold (1KB) even message striping is used over the 250
BSRW SSRW
8000
two rails. When a message is sent, the sending process 200
Time (μs)
Time (μs)
6000
uses the elan_wait () to make sure the user buffer can be 150
4000
re-used. 100
2000
In the latest Quadrics Hawk distribution, release 2.2.4- 50
0
1, remote event notification is enabled in elan_doput () for 0
3 9 16
3 9 16
both single-rail and multi-rail systems. This allows multi- Number of processes Number of processes
rail striped (ELAN_RAIL_ALL) put messages to have a
Figure 5. Scatter performance and scalability.
devent (destination event). In this case the devent will be
set once in each rail and in the target process one will need
to call elan_initEvent () once for each rail and then wait on
6.2. Evaluation of Gather
each ELAN_EVENT to be returned. This guarantees a
message has been delivered in its entirety. The multi-port spanning Binomial tree algorithm for
In the implementation of our algorithms, processes do Gather operation has been implemented on multi-rail
not synchronize with each other. Note that our QsNetII systems using RDMA Write feature. We call this
implementation is completely put-based, but we are in the scheme as BGRW.
process of optimizing them for the cases where processes Figure 6 compares the performance of our gather
are co-located on the same 4-way nodes. algorithm, BGRW, with the elan_gather (). The results are
very promising as our implementation is much better than
the native implementation except slightly for short
6.1. Evaluation of Scatter
messages less than 64 bytes. Interestingly, the proposed
multi-port gather gains an improvement of up to 6.35 for
We have implemented the multi-port spanning
1MB message. The scalability plots in Figure 6 verify the
Binomial tree algorithm for Scatter operation on multi-rail
superiority of our gather algorithm for short to medium,
QsNetII systems using RDMA Write. We call this scheme
medium, and large messages. However, it does show that
as BSRW. Likewise, we call the sequential tree
with increasing number of processes elan_gather () is
implementation for scatter as SSRW.
better for very short messages.
Figure 5 compares the performance of the two scatter
algorithms, BSRW and SSRW, on our dual-rail QsNetII.
As expected, BSRW is superior for short messages, while 6.3. Evaluation of All-to-all Personalized
SSRW has a much better performance for medium and Exchange
large messages. Figure 5 also presents the scalability of
our implementation. The scalability figures show that We have also implemented the multi-port Direct
indeed the BSRW is the better algorithm for short algorithm for All-to-all personalized exchange on multi-
messages with increasing system size. rail QsNetII systems using RDMA Write. We call this
scheme as DARW.
Figure 7 compares the performance of our all-to-all Altoall (16 processes) Altoall (16 processes)
algorithm, DARW, with the elan_alltoall (). The results Elan_alltoall DARW Elan_alltoall DARW
are again encouraging. Our multi-port all-to-all algorithm 150 4000
and its implementation is much better than the native
Time (μs)
120
Time (μs)
3000
elan_alltoall () for medium size messages. In fact, the 90 2000
60
improvement is up to a factor of 2.19 for 16KB message. 30
1000
Time (μs)
Time (μs)
algorithm [5] for multi-rail QsNetII. These algorithms are
60000 40
known to be superior for short messages.
30000 20
0 0
7. Related Work 64K 128K 256K 512K 1M 3 9 16
Message size (bytes) Number of processes
Study of collective communication operations has been Alltoall scalability (8KB) Alltoall scalability (256KB)
an active area of research. Recently, the authors in [16] Elan_alltoall DARW Elan_alltoall DARW
Time (μs)
1500
Time (μs)
20000
Thakur and his colleagues discussed recent collective 1000
10000
algorithms used in MPICH [21]. They have shown some 500
40 3000
Time (μs)
12
90000
8
60000
30000
4
Interconnection networks and the supporting
0
64K 128K 256K 512K 1M
0
3 9 16
communication system software are the deciding factors in
Message size (bytes) Number of processes the performance of clusters. Specifically, efficient
Gather scalability (256KB)
implementation of collective operations is critical to the
Gather scalability (8KB)
performance of MPI applications.
Elan_gather BGRW
Elan_gather BGRW
QsNetII is a high-performance network for clusters that
800 40000
implements some collectives at the Elan level. Their MPI
Time (μs)
Time (μs)
600 30000
counterparts directly use them. Therefore, optimizations
400 20000
and inclusion of new collectives at the Elan level are
200 10000
extremely desirable.
0 0
3 9 16 3 9 16 Quadrics supports point-to-point message striping over
Number of processes Number of processes multi-rail QsNetII. In this work, we have proposed and
Figure 6. Gather performance and scalability. implemented a number of multi-port collectives at the
Elan level over multi-rail QsNetII systems. Our
performance results indicate that our multi-port gather Parallel and Distributed Systems, 8(11):1143-1156, Nov.
gains an improvement of up to 6.35 for 1MB message 1997.
over the native elan_gather (). The proposed multi-port [6] S. Coll, E. Frachtenberg, F. Petrini, A. Hoisie, and L.
all-to-all performs better than the elan_alltoall () by a Gurvits. Using multirail networks in high performance
clusters. In 3rd IEEE International Conference on Cluster
factor of 2.19 for 16KB message. Moreover, we have Computing (Cluster’01), pages 15-24, Oct. 2001.
proposed two algorithms for short and long messages for [7] Cray Man Page Collection: Shared Memory Access
the scatter operation. (SHMEM) S-2383-23, Available: http://docs.cray.com/.
The results are encouraging and future work in this [8] R. Hockney. The communication challenge for MPP, Intel
area is justified. Our all-to-all algorithm did not perform Paragon and Meiko CS-2. Parallel Computing, 20(3):389–
well for short messages. For this, we are currently 398, Mar. 1994.
experimenting with the standard exchange [3], and [9] InfiniBand Architecture. Available:
Bruck’s index [5] algorithms. We are also trying to utilize http://www.infinibandta.org/.
the shared memory wrapper facility of Quadrics software [10] S. L. Johnson and C.-T. Ho. Optimum broadcasting and
personalized communication in hypercubes. IEEE
to speedup the collectives for co-located processes on
Transactions on Computer, 38(9): 1249-1268, Sept. 1989.
SMP nodes. Optimal static and dynamic striping [11] J. Liu, A. Vishnu and D. K. Panda. Building multirail
mechanisms may also help boost the performance. We InfiniBand clusters: MPI-level design and performance
intend to extend our study by devising other collective evaluation. In 2004 ACM/IEEE Conference on
communications of interests and testing them on larger Supercomputing (SC’04), Nov. 2004.
multi-rail clusters. NIC-based or NIC-assisted collectives [12] Message Passing Interface Forum: MPI, A Message Passing
for multi-rail systems, and taking advantage of basic Interface standard, Version 1.2, 1997.
hardware collectives in their design are other areas of [13] A. Moody, J. Fernandez, F. Petrini, and D. K. Panda.
interests. Scalable NIC-based reduction on large-scale clusters. In
2003 ACM/IEEE Conference on Supercomputing (SC’03),
Nov. 2003.
Acknowledgments [14] PDSH: available: http://www.llnl.gov/linux/pdsh/.
[15] F. Petrini, S. Coll, E. Frachtenberg, and A. Hoisie,
The authors would like to thank the anonymous Performance evaluation of the Quadrics interconnection
referees for their insightful comments. Many thanks goes network. Journal of Cluster Computing, 6(2):125-142, Apr.
to David Addison of the Quadrics for his help on our 2003.
overall understanding of the notification events in multi- [16] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E.
Gabriel, and J. J. Dongarra. Performance analysis of MPI
rail QsNetII systems. This work was supported by grants
collective operations. In 19th IEEE International Parallel
from the Natural Sciences and Engineering Research and Distributed Processing Symposium (IPDPS’05), pages
Council of Canada (NSERC), Queen’s University, Canada 272a-272a, Apr. 2005.
Foundation for Innovation (CFI), and Ontario Innovation [17] D. Roweth and A. Moody. Performance of all-to-all on
Trust (OIT). QsNetII”. Quadrics White Paper, Available at
http://www.quadrics.com/.
References [18] D. Roweth, A. Pittman, and J. Beecroft. Optimized
collectives on QsNetII. Quadrics White Paper, Available at
http://www.quadrics.com/.
[1] J. Beecroft, D. Addison, D. Hewson, M. McLaren, D.
[19] S. Sur, H.-W. Jin, and D. K. Panda. Efficient and scalable
Roweth, F. Petrini, and J. Nieplocha. QsNetII: Defining
all-to-all personalized exchange for InfiniBand clusters. In
high-performance network design. IEEE Micro, 25(4):34-
2004 International Conference on Parallel Processing
47, July-Aug. 2005.
(ICCP’04), pages 275-282, 2004.
[2] J. Beecroft, M. Homewood, and M. McLaren. Meiko CS-2
[20] S. Sur, U. K. R. Bondhugula, A. Mamidala, H.-W. Jin, and
interconnect Elan-Elite design. Parallel Computing, 20(10-
D. K. Panda. High performance RDMA based all-to-all
11):1627-1638, Nov. 1994.
broadcast for InfiniBand clusters. In International
[3] S. H. Bokhari, Multiphase complete exchange on Paragon,
Conference on High Performance Computing (HiPC 2005),
SP2, and CS-2. IEEE Parallel and Distributed Technology,
Dec. 2005.
4(3):45-59, Sept. 1996.
[21] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of
[4] R. Brightwell, D. Doerfler, and K. D. Underwood. A
collective communication operations in MPICH.
comparison of 4X InfiniBand and Quadrics Elan-4
International Journal of High Performance Computing
technologies. In 2004 IEEE International Conference on
Applications, 19(1):49-66, 2005.
Cluster Computing (Cluster 2004), pages 193-204, Sept.
[22] R. Zamani, Y. Qian, and A. Afsahi. An evaluation of the
2004.
Myrinet/GM2 two-port networks. In 3rd IEEE Workshop on
[5] J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby.
High-Speed Local Networks, HSLN 2004, pages 734-742,
Efficient algorithms for all-to-all communications in
Nov. 2004.
multiport message-passing systems. IEEE Transactions on