Professional Documents
Culture Documents
k
Pac
asynchronously without explicit cooperation of the user Send/Pack S Send/Pack
calls on the remote side. In particular, we use so called .. Buffer .. Buffer
server thread [7]. The server thread is blocked in . .
HOST HOST
elan_queueRxWait the queue wait call provided by the
Quadrics libelan interface. When the message arrives the Figure 3: Host-Based-NIC-Assisted (HBNA) method
thread is awaken and becomes available for processing.
This interface is very well supported by the underlying Figure 3) from the destination NIC in regards to the
Elan-4 hardware but supports only short messages that fit availability of a destination buffer. By waiting for a
in the single network packet. response after packing the source data into the pack/send
buffer, we are overlapping the time taken to request a
The first step is implemented as follows: Each NIC uses
buffer and get response with the packing of data into
as many flags as receive buffers on its host. These flags
pack/send buffer. After obtaining destination receive
are set by the thread processor for the requesting host and
buffer information, it transmits the data to the destination
cleared by the local host. This scheme is analogous to the
buffer (send in Figure 3) on the destination node and
producer-consumer problem. The NIC can only set the
repeats this entire process of requesting, packing and
flag for a request from a remote host if the flag is “clear”
sending until entire source data is transmitted.
and the host can only reset the flag on the local NIC if the
flag becomes “full”. The successful completion of this operation has one
additional logic adhered to it. In addition to the packed
To assure availability of the remote receive buffer, the
source data, the source node also needs to transmit a
source node first sends a request to the destination NIC to
descriptor with information about how and where to
locate a free buffer. The NIC checks the receive buffer
unpack the data on the destination. Since the libelan
flags to find one with the value of the flag marked as
message queuing interface only supports small messages
clear. Once the destination NIC finds a cleared flag, it is
(<2k), the descriptor information must be transmitted
an indication that the corresponding receive buffer on the
separately from the data for larger message sizes.
destination node is available and ready for use. The
Addressing this requirement is not straightforward on
destination NIC now sends the information about the
networks (such as Elan-4) that do not provide ordered
available buffer to the requesting source node. Once the
message delivery is not straightforward.
destination node receives the data into this receive buffer
and finishes processing the data in the buffer, it can mark To minimize idle time for messages larger than 2KB, the
the receive buffer as available. Hence it clears receive destination server thread should not be awaken for
buffer flag on the NIC corresponding to the current unpacking of data until both the descriptor and the data
receive buffer. The second step involves packing and completely arrived in the destination receive buffer. To
transmission of data. solve this problem efficiently we use an advanced
hardware feature of the Quadrics NIC called DMA
The data and functional units involved in the
chaining. The DMA chaining concept is illustrated in
implementation of the PUT operation are shown in Figure
Figure 4. We issue the DMA message with the data and
3. To do the PUT operation, the source node first initiates
link or chain the second message that carries the
its request for the remote buffer; this process is described
descriptor. The packed data is send as DMA message to
as the first step in the paragraph above and can be seen as
appropriate receive buffer on the destination node.
an arrow labeled as request in Figure 3. While this request
Completion of that transfer sets the hardware event on the
is in progress, the source packs the data into a local
sending node. The event triggers transmission of the
pack/send buffer. This is shown as an arrow labeled as
chained message that contains the descriptor. That
pack in Figure 3. After packing the source data, the source
message is send to the remote message queue. Upon
node waits for a response (arrow labeled as response in
arrival in the available queue slot and generates an event
completed. This queue works like a sliding window for
NIC NIC the queue of DMA requests that correspond to the
noncontiguous data transfer. When the queue is full,
Thread Event
Processor Queue progress can be made only when a request is processed.
The NIC level code is written such that thread processor is
yielded after the non-contiguous request are issued. Thus
0 1 2 3
the NIC quickly becomes available to service other
Data Data
Packed Data Descriptor Packed Data Descriptor requests. The thread processor CPU is a shared resource
responsible for processing of multiple protocols and
CPU CPU services supported through the NIC. This helps other
protocols (e.g. MPI where the thread processor is used for
message tag matching) to make progress.
Figure 4: Message chaining of data and descriptor
6. Performance Evaluation
that wakes up the destination server thread. The chaining
We used several strategies to evaluate our implementation
technique we described solves an important flow control
of the non-contiguous RMA data transfers. Our objective
problem and provides efficient operation with minimal
was to choose experiments that expose the advantages and
host CPU utilization. The host CPU is only used for
disadvantages of each of the methods discussed above and
packing and unpacking the data. The NIC thread
the impact on the overall application performance they
processor is only involved in sending the source node
have. Ultimately, we found that a hybrid method that
information and selecting available destination receive
switches between NIC based and HBNA methods based
buffers.
on the message size and the shape of the non-contiguous
data provides the best performance.
5.2 NIC-Based method
For the NIC based method we describe the design and We ran experiments on the HP cluster at PNNL. Each
implementation using the PUT operation on non- cluster node has dual IA-64 Madison 1.5GHz CPU’s on a
contiguous data. This is because the implementation of HP ZX1 Chipset. QsNetII interconnect connects the nodes
GET RMA operation just involves sending a message to in the cluster. The interconnect connects to the system via
the remote NIC requesting it to do a PUT. a 133MHz PCI-X bus, which is capable of sustaining
1GB/s bandwidth.
In the NIC-based method when a non-contiguous PUT
operation is initiated, the source and destination 6.1 Micro-Benchmarks
descriptors are sent to the thread processor on the NIC.
We ran a micro-benchmark to measure the point to point
The thread processor then decodes the descriptors and
performance of these methods. This micro-benchmark
initiates a series of contiguous data transfers. The NIC
involves one process doing a series of PUT operations to
thread processor maintains a request queue (see Section 2)
all other processes. The data transfers involve 2-
of a limited size to do flow control for the series of data
dimensional square array sections. We ran this benchmark
transfers it initiates. This request queue can be seen in
on 4 processes each situated on a different node. Figure 6
Figure 5 which represents basic data and functional units
shows the performance of each of these methods for
in the NIC-based algorithm. In Figure 5, the “current”
various message sizes. It can be seen from Figure 6 that
requests are the ones that are currently processed to be
although the HBNA method has some host involvement,
issued. The “issued” requests are the requests that have
for some message range, it out-performs both the NIC-
already been submitted to the DMA engine of the NIC.
based and the vendor provided methods. However the
The processed requests are the ones that have been
HBNA method may still impact the over-all application
time because it does interrupt the remote processor, which
could potentially be doing some useful computation. The
primary reason for better performance of the HBNA
Request Queue method here is the following. This method copies the
Issued User Data smaller non-contiguous chunks of data into a larger
Data Descriptor
chunk. Peak bandwidth for Elan4 is attainable for strided
Current messages over 100KB. The attainable bandwidth
increases with message size until then. This means, by
Processed
copying smaller chunks into a larger chunk and then
NIC Host transmitting, the HBNA method may achieve better
bandwidth. The cost of copying packing and unpacking is
Figure 5: NIC based method
offset by the higher bandwidth it has for transmitting (right).
messages of a larger size. We later observed the same
effect when analyzing the performance of the NAS BT 6.3 Application Benchmarks – NAS BT
benchmark. The NAS-BT code is one of the NAS Parallel
Benchmarks (NPB). It is a simulated computational fluid
1000
dynamics application that solves systems of equations
resulting from an approximately factored implicit finite-
difference discretization of three-dimensional Navier-
Stokes equations. The code uses an implicit algorithm to
100 compute a finite difference solution to the 3D
compressible Navier-Stokes equations. The solution is
MBPS
900 900
800 800
700 700
600 600
MBps
500
MBps
500
400 400
300 300
NIC-Based 200
200 HBNA
Vendor Provided
100 Calculated value for Copy+Send+Copy
100
Calculated NICBased wit no pipelining
0
0
0 500000 1000000 1500000 2000000 2500000
0 500000 1000000 1500000 2000000 2500000
Bytes
Bytes
Figure 7: Performance comparisons of NIC based method (left) and HBNA method (Right)
Class A 1 processor per node Class A 2 processors per node
12000 9000
Vendor Provided 8000 Vendor Provided
10000 NIC Based NIC Based
7000 HBNA
HBNA
8000 6000
5000
6000
4000
4000 3000
2000
2000
1000
0 0
4 9 16 25 36 49 64 4 9 16 25 36 49 64
Figure 8: BT performance for class A and Class B. Left- one processor per node, Right - two processors per node
entire non-contiguous data transfer 2) NIC level code uses Thus saturation bandwidth may be reached faster this
the DMA engine on the NIC to transmit multiple non- way. Micro-benchmarks in Figure 6 also show the host
contiguous chunks of data without any further host based method giving a better bandwidth for a certain
involvement. Maximum benefit can be obtained when message range in which copying and transmitting the non-
first stride is large enough to obtain high bandwidth from contiguous data as a larger chunk gives better bandwidth.
the network.
For the two processors per node case, although
In host based method, the non-contiguous chunks of data communication time for the host based method is less,
are copied into larger buffer and then transmitted. This some overhead at the application level is perceived: one
may involve more than one communication call of the application threads is interrupted to complete data
depending on if the host buffer is big enough to fit the transfer on the remote side. This means the over all time
entire user data. spent in communication for each processor, when using
all the processors on the node, is less for the HBNA
When the overall message size is less than the size of the
method in comparison to the NIC based method. Overall
message necessary to obtain saturation bandwidth,
time spent on computation is slightly higher in the HBNA
copying the message into contiguous chunks and sending
method because the user thread is interrupted to complete
it may be better than the NIC based approaches. It can be
communication (packing/unpacking).
seen from Figure 8 that for 1 processor per node case, the
host based method does better then the other two NIC 6.4 Application Benchmarks – SRUMMA
based methods. This is because most of the messages in
BT transmit rectangular data with the longer edge We use the SRUMMA parallel dense matrix
representing the number of segments. Such message give multiplication [13] that relies on remote memory access
better bandwidth with the host based method as the data is communication. The matrices are decomposed into sub-
copied into larger contiguous buffers for transmission. matrices and distributed among processors with a 2D
200 based and it uses NIC level programming to manage
NIC Based access to remote receive buffers.
Aggregate GFLOPs 180
Vendor provided
160
HBNA
140 8. Conclusions
120 The paper has introduced two schemes for optimizing
100 noncontiguous strided remote memory access (RMA) data
80 transfers using a programmable network interface card.
60 The two schemes provide superior performance to the
40 vendor provided interfaces. Based on experimental results
20 obtained in context of microbenchmarks and application
0 kernels, it was found that the effectiveness of the two
0 16 32 48
Processors
64 80 schemes is dependent on the message size. This
observation implies that the most effective algorithm
Figure 8: Matrix Multiplication for matrix size 2000 would rely on switching between them based on the
block distribution. Each sub-matrix is divided into message size.
chunks. Overlapping is achieved by issuing a call to get a
chunk of data while computing the previously received 9. References
chunk. Figure 8 shows the performance of SRUMMA for [1]D. Buntinas, D. K. Panda, J. Duato, and P. Sadayappan,
matrix of size 2000. It demonstrates that performance of "Broadcast/multicast over Myrinet using NIC-assisted
the algorithm is highly sensitive to the implementation of multidestination messages," Network-Based Parallel Computing,
the strided interfaces. For smaller number of processors, Proceedings, vol. 1797, pp. 115-129, 2000
where the block size is larger the NIC-based scheme [2] D. Buntinas, D. K. Panda, and P. Sadayappan, "Performance
performs the best. However for the larger numbers of Benefits of NICBased Barrier on Myrinet/GM," presented at
processors the matrix block size is smaller and thus the Communication Architecture for Clusters, San Francisco, CA,
HBNA scheme that performs packing becomes faster. 2001.
[3] V. Tipparaju, G. Santhmaraman, J. Nieplocha, and D. K.
Panda, "Host-assised zero-copy remote memory access
7. Related work communication on Infiniband," IPDPS' 2004.
We are not aware of any prior work on exploiting the NIC [4] T. Rajeev, G. William, and L. Ewing, A case for using MPI's
derived datatypes to improve I/O performance. San Jose, CA:
for implementing efficient non-contiguous RMA data
IEEE Computer Society </pre> </body> </html>, 1998.
transfers. Much work has been done in the recent past [5] J. Beecroft, D. Addison, F. Petrini, and M. McLaren,
describing methods of utilizing NIC for point-to-point, "QsNetII: An Interconnect for Supercomputing Applications,"
one-sided and collective communication. Notable of these Quadrics, Hot Interconnects, 2003.
are [1, 2]. Our previous [3] work focused on the [6] F. Petrini, S. Coll, E. Frachtenberg, and A. Hoisie,
Infiniband network and discusses a combined technique "Hardware- and Software-Based Collective Communication on
that uses a combination of vendor provided non- the Quadrics Network," IEEE International Symposium on
contiguous communication interfaces and host to Network Computing and Applications, Boston, 2001.
implement one-sided non-contiguous data transfer. Non- [7] J. Nieplocha, E. Apra, J. Ju, and V. Tipparaju, "One-Sided
Communication on Clusters with Myrinet," Cluster Computing,
contiguous data transfers have been implemented as a part
vol. 6, pp. 115-124, 2003
of several libraries and standards. MPI [9]has provision [8] R. A. VanDeGeijn and J. Watts, "SUMMA: Scalable
for derived, user defined, data types that are able to handle universal matrix multiplication algorithm," Concurrency-
non-contiguous data. ARMCI [3, 10] has support for 2 Practice and Experience, vol. 9, pp. 255-274, 1997
different kinds of non-contiguous data transfer operations. [9] W. Gropp and E. Lusk, User's Guide for MPICH, a Portable
These operations are implemented on several networks, Implementation of MPI, 1996.
even on the networks with limited or no support for non- [10] J. Nieplocha and B. Carpenter, "ARMCI: A Portable
contiguous data transfer. [11] Gives a complete overview Remote Memory Copy Library for Distributed Array Libraries
of performance of non-contiguous data types. Methods of and Compiler Run-time Systems," IPPS/SDP'99, 1999.
[11] M. Ashworth, "A report on further progress in the
doing one-sided non-contiguous RMA are discussed in
development of codes for the CS2," 1996.
[12]. It also discusses non-contiguous communication in [12] J. Worringen, A. Gaer, and F. Reker, "Exploiting
context of MPI-2 and discusses two optimizations done to transparent remote memory access for non-contiguous- and one-
one-sided non-contiguous MPI-2 communication. The sided-communication," 2002.
optimization done avoids local copy of data by sending [13] M. Krishnan and J. Nieplocha, “SRUMMA: A matrix
multiple non-contiguous chunks of data into the remote multiplication algorithm suitable for clusters and scalable shared
receive buffer directly. The method described here is memory systems”, Proc. 18th International Parallel and
comparable to HBNA method described in this paper Distributed Processing Symposium, IPDPS 2004, 2004, p 987-
significant difference being that HBNA is not merely host 996