Professional Documents
Culture Documents
∗
Parallel Virtual File System (PVFS2) over Quadrics
−Rendezvous −Rendezvous
Send Recv
secondary buffer zone
short/unexpted
RDMA Write RDMA Read
FIN FIN_ACK
send slots recv slots
a) RDMA Write b) RDMA Read
As discussed in Section 4.2, RDMA read and write are both utilized 5.3. Zero-Copy Non-Contiguous IO with SEA-
in the rendezvous protocol. This achieves zero-copy transmission MUR
of long messages. File systems, such as DAFS [10], also take ad-
vantage of similar RDMA-based message transmission. Typically
Non-contiguous I/O access is the main access pattern in scientific
a receiver decides to use RDMA read or write based on whether
applications. Thakur et. al. [25] also noted that it is important to
the sender is trying to read or write data: a read operation is imple-
achieve high performance MPI-IO by providing native support of
mented as RDMA write from the receiver, and a write operation as
noncontiguous access in file systems. PVFS2 provides list I/O for
a RDMA read. However, one process can be potentially overloaded
structured non-contiguous IO in scientific applications. List IO can
with a large number of outstanding RDMA operations, which can
be built on top of an interconnect’s native scatter/gather support if
lead to suboptimal performance due to the bandwidth drop-off [5].
it is available. Otherwise, it often resorts to memory packing and
A basic throttling mechanism is needed to control the number of
unpacking for converting non-contiguous to contiguous communi-
concurrent outstanding RDMA operations. We introduce an adap-
cation. An alternative is to use multiple contiguous operations. This
tive throttling algorithm to take into account of the load information
approach would require multiple send and receive operations from
and balance the number of RDMA operations. In this algorithm, a
both the sender and the receiver, and it would lead to more process-
receiver gathers its load information from the local communication
ing of smaller communication operations, resulting in the degrada-
state and the sender’s load information from the sender’s initial ren-
tion of performance.
dezvous packet. For the reason of fairness to multiple clients, this
algorithm takes into account whether one process is a server under
Quadrics provides non-contiguous communication operations in
heavy load. The client always carries out the RDMA operations
the form of elan putv and elan getv. However, these operations are
when one process is a heavily loaded server. Otherwise, a receiver
specifically designed for the shared memory programming model
uses RDMA read to pull the message when it is less loaded than the
(SHMEM) over Quadrics. The final placement of the data requires
sender, or RDMA write when it is not. Fig. 6 shows a flow chart of
a memory copy from the global shared memory to the application
the detail algorithm.
destination memory. To support zero-copy non-contiguous IO, we
propose a Single Event Associated MUltiple RDMA (SEAMUR)
mechanism.
Yes
Fig. 7 shows a diagram of RDMA write-based SEAMUR mecha-
A server under heavy load? nism. The source and destination memory address/length pairs of
the IO fragments are first collected by the process that is to initiate
No The server is a receiver?
the RDMA operations, in this case, the sender. Then SEAMUR is
Yes carried out in four steps. In Step 1, the sender determines the num-
No ber of required contiguous RDMA write operations, N, and con-
Yes structs the same number of RDMA descriptors in the host memory.
receiver_load < sender_load A single Elan4 event is also created to wait on the completion of
No
these N RDMA operations. In Step 2, these RDMA write descrip-
RDMA Read tors are posted together into the Quadrics Elan4 command port (a
command queue to the NIC formed by a memory-mapped user ac-
RDMA Write cessible NIC memory) through programmed IO. In Step 3, multi-
ple RDMA write operations are triggered to perform the zero-copy
non-contiguous IO from the source to the destination memory. In
Fig. 6: Adaptive RDMA Algorithm Step 4, upon the completion of multiple RDMA operations, the ear-
lier associated elan event is triggered, which in turn notifies the host
process through a host-side event. The remote side can be notified
5.2. Optimizing Completion Notification through a separated chained message as described in Section 4.2,
or simply a remote event as described in Section 5.2. Note that,
Event mechanisms that enable both local and remote completion
using this approach, multiple RDMA operations are issued without
notification are available in Quadrics communication models. In
calling extra sender or receiver routines. Zero-copy is achieved by
particular, this mechanism can be used to enable notification of
directly addressing the source and destination memory. If RDMA
message completion along with RDMA read/write operations. In
read is chosen based on the adaptive rendezvous protocol, similar
the rendezvous protocol, so long as the control information con-
zero-copy non-contiguous support can be achieved by issuing mul-
tained in the last control message is available to the remote process,
tiple RDMA read operations, all being chained with another Elan4
the completion of a full message can be safely notified through an
event of count N.
enabled remote event. We have designed this as an optimization
to the rendezvous protocol. A sender process allocates a comple-
tion event and encodes the address of this event in the first ren- 6. Implementation
dezvous message. When the receiver pulls the message via RDMA
read, it also triggers a remote event to the sender using the pro- With the design of client/server connection model and the transport
vided event address. Similarly, in the case of RDMA write, the re- layer over Quadrics communication mechanisms, we have imple-
ceiver provides the address of such an event in its acknowledgment mented PVFS2 over Quadrics/Elan4. The implementation is based
to the sender. The receiver detects the completion of a full message on the recent release of PVFS2-1.1-pre1. Due to the compatibility
through the remote event triggered by a RDMA write operation. In issue of PVFS2 and Quadrics RedHat Linux kernel distribution, we
Source Memory
Destination Memory
Table 2: Network Performance over Quadrics
Host Event
Operations
TCP/EIP
Latency
23.92µs
Bandwidth
482.26MB/s
1 Quadrics RDMA/Write 1.93µs 910.1MB/s
Command Port Quadrics RDMA/Read 3.19µs 911.1MB/s
2 3 Quadrics QDMA 3.02 µs 368.2MB/s
N 4
300
250
Bandwidth (MB/s)
InfiniBand over PVFS1 [20]. DeBergalis et. al. [10] have further
200
described a file system, DAFS, built on top of networks with VIA-
like semantics. Our work is designed for Quadrics Interconnects
150 over PVFS2 [1].
100 Models to support client/server communication and provide generic
abstractions for transport layer have been described over differ-
50
ent networks [30, 17, 6]. Yu et. al [29] have described the
0 designing of dynamic process model over Quadrics for MPI2.
Write W/T Read W/T Write W/N Read W/N
Our work explores the ways to overcome Quadrics static pro-
cess/communication model and optimize the transport protocols
Fig. 10: Performance of MPI-Tile-IO Benchmark with Quadrics event mechanisms for PVFS2. Ching et. al [7] have
implemented list IO in PVFS1 and evaluated its performance over
7.5. Performance of NAS BT-IO TCP/IP over fast-Ethernet. Wu et. al [28] have studied the bene-
fits of leveraging InfiniBand hardware scatter/gather operations to
The BT-IO benchmarks are developed at NASA Ames Research optimize non-contiguous IO access in PVFS1. Our work exploits
Center based on the Block-Tridiagonal problem of the NAS Paral- a communication mechanism with a single event chained to multi-
lel Benchmark suite. These benchmarks test the speed of parallel IO ple RDMA to support zero-copy non-contiguous network IO over
capability of high performance computing applications. The entire Quadrics.
data set undergoes complex decomposition and partition, eventually
distributed among many processes, more details available in [26]. 9. Conclusions
The BT-IO problem size class A is evaluated. We have also eval-
uated the performance of BT-IO with the same version of PVFS2 In this paper, we have examined the feasibility of designing a paral-
built on top of Myrinet/GM. The Myrinet experiment is conducted lel file system over Quadrics [23] to take advantage of its user-level
on the same 8-node cluster. All nodes are equipped with two port communication and RDMA operations. PVFS2 [1] is used as the
LANai-XP cards that are connected to a Myrinet 2000 network. We parallel file system platform in this work. The challenging issues
have used four of eight nodes as server nodes and the other four as in supporting PVFS2 on top of Quadrics interconnects are identi-
client nodes. fied. Accordingly, strategies have been designed to overcome these
challenges, such as constructing a client-server connection model,
Table 4 shows the comparisons of BT-IO performance over designing the PVFS2 transport layer over Quadrics RDMA read
PVFS2/Elan4 and PVFS2/TCP on top of Quadrics interconnect, and write, and providing efficient non-contiguous network IO sup-
and that of PVFS2/GM over Myrinet. The performance of ba- port. The performance of our implementation is compared to that
sic BT benchmark is measured as the time of BT-IO benchmark of PVFS2/TCP over Quadrics IP implementation. Our experimen-
without IO accesses. On the same Quadrics network, the BT- tal results indicate that: the performance of PVFS2 can be signifi-
IO benchmark has only 2.12 seconds extra IO time when access- cantly improved with Quadrics user-level protocols and RDMA ca-
ing a PVFS2 file system provided by this implementation, but pabilities. Compared to PVFS2/TCP on top of Quadrics IP imple-
5.38 seconds when accessing a PVFS2 file system with a TCP- mentation, our implementation improves the aggregated read per-
based implementation. The IO time of BT-IO is reduced by 60% formance by more than 140%. It is also able to deliver significant
with our Quadrics/Elan4-based implementation compared to TCP- performance improvement in terms of IO access time to application
based implementation. Compared to the PVFS2 implementation benchmarks such as mpi-tile-io [24] and BT-IO [26]. To the best of
over Myrinet/GM, this Elan4-based implementation also reduces our knowledge, this is the first high performance design and imple-
the IO time of BT-IO. This is because the bandwidth of Quadrics mentation of a user-level parallel file system, PVFS2, over Quadrics
is higher than that of Myrinet 2000, about 500 MB with two-port interconnects.
LANai cards. These results suggest our implementation can in-
deed enable the applications to leverage the performance benefits In future, we intend to leverage more features of Quadrics to sup-
of Quadrics/Elan4 for efficient file IO accesses. port PVFS2 and study their possible benefits to different aspects
of parallel file system. For example, we intend to study the feasi-
8. Related Work bility of offloading PVFS2 communication-related processing into
Quadrics programmable network interface to free up more host
Previous research have studied the benefits of using user-level com- CPU computation power for disk IO operations. We also intend to
munication protocols to parallelize IO access to storage servers. study the benefits of integrating Quadrics NIC memory into PVFS2
Zhou et. al. [31] have studied the benefits of VIA networks in memory hierarchy, such as data caching with client and/or server-
database storage. Wu et. al. [27] have described their work on side NIC memory.
Acknowledgment ropean PVM/MPI Users’ Group Meeting (Euro PVM/MPI
2004), pages 87–96, September 2004.
We gratefully acknowledge Dr. Pete Wyckoff from Ohio Super- [17] J. Liu, M. Banikazemi, B. Abali, and D. K. Panda. A Portable
computing Center and Dr. Jiesheng Wu from Ask Jeeves, Inc for Client/Server Communication Middleware over SANs: De-
many technical discussions. We would like to thank members from sign and Performance Evaluation with InfiniBand. In SAN-02
the PVFS2 team for their technical help. Furthermore, We also Workshop (in conjunction with HPCA), February 2003.
would like to thank Drs Daniel Kidger and David Addison from
Quadrics, Inc for their valuable technical support. [18] Message Passing Interface Forum. MPI-2: Extensions to the
Message-Passing Interface, Jul 1997.
[19] N. Nieuwejaar and D. Kotz. The Galley Parallel File System.
10 References Parallel Computing, (4):447–476, June 1997.
[20] P. H. Carns and W. B. Ligon III and R. B. Ross and R. Thakur.
[1] The Parallel Virtual File System, version 2. http://www. PVFS: A Parallel File System For Linux Clusters. In Proceed-
pvfs.org/pvfs2. ings of the 4th Annual Linux Showcase and Conference, pages
[2] The Public Netperf Homepage. http://www.netperf.org/ 317–327, Atlanta, GA, October 2000.
netperf/NetperfPage.html. [21] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Re-
[3] J. Beecroft, D. Addison, F. Petrini, and M. McLaren. QsNet- dundant Arrays of Inexpensive Disks. In Proceedings of the
II: An Interconnect for Supercomputing Applications. In the 1988 ACM SIGMOD International Conference on Manage-
Proceedings of Hot Chips ’03, Stanford, CA, August 2003. ment of Data, Chicago, IL, 1988.
[4] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. [22] F. Petrini, W.-C. Feng, A. Hoisie, S. Coll, and E. Frachten-
Seitz, J. N. Seizovic, and W.-K. Su. Myrinet: A Gigabit- berg. The Quadrics Network: High Performance Cluster-
per-Second Local Area Network. IEEE Micro, 15(1):29–36, ing Technology. IEEE Micro, 22(1):46–57, January-February
1995. 2002.
[5] D. Bonachea, C. Bell, P. Hargrove, and M. Welcome. GAS- [23] Quadrics, Inc. Quadrics Linux Cluster Documentation.
Net 2: An Alternative High-Performance Communication In- [24] R. B. Ross. Parallel i/o benchmarking consortium. http://
terface, Nov. 2004. www-unix.mcs.anl.gov/rross/pio-benchmark/html/.
[6] P. H. Carns, W. B. Ligon III, R. Ross, and P. Wyckoff. BMI: [25] R. Thakur, W. Gropp, and E. Lusk. On Implementing MPI-
A Network Abstraction Layer for Parallel I/O, April 2005. IO Portably and with High Performance. In Proceedings of
[7] A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp. the 6th Workshop on I/O in Parallel and Distributed Systems,
Noncontiguous I/O through PVFS. In Proceedings of pages 23–32. ACM Press, May 1999.
the IEEE International Conference on Cluster Computing, [26] P. Wong and R. F. Van der Wijngaart. NAS Parallel Bench-
Chicago, IL, September 2002. marks I/O Version 2.4. Technical Report NAS-03-002, Com-
[8] Cluster File System, Inc. Lustre: A Scalable, High Perfor- puter Sciences Corporation, NASA Advanced Supercomput-
mance File System. http://www.lustre.org/docs.html. ing (NAS) Division.
[9] A. M. David Nagle, Denis Serenyi. The Panasas ActiveScale [27] J. Wu, P. Wychoff, and D. K. Panda. PVFS over InfiniBand:
Storage Cluster – Delivering Scalable High Bandwidth Stor- Design and Performance Evaluation. In Proceedings of the
age. In Proceedings of Supercomputing ’04, November 2004. International Conference on Parallel Processing ’03, Kaohsi-
ung, Taiwan, October 2003.
[10] M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck,
T. Talpey, and M. Wittle. The Direct Access File System. [28] J. Wu, P. Wychoff, and D. K. Panda. Supporting Efficient
In Proceedings of Second USENIX Conference on File and Noncontiguous Access in PVFS over InfiniBand. In Proceed-
Storage Technologies (FAST ’03), 2003. ings of Cluster Computing ’03, Hong Kong, December 2004.
[11] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Net- [29] W. Yu, T. S. Woodall, R. L. Graham, and D. K. Panda. De-
works: An Engineering Approach. The IEEE Computer Soci- sign and Implementation of Open MPI over Quadrics/Elan4.
ety Press, 1997. In Proceedings of the International Conference on Parallel
and Distributed Processing Symposium ’05, Colorado, Den-
[12] J. Huber, C. L. Elford, D. A. Reed, A. A. Chien, and D. S. Blu-
ver, April 2005.
menthal. PPFS: A High Performance Portable Parallel File
System. In Proceedings of the 9th ACM International Confer- [30] R. Zahir. Lustre Storage Networking Transport Layer.
ence on Supercomputing, pages 385–394, Barcelona, Spain, http://www.lustre.org/docs.html.
July 1995. ACM Press. [31] Y. Zhou, A. Bilas, S. Jagannathan, C. Dubnicki, J. F. Philbin,
[13] IBM Corp. IBM AIX Parallel I/O File System: Installation, and K. Li. Experiences with VI Communication for Database
Administration, and Use. Document Number SH34-6065-01, Storage. In Proceedings of the 29th Annual International
August 1995. Symposium on Computer Architecture, pages 257–268. IEEE
Computer Society, 2002.
[14] Infiniband Trade Association. http://www.infinibandta.org.
[15] Intel Scalable Systems Division. Paragon System User’s
Guide, May 1995.
[16] R. Latham, R. Ross, and R. Thakur. The impact of file sys-
tems on mpi-io scalability. In Proceedings of the 11th Eu-