Professional Documents
Culture Documents
Abstract
This paper provides information about Lustre™ networking that can be used to plan cluster file system deployments
for optimal performance and scalability. The paper includes information on Lustre message passing, Lustre Network
Drivers, and routing in Lustre networks, and describes how these features can be used to improve cluster storage
management. The final section of this paper describes new Lustre networking features that are currently under
consideration or planned for future release.
Sun Microsystems, Inc.
Table of Contents
Applications of LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Remote direct memory access (RDMA) and LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Using LNET to implement a site-wide or global file system . . . . . . . . . . . . . . . . . . . . . 7
Using Lustre over wide area networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Using Lustre routers for load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1 Challenges in Cluster Networking Sun Microsystems, Inc.
Chapter 1
Challenges in Cluster Networking
Lustre networking (LNET) provides features that address many of these challenges.
Chapter 2 provides an overview of some of the key features of the LNET architecture.
Chapter 3 discusses how these features can be used in specific high-performance
computing (HPC) networking applications. Chapter 4 looks at how LNET is expected
to evolve to enhance load balancing, quality of service (QoS), and high availability in
networks on a local and global scale. And Chapter 5 provides a short synopsis and recap.
2 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.
Chapter 2
Lustre Networking — Architecture and
Current Features
The LNET architecture comprises a number of key features that can be used to simplify
and enhance HPC networking.
LNET architecture
The LNET architecture has evolved through extensive research into a set of protocols
and application programming interfaces (APIs) to support high-performance, high-
availability file systems. In a cluster with a Lustre file system, the system network is
the network connecting the servers and the clients.
LNET is only used over the system network where it provides all communication infra-
structure required by the Lustre file system. The disk storage in a Lustre file system is
connected to metadata servers (MDSs) and object storage servers (OSSs) using tradi-
tional storage area networking (SAN) technologies. However, this SAN does not extend
to the Lustre client systems, and typically does not require SAN switches.
Figure 1 shows how these network features are implemented in a cluster deployed
with LNET.
Metadata server (MDS) disk storage Object storage OSS storage with object
containing metadata targets (MDT) servers (OSS) storage targets (OST)
systems 1-1000’s
Elan
Myrinet
OSS 2
InfiniBand
Simultaneous
Lustre clients support of multiple OSS 3
1—100,000 Shared storage
network types
enables failover OSS
Router OSS 4
OSS 5
GigE
OSS 6
Enterprise-class
= Failover storage arrays and
OSS 7 SAN fabric
LNET is implemented using layered software modules. The file system uses a remote
procedure API with interfaces for recovery and bulk transport. This API, in turn, uses
the LNET Message Passing API, which has its roots in the Sandia Portals message
passing API, a well-known API in the HPC community.
The LNET architecture supports pluggable drivers to provide support for multiple
network types individually or simultaneously, similar in concept to the Sandia Portals
network abstraction layer (NAL). The drivers, called Lustre Network Drivers (LNDs), are
loaded into the driver stack, with one LND for each network type in use. Routing is
possible between different networks. This was implemented early in the Lustre
product cycle to provide a key customer, Lawrence Livermore National Laboratories (LLNL),
with a site-wide file system (this will be discussed in more detail in Chapter 2, Applications
of LNET).
4 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.
Figure 2 shows how the software modules and APIs are layered.
Not portable
A Lustre network is a set of configured interfaces on nodes that can send traffic directly
from one interface on the network to another. In a Lustre network, configured interfaces
are named using network identifiers (NIDs). The NID is a string that has the form
<address>@<type><network id>. Examples of NIDs are 192.168.1.1@tcp0, designating
an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the
8th Lustre Elan network.
The LNDs that support these networks are pluggable modules for the LNET
software stack.
5 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.
When more than one Lustre network is present, LNET can route traffic between networks
using routing nodes in the network. An example of this is shown in Figure 3, where one
of the routers is also an OSS. If multiple routers are present between a pair of networks,
they offer both load balancing and high availability through redundancy.
OSS
Elan clients 132.6.1.2 192.168.0.2 TCP clients
MDS
Elan 132.6.1.4 Ethernet
switch switch
When multiple interfaces of the same type are available, load balancing traffic across
all links becomes important. If the underlying network software for the network type
supports interface bonding, resulting in one address, then LNET can rely on that mech-
anism. Such interface bonding is available for IP networks and Elan4, but not presently
for InfiniBand.
6 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.
If the network does not provide channel bonding, Lustre networks can help. Each of
the interfaces is placed on a separate Lustre network. The clients on each of these
Lustre networks together can utilize all server interfaces. This configuration also
provides static load balancing.
Additional features that may be developed in future releases to allow LNET to even
better manage multiple network interfaces are discussed further in Chapter 4,
Anticipated Features in Future Releases.
Figure 4 shows how a Lustre server with several server interfaces can be configured to
provide load balancing for clients placed on more than one Lustre network. At the top,
two Lustre networks are configured as one physical network using a single switch. At
the bottom, they are configured as two physical networks using two switches.
vibO vib1
10.0.0.3 10.0.0.4
network rail network rail
10.0.0.1 10.0.0.2
Multiple
Server
interfaces
vibO vib1
10.0.0.3 network rail network rail 10.0.0.4
10.0.0.1 10.0.0.2
Multiple
Server
interfaces
Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster
7 Applications of LNET Sun Microsystems, Inc.
Chapter 3
Applications of LNET
LNET provides versatility for deployments. A few opportunities are described in this section.
The Internet Wide Area RDMA Protocol (iWARP), developed by the RDMA Consortium,
is an extension to TCP/IP that supports RDMA over TCP/IP networks. Linux supports the
iWARP protocol using the OpenFabrics Alliance (OFA) code and interfaces. LNET OFA
LND supports iWARP properly as well as IB.
Site-wide file systems are typically desirable in HPC centers where many clusters exist
on different high-speed networks. Typically, it is not easy to extend such networks or to
connect such networks to other networks. LNET makes this possible.
8 Applications of LNET Sun Microsystems, Inc.
Switch
Clients Cluster 2 Routers
...
Storage
network
MDS
Switch
... ...
IP network
Storage island
The benefits of site-wide and global file systems are not to be underestimated.
Traditional data management for multiple clusters frequently involves staging data
from one cluster on the file system to another. By deploying a site-wide Lustre file
system, multiple copies of the data are no longer needed and substantial savings can
be achieved through improved storage management and reduced capacity requirements.
Routers can also be used advantageously to connect servers distributed over a WAN.
For example, a single Lustre cluster may consist of two widely separated groups of
Lustre servers and clients with each group interconnected by an InfiniBand network.
As shown in Figure 6, Lustre routing nodes can be used to connect the two groups of
Lustre servers and clients via an IP-based WAN. Alternatively, the servers could have an
InfiniBand and Ethernet interface. However, this configuration may require more ports
on switches, so the routing solution may be more cost effective.
WAN
IP IP
Router Router
InfiniBand InfiniBand
Location A Location B
With a Lustre network, the purchase of such costly switches can be avoided. For a more
cost-effective solution, two separate networks can be created. A smaller, faster network
contains the servers and a set of router nodes with sufficient aggregate throughput. A
second client network with slower interfaces contains all the client nodes and is also
attached to the router nodes. If this second network already exists and has sufficient
free ports to add the Lustre router nodes, no changes to this client network are
required. Figure 7 shows an installation with this configuration.
GigE 10GigE
switch switch
... ...
Figure 7. An installation combining slow and fast networks using Lustre routers
The routers provide a redundant, load-balanced path between the clients and the
servers. This network configuration allows many clients together to use the full band-
width of a server, even if individual clients have insufficient network bandwidth to do
so. Because multiple routers stream data to the server network simultaneously, the
server network can see data throughput in excess of what a single router can deliver.
11 Anticipated Features in Future Releases Sun Microsystems, Inc.
Chapter 4
Anticipated Features in Future Releases
LNET offers many features today. And just like most products, enhancements and new
features are intended for future releases. Some possible new features include support
of multiple network interfaces, implementation of server-driven quality-of-service (QoS)
guarantees, asynchronous I/O, and a control interface for routers.
It is Sun’s intention to address these shortcomings with the following design. First,
LNET will virtualize multiple interfaces and offer the aggregate as one NID to the users
of the LNET API. In concept, this is quite similar to the aggregation (also referred to as
bonding or trunking) of Ethernet interfaces using protocols such as 802.3ad Dynamic
Link Aggregation. The key features that a future LNET release may offer are:
• Load balancing: All links are used based on availability of throughput capacity.
• Link-level high availability: If one link fails, the other channels transparently
continue to be used for communication.
Client Client
X
All traffic
Switch Switch
Evenly-loaded traffic Link failure
accommodated
X without server failover
Server Server
Server-driven QoS
QoS is often a critical issue, for example, when multiple clusters are competing for
bandwidth from the same storage servers. A primary QoS goal is to avoid overwhelming
server systems with conflicting demands from multiple clusters or systems, resulting in
performance degradation for all clusters. Setting and enforcing policies is one way to
avoid this.
For example, a policy can be established that guarantees that a certain minimal band-
width is allocated to resources that must respond in real time, such as for visualization.
Or a policy can be defined that gives systems or clusters doing mission-critical work
priority for bandwidth over less important clusters or systems. The Lustre QoS system’s
role is not to determine an appropriate set of policies but to provide capabilities that
allow policies to be defined and enforced.
Two components proposed for the Lustre QoS scheduler are a global Epoch Handler
(EH) and a Local Request Scheduler (LRS). The EH provides a shared time slice among
all servers. This time slice can be relatively large (one second, for example) to avoid
overhead due to excessive server-to-server networking and latency. The LRS is responsible
for receiving and queuing requests according to a local policy. The EH and LRS together
allow all servers in a cluster to execute the same policy during the same time slice.
Note that the policy may subdivide the time slices and use the subdivision advanta-
geously. The LRS also provides summary data to the EH to support global knowledge
and adaptation.
13 Anticipated Features in Future Releases Sun Microsystems, Inc.
Figure 9 shows how these features can be used to schedule rendering and visualization
of streaming data. In this implementation, LRS policy allocates 30 percent of each
Epoch time slice to visualization and 70 percent to rendering.
Epoch messaging
OSS 1—3
... ...
Epoch 1 2 3
Rendering
Visualization
Rendering
Visualization
Rendering
Figure 9. Using server-driven QoS to schedule video rendering and visualization
For example, features are being considered for the Lustre Router Control Plane that
could be used when data packets are being routed by routers from A to B and also from
C to D and, for operational reasons, a preference needs to be given to routing the packets
from C to D. The control plane would apply a policy to the routers so that packets
would be sent from C to D before packets are sent from A to B.
The Lustre Router Control Plane may also include the capability to provide input to a
server-driven QoS subsystem, linking router policies with server policies. It might be
particularly interesting to have an interface between the server-driven QoS subsystem
and the router control plane to allow coordinated adjustment of QoS in a cluster and a
wide area network.
14 Anticipated Features in Future Releases Sun Microsystems, Inc.
Asynchronous I/O
In large compute clusters, the potential exists for significant I/O optimization. When a
client writes large amounts of data, a truly asynchronous I/O mechanism would allow
the client to register the memory pages that need to be written for RDMA and allow
the server to transfer the data to storage without causing interrupts on the client. This
makes the client CPU fully available to the application again, which is a significant
benefit in some situations.
Source node Network Sink node Source node Network Sink node
LNET LND LND LNET LNET LND LND LNET
Send description
and source
Get DMA address RDMA address
Sending message with DMA handshake Sending message without DMA handshake
Figure 10. Network-level DMA with handshake interrupts and without handshake interrupts
LNET supports RDMA; however, currently a handshake at the operating system level is
required to initiate the RDMA, as shown in Figure 10 (on the left). The handshake
exchanges the network-level DMA addresses to be used. The proposed change to LNET
would eliminate the handshake and include the network-level DMA addresses in the
initial request to transfer data as shown in Figure 10 (on the right).
15 Conclusion Sun Microsystems, Inc.
Chapter 5
Conclusion
LNET provides an exceptionally flexible and innovative infrastructure. Among the many
features and benefits that have been discussed, the most significant are:
• Native support for all commonly used HPC networks
• Extremely fast data rates through RDMA and unparalleled TCP throughput
• Support for site-wide file systems through routing, eliminating staging, and copying
of data between clusters
• Load-balancing router support to eliminate low-speed network bottlenecks
Lustre networking will continue to evolve with planned features to handle link aggre-
gation, server-driven QoS, a rich control interface to large routed networks, and
asynchronous I/O without interrupts.
Lustre™ Networking On the Web sun.com
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
© 2008 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Lustre, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other coun-
tries. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice.
SunWIN #524780 Lit. #SYWP13913-1 11/08