You are on page 1of 18

LUSTRE™ NETWORKING

High-Performance Features and Flexible Support


for a Wide Array of Networks
White Paper
November 2008

Abstract
This paper provides information about Lustre™ networking that can be used to plan cluster file system deployments
for optimal performance and scalability. The paper includes information on Lustre message passing, Lustre Network
Drivers, and routing in Lustre networks, and describes how these features can be used to improve cluster storage
management. The final section of this paper describes new Lustre networking features that are currently under
consideration or planned for future release.
Sun Microsystems, Inc.

Table of Contents

Challenges in Cluster Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Lustre Networking — Architecture and Current Features . . . . . . . . . . . . . . . . . . . . . 2


LNET architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Network types supported in Lustre networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Routers and multiple interfaces in Lustre networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Applications of LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Remote direct memory access (RDMA) and LNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Using LNET to implement a site-wide or global file system . . . . . . . . . . . . . . . . . . . . . 7
Using Lustre over wide area networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Using Lustre routers for load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Anticipated Features in Future Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


New features for multiple interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Server-driven QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A router control plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1 Challenges in Cluster Networking Sun Microsystems, Inc.

Chapter 1
Challenges in Cluster Networking

Networking in today’s datacenters provides many challenges. For performance, file


system clients must access servers using native protocols over a variety of networks,
preferably leveraging capabilities such as remote direct memory access. In large instal-
lations, multiple networks may be encountered and all storage must be simultaneously
accessible over multiple networks through routers and by using multiple network
interfaces on the servers. While storage management nightmares such as staging
multiple copies of data on file systems local to a cluster are common practice, they are
also highly undesirable.

Lustre networking (LNET) provides features that address many of these challenges.
Chapter 2 provides an overview of some of the key features of the LNET architecture.
Chapter 3 discusses how these features can be used in specific high-performance
computing (HPC) networking applications. Chapter 4 looks at how LNET is expected
to evolve to enhance load balancing, quality of service (QoS), and high availability in
networks on a local and global scale. And Chapter 5 provides a short synopsis and recap.
2 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.

Chapter 2
Lustre Networking — Architecture and
Current Features

The LNET architecture comprises a number of key features that can be used to simplify
and enhance HPC networking.

LNET architecture
The LNET architecture has evolved through extensive research into a set of protocols
and application programming interfaces (APIs) to support high-performance, high-
availability file systems. In a cluster with a Lustre file system, the system network is
the network connecting the servers and the clients.

LNET is only used over the system network where it provides all communication infra-
structure required by the Lustre file system. The disk storage in a Lustre file system is
connected to metadata servers (MDSs) and object storage servers (OSSs) using tradi-
tional storage area networking (SAN) technologies. However, this SAN does not extend
to the Lustre client systems, and typically does not require SAN switches.

Key features of LNET include:


• Remote direct memory access (RDMA), when supported by underlying networks
such as Elan, Myrinet, and InfiniBand
• Support for many commonly used network types such as InfiniBand and IP
• High-availability and recovery features that enable transparent recovery in
conjunction with failover servers
• Simultaneous availability of multiple network types with routing between them
3 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.

Figure 1 shows how these network features are implemented in a cluster deployed
with LNET.

Metadata server (MDS) disk storage Object storage OSS storage with object
containing metadata targets (MDT) servers (OSS) storage targets (OST)
systems 1-1000’s

Clustered MDS pool 1—100


OSS 1
MDS 1 MDS 2
(active) (standby) Commodity storage

Elan
Myrinet
OSS 2
InfiniBand

Simultaneous
Lustre clients support of multiple OSS 3
1—100,000 Shared storage
network types
enables failover OSS

Router OSS 4

OSS 5

GigE
OSS 6
Enterprise-class
= Failover storage arrays and
OSS 7 SAN fabric

Figure 1. Lustre architecture for clusters

LNET is implemented using layered software modules. The file system uses a remote
procedure API with interfaces for recovery and bulk transport. This API, in turn, uses
the LNET Message Passing API, which has its roots in the Sandia Portals message
passing API, a well-known API in the HPC community.

The LNET architecture supports pluggable drivers to provide support for multiple
network types individually or simultaneously, similar in concept to the Sandia Portals
network abstraction layer (NAL). The drivers, called Lustre Network Drivers (LNDs), are
loaded into the driver stack, with one LND for each network type in use. Routing is
possible between different networks. This was implemented early in the Lustre
product cycle to provide a key customer, Lawrence Livermore National Laboratories (LLNL),
with a site-wide file system (this will be discussed in more detail in Chapter 2, Applications
of LNET).
4 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.

Figure 2 shows how the software modules and APIs are layered.

Support for multiple network types


Vendor network device libraries

Lustre Network Drivers (LNDs)

Similar to Sandia Portals, with


LNET library
some new and different features

Network I/O (NIO) API


Moves small and large buffers
Lustre request processing Uses RDMA
Generates events

Zero-copy marshalling libraries


Service framework and request dispatch Legend:
Connection and address naming API
Generic recovery infrastructureAPI
Not supplied

Not portable

Portable Lustre component

Figure 2. Modular LNET implemented with layered APIs

A Lustre network is a set of configured interfaces on nodes that can send traffic directly
from one interface on the network to another. In a Lustre network, configured interfaces
are named using network identifiers (NIDs). The NID is a string that has the form
<address>@<type><network id>. Examples of NIDs are 192.168.1.1@tcp0, designating
an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the
8th Lustre Elan network.

Network types supported in Lustre networks


The LNET architecture includes LNDs to support many network types, including:
• InfiniBand (IB): OpenFabrics IB versions 1.0, 1.2, 1.2.5 and 1.3
• TCP: Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB
• Quadrics: Elan3 and Elan4
• Myricom: GM and MX
• Cray: SeaStar and RapidArray

The LNDs that support these networks are pluggable modules for the LNET
software stack.
5 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.

Routers and multiple interfaces in Lustre networks


A Lustre network consists of one or more interfaces on nodes configured with NIDS
that communicate without the use of intermediate router nodes with their own NIDS.
LNET can conveniently define a Lustre network by enumerating the IP addresses of the
interfaces forming the Lustre network. A Lustre network is not required to be physically
separated from another Lustre network, although that is possible.

When more than one Lustre network is present, LNET can route traffic between networks
using routing nodes in the network. An example of this is shown in Figure 3, where one
of the routers is also an OSS. If multiple routers are present between a pair of networks,
they offer both load balancing and high availability through redundancy.

OSS
Elan clients 132.6.1.2 192.168.0.2 TCP clients

MDS
Elan 132.6.1.4 Ethernet
switch switch

... Router TCP clients ...


access MDS
through the
132.6.1.10 192.168.0.10 router

elanO Lustre network tcpO Lustre network

Figure 3. Lustre networks connected through routers

When multiple interfaces of the same type are available, load balancing traffic across
all links becomes important. If the underlying network software for the network type
supports interface bonding, resulting in one address, then LNET can rely on that mech-
anism. Such interface bonding is available for IP networks and Elan4, but not presently
for InfiniBand.
6 Lustre Networking — Architecture and Current Features Sun Microsystems, Inc.

If the network does not provide channel bonding, Lustre networks can help. Each of
the interfaces is placed on a separate Lustre network. The clients on each of these
Lustre networks together can utilize all server interfaces. This configuration also
provides static load balancing.

Additional features that may be developed in future releases to allow LNET to even
better manage multiple network interfaces are discussed further in Chapter 4,
Anticipated Features in Future Releases.

Figure 4 shows how a Lustre server with several server interfaces can be configured to
provide load balancing for clients placed on more than one Lustre network. At the top,
two Lustre networks are configured as one physical network using a single switch. At
the bottom, they are configured as two physical networks using two switches.

vibO Lustre network vib1 Lustre network


10.0.0.7 10.0.0.8

Clients 10.0.0.5 10.0.0.6 Clients


Switch

vibO vib1
10.0.0.3 10.0.0.4
network rail network rail
10.0.0.1 10.0.0.2
Multiple
Server
interfaces

vibO Lustre network vib1 Lustre network


10.0.0.7 10.0.0.8

Clients 10.0.0.5 10.0.0.6 Clients


Switch Switch

vibO vib1
10.0.0.3 network rail network rail 10.0.0.4
10.0.0.1 10.0.0.2
Multiple
Server
interfaces

Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster
7 Applications of LNET Sun Microsystems, Inc.

Chapter 3
Applications of LNET

LNET provides versatility for deployments. A few opportunities are described in this section.

Remote direct memory access (RDMA) and LNET


With the exception of TCP, LNET provides support for RDMA on all network types.
When RDMA is used, nodes can achieve almost full bandwidth with extremely low
CPU utilization. This is advantageous, particularly for nodes that are busy running
other software, such as Lustre server software. The LND automatically uses this feature
for large message sizes.

However, provisioning with sufficient CPU power and high-performance motherboards


may justify TCP networking as a trade-off to using RDMA. On 64-bit processors, LNET
can saturate several GigE interfaces with relatively low CPU utilization, and with the
Dual-Core Intel® Xeon® processor 5100 series, the bandwidth on a 10 GigE network can
approach a gigabyte per second. LNET provides extraordinary bandwidth utilization of
TCP networks. For example, end-to-end I/O over a single GigE link routinely exceeds
110 MB/sec with LNET.

The Internet Wide Area RDMA Protocol (iWARP), developed by the RDMA Consortium,
is an extension to TCP/IP that supports RDMA over TCP/IP networks. Linux supports the
iWARP protocol using the OpenFabrics Alliance (OFA) code and interfaces. LNET OFA
LND supports iWARP properly as well as IB.

Using LNET to implement a site-wide or global file system


Site-wide file systems and global file systems are implemented to provide transparent
access from multiple clusters to one or more file systems. Site-wide file systems are
typically associated with one site, while global file systems may span multiple locations
and therefore utilize wide area networking.

Site-wide file systems are typically desirable in HPC centers where many clusters exist
on different high-speed networks. Typically, it is not easy to extend such networks or to
connect such networks to other networks. LNET makes this possible.
8 Applications of LNET Sun Microsystems, Inc.

An increasingly popular approach is to build a storage island at the center of such an


installation. The storage island contains storage arrays and servers and utilizes an
InfiniBand or TCP network. Multiple clusters can connect to this island through Lustre
routing nodes. The routing nodes are simple Lustre systems with at least two network
interfaces: one to the internal cluster network and one to the network used in the
storage island. Figure 5 shows an example of a global file system.

Clients Cluster 1 Routers

Switch ... Server farm


...
Elan4 OSS
InfiniBand

Switch
Clients Cluster 2 Routers
...
Storage
network
MDS
Switch
... ...
IP network
Storage island

Figure 5. A global file system implemented using Lustre networks

The benefits of site-wide and global file systems are not to be underestimated.
Traditional data management for multiple clusters frequently involves staging data
from one cluster on the file system to another. By deploying a site-wide Lustre file
system, multiple copies of the data are no longer needed and substantial savings can
be achieved through improved storage management and reduced capacity requirements.

Using Lustre over wide area networks


The Lustre file system has been successfully deployed over wide area networks (WANs).
Typically, even over a WAN, 80 percent of raw bandwidth can be achieved, which is
significantly more than that achieved by many other file systems over local area networks
(LANs). For example, within the United States, Lustre file system deployments have
achieved a bandwidth of 970 MB/sec over a WAN using a single 10 GigE interface (from
a single client). Between Europe and the United States, 97 MB/sec has been achieved
with a single GigE connection. On LANs, observed I/O bandwidths are only slightly
higher: 1100 MB/sec on a 10 GigE network and 118 MB/sec on a GigE network.
9 Applications of LNET Sun Microsystems, Inc.

Routers can also be used advantageously to connect servers distributed over a WAN.
For example, a single Lustre cluster may consist of two widely separated groups of
Lustre servers and clients with each group interconnected by an InfiniBand network.
As shown in Figure 6, Lustre routing nodes can be used to connect the two groups of
Lustre servers and clients via an IP-based WAN. Alternatively, the servers could have an
InfiniBand and Ethernet interface. However, this configuration may require more ports
on switches, so the routing solution may be more cost effective.

WAN

IP IP

Router Router

InfiniBand InfiniBand

Lustre cluster group 1 Lustre cluster group 2

... ... ... ...

Clients Servers Clients Servers

Location A Location B

Figure 6. A Lustre cluster distributed over a WAN

Using Lustre routers for load balancing


Commodity servers can be used as Lustre routers to provide a cost-effective, load-
balanced, redundant router configuration. For example, consider an installation with
servers on a network with 10 GigE interfaces and many clients attached to a GigE network.
It is possible, but typically costly, to purchase IP switching equipment that can connect
to both the servers and the clients.
10 Applications of LNET Sun Microsystems, Inc.

With a Lustre network, the purchase of such costly switches can be avoided. For a more
cost-effective solution, two separate networks can be created. A smaller, faster network
contains the servers and a set of router nodes with sufficient aggregate throughput. A
second client network with slower interfaces contains all the client nodes and is also
attached to the router nodes. If this second network already exists and has sufficient
free ports to add the Lustre router nodes, no changes to this client network are
required. Figure 7 shows an installation with this configuration.

GigE clients Router farm 10GigE servers

GigE 10GigE
switch switch

... ...

Load balancing, redundant


router farm

Figure 7. An installation combining slow and fast networks using Lustre routers

The routers provide a redundant, load-balanced path between the clients and the
servers. This network configuration allows many clients together to use the full band-
width of a server, even if individual clients have insufficient network bandwidth to do
so. Because multiple routers stream data to the server network simultaneously, the
server network can see data throughput in excess of what a single router can deliver.
11 Anticipated Features in Future Releases Sun Microsystems, Inc.

Chapter 4
Anticipated Features in Future Releases

LNET offers many features today. And just like most products, enhancements and new
features are intended for future releases. Some possible new features include support
of multiple network interfaces, implementation of server-driven quality-of-service (QoS)
guarantees, asynchronous I/O, and a control interface for routers.

New features for multiple interfaces


As previously mentioned, LNET can currently exploit multiple interfaces by placing
them on different Lustre networks. This configuration provides reasonable load balancing
for a server with many clients. However, it is a static configuration that does not
handle link-level failover or dynamic load balancing.

It is Sun’s intention to address these shortcomings with the following design. First,
LNET will virtualize multiple interfaces and offer the aggregate as one NID to the users
of the LNET API. In concept, this is quite similar to the aggregation (also referred to as
bonding or trunking) of Ethernet interfaces using protocols such as 802.3ad Dynamic
Link Aggregation. The key features that a future LNET release may offer are:
• Load balancing: All links are used based on availability of throughput capacity.
• Link-level high availability: If one link fails, the other channels transparently
continue to be used for communication.

These features are shown in Figure 8.

Client Client

X
All traffic

Switch Switch
Evenly-loaded traffic Link failure
accommodated
X without server failover

Server Server

Figure 8. Link-level load balancing and failover


12 Anticipated Features in Future Releases Sun Microsystems, Inc.

From a design perspective, these load-balancing and high-availability features are


similar to the features offered with LNET routing described in Chapter 2 in the section
“Using Lustre routers for load balancing.” A challenge in developing these features is
providing a simple way to configure the network. Assigning and publishing NIDs for the
bonded interfaces should be simple and flexible and should work even if all links are
not available at startup. We expect to use the management server protocol to resolve
this issue.

Server-driven QoS
QoS is often a critical issue, for example, when multiple clusters are competing for
bandwidth from the same storage servers. A primary QoS goal is to avoid overwhelming
server systems with conflicting demands from multiple clusters or systems, resulting in
performance degradation for all clusters. Setting and enforcing policies is one way to
avoid this.

For example, a policy can be established that guarantees that a certain minimal band-
width is allocated to resources that must respond in real time, such as for visualization.
Or a policy can be defined that gives systems or clusters doing mission-critical work
priority for bandwidth over less important clusters or systems. The Lustre QoS system’s
role is not to determine an appropriate set of policies but to provide capabilities that
allow policies to be defined and enforced.

Two components proposed for the Lustre QoS scheduler are a global Epoch Handler
(EH) and a Local Request Scheduler (LRS). The EH provides a shared time slice among
all servers. This time slice can be relatively large (one second, for example) to avoid
overhead due to excessive server-to-server networking and latency. The LRS is responsible
for receiving and queuing requests according to a local policy. The EH and LRS together
allow all servers in a cluster to execute the same policy during the same time slice.
Note that the policy may subdivide the time slices and use the subdivision advanta-
geously. The LRS also provides summary data to the EH to support global knowledge
and adaptation.
13 Anticipated Features in Future Releases Sun Microsystems, Inc.

Figure 9 shows how these features can be used to schedule rendering and visualization
of streaming data. In this implementation, LRS policy allocates 30 percent of each
Epoch time slice to visualization and 70 percent to rendering.

Epoch messaging

OSS 1—3

... ...

Rendering cluster Visualization cluster

Epoch 1 2 3

30% 70% 30% 70% 30% 70%


Visualization

Rendering

Visualization

Rendering

Visualization

Rendering
Figure 9. Using server-driven QoS to schedule video rendering and visualization

A router control plane


Lustre technology is expected to be used in vast worldwide file systems that traverse
multiple Lustre networks with many routers. To achieve wide-area QoS guarantees that
cannot be achieved with static configurations, the configurations of these networks
must change dynamically. A control interface is required between the routers and
external administrative systems to handle these situations. Requirements are currently
being developed for a Lustre Router Control Plane to help address these issues.

For example, features are being considered for the Lustre Router Control Plane that
could be used when data packets are being routed by routers from A to B and also from
C to D and, for operational reasons, a preference needs to be given to routing the packets
from C to D. The control plane would apply a policy to the routers so that packets
would be sent from C to D before packets are sent from A to B.

The Lustre Router Control Plane may also include the capability to provide input to a
server-driven QoS subsystem, linking router policies with server policies. It might be
particularly interesting to have an interface between the server-driven QoS subsystem
and the router control plane to allow coordinated adjustment of QoS in a cluster and a
wide area network.
14 Anticipated Features in Future Releases Sun Microsystems, Inc.

Asynchronous I/O
In large compute clusters, the potential exists for significant I/O optimization. When a
client writes large amounts of data, a truly asynchronous I/O mechanism would allow
the client to register the memory pages that need to be written for RDMA and allow
the server to transfer the data to storage without causing interrupts on the client. This
makes the client CPU fully available to the application again, which is a significant
benefit in some situations.

Source node Network Sink node Source node Network Sink node
LNET LND LND LNET LNET LND LND LNET

Put message Put message


description description
Register
sink buffer Register source buffer

Send description
and source
Get DMA address RDMA address

Register source buffer Register


sink buffer

RDMA data RDMA data


Event Event

Sending message with DMA handshake Sending message without DMA handshake

Figure 10. Network-level DMA with handshake interrupts and without handshake interrupts

LNET supports RDMA; however, currently a handshake at the operating system level is
required to initiate the RDMA, as shown in Figure 10 (on the left). The handshake
exchanges the network-level DMA addresses to be used. The proposed change to LNET
would eliminate the handshake and include the network-level DMA addresses in the
initial request to transfer data as shown in Figure 10 (on the right).
15 Conclusion Sun Microsystems, Inc.

Chapter 5
Conclusion

LNET provides an exceptionally flexible and innovative infrastructure. Among the many
features and benefits that have been discussed, the most significant are:
• Native support for all commonly used HPC networks
• Extremely fast data rates through RDMA and unparalleled TCP throughput
• Support for site-wide file systems through routing, eliminating staging, and copying
of data between clusters
• Load-balancing router support to eliminate low-speed network bottlenecks

Lustre networking will continue to evolve with planned features to handle link aggre-
gation, server-driven QoS, a rich control interface to large routed networks, and
asynchronous I/O without interrupts.
Lustre™ Networking On the Web sun.com

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
© 2008 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Lustre, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other coun-
tries. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice.
SunWIN #524780 Lit. #SYWP13913-1 11/08

You might also like