You are on page 1of 31

1

www.myri.com 2009 Myricom, Inc.


Myricoms Myri-10G Software
Support for Network Direct
(MS MPI) on either 10-Gigabit
Ethernet or 10-Gigabit Myrinet
30. Mrz 2009
2. Treffen der WinHPC UG
Dresden
Dr. Markus Fischer
Senior Software Architect
fischer@myri.com
2

www.myri.com 2009 Myricom, Inc.
Myri-10G
Based on 10-Gigabit Ethernet PHYs (layer 1)
Standard 10-Gigabit Ethernet cabling, copper and fiber
Myri-10G NICs support both Ethernet and Myrinet
network protocols at the Data Link level (layer 2)
and, like earlier Myricom NICs, include processors and firmware
for offload and kernel-bypass operation
Myri-10G switches retain the efficiency and scalability of
layer-2 Myrinet switching internally
but can have a mix of 10-Gigabit Myrinet and 10-Gigabit Ethernet
ports externally
4th-generation Myricom products, a convergence
at 10-Gigabit/s data rates of Myrinet with Ethernet
3

www.myri.com 2009 Myricom, Inc.
The Spectrum of Myri-10G Applications
10-Gigabit Ethernet solutions for general networking
10-Gigabit Ethernet NICs with wire-speed TCP/IP performance,
low cost, fully compliant with Ethernet standards, interoperable
with the 10-Gigabit Ethernet products of other companies
10-Gigabit Ethernet layer-2 switches that scale efficiently to
thousands of ports, because they are Myrinet switches inside
Optional capabilities enabled by NIC firmware
for low-latency, low-host-CPU-load, kernel-bypass operation
over either 10Gb Ethernet or 10Gb Myrinet networks
MX (Myrinet Express) for HPC; Video Pump
TM
for video streaming;
RAP for packet capture and playback; and more
10-Gigabit Ethernet and Myrinet solutions for HPC
A complete, low-latency, scalable, cluster-interconnect solution
NICs, software, and switches software-compatible with Myrinet-
2000, and highly interoperable with 10-Gigabit Ethernet
4

www.myri.com 2009 Myricom, Inc.
Myri-10G NICs
Protocol-offload
10-Gigabit Ethernet NICs
- Myri10GE software distribution
- 10G Ethernet switch
- 9s TCP/IP latency
- 9.5-9.9 Gbit/s TCP/IP data rate
- Low host-CPU utilization
Kernel-bypass, low-latency,
10-Gigabit Myrinet NICs
- MX-10G software distribution
- 10G Myrinet switch
- 2s MPI latency
- 1.2 GByte/s MPI data rate
- Very low host-CPU utilization
PCI Express x8 NICs, available with many different 10GbE PHYs and form factors.
These NICs are all based on the Myricom Lanai-Z8E or -Z8ES chips, and share the
same software support and optional firmware-enabled performance features.

Two SFP+ports for Direct Attach,


10GBase-SR, or 10GBase-LR
One QSFP port for Direct
Attach or EOE fiber cables
5

www.myri.com 2009 Myricom, Inc.
Lanai Z8ES Chip
Lanai Z8E core with
2MB of on-chip fast
SRAM, plus functional
enhancements
2 ports for failover
SR-IOV
SMBus
more
I/O is PCIe x8 and two
XAUI ports
21.5% higher clock
rate than the Z8E
2.5W typical at
364.6MHz
25mm square FCBGA
package
6

www.myri.com 2009 Myricom, Inc.
Block Diagram of the Lanai Z8ES Chip
7

www.myri.com 2009 Myricom, Inc.
Examples of Lanai-Z8ES-based NICs
Lanai
Z8ES
Network Ports
10 Gb/s
PCIe x8
Simple NIC
Two ports for failover
10G-PCIE-8B-2S,
-2I, -2C
Lanai
Z8ES
Network Ports
20 Gb/s
PCIe x8
Double NIC
blades, motherboards
10G-PCIE-8B2-4I
Lanai
Z8ES
PCIe x8
Lanai
Z8ES
Network Ports
20 Gb/s
Double NIC
Gen2 PCIe Add-in Card
10G-PCIE2-8B2-2QP,
-2S, -2C
Lanai
Z8ES
Gen 2 PCIe x8
PCIe Switch
8

www.myri.com 2009 Myricom, Inc.
Preferred NIC for HPC Applications
Myricom 10G-PCIE-8B-QP
3W with copper cables
Add 1W for EOE cable
QSFP
Network
Port
P
C
I

E
x
p
r
e
s
s

x
8

H
o
s
t

P
o
r
t
Indiana University and TU Dresden used these fast but low-power NICs in an IBM iDataplex system
together with a Myri-10G switch to win the SC08 Cluster Challenge (performance under power limits)
Advance Product Information
9

www.myri.com 2009 Myricom, Inc.
2-Port NIC for the Data Center
Myricom 10G-PCIE-8B-2S
2 x SFP+
Network
Ports
P
C
I

E
x
p
r
e
s
s

x
8

H
o
s
t

P
o
r
t
Two Ports for Failover (HA), Low-profile PCI Express x8 add-in card, Dual-protocol network ports, 10-
Gigabit Ethernet or 10-Gigabit Myrinet, Wire-speed performance from either port, Firmware-controlled
offloads, 6 Watts typical including power for two SFP+transceivers
10

www.myri.com 2009 Myricom, Inc.
Gen2 PCIe v2.0 (5GT/s) NIC with 2 SFP+ ports
for Performance (20Gb/s) and Failover
P
C
I

E
x
p
r
e
s
s

5
G
T
/
s

x
8
H
o
s
t

P
o
r
t
2 SFP+
Network
Ports
Product code: 10G-PCIE2-8B2-2S
IP - The two active ports can carry TCP/IP traffic concurrently at an aggregate data rate of 19.8 Gb/s
with a 9KB MTU, or 18.9 Gb/s with a 1500B MTU.
MX Can use both ports for data transfers allowing for 2x BW (NIC bonding)
11

www.myri.com 2009 Myricom, Inc.
Myri-10G High Speed Expansion Cards (HSECs)
for the IBM BladeCenter H
4 Ports,
2 for performance,
2 for failover
20 Gb/s throughput
Two PCIe devices
6 Watts
Product Code
10G-PCIE-8B2-4I
2 Ports for failover
10 Gb/s throughput
Single PCIe device
3 Watts
Product Code
10G-PCIE-8B-2I
Fastest, lowest cost, lowest power, Ethernet expansion cards
available for the IBM BladeCenter H
12

www.myri.com 2009 Myricom, Inc.
Myri-10G Modular Switches
Myricoms modular Myri-
10G switches are based on
a 110ns-latency 32-port
Myrinet-protocol crossbar-
switch chip. The photos
show the current choice of
enclosures for these
modular switches.
Different mixes of Ethernet
or Myrinet external ports,
as well as ports with
different PHYs, are
configured by using
different types of line
cards. The line cards plug
into a backplane that
connects the switch chips
on the line cards in a
diameter-3 full-bisection
Clos network.
For applications requiring
more than 512 host ports,
the switch networks are
scalable by interconnecting
the internal Myrinet fabrics.
13

www.myri.com 2009 Myricom, Inc.
Myri-10G Software
The driver and firmware for 10-Gigabit Ethernet operation
(Myri10GE software distribution) is included with the NIC
The broader software support for 10-Gigabit Myrinet and
Low-Latency 10-Gigabit Ethernet is MX-10G (Myrinet Express)
MX
kernel bypass
Sockets over MX
+
MPI over MX
IP Sockets
IP Sockets
Host APIs
Myrinet
IPoM
IP over Myrinet
MXoM
MX over Myrinet
IPoE
IP over Ethernet
MXoE
MX over Ethernet
MX-10G
TCP/IP, UDP/IP
host-OS network stack
Ethernet
IPoE
IP over Ethernet
Myri10GE
TCP/IP, UDP/IP
host-OS network stack
Network Network protocols
Software
Distribution
Host protocols
MX-10G, Video Pump, and other software distributions are functional supersets of the Myri10GE software.
14

www.myri.com 2009 Myricom, Inc.
Myri-10G 10-Gigabit Ethernet NICs
Hopefully, HPC people will excuse this short story
about Myricoms very popular 10GbE NICs
Remember that Ethernet is the most widely used interconnect for
HPC clusters, and even dominates the TOP500 list
15

www.myri.com 2009 Myricom, Inc.
Myri-10G NICs in the 10GbE Market
Demonstrated interoperability with 10-Gigabit Ethernet switches from
Myricom, Foundry, Extreme, HP, Quadrics, Fujitsu, SMC, Force10, Cisco,
Blade Network, Broadcom, Arista, (the list keeps growing)
The Myri10GE driver and firmware is currently available for Linux,
Windows, Solaris, Mac OS X, FreeBSD, and VMware ESX
Driver was contributed to and accepted in the Linux kernel; included in the 2.6.18
and later kernels and in many Linux distributions
WHQL-certified for Windows XP, Windows Server 2003 and 2008, Vista
NDIS 5.1, 5.2 , and NDIS 6.0
Highly effective stateless offloads
Zero-copy send
checksum offloads
LRO
TSO
RSS
interrupt coalescing
multicast filtering,
Not a TOE; no troublesome stateful offloads; no OS patches
See http://www.linuxfoundation.org/en/Net:TOE
16

www.myri.com 2009 Myricom, Inc.
16
Myri-10G TCP/IP Performance
The excellent netperf benchmark results below are with
Linux (2.6.20) between servers with two Intel quad-core
2.66GHz Xeon X5355s:
See http://www.myri.com/scs/performance/Myri10GE/ for
additional TCP/IP and UDP/IP performance benchmarks
NIC firmware allows Myricom to take advantage of advances such as
PCIe Direct Cache Access (DCA) as soon as it appeared in servers
See http://www.10gbe.net/ for independent comparisons
Netperf Test MTU BW TX_CPU % RX_CPU %
------------ ---- ------- -------- --------
TCP_STREAM 9000 9910.26 11.32 5.89
TCP_SENDFILE 9000 9894.46 3.37 5.91
TCP_STREAM 1500 9463.06 10.54 8.71
TCP_SENDFILE 1500 9354.66 2.75 8.67
17

www.myri.com 2009 Myricom, Inc.
Myri-10GE Performance on Windows
Ntttcp Results, MTU 9000
Commands: Sender: ntttcps -m 1,1,10.0.130.50 -l 1048576 -n 100000 -w -v -a 8
Receiver: ntttcpr -m 1,1,10.0.130.50 -l 1048576 -rb 2097152 -n 1000000 -w -v -a 8
Results on the Sender:
Total Bytes(MEG) Realtime(s) Average Frame Size Total Throughput(Mbit/s)
=====================================================================
104857.600000 85.500 60667.263 9811.237
Packets Sent Packets Received Total Retransmits Total Errors Avg. CPU %
===================================================================
1728405 281845 2 0 10.70
Results on the Receiver:
Total Bytes(MEG) Realtime(s) Average Frame Size Total Throughput(Mbit/s)
=====================================================================
104857.600000 85.735 8959.587 9784.345
Packets Sent Packets Received Total Retransmits Total Errors Avg. CPU %
===================================================================
281837 11703396 0 0 27.27
Myri-10G TCP/IP Performance on Windows
18

www.myri.com 2009 Myricom, Inc.
Myri-10G for HPC Applications
The Power of Programmable NICs
19

www.myri.com 2009 Myricom, Inc.
MX Software Interfaces
Applications
UDP TCP
IP
Ethernet driver MX driver
In the
Host
OS
kernel
Ethernet
NIC
MX firmware in the Myri NIC
Myrinet-2000,
10G Myrinet,
or 10G Ethernet ports
MPI
Sockets
Other
Middleware
Kernel bypass
Initialization
& IP Traffic
20

www.myri.com 2009 Myricom, Inc.
MX Software
MX-10G is the low-level message-passing system for
low-latency, low-host-CPU-utilization, kernel-bypass
operation of Myri-10G NICs
over either 10Gb Myrinet or 10Gb Ethernet networks
MX-2G for Myrinet-2000 PCI-X NICs was released in
J une 2005
Myricom software support always spans two generations of NICs
MX-2G and MX-10G are fully compatible at the application level
Includes TCP/IP, UDP/IP, MPICH-MX, and Sockets-MX
Also available: MPICH2-MX, OpenMPI, HP-MPI, Intel-MPI,
Supports NIC bonding (teaming)
With N NICs, nearly N times the data rate at the same latency
Cluster file systems operate directly over MX
Lustre-MX, PVFS2-MX, others in progress
21

www.myri.com 2009 Myricom, Inc.
MX over Ethernet
Myricom extended MX-10G to operate over 10-Gigabit
Ethernet as well as 10-Gigabit Myrinet
MXoE works with Myri-10G NICs (kernel bypass) and
standard 10-Gigabit Ethernet switches
MXoE uses Ethernet as a layer-2 network with an MX
EtherType to identify MX packets (frames)
MX has its own efficient protocols for reliable delivery
The Myri-10G NICs can carry IP traffic (IPoE) together with MX
traffic (MXoE)
Myricom has made the MXoE protocols open for use with other
Ethernet NICs
See the Open-MX project (http://open-mx.gforge.inria.fr/)
22

www.myri.com 2009 Myricom, Inc.
Myri-10G Software Stacks for Windows
MX drivers are WHQL
certified
Carry both IP and MX traffic
efficiently
Windows CCS
WSD-MX allows for TCP
Sockets acceleration achieving
line rate even for Pingpong
HPC Server 2008:
WSD-MX (lower latency)
NdProv-MX as a Provider for
Network Direct (ND Logo
tested)
23

www.myri.com 2009 Myricom, Inc.
MSMPI
Performance
using NdProv-MX
24

www.myri.com 2009 Myricom, Inc.
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
#Date : Wed Oct 17 13:29:30 2007
#Version : Version: Service Pack 2 (Build 3790)
#PingPong
#bytes #repetitions t[usec] Mbytes/sec
0 500 3.27 0.00
2 500 3.69 0.52
4 500 3.71 1.03
8 500 3.67 2.08
16 500 3.72 4.10
32 500 3.73 8.19
64 500 4.10 14.88
128 500 4.65 26.25
256 500 6.31 38.71
512 500 6.99 69.87
1024 500 8.58 113.88
2048 500 10.13 192.87
4096 500 13.97 279.67
8192 500 19.39 402.83
16384 500 27.69 564.32
32768 500 42.32 738.46
65536 500 91.96 679.62
131072 320 146.52 853.12
262144 160 257.94 969.20
524288 80 498.73 1002.54
1048576 40 913.70 1094.46
2097152 20 1834.80 1090.04
4194304 10 3787.13 1056.21
Myri-10G MPICH2-MX Performance on Windows
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
#Date : Wed Oct 17 13:29:30 2007
#Version : Version: Service Pack 2 (Build 3790)
#SendRecv
t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
5.59 5.60 5.60 0.00
5.60 5.61 5.61 0.68
5.59 5.59 5.59 1.37
5.62 5.62 5.62 2.71
5.61 5.61 5.61 5.44
5.70 5.70 5.70 10.71
6.06 6.07 6.07 20.12
6.17 6.19 6.17 39.58
7.74 7.75 7.74 63.04
8.74 8.74 8.74 111.75
10.35 10.37 10.35 188.64
11.96 11.97 11.97 326.41
16.34 16.34 16.34 478.08
20.10 20.10 20.10 777.26
30.90 30.91 30.90 1011.38
55.44 55.45 55.44 1127.18
123.27 123.27 123.27 1014.02
179.53 179.54 179.54 1392.45
278.56 278.58 278.57 1794.84
552.60 552.62 552.61 1809.57
954.56 954.59 954.57 2095.14
1888.87 1888.97 1888.92 2117.55
3780.72 3780.94 3780.83 2129.60
25

www.myri.com 2009 Myricom, Inc.
25
Thank you! More questions?
26

www.myri.com 2009 Myricom, Inc.
Extras
Or to help answer likely questions
27

www.myri.com 2009 Myricom, Inc.
Myri-10G in the TOP500 List
IBM used several hundred Myri-10G 10GbE NICs in the
Roadrunner system (#1) for storage
The IBM Blue Gene/P system at Argonne National Laboratory
(#5) uses a Myri-10G dual-protocol switch with nearly 1000
ports, 10Gb Ethernet to the Blue Gene/P racks and 10Gb
Myrinet to the file-system cluster
Other TOP500 Blue Gene/P systems that use Myri-10G switches
for storage switching: IDRIS (#16) and EDF (#24), both in France
and similar off-the-list use in many other TOP500 systems
Were pleased about the Myrinet 10G clusters in the Nov-2008
TOP500 list, i.e., the T2K cluster at the University of Tokyo (#27),
and the clusters at Clemson and USC (#60 & #61). However, in
keeping with Myricoms diversification, our Myri-10G products are
used invisibly in many other TOP500 systems, for example:
28

www.myri.com 2009 Myricom, Inc.
Photos of the T2K Cluster at U Tokyo
Photos taken while the cluster was being installed. The
installation uses 14 21U switches and CX4 cabling.
This cluster is much too large and too tightly packed in
its room to capture in a single photograph.
29

www.myri.com 2009 Myricom, Inc.
Cabling Change in our HPC Offerings
We are phasing out CX4
cabling for HPC clusters and
inter-switch links for switches
with more than 512 host ports
We recommend the use of
components with QSFP (QP)
ports, with QSFP-terminated
copper cables up to 5m, or
with QSFP-terminated EOE
(Electrical-Optical-Electrical)
cables for distances larger
than 5m
Additional cabling options,
such as QSFP-terminated
copper cables with active
circuitry in the cable ends,
are expected
Myricom
10G-QP-10M
QSFP-terminated
EOE cable
30

www.myri.com 2009 Myricom, Inc.
The Dominance of Ethernet
The Myricom technical team that designed Myri-10G (starting in 2002) was
not willing to settle for just a 4th generation of Myrinet, but decided that
Myricom should diversify from a specialty network company into
mainstream Ethernet.
In retrospect, Myri-10G anticipated the current revolution in Ethernet, in
which the standards bodies and many companies are engaged today in
developing what is variously called Low Latency Ethernet, Data Center
Ethernet, Convergence Enhanced Ethernet, or Lossless Ethernet.
Application targets: HPC, Storage, Fibre Channel over Ethernet (FCoE), and
others
The goal: Ethernet domination; the elimination of specialty networks
Standards-based Ethernet will soon be capable of everything that the
specialty networks Myrinet, Fibre Channel, and InfiniBand do today, but
in the meanwhile Myri-10G comes closer than any other commercial
products to achieving this goal.
Aided in part by the firmware in the programmable NICs, Myri-10G already
supports FCoE and multiple forms of HPCoE.
31

www.myri.com 2009 Myricom, Inc.
Extra: Myth and Truth about I/O Performance
Data Rate vs Signaling Rate (8b/10b Encoding)
Traditional networking such as Ethernet advertises the Data Rate. 10Gb/s is the
data rate of 10-Gigabit Ethernet, whatever the signal encoding.
IB marketing advertises signaling rate with 8b/10b encoding, so dont be
disappointed when running performance tests. 10Gb/s IB is really 8Gb/s data
rate; 20Gb/s IB is 16 Gb/s data rate, and 40Gb/s IB is 32Gb/s data rate.
PCI Express also lists signaling rate (GT/s =GigaTransfers/s = GBaud per lane),
and is also 8b/10b encoded, so multiply by 8/10 and the number of lanes to get
data rate.
Gen 1 PCI Express (2.5GT/s) x8 =8 * 2.5 Gb/s (signaling) =20 Gb/s (signaling)
16 Gb/s (data rate) each direction
Good implementations achieve 13 Gb/s data rate after protocol overhead
Well suited for 10Gb Ethernet
Gen 2 PCI Express (5GT/s) x8 =8 * 5.0 Gb/s (signaling) =40 Gb/s (signaling) 32
Gb/s (data rate) each direction
A good implementation achieves 25 Gb/s data rate after protocol overhead
Well suited for dual-port 10Gb Ethernet NICs, allowing true 2 * 10Gb/s
Now, think about how to handle 40Gb/s or 80Gb/s?

You might also like