You are on page 1of 5

The Elan5 Network Processor

Jon Beecroft, David Hewson, Fred Homewood, Duncan Roweth and Ed Turner.
Device Architecture
Abstract—The Elan5 is a single chip network processor The Elan5 has seven Packet Processing Engines, with
which acts as a host adapter for high speed network one assigned to device management, two dedicated to
protocols. It is capable of handling both 10Gb Ethernet,
link input and four being available for output packet
and proprietary Quadrics protocols developed for ultra low
latency communication in High Performance Computing generation and processing of requests from remote Elans.
applications. In order to provide flexibility in the choice of The PPEs are identical with the exception of the two
protocols the device is implemented as an array of identical connected to the network links, which have additional
RISC processors, which can be dedicated to tasks such as input and output buffers. Although the input buffers are
input packet handling and host memory DMA handling. owned by their respective PPEs, each of the output
buffers can be accessed by any PPE.
Index Terms—10GbE, HPC, Ethernet, QsNet
Each PPE consists of; a dual issue 500MHz RISC
INTRODUCTION processor core optimized for data communication tasks, a

T HE Elan5 Network processor marks a significant


departure from the preceding generations of high
speed Network Interface devices [1] in several aspects.
16Kbyte instruction cache, a 9Kbyte DMA buffer; and a
port connecting it to the on chip internal fabric.
QsNetIII QsNetIII
The device implements a standard-based protocol, IEEE / CX4 / CX4

802.3ae [2] in addition to the ultra low latency protocols


developed for supercomputing applications. This choice
Link Link Elan5 Adapter
was dictated by the requirement to address wider markets
to offset the increasing development costs of the latest
Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine Packet Engine
generations of CMOS technology, while at the same time 16K inst cache
9K data buffers
16K inst cache
9K data buffers
16K inst cache
9K data buffers
16K inst cache
9K data buffers
16K inst cache
9K data buffers
16K inst cache
9K data buffers
16K inst cache
9K data buffers

maintaining a performance advantage when used in


supercomputer class systems.
Fabric

The requirement to implement multiple protocols x8

necessitates a much greater degree of programmability in Host I/F Local Memory Local Functions Bridge

the underlying architecture than in preceding generations MMU


Object Cache Tags

Cmd Launch Buffer Manager External cache


of device. For example in Elan4 the only form of DMA Free List SDRAM i/f Ext i/f
PCIe
supported is a simple RDMA operation in which a SERDES
8K 8 8 banks = 512K ECC RAM PLL

contiguous memory block is transferred between nodes.


By implementing the DMA function in a processor
External EEPROM Clocks
instead of a dedicated state machine it is possible to PCIe
16 Lanes
DDRII

implement more complex functions, such as Ethernet


RDMA and various forms of scatter gather operation. Figure 1 - Elan5 Network Processor Architecture
All seven devices are connected through the on chip
The Elan5’s processing power is provided by a pool of fabric to a multi-bank memory system, with built in
Packet Processing Engines (PPEs) designed specifically hardware buffer management. A dedicated host address
to support the high speed streaming of data. Multiple path routes host memory references via the on-chip
PPEs are provided so as to maximize the packet rate and MMU to the host bus interface – a PCI Express interface
minimize the packet size necessary to saturate link supporting version 1 and version 2 protocols. The on
bandwidth. The PPEs share a multi bank memory system board MMU allows the Elan5 to replicate the virtual
on the chip. Communications between them, the memory memory translations of the host processor. The MMU
system and the host interface being provided by a high supports variable page sizes, up to 8 different page sizes
speed internal fabric. can be in use at anytime. Its TLB includes 64 fully
associative tags, with each tag referencing up to 16
consecutive page table entries (PTEs), for a total of 1024
translations.
To contact Quadrics send e-mail to sales@quadrics.com.
Other functional units include the Command Launch Unit
(CLU), the Object Cache, the external EEPROM
interfaces, the external SDRAM interface and cache.
The CLU is responsible for taking commands from the PPEs registers. Operations may be any number of bytes
host processor and directing them to an appropriate PPE. up to 4Kbytes, the maximum transaction size permitted
The CLU can support up to 2048 separate command by PCI Express. The buffer has associated scoreboarding
queues. Commands are issued by PIO write operations logic, which ensures that if a load is issued to bring data
from the host processor across the PCI Express bus. The into the DMA buffer, followed by a dependent store, (i.e.
CLU assigns the commands to a queue based on the one which overlaps that region in buffer memory) the
physical address, and signals the associated PPE, store will not be scheduled until the load has completed.
scheduling the task that processes the queue. The PPEs support on-the-fly checksum and CRC
generation for data being streamed through the DMA
The Object Cache is a software-controlled content buffers. Support is also provided for generating packets
addressable memory (CAM) that provides efficient from the PPE’s registers, allowing it to modify data in the
support for looking up objects in the local memory based DMA buffer before transmitting to the link.
on a key supplied as part of a command or network
packet. The cache is a shared resource amongst the Internal Fabric & Buffer Memory
PPEs. We use it for looking up packet sequencers, The internal fabric allows each of the seven PPEs, and
queues and protocol control blocks. The CAM supports the host bus interface, to communicate with each other
up to 256 entries that can be rapidly validated against a and access the on chip memory. Each port onto the
32 or 64 bit key presented by the PPE. fabric has a separate 64-bit read and write path, for a per
port bandwidth of 8Gbytes/sec and a total fabric
Packet Processing Engine bandwidth of 64Gbytes/s at 500MHz. The protocols and
The packet processing engines are based around a clocking strategy for the fabric were designed to be
proprietary RISC processor, optimized for use in tolerant to delays in crossing the chip, to simplify the task
communication applications. The basic design consists of meeting timing closure, which can be a significant
of a 32 bit processor capable of supporting up to 8 challenge in large sub-micron chip designs.
outstanding loads. The high aggregate number of
outstanding loads is required to handle the long latency The memory system connected to the fabric consists of
associated with access to host memory. Load and store 512Kbyte of ECC corrected SRAM, arranged as 8, 64 bit
operations can be for any power of two number of bytes wide banks. The individual banks support simultaneous
between one and 128. The processor can address up to 64 reads and writes. Data is striped across the banks in 64
of 128 registers, with flexible register windowing to bit words, so that multiple block accesses to the memory
provide efficient subroutine call support. occur without contention, once the first access has been
scheduled.
Each PPE can execute two instructions per cycle for a
single thread or one instruction per cycle from each of A hardware buffer manager is implemented to ensure
two threads. The ability to support two simultaneous efficient utilization of the on chip memory. This supports
threads is useful for code segments where data has to be up to 1024 separately controlled buffer structures. Each
streamed in and out of a FIFO, where one thread acts as a of which can be either a simple FIFO that takes streamed
producer and the other as the consumer. Fork and Join data and is emptied as it is read, or an object buffer,
instructions allow the processor to rapidly switch in and which can be read and written multiple times then
out of dual thread mode. returned to a free list when no longer in use.

The primary role of the PPE is high-speed streaming Network Interface


and re-alignment of data. Even with a relatively large The Elan5 has two DDR XAUI link interfaces, each
number of registers, the host memory latency is such that consisting of four high speed SerDes operating at
it is not possible to code a host memory copy routine that 6.25Gbits/s for QsNet, low latency, remote memory
transfers data through registers and sustains full protocols or 3.125Gbit/s for CX4 Ethernet operation).
bandwidth on the host bus interface. Host memory Each link has a 32Kbyte input buffer and a 32Kbyte
latencies vary greatly, from 250ns for a fast dual CPU output buffer. Each link is controlled by a PPE, and only
node to 1500ns for a high CPU count SMP. Assuming a that PPE can access the input buffer. The output buffers
host memory latency in the region of a microsecond and can be accessed by any of the PPEs writing across the
a host memory bandwidth of around 2.8Gbytes/s (16 lane fabric, allowing any PPE to generate output packets.
PCI Express, 70% efficiency), then at any time at least Each link supports upto 64 pending packets.
2.8Kbytes of memory or registers must be reserved for
data returning from outstanding read requests. PCI Express Interface
The host interface supports PCI Express version 1.0
The PPE supports a 9Kbyte DMA buffer. Up to 8 with upto 16 lanes or PCI Express version 2.0 with upto 8
outstanding DMA loads, and 3 outstanding store lanes. The interface has been designed to be a simple as
operations can be issued by each PPE, they complete by possible, with much of the management and
transferring data to or from the buffer rather than the configuration functionality being implemented in
software running on the PPEs rather than as state
Quadrics Vendor
machines within the interface. MPI MPI
Shmem ARMCI UPC ...

The interface is designed to support many outstanding Libelan library. Device independent binary interface providing optimized
communications primitives, e.g. message passing, put/get, collectives.
reads to service requirements of the multiple PPEs. Up
to 32 concurrent register load operations can be
supported. For register loads, the data is not transferred libelan3 libelan4 libelan5

to the processor until the PCI Express bus CRC has been
checked. This requires local buffering in the PCI Express Device specific firmware and thread code
interface. To avoid having to buffer the larger DMA
loads in the PCI Express Interface, they are allowed to
complete directly to the DMA buffers, but are tagged as Figure 2: Elan5 software stack
unchecked until the final CRC is validated. The DMA Elan5 supports secure multi-user access to QsNet
buffer score-boarding uses this tag to block store through use of job specific capabilities that describe the
operations using that data until it has been both loaded rights of each users and network context numbers
and checked. assigned to each packet by the outputting PPE.

External Memory Software Model – QsNet


The Elan5 provides a 32 bit DDR 2 SDRAM interface When used as a QsNet device two PPEs are dedicated
for applications requiring larger amounts of adapter to input handling, taking packets from the links and
memory. This approach provides the option of either executing them directly or (typically when they
significantly lower (and predictable) latency, when contain requests from other Elans) dispatching them to
compared to accessing the host memory, which may one of the other PPEs. Two PPEs are dedicated to
have to wait for lengthy DMA load operations to handling large DMA requests, processing DMA
complete. An external EEPROM interface is also descriptors received from the host (put) or remote Elans
provided for a ROM to store the Elan5 boot code. (get). Two PPEs are assigned to short put/get requests
where the address and data are supplied with the
Fault Tolerance command. The management PPE performs configuration
Elan5 is designed for High Performance Computing and status monitoring tasks. Use of the device is
(HPC) systems consisting of thousands of commodity illustrated with several of the tasks common to HPC
servers. Fault tolerance of the system is of critical applications. Our first example illustrates how a short put
importance; the sheer number of components make errors is executed, a common operation in Partitioned Global
in the memory systems and packet corruption a frequent Address Space (PGAS) programming models such as
occurrence. For example with 1024 nodes each driving a Shmem [3], ARMCI [4], or UPC [5]. The user process
single link at 2 Gbytes/sec we would expect to see formats a command block containing the put command
between 10 and 20 corrupt packets per second with a bit itself, the destination virtual address, the destination
error rate (BER) of 10-15. The approach taken on Elan5 is virtual process number (sometimes called the rank) and
to use ECC to protect all memories containing state that the data. The command block is then written to an Elan5
is difficult to recreate (the SRAM, the DMA buffers, the command queue as a PIO write. The command launch
cache and the local memory interface). Parity is unit assigns the command to a PPE. The PPE builds a
sufficient for the instruction caches. Data transfer (and packet and forwards it to the link.
the input/output buffers) are protected by an end-to-end
32-bit CRC on each packet. PCI Express has its own
Link
CRCs. H
C
O
L Buffer Out
S
U
T

Software Model PPE Ack In

The allocation of the processing tasks to the pool of


available PPEs is controlled by the device firmware. QsNetIII
Switch
Common firmware is loaded from ROM as the Elan
boots, it initializes the PCI Express interface and local Link
memory. Application specific firmware modules (QsNet Input PPE Out
or Ethernet) are then loaded by the device driver. H
M
O
M Buffer In
S
U
T
The libelan5 library provides direct user space access
to the device. The libelan library provides a binary
interface to a variety of widely used parallel Figure 3: Execution of a short put on Elan5
programming models; their implementation is common to
On the input side the PPE takes the packet from the
Elan5 and the existing Elan3 and Elan4 adapters
links and writes the data directly to host memory. The
destination virtual address is translated to a physical descriptors to a queue in the Elan. The sender generates a
address by the MMU. An acknowledgement (ACK) is command block containing the message header and
returned to the source PPE on successful completion. In (optionally) a small amount of data. This header is
the event of an error or trap a not acknowledgement transmitted to the destination (step 1 in Figure 5) as per
(NACK) is returned. The source will retransmit. On the short put example, except that this is a put to a remote
Elan5 put operations are purely one sided. They involve queue rather than a put to memory. The input PPE
the source process and the Elans, the destination process recognizes the packet type and assigns the header to the
does not participate and there is no need to run a main user’s queue.
CPU thread.
C
Buffer
It is inefficient to transfer large amounts of data using L
U
MPI Send
H
PIO writes. Above a certain (programmable) threshold O
S
PPE

1
we write a DMA descriptor to the source Elan instead of T
M
DMA read
DMA
4 Link
M Buffers
the data. U
Input PPE Out
1
DMA
Queue Buffer In
3 3

C
L
U Buffer QsNetIII
C Tag Switch
H L Match
MPI Recv 2 Get request
O DMA Link U
S Acks 3
T H
Out O Link
Queue
M DMA S
M DMA read T Input PPE Out
Buffers
U In
M
M Buffer In
U 4 4
QsNetIII
Switch

Link Figure 5: Execution of an MPI message on Elan5


Input PPE Out
H
M
Arrival of data in the packet queue schedules the
O
S
M
U
Buffer In matcher thread on one of the general purpose PPEs. It
T
determines whether there is a matching receive (step 2).
In the short message case data supplied with the header
can then be written directly to memory (the receive
Figure 4: Execution of a DMA transfer on Elan5 descriptor contains the user virtual address) completing
the transfer. In the large message case the matcher thread
The CLU schedules the command on general purpose sends a get request to the source (step 3) which can then
PPE, it issues DMA reads to the host interface and packet complete the transfer with a DMA (step 4). If no
writes to the network. The input PPE on the destination matching receive is available the header is buffered until
node consumes packets, writing the data to memory and a receive is posted. The QsNet firmware implements
returning ACKs to the source. these features together with support for executing
collectives in the adapter.
The PGAS models also require support for fetching
data from a remote process, get. Elan5 implements ‘get’ Software Model - Ethernet
operations by sending a request to the destination Elan. It When used as an Ethernet device two PPEs are
performs the load (or DMA read) and puts the resulting dedicated to output packet preparation. This entails
data back to the source. Again the operation is purely one Fetching the data from host memory space
sided, being executed by the initiating process and the
Data realignment, converting data aligned from
Elans.
the host memory layout to the packet alignment.
Data preconditioning, such as the preparation of
The Elan5 libraries provide a highly optimized
RDMA framing data.
implementation of MPI message passing [6], supporting
both RDMA and independent progression of MPI TCP/IP header generation
The two input processes associated with the links
messages. In an MPI application the source processes
handle packet recognition and classification, header
send messages and the destination process receive them,
the transfer completes when a send is matched to a checksum processing and RDMA CRC checking. The
task of working out where to place the packets in
receive, using either the source rank or an integer tag. In
a well written program non-blocking receives will have memory, and writing the packets to the host memory is
performed by two protocol processes, which also
been posted early so as to hide the time taken to transfer
generate the TCP/IP acknowledges. The management
data.
PPE performs housekeeping tasks such as configuration,
retransmission scheduling, and status monitoring.
On Elan5 the MPI tag matching is performed by one or
more PPEs. The receiving process writes MPI receive
Implementation
The Elan5 has a number of different clock domains REFERENCES
determined by the speeds for the various external [1] Quadrics QsNetII: A Network for Supercomputing Applications. Jon
interfaces. This presents a number of different clocking Beecroft, David Addison, Fabrizio Petrini, Moray McLaren in
Proceedings of Hot Chips 15, Palo Alto, California, 2003. See
domains and associated issues. The 8 on chip phase http://www.quadrics.com/
locked loops allow the domains to be clocked [2] IEEE Piscataway NY, IEEE 802ae. Network standard
asynchronously to ease timing closure and remove [3] Shmem, first referenced in the Cray T3E programmers guide.
[4] ARMCI: A Portable Remote Memory Copy Library for Distributed
complex dependencies. The PPEs, fabric and local Array Libraries and Compiler Run-time Systems. Lecture Notes Jarek
memory are clocked at 500MHz, giving a total Nieplocha1 and Bryan Carpenter in Computer Science; Vol. 1586. See
instruction throughput of 7Gops. http://www.emsl.pnl.gov/docs/parsoft/armci/
[5] UPC Consortium, "UPC Language Specifications, v1.2", Lawrence
Berkeley National Lab Tech Report LBNL-59208, 2005. See
The device is implemented on a standard TSMC 90nm, 8 http://upc.gwu.edu/
layer metal process and measures 10mm x 10mm. The [6] MPI: A message-passing interface standard. The MPI Forum. See
http://www.mcs.anl.gov/mpi/
total gate count for the functional logic is 4.5 million
gates. The total on-chip memory exceeds one MByte.
The memories include Built in Self Repair (BISR) to
improve manufacturing yield. A full scan methodology is
used for silicon manufacturing test, JTAG boundary scan
is included for manufactured board connectivity testing.

The Elan5 device is packaged in a 672 high performance


flip-chip ball grid array device. Flip chip packaging was
selected to meet the signal integrity requirements of the
high-speed serial links. Elan5 uses a semi custom
package with particular attention being paid to the
routing of the differential pairs for the links.

Worst case power dissipation for the device 17 Watts,


requiring the use of a thermal enhanced package with a
copper heat spreader. Power consumption is significantly
lower, around 12W for an x8 PCI-Express interface and a
single link, and lower still when the SDRAM interface is
not required.

Conclusion
The architecture of the Elan5 enables a single device to
support a range of different communication protocols.
Performing the protocol handling in firmware allows the
device to provide a high level of communications
processing off-load, without the complexity and
verification challenges of a custom hardware
implementation The architecture has been designed so
that future variants can scale the number of links, and the
number of packet processing engine. They will also
utilize, higher bandwidth, host interfaces.

You might also like