You are on page 1of 51

Hardware- and Software-Based Collective

Communication on the Quadrics Network


Fabrizio Petrini , Salvador Coll , Eitan Frachtenberg and Adolfy Hoisie

CCS-3 Modeling, Algorithms, and Informatics


Los Alamos National Laboratory


Technical University of Valencia - SPAIN


scoll@lanl.gov

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.1


Outline
Introduction
Quadrics network design overview
Hardware
Communication/programming libraries
Collective communication on the QsNET
Barrier synchronization
Broadcast
Performance analysis
Experimental framework
Results
Conclusions

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.2


Introduction
The efficient implementation of collective communication
is a challenging design effort
Very important to guarantee scalability of barrier
synchronization, broadcast, gather, scatter, reduce, etc.
Essential to implement system primitives to enhance
fault-tolerance.
Software or hardware support for multicast communication
can improve the performance and resource utilization of a
parallel computer
Software multicast: based on unicast messages, simple
to implement, no network topology constraint, slower
Hardware multicast: require dedicated hardware,
network dependent, faster

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.3


Introduction
Some of the most powerful systems in the world use the
Quadrics interconnection network and the collective
communication services analyzed in this job:
The Terascale Computing System (TCS) at the
Pittsburgh Supercomputing Center – the second most
powerful computer in the world

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.4


Introduction
Some of the most powerful systems in the world use the
Quadrics interconnection network and the collective
communication services analyzed in this job:
The Terascale Computing System (TCS) at the
Pittsburgh Supercomputing Center – the second most
powerful computer in the world
Barrier Test
6.5

6
Latency (µs)

5.5

4.5

4
2 4 8 16 32 64 128 256 512
Nodes
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.5
Introduction
Some of the most powerful systems in the world use the
Quadrics interconnection network and the collective
communication services analyzed in this job:
The Terascale Computing System (TCS) at the
Pittsburgh Supercomputing Center – the second most
powerful computer in the world
ASCI Q machine, currently under development at Los
Alamos National Laboratory (30 TeraOps, expected to
be delivered by the end of 2002)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.6


Quadrics Network Design Overview
QsNET provides an abstraction of distributed virtual shared
memory
Each process can map a portion of its address space into the
global memory
These address spaces constitutes the virtual shared memory
This shared memory is fully integrated with the native
operating system
Based on two building blocks:
a network interface card called Elan
a crossbar switch called Elite

Collectives

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.7


Elan
10 200MHz 10

Link FIFO FIFO


Mux 0 1

72
SDRAM Thread µ code
I/F Processor Processor

DMA Inputter
Buffers

64 Data Bus
100 MHz

64
32

Table Clock &


MMU & Walk Statistics
TLB Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.8


Elan
10 200MHz 10
400 MB/s
Bidirectional
200MHz / 10bits
Link FIFO FIFO
Mux 0 1

72
SDRAM Thread µ code
I/F Processor Processor

DMA Inputter
Buffers

64 Data Bus
100 MHz

64
32

Table Clock &


MMU & Walk Statistics
TLB Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.9


Elan
10 200MHz 10

2 virtual channels
Link FIFO FIFO
Mux 0 1

72
SDRAM Thread µ code
I/F Processor Processor

DMA Inputter
Buffers

64 Data Bus
100 MHz

64
32

Table Clock &


MMU & Walk Statistics
TLB Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.10


Elan
10 200MHz 10

Link FIFO FIFO


Mux 0 1

72
SDRAM Thread µ code
I/F Processor Thread Processor
Processor

Runs Communication Protocols


32−bit SPARC−based
DMA Inputter
Buffers

64 Data Bus
100 MHz

64
32

Table Clock &


MMU & Walk Statistics
TLB Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.11


Elan
10 200MHz 10

Link FIFO FIFO


Mux 0 1

72
SDRAM Thread µ code
I/F Processor Processor

DMA Inputter
Buffers

TLB Synchronized
Data Bus
64 with Host
100 MHz

64
32

Table Clock &


MMU & Walk Statistics
TLB Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.12


Elan
10 200MHz 10

Link FIFO FIFO


Mux 0 1

72
SDRAM Thread µ code
I/F Processor Processor

DMA Inputter
Buffers

64 Data Bus
100 MHz

64
32

MMU &
66 MHz / 64−bit
Table Clock &
Walk Statistics
TLB PCI Interface
Engine Registers
28
4 Way
Set Associative Cache

PCI Interface
66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.13


Elite
8 bidirectional links with 2 virtual channels in each
direction
An internal 16x8 full crossbar switch
400 MB/s on each link direction
Packet error detection and recovery, with routing and data
transactions CRC protected
2 priority levels plus an aging mechanism
Adaptive routing
Hardware support for broadcast

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.14


Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.15


Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.16


Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.17


Packet Format
route one or more transactions EOP token

transaction type
context
packet header memory address
routing tags
CRC data

CRC

320 bytes data payload (5 transactions with 64 bytes each)


74-80 bytes overhead

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.18


Programming Libraries
Elan3lib
event notification
memory mapping and allocation
remote DMA
Elanlib and Tports
collective communication
tagged message passing
MPI, shmem User Applications

shmem mpi
tport
elanlib

user space elan3lib

kernel space system calls elan kernel comms

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.19


Collective communication on the QsNET

Broadcast tree for a 16-node network

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.20


Collective communication on the QsNET

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.21


Collective communication on the QsNET

Serialization through the root switch to avoid deadlocks

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.22


Collective communication on the QsNET

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.23


Collective communication on the QsNET

Deadlocked situation

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.24


Barrier Synchronization
QsNET implements two synchronization primitives:
Software-based: it uses a balanced tree and point-to-point
messages
elan_gsync()
Hardware-based: it uses the hardware multicast support
elan_hgsync(): busy-wait
elan_hgsyncevent(): event-based

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.25


Software-Based Barrier
Each process waits for ’ready’ signals from its children
0

1 5 9 13

2 3 4 6 7 8 10 11 12 14 15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.26


Software-Based Barrier
Each process waits for ’ready’ signals from its children (1) ...
Root Node
0

1 5 9 13

(1) (1) (1) (1)

2 3 4 6 7 8 10 11 12 14 15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.27


Software-Based Barrier
... and sends its own signal up to the parent process (2)
Root Node
(2) 0

1 5 9 13

(1) (1) (1) (1)

2 3 4 6 7 8 10 11 12 14 15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.28


Hardware-Based Barrier

Example for 16 nodes

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.29


Hardware-Based Barrier

(1) init barrier, (2) update sequence #, (3) wait

Init barrier

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.30


Hardware-Based Barrier

test sequence #

Multicast transaction

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.31


Hardware-Based Barrier

finish barrier return OK or FAIL

Acknowledgment

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.32


Hardware-Based Barrier

finish barrier

Final ’EOP’ (End-Of-Packet) token

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.33


Broadcast
QsNET implements two broadcast primitives:
Software-based: it uses a balanced tree and point-to-point
messages
elan_bcast()
Hardware-based: it uses the hardware multicast support
elan_hbcast()
Both implementations perform an initial barrier to
guarantee resources allocation

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.34


Performance Analysis
The experimental results are obtained on a 64-node cluster
of Compaq AlphaServer ES40s running Tru64 Unix.
Each Alpahserver is attached to a quaternary fat-tree of
dimension three through a 64 bit, 33 MHz PCI bus using
the Elan3 card.
In order to expose the real network performance, we place
the communication buffers in Elan memory.
We present:
unidirectional ping results, as a reference, and
barrier and broadcast results, analyzing the effect of
additional background traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.35


Unidirectional Ping
Ping Bandwidth
350
MPI
Elan3, Elan to Elan
300 Elan3, Main to Main

250
Bandwidth (MB/s)
200

150

100

50

0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Message Size (bytes)

Peak data bandwidth (Elan to Elan) of 335 MB/s 396 MB/s (99% of nominal
bandwidth)
Main to main asymptotic bandwidth of 200 MB/s

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.36


Unidirectional Ping
Ping Latency
24
MPI
22 Elan3, Elan to Elan
Elan3, Main to Main
20
18
Latency (µs) 16
14
12
10
8
6
4
2
0 1 4 16 64 256 1K 4K
Message Size (bytes)

Latency of 2.4 s up to 64-byte messages (Elan to Elan memory)


Higher MPI latency due to message tag matching

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.37


Barrier Synchronization
Barrier Test - 1 CPU per node
16
elan_gsync()
elan_hgsync()
14 elan_hgsyncEvent()

Latency (µs) 12

10

4
4 16 64
Nodes

Good hardware barrier scalability

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.38


Barrier Synchronization with Background Traffic
Barrier Test - 1 CPU per node (complement traffic) Barrier Test - 1 CPU per node (uniform traffic)
1024 1024
elan_gsync() elan_gsync()
elan_hgsync() elan_hgsync()
512 elan_hgsyncEvent() 512 elan_hgsyncEvent()
256 256

128 128
Latency (µs)

Latency (µs)
64 64

32 32

16 16

8 8

4 4
4 16 64 4 16 64
Nodes Nodes

Software barrier significantly affected (the slowdown is 40 in the worst case)


Little impact on the hardware barriers, whose average latency is only doubled

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.39


Hardware Barrier with Background Traffic
Barrier Test - 64 nodes, 1 CPU per node (latency distribution)
10000
elan_hgsync()
elan_hgsync() - complement traffic
elan_hgsync() - uniform traffic
1000

100

10

1
4 8 16 32 64 128 256 512
Latency (µs)

94% of the operations take less than 9 s with no bakground traffic


93% of the tests take less than 20 s with uniform traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.40


Software Barrier with Background Traffic
Barrier Test - 64 nodes, 1 CPU per node (latency distribution)
10000
elan_gsync()
elan_gsync() - complement traffic
elan_gsync() - uniform traffic
1000

100

10

1
8 16 32 64 128 256 512 1024 2048 4096
Latency (µs)

99% of the barriers take less than 30 s with no bakground traffic


93% of the synchronizations complete with less than 605 s with uniform traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.41


Broadcast Bandwidth
Broadcast Test - 64 Nodes, 1 CPU per node
300
elan_bcast() - global - main
elan_bcast() - global - elan
250 elan_hbcast() - global - main
elan_hbcast() - global - elan

Bandwidth (MB/s) 200

150

100

50

0
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Message Size (bytes)

Asymptotic bandwidth of 288MB/s when using Elan memory for both


implementations

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.42


Broadcast Latency
Broadcast Test - 64 Nodes, 1 CPU per node
50
elan_bcast() - global - main
elan_bcast() - global - elan
45 elan_hbcast() - global - main
elan_hbcast() - global - elan
40

35
Latency (µs)

30

25

20

15

10
1 4 16 64 256 1K 4K
Message Size (bytes)

Hardware latency with Elan buffers below 13 s for messages up to 256 bytes

Software latencies are 3.5 s higher than hardware latencies

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.43


Broadcast Scalability
Broadcast Test - 1 CPU per node (256k bytes) Broadcast Test - 1 CPU per node (256k bytes)
340 1600

320 1500
300
1400
Bandwidth (MB/s)

280
1300

Latency (µs)
260 elan_bcast() - global - main elan_bcast() - global - main
elan_bcast() - global - elan 1200 elan_bcast() - global - elan
240 elan_hbcast() - global - main elan_hbcast() - global - main
elan_hbcast() - global - elan elan_hbcast() - global - elan
1100
220
1000
200

180 900

160 800
4 16 64 4 16 64
Nodes Nodes

No significant effect when using buffers in main memory

With buffers in Elan memory performance depends on the number of switch


layers traversed

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.44


Broadcast with Background Traffic
Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic) Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic)
40 10000
elan_bcast() - global - main elan_bcast() - global - main
elan_bcast() - global - elan elan_bcast() - global - elan
35 elan_hbcast() - global - main elan_hbcast() - global - main
elan_hbcast() - global - elan elan_hbcast() - global - elan
30
Bandwidth (MB/s)

25

Latency (µs)
20 1000

15

10

0 100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K
Message Size (bytes) Message Size (bytes)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.45


Broadcast with Background Traffic
Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic) Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic)
25 10000
elan_bcast() - global - main elan_bcast() - global - main
elan_bcast() - global - elan elan_bcast() - global - elan
elan_hbcast() - global - main elan_hbcast() - global - main
20 elan_hbcast() - global - elan elan_hbcast() - global - elan
Bandwidth (MB/s)

Latency (µs)
15
1000
10

0 100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K
Message Size (bytes) Message Size (bytes)

Latency differences between hw and sw implementations increase

Better performance with buffers in main memory (due to the background traffic
application)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.46


Broadcast with Background Traffic
Broadcast Test - 1 CPU per node (256k bytes - uniform traffic) Broadcast Test - 1 CPU per node (256k bytes - uniform traffic)
60 12000
elan_bcast() - global - main elan_bcast() - global - main
elan_bcast() - global - elan elan_bcast() - global - elan
55 elan_hbcast() -global - main 11000 elan_hbcast() -global - main
elan_hbcast() -global - elan elan_hbcast() -global - elan
50 10000
Bandwidth (MB/s)

45 9000

Latency (µs)
40 8000

35 7000

30 6000

25 5000

20 4000
4 16 64 4 16 64
Nodes Nodes

Significant performance degradation for all the alternatives

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.47


Conclusions
Hardware-based synchronization takes as little as 6 s on a
64-node Alphaserver cluster, with very good scalability.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48


Conclusions
Hardware-based synchronization takes as little as 6 s on a
64-node Alphaserver cluster, with very good scalability.
Good latency and scalability are achieved with the
software-based synchronization too, which takes about
15 s.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48


Conclusions
Hardware-based synchronization takes as little as 6 s on a
64-node Alphaserver cluster, with very good scalability.
Good latency and scalability are achieved with the
software-based synchronization too, which takes about
15 s.
The hardware barrier is almost insensitive to background
traffic, with 93% of the synchronizations completed in less
than 20 s.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48


Conclusions
Hardware-based synchronization takes as little as 6 s on a
64-node Alphaserver cluster, with very good scalability.
Good latency and scalability are achieved with the
software-based synchronization too, which takes about
15 s.
The hardware barrier is almost insensitive to background
traffic, with 93% of the synchronizations completed in less
than 20 s.
With the broadcast, both implementations can deliver a
sustained bandwidth of 288 MB/s Elan memory to Elan
memory and 200 MB/s main memory to main memory.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48

You might also like