Professional Documents
Culture Documents
Fabrizio Petrini , Salvador Coll , Eitan Frachtenberg and Adolfy Hoisie
6
Latency (µs)
5.5
4.5
4
2 4 8 16 32 64 128 256 512
Nodes
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.5
Introduction
Some of the most powerful systems in the world use the
Quadrics interconnection network and the collective
communication services analyzed in this job:
The Terascale Computing System (TCS) at the
Pittsburgh Supercomputing Center – the second most
powerful computer in the world
ASCI Q machine, currently under development at Los
Alamos National Laboratory (30 TeraOps, expected to
be delivered by the end of 2002)
Collectives
72
SDRAM Thread µ code
I/F Processor Processor
DMA Inputter
Buffers
64 Data Bus
100 MHz
64
32
PCI Interface
66MHz
64
72
SDRAM Thread µ code
I/F Processor Processor
DMA Inputter
Buffers
64 Data Bus
100 MHz
64
32
PCI Interface
66MHz
64
2 virtual channels
Link FIFO FIFO
Mux 0 1
72
SDRAM Thread µ code
I/F Processor Processor
DMA Inputter
Buffers
64 Data Bus
100 MHz
64
32
PCI Interface
66MHz
64
72
SDRAM Thread µ code
I/F Processor Thread Processor
Processor
64 Data Bus
100 MHz
64
32
PCI Interface
66MHz
64
72
SDRAM Thread µ code
I/F Processor Processor
DMA Inputter
Buffers
TLB Synchronized
Data Bus
64 with Host
100 MHz
64
32
PCI Interface
66MHz
64
72
SDRAM Thread µ code
I/F Processor Processor
DMA Inputter
Buffers
64 Data Bus
100 MHz
64
32
MMU &
66 MHz / 64−bit
Table Clock &
Walk Statistics
TLB PCI Interface
Engine Registers
28
4 Way
Set Associative Cache
PCI Interface
66MHz
64
transaction type
context
packet header memory address
routing tags
CRC data
CRC
shmem mpi
tport
elanlib
Deadlocked situation
1 5 9 13
2 3 4 6 7 8 10 11 12 14 15
1 5 9 13
2 3 4 6 7 8 10 11 12 14 15
1 5 9 13
2 3 4 6 7 8 10 11 12 14 15
Init barrier
test sequence #
Multicast transaction
Acknowledgment
finish barrier
250
Bandwidth (MB/s)
200
150
100
50
0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Message Size (bytes)
Peak data bandwidth (Elan to Elan) of 335 MB/s 396 MB/s (99% of nominal
bandwidth)
Main to main asymptotic bandwidth of 200 MB/s
Latency (µs) 12
10
4
4 16 64
Nodes
128 128
Latency (µs)
Latency (µs)
64 64
32 32
16 16
8 8
4 4
4 16 64 4 16 64
Nodes Nodes
100
10
1
4 8 16 32 64 128 256 512
Latency (µs)
100
10
1
8 16 32 64 128 256 512 1024 2048 4096
Latency (µs)
150
100
50
0
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Message Size (bytes)
35
Latency (µs)
30
25
20
15
10
1 4 16 64 256 1K 4K
Message Size (bytes)
Hardware latency with Elan buffers below 13 s for messages up to 256 bytes
320 1500
300
1400
Bandwidth (MB/s)
280
1300
Latency (µs)
260 elan_bcast() - global - main elan_bcast() - global - main
elan_bcast() - global - elan 1200 elan_bcast() - global - elan
240 elan_hbcast() - global - main elan_hbcast() - global - main
elan_hbcast() - global - elan elan_hbcast() - global - elan
1100
220
1000
200
180 900
160 800
4 16 64 4 16 64
Nodes Nodes
25
Latency (µs)
20 1000
15
10
0 100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K
Message Size (bytes) Message Size (bytes)
Latency (µs)
15
1000
10
0 100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K
Message Size (bytes) Message Size (bytes)
Better performance with buffers in main memory (due to the background traffic
application)
45 9000
Latency (µs)
40 8000
35 7000
30 6000
25 5000
20 4000
4 16 64 4 16 64
Nodes Nodes