Professional Documents
Culture Documents
High-Radix Networks
ABSTRACT 1. INTRODUCTION
Increasing integrated-circuit pin bandwidth has motivated Interconnection networks are widely used to connect pro-
a corresponding increase in the degree or radix of intercon- cessors and memories in multiprocessors [23, 24], as switch-
nection networks and their routers. This paper introduces ing fabrics for high-end routers and switches [8], and for
the flattened butterfly, a cost-efficient topology for high- connecting I/O devices [22]. As processor and memory per-
radix networks. On benign (load-balanced) traffic, the flat- formance continues to increase in a multiprocessor com-
tened butterfly approaches the cost/performance of a but- puter system, the performance of the interconnection net-
terfly network and has roughly half the cost of a comparable work plays a central role in determining the overall perfor-
performance Clos network. The advantage over the Clos is mance of the system. The latency and bandwidth of the
achieved by eliminating redundant hops when they are not network largely establish the remote memory access latency
needed for load balance. On adversarial traffic, the flat- and bandwidth.
tened butterfly matches the cost/performance of a folded- Over the past 20 years k-ary n-cubes [6] have been widely
Clos network and provides an order of magnitude better per- used – examples of such networks include SGI Origin 2000 [17],
formance than a conventional butterfly. In this case, global Cray T3E [24], and Cray XT3 [5]. However, the increased
adaptive routing is used to switch the flattened butterfly pin density and faster signaling rates of modern ASICs is en-
from minimal to non-minimal routing — using redundant abling high-bandwidth router chips with >1Tb/s of off-chip
hops only when they are needed. Minimal and non-minimal, bandwidth [23]. Low-radix networks, such as k-ary n-cubes,
oblivious and adaptive routing algorithms are evaluated on are unable to take full advantage of this increased router
the flattened butterfly. We show that load-balancing adver- bandwidth. To take advantage of increased router band-
sarial traffic requires non-minimal globally-adaptive routing width requires a high-radix router – with a large number of
and show that sequential allocators are required to avoid narrow links, rather than a conventional router with a small
transient load imbalance when using adaptive routing algo- number of wide links. With modern technology, high-radix
rithms. We also compare the cost of the flattened butterfly networks based on a folded-Clos [4] (or fat-tree [18]) topol-
to folded-Clos, hypercube, and butterfly networks with iden- ogy provide lower latency and lower cost [14] than a network
tical capacity and show that the flattened butterfly is more built from conventional low-radix routers. The next gener-
cost-efficient than folded-Clos and hypercube topologies. ation Cray BlackWidow vector multiprocessor [23] is one
of the first systems that uses high-radix routers and imple-
Categories and Subject Descriptors ments a modified version of the folded-Clos network.
In this paper, we introduce the high-radix flattened but-
C.1.2 [Computer Systems Organization]: Multiproces-
terfly topology which provides better path diversity than a
sors—Interconnection architectures; B.4.3 [Hardware]: In-
conventional butterfly and has approximately half the cost
terconnections—Topology
of a comparable performance Clos network on balanced traf-
fic. The flattened butterfly is similar to a generalized hyper-
General Terms cube network [2] but by utilizing concentration, the flattened
Design, Performance butterfly can scale more effectively and also exploit high-
radix routers. Routing algorithms on the flattened butterfly
Keywords are compared and we show that global-adaptive routing is
interconnection networks, topology, high-radix routers, flat- needed to load-balance the topology. We show that greedy
tened butterfly, global adaptive routing, cost model implementations of global-adaptive routing algorithms give
poor instantaneous load-balance. We describe a sequential
implementation of the UGAL [27] routing algorithm that
overcomes this load-imbalance and also introduce the CLOS
Permission to make digital or hard copies of all or part of this work for AD routing algorithm that removes transient load imbalance
personal or classroom use is granted without fee provided that copies are as the flattened butterfly is routed similar to a folded-Clos
not made or distributed for profit or commercial advantage and that copies
network. In addition to the performance analysis, we pro-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific vide a detailed cost comparison of high-radix topologies and
permission and/or a fee. discuss the cost advantages of the flattened butterfly.
ISCA’07, June 9–13, 2007, San Diego, California, USA. The remainder of the paper is organized as follows. We
Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00.
I0 O0
I0 O0 R0'
I0 O0 I0 O0 R0 R1 R2 R3 I1 O1
I1 O1
I1 O1 I1 O1
I2 O2
R0 R1 R0' I2 O2
R1'
I2 O2 I2 O2 I3 O3
I3 O3
I3 O3 I3 O3
I4 O4
I4 O4 R2'
I4 O4 I4 O4 I5 O5
I5 O5
I5 O5 I5 O5
I6 O6
R2 R3 R1' I6 O5 R3'
I6 O6 I6 O6 I7 O7
I7 O7
I7 O7 I7 O7
I8 O8 I8 O8
R4'
I8 O8 I8 O8 I9 O9
I9 O9
I9 O9 I9 O9
R4 R5 R2' I10 O10 I10 O10
O10 R5'
I10 O10 I10 O11
I11 O11 I11
I11 O11 I11 O11
{
{
{
{
{
{
dimension 3 dimension 2 dimension 1
describe the flattened butterfly topology in detail in Sec- 2.1 Topology Construction: Butterfly to
tion 2 and discuss routing algorithms in Section 3. Section 4 Flattened Butterfly
presents a cost model for interconnection networks and pro- We derive the flattened butterfly by starting with a con-
vides a cost comparison between the flattened butterfly and ventional butterfly (k-ary n-fly) and combining or flattening
alternative topologies. Additional discussion of the flattened the routers in each row of the network into a single router.
butterfly is provided in Section 5 including a power compari- Examples of flattened butterfly construction are shown in
son. Section 6 discusses related work, and Section 7 presents Figure 1. 4-ary 2-fly and 2-ary 4-fly networks are shown
our conclusions. in Figure 1(a,c) with the corresponding flattened butterflies
shown in Figure 1(b,d). The routers R0 and R1 from the
2. FLATTENED BUTTERFLY TOPOLOGY first row of Figure 1(a) are combined into a single router R00
The butterfly network (k-ary n-fly) can take advantage of in the flattened butterfly of Figure 1(b). Similarly routers
high-radix routers to reduce latency and network cost [9]. R0, R1, R2, and R3 of Figure 1(c) are combined into R00 of
However, there is no path diversity in a butterfly network Figure 1(d). As a row of routers is combined, channels en-
which results in poor throughput for adversarial traffic pat- tirely local to the row, e.g., channel (R0,R1) in Figure 1(a),
terns. Also, a butterfly network cannot exploit traffic lo- are eliminated. All other channels of the original butter-
cality. A Clos network [4] provides many paths1 between fly remain in the flattened butterfly. For example, chan-
each pair of nodes. This path diversity enables the Clos to nel (R0,R3) in Figure 1(a) becomes channel (R00 ,R10 ) in
route arbitrary traffic patterns with no loss of throughput. Figure 1(b). Because channels in a flattened butterfly are
The input and output stages of a Clos network can be com- symmetrical, each line in Figures 1(b,d) represents a bidirec-
bined or folded on top of one other creating a folded Clos or tional channel (i.e. two unidirectional channels), while each
fat-tree [18] network which can exploit traffic locality. line in Figures 1(a,c) represents a unidirectional channel.
A Clos or folded Clos network, however, has a cost that A k-ary n-flat, the flattened butterfly derived from a k-ary
is nearly double that of a butterfly with equal capacity and n-fly, is composed of N k
radix k0 = n(k −1)+1 routers where
has greater latency than a butterfly. The increased cost and N is the size of the network. The routers are connected by
latency both stem from the need to route packets first to channels in n0 = n − 1 dimensions, corresponding to the
an arbitrary middle stage switch and then to their ultimate n − 1 columns of inter-rank wiring in the butterfly. In each
destination. This doubles the number of long cables in the dimension d, from 1 to n0 , router i is connected to each
network, which approximately doubles cost, and doubles the router j given by
number of inter-router channels traversed, which drives up » „— «–
latency. i
j =i+ m− mod k kd−1 (1)
The flattened butterfly is a topology that exploits high- kd−1
radix routers to realize lower cost than a Clos on load-
balanced traffic, and provide better performance and path for m from 0 to k − 1, where the connection from i to itself
diversity than a conventional butterfly. In this section, we is omitted. For example, in Figure 1(d), R40 is connected to
describe the flattened butterfly topology in detail by com- R50 in dimension 1, R60 in dimension 2, and R00 in dimension
paring it to a conventional butterfly and show how it is also 3.
similar to a folded-Clos. The number of nodes (N ) in a flattened butterfly is plot-
ted as a function of number of dimensions n0 and switch
1
One for each middle-stage switch in the Clos. radix k0 in Figure 2. The figure shows that this topology is
z dimension
20000 dimension 1
n'=4 y dimension
n'=1
4000 P0 P1 P31 P0
(a) (b)
0
0 32 64 96 128
radix (k') Figure 3: Block diagram of routers used in a 1K net-
work for (a) flattened butterfly with one dimension
Figure 2: Network size (N ) scalability as the radix and (b) (8,8,16) generalized hypercube.
(k0 ) and dimension (n0 ) is varied.
than a ring, to connect the nodes in each dimension. In
the 1980s, when this topology was proposed, limited pin
suited only for high-radix routers. Networks of very limited bandwidth made the GHC topology prohibitively expensive
size can be built using low-radix routers (k0 < 16) and even at large node counts. Also, without load-balancing global
with k0 = 32 many dimensions are needed to scale to large adaptive routing, the GHC has poor performance on adver-
network sizes. However with k0 = 61, a network with just sarial traffic patterns.
three dimensions scales to 64K nodes. The flattened butterfly improves on the GHC in two main
ways. First, the flattened butterfly connects k terminals to
2.2 Routing and Path Diversity each router while the GHC connects only a single terminal
Routing in a flattened butterfly requires a hop from a node to each router. Adding this k-way concentration makes the
to its local router, zero or more inter-router hops, and a final flattened butterfly much more economical than the GHC -
hop from a router to the destination node. If we label each reducing its cost by a factor of k, improves its scalability, and
node with a n-digit radix-k node address, an inter-router makes it more suitable for implementation using high-radix
hop in dimension d changes the dth digit of the current node routers. For example, routers used in a 1K node network
address to an arbitrary value, and the final hop sets the 0th of a flattened butterfly with one dimension and a (8,8,16)
(rightmost) digit of the current node address to an arbitrary GHC are shown in Figure 3. With 32-terminal nodes per
value. Thus, to route minimally from node a = an−1 , . . . , a0 router, the terminal bandwidth of the flattened butterfly is
to node b = bn−1 , . . . , b0 where a and b are n-digit radix-k matched to the inter-router bandwidth. In the GHC, on the
node addresses involves taking one inter-router hop for each other hand, there is a mismatch between the single termi-
digit, other than the rightmost, in which a and b differ. For nal channel and 32 inter-router channels. If the inter-router
example, in Figure 1(d) routing from node 0 (00002 ) to node channels are of the same bandwidth as the terminal channel,
10 (10102 ) requires taking inter-router hops in dimensions the network will be prohibitively expensive and utilization
1 and 3. These inter-router hops can be taken in either of the inter-router channels will be low. If the inter-router
order giving two minimal routes between these two nodes. channels are sized at 1/32 the bandwidth of the terminal
In general, if two nodes a and b have addresses that differ in channel, serialization latency will make the latency of the
j digits (other than the rightmost digit), then there are j! network prohibitively high and the overall bandwidth of the
minimal routes between a and b. This path diversity derives router will be low, making poor use of the high-pin band-
from the fact that a packet routing in a flattened butterfly is width of modern VLSI chips.
able to traverse the dimensions in any order, while a packet The second improvement over the GHC is the use of non-
traversing a conventional butterfly must visit the dimensions minimal globally-adaptive routing (Section 3) to load-balance
in a fixed order – leading to no path diversity. adversarial traffic patterns and the use of adaptive-Clos rout-
Routing non-minimally in a flattened butterfly provides ing with sequential allocators to reduce transient load imbal-
additional path diversity and can achieve load-balanced rout- ance. These modern routing algorithms enable the flattened
ing for arbitrary traffic patterns. Consider, for example, butterfly to match the performance of a Clos network on ad-
Figure 1(b) and suppose that all of the traffic from nodes versarial traffic patterns. In contrast, a GHC using minimal
0-3 (attached to router R00 ) was destined for nodes 4-7 (at- routing is unable to balance load and hence suffers the same
tached to R10 ). With minimal routing, all of this traffic performance bottleneck as a conventional butterfly on ad-
would overload channel (R00 ,R10 ). By misrouting a fraction versarial traffic.
of this traffic to R20 and R30 , which then forward the traffic
on to R10 , load is balanced. With non-minimal routing, a 3. ROUTING ALGORITHMS AND
flattened butterfly is able to match the load-balancing (and
non-blocking) properties of a Clos network – in effect act- PERFORMANCE COMPARISON
ing as a flattened Clos. We discuss non-minimal routing In this section, we describe routing algorithms for the flat-
strategies for flattened butterfly networks in more detail in tened butterfly and compare their performance on different
Section 3. traffic patterns. We also compare the performance of the
flattened butterfly to alternative topologies.
2.3 Comparison to Generalized Hypercube
The flattened butterfly is similar to a generalized hyper- 3.1 Routing Algorithms on Flattened
cube [2] (GHC) topology. The generalized hypercube is a k- Butterfly
ary n-cube network that uses a complete connection, rather We evaluate the flattened butterfly on five routing algo-
rithms: minimal adaptive (MIN AD), Valiant’s non-minimal VAL MIN UGAL UGAL-S CLOS AD MIN VAL UGAL UGAL-S CLOS AD
25
oblivious algorithm (VAL), the UGAL non-minimal adap- 25
Latency (cycles)
Latency (cycles)
15 15
a flattened Clos (CLOS AD). We describe each briefly here.
We consider routing in a k-ary n-flat where the source node 10 10
s, destination node d, and current node c are represented as
5 5
n-digit radix-k numbers, e.g., sn−1 , . . . , s0 . At a given step
of the route, a channel is productive if it is part of a min- 0
0
0 0.2 0.4 0.6 0.8 1
imal route; that is, a channel in dimension j is productive 0 0.1 0.2 0.3 0.4 0.5 0.6
Offered load
Offered load
if cj 6= dj before traversing the channel, and cj = dj after
traversing the channel. (a) (b)
We assume an input-queued virtual channel router [9].
Adaptive algorithms estimate output channel queue length Figure 4: Routing algorithm comparisons on the
using the credit count for output virtual channels which re- flattened butterfly with (a) uniform random traffic
flects the occupancy of the input queue on the far end of and (b) worst case traffic pattern.
the channel. For MIN AD and UGAL, the router uses a
greedy allocator [13]: during a given routing cycle, all in-
puts make their routing decisions in parallel and then, the not among all nodes. As a result, even though CLOS AD is
queuing state is updated en mass. For UGAL-S and CLOS non-minimal routing, the hop count is always equal or less
AD, the router uses a sequential allocator where each in- than that of a corresponding folded-Clos network.
put makes its routing decision in sequence and updates the
queuing state before the next input makes its decision. 3.2 Evaluation & Analysis
Minimal Adaptive (MIN AD): The minimal adaptive algo- We use cycle accurate simulations to evaluate the perfor-
rithm operates by choosing for the next hop the productive mance of the different routing algorithms in the flattened
channel with the shortest queue. To prevent deadlock, n0 butterfly. We simulate a single-cycle input-queued router
virtual channels (VCs) [7] are used with the VC channel switch. Packets are injected using a Bernoulli process. The
selected based on the number of hops remaining to the des- simulator is warmed up under load without taking measure-
tination. ments until steady-state is reached. Then a sample of in-
Valiant (VAL) [30]: Valiant’s algorithm load balances traf- jected packets is labeled during a measurement interval. The
fic by converting any traffic pattern into two phases of ran- simulation is run until all labeled packets exit the system.
dom traffic. It operates by picking a random intermediate We simulate a 32-ary 2-flat flattened butterfly topology
node b, routing minimally from s to b, and then routing mini- (k0 = 63, n0 = 1, N = 1024). Simulations of other size
mally from b to d. Routing through b perfectly balances load networks and higher dimension flattened butterfly follow
(on average) but at the cost of doubling the worst-case hop the same trend and are not presented due to space con-
count, from n0 to 2n0 . While any minimal algorithm can straints. We simulate single-flit packets2 and assume the
be used for each phase, our evaluation uses dimension order total buffering per port is 32 flit entries. Although input-
routing. Two VCs, one for each phase, are needed to avoid queued routers have been shown to be problematic in high-
deadlock with this algorithm. radix routers [14], we use input-queued routers but provide
Universal Globally-Adaptive Load-balanced (UGAL [27], sufficient switch speedup in order to generalize the results
UGAL-S) : UGAL chooses between MIN AD and VAL on a and ensure that routers do not become the bottleneck of the
packet-by-packet basis to minimize the estimated delay for network.
each packet. The product of queue length and hop count We evaluate the different routing algorithms on the flat-
is used as an estimate of delay. With UGAL, traffic is tened butterfly using both benign and adversarial traffic
routed minimally on benign traffic patterns and at low loads, patterns. Uniform random (UR) traffic is a benign traffic
matching the performance of MIN AD, and non-minimally pattern that balances load across the network links. Thus,
on adversarial patterns at high loads, matching the perfor- minimal routing is sufficient for such traffic patterns and all
mance of VAL. UGAL-S is identical to UGAL but with a of the routing algorithms except VAL achieve 100% through-
sequential allocator. put (Figure 4(a)). VAL achieves only half of network capac-
Adaptive Clos (CLOS AD): Like UGAL, the router chooses ity regardless of the traffic pattern3 [28]. In addition, VAL
between minimal and non-minimal on a packet-by-packet adds additional zero-load latency with the extra hop count
basis using queue lengths to estimate delays. If the router associated with the intermediate node.
chooses to route a packet non-minimally, however, the packet We also simulate an adversarial traffic pattern where each
is routed as if it were adaptively routing to the middle stage node associated with router Ri sends traffic to a randomly
of a Clos network. A non-minimal packet arrives at the in- selected node associated with router Ri+1 . With this traffic
termediate node b by traversing each dimension using the pattern, all of the nodes connected to a router will attempt
channel with the shortest queue for that dimension (includ-
ing a “dummy queue” for staying at the current coordinate 2
Different packet sizes do not impact the comparison results
in that dimension). Like UGAL-S, CLOS AD uses a sequen- in this section.
3
tial allocator. The routing is identical to adaptive routing The capacity of the network (or the ideal throughput for
in a folded-Clos [13] where the folded-Clos is flattened into UR traffic) is given by 2B/N where B is the total bisection
the routers of the flattened butterfly. Thus, the intermedi- bandwidth and N is the total number of nodes [9]. For the
ate node is chosen from the closest common ancestors and flattened butterfly, similar to the butterfly network, B =
N/2, thus the capacity of the flattened butterfly is 1.
MIN UGAL VAL UGAL-S CLOS AD to three other topologies : the conventional butterfly, the
35
folded Clos, and the hypercube. We use a network of 1024
20 8
Cost ($)
20 20
Latency (cycles)
Cost = 0.8117 x length + 3.7285
6
15 15
4 10
10 10 2
0
5 0
5 0 5 10 15 20 25 30
0 2 4 6 8 10 12
0 Cable Length (meters) Cable Length (meters)
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Offered Load Offered Load (a) (b)
(a) (b)
Figure 7: Cable Cost Data. (a) Cost of Infiniband
Figure 6: Topology comparisons of the flattened 4x and 12x cables as a function of cable length and
butterfly and the folded Clos with (a) uniform ran- (b) cable cost model with the use of repeaters for
dom traffic and (b) worst case traffic pattern. cable ¿6m. The model is based on the Infiniband
12x cost model and the data point is fitted with a
line to calculate the average cable cost.
The system components, processing nodes and routers,
are packaged within a packaging hierarchy. At the lowest
level of the hierarchy are the compute modules (containing ules in the same cabinet over distances that are typically less
the processing nodes) and routing modules (containing the than 1m. Moderate cost electrical cables connect modules in
routers). At the next level of the hierarchy, the modules different cabinets up to lengths of 5-10m.7 Transmitting sig-
are connected via a backplane or midplane printed circuit nals over longer distance require either an expensive optical
board. The modules and backplane are contained within cable or a series of electrical cables connected by repeaters
a cabinet. A system consists of one or more cabinets with that retime and amplify the signal. Because optical technol-
the necessary cables connecting the router ports according ogy still remains relatively expensive compared to electrical
to the network topology. The network cables may aggregate signaling over copper, our cost analysis uses electrical sig-
multiple network links into a single cable to reduce cost and naling with repeaters as necessary for driving signals over
cable bulk. long distances.
Router cost is divided into development (non-recurring)
Component Cost cost and silicon (recurring) cost. The development cost is
router $390 amortized over the number of router chips built. We assume
a non-recurring development cost of ≈$6M for the router
router chip $90
which is amortized over 20k parts, about $300 per router.
development $300
The recurring cost is the cost for each silicon part which we
links (cost per signal 6Gb/s)
assume to be ≈$90 per router using the MPR cost model [19]
backplane $1.95 for a TSMC 0.13µm 17mm×17mm chip which includes the
electrical $3.72 + $0.81 ` cost of packaging and test.
optical $220.00 The cost of electrical cables can be divided into the cost of
the wires (copper) and the overhead cost which includes the
connectors, shielding, and cable assembly. Figure 7(a) plots
Table 2: Cost breakdown of an interconnection net-
the cost per differential signal for Infiniband 4x and 12x ca-
work.
bles as a function of distance. The cost of the cables was
obtained from Gore [11]. The slope of the line represents
Network cost is determined by the cost of the routers and the cable cost per unit length ($/meter) and the y-intercept
the backplane and cable links. The cost of a link depends on is the overhead cost associated with shielding, assembly and
its length and location within the packaging hierarchy (Ta- test. With the Infiniband 12x cables, the overhead cost per
ble 2). A link within the backplane4 is about $1.95 per dif- signal is reduced by 36% because of cable aggregation – In-
ferential signal5 , whereas a cable connecting nearby routers finiband 12x contains 24 wire pairs while Infiniband 4x con-
(routers within 2m) is about $5.34 per signal, and an optical tains only 8 pairs, thus 12x cable is able to amortize the
cable is about $220 per signal6 to connect routers in a distant shielding and assembly cost. 8
cabinet. Inexpensive backplanes are used to connect mod- When the cable length exceeds the critical length of 6m,
4
We assume $3000 backplane cost for 1536 signals, or $1.95 repeaters need to be inserted and the cable cost with re-
per backplane signal which includes the GbX connector cost peaters is shown in Figure 7(b), based on the Infiniband
at $0.12 per mated signal to support signal rates up to 10 12x cable cost. When length is 6m, there is a step in the
Gb/s [1].
5 7
To provide signal integrity at high speed, signals are almost In our analysis, we will assume that a 6m cable is the
always transmitted differentially - using a pair of wires per longest distance that can be driven at full signalling rate
signal. of 6.25 Gb/s, based on SerDes technology similar to that
6 used in the Cray BlackWidow [23].
Pricing from www.boxfire.com – $480 for 20 meter 24
8
strand terminated fiber optic cable results in $20 per sig- It is interesting to note that the 12x has a slightly higher
nal for the cost of the cable. Although optical transceiver wire cost per unit length. The Infiniband 4x is a commodity
modules have a price goal of around $200, that has not yet cable whereas the 12x cables needed to be custom ordered
been achieved with current modules costing $1200 and up. – thus, resulting in the higher price.
dimension 3 dimension 2
connections connections
15
Router Cabinet
dimension 1
15 connections
15
router
~E/2
P0 P1 P15
E
(a)
dimension 3 connections (a) (b)
15
15 15 Figure 9: Sample cabinet packaging layout of 1024
nodes in (a) folded-Clos and (b) hypercube topolo-
gies. Each box represents a cabinet of 128 nodes
and the hypercube is partitioned into two chassis.
R0 R1 R15 In the folded-Clos, the router cabinet is assumed to
be placed in the middle and for illustration purpose
P0 P1 P15 P16 P17 P31 P240 P241 P255
only, the router cabinet is drawn taller.
dimension 1
dimension 2
0
Table 3: Technology and packaging assumptions 0 10000 20000 30000
Network size
used in the topology comparison. The values are
representative of those used in the Cray Black-
200
Widow parallel computer.
Hypercube
160
Folded Clos
Conventional Butterfly
Butterfly 80 Flattened
Butterfly 8 Butterfly
% of link cost
0 0
0 5000 10000 15000 20000 25000 30000 0 10000 20000 30000 40000 50000 60000
Network size (N)
Figure 11: Cost comparison of the different topolo-
Network Size (N)
gies. The bottom plot is the same plot as the plot
(a) (b) on the top with the x-axis zoomed in to display the
smaller networks more clearly.
Figure 10: (a) The ratio of the link cost and the (b)
average cable length of the different topologies as For small network size (<4K), there is very little difference
the network size is increased. The cable overhead is in Lavg among the different topologies. For larger network
not included in this plot. size, Lavg for the flattened butterfly is 22% longer than that
of a folded-Clos and 54% longer than that of a hypercube.
However, with the reduction in the number of global cables,
4.3 Cost Comparison the flattened butterfly is still able to achieve a cost reduction
The assumptions and the parameters used in the topol- compared to the other topologies.
ogy cost comparison are shown in Table 3. In determining We plot the cost per node of the four topologies in Fig-
the actual cable length for both Lmax and Lavg , we add ure 11 as the network size is increased. In general, the but-
2m of cable overhead – 1m of vertical cable run at each terfly network provides the lowest cost while the hypercube
end of the cable. We multiply a factor of 2 to the depth of and the folded-Clos have the highest cost. The flattened
the cabinet footprint to allow for spacing between rows of butterfly, by reducing the number of links, gives a 35-53%
cabinets. Based on these parameters and the cost model de- reduction in cost compared to the folded-Clos.
scribed earlier, we compare the cost of conventional butter- A step increase occurs in the cost of the folded-Clos when
fly, folded-Clos, hypercube, and flattened butterfly topolo- the number of stages increases. For example, increasing the
gies while holding the random bisection bandwidth constant. network size from 1K to 2K nodes with radix-64 routers
In Figure 10, the ratio of the link cost to the total intercon- requires adding an additional stage to the network to go
nection network cost is shown. The total cost of the inter- from a 2-stage folded-Clos to a 3-stage folded-Clos. The
connection network is dominated by the cost of the links. flattened butterfly has a similar step in the cost function
Because of the number of routers in the hypercube, the when the number of dimensions increases. For example, a
routers dominate the cost for small configurations.10 How- dimension must be added to the network to scale from 1K
ever, for larger hypercube networks (N > 4K), link cost ac- to 2K nodes. However, the cost increase is much smaller
counts for approximately 60% of network cost. For the other (approximately $40 per node increase compared to $60 per
three topologies, link cost accounts for approximately 80% node increase for the folded-Clos) since only a single link is
of network cost. Thus, it is critical to reduce the number of added by increasing the number of dimension – whereas in
links to reduce the cost of the interconnection network. the folded-Clos, two links are added.
The average cable length (Lavg ) of the different topolo- For small networks (<1K), the cost benefit of flattened
gies is plotted in Figure 10(b) as the network size is varied, butterfly compared to the folded-Clos is approximately 35%
based on the cable length model presented in Section 4.2. to 38%. Although the flattened butterfly halves the number
10 of global cables, it does not reduce the number of local links
The router cost for the hypercube is appropriately ad-
justed, based on the number of pins required. Using con- from the processors to the routers. For small networks, these
centration in the hypercube could reduce the cost of the links account for approximately 40% of the total cost per
network but will significantly degrade performance on ad- node – thus, total reduction in cost is less than the expected
versarial traffic pattern and thus, is not considered. 50%.
k n k0 n0 n'=11 n'=5 n'=3 n'=2 n'=1 n'=11 n'=5 n'=3 n'=2 n'=1
64 2 127 1 50 40
16 3 46 2 40
30
Latency (cycles)
Latency (cycles)
8 4 29 3 30
4 6 19 5 20
20
2 12 12 11
10 10
0
0
Table 4: Different k and n parameters for a N = 4K 0 0.1 0.2 0.3 0.4 0.5 0.6
0 0.2 0.4 0.6 0.8 1
network and the corresponding k0 and n0 flattened Offered load Offered load
I12
I13
I14
I15
O2
O3
O4
O5
O10
O11
O12
O13
O0
O1
O5
O7
O14
O15
I8
I9
I0
I1
I2
I3
I4
I5
I6
I7
O8
O9
I14
I15
I16
I17
I18
I19
I10
I11
I16
I17
I18
I19
O10
O11
O12
O13
O2
O3
O4
O5
O14
O15
O0
O1
O5
O7
I8
I9
I4
I5
I6
I7
O8
O9
I0
I1
I2
I3
(b) However, for direct topologies and for the flattened but-
terfly, a dedicated SerDes can be used to exploit the pack-
Figure 14: Examples of alternative organization of aging locality – e.g. have separate SerDes on the router
a 4-ary 2-flat flattened butterfly by (a) using redun- chip where some of the SerDes are used to drive local links
dant channels and (b) increasing the scalability. The while others are used to drive global links and reduce the
dotted lines represent the additional links added by power consumption. For example, a SerDes that can drive
taking advantage of the extra ports. <1m of backplane only consumes approximately 40mW [32]
(Plink ll ), resulting in over 5x power reduction. The differ-
ent power assumptions are summarized in Table 5 and the
5.2 Wire Delay power comparison for the different topologies are shown in
As shown in Figure 10(b), the flattened butterfly increases Figure 15.
the average length of the global cables and can increase the The comparison trend is very similar to the cost compar-
wire delay. However, longer wire delay does not necessarily ison that was shown earlier in Figure 11. The hypercube
lead to longer overall latency since the wire delay (time of gives the highest power consumption while the conventional
flight) is based on the physical distance between a source butterfly and the flattened butterfly give the lowest power
and its destination. As long as the packaging of the topol- consumption. For 1K node network, the flattened butter-
ogy have minimal physical distance, the time of flight is fly provides lower power consumption than the conventional
approximately the same regardless of the hop count. Direct butterfly since it takes advantage of the dedicated SerDes to
networks such as torus or hypercube have minimal physical drive local links. For networks between 4K and 8K nodes,
distance between two nodes but indirect networks such as the flattened butterfly provides approximately 48% power
folded Clos and the conventional butterfly have non-minimal reduction, compared to the folded Clos because a flattened
physical distance that can add additional wire latency. How- butterfly of this size requires only 2 dimensions while the
ever, since there are no intermediate stages in the flattened folded-Clos requires 3 stages. However, for N > 8K, the flat-
butterfly, the packaging is similar to a direct network with tened butterfly requires 3 dimensions and thus, the power
minimal11 distance physical distance and does not increase reduction drops to approximately 20%.
the overall wire delay. For this reason, the wire delay of a
11
Minimal distance here means minimal Manhattan distance
6. RELATED WORK
and not the distance of a straight line between the two end- The comparison of the flattened butterfly to the gener-
points. alized hypercube was discussed earlier in Section 2.3. Non-
Component Power to a Clos network. Dilated butterflies [15] can be created
Pswitch 40 W where the bandwidth of the channels in the butterflies are
Plink gg 200 mW increased. However, as shown in Section 4.3, the network
Plink gl 160 mW cost is dominated by the links and these methods signifi-
Plink ll 40 mW cantly increase the cost of the network with additional links
as well as routers.
Our non-minimal routing algorithms for the flattened but-
Table 5: Power consumption of different compo- terfly are motivated by recent work on non-minimal and
nents in a router. globally adaptive routing. Non-minimal routing was first
applied to torus networks and have been shown to be crit-
10
ical to properly load-balancing tori [27, 28]. We show that
these principles and algorithms can be applied to the flat-
Hypercube
8 tened butterfly as well. We also extend this previous work
Power per node (W)