Professional Documents
Culture Documents
Narendra
Karmarkar
AT&T Murray
Bell Hill,
Laboratories NJ
Indian
Institute Bombay,
07974
Abstract
Many problems in scientific cotnputation involve sparse matrices. While dense matrix computations can be parallelized relatively easily, sparse matrices with arbitrary or irregular structure pose a real challenge to the design of highly parallel machines. In this paper we propose a new parallel architecture
associated topology
suited to it. In
case of some important problem one may be able to justify building a machine whose interconnection pattern is dedicated to solving that particular problem. on a On the other hand, one can base the architecture
for sparse matrix computation based on finite projective geometries. Mathematical structure of these geometries play an itnportant role in dejining the pattern interconnection between memories and processors well as solving several difficult problems arising parallel systems (such as load balancing, data-routing, memory-access conflicts etc. ) in an efli.cient manner. oj as in
general interconnection network that can simulate any topology efficiently. There have been a number of interconnection networks proposed and explored, differing in the trade-off and efficiency. 1.2.2 Sparse matrices structure they make between generality
with irregular
non-zero
relatively
easily
on
traditional
domain
described types of in this paper arising was in
architectures,
many scientific
applications
give rise to
architecture by certain
problems
sparse matrices with arbitrary or irregular pattern of non-zero locations. Such problems pose a real challenge to the design of highly parallel machines. On the other hand, very efficient data structures and programming techniques have been developed for sparse matrix computation on sequential machines. When evaluating the effectiveness of a parallel architecture on a sparse matrix problem, it is important to compare the performance with the best sequential method for solving the same problems rather than comparing some algorithm on the two architectures.
computation such as linear programming, of partial differential equations, signal simulation of non-linear electronic circuits, programming etc. A large body of research computation is directed towards finding the amount in, there of parallelism is plenty available in a given we are parallelism in the types of applications of intrinsic
Fortunately,
available at a fine-grain level. The real difficulties arise not in jinding parallelism, but in exploiting it ejiciently. 1.2
Difficulties
in exploiting
parallelism
These difficulties
levels: architectural and software level. described below.
arise at roughly three different level, hardware operational-level Some of these difficulties are briefly
to obtain high
358
balancing the load on processors, etc. A question naturally- ~ises: is it possible to define a parallel system in which the most elementary to the system as a whole operation coherent, conflict-free instruction automatically and uniform, you can give results in balanced memories
interconnecting governed
various
hardware algorithm.
elements
and memories
needs to be
congestion, delays, ability to find conflict-free paths, amount of buffering needed at intermediate nodes, and speed of the routing algorithm itself.
and wires connecting them, and can one express computation specified by the user in terms of a sequence of such instructions? Thus an instruction for the system should be more than just a collection of instructions for the individual elements of the systcm. It should have three further properties: first, when individual elements of the system follow such an instruction, there should be no conflicts or inefficient use of the resources. Secondly the collection of such system-instructions should be powerful enough so that computation specified by the user can be expressed in terms of these instructions. Furthermore the instruction set should have a structure that pertnits this process of mapping user programs onto the underlying architecture to be carried out efficiently. It seems that the mathematical structure of objects known defining work as finite geometries is eminently suited for The such a parallel system for solving the types of
described earlier.
as a mathematical
decompose the task and assign the subproblems to processors and also to program the communication between processom. In case of a problem involving a regular square grid of nodes, it is easy to decompose the problem into isomorphic subproblems. Unfortunately, many real problems involve irregular grids or boundaries, sharp transitions in the functions being computed etc. The task of decomposing such problems can be quite tedious. It is desirable to have the compiler do efficient mapping of users problem on the underlying hardware. Another issue that comes up is the level of granularity at which the parallelism is to be exploited. If one restricts the parallelism available at a coarse-grain level, it is easier to exploit but m one goes available. The to exploit even to finer levels, there is more parallelism real challenge is to provide the ability fine-grain parallelism.
curiosity to explore how the structure and symmetry of finite geometries can be exploited to define the interconnection pattern between memories and processors, assigning load to processors and accessing memories in a conflict-free manner. Since then it has grown in several directions: how to design application for various problems in scientific algorithms computation? How best to map a given computation both the Bell This the onto the architecture, how to design the hardware
will be addressed in a series of papers that document study performed by several researchers at ATtkT Laboratories and Indian Institute of Technology. paper, and which two is the first in the series describes mathematical
based on geometric
computation, namely matrix-vector multiplication and gaussian elimination for sparse matrices. The other papers describe the compiler mHI 89], simulation results ~KR91] and aspects of hardware design. In addition to showing how the mathematical structure of finite geometries helps in solving many difficult problems in parallel system design, our work is guided by the objective that even the first implementation based on these concepts should result in a practical machine that many scientists and engineers would want to use.
There have been a number of parallel proposed in the last decade. Typically a method of interconnecting the hardware as processors and memories. Severat then found to control the operation of the moving data through the network,
architectures
one first decides elements such algorithms are system such as assigning and
359
Consequently, the first version of the compiler and hardware are limited to applications that involve execution of a data-flow graph whose symbolic structure remains fixed but which is executed with different numerical values severat times during the course of the algorithm. notorious y As an example, difficult for let us consider the problem a problem parallelization. which is A single of electronic circuit-simulation,
Computationally intensive the host processor. subroutines or macros that have fixed symbolic structure but are executed several times with different numerical values are to be carried out on the attached processor. Since the two processors share the same memory, not necessary to communicate large amounts it is of data
between the processors. Only certain structural information such as base addresses of arrays need to be communicated from host processor to attached processor before invoking a subroutine to be executed on the attached processor. 2.2
each time step which solves non-linear algebraic equations involves several linear-system solutions having the same symbolic structure and each linearsystem, if solved by an iterative method like conjugategradien~ involves several multiplications with a sparse matrix with fixed non-zero structure. Since the non-zero pattern of the matrix electronic circuit the simulator may vector multiplications depends only on the structure of a single execution of involve several thousand having the same data-flow matrixgraph. being simulated,
Interconnection of a projective
set of points S, and a collection of subsets of S associated with each integer i < d which constitute subspaces of dimension dimension 1 are called i. Thus subsets associated with lines, those associated with
Typicat number of iterations of the same data-flow graph for several other applications are compiled in Table #1.
dimension 2 are called planes, etc. These subsets have intersection properties similar to the properties of the familiar 3-dimensional (infinite) geometry. e.g. Two points determine a unique line, three non-collinear points determine a plane, two lines in a plane intersect in a point etc. In section 3 we will define the class of geometries geometric follows. choose we are going to use more precisely. structure is used for defining between a finite of memories geometry and processors of dimension Put Given a pair The the as d, the
Problem Number 1 2 1 2 3 1
Repetition
Count
interconnection
dimensions
d,., dP < d.
processors in the system in one-to-one correspondence with all subspaces of dimension dP, put the memory modules in one-to-one correspondence with all subspaces of dimension d,; and connect a processors and memory module if the corresponding subspaces have a non-trivial intersection. 2.3 Memory system
Control 2 Systems 3 Table #1. Iterations 7,254 on Same Data-Flow Graph 1,863
The memory system of the attached processor is into n modules denoted by M 1, M2, . . . . M.. The type of memory access possible in one machine cycle in this architecture is between the partitioned two extremes of random access and sequential and could be called structured access. Only combinations of words can be accessed in one These combinations are designed using access certain cycle. certain
2. Description
The machine proposed here is meant to be used as an attached processor to a general purpose machine referred to as host processor. The two processors share a common global memory. The main program runs on
symmetries present in the projective geometry so that no conflicts can arise in either accessing the memory or sending the accessed data through the interconnection network. The total set of aIlowed combinations is powerful enough so that a sequence of such structured
360
accesses is as effective
as random
accesses.
On the
2.5 Instructions
other hand, such a structured-access memory has a much higher bandwidth than a random-access memory implemented using comparable technology. In section 3, a mathematical characterization of allowed access patterns is given. Furthermore, these individual access patterns can be combined to form certain special sequences again using the symmetries present in the geometry. Each such sequence, called a perfect sequence, defines the operation of the hardware for several consecutive machine cycles at once, and has the effect of utilizing the communication bandwidth of the machine fully. As a resul~ it is possible to connect the memories and processors in the system so that the number of wires needed grows linearly with respect to the number of processors and memories in the system. 2.4
A system-instruction decides what action each of the individual hardware elements are to perform in a single machine cycle (or over a sequence of machine cycles in case of instructions corresponding to perfect sequences of conflict-free patterns). Each hardware element such as an arithmetic processor, a switch or a memory module has its own instruction set. The compiler takes a data-flow gmph instructions implicitly as inpu~ uses system-level to produce as output a collection
of programs, one for each element of the system, consisting of list of instructions drawn from the instruction set of that element. Each hardware element of the system has local memory for storing the instruction sequence to be used by that element. The initial loading of these instruction sequences for each subroutine to be executed on the attached processor is done by the host. Once loaded, the instruction sequence may typically get executed many times before being overwritten by the instruction sequence for some other subroutine. Depending on the size of local instruction memories, instruction sequences for several subroutines can reside simultaneously in these memories. Since the instruction sequences are accessed sequentially, an hardware implementation of the instruction storage memory can take advantage of sequential access.
is
done
at
a fine-grain
level.
operation that takes two operands u and b as inputs and modifies one of them, say a, as output ai--aob. Suppose module M, module M,. the operand and operand a b belong belongs to the memory to the memory (i, j) with
this operation. Similarly with a ternary operation, we associate an index-triplet. If we number the prccessor as P~, P2, ..., Pn then the processor P1 that is responsible for doing a bhmy operation with associated function that index pair (i, j) is given depends on the geometry. 1 = f(i, Tbus two operations by a certain
2.6 Pipelining
of memory
accesses
module
The instruction
specifies a list of addresses of operands along with a read/write bit. The instruction sequence is stored in the same module. advance, full error detection Since the address sequence is known in pipelining is possible in decoding bits in memory and in dynamic and correction. Operands being written
j)
addresses, accessing
pair (or triplet) always get assigned to the processor. Furthermore, the function used for
can also be placed in a shift register whose length depends on the number of stages in the pipeline, so that consistent values of operands can be supplied. Alternatively, the compiler can ensure that any memory location that is modKied is not immediately accessed for reading (within as many machine cycles as the length of the pipeline for writing). In a large-scale implementation each module can be connected to a separate local provide for secondary storage in addition secondary storage of the host processor. 2.7 Arithmetic memory disk, to to the
assignment is compatible with the structure of the geometry, i.e. the processor numbered $(i, j) is connected to memory modules i and j. Since the assignment of operations to processors is determined entirely by the location of the data the compiler has an indirect control over load-balancing by exploiting the freedom it has of assigning memory locations to intermediate operands in the data-flow graph and moving the operands from one memory module to another by inserting move (i, j) instructions if necessary. (Even random assignment of memory locations tends to distribute the load evenly as the size of the data-flow graph increases.)
processors
Arhhmetic processors are capable of perfomling the basic arithmetic and logical operations. The instruction
36 I
sequence to be followed by a processor (created by a compiler) can also contain move (i, j) operations simply for moving operands from memory module M, to M,, as well as no-ops. In the case of pipe-lined operations taking more than one machine cycle of the corresponding instruction means execution initiating the operation. It is up to the compiler to
of Finite
Projective
fields
the concept of a some
In this section we briefly projective space over a finite notation used later.
ensure that delay in the availability of the results is taken into account when the data-flow graph is processed. The instruction sequence to be foltowed by a processor memory. is The stored into its own local instruction instruction sequence does not contain
Consider a finite field F = GF(s) having s elements, where s is a power of a prime number p, s = p, and k is a positive integer. A projective space of dimension d over the finite field F, denoted by PJ (F), consists of one dimensional subspaces of a (d + 1) dimensional over the finite can be (x~, x~, .r~, ..., vector space Fd + 1 field, F. Elements of this vector space represented as (d + 1)-tuples XJ + 1) where each x, e F. Clearly,
addresses of operands, but only the type of operation to be performed (including no operation). There is no concept of fetching an operand. The processors simply operates on whatever data flows on its dedicated links. Each processor has its own local memory. In an application such as multiplication of a sparse matrix by a vector, the matrix elements can be stored in the local memories of processors and the input and output vectors can be stored in the shared, partitioned global memory.
the total number of such elements is Sd+ 1 = p (d+ 1). Two non-zero elements ~, y # O of this vector space are said to be equivalent if th&e exists a L e GF(s) such that I = ky. Each equivalence class gives a point in the projective ~pace. Hence the number of points in Pd (F) is given by
2.8 Compiler
Given a sequence of instructions, many rearrangements of the sequence are possible that produce the same end result. An optimizing compiler for a sequential machine seeks to perform such rearrangements complexity with the goal of reducing the sequential of the program.
Pd =
An consists m-dimensional of all one
~d+l_ 1
sl subspace of Pal(F) of an
projective dimensional
subspaces
In case of a parallel machine, the number of possible rearrangements is even larger because of a greater degree of freedom available in assigning operations to processors, in moving the data through the interconnection network etc. In order to make this freedom more visible and available to the compiler, input to the compiler is expressed in terms of a dataflow graph that represents a partial order to be satisfied by operations based on dependencies. The first version of the compiler [DHI 89] is restricted to applications that involve repeated execution of the same data-flow graph with different numerical values. The compiler does extensive processing of the data-flow graph to reexpress the computation in balanced, conflict-free chunks as much as possible. Any remaining inefficiency shows up in the form of holes in the perfect patterns which result in no-ops in the instruction sequences for processors.
(m + 1)-dimensional subspace of the vector space. Let b., b ~,... b~ be a basis of the latter vector subspace. me elements of the vector subspace are given by x= ~ cxlb, J.IJ where ~i ~ F .
Hence the number of such elements is s+ 1, and the number of points in the corresponding projective subspace is
m+l pm=~l
s1 Let r = d m. Then an (m + 1)-dimensional vector subspace of Fd + 1 can also be described as the set of all solutions of a system of r independent linear equations ui~~ = O where ai ~ (F*)d+l, the dual of
Fd+~, i =,... r, this
corresponding dlmension r.
projective
Let f,21 denote the collection of atl projective subspaces of dimension 1. Thus Q. is the set of all points in the projective space, ~ ~ is the set of all lines, ~J_ ~ is the set of all hyperphmes, etc. For n 2 m,
362
define
parallel lines. =
$(~, ms)
l)...
subspaces
Let n = S* + s + 1. Given n processom and a memory system partitioned into n memory modules M ~, ikf2,... M., a method of interconnecting them based on the incidence structure of a finite projective plane, can be devised as follows. Put the memory modules in one-to-one
The
number
of points
in an l-dimension
subspace is given by
+(1, 0> s)
The number of distinct a given point is @(d1,1l,s) subspaces, l-dimensional subspaces through
correspondence with points in the projective space, and processors in one-to-one correspondence with lines. A memory module and a processor are connected if the corresponding point belongs to the corresponding line. Thus each processor is connected to (s + 1) memory modules and vice versa. Consider a binary operation Then with associated index pair (i, j). Suppose i # j.
the memory modules Ml and M, correspond to a distinct pair of points in the projective space. This pair of points determines a line in the projective space that correspond to some processor p ~. Then the binary operation is assigned to this processor. If i = j or if the operation is unary, we have some freedom in assigning the operation. A ternary operation can not be directly performed in this scheme.
and the number of distinct l-dimensional 122, containing a given line is @(d-2, 1-2, s).
More generally, for O S 1 c ms d, the number of m-dimensional subspaces of Pd ( GF(s)) containing a given l-dimensional subspace is $(d-l-l, m-ll,s) subspaces contained in
of the two-
subspace is
practically useful devices such as a fast matrix-vector multiplier for sparse matrices with arbitrary or irregular sparsity pattern. Suppose we have an p x q matrix and we wish to compute the matrix-vector product y=Ax . A;
that is based on a two dimensional projective space lP2 (F), where F = GF(s), also called the projective plane of orders. The number of points in a projective is +(2,0, 53-1 s)=7s2+s+1 plane of orders
Flrs~ the indices 1 to q of the vector ~ and indices 1 top of vector y are assigned to logical memory modules Ml... M. by ~eans of two functions, say f and g. which can be hashing functions i.e. let p = f(i)
,
p, v e Llo .
is stored in the memory module Mv and y(j) Now consider the operation, A (j, i) of the
which is the same as the number of points Each line contains (s + 1) points and through any
is stored in the memory module M. multiply-and-accumulate following corresponding matrix A Y(j) to a non-zero
element
point there are (s + 1) lines. Every dkxinct pair of points determine a line and every distinct pair of lines intersect in a point. Note that there are no exceptions to the latter rule in a projective geometry since there are no
y(j)
+ A(j,
i) * x(i)
363
passing through the pair of points to memory modules MW and M,. The
Point Pairs
(Pl>ql) (p29
qz)
matrix entry A ( j, 0 is stored in the local memory of the processor p ~. Note that the processor pl is connected to memory modules to corresponding ML and M processor pl because the line contains points
corresponding to memory modules MW and M,. ne input operands x(i) and y(j) are sent from the partitioned shared memory to the processor p ~ through the interconnection network. The processor p ~ atso receives A ( j, i) from its local memory and performs the arithmetic operation. The output Y(j) is shipped back through the interconnection network to the memory module MV in the partitioned shared global memory. This method situation occurring in is very effective many iterative if the same matrix multiplications; linear-algebraic equations. 2. 3.4 There is no conflict usage. access pattern for a 3. 4. Let the number of points (and hence the number of lines) in the geometry be n. A perfect access pattern is a collection pairs of points = {(l>bl) ai,bi~ ,(aa, b?),... (a~, brl)la~#bi, of n ordered All processors are fully utilized. Memory bandwidth is fully utilized. pattern. or waiting in processor A is used for several matrix-vector in applications a typical methods or
i
Table #2. Clearly,
(Pn, q,, )
1. = (pn, qn)
Perfect access patterns for 2-d geometry if one schedules a collection of binary
operations corresponding to such a set of indexpairs for simultaneous parallel execution, then we have the following situation: 1. There are no read or write conflicts in
memory accesses.
Perfect
We will
access patterns
first define a perfect
two-dimensional
Recatl that a memory module is connected to a processor if the point a corresponding to the memory module belongs to the line ~ corresponding to the by the processor. Thus we can denote this connection
ordered pair (a, @). Let C denote the collection of all processor-memory
!20, i = 1,... n} properties: connections. i.e. C={(rx, p)l aGQl), ~GQ~, ctq p}.
First members {al, a2,... a,,} of all the pairs form a permutation of all points of the geometry. Second members { b ~, bz,... b,, ) of atl pairs form a permutation of all points of the geometry. Let 1, denote the line determined points. i.e. 1, = (U,, b,) Then the lines { 1~, lZ,... in } determined by these n pairs form a permutation of all lines of the geometry. by the iti pair of
2.
If (a i, b,) is one of the pairs in a perfect pattern P and li is the corresponding line, we say that the perfect pattern P exercises the connections (a,, 1,) and (b,, 1,). A sequence of perfect access patterns is called a
3.
in C is exercised the
same number of times collectively by the patterns contained in the sequence. If such a perfect sequence is packaged as a single instruction that defines the operation of the machine over several machine cycles, it leads to uniform utilization of the communication bandwidth of the wires connecting processors and memories. Perfect access patterns and perfect sequences can be easily generated using the group-theoretic structure of projective geometries, as described in the next section.
364
structure
of projective
in the projective
d over GF(s)
the origin in the vector space of dimension (d + 1) over GF(s), which contains s +1 elements. Since $ = p~, a power of a prime number, so is sd + 1 = p~(d+ i ). Hence there is a unique finite field with Sd+ 1 elements. One might suspect that there may be some relation between the finite field GF(sd+ 1) and the projective space lPd(GF(s)), which would give the projective space additional structure based on the multiplication operation in GF(sd+ 1), besides its geometric structure. Indeed, there is such a relation, and we want to elaborate it further, If p is a prime number, then GF(p ) contains
onto another projective subspace of the same dimension. Such a mapping is called an uutomorphism of the projective geometry. Since G q and H* are cyclic, so is G q /H*. Let g be a generator of G*/H*. (By abuse of notation, we will not henceforth distinguish between elements of G* /H* and points of P~ (s)). Thus we can denote points in Pd(s)asgi, i = O,... n 1. The mapping L-g becomes Lg :gi +gi+l
This will be called a shift operation. Any power l,; of the shift operation is also an
GF(p ) as a subfield if and only if n I m. Therefore GF(sd+ 1) (where s = pk ) contains GF(s) as a subfield and the degree of GF(sd+l ) over GF(s) is (d + 1). Hence GF(sd+ 1) is a vector space of dimension (d + 1) over GF(s). Each non-zero element of this vector space determines a ray through the origin and hence a point in the projective Let space @(s). group of non-zero
automorphism of the geometry and the collection of all powers of the shift operation forms an automo~hism group denoted henceforth by L, which is a subgroup of the group of all automorphisrns of the geometry. A subgroup G of the full automorphism group is said to act transitively on subspaces of dimension k if for any pair of subspaces H,, Hz of dimension element of G which maps H 1 to H2. For subgroup any L projective co-dnension space, on one the k, there is an
G* = the multiplicative
acts transitively
subspaces of
property is used in the next section for generating perfect patterns and sequences for the two dimensional geometries. Perfect patterns for 4-dimensionat case will be described in section 3.8. 3.6
Let x e G*.
Generation geometry
In case of
of perfect
patterns
for 2-d
two-dimensional
geometry,
the
hype@nes (dimension
(co-dimension
= 1) are the same as lines L~ and its on lines. pattern using the shift
This is precisely the coset of H* in G* determined along with the origin. This establishes a one-to-one
correspondence
perfect
between points of the projective space Pd (s) and elements of the quotient group G */H*. This correspondence allows us to define a multiplication operation in the projective space. Let g be a fixed point in the projective and consider the mapping by Lg:x+gox where g above.
o x is the
operation, geometry.
take any pair of points a, b, a # b, in the Let 1denote the line generated by a and b i.e. 1 = (a, b) .
space P~(s)
Set ao = a, bo = b and 10 = 1 and define a~, bk and of the lk, k = 1,... n 1, by successive application shift operation L ~ as follows: ak = L~ akl bk = L8 Obk-,
multiplication
operation
and
lk = Lgolk_~
365
let and
P = {(a~,
b~), k = O,... YZ 1)
pattern of the geometry. Now in order to generate a perfect sequence of such patterns, take any line 1 = {al, a2,... a~) of the geometry, Form all ordered pairs (a,, a,), i #j, from the points on 1. Generate a perfect pattern from each of the pairs. The collection of perfect patterns obtained this way (sequenced in any order) forms a perfect sequence. 3.7 4.
space determines a line 16 Q 1 (if et = ~ we have some freedom in determining the line). The element A (i, j) is stored in the memory corresponding to line 1. correspondence module
Example geometry
of an application
of 4-dimensional
If a line corresponding to a memory module is contained in a plane corresponding to a processor then a connection is made between the memory module and the processor. Again consider a typical operation in gaussian elimination
suitable for performing sparse Gaussian an operation required in many scientific and computations. gaussian k) and let ct = f(i), ~ = f(j) and T = ~(k). Then all the are subsets index-pairs involved in the above operation of the triplet (rx, ~, y). Assuming that the triplet
A typical operation in symmetric elimination applied to matrix A is A(i, k) +- A(i, k) A(i, j) A(j, * A(j, j)
where A (j, j) is the pivot element. Such an operation needs to be carried out only if A ( i, j) # O and A ( j, k) # O. When non-zero elements in the matrix are not in consecutive uniformly high locations, efficiency it is very difficult on vector to obtain machines.
general position, they determine a plane, say S, of the projective space. i.e. 6 e Llz. The above operation is assigned to the processor corresponding carry out this operation, to communicate to 8. In order to to the processor needs to be able
However, the operation shown above has an interesting property, regardless of the pattern of non-zero in the matrix: Two elements A(i, j) and A(i, j) of the matrix need to be brought together for a multiplication, division or subtraction only if the index pairs (i, j) and indices in (i, j ) have at least one of the constituent common. This property can be exploited 1, Map the row and column as follows: indices 1, 2, . ...1 to
the pairs (u, ~) (~, y) and (u, y). Note that the lines determined by these pairs are contained in the plane determined by the triplet (et, ~, f), hence the necessary connections exist. 1. In a projective dimension dimension symmetric memory space, the number of subspaces of
m equals the number of subspaces of com + 1. Hence if we are interested in a scheme with equal number of processors and modules, then we should make the co-
dimension of subspaces corresponding to processors one more than the dimension of subspaces corresponding to memory modules. Thus in the present case we should geometry, i.e., d = 4. choose a four dimensional 2. In the projective the number of lines is n = 4)(4, 1,,s) Therefore
points of the projective space by means of an assignment function ~, which can be a hash function. cx= 2. j(i) cte Qo. in
Put the memory modules in the logical partition one-to-one correspondence with lines, elements of Q 1.
i.e. with
(s5 (s2 -
1)(SJ l)(s -
1) 1)
3.
non-zero
element
A(i,
j)
is assigned
to a
366
+ss+sj+s+
1).
1.
Let u i, i = 1,... n, denote the lines generated by first two points from each triplet.
= Q(4, 1,s)
= n . The 2.
i.e.
U~ =
(di,
b,)
Hence the number of processors is also n. number of planes containing a given line is +(d -2,0,
of lines { u 1, u j,...
Un ] forms
s) = 4K2,o, s) = J+
s=s2+s+1. a permutation
3.
Similarly the number of lines contained in a given plane is $(2, 1, s) = Sz + s + 1 = m. Hence each processor modules. The number the communication of interconnections required between if is connected to tn = O(W ) memory
memories and processors can be reduced significantly carried out in a disciplined and co-ordinated
of
planes
{ h 1, h j,... of
h. } the
of all
the planes
based on the perfect sequences of conj?ict-free patterns. In a machine designed to support only these perfect patterns, it is possible to connect memories and processors so that the number of wires needed grows linearly with respect to the number of processors in the system. Note that the fundamental property of Gaussian
associated triplet is performed, the three memory modules accessed correspond to the three lines (a,, b,), (bi, c1) and (c,, a, ), and the processor performing the operation corresponds the plane (a,, b,, c,). Hence it is clear that if we schedule n operations where associated triplets form a perfect pattern to execute in parallel in the same machine cycle then 1. there is no read or write accesses 2. 3. 4. there is no conflict in the use of processors conflicts in memory
elimination exploited here is the fact that index pairs (i, j) and (i, j ) need to interact only when they have an index transitive operation in common. There are many examples of the computations having the same property, e.g. finding
closure of a dwected graph or the join for binary relations in a relational data-base. is also applicable to such
all the processors are fully utilized the memory bandwidth is fully utilized. is called perfect
3.8 Perfect
access patterns
In P4 (s) let n denote the number of lines, which is also equal to the number of 2-dimensional A perfect collinear access pattern planes. n non-
of triplets
of the geometry
based on cyclic
is a collection
shifts are not enough for generating perfect access patterns for the 4-d geometry. Fkst we consider other types of automorphism. In a finite field of characteristic p, the operation of
i = 1,... n)
367
of the field. = Xp + yp
(x + y)p
circuit boards) for building hardware is basically twodimensional and euclidean, how does one embed a two or four dimensional finite projective should of wire-crossings geometry into such to space Pd (s) group a medium? parts? 4.4 Design of custom integrated circuits blocks as How the system be partitioned
Since
the
points
in
the of
projective
correspond
to elements
the multiplicative
between different
G* /H*, the operation raising to the p h power can also be defined on the points of I@ [s). It is easy to show that such operation is an automorphism of the projective space. The operations cyclic shift and raising to the p h
How should one define the basic building a custom integrated circuit?
power together are adequate for generating the perfect access patterns for the smallest 4-dimensional geometry P4 (2) which has 155 lines and planes. For bigger 4-d geometries we need other automorphisms. The most general automorphism of Pd (s) is obtained by means of a non-singular (d + 1) x (d + 1) matrix over GF(,s). More detailed discussion of generation of perfect patterns for 4-dimensional specific examples of practical subsequent paper. geometries along with interest will be given in a
architecture of such chips in order to fully exploit the structure of underlying geometry to achieve high performance? Results of ongoing exploration of many of these issues will be reported in a forthcoming series of papers.
References
[ADL 89] Adler, I., Karmarkar, R., Veiga, Programming G., N. K. Resende, G. C. Data Structure and for the Algorithm, Vol. 1,
4. Areas
for Further
Research
Implementation
The ideas presented in this paper lead to a number of interesting areas for further investigation. Some of [BN71] these are briefly 4.1 Efficient described below. mapping
of algorithms
@3HI 89]
Bell, G. C., Newell, Structures: Readings McGraw-Hill, 1971. Dhillon, I. S., A Parallel
A., and
Computer Examples,
In this paper, we showed how matrix multiplication and inversion can be mapped onto the proposed architecture, using two and four dimensional geometries. How best can one map a variety of other commonly used methods in scientific computation? How can one exploit geometries of even higher dimension? 4.2 Compilation
Architecture
for
Sparse Matrix Computations, B. Tech. Institute of Project Report, Indian Technology, Bombay, 1989. [HAL 86] Hall, Marshall, Jr., Combinatorial Series in Theory, Discrete
The first version of the compiler for the proposed architecture has been designed and implemented. Description of this work can be found in [DH189], [DKR91]. Results of application of the compiler to a number of large-scale real-life problems in scientific in domains such as linear computation arising programming, electronic circuit simulation, partial differential equations, signal processing, queueing theory, etc. are reported in DKR90]. compilation process suggests many This work on the further research
Karmarkar, K. a
Ramakrishnan, Analysis of
Architecture on Matrix Vector Multiply Like Routines, Technical Report #11216901004-13 TM, AT&T Bell Laboratories, Murray Hill, N.J., October pKR91] Dhillon, Compilation I. 1990. N. and of the Parallel
Ramalcrkhnan,
368
Proceedings
of
the
Fifth
Matrix for Sparse Computations, Proceedings of the Workshop on Parallel Processing, BARC, Bombay, February 1990, pp. 1-18. [KAR90B] Karmarkar, N., A New Parallel Architecture
for Sparse Matrix Computations Based on Finite Projective Geometries, invited talk at the SIAM Mathematics, [VDW 70] Conference on Atlanta, June 1990. Algebra, Discrete
Vol. 1, Unger
369