Professional Documents
Culture Documents
architectures
Nicholas FitzRoy-Dale
Document Revision: 1 Date: 2005/07/1 22:31:24
nfd@cse.unsw.edu.au
http://www.cse.unsw.edu.au/∼disy/
Operating Systems and Distributed Systems Group
School of Computer Science and Engineering
The University of New South Wales
UNSW Sydney 2052, Australia
1 VLIW and EPIC architectures
Modern superscalar processors are complex, power-hungry devices that present an antiquated view of
processor architecture to the programmer in the interests of backwards compatibility — and do a lot
of work to achieve high performance while maintaining this illusion. The alternative to superscalar is a
VLIW architecture, but these have traditionally been actively backwards-incompatible, with performance
highly dependent on the (frequently mediocre) abilities of the compiler.
Neither VLIW nor superscalar are perfect architectures: each has its own set of trade-offs. This
report discusses the relative strengths and weaknesses of the two, focusing on the benefits of VLIW and
the closely-related EPIC architecture as used in Intel’s Itanium processor family.
An introduction to the motivation behind VLIW is given, VLIW and EPIC are discussed in detail,
and then two case studies are presented: the Analog Devices SHARC family of DSPs, demonstrating the
VLIW influences present in a modern DSP; and Intel’s Itanium processor family, which is to date the
only implementation of EPIC.
2 Instruction-level parallelism
A common design goal for general-purpose processors is to maximise throughput, which may be defined
broadly as the amount of work performed in a given time.
Average processor throughput is a function of two variables: the average number of clock cycles
required to execute an instruction, and the frequency of clock cycles. To increase throughput, then,
a designer could increase the clock rate of the architecture, or increase the average instruction-level
parallelism(ILP) of the architecture.
Modern processor design has focused on executing more instructions in a given number of clock
cycles, that is, increasing ILP. A number of techniques may be used. One technique, pipelining, is
particularly popular because it is relatively simple, and can be used in conjunction with superscalar and
VLIW techniques. All modern CPU architectures are pipelined.
2.1 Pipelining
All instructions are executed in multiple stages. For example, a simple processor may have five stages:
first the instruction must be fetched from cache, then it must be decoded, the instruction must be executed,
and any memory referenced by the instruction must be loaded or stored (Figure 1). Finally the result of
the instruction is stored in registers. The output from one stage serves as the input to the next stage,
forming a pipeline of instruction implementation. These stages are frequently independent of each other,
so, if separate hardware is used to perform each stage, multiple instructions may be “in flight” at once,
with each instruction at a different stage in the pipeline. Ignoring potential problems, the theoretical
increase in speed is proportional to the length of the pipeline: longer pipelines means more simultaneous
in-flight instructions and therefore fewer average cycles per instruction.
The major potential problem with pipelining is the potential for hazards. A hazard occurs when an
instruction in the pipeline cannot be executed. Hennessey and Patterson identify three types of hazards:
structural hazards, where there simply isn’t sufficient hardware to execute all parallelisable instructions
at once; data hazards, where an instruction depends on the result of a previous instruction; and control
hazards, which arise from instructions which change the program counter (ie, branch instructions). Var-
ious techniques exist for managing hazards. The simplest of these is simply to stall the pipeline until the
instruction causing the hazard has completed.
Despite pipelining being an almost-universal practise, some architectures are more amenable to pipelin-
ing than others. Architectures that are designed for pipelining tend to require all instructions to take
2 2 INSTRUCTION-LEVEL PARALLELISM
roughly the same amount of time to complete, so that the pipeline can operate at the same speed regard-
less of what is being executed. In order to keep this speed high, instructions tend to be simple. A small
set of simple, quick-to-execute instructions is the hallmark of RISC (reduced instruction set computing)
architectures. It is no surprise, then, that RISCs tend to be well-suited to pipelining.
Nowadays, the RISC/non-RISC distinction is somewhat outdated. The key metric is performance,
and an architecture can be fast without being RISC. Similarly, it is possible to apply tradtional RISC tech-
niques such as pipelining to non-RISC instruction sets such as Intel’s x86 instruction set. This instruction
set is not well-suited to pipelining, for two reasons. Firstly, it contains a number of instructions that take
a long time to execute. For example, all x86 compatibles support instructions to work with binary-coded
decimal (BCD) numbers. Secondly, x86 supports so-called complex addressing modes. For example, it
is possible to move data from one location in main memory to another location in main memory, without
requiring the data to go through a CPU register as an intermediate stage. Such instructions are difficult
to pipeline, because memory accesses are slow, and data should ideally be present before it is required.
To deal with these sorts of obstacles to pipelining, the pipelined processor translates CISC instruc-
tions into an internal instruction set before pipelining the instructions. This translation phase adds to the
complexity of the architecture.
2.2 Superscalar
Usually, the execution phase of the pipeline takes the longest. On modern hardware, the execution
of the instruction may be performed by one of a number of functional units. For example, integer
instructions may be executed by the ALU, whereas floating-point operations are performed by the FPU.
On a traditional, scalar pipelined architecture, either one or the other of these units will always be idle,
depending on the instruction being executed. On a superscalar architecture, instructions may be executed
in parallel on multiple functional units. The pipeline is essentially split after instruction issue.
Executing multiple instructions simultaneously brings several problems. The first problem is that the
possibility of hazards is increased, because more instructions are in-flight at once. Secondly, instructions
must must be retired (executed, and their results written back) in the correct order, to correctly follow the
semantics of a scalar machine. On a superscalar machine, where the instruction stream may be re-ordered
on-the-fly by the hardware (so-called dynamic scheduling), the processor must still retire instructions in
2.3 The branch problem 3
program order.
Typical program code contains a branch every six or seven instructions. For a pipelined superscalar
architecture, this represents a potentially significant performance problem. It would be helpful if the
processor knew whether a branch would be taken ahead of time, so that the correct instructions could be
fetched and pipelined. Obviously this is not possible, but it is possible to predict the target of a branch
with a probability of more than chance by storing the previous results of the branch and using it to predict
the target of future branches.
Such branch prediction hardware helps to reduce the cost, in terms of instructions unnecessarily
pipelined, of branches. However, the hardware must now deal with the consequences of incorrectly-
predicted branches when “wrong” instructions were speculatively executed. If these speculated instruc-
tions were permitted to modify hardware registers or memory, the program would be in an inconsistent
state. The simple solution, employed by superscalar Pentiums, is simply not to allow speculatively-
executed instructions to modify real state until the branch target is known. However, for better perfor-
mance it is necessary to implement a reorder buffer, to store the results of instructions executed specula-
tively (and out-of-order). Information stored in the reorder buffer is only used to update real state when
their instructions are known to be correct.
Figure 2 shows a typical superscalar architecture. Superscalar’s complexity is evident in the instruc-
tion decoder and the reorder buffer.
4 3 VLIW
3 VLIW
All this additional hardware is complex, and contributes to the transistor count of the processor. All other
things being equal, more transistors equals more power consumption, more heat, and less on-die space
for cache.
Thus it seems beneficial to expose more of the architecture’s parallelism to the programmer. This
way, not only is the architecture simplified, but programmers have more control over the hardware, and
can take better advantage of it.
VLIW is an architecture designed to help software designers extract more parallelism from their
software than would be possible using a traditional RISC design. It is an alternative to better-known
superscalar architectures. VLIW is a lot simpler than superscalar designs, but has not so far been com-
mercially successful.
Figure 3 shows a typical VLIW architecture. Note the simplified instruction decode and dispatch
logic, and the lack of a reorder buffer.
ALUs, two floating-point units, a memory access unit, and a branch unit. The instruction word may then
be divided into an integer portion, a floating-point portion, a memory load/store portion, and a branch
portion. Each portion of the instruction word constitutes a “mini-instruction” for the processing unit to
which it refers. All mini-instructions are implicitly parallel, which gives the processor greater flexibility
in scheduling the instructions among available execution units. Figure 4 shows the 256-bit instruction
word of an early VLIW, the MultiFlow Trace 7 series. This machine supported seven operations per
instruction word.
Making the architecture this explicit provides several advantages in terms of performance and re-
duced die size. The job of arranging code so that the processor is best utilised is left, to a large extent, to
the compiler. Thus VLIW architectures can execute code strictly in order, without requiring scheduling
hardware or redorder buffers. This makes for – theoretically at least – a simpler, less power-hungry chip.
The downside is that writing a good compiler for a VLIW is much more difficult than for a superscalar
architecture, and the difference between a good compiler and a bad one is far more noticeable.
Another problem with traditional VLIW is code size. Often it is simply not possible to completely
utilise all processor execution units at once. Thus many instructions contain no-ops in portions of the
instruction word with a corresponding increase in the size of the object code. Increased code size has ob-
vious implications for the efficacy of caches and bus bandwidth. Modern VLIWs deal with this problem
in different ways. One simple method is to offer several instruction templates, and allow the compiler to
pick the most appropriate one – in this case, the one that utilises the most bits of the instruction word.
Another is to employ traditional data-compression techniques to the code.
Because it exposes more information about the processor’s architecture to the programmer than a
superscalar does, VLIW instruction sets are architecture-specific to a significant degree. In a superscalar
implementation (presenting the illusion of a scalar architecture to the programmar), hardware designers
are free to add, for example, additional ALUs, increasing parallelism without affecting existing programs.
In a VLIW the obvious solution is, however, to increase the length of instruction word. This obvious
solution has similarly-obvious compatibility problems. Another alternative is to add more instructions
to the instruction word, making use of the extra processing unit without causing compatibility problems.
This is the approach taken by the Analog Devices SHARC family. Yet another alternative is to remove the
implicit parallelism implied by the end of the instruction word, and instead make the limits of parallelism
explicit. This is the basis of the EPIC architecture, present in the Intel Itanium.
3.2 Interlocking
Another architecture feature present in some RISC and VLIW architectures but never in superscalars is
lack of interlocks. In a pipelined processor, it is important to ensure that a stall somewhere in the pipeline
won’t result in the machine performing incorrectly. This could happen if later stages of the pipeline
do not detect the stall, and thus proceed as if the stalled stage had completed. To prevent this, most
architectures incorporate interlocks on the pipeline stages. Removing interlocks from the architecture is
beneficial, because they complicate the design and can take time to set up, lowering the overall clock
rate. However, doing so means that the compiler (or assembly-language programmer) must know details
6 3 VLIW
about the timing of pipeline stages for each instruction in the processor, and insert NOPs into the code
to ensure correctness. This makes code incredibly hardware-specific. Both the architectures studied in
detail below are fully interlocked, though Sun’s ill-fated MAJC architecture was not, and relied on fast,
universal JIT compilation to solve the hardware problems 1 .
If the hardware supports software speculation, it needs to provide mechanisms to ensure that spec-
ulated instructions that raise an exception do not affect the state of the machine until they become non-
speculative. VLIWs deal with these problems in various ways. The simplest method is never to raise
exceptions for speculated instructions, though other techniques exist, such as deferring the exceptions.
3.4 History
VLIW is not a new architecture. In fact, it was originally more popular than superscalar designs (which
arose to alleviate backwards compatibility problems). One of the first was the AP-120B, described by
Charlesworth in 1981. The mid 80s saw several attempts to introduce VLIW processors, notably the
Multiflow Trace and the Cydrome Cydra 5. The compiler for the Multiflow Trace was the first processor
to employ trace scheduling as invented by Multiflow’s founder, Joseph Fischer. The Cydra 5 included
hardware support for software pipelining very similar to that now present on the Itanium. Specifically,
it supported an iteration frame pointer which could be used as an offset into its register data file. The
similarity is not a co-incidence: Cydrome chief architect, Bob Rau, was later involved in the development
of Itanium.
The Trace and the Cydra 5 both had very large instruction words: 256 bits for both machines, with
later versions of the Trace supporting even larger words((The earliest Trace machines supported 7 in-
structions per 256-bit word. Later models supported 14 or even 28 instructions per word). The machines
dealt with the corresponding bandwidth and storage problems in different ways: the Trace supported
instruction compression, and the Cydra 5 supported a sequential mode, where each of the 7 opcodes
contained within its instruction word were executed one at a time. Interesting, the Multiflow and the
Cydra 5 did not contain caches of any description.
These products failed commercially; the prevailing opinion seems to be due to their position as start-
ups. Thus, several small technical mistakes combined with the difficulty of selling expensive machines
from a pooly-established company, rather than any major architectural deficiencies, caused their down-
fall.
The importance of efficient compilers also contributed to poor early acceptance of VLIW architec-
tures. For example, Intel’s i860 RICS processor, introduced in the early 1990s, had a simple VLIW mode
where each instruction consisted of a integer portion and a floating-point portion. Compilers for the i860
were expected to carefully order instructions in order to keep pipelines filled, but unfortunately were not
of sufficient quality to produce good code for the chip.
uses a super Harvard architecture, making use of multiple instruction, data, and I/O buses. Almost all
registers are general-purpose, and the type of the register (fixed-point, or floating point) is determined by
bits set in the instruction (not all operations are permitted for all number formats)
The SHARC uses a number of instruction groups. Each group contains a related family of instruc-
tions, such as ALU-related instructions (Groups I and II) and memory access (Group III). There is some
commonality between bits allocated in the instruction word inside groups, but allocation is not wholly
regular.
Despite containing two ALUs, SHARC opcodes do not permit independent addressing of these func-
tional units. Instead, they work in tandem in SIMD mode. This is perhaps because the ADSP-2136x is an
evolutionary advance on previous processors in the SHARC family, which contained only a single ALU.
In the default (SISD) mode, the secondary ALU is disabled. SIMD mode may be enabled by setting a
bit in the CPU status register MODE1.
Many instructions include a 23-bit section called the compute field. This “mini-instruction” is es-
sentially the part of the instruction directing an ALU, and thus supports its own set of opcodes for all
arithmetic operations. As mentioned above, in SIMD mode, one compute field controls both ALUs;
the source and destination of data for PEy is defined by the architecture as offset from the source and
destination of PEx. Many instructions which include a compute field also support predicated execution.
Despite the lack of independently-addressible ALUs, the SHARC is capable of several examples of
parallelism, mostly concerned with the efficient transformation of large amounts of data. Generally,
compute operations may be combined with predicates, data movement to / from memory, and register
manipulations such as shifts and transfers.
SHARC does not support speculation, but has a relatively short pipeline consisting of 5 stages, so the
cost of a branch is low.
The major problem addressed by EPIC is hardware dependence. VLIW is designed around the concept
that the limits of a processor’s parallelism is addressed by a single instruction word. Thus, processors
capable of a greater degree of parallelism require a different instruction set. EPIC’s solution to this
problem is to define several reasonably-abstract categories of mini-instructions, such as ALU operations,
floating-point operations, and branches. Mini-instructions are combined in groups of three into a bundle.
In addition to three 41-bit mini-instructions, bundles contain a 5-bit template type for a total bundle size
of 128 bits. Figure 5 shows the general format of an EPIC bundle.
5.2 Instruction-level parallelism in EPIC 9
Template 0: MII
Template 1: MII*
Template 2: MI*I
Template 3: MI*I*
Figure 6: The first four EPIC bundle templates. Stops are indicated by asterisks.
relatively-short pipeline: 8 stages in Itanium 2, compared with 30 for later revisions of the Pentium
4.
6 Conclusions
Despite its history, VLIW has yet to see significant commercial success in general-purpose computers.
One reason for this is backwards-compatibility issues, which newer architectures, such as EPIC, are start-
ing to address. Another potential problem facing VLIWs is the widening gap beween CPU performance
and memory bandwidth. Perhaps updates to the EPIC architecture, or some future VLIW-based archi-
tecture, will see optional support for some form of instruction compression, to reduce Itanium’s reliance
on caching for performance.