You are on page 1of 11

The VLIW and EPIC processor

architectures
Nicholas FitzRoy-Dale
Document Revision: 1 Date: 2005/07/1 22:31:24

nfd@cse.unsw.edu.au
http://www.cse.unsw.edu.au/∼disy/
Operating Systems and Distributed Systems Group
School of Computer Science and Engineering
The University of New South Wales
UNSW Sydney 2052, Australia
1 VLIW and EPIC architectures
Modern superscalar processors are complex, power-hungry devices that present an antiquated view of
processor architecture to the programmer in the interests of backwards compatibility — and do a lot
of work to achieve high performance while maintaining this illusion. The alternative to superscalar is a
VLIW architecture, but these have traditionally been actively backwards-incompatible, with performance
highly dependent on the (frequently mediocre) abilities of the compiler.
Neither VLIW nor superscalar are perfect architectures: each has its own set of trade-offs. This
report discusses the relative strengths and weaknesses of the two, focusing on the benefits of VLIW and
the closely-related EPIC architecture as used in Intel’s Itanium processor family.
An introduction to the motivation behind VLIW is given, VLIW and EPIC are discussed in detail,
and then two case studies are presented: the Analog Devices SHARC family of DSPs, demonstrating the
VLIW influences present in a modern DSP; and Intel’s Itanium processor family, which is to date the
only implementation of EPIC.

2 Instruction-level parallelism
A common design goal for general-purpose processors is to maximise throughput, which may be defined
broadly as the amount of work performed in a given time.
Average processor throughput is a function of two variables: the average number of clock cycles
required to execute an instruction, and the frequency of clock cycles. To increase throughput, then,
a designer could increase the clock rate of the architecture, or increase the average instruction-level
parallelism(ILP) of the architecture.
Modern processor design has focused on executing more instructions in a given number of clock
cycles, that is, increasing ILP. A number of techniques may be used. One technique, pipelining, is
particularly popular because it is relatively simple, and can be used in conjunction with superscalar and
VLIW techniques. All modern CPU architectures are pipelined.

2.1 Pipelining
All instructions are executed in multiple stages. For example, a simple processor may have five stages:
first the instruction must be fetched from cache, then it must be decoded, the instruction must be executed,
and any memory referenced by the instruction must be loaded or stored (Figure 1). Finally the result of
the instruction is stored in registers. The output from one stage serves as the input to the next stage,
forming a pipeline of instruction implementation. These stages are frequently independent of each other,
so, if separate hardware is used to perform each stage, multiple instructions may be “in flight” at once,
with each instruction at a different stage in the pipeline. Ignoring potential problems, the theoretical
increase in speed is proportional to the length of the pipeline: longer pipelines means more simultaneous
in-flight instructions and therefore fewer average cycles per instruction.
The major potential problem with pipelining is the potential for hazards. A hazard occurs when an
instruction in the pipeline cannot be executed. Hennessey and Patterson identify three types of hazards:
structural hazards, where there simply isn’t sufficient hardware to execute all parallelisable instructions
at once; data hazards, where an instruction depends on the result of a previous instruction; and control
hazards, which arise from instructions which change the program counter (ie, branch instructions). Var-
ious techniques exist for managing hazards. The simplest of these is simply to stall the pipeline until the
instruction causing the hazard has completed.

2.1.1 Designing for pipelining

Despite pipelining being an almost-universal practise, some architectures are more amenable to pipelin-
ing than others. Architectures that are designed for pipelining tend to require all instructions to take
2 2 INSTRUCTION-LEVEL PARALLELISM

Figure 1: The 5-stage MIPS pipeline

roughly the same amount of time to complete, so that the pipeline can operate at the same speed regard-
less of what is being executed. In order to keep this speed high, instructions tend to be simple. A small
set of simple, quick-to-execute instructions is the hallmark of RISC (reduced instruction set computing)
architectures. It is no surprise, then, that RISCs tend to be well-suited to pipelining.
Nowadays, the RISC/non-RISC distinction is somewhat outdated. The key metric is performance,
and an architecture can be fast without being RISC. Similarly, it is possible to apply tradtional RISC tech-
niques such as pipelining to non-RISC instruction sets such as Intel’s x86 instruction set. This instruction
set is not well-suited to pipelining, for two reasons. Firstly, it contains a number of instructions that take
a long time to execute. For example, all x86 compatibles support instructions to work with binary-coded
decimal (BCD) numbers. Secondly, x86 supports so-called complex addressing modes. For example, it
is possible to move data from one location in main memory to another location in main memory, without
requiring the data to go through a CPU register as an intermediate stage. Such instructions are difficult
to pipeline, because memory accesses are slow, and data should ideally be present before it is required.
To deal with these sorts of obstacles to pipelining, the pipelined processor translates CISC instruc-
tions into an internal instruction set before pipelining the instructions. This translation phase adds to the
complexity of the architecture.

2.2 Superscalar
Usually, the execution phase of the pipeline takes the longest. On modern hardware, the execution
of the instruction may be performed by one of a number of functional units. For example, integer
instructions may be executed by the ALU, whereas floating-point operations are performed by the FPU.
On a traditional, scalar pipelined architecture, either one or the other of these units will always be idle,
depending on the instruction being executed. On a superscalar architecture, instructions may be executed
in parallel on multiple functional units. The pipeline is essentially split after instruction issue.
Executing multiple instructions simultaneously brings several problems. The first problem is that the
possibility of hazards is increased, because more instructions are in-flight at once. Secondly, instructions
must must be retired (executed, and their results written back) in the correct order, to correctly follow the
semantics of a scalar machine. On a superscalar machine, where the instruction stream may be re-ordered
on-the-fly by the hardware (so-called dynamic scheduling), the processor must still retire instructions in
2.3 The branch problem 3

           

        

               

     


                                               

   

Figure 2: Superscalar architecture

program order.

2.3 The branch problem

Typical program code contains a branch every six or seven instructions. For a pipelined superscalar
architecture, this represents a potentially significant performance problem. It would be helpful if the
processor knew whether a branch would be taken ahead of time, so that the correct instructions could be
fetched and pipelined. Obviously this is not possible, but it is possible to predict the target of a branch
with a probability of more than chance by storing the previous results of the branch and using it to predict
the target of future branches.
Such branch prediction hardware helps to reduce the cost, in terms of instructions unnecessarily
pipelined, of branches. However, the hardware must now deal with the consequences of incorrectly-
predicted branches when “wrong” instructions were speculatively executed. If these speculated instruc-
tions were permitted to modify hardware registers or memory, the program would be in an inconsistent
state. The simple solution, employed by superscalar Pentiums, is simply not to allow speculatively-
executed instructions to modify real state until the branch target is known. However, for better perfor-
mance it is necessary to implement a reorder buffer, to store the results of instructions executed specula-
tively (and out-of-order). Information stored in the reorder buffer is only used to update real state when
their instructions are known to be correct.
Figure 2 shows a typical superscalar architecture. Superscalar’s complexity is evident in the instruc-
tion decoder and the reorder buffer.
4 3 VLIW

           

         

      

    

                                           

   

Figure 3: VLIW architecture

3 VLIW
All this additional hardware is complex, and contributes to the transistor count of the processor. All other
things being equal, more transistors equals more power consumption, more heat, and less on-die space
for cache.
Thus it seems beneficial to expose more of the architecture’s parallelism to the programmer. This
way, not only is the architecture simplified, but programmers have more control over the hardware, and
can take better advantage of it.
VLIW is an architecture designed to help software designers extract more parallelism from their
software than would be possible using a traditional RISC design. It is an alternative to better-known
superscalar architectures. VLIW is a lot simpler than superscalar designs, but has not so far been com-
mercially successful.
Figure 3 shows a typical VLIW architecture. Note the simplified instruction decode and dispatch
logic, and the lack of a reorder buffer.

3.1 ILP in VLIW


VLIW and superscalar approach the ILP problem differently. The key difference between the two is
where instruction scheduling is performed: in a superscalar architecture, scheduling is performed in
hardware (and is called dynamic scheduling, because the schedule of a given piece of code may differ
depending on the code path followed), whereas in a VLIW scheduling is performed in software (static
scheduling, because the schedule is “built in to the binary” by the compiler or assembly language pro-
grammer).
A traditional VLIW architecture divides the instruction word into separate regions, each of which
corresponds to a dedicated processor unit. For example, a VLIW architecture may contain two integer
3.2 Interlocking 5

Figure 4: Instruction word for the Multiflow Trace 7 series

ALUs, two floating-point units, a memory access unit, and a branch unit. The instruction word may then
be divided into an integer portion, a floating-point portion, a memory load/store portion, and a branch
portion. Each portion of the instruction word constitutes a “mini-instruction” for the processing unit to
which it refers. All mini-instructions are implicitly parallel, which gives the processor greater flexibility
in scheduling the instructions among available execution units. Figure 4 shows the 256-bit instruction
word of an early VLIW, the MultiFlow Trace 7 series. This machine supported seven operations per
instruction word.
Making the architecture this explicit provides several advantages in terms of performance and re-
duced die size. The job of arranging code so that the processor is best utilised is left, to a large extent, to
the compiler. Thus VLIW architectures can execute code strictly in order, without requiring scheduling
hardware or redorder buffers. This makes for – theoretically at least – a simpler, less power-hungry chip.
The downside is that writing a good compiler for a VLIW is much more difficult than for a superscalar
architecture, and the difference between a good compiler and a bad one is far more noticeable.
Another problem with traditional VLIW is code size. Often it is simply not possible to completely
utilise all processor execution units at once. Thus many instructions contain no-ops in portions of the
instruction word with a corresponding increase in the size of the object code. Increased code size has ob-
vious implications for the efficacy of caches and bus bandwidth. Modern VLIWs deal with this problem
in different ways. One simple method is to offer several instruction templates, and allow the compiler to
pick the most appropriate one – in this case, the one that utilises the most bits of the instruction word.
Another is to employ traditional data-compression techniques to the code.
Because it exposes more information about the processor’s architecture to the programmer than a
superscalar does, VLIW instruction sets are architecture-specific to a significant degree. In a superscalar
implementation (presenting the illusion of a scalar architecture to the programmar), hardware designers
are free to add, for example, additional ALUs, increasing parallelism without affecting existing programs.
In a VLIW the obvious solution is, however, to increase the length of instruction word. This obvious
solution has similarly-obvious compatibility problems. Another alternative is to add more instructions
to the instruction word, making use of the extra processing unit without causing compatibility problems.
This is the approach taken by the Analog Devices SHARC family. Yet another alternative is to remove the
implicit parallelism implied by the end of the instruction word, and instead make the limits of parallelism
explicit. This is the basis of the EPIC architecture, present in the Intel Itanium.

3.2 Interlocking
Another architecture feature present in some RISC and VLIW architectures but never in superscalars is
lack of interlocks. In a pipelined processor, it is important to ensure that a stall somewhere in the pipeline
won’t result in the machine performing incorrectly. This could happen if later stages of the pipeline
do not detect the stall, and thus proceed as if the stalled stage had completed. To prevent this, most
architectures incorporate interlocks on the pipeline stages. Removing interlocks from the architecture is
beneficial, because they complicate the design and can take time to set up, lowering the overall clock
rate. However, doing so means that the compiler (or assembly-language programmer) must know details
6 3 VLIW

about the timing of pipeline stages for each instruction in the processor, and insert NOPs into the code
to ensure correctness. This makes code incredibly hardware-specific. Both the architectures studied in
detail below are fully interlocked, though Sun’s ill-fated MAJC architecture was not, and relied on fast,
universal JIT compilation to solve the hardware problems 1 .

3.3 Code generation


The realisation behind VLIW is that the compiler (or, occassionally, the assembly-language programmer)
has more opportunity than the processor to exploit software parallelism, because it has better knowledge
of the code. Lack of global knowledge means that scheduling performed by the processor (called dynamic
scheduling) must be conservative to ensure safety. Thus compilers are in a position to perform more and
better optimisations than the hardware could. Unfortunately compiler writers have been slow to fully-
exploit VLIW, resulting in a number of lackluster implementations. This section discusses a number of
optimisations that compilers could perform for VLIW that are traditionally performed by hardware in
superscalar implementations.

3.3.1 Loop parallelism


An important compiler technique, perhaps the most important compiler technique, for exploiting ILP is
loop parallelism. This simply refers to finding parallelisable loops and generating parallel code. De-
termining whether a loop is parallelisable for a given system may become arbitrary complex, because
before it can parallelise the instructions the compiler must ensure that there are no data dependencies
between them. For example, consider the simple case of a loop copying one array to another. Ensuring
that the two arrays do not overlap could involve dataflow analysis along many code paths. However the
compiler is certainly better-equipped to do this job than scheduling hardware in a superscalar processor,
because more information is available, such as array bounds. Also, compilers can spend significantly
more time performing the analysis than can a processor.

3.3.2 Branch speculation supports


Branching in a VLIW can reduce throughput in a VLIW for the same reasons as throughput may be
reduced in a superscalar architecture. VLIWs reduce the cost in two ways. First, instructions may be
predicated – i.e., a section of the instruction word is devoted to a conditional test, and the instruction
is only executed if the condition is true. Handling a single predicated instruction is far cheaper than
handling a branch, because the program counter does not change in unpredictable ways. On multiple-
issue architectures, predication may allow, for example, mutually-exclusive branches of code to execute
simultaneously (such as both the if clause and the else clause of an if statement), with only the correct
branch actually being executed: the incorrect branch effectively becomes a sequence of no-ops. Obvi-
ously there is a point where the benefit of avoiding branches is outweighed by the performance lost by
discarding incorrect predicated instructions, but for short sequences predicates can dramatically improve
performance.
Note that the advantages of predication aren’t limited to VLIW processors. Superscalars also stand
to benefit. The reason is that by turning a control dependency (a branch) into a data dependency (a
condition check), predication moves the decision as to what to do with the instruction from near the start
of the pipeline (in instruction decode and execution) to the end (in writeback).
In addition to predication, compilers may implement speculation. That is, the compiler may use
heuristics to determine the likely outcome of a branch. If the architecture supports it, the compiler may
speculatively load, and even speculatively execute, instructions at the likely branch target. The problem
of register renaming encountered by superscalar architectures can be solved simply by providing more
registers, so the compiler is not as constrained and need not re-use names.
1
The problems associated with removing interlocks are not unique to VLIW. The most famous non-interlocked RISC archi-
tecture, MIPS, addressed the hardware-specificity problem by mandating that each instruction take exactly one clock cycle
3.4 History 7

If the hardware supports software speculation, it needs to provide mechanisms to ensure that spec-
ulated instructions that raise an exception do not affect the state of the machine until they become non-
speculative. VLIWs deal with these problems in various ways. The simplest method is never to raise
exceptions for speculated instructions, though other techniques exist, such as deferring the exceptions.

3.3.3 Code libraries


An alternative to smart compilers is to utilise a collection of highly-optimised code libraries written in
assembly language by platform experts. These libraries are common on superscalars to support special-
purpose SIMD instruction sets such as AltiVec on the Power family.

3.4 History
VLIW is not a new architecture. In fact, it was originally more popular than superscalar designs (which
arose to alleviate backwards compatibility problems). One of the first was the AP-120B, described by
Charlesworth in 1981. The mid 80s saw several attempts to introduce VLIW processors, notably the
Multiflow Trace and the Cydrome Cydra 5. The compiler for the Multiflow Trace was the first processor
to employ trace scheduling as invented by Multiflow’s founder, Joseph Fischer. The Cydra 5 included
hardware support for software pipelining very similar to that now present on the Itanium. Specifically,
it supported an iteration frame pointer which could be used as an offset into its register data file. The
similarity is not a co-incidence: Cydrome chief architect, Bob Rau, was later involved in the development
of Itanium.
The Trace and the Cydra 5 both had very large instruction words: 256 bits for both machines, with
later versions of the Trace supporting even larger words((The earliest Trace machines supported 7 in-
structions per 256-bit word. Later models supported 14 or even 28 instructions per word). The machines
dealt with the corresponding bandwidth and storage problems in different ways: the Trace supported
instruction compression, and the Cydra 5 supported a sequential mode, where each of the 7 opcodes
contained within its instruction word were executed one at a time. Interesting, the Multiflow and the
Cydra 5 did not contain caches of any description.
These products failed commercially; the prevailing opinion seems to be due to their position as start-
ups. Thus, several small technical mistakes combined with the difficulty of selling expensive machines
from a pooly-established company, rather than any major architectural deficiencies, caused their down-
fall.
The importance of efficient compilers also contributed to poor early acceptance of VLIW architec-
tures. For example, Intel’s i860 RICS processor, introduced in the early 1990s, had a simple VLIW mode
where each instruction consisted of a integer portion and a floating-point portion. Compilers for the i860
were expected to carefully order instructions in order to keep pipelines filled, but unfortunately were not
of sufficient quality to produce good code for the chip.

4 Case study: Analog Devices SHARC ADSP-2136x family


The SHARC family of DSPs are aimed at real-time audio and visual applications. Because embedded
devices are generally not user-programmable, and run only a very limited set of applications, there is
less of a requirement to maintain a legacy instruction set, a condition sometimes referred to as low
instruction-set inertia [?]. Like many DSP manufacturers, Analog Devices has taken advantage of low
instruction set inertia to create their own instruction set for the SHARC, using VLIW techniques to allow
programmers to get the most from the design while keeping the architecture simple.
The SHARC contains two separate ALUs, named PEx and PEy. Each has its own register file. These
processing elements are not separately accessible but work together when the chip is placed in SIMD
mode for increased throuput.
The focus on data throughput extends throughout the architecture. Address generation is separate to
the processing units, allowing addresses to be generated in parallel with data processing. The chip also
8 5 CASE STUDY: INTEL ITANIUM

uses a super Harvard architecture, making use of multiple instruction, data, and I/O buses. Almost all
registers are general-purpose, and the type of the register (fixed-point, or floating point) is determined by
bits set in the instruction (not all operations are permitted for all number formats)

4.1 Instruction format

The SHARC uses a number of instruction groups. Each group contains a related family of instruc-
tions, such as ALU-related instructions (Groups I and II) and memory access (Group III). There is some
commonality between bits allocated in the instruction word inside groups, but allocation is not wholly
regular.
Despite containing two ALUs, SHARC opcodes do not permit independent addressing of these func-
tional units. Instead, they work in tandem in SIMD mode. This is perhaps because the ADSP-2136x is an
evolutionary advance on previous processors in the SHARC family, which contained only a single ALU.
In the default (SISD) mode, the secondary ALU is disabled. SIMD mode may be enabled by setting a
bit in the CPU status register MODE1.
Many instructions include a 23-bit section called the compute field. This “mini-instruction” is es-
sentially the part of the instruction directing an ALU, and thus supports its own set of opcodes for all
arithmetic operations. As mentioned above, in SIMD mode, one compute field controls both ALUs;
the source and destination of data for PEy is defined by the architecture as offset from the source and
destination of PEx. Many instructions which include a compute field also support predicated execution.

4.2 Instruction-level parallelism in the SHARC family

Despite the lack of independently-addressible ALUs, the SHARC is capable of several examples of
parallelism, mostly concerned with the efficient transformation of large amounts of data. Generally,
compute operations may be combined with predicates, data movement to / from memory, and register
manipulations such as shifts and transfers.
SHARC does not support speculation, but has a relatively short pipeline consisting of 5 stages, so the
cost of a branch is low.

5 Case study: Intel Itanium


Itanium is based around the explicitly-parallel instruction computer (EPIC) architecture, a fairly recent
architecture that emerged, circa 1997, from Hewlett-Packard’s PlayDoh research architecture. The EPIC
architecture is based on VLIW, but was designed to overcome the key limitations of VLIW (in particular,
hardware dependence) while simultaneously giving more flexibility to compiler writers. So far the only
implementation of this architecture is as part of the IA-64 processor architecture in the Itanium family of
processors.

5.1 Instruction bundles

The major problem addressed by EPIC is hardware dependence. VLIW is designed around the concept
that the limits of a processor’s parallelism is addressed by a single instruction word. Thus, processors
capable of a greater degree of parallelism require a different instruction set. EPIC’s solution to this
problem is to define several reasonably-abstract categories of mini-instructions, such as ALU operations,
floating-point operations, and branches. Mini-instructions are combined in groups of three into a bundle.
In addition to three 41-bit mini-instructions, bundles contain a 5-bit template type for a total bundle size
of 128 bits. Figure 5 shows the general format of an EPIC bundle.
5.2 Instruction-level parallelism in EPIC 9

       
      
       
  

   

Figure 5: An EPIC bundle

Template 0: MII
Template 1: MII*
Template 2: MI*I
Template 3: MI*I*

Figure 6: The first four EPIC bundle templates. Stops are indicated by asterisks.

5.2 Instruction-level parallelism in EPIC


Crucially, the length of a bundle does not define the limits of parallelism; the bundle template type
indicates, by the presence or absence of a stop, whether instructions following the bundle can execute
in parallel with instructions in the bundle. The claim of EPIC is that as the processor family evolves,
processors with greater support for parallelism will simply issue more bundles simultaneously.
Figure 6 illustrates the first four EPIC bundle templates, of the 32 available. Note that the first 4
bundles all contain memory (M) and integer ALU (I) mini-instructions. higher-numbered bundles con-
tain different types of mini-instructions. Template 0 contains one memory and two integer instructions
and no stops, meaning that a sequence of template-0 bundles will be executed in parallel as much as
the hardware is capable, whereas template 1 contains a stop after the second integer instruction but is
otherwise identical. The hardware ensures that all instructions before the stop have been retired before
executing instructions after the stop.
To put it another way, compilers simply target a theoretical processor with support for an infinite
amount of parallelism (or, at least, a register-limited amount of parallelism), and the implementation
performs as much as it can. For example, all current Itaniums issue two bundles at a time through a
technique known as dispersal. The first part of a bundle is issued, and then the bundle is logically shifted
so that the next part of the bundle is available for execution. If the mini-instruction cannot be executed,
split issue occurs. The bundle continues to occupy its bundle slot, and another bundle is loaded to occupy
the next slot. Since some instructions from the bundle have been executed, leaving them in the bundle
slot reduces parallelism. EPIC trades this performance decrease against the relatively simple hardware
required to implement dispersion.
Itanium relies heavily on predication. All Itanium instructions are predicated, reducing the cost of
a branch significantly if it can be rewritten to predicated instructions. Itanium has 64 one-bit predicate
registers, meaning that 6 bits of every mini-instruction are devoted to specifying a predicate register.
Multiple predicted streams can execute in parallel, but only those whose predictate is true are retired.
The reasoning behind predication is that even though modern branch predictors are very efficient, the
cost of misprediction is still high because branches occur so frequently.
Itanium does not perform automatic speculation in hardware. This decision is definitely in the spirit
of VLIW. The cost of speculation is thus moved to program development time, where there are more
resources to deal with it. The architecuture supports speculative loads, where a given load instruction
may be moved further away from where it is required. The advantage of this is that it hides memory
latency. The potential problem is that the load may fail, triggering an exception.
Speculative load exceptions are handled using poison bits, an extra bit on every register that is set of
the result of a speculative load triggered an exception. The bit, named NaT (“not a thing”) for integer
registers and NaTVal (“not a thing value”) for floating-point registers, may be copied to other registers
through store operations, but any other attempt to make use of it results in an exception.
If all else fails, and Itanium is forced to take an unexpected branch, the cost is reduced by its
10 6 CONCLUSIONS

relatively-short pipeline: 8 stages in Itanium 2, compared with 30 for later revisions of the Pentium
4.

5.3 Problems with EPIC


Despite the advantages of EPIC over VLIW, IA-64 does not solve all of VLIW’s problems. The foremost
problem is program size: It is not always possible to completely fill all slots in a bundle, and empty slots
are filled with NOPs (IA-64 does not perform compression further than that offered by bundle templates).
As discussed above, increases in code size negatively impact cache performance and result in more bus
traffic. Itanium compensates for this by using large, fast caches on-die. Cache is relatively easy to add to
Itanium, because the lack of hardware dedicated to speculation, dynamic scheduling and the like results
in a small core size. However cache increases die size and power consumption – though cache consumes
far less power than core logic.
Another problem common to VLIWs in general is the importance of compiler optimisation. Poor
compiler support can significantly impact the performance of EPIC code. Historically this has been a
problem for Itanium, but should improve in the future as compiler support improves.

6 Conclusions
Despite its history, VLIW has yet to see significant commercial success in general-purpose computers.
One reason for this is backwards-compatibility issues, which newer architectures, such as EPIC, are start-
ing to address. Another potential problem facing VLIWs is the widening gap beween CPU performance
and memory bandwidth. Perhaps updates to the EPIC architecture, or some future VLIW-based archi-
tecture, will see optional support for some form of instruction compression, to reduce Itanium’s reliance
on caching for performance.

You might also like