You are on page 1of 11

How to provide a power-efficient

architecture
It's anything but a smooth ride for processors with many challenges including the amount
of energy consumed per logic operation. Here's what can be done to reduce total power
consumption.

By Bob Crepps, Intel Corporation


Page 1 of 5
Power Management DesignLine
(07/24/2006 11:02 AM EDT)

Microprocessor performance has scaled over the last three decades from devices that
could perform tens of thousands of instructions per second to tens of billions for today's
products. There seems to be no limit to the demand for increasing performance.
Processors have evolved from super-scalar architecture to instruction-level parallelism,
where each evolution makes more efficient use of fast single instruction pipeline. The
goal is to continue that scaling, to reach a capability of 10 tera-instructions per second by
the year 2015. However, there are many challenges on the way to that goal.

Figure 1: Processor Performance

As semiconductor process technology advances at a rate predicted by Moore's Law, some


effects that could be ignored in previous process generations are having increasing
impacts on transistor performance and ways to deal with those effects must be found.
Consider the Technology Outlook chart below. As indicated by "High Volume
Manufacturing", Intel is using a 65nm process technology and can integrate up to 4
billion transistors on a single die. If you look ahead just 5 years, the process technology
will be 22nm and the integration capacity will grow to 16-32 billion transistors! But note
also that certain aspects will not scale as they have in the past. Delay (CV/I) has been
scaling at a rate of about .7 per process step and the rate of scaling will decrease. The
amount of energy consumed per logic operation will scale less. The probability that we
will use bulk planar transistors as we do today will decrease; new transistor
configurations such as strained silicon, High K dielectric, metal gate and tri-gate will be
used. And variability, the difference in performance measured in "identical" transistors on
the same die, will increase. Moore's Law is alive and well, and there are new challenges
ahead as process technology nodes shrink.

Figure 2: Technology outlook

There are a number of techniques we've developed to control these factors. Among those
are several techniques for power reduction, including leakage control and active power
reduction. In addition, power consumption can be reduced by using special purpose
hardware, multi-threading, chip multi-processing and dual and multi-core processors.
These techniques will be described in more detail. The Power Technology Roadmap
shows which technique will intercept each process technology step.
Figure 3: Power Reduction Technology Roadmap

Leakage control
The ideal logic transistor should act like a switch. When it's off, no current flows, and
when it's on, current flows with very low loss. As the insulating layers of transistors
become thinner, leakage current increases. Following the current trend, leakage power
could soon reach 50% of active power. These transistors act much less like switches and
more like dimmers.

There are several techniques to help reduce leakage power. One of those is called Body
Bias. When the transistor is off, a bias voltage is applied to the body of the transistor,
reducing the leakage current by a factor of 2-10X. Another technique is called Stack
Effect. Instead of using a single transistor, you can use two transistors stacked together to
perform the switching function. This Stack Effect can reduce leakage currents by a factor
of 5-10X. A third technique is called Sleep Transistors. Sleep Transistors can be used to
isolate, or disconnect from the power supply, blocks of logic when not needed. For
example, a floating point unit can be tuned off until called. Sleep Transistors can reduce
leakage by a factor of 2-1000X, depending on the size of the logic block. Note that in all
of these techniques, leakage power is reduced by using more, not fewer, transistors. It
seems counter-intuitive but takes advantage of the higher transistor integration capability.
Figure 4: Leakage Control

Active Power Reduction:


We can extend that use of more transistors to active power reduction as well. By using
multiple supply voltages, and dividing logic circuits into faster and slower sections, then
the slower sections can use lower supply voltage and consume less active power. Fro
example, an ALU which must operate at high speed is connected to the higher supply
voltage and a sequencer can be connected to the lower voltage.

Replicated design is another very powerful technique. Consider the figure below. We start
with a single logic block that has operating frequency, supply voltage, power, die area
and power density all normalized to 1 unit, and has a throughput of 1. If we replicate that
block and adjust the operating conditions, we can reduce the active power consumption.
By reducing the supply voltage to 0.5, with a corresponding reduction in operating
frequency to 0.5, the power is reduced to 0.25. The area must increase to 2 and combined
with the reduced power consumption, the power density is 0.125. The throughput per
block is 0.5 so the total throughput is 1. The power has been reduced by a factor of 4 but
the throughput is the same. More transistors equals less power for the same performance.
Figure 5: Active Power Reduction

Special-Purpose Hardware
So far, we've talked about using multiple but identical logic blocks or cores, but there are
other ways to scale performance by using special purpose cores. In the chart shown
below, one of the curves is labeled "GP MIPS @ 75W". This represents the general
purpose MIPS (millions of instructions per second) over time. The data was derived from
various processors used to service a saturated Ethernet link. The processor power was
normalized to 75 Watts. The curve labeled "TOE MIPS @~2W" comes from a special
purpose device. This TOE or TCP Offload Engine, is a test chip made to process TCP
traffic. The chart clearly shows that these "specialized MIPs" consume much less power
to perform the specific function than are required by general purpose MIPs to do the
same task. The die shot shows that the TOE is a very small die and requires relatively few
transistors. The high performance versus power of special purpose hardware will find
applications in network processing, multimedia, speech recognition, encryption and XML
processing, to name a few applications.
Figure 6: Special Purpose Hardware

Code-named Paragon1
All of these techniques sound good, but can they really be implemented? The Lab
designed prototype test chip called Paragon. This device is special-purpose test circuit for
digital signal processing. The design was executed for highest energy efficiency for the
target performance level, rather than the highest obtainable performance (operating
frequency). Paragon incorporated Body Bias and Sleep Transistors to control leakage
power, and used Dual VT with a tiled architecture for fast and slow data paths to reduce
active power. The end result was a filter accelerator that operates at 110 Giga-operations
per watt. Paragon uses just 9milliwatts total power at 1GHz throughput as a 16x16 single-
cycle multiplier. Paragon achieved a leakage power reduction of 7X compared to
previous designs with 75 microwatts standby leakage and 540 microwatts active leakage.
Overall performance is 3X higher Giga-operations per watt than previously known
designs.

Dual and Multi-Core


We've looked at Tera-scale from the logic level and the thread level. Now let's look at
multiple cores.

First, consider this Rule of Thumb. It is derived from power, voltage and frequency and
considers active and leakage power. The Rule of Thumb is that to make a 1% change in
voltage, a corresponding change of frequency of 1% is required. That will cause a 3%
change in power (power varies as a cubic function of voltage and frequency).
Performance will change by 0.66%.
Assume we have a single core with cache whose voltage, frequency, power and
performance are normalized to 1. Now replicate the core and share the cache between the
two cores. Next, reduce the voltage and frequency by 15%, so that the power for each
core is .5 and the total for the tow cores is 1. According to the Rule of Thumb, the
performance will be 1.8 for two cores consuming the same power as the single core.

Next consider two cores compared to multiple cores. Start with a large core and cache
that consumes 4 units of power and delivers 2 units of performance. Compare that to a
small core that has been scaled to use one unit of power and has a corresponding
performance of 1 unit. By combining 4 of the small cores together, the total power is
equal to that of the large core, or 4 units, and the total performance is equal to 4 units,
twice that of the large core for the same power.

Figure 7: Single- to Dual-Core

Power and Scalar Performance


Historically, increasing frequency was a major factor in increasing performance. As the
total power consumed by microprocessors grew with increasing frequency, the need to
find other ways to scale performance became clear. In the chart below, note that if we
factor out all the contributions to performance due to process technology, that is,
frequency, delay scaling, etc. it's clear that scalar performance has increased at a rate of
about one-fourth the rate of power increase. This increasing energy per instruction has
been the trend over the last 15 years. The exception is the Pentium M processor, which
was designed to have much lower EPI than previous versions. This processor shows
scalar performance much closer to 1:1 with increasing power. Is there more to be done to
control EPI?
Figure 8: Power and Scalar Performance

An Energy per Instruction Throttle2


Researchers began looking for ways to control EPI. There are several ways to do this,
each with different advantages and ranges of control. Changing voltage and frequency is
one method. By lowering voltage and frequency, power is reduced, as previously
discussed. This method has a practical EPI control range of 1:2 to 1:4 and a response time
of about 100 microseconds to ramp the supply voltage. Another technique is called
variable-sized core, where processor resources are reduced to reduce energy
consumption. This method has an EPI control range of 1:1 to 1:2 and a response time of 1
microsecond (to fill 32KB L1 cache). A third technique is speculation control. When
speculative execution is used, each missed speculation has used energy and reduces
overall EPI. By limiting speculation, EPI can be improved. It allows for an EPI range of
1:1 to 1:1.4 and has a response time of 10 nanoseconds due to the latency of the pipeline.
We chose the voltage and frequency method for our experiments, as it offers a wide
control range.
Figure 9: Energy per Instruction

The experiment was intended to measure power savings and performance by varying
voltage and frequency on a multi-way system. Various Linux benchmarks were used to
measure performance by measuring the wall-clock run-time of each benchmark. Linux is
well suited to this experiment due to the ability to assign thread affinity, that is, assign a
thread to a processor. The benchmarks represent a combination of single and multi-
threaded applications.

The experimental platform used was a 4-way Xeon processor system operating at 2GHz,
with 2MB L3 cache, 4GB main memory and 3, Ultra 320 disk drives. The platform was
operated in four modes:

1. CPU at 2GHz as a baseline. Power and performance normalized to 1.0


2. CPUs operating at 1.5GHz, power normalized to 1.12 compared to baseline,
performance normalized to 1.06.
3. CPUs operating at 1.25GHz, power normalized to 1.17, performance normalized
to 1.08.
4. CPUs operating at 1GHz, power and performance normalized to 1.0.
The 2-CPU and 3-CPU configurations run-times were adjusted to make power
exactly the same as the baseline and 4-CPU configurations.

For multiple CPU configurations, the platform was operated in two modes: Symmetric
Multi-processing or SMP, where all processors run at the same speed, and Asymmetric
Multi-processing, where processors run at different speeds.

Benchmark Results:
The benchmark results (wall-clock run times) fell into three categories:
The 4-CPU SMP and AMP performed equally well.
The AMP configurations achieved significant speedup over SMP.
The AMP and 4-CPU SMP performed worse than the baseline configuration.

The intuitive explanation of the results shows that, during the sequential or single-thread
phases of each application (and even highly threaded applications have sequential
phases), the 4-CPU configuration underutilizes power as only one CPU is in use. The 1-
CPU configuration, on the other hand, is unable to exploit available thread-level
parallelism applications, so performance suffers. The AMP configuration continuously
varies EPI with thread-level parallelism to optimize power and performance.

Figure 10: Speeds on AMP

Why and When is AMP Better?


To understand the advantages of AMP, we used the following methodology:
* First, for each benchmark, the percent of the run-time that spent on sequential and
parallel portions was computed.
* Second, the run-times were compared, both as measured on the Amp prototype and
projected onto an ideal AMP system.

The results are clustered into three categories:


First, for applications where the application is mostly parallel, the SMP configuration
gives the best performance.
Second, for applications that are mostly sequential, the 1-CPU configuration gives the
best performance.
Third, for applications where there is a mix of sequential and parallel, AMP gives the best
overall performance. Remember, the operating conditions were set so that each
configuration consumes the same amount of power.

In the graph below, the results are shown for each benchmark and for SMP and AMP
against the baseline system. Performance is measured as wall-clock run time so lower is
better. You can see that an AMP configuration of 1 CPU running at higher speed and 3 at
lower speed (1+3) gives performance very close to the best performance of any
configuration over the range of applications and so the best overall EPI.
Figure 11: Why and When AMP is better

Conclusion
Moore's Law is alive and well and our ability to integrate more transistors onto a single
device will continue to scale well into the future. However, as process technology
advances, we face new challenges that require new techniques for control of active and
leakage power. We can use more transistors to achieve higher performance at lower
power, as in the case of special purpose hardware. Multi-processing will allow
performance to scale while maintaining or reducing total power consumption and enable
more efficient energy per instruction usage.

To learn more about Moore's Law and enabling energy efficient instruction usage go to:
http://www.intel.com/technology/eep/index.htm?ppc_cid=c96

References:
1. "Paragon: A 110GOPS/W 16b Multiplier and Reconfigurable PLA Loop in 90nm
CMOS" available from IEEE, ISSCC 2005 Session 20
2. "Mitigating Amdahl's Law Through EPI Throttling" available from IEEE, 0-7695-
2270-X/05

About the author:


Bob Crepps is a ten-year veteran of Intel. He has worked as an engineer in Motherboard
Products and in Industry Enabling for technologies such as USB, AGP and PCI Express.
His current role is as a technology strategist for the Microprocessor Technology Lab in
Corporate Technology Group. Prior to Intel he was an analog design engineer and
information systems engineer for fifteen years for various technology companies in the
Northwest. Bob lives and works in Hillsboro, Oregon. bob.a.crepps@intel.com

You might also like