Professional Documents
Culture Documents
architecture
It's anything but a smooth ride for processors with many challenges including the amount
of energy consumed per logic operation. Here's what can be done to reduce total power
consumption.
Microprocessor performance has scaled over the last three decades from devices that
could perform tens of thousands of instructions per second to tens of billions for today's
products. There seems to be no limit to the demand for increasing performance.
Processors have evolved from super-scalar architecture to instruction-level parallelism,
where each evolution makes more efficient use of fast single instruction pipeline. The
goal is to continue that scaling, to reach a capability of 10 tera-instructions per second by
the year 2015. However, there are many challenges on the way to that goal.
There are a number of techniques we've developed to control these factors. Among those
are several techniques for power reduction, including leakage control and active power
reduction. In addition, power consumption can be reduced by using special purpose
hardware, multi-threading, chip multi-processing and dual and multi-core processors.
These techniques will be described in more detail. The Power Technology Roadmap
shows which technique will intercept each process technology step.
Figure 3: Power Reduction Technology Roadmap
Leakage control
The ideal logic transistor should act like a switch. When it's off, no current flows, and
when it's on, current flows with very low loss. As the insulating layers of transistors
become thinner, leakage current increases. Following the current trend, leakage power
could soon reach 50% of active power. These transistors act much less like switches and
more like dimmers.
There are several techniques to help reduce leakage power. One of those is called Body
Bias. When the transistor is off, a bias voltage is applied to the body of the transistor,
reducing the leakage current by a factor of 2-10X. Another technique is called Stack
Effect. Instead of using a single transistor, you can use two transistors stacked together to
perform the switching function. This Stack Effect can reduce leakage currents by a factor
of 5-10X. A third technique is called Sleep Transistors. Sleep Transistors can be used to
isolate, or disconnect from the power supply, blocks of logic when not needed. For
example, a floating point unit can be tuned off until called. Sleep Transistors can reduce
leakage by a factor of 2-1000X, depending on the size of the logic block. Note that in all
of these techniques, leakage power is reduced by using more, not fewer, transistors. It
seems counter-intuitive but takes advantage of the higher transistor integration capability.
Figure 4: Leakage Control
Replicated design is another very powerful technique. Consider the figure below. We start
with a single logic block that has operating frequency, supply voltage, power, die area
and power density all normalized to 1 unit, and has a throughput of 1. If we replicate that
block and adjust the operating conditions, we can reduce the active power consumption.
By reducing the supply voltage to 0.5, with a corresponding reduction in operating
frequency to 0.5, the power is reduced to 0.25. The area must increase to 2 and combined
with the reduced power consumption, the power density is 0.125. The throughput per
block is 0.5 so the total throughput is 1. The power has been reduced by a factor of 4 but
the throughput is the same. More transistors equals less power for the same performance.
Figure 5: Active Power Reduction
Special-Purpose Hardware
So far, we've talked about using multiple but identical logic blocks or cores, but there are
other ways to scale performance by using special purpose cores. In the chart shown
below, one of the curves is labeled "GP MIPS @ 75W". This represents the general
purpose MIPS (millions of instructions per second) over time. The data was derived from
various processors used to service a saturated Ethernet link. The processor power was
normalized to 75 Watts. The curve labeled "TOE MIPS @~2W" comes from a special
purpose device. This TOE or TCP Offload Engine, is a test chip made to process TCP
traffic. The chart clearly shows that these "specialized MIPs" consume much less power
to perform the specific function than are required by general purpose MIPs to do the
same task. The die shot shows that the TOE is a very small die and requires relatively few
transistors. The high performance versus power of special purpose hardware will find
applications in network processing, multimedia, speech recognition, encryption and XML
processing, to name a few applications.
Figure 6: Special Purpose Hardware
Code-named Paragon1
All of these techniques sound good, but can they really be implemented? The Lab
designed prototype test chip called Paragon. This device is special-purpose test circuit for
digital signal processing. The design was executed for highest energy efficiency for the
target performance level, rather than the highest obtainable performance (operating
frequency). Paragon incorporated Body Bias and Sleep Transistors to control leakage
power, and used Dual VT with a tiled architecture for fast and slow data paths to reduce
active power. The end result was a filter accelerator that operates at 110 Giga-operations
per watt. Paragon uses just 9milliwatts total power at 1GHz throughput as a 16x16 single-
cycle multiplier. Paragon achieved a leakage power reduction of 7X compared to
previous designs with 75 microwatts standby leakage and 540 microwatts active leakage.
Overall performance is 3X higher Giga-operations per watt than previously known
designs.
First, consider this Rule of Thumb. It is derived from power, voltage and frequency and
considers active and leakage power. The Rule of Thumb is that to make a 1% change in
voltage, a corresponding change of frequency of 1% is required. That will cause a 3%
change in power (power varies as a cubic function of voltage and frequency).
Performance will change by 0.66%.
Assume we have a single core with cache whose voltage, frequency, power and
performance are normalized to 1. Now replicate the core and share the cache between the
two cores. Next, reduce the voltage and frequency by 15%, so that the power for each
core is .5 and the total for the tow cores is 1. According to the Rule of Thumb, the
performance will be 1.8 for two cores consuming the same power as the single core.
Next consider two cores compared to multiple cores. Start with a large core and cache
that consumes 4 units of power and delivers 2 units of performance. Compare that to a
small core that has been scaled to use one unit of power and has a corresponding
performance of 1 unit. By combining 4 of the small cores together, the total power is
equal to that of the large core, or 4 units, and the total performance is equal to 4 units,
twice that of the large core for the same power.
The experiment was intended to measure power savings and performance by varying
voltage and frequency on a multi-way system. Various Linux benchmarks were used to
measure performance by measuring the wall-clock run-time of each benchmark. Linux is
well suited to this experiment due to the ability to assign thread affinity, that is, assign a
thread to a processor. The benchmarks represent a combination of single and multi-
threaded applications.
The experimental platform used was a 4-way Xeon processor system operating at 2GHz,
with 2MB L3 cache, 4GB main memory and 3, Ultra 320 disk drives. The platform was
operated in four modes:
For multiple CPU configurations, the platform was operated in two modes: Symmetric
Multi-processing or SMP, where all processors run at the same speed, and Asymmetric
Multi-processing, where processors run at different speeds.
Benchmark Results:
The benchmark results (wall-clock run times) fell into three categories:
The 4-CPU SMP and AMP performed equally well.
The AMP configurations achieved significant speedup over SMP.
The AMP and 4-CPU SMP performed worse than the baseline configuration.
The intuitive explanation of the results shows that, during the sequential or single-thread
phases of each application (and even highly threaded applications have sequential
phases), the 4-CPU configuration underutilizes power as only one CPU is in use. The 1-
CPU configuration, on the other hand, is unable to exploit available thread-level
parallelism applications, so performance suffers. The AMP configuration continuously
varies EPI with thread-level parallelism to optimize power and performance.
In the graph below, the results are shown for each benchmark and for SMP and AMP
against the baseline system. Performance is measured as wall-clock run time so lower is
better. You can see that an AMP configuration of 1 CPU running at higher speed and 3 at
lower speed (1+3) gives performance very close to the best performance of any
configuration over the range of applications and so the best overall EPI.
Figure 11: Why and When AMP is better
Conclusion
Moore's Law is alive and well and our ability to integrate more transistors onto a single
device will continue to scale well into the future. However, as process technology
advances, we face new challenges that require new techniques for control of active and
leakage power. We can use more transistors to achieve higher performance at lower
power, as in the case of special purpose hardware. Multi-processing will allow
performance to scale while maintaining or reducing total power consumption and enable
more efficient energy per instruction usage.
To learn more about Moore's Law and enabling energy efficient instruction usage go to:
http://www.intel.com/technology/eep/index.htm?ppc_cid=c96
References:
1. "Paragon: A 110GOPS/W 16b Multiplier and Reconfigurable PLA Loop in 90nm
CMOS" available from IEEE, ISSCC 2005 Session 20
2. "Mitigating Amdahl's Law Through EPI Throttling" available from IEEE, 0-7695-
2270-X/05