Professional Documents
Culture Documents
In the past thirty years, computer technology advances have fundamentally changed the practice of business and personal computing. During these three decades, the wide acceptance of personal computers and the explosive growth in the performance, capability, and reliability of computers have fostered a new era of computing. The driving forces behind this new computing revolution are due primarily to rapid advances in computer architecture and semiconductor technologies. In this paper, we will examine key architectural and process technology trends that affect microprocessor designs in the next decade. 2. Microprocessor Evolution
In 1965 Gordon Moore observed that the total number of devices on a chip doubled every 12 months at no additional cost. He predicted that the trend would continue in the 1970s but would slow after 1975 [1]. Known widely as the Moores Law, these observations made the case for continued wafer and die size growth, defect density reduction, and increased transistor density as technology scaled and manufacturing matured. Figure 1 shows that transistor count in leading microprocessors has doubled in each technology node, appropriately every 18 to 24 months. Factors that drove up transistor count are increasingly complex processing cores, integration of multiple levels of caches, and inclusion of system functions. Figure 2 shows microprocessors frequency has doubled in each generation, results of 25% reduction of gates per clock, faster transistors and advanced circuit design. Figure 3 shows die size has increased at 7% per year while feature size reduced by 30% every 2 to 3 years. Together, these fuel the transistor density growth as predicted by Moores Law. Die size is limited by the reticle size, power dissipation, and yield. Leading microprocessors typically have large die sizes that are reduced with more advanced process technology to improve frequency and yield. As feature size gets smaller, figure 4 shows that longer pipelines enable frequency scaling, which has been a key driver for performance. 3. Transistor Scaling
As advances in lithography decrease feature size and transistor delay, on-chip interconnect increasingly become s the bottleneck in microprocessor designs. Narrower metal lines and spacing resulting from process scaling increase interconnect delay. Figure 6 shows that local interconnects scale proportionally to feature size. Global interconnects, primarily dominated by RC delay, are not only insufficient to keep up but are rapidly worsening. Repeaters can be added to mitigate the delay but consume power and die area. Low resistivity copper metallization and low-k materials such as fluorine-doped SiO2 (FSG) are employed to reduce the worsening interconnect scalability. In the long term, radically different on-chip interconnect topology is needed to sustain the transistor density and performance growth rates as in the last three decades. 5. Packaging
Device physics poses several challenges to future scaling of the bulk MOSFET structure. Leakage through
The microprocessor package is changing from its traditional role of protective mechanical enclosure to a sophisticated thermal and electrical management platform. Recent advances in microprocessor packaging include the migration from wirebond to flip-chip and from ceramic to organic package substrates. Looking forward, emerging package technologies include the bumpless build-up layer (BBUL) packages, which are built around the silicon die [3]. The BBUL package provides the advantages of small electrical loop inductance and reduced thermo -mechanical stresses on the die interconnect system using low dielectric constant (low-k ) materials. This packaging technology allows for high pin count and easy integration of multiple electronic and optical components.
43
6.
Power Dissipation
Power dissipation increasingly limits microprocessor performance. The power budget for a microprocessor is becoming a design constraint, similar to the die area and target frequency. Supply voltage continues to scale down with every new process generation, but at a lower rate that does not keep up with the increase in the clock frequency and transistor count. Figure 7 shows that power increases with frequency for two processor architectures and the last two process generations. Architectural techniques like ondie power management, and circuit methods such as clock gating and domino to static conversion, are employed to control the power increase of future microprocessors. 7. Clock Speed
to achieve this are hyper-threading, also known as multithreading, and chip multiprocessing (CMP).. 9. Input/Output
Performance increases lead to higher demand for sustainable bandwidth between a microprocessor and external main memory and I/Os. This has led to faster and wider external buses as shown in Figure 10. In the future, high-speed point-to-point interconnects will replace shared buses to satisfy incre asing bandwidth requirements. Distributed interconnects will provide a more scalable path to increase external bandwidth when practical limit of a pin is reached. 10. Conclusion No fundamental barrier exists to extending Moores Law into the next decade. As feature size continues to decrease by 30% in each process generation, the number of transistors in a high performance microprocessor doubles. This vast increase in the number of on-chip transistors allows integration of critical functions as well as greatly enhances microprocessor performance. Processor-to-memory gap continues to widen as microprocessor speed increases faster than main memory. Integration of multiple levels of cache memory reduces the impact of slow memory. Larger cache sizes reduce conflict and capacity misses more than coherency misses . Integrating memory and I/O controllers in a microprocessor will reduce memory access latency and re duce bus bandwidth requirements. Worsening global interconnects in a microprocessor pose an important challenge to frequency scaling and further integration. Improved metallization and low-k material are medium-term solution. Long-term solution may lie in restructuring the microprocessor and system architecture to minimize communication costs between components on a die as well as in a large system. High power dissipation becomes a critical barrier to frequency and performance scaling of a microprocessor. Depleted Substrate Transistor and advanced power management are promising ways to curtail the rapid growth of microprocessors power dissipation. 11. References
[1] G. Moore, Cramming more components onto integrated circuits, Electronics, Vol. 38, No. 8, April 19, 1965. [2] R. Chau et.al., A 50nm Depleted-Substrate CMOS transistor (DST), IEDM Tech. Digest, 2001. [3] S. Towle et.al., Bumpless Build-Up Layer Packaging, ASME Intl. Mech. Eng. Digest, 2001. [4] International Technology Roadmap for Semiconductors, 2001 edition. [5] Intel Microprocessor Reference Guide, April 2002. (http://www.intel.com/pressroom/kits/quickref.htm) [6] Standard Performance Evaluation Corporation, April 2002. (http://www.spec.org )
Microprocessor clock speed increases with faster transistors and longer pipelines. Figure 4 shows that frequency scales with process improvements for several generations of Intel microprocessors with different microarchitectures. Holding process technology cons tant, as the number of pipeline stages increase from 5 to 10 to 20 from the original Intel Pentium through the Pentium 4, clock speeds are significantly increased. Frequency increases have translated into higher application performance. Additional transistors are used to reduce the negative performance impact of long pipelines; an example is increasingly sophisticated branch predictors. Process improvements also increase clock speed in each processor family with similar number of pipe stages. Later designs in a processor family usually gain a smaller frequency advantage from process improvements because many micro -architectural and circuit tunings have been realized in earlier designs. Some of the later microprocessors are also targeted to a power-constrained environment that limits their frequency gain. 8. Cache Memory
Microprocessor clock speeds and performance demands have increased over the years. Unfortunately, external memory bandwidth and latency have not kept pace. This widening processor-to-memory gap has led to increased cache sizes and increased number of cache levels between the processing core(s) and main memory. Figure 8 shows the size of the first and second level caches in the last 7 generations of Intel microprocessors. As frequency increases, first level cache size has begun to decrease to maintain low access latency, typically 1 to 2 clocks, as shown in Figure 9. As aggregate cache sizes increase in symmetric multiprocessor systems (SMP), the ratio of conflict, capacity, and coherency misses, or cache-to-cache transfers, will change. Set associative caches will see reduction in conflict and capacity misses relative to cache size increases. However, these increases will have smaller impact on coherency misses in large SMP systems. This motivates system designers to optimize for cache-to-cache transfers over memory-to-cache transfers. Two approaches
44
1,000,000,000 100,000,000 10,000,000 Transistor count 1,000,000 100,000 10,000 1,000 100 1971 1976 1981 1986 1991 1996 2001 2006 10
freq: 5 stages freq: 20 stages perf: 10 stages freq: 10 stages perf: 5 stages perf: 20 stages
10000
1000
1000
100
100
10
Year of introduction
Figure 4 As feature size gets smaller, longer pipeline enables frequency scaling which is a key driver for performance [5, 6]
10000
Frequency [MHz]
Pentium III 1000 Pentium II Pentium Pro 486 386 10 1987 1991 1995 1999 Pentium
10
100
1 2003
Year of introduction
Figure 2 Frequency doubles and number of gates per clock reduced by 25% each generation [5, 6]
1000
die size feature size
Figure 5 Cross-section of a raised-source/drain depleted substrate transistor (DST) on thin silicon body [2]
Feature size (nm) 250 180 130 90 65 45
0.10
100
Gate delay (fanout 4) Local interconnect (M1,2) Global interconnect with repeaters Global interconnect without repeaters
100
1.00
10 Relative delay
1971
1976
1981
1986
1991
1996
2001
Year of introduction
Figure 3 Feature size reduces by 70% every 2 to 3 years. Die sizes grow at 7% per year [5, 6]
2006
10
10.00
0.1
32
frequency (Mhz)
45
80 70 60 Power [W] 50 40 30 20 10 0 500 Pentium III 0.13um Pentium III 0.18um Pentium 4 0.13um Pentium 4 0.18um
0 386 486 Pentium Pro Pentium III (.25u) Pentium III (.18u) Pentium 4 (.18u) Pentium II (.35u) Pentium II (.25u) Pentium
1000
2000
2500
Figure 7 Processor power as a function of frequency for two process generations [5]
32 28 24 20 16 12 8 4 0 386 486 Pentium Pro Pentium II (.35u) Pentium II (.25u) Pentium
L1 cache size L2 cache size
Figure 10 Memory and I/O bandwidth are crucial to sustain high processor performance
L1 cache size (K) L2 cache size (K) 512 448 384 320 256 192 128 64 0 Pentium 4 (.13u)
Figure 8 Increasing on-chip cache sizes reduce the impact of widening processor-memory gap
Pentium Pro
Pentium 4 (.18u)
Pentium 4 (.18u)
Figure 9 Short L1 cache latency dictates small L1 cache size. L2 cache latency is less critical and allows larger L2 cache sizes.
46
Pentium 4 (.13u)
Pentium II (.35u)
Pentium II (.25u)
Pentium