You are on page 1of 4

ESSCIRC 2002

Future Trend of Microprocessor Design (Invited Paper)


Robert Yung, Stefan Rusu, Ken Shoemaker Intel Corporation, Santa Clara, California USA yung@intel.com
1. Introduction the gate oxide by direct band-to-band tunnelling limits physical oxide thickness scaling and will drive high-k gate material adoption. Sub-threshold leakage current will continue to increase. Researchers have demonstrated experimental devices with a gate length of only 15nm, which will enable chips with more than one billion transistors by the second half of this decade. While bulk CMOS transistor scaling is expected to continue, novel transistor structures are being explored. Figure 5 shows a cross-section of the Depleted Substrate Transistor (DST) [2] that has higher active drive and lower leakage current than the bulk CMOS technology. 4. Interconnect Scaling

In the past thirty years, computer technology advances have fundamentally changed the practice of business and personal computing. During these three decades, the wide acceptance of personal computers and the explosive growth in the performance, capability, and reliability of computers have fostered a new era of computing. The driving forces behind this new computing revolution are due primarily to rapid advances in computer architecture and semiconductor technologies. In this paper, we will examine key architectural and process technology trends that affect microprocessor designs in the next decade. 2. Microprocessor Evolution

In 1965 Gordon Moore observed that the total number of devices on a chip doubled every 12 months at no additional cost. He predicted that the trend would continue in the 1970s but would slow after 1975 [1]. Known widely as the Moores Law, these observations made the case for continued wafer and die size growth, defect density reduction, and increased transistor density as technology scaled and manufacturing matured. Figure 1 shows that transistor count in leading microprocessors has doubled in each technology node, appropriately every 18 to 24 months. Factors that drove up transistor count are increasingly complex processing cores, integration of multiple levels of caches, and inclusion of system functions. Figure 2 shows microprocessors frequency has doubled in each generation, results of 25% reduction of gates per clock, faster transistors and advanced circuit design. Figure 3 shows die size has increased at 7% per year while feature size reduced by 30% every 2 to 3 years. Together, these fuel the transistor density growth as predicted by Moores Law. Die size is limited by the reticle size, power dissipation, and yield. Leading microprocessors typically have large die sizes that are reduced with more advanced process technology to improve frequency and yield. As feature size gets smaller, figure 4 shows that longer pipelines enable frequency scaling, which has been a key driver for performance. 3. Transistor Scaling

As advances in lithography decrease feature size and transistor delay, on-chip interconnect increasingly become s the bottleneck in microprocessor designs. Narrower metal lines and spacing resulting from process scaling increase interconnect delay. Figure 6 shows that local interconnects scale proportionally to feature size. Global interconnects, primarily dominated by RC delay, are not only insufficient to keep up but are rapidly worsening. Repeaters can be added to mitigate the delay but consume power and die area. Low resistivity copper metallization and low-k materials such as fluorine-doped SiO2 (FSG) are employed to reduce the worsening interconnect scalability. In the long term, radically different on-chip interconnect topology is needed to sustain the transistor density and performance growth rates as in the last three decades. 5. Packaging

Device physics poses several challenges to future scaling of the bulk MOSFET structure. Leakage through

The microprocessor package is changing from its traditional role of protective mechanical enclosure to a sophisticated thermal and electrical management platform. Recent advances in microprocessor packaging include the migration from wirebond to flip-chip and from ceramic to organic package substrates. Looking forward, emerging package technologies include the bumpless build-up layer (BBUL) packages, which are built around the silicon die [3]. The BBUL package provides the advantages of small electrical loop inductance and reduced thermo -mechanical stresses on the die interconnect system using low dielectric constant (low-k ) materials. This packaging technology allows for high pin count and easy integration of multiple electronic and optical components.

43

6.

Power Dissipation

Power dissipation increasingly limits microprocessor performance. The power budget for a microprocessor is becoming a design constraint, similar to the die area and target frequency. Supply voltage continues to scale down with every new process generation, but at a lower rate that does not keep up with the increase in the clock frequency and transistor count. Figure 7 shows that power increases with frequency for two processor architectures and the last two process generations. Architectural techniques like ondie power management, and circuit methods such as clock gating and domino to static conversion, are employed to control the power increase of future microprocessors. 7. Clock Speed

to achieve this are hyper-threading, also known as multithreading, and chip multiprocessing (CMP).. 9. Input/Output

Performance increases lead to higher demand for sustainable bandwidth between a microprocessor and external main memory and I/Os. This has led to faster and wider external buses as shown in Figure 10. In the future, high-speed point-to-point interconnects will replace shared buses to satisfy incre asing bandwidth requirements. Distributed interconnects will provide a more scalable path to increase external bandwidth when practical limit of a pin is reached. 10. Conclusion No fundamental barrier exists to extending Moores Law into the next decade. As feature size continues to decrease by 30% in each process generation, the number of transistors in a high performance microprocessor doubles. This vast increase in the number of on-chip transistors allows integration of critical functions as well as greatly enhances microprocessor performance. Processor-to-memory gap continues to widen as microprocessor speed increases faster than main memory. Integration of multiple levels of cache memory reduces the impact of slow memory. Larger cache sizes reduce conflict and capacity misses more than coherency misses . Integrating memory and I/O controllers in a microprocessor will reduce memory access latency and re duce bus bandwidth requirements. Worsening global interconnects in a microprocessor pose an important challenge to frequency scaling and further integration. Improved metallization and low-k material are medium-term solution. Long-term solution may lie in restructuring the microprocessor and system architecture to minimize communication costs between components on a die as well as in a large system. High power dissipation becomes a critical barrier to frequency and performance scaling of a microprocessor. Depleted Substrate Transistor and advanced power management are promising ways to curtail the rapid growth of microprocessors power dissipation. 11. References
[1] G. Moore, Cramming more components onto integrated circuits, Electronics, Vol. 38, No. 8, April 19, 1965. [2] R. Chau et.al., A 50nm Depleted-Substrate CMOS transistor (DST), IEDM Tech. Digest, 2001. [3] S. Towle et.al., Bumpless Build-Up Layer Packaging, ASME Intl. Mech. Eng. Digest, 2001. [4] International Technology Roadmap for Semiconductors, 2001 edition. [5] Intel Microprocessor Reference Guide, April 2002. (http://www.intel.com/pressroom/kits/quickref.htm) [6] Standard Performance Evaluation Corporation, April 2002. (http://www.spec.org )

Microprocessor clock speed increases with faster transistors and longer pipelines. Figure 4 shows that frequency scales with process improvements for several generations of Intel microprocessors with different microarchitectures. Holding process technology cons tant, as the number of pipeline stages increase from 5 to 10 to 20 from the original Intel Pentium through the Pentium 4, clock speeds are significantly increased. Frequency increases have translated into higher application performance. Additional transistors are used to reduce the negative performance impact of long pipelines; an example is increasingly sophisticated branch predictors. Process improvements also increase clock speed in each processor family with similar number of pipe stages. Later designs in a processor family usually gain a smaller frequency advantage from process improvements because many micro -architectural and circuit tunings have been realized in earlier designs. Some of the later microprocessors are also targeted to a power-constrained environment that limits their frequency gain. 8. Cache Memory

Microprocessor clock speeds and performance demands have increased over the years. Unfortunately, external memory bandwidth and latency have not kept pace. This widening processor-to-memory gap has led to increased cache sizes and increased number of cache levels between the processing core(s) and main memory. Figure 8 shows the size of the first and second level caches in the last 7 generations of Intel microprocessors. As frequency increases, first level cache size has begun to decrease to maintain low access latency, typically 1 to 2 clocks, as shown in Figure 9. As aggregate cache sizes increase in symmetric multiprocessor systems (SMP), the ratio of conflict, capacity, and coherency misses, or cache-to-cache transfers, will change. Set associative caches will see reduction in conflict and capacity misses relative to cache size increases. However, these increases will have smaller impact on coherency misses in large SMP systems. This motivates system designers to optimize for cache-to-cache transfers over memory-to-cache transfers. Two approaches

44

1,000,000,000 100,000,000 10,000,000 Transistor count 1,000,000 100,000 10,000 1,000 100 1971 1976 1981 1986 1991 1996 2001 2006 10
freq: 5 stages freq: 20 stages perf: 10 stages freq: 10 stages perf: 5 stages perf: 20 stages

10000 relative integer performance

10000

1000

1000

100

100

10 1000 feature size (nm) 100

10

Year of introduction

Figure 1 Transistor count doubles every 18-24 months [5, 6]

Figure 4 As feature size gets smaller, longer pipeline enables frequency scaling which is a key driver for performance [5, 6]

10000

frequency gate delay / clock

100 Gate delays per clock Pentium 4


Frequency [MHz]

Pentium III 1000 Pentium II Pentium Pro 486 386 10 1987 1991 1995 1999 Pentium

10

100

1 2003

Year of introduction

Figure 2 Frequency doubles and number of gates per clock reduced by 25% each generation [5, 6]
1000
die size feature size

Figure 5 Cross-section of a raised-source/drain depleted substrate transistor (DST) on thin silicon body [2]
Feature size (nm) 250 180 130 90 65 45

0.10

feature size (um)

die size (mm2)

100
Gate delay (fanout 4) Local interconnect (M1,2) Global interconnect with repeaters Global interconnect without repeaters

100

1.00

10 Relative delay

1971

1976

1981

1986

1991

1996

2001

Year of introduction

Figure 3 Feature size reduces by 70% every 2 to 3 years. Die sizes grow at 7% per year [5, 6]

2006

10

10.00

0.1

Figure 6 On-chip interconnect trend [4]

32

frequency (Mhz)

45

80 70 60 Power [W] 50 40 30 20 10 0 500 Pentium III 0.13um Pentium III 0.18um Pentium 4 0.13um Pentium 4 0.18um

4000 Bus Bandwidth (MB/sec) 3000 2000 1000


Bus Bandwidth

0 386 486 Pentium Pro Pentium III (.25u) Pentium III (.18u) Pentium 4 (.18u) Pentium II (.35u) Pentium II (.25u) Pentium

1000

1500 Frequency [MHz]

2000

2500

Figure 7 Processor power as a function of frequency for two process generations [5]
32 28 24 20 16 12 8 4 0 386 486 Pentium Pro Pentium II (.35u) Pentium II (.25u) Pentium
L1 cache size L2 cache size

Figure 10 Memory and I/O bandwidth are crucial to sustain high processor performance
L1 cache size (K) L2 cache size (K) 512 448 384 320 256 192 128 64 0 Pentium 4 (.13u)

Pentium III (.25u)

Pentium III (.18u)

Figure 8 Increasing on-chip cache sizes reduce the impact of widening processor-memory gap

10 Latency (clocks) 8 6 4 2 0 486

L1 cache latency L2 cache latency

Pentium Pro

Pentium 4 (.18u)

Pentium 4 (.18u)

Pentium III (.25u)

Figure 9 Short L1 cache latency dictates small L1 cache size. L2 cache latency is less critical and allows larger L2 cache sizes.

46

Pentium III (.18u)

Pentium 4 (.13u)

Pentium II (.35u)

Pentium II (.25u)

Pentium

You might also like