You are on page 1of 29

Overview

Historically, the limiting factor in a


computer’s performance has been memory
access time
– Memory speed has been slow compared to the
speed of the processor
EE 4504 – A process could be bottlenecked by the memory
Computer Organization system’s inability to “keep up” with the
processor
Our goal in this section is to study the
development of an effective memory
Section 3 organization that supports the processing
Computer Memory power of the CPU
– General memory organization and performance
– “Internal” memory components and their use
– “External” memory components and their use
Reading: Text, chapters 4 and 5

EE 4504 Section 3 1 EE 4504 Section 3 2

1
Terminology

Capacity: the amount of information that Access time:


can be contained in a memory unit -- – For RAM, the time to address the unit and
usually in terms of words or bytes perform the transfer
Word: the natural unit of organization in – For non-random access memory, the time to
position the R/W head over the desired location
the memory, typically the number of bits
used to represent a number Memory cycle time: Access time plus any
other time required before a second access
Addressable unit: the fundamental data
can be started
element size that can be addressed in the
memory -- typically either the word size or Access technique: how are memory
individual bytes contents accessed
– Random access:
Unit of transfer: The number of data
» Each location has a unique physical address
elements transferred at a time -- usually
» Locations can be accessed in any order and
bits in main memory and blocks in all access times are the same
secondary memory » What we term “RAM” is more aptly called
Transfer rate: Rate at which data is read/write memory since this access
transferred to/from the memory device technique also applies to ROMs as well
» Example: main memory

EE 4504 Section 3 3 EE 4504 Section 3 4

2
– Sequential access: – Associative access:
» Data does not have a unique address » A variation of random access memory
» Must read all data items in sequence until » Data items are accessed based on their
the desired item is found contents rather than their actual location
» Access times are highly variable » Search all data items in parallel for a match
» Example: tape drive units to a given search pattern
– Direct access: » All memory locations searched in parallel
» Data items have unique addresses without regard to the size of the memory
» Access is done using a combination of Extremely fast for large memory sizes
moving to a general memory “area” » Cost per bit is 5-10 times that of a “normal”
followed by a sequential access to reach the RAM cell
desired data item » Example: some cache memory units
» Example: disk drives

EE 4504 Section 3 5 EE 4504 Section 3 6

3
Memory Hierarchy

Major design objective of any memory Basis of the memory hierarchy


system – Registers internal to the CPU for temporary
– To provide adequate storage capacity at data storage (small in number but very fast)
– An acceptable level of performance – External storage for data and programs
– At a reasonable cost (relatively large and fast)
– External permanent storage (much larger and
Four interrelated ways to meet this goal much slower)
– Use a hierarchy of storage devices
Characteristics of the memory hierarchy
– Develop automatic space allocation methods for
efficient use of the memory – Consists of distinct “levels” of memory
components
– Through the use of virtual memory techniques,
free the user from memory management tasks – Each level characterized by its size, access
time, and cost per bit
– Design the memory and its related
interconnection structure so that the processor – Each increasing level in the hierarchy consists
can operate at or near its maximum rate of modules of larger capacity, slower access
time, and lower cost/bit
Goal of the memory hierarchy
– Try to match the processor speed with the rate
of information transfer from the lowest element
in the hierarchy

EE 4504 Section 3 7 EE 4504 Section 3 8

4
Memory Technology Size Access
Type Time

Registers Cache Semiconductor 128-512 10 ns


in the CPU RAM KB

Cache Main Semiconductor 4-128 MB 50 ns


Memory RAM

Main memory Magnetic Hard Disk Gigabyte 10 ms,


Disk 10 MB/sec
Disk cache
Optical Disk CD-ROM Gigabyte 300 ms,
600 KB/sec
Magnetic disk
Magnetic Tape 100s MB Sec-min.,
Optical disk Magnetic tape Tape 10MB/min

Typical memory Parameters


The memory hierarchy

EE 4504 Section 3 9 EE 4504 Section 3 10

5
The memory hierarchy works because of Example:
locality of reference – Two-level memory system
– Memory references made by the processor, for – Level 1 access time of 1 us
both instructions and data, tend to cluster – Level 2 access time of 10us
together – Ave access time = H(1) + (1-H)(10) us
» Instruction loops, subroutines
» Data arrays, tables
– Keep these clusters in high speed memory to
reduce the average delay in accessing data
– Over time, the clusters being referenced will
change -- memory management must deal with
this

Figure 4.2 2-level memory performance


EE 4504 Section 3 11 EE 4504 Section 3 12

6
Main Memory

Core memory Semiconductor memory


– Used in generations 2 and 3 – Typically random access
– Magnetic cores (toroids) used to store logical 0 – RAM: actually read-write memory
or 1 state by inducing an E-field in them » Dynamic RAM
(hysteresis loop) Storage cell is essentially a transistor
» 1 core = 1 bit of storage acting as a capacitor
– Required addressing and sensing wires ran Capacitor charge dissipates over time
through each core causing a 1 to flip to a zero
– Destructive readout Cells must be refreshed periodically to
– Obsolete avoid this
» Replaced in the 1970s by semiconductor Very high packaging density
memory » Static RAM: basically an array of flip-flop
storage cells
Uses 5-10x more transistors than
similar dynamic cell so packaging
density is 10x lower
Faster than a dynamic cell

EE 4504 Section 3 13 EE 4504 Section 3 14

7
– Read Only Memories (ROM) » EEPROMS
» “Permanent” data storage Electrically erasable PROMs
» ROMs Can be written to many times while
Data is “wired in” during fabrication at remaining in a system
a chip manufacturer’s plant Does not have to be erased first
Purchased in lots of 10k or more Program individual bytes
» PROMs Writes require several hundred usec per
Programmable ROM byte
Data can be written once by the user Used in systems for development,
employing a PROM programmer personalization, and other tasks
Useful for small production runs requiring unique information to be
stored
» EPROM
» Flash Memory
Erasable PROM
Similar to EEPROM in using electrical
Programming is similar to a PROM erase
Can be erased by exposing to UV light Fast erasures, block erasures
Higher density than EEPROM

EE 4504 Section 3 15 EE 4504 Section 3 16

8
– Organization
» Each memory chip contains a number of 1-
bit cells
1, 4, and 16 million cell chips are
common
» Cells can be arranged as a single bit column
(e.g., 4Mx1) or in multiple bits per address
location (e.g., 1Mx4)
» To reduce pin count, address lines can be
multiplexed with data and/or as high and
low halves
Trade off is in slower operation
» Typical control lines
W* (write), OE* (output enable) for
write and read operations
CS* (chip select) derived from external
address decoding logic
RAS*, CAS* (row and column address
selects) used when address is applied to
the chip in 2 halves Figure 4.8 256Kx8 memory from 256Kx1 chips

EE 4504 Section 3 17 EE 4504 Section 3 18

9
– Improvements to DRAM
» Basic DRAM design has not changed much
since its development in the 70s
» Cache was introduced to improve
performance
Limited to no gain in performance after
a certain amount of cache is
implemented
» Enhanced DRAM
Add fast 1-line SRAM cache to DRAM
chip
Consecutive reads to the same line are
from this cache and thus faster than the
DRAM itself
Tests indicate these chips can perform
as well as tradition DRAM-cache
combinations
» Cache DRAM
Figure 4.9 1Mx8 memory from 256Kx1 chips Use larger SRAM cache on the chip as
a true multi-line cache
Use it as a serial data stream buffer for
EE 4504 Section 3 19 EE 4504 Section 3 block data transfers 20

10
Error Correction

Semiconductor memories are subject to


errors
– Hard (permanent) errors
» Environmental abuse
» Manufacturing defects
» Wear
– Soft (transient) errors
» Power supply problems
» Alpha particles
Problematic as feature sizes shrink
– Memory systems include logic to detect and/or
correct errors
» Width of memory word is increased
» Additional bits are parity bits
Figure 4.10 Basic error detection and correction circuitry
» Number of parity bits required depends on
the level of detection and correction needed

EE 4504 Section 3 21 EE 4504 Section 3 22

11
General error detection and correction Single error detection and correction
– A single error is a bit flip -- multiple bit flips – For each valid codeword, there will be 2K-1
can occur in a word invalid codewords
– 2M valid data words – 2K-1 must be large enough to identify which of
– 2M+K codeword combinations in the memory the M+K bit positions is in error
– Distribute the 2M valid data words among the – Therefore 2K-1 > M+K
2 M+K codeword combinations such that the » 8-bit data, 4 check bits
“distance” between valid words is sufficient to » 32-bit data, 6 check bits
distinguish the error – Arrange bits as shown in Figure 4.12

Valid codeword
1 bit flip
between each
codeword

7 bit flips would


map the upper valid
codeword into the
lower one

Detect up to 6 errors, Valid codeword


Correct up to 3 errors
EE 4504 Section 3 23 EE 4504 Section 3 24

12
Cache Memory

– Bit position n is checked by bits Ci such that Cache memory is a critical component of
the sum of the subscripts, i, equals n (e.g., the memory hierarchy
position 10, bit M6, is checked by bits C2 and
C8) – Compared to the size of main memory, cache is
relatively small
– To detect errors, compare the check bits read
from memory to those computed during the – Operates at or near the speed of the processor
read operation (use XOR) – Very expensive compared to main memory
» If the result of the XOR is 0000, no error – Cache contains copies of sections of main
» If non-zero, the numerical value of the memory
result indicates the bit position in error Cache -- main memory interface
If the XOR result was 0110, bit – Assume an access to main memory causes a
position 6 (M3) is in error block of K words to be transferred to the cache
– Double error detection can be added by adding – The block transferred from main memory is
another check bit that implements a parity stored in the cache as a single unit called a slot,
check for the whole word of M+K bits line, or page
SED and SEC-DED are generally enough – Once copied to the cache, individual words
protection in typical systems within a line can be accessed by the CPU
– Because of the high speeds involved with the
cache, management of the data transfer and
storage in the cache is done in hardware -- the
O/S does not “know” about the cache
EE 4504 Section 3 25 EE 4504 Section 3 26

13
– If there are 2n words in the main memory, then Mapping functions -- since M>>C, how
there will be M= 2n /K blocks in the memory are blocks mapped to specific lines in
– M will be much greater than the number of cache
lines, C, in the cache
– Every line of data in the cache must be tagged Direct mapping
in some way to identify what main memory – Each main memory block is assigned to a
block it is specific line in the cache:
» The line of data and its tag are stored in the
cache i = j modulo C
– Factors in the cache design
where i is the cache line number assigned to
» Mapping function between main memory main memory block j
and the cache
– If M=64, C=4:
» Line replacement algorithm
» Write policy Line 0 can hold blocks 0, 4, 8, 12, ...
» Block size Line 1 can hold blocks 1, 5, 9, 13, ...
» Number and type of caches Line 2 can hold blocks 2, 6, 10, 14, ...
Line 3 can hold blocks 3, 7, 11, 15, ...

EE 4504 Section 3 27 EE 4504 Section 3 28

14
– Direct mapping cache treats a main memory
address as 3 distinct fields
» Tag identifier
» Line number identifier
» Word identifier (offset)
– Word identifier specifies the specific word (or
addressable unit) in a cache line that is to be
read
– Line identifier specifies the physical line in
cache that will hold the referenced address
– The tag is stored in the cache along with the
data words of the line
» For every memory reference that the CPU
makes, the specific line that would hold the
reference (if it is has already been copied
into the cache) is determined
» The tag held in that line is checked to see if
the correct block is in the cache

Figure 4.18 Direct mapping cache organization

EE 4504 Section 3 29 EE 4504 Section 3 30

15
– Example: – Advantages of direct mapping
» Easy to implement
Memory size of 1 MB (20 address bits) » Relatively inexpensive to implement
addressable to the individual byte
» Easy to determine where a main memory
Cache size of 1 K lines, each holding 8 bytes reference can be found in cache
– Disadvantage
Word id = 3 bits » Each main memory block is mapped to a
Line id = 10 bits specific cache line
Tag id = 7 bits » Through locality of reference, it is possible
to repeatedly reference to blocks that map
Where is the byte stored at main memory to the same line number
location $ABCDE stored? » These blocks will be constantly swapped in
and out of cache, causing the hit ratio to be
$ABCDE=1010101 1110011011 110 low
Cache line $39B, word offset $6, tag $55

EE 4504 Section 3 31 EE 4504 Section 3 32

16
Associative mapping
– Let a block be stored in any cache line that is
not in use
» Overcomes direct mapping’s main
weakness
– Must examine each line in the cache to find the
right memory block
» Examine the line tag id for each line
» Slow process for large caches!
– Line numbers (ids) have no meaning in the
cache
» Parse the main memory address into 2
fields (tag and word offset) rather than 3 as
in direct mapping
– Implement cache in 2 parts
» The lines themselves in SRAM
» The tag storage in associative memory
– Perform an associative search over all tags to
find the desired line (if its in the cache) Figure 4.20 Associate cache organization

EE 4504 Section 3 33 EE 4504 Section 3 34

17
– Our memory example again ... Set associative mapping
– Compromise between direct and fully
Word id = 3 bits associative mappings that builds on the
Tag id = 17 bits strengths of both
Where is the byte stored at main memory – Divide cache into a number of sets (v), each set
location $ABCDE stored? holding a number of lines (k)
– A main memory block can be stored in any one
$ABCDE=10101011110011011 110 of the k lines in a set such that

Cache line unknown, word offset $6, tag set number = j modulo v
$1579D – If a set can hold X lines, the cache is referred to
– Advantages as an X-way set associative cache
» Fast » Most cache systems today that use set
» Flexible associative mapping are 2- or 4-way set
associative
– Disadvantage
» Implementation cost
» Example above has 8 KB cache and
requires 1024 x 17 = 17,408 bits of
associative memory for the tags!

EE 4504 Section 3 35 EE 4504 Section 3 36

18
– Our memory example again

Assume the 1024 lines are 4-way set associative

1024/4 = 256 sets

Word id = 3 bits
Set id = 8 bits
Tag id = 9 bits

Where is the byte stored at main memory


location $ABCDE stored?

$ABCDE=101010111 10011011 110

Cache set $9B, word offset $6, tag $157

Figure 4.22 Set associative cache organization


(caption in the text is misleading -- not 2-way!)

EE 4504 Section 3 37 EE 4504 Section 3 38

19
Line replacement algorithms Write policy
– When an associative cache or a set associative – When a line is to be replaced, must update the
cache set is full, which line should be replaced original copy of the line in main memory if any
by the new line that is to be read from memory? addressable unit in the line has been changed
» Not a problem for direct mapping since – Write through
each block has a predetermined line it must » Anytime a word in cache is changed, it is
use also changed in main memory
– Least recently used » Both copies always agree
– First in first out » Generates lots of memory writes to main
– Least frequently used memory
– Random – Write back
» During a write, only change the contents of
the cache
» Update main memory only when the cache
line is to be replaced
» Causes “cache coherency” problems --
different values for the contents of an
address are in the cache and the main
memory
» Complex circuitry to avoid this problem
EE 4504 Section 3 39 EE 4504 Section 3 40

20
Block / line sizes Number of caches
– How much data should be transferred from – Single vs. 2-level
main memory to the cache in a single memory » Modern CPU chips have on-board cache
reference (L1)
– Complex relationship between block size and 80486 -- 8KB
hit ratio as well as the operation of the system Pentium -- 16 KB
bus itself
Power PC -- up to 64 KB
– As block size increases,
» L1 provides best performance gains
» Locality of reference predicts that the
additional information transferred will » Secondary, off-chip cache (L2) provides
likely be used and thus increases the hit higher speed access to main memory
ratio (good) » L2 is generally 512KB or less -- more than
this is not cost-effective
» Number of blocks in cache goes down,
limiting the total number of blocks in the
cache (bad)
» As the block size gets big, the probability of
referencing all the data in it goes down (hit
ratio goes down) (bad)
» Size of 4-8 addressable units seems about
right for current systems

EE 4504 Section 3 41 EE 4504 Section 3 42

21
Pentium and PowerPC Cache

– Unified vs. split cache In the last 10 years, we have seen the
» Unified cache stores data and instructions in introduction of cache into microprocessor
1 cache chips
Only 1 cache to design and operate
On board cache (L1) is supplemented with
Cache is flexible and can balance
“allocation” of space to instructions or external fast SRAM cache (L2)
data to best fit the execution of the – While off-chip, it can provide zero wait state
program -- higher hit ratio performance compared to a relatively slow
» Split cache uses 2 caches -- 1 for main memory access
instructions and 1 for data Intel family
Must build and manage 2 caches – ‘386 had no internal cache
Static allocation of cache sizes – ‘486 had 8KB unified cache
Can out perform unified cache in – Pentium has 16KB split cache: 8 KB data and
systems that support parallel execution 8 KB instruction
and pipelining (reduces cache – Pentium support 256 or 512 KB external L2
contention) cache that is 2-way set associative
PowerPC
– 601 had 1 32 KB cache
– 603/604/620 has split cache of size 16/32/64
KB
EE 4504 Section 3 43 EE 4504 Section 3 44

22
MESI Cache Coherency Protocol Operation of 2-Level Memory

MESI protocol provides cache coherency Recall the goal of the memory system:
in both the Pentium and the PowerPC – Provide an average access time to all memory
Stands for locations that is approximately the same as that
of the fastest memory component
– Modified
– Provide a system memory with an average cost
– Exclusive approximately equal to the cost/bit of the
– Shared cheapest memory component
– Invalid Simplistic approach,
Implemented with an additional 2-bit field
for each cache line Ts = H1xT1 + H2(T1 + T2 + Tb21)
Becomes interesting in the interactions of
the L1 and the L2 caches -- each track the H2 = 1 - H1
“local” MESI status as a line moves from
main memory to L2 and then to L1 T1, T2 are the access times to level 1
PowerPC adds an addition state “A” for and 2
allocated
– A line is marked as A while its data is being Tb21 is the block transfer time from
swapped out level 2 to level 1
Can be generalized to 3 or more levels
EE 4504 Section 3 45 EE 4504 Section 3 46

23
External Memory

Magnetic disks
– The disk is a metal or plastic platter coated with
magnetizable material
– Data is recorded onto and later read from the
disk using a conducting coil, the head
– Data is organized into concentric rings, called
tracks, on the platter
– Tracks are separated by gaps
– Disk rotates at a constant speed -- constant
angular velocity
» Number of data bits per track is a constant
» Data density is higher on the inner tracks
– Logical data transfer unit is the sector
» Sectors are identified on each track during
the formatting process

Figures 5.1 and 5.2 Disk organization


EE 4504 Section 3 47 EE 4504 Section 3 48

24
– Disk characteristics RAID Technology
» Single vs. multiple platters per drive (each – Disk drive performance has not kept pace with
platter has its own read/write head) improvements in other parts of the system
» Fixed vs. movable head – Limited in many cases by the electro-
Fixed head has a head per track mechanical transport means
Movable head uses one head per platter – Capacity of a high performance disk drive can
» Removable vs. nonremovable platters be duplicated by operating many (much
Removable platter can be removed cheaper) disks in parallel with simultaneous
from disk drive for storage of transfer access
to another machine – Data is distributed across all disks
» Data accessing times – With many parallel disks operating as if they
Seek time -- position the head over the were a single unit, redundancy techniques can
correct track be used to guard against data loss in the unit
(due to aggregate failure rate being higher)
Rotational latency -- wait for the
desired sector to come under the head – “RAID” developed at Berkeley -- Redundant
Array of Independent Disks
Access time -- seek time plus rotational
latency » Six levels: 0 -- 5
Block transfer time -- time to read the
block (sector) off of the disk and
transfer it to main memory
EE 4504 Section 3 49 EE 4504 Section 3 50

25
– RAID 0
– RAID 2
» No redundancy techniques are used
» All disks are used for every access -- disks
» Data is distributed over all disks in the array
are synchronized together
» Data is divided into strips for actual storage
» Data strips are small (byte)
Similar in operation to interleaved
» Error correcting code computed across all
memory data storage
disks and stored on additional disks
» Can be used to support high data transfer
» Uses fewer disks than RAID 1 but still
rates by having block transfer size be in
expensive
multiples of the strip
Number of additional disks is
» Can support low response time by having
proportional to log of number of data
the block transfer size equal a strip --
disks
support multiple strip transfers in parallel
– RAID 3
– RAID 1
» Like RAID 2 however only a single
» All disks are mirrored -- duplicated
redundant disk is used -- the parity drive
Data is stored on a disk and its mirror
» Parity bit is computed for the set of
Read from either the disk or its mirror individual bits in the same position on all
Write must be done to both the disk disks
and mirror » If a drive fails, parity information on the
» Fault recovery is easy -- use the data on the redundant disks can be used to calculate the
mirror data from the failed disk “on the fly”
» System is expensive!
EE 4504 Section 3 51 EE 4504 Section 3 52

26
– RAID 4 Optical disks
» Access is to individual strips rather than to – Advent of CDs in the early 1980s
all disks at once (RAID 3) revolutionized the audio and computer
» Bit-by-bit parity is calculated across industries
corresponding strips on each disk – Basic operation
» Parity bits stored in the redundant disk » CD is operated using constant linear
» Write penalty velocity
For every write to a strip, the parity » Essentially one long track spiraled onto the
strip must also be recalculated and disk
written » Track passes under the disk’s head at a
Thus 1 logical write equals 2 physical constant rate -- requires the disk to change
disk accesses rotational speed based on what part of the
The parity drive is always written to track you are on
and can thus be a bottleneck » To write to the disk, a laser is used to burn
– Raid 5 pits into the track -- write once!
» Parity information is distributed on data » During reads, a low power laser illuminates
disks in a round-robin scheme the track and its pits
» No parity disk needed In the track, pits reflect light differently
than no pits thus allowing you to store
1s and 0s

EE 4504 Section 3 53 EE 4504 Section 3 54

27
» Master disk is made using the laser – WORMs -- Write Once, Read Many disks
Master is used to “press” copies in a » User can produce CD ROMs in limited
mass production mechanical style quantities
Cheaper than production of » Specially prepared disk is written to using a
information on magnetic disks medium power laser
– Storage capacity roughly 775 NB or 550 3.5” » Can be read many times just like a normal
disks CD ROM
– Transfer rate standard is 176 MB/second » Permits archival storage of user
– Only economical for production of large information, distribution of large amounts
quantities of disks of information by a user
– Disks are removable and thus archival – Erasable optical disk
– Slower than magnetic disks » Combines laser and magnetic technology to
permit information storage
» Laser heats an area that can then have an e-
field orientation changed to alter
information storage
» “State of the e-field” can be detected using
polarized light during reads

EE 4504 Section 3 55 EE 4504 Section 3 56

28
Summary

Magnetic Tape Goal of the memory hierarchy is to


– The first kind of secondary memory produce a memory system that has an
– Still widely used average access time of roughly the L1
» Very cheap memory and an average cost per bit
» Very slow roughly equal to the lowest level in the
– Sequential access hierarchy
» Data is organized as records with physical Range of performance spans 10 orders of
air gaps between records magnitude!
» One words is stored across the width of the
tape and read using multiple read/write Components / levels discussed
heads – Cache
– Main memory
– Secondary memory

EE 4504 Section 3 57 EE 4504 Section 3 58

29

You might also like