Cache Memory

EC3003 - Sistem Komputer
Bagian 10
Cache Memory
Departemen Teknik Elektro Institut Teknologi Bandung

2005
Pembahasan
Organisasi cache memory

Direct mapped cache
Set associative cache
Pengaruh cache pada kinerja komputer
Cache Memory 10-2

Cache Memory
Cache memory adalah memori berbasis SRAM berukuran kecil dan

berkecepatan tinggi yang dikendalikan secara otomatis oleh
hardware.
Menyimpan suatu blok dari main memory yang sering diakses oleh CPU
Dalam operasinya, pertama-tama CPU akan mencari data di L1,
kemudian di L2, dan main memory.
CPU
register file
L1
ALU
cache
cache bus system bus memory bus
I/O main
L2 cache bus interface
bridge memory
Cache Memory 10-3

L1 Cache
register dalam CPU memiliki
Satuan transfer antara tempat untuk menyimpan empat
register dan cache dalam word berukuran 4-byte.
blok berukuran 4-byte
baris 0 L1 cache memiliki tempat untuk
menyimpan dua blok berukuran
baris 1 4-word
Satuan transfer antara
cache dan main memory
dalam blok berukuran
4-word blok 10 abcd
...
main memory memiliki tempat
blok 21 pqrs
untuk meyimpan blok-blok
... berukuran 4-word
blok 30 wxyz
...
Cache Memory 10-4
Organisasi Cache Memory
1 valid bit t tag bit B = 2b byte
Cache adalah array per baris per baris per blok cache
dari kumpulan set.
Setiap set berisi valid tag 0 1 ••• B–1
satu atau lebih E baris
set 0: •••
baris. per set
Setiap baris valid tag 0 1 ••• B–1
menyimpan satu
blok data.
valid tag 0 1 ••• B–1
set 1: •••
S= 2s set valid tag 0 1 ••• B–1
•••
valid tag 0 1 ••• B–1
set S-1: •••

Ukuran cache :
C = B x E x S byte data valid tag 0 1 ••• B–1
Cache Memory 10-5

Pengalamatan Cache
Alamat A:
t bit s bit b bit
m-1 0
v tag 0 1 • • • B–1
set 0: •••
v tag 0 1 • • • B–1 <tag> <set index> <block offset>
v tag 0 1 • • • B–1
set 1: •••
v tag 0 1 • • • B–1
Data word pada alamat A berada
••• dalam cache jika bit <tag> dan <set
v tag 0 1 • • • B–1 index> cocok dan berada dalam baris
set S-1: ••• yang valid.
v tag 0 1 • • • B–1
Isi word dimulai dari byte ofset <block
offset> pada awal blok
Cache Memory 10-6

Direct-Mapped Cache
Cache yang sederhana

Setiap set hanya memiliki satu baris (line)
set 0: valid tag blok cache E=1 baris per set
set 1: valid tag blok cache
•••
set S-1: valid tag blok cache
Cache Memory 10-7

Mengakses Direct-Mapped Cache
Memilih set
Menggunakan bit set index untuk memilih set yang digunakan
set 0: valid tag blok cache
set dipilih valid tag blok cache

set 1:
•••
t bit s bit b bit
set S-1: valid tag blok cache
00 001
m-1
tag set index block offset0
Cache Memory 10-8

Mengakses Direct-Mapped Cache
Pencocokan baris dan pemilihan word

Pencocokan baris : mencari baris valid dalam set yang dipilih
dengan mencocokan tag
Pemilihan word : Selanjutnya mengekstraksi word
=1? (1) Bit valid harus di-set
0 1 2 3 4 5 6 7
Set dipilih (i): 1 0110 w0 w1 w2 w3
(2) Bit tag pada cache

harus cocok dengan = ? (3) Jika (1) dan (2), maka
bit tag pada alamat cache hit, dan block
offset memilih posisi
t bit s bit b bit awal byte
0110 i 100
m-1
Cache Memory 10-9

Simulasi Direct-Mapped Cache
M=16 byte alamat, B=2 byte/blok,
t=1 s=2 b=1 S=4 set, E=1 entri/set
x xx x
Penelusuran alamat (baca):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
0 [00002] (miss) 13 [11012] (miss)

v tag data v tag data
11 0 M[0-1]
m[1] m[0] 1
1 0 M[0-1]
m[1] m[0]
(1) (3) 1
1 1 M[12-13]
m[13] m[12]
8 [10002] (miss) 0 [00002] (miss)

v tag data v tag data
11 1 M[8-9]
m[9] m[8] 11 0 M[0-1]
m[1] m[0]
(4) 1 1 M[12-13]
(5) 11 1
1 M[12-13]
m[13] m[12]
Cache Memory 10-10

Bit Tengah Sebagai Indeks
4-baris cache Bit atas Bit tengah
00 0000 0000
01 0001 0001
10 0010 0010
11 0011 0011
0100 0100
Bit indeks orde tinggi 0101 0101
0110 0110
Baris memori yang bersebelahan
akan dipetakan pada lokasi cache 0111 0111
sama 1000 1000
Spatial locality yang buruk 1001 1001
Bit indeks orde tengah 1010 1010
Baris memori yang berurutan 1011 1011
dipetakan pada baris cache 1100 1100
berbeda 1101 1101
Dapat menyimpan urutan byte 1110 1110
pada cache dalam satu waktu 1111 1111
Cache Memory 10-11
Set Associative Cache
Setiap set memiliki lebih dari satu baris
valid tag blok cache

set 0: E=2 baris per set

set 1:
•••
set S-1:
Cache Memory 10-12

Mengakses Set Associative Cache
Memilih set
Serupa dengan direct-mapped cache

set 0:
Set dipilih valid tag blok cache

set 1:
•••
t bit s bit b bit set S-1:
00 001
m-1
Cache Memory 10-13

Mengakses Set Associative Cache
Pencocokan baris dan pemilihan word

Harus membandingkan setiap tag pada baris yang valid dalam
set yang dipilih
=1? (1) Bit valid harus di-set.
0 1 2 3 4 5 6 7
1 1001
Set dipilih (i):
1 0110 w0 w1 w2 w3
(2) Bit tag pada salah satu (3) Jika (1) dan (2), maka
=? cache hit, dan block
baris cache harus cocok
offset memilih posisi
dengan bit tag pada alamat
awal byte
t bit s bit b bit
0110 i 100
m-1
Cache Memory 10-14

Multi-Level Cache
Pada cache, data dan instruksi dapat dipisah atau diletakkan dalam
tempat yang sama
Reg L1
Prosesor Unified
d-cache Unified
L2
L2 Memori
Memori disk
disk
L1 Cache
Cache
i-cache
Ukuran : 200 B 8-64 KB 1-4MB SRAM 128 MB DRAM 30 GB

Kecepatan : 3 ns 3 ns 6 ns 60 ns 8 ms
$/Mbyte: $100/MB $1.50/MB $0.05/MB
Baris: 8B 32 B 32 B 8 KB
Bertambah besar, lambat dan murah
Cache Memory 10-15

Hirarki Cache Intel Pentium
L1 Data
1 cycle latency
16 KB
4-way assoc L2
Reg. L2Unified
Unified
Write-through 128KB--2
128KB--2MB MB Main
32B lines 4-way assoc Main
4-way assoc Memory
Write-back Memory
Write-back Hingga
Write Hingga4GB
4GB
L1 Instruction Writeallocate
allocate
32B lines
32B lines
16 KB, 4-way
32B lines
Chip
ChipProsesor
Prosesor
Cache Memory 10-16

Metrik Kinerja Cache
Miss Rate
Persentase referensi memori yang tidak ditemukan dalam cache
(miss/referensi).
Umumnya 3-10% untuk L1, < 1% untuk L2.
Hit Time
Waktu untuk mengirimkan data dari cache ke prosesor
(termasuk waktu untuk menentukan apakah data tersebut
terdapat dalam cache).
Umumnya 1 siklus clock untuk L1, 3-8 siklus clock untuk L2.
Miss Penalty
Waktu tambahan yang diperlukan karena terjadi miss
Umumnya 25-100 siklus untuk main memory.
Cache Memory 10-17

Menulis Kode yg Cache Friendly
Kode yang baik :
Melakukan referensi berulang-ulang terhadap suatu variabel
(temporal locality)
Pola referensi stride-1 (spatial locality)
Contoh : cold cache, 4-byte words, 4-word cache blocks
int sumarrayrows(int a[M][N]) int sumarraycols(int a[M][N])

{ {
int i, j, sum = 0; int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++)

for (j = 0; j < N; j++) for (i = 0; i < M; i++)
sum += a[i][j]; sum += a[i][j];
return sum; return sum;
} }
Miss rate = 1/4 = 25% Miss rate = 100%
Cache Memory 10-18

Gunung Memori
Membaca throughput (membaca bandwidth)

Banyaknya byte yang terbaca dari memori setiap
detik (MB/detik)
Gunung memori
Ukuran throughput sebagai fungsi dari spatial locality
dan temporal locality.
Cara untuk menentukan kinerja sistem memori.
Cache Memory 10-19

Fungsi Tes Gunung Memori
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)

result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
/* Run test(elems, stride) and return read throughput (MB/s) */

double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test(elems, stride); /* warm up the cache */

cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
Cache Memory 10-20

Rutin Utama Gunung Memori
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16 /* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS]; /* The array we'll be traversing */
int main()
{
int size; /* Working set size (in bytes) */
int stride; /* Stride (in array elements) */
double Mhz; /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

Mhz = mhz(0); /* Estimate the clock frequency */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride, Mhz));
printf("\n");
}
exit(0);
}
Cache Memory 10-21
Gunung Memori
Pentium III Xeon

550 MHz
1200 16 KB on-chip L1 d-cache
16 KB on-chip L1 i-cache
read throughput (MB/s)
1000 512 KB off-chip unified

L1
L2 cache
800
600
400 Punggung gunung

xe
memperlihatkan
Kemiringan L2 Temporal Locality
200
untuk
Spatial
0
Locality
s1
2k
s3
s5
8k
mem
s7
32k
s9
128k
working set size

s11
512k
stride (words)
s13
(bytes)
2m
s15
8m
Cache Memory 10-22

Punggung Gunung - Temporal
Potongan gunung memori dengan stride=1
Memperlihatkan throughput dari cache dan memori berbeda.
1200
main memory L2 cache L1 cache
region region region
1000
read througput (MB/s)
800
600
400
200
0
8m
4m
2m
1k
1024k
512k
256k
128k
64k
32k
16k
8k
4k
2k
working set size (bytes)
Cache Memory 10-23

Kemiringan – Spatial Locality
Potongan pada gunung memori dengan ukuran 256 KB.
Memperlihatkan ukuran blok cache
800
700
600
read throughput (MB/s)
500
one access per cache line
400
300
200
100
0
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
Cache Memory 10-24

Contoh Perkalian Matriks
Pengaruh utama cache yang penting :
Total ukuran cache
Memperlihatkan temporal locality, tetap mempertahankan working set tetap
kecil (contoh : dengan menggunakan blocking)
Ukuran blok
Memperlihatkan spatial locality /*
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum
sum == 0.0;
0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
Deskripsi : c[i][j]
c[i][j] == sum;
sum;
Perkalian matriks NxN }}
Total operasi O(N3) }}
Akses
N pembacaan untuk setiap elemen sumber
N nilai dijumlahkan untuk setiap tujuan
Dapat disimpan di register
Cache Memory 10-25

Analisis Miss Rate
Analisis miss rate pada perkalian matriks

Asumsi :
Ukuran baris = 32B (cukup besar untuk 4 buah 64-bit word)
Dimensi matriks (N) sangat besar
Aproksimasi 1/N sama dengan 0.0
Cache tidak terlalu besar untuk menyimpan beberapa baris.
Metoda analisis :
Melihat pola akses pada loop bagian dalam.
k j j
i k i
A B C
Cache Memory 10-26

Layout Array C dalam Memori
Array C dialokasikan dalam urutan row-major

Setiap baris (row) terletak dalam memori yang berurutan
Berpindah antar kolom dalam satu baris :
for (i = 0; i < N; i++)
sum += a[0][i];
Mengakses elemen yang berurutan
Jika ukuran blok (B) > 4 bytes, eksploit spatial locality
miss rate = 4 bytes / B
Berpindah antar baris dalam satu kolom :
for (i = 0; i < n; i++)
sum += a[i][0];
Mengakses elemen yang jauh
Tidak terjadi spatial locality!
miss rate = 1 (i.e. 100%)
Cache Memory 10-27

Perkalian Matriks ijk
/*
/* ijk
ijk */
*/ Loop bagian dalam :
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for (*,j)
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum (i,j)
sum == 0.0;
0.0; (i,*)
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum A B C
sum +=+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}} Baris Kolom Tetap
Miss pada setiap iterasi loop bagian dalam :

A B C
0.25 1.0 0.0
Cache Memory 10-28

Perkalian Matriks jik
/*
/* jik
jik */
*/ Loop bagian dalam :
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{ (*,j)
sum
sum == 0.0;
0.0; (i,j)
for (i,*)
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum A B C
sum +=+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum
sum
}}
}}
Baris Kolom Tetap

A B C
0.25 1.0 0.0
Cache Memory 10-29

Perkalian Matriks kij
/*
/* kij
kij */*/ Loop bagian dalam :
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{ (i,k) (k,*)
rr == a[i][k]; (i,*)
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) A B C
c[i][j]
c[i][j] +=+= rr ** b[k][j];
b[k][j];
}}
}}
Tetap Baris Kolom

A B C
0.0 0.25 0.25
Cache Memory 10-30

Perkalian Matriks ikj
/*
/* ikj
ikj */*/ Loop bagian dalam :
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{ (i,k) (k,*)
rr == a[i][k]; (i,*)
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) A B C
c[i][j]
c[i][j] +=+= rr ** b[k][j];
b[k][j];
}}
}}
Tetap Baris Baris

A B C
0.0 0.25 0.25
Cache Memory 10-31

Perkalian Matriks jki
/*
/* jki
jki */*/ Loop bagian dalam :
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for (*,k) (*,j)
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
rr == b[k][j]; (k,j)
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j] A B C
c[i][j] +=+= a[i][k]
a[i][k] ** r;
r;
}}
}}
Kolom Tetap Kolom

A B C
1.0 0.0 1.0
Cache Memory 10-32

Perkalian Matriks kji
/*
/* kji
kji */*/ Loop bagian dalam :
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for (*,k) (*,j)
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
rr == b[k][j]; (k,j)
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j] A B C
c[i][j] +=+= a[i][k]
a[i][k] ** r;
r;
}}
}}
Kolom Tetap Kolom

A B C
1.0 0.0 1.0
Cache Memory 10-33

Ringkasan Perkalian Matriks
ijk (& jik): kij (& ikj): jki (& kji):

• 2 load, 0 store • 2 load, 1 store • 2 load, 1 store
• miss/iterasi = 1.25 • miss/iterasi = 0.5 • miss/iterasi = 2.0
for (i=0; i<n; i++) { for (k=0; k<n; k++) { for (j=0; j<n; j++) {
for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++) {
sum = 0.0; r = a[i][k]; r = b[k][j];
for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++)
sum += a[i][k] * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r;
c[i][j] = sum; } }
} } }
}
Cache Memory 10-34

Kinerja Perkalian Matriks Pentium
Miss rate bukan selalu perkiraan yang baik

Penjadwalan kode juga berpengaruh
60
50
40
Cycles/iteration
kji
jki
30
kij
ikj
20 jik
ijk
10
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Cache Memory 10-35

Meningkatkan Temporal Locality
Meningkatkan temporal locality dengan blocking.
Contoh : perkalian matriks dengan blocking
“blok” (di sini) bukan berarti “blok cache blok”.
Tetapi berarti suatu sub-blok dalam matriks.
Contoh : N = 8; ukuran sub-blok = 4
A11 A12 B11 B12 C11 C12
X =
A21 A22 B21 B22 C21 C22
Ide dasar: Sub-blok (mis., Axy) dapat diperlakukan seperti skalar
C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22
Cache Memory 10-36

Perkalian Matriks dengan Blok
for (jj=0; jj<n; jj+=bsize) {

for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
Cache Memory 10-37

Analisis Perkalian Matriks Blok
Pasangan loop paling dalam mengalikan potongan A 1 x bsize dengan blok B
bsize x bsize dan mengakumulasikan menjadi C 1 x bsize.
Loop dengan j langkah melalui potongan A dan C n baris, memakai B sama.
for (i=0; i<n; i++) {

for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
Innermost }
Loop Pair c[i][j] += sum; kk jj jj
}
kk
i i
A B C
Potongan baris diakses update potongan

bsize kali blok dipakai n kali elemen berurutan
secara berurutan
Cache Memory 10-38
Kinerja Blocking pada Pentium
Kinerja perkalian matriks dengan blocking pada Pentium
Blocking (bijk and bikj) meningkatkan kinerja dengan faktor dua
kali di atas versi unblocked (ijk and jik)
Relatif tidak sensitive terhadap ukuran array.
60
50
kji
40 jki
Cycles/iteration
kij
30 ikj
jik
20 ijk
bijk (bsize = 25)
bikj (bsize = 25)
10
0
0
5
0
5
0
5
0
5
0
5
0
5
0
25
50
75
10
12
15
17
20
22
25
27
30
32
35
37
40
Array size (n)
Cache Memory 10-39

Kesimpulan
Pemrogram dalam melakukan optimisasi kinerja cache

Bagaimana struktur data dikelola
Bagaimana data diakses
Struktur nested loop
Blocking merupakan teknik umum
Seluruh sistem menyukai “cache friendly code”
Memperoleh kinerja optimum absolut sangat tergantung pada
platform yang digunakan.
Ukuran cache, ukuran line, associativities, dll.
Keuntungan paling besar dapat diperoleh dengan kode generik
Tetap bekerja dalam working set yang kecil (temporal locality)
Gunakan stride yang kecil (spatial locality)
Cache Memory 10-40

Cache Memory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cache Memory

Uploaded by

Copyright:

Available Formats

EC3003 - Sistem Komputer

Departemen Teknik Elektro Institut Teknologi Bandung

Organisasi cache memory

Cache Memory 10-2

Cache memory adalah memori berbasis SRAM berukuran kecil dan

Cache Memory 10-3

valid tag 0 1 ••• B–1

set S-1: •••

Cache Memory 10-5

Cache Memory 10-6

Cache yang sederhana

set 0: valid tag blok cache E=1 baris per set

set 1: valid tag blok cache

set S-1: valid tag blok cache

Cache Memory 10-7

set 0: valid tag blok cache

set dipilih valid tag blok cache

Cache Memory 10-8

Pencocokan baris dan pemilihan word

Set dipilih (i): 1 0110 w0 w1 w2 w3

(2) Bit tag pada cache

Cache Memory 10-9

0 [00002] (miss) 13 [11012] (miss)

8 [10002] (miss) 0 [00002] (miss)

Cache Memory 10-10

Setiap set memiliki lebih dari satu baris

valid tag blok cache

valid tag blok cache

Cache Memory 10-12

valid tag blok cache

Set dipilih valid tag blok cache

Cache Memory 10-13

Pencocokan baris dan pemilihan word

Cache Memory 10-14

Ukuran : 200 B 8-64 KB 1-4MB SRAM 128 MB DRAM 30 GB

Bertambah besar, lambat dan murah

Cache Memory 10-15

Cache Memory 10-16

Cache Memory 10-17

int sumarrayrows(int a[M][N]) int sumarraycols(int a[M][N])

for (i = 0; i < M; i++) for (j = 0; j < N; j++)

Miss rate = 1/4 = 25% Miss rate = 100%

Cache Memory 10-18

Membaca throughput (membaca bandwidth)

Cache Memory 10-19

for (i = 0; i < elems; i += stride)

/* Run test(elems, stride) and return read throughput (MB/s) */

test(elems, stride); /* warm up the cache */

Cache Memory 10-20

int data[MAXELEMS]; /* The array we'll be traversing */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

Pentium III Xeon

1000 512 KB off-chip unified

400 Punggung gunung

working set size

Cache Memory 10-22

Cache Memory 10-23

Cache Memory 10-24

Cache Memory 10-25

Analisis miss rate pada perkalian matriks

Cache Memory 10-26

Array C dialokasikan dalam urutan row-major

Cache Memory 10-27

Miss pada setiap iterasi loop bagian dalam :