Professional Documents
Culture Documents
Intel
VSSAD Group
{michael.adler, joel.emer}
@intel.com
Introduction
The modern circuit design flow:
Specify
Specify
System
System
Requirements
Requirements
Explore
Explore
Architecture
Architecture
Alternatives
Alternatives
Write
Write
Circuit
Circuit
RTL
RTL
Physical
Physical
Manufacture
Manufacture
Verify
Verify
Performance Models
Created early in the design process
Drive architectural exploration, feasibility analysis
parameterization, ease of change
FPGAs
FPGAscan
canhelp!
help!
Shows how
many clock
cycles
an
operation
takes
Performance
Performancemodels
modelshave
havehigh
highdegree
degreeof
ofparallelism
parallelism
model
clock
Does not within
showaawhat
cycle time is
within
modelclock
clockcycle
cycle
Its
Itsall
allmodeling
modelinggates
gates
Homebrewed
C orinteresting
SystemC:
synchronous
simulation
But
circuits
Butmany
manyinteresting
circuitshave
haveno
nogood
good
implementations
on
FPGAs
implementations
onLUT-based
LUT-based
FPGAs crisis:
The software
performance
modeling
CAMs, many-ported register files, nested MUXes, etc
Simulation speed CAMs, many-ported register files, nested MUXes, etc
The
Thesolution:
solution:
Configure
ConfigureFPGA
FPGAinto
intocircuit
circuitsimulator
simulator
Virtualize
Virtualizethe
theFPGA
FPGAclock
clock
Low-detail models
Medium-detail models
High-detail models
Accuracy
Frequency
(order of magnitue)
100 KHz
10 KHz
1 KHz
Development Time
22,873 (69%)
Block RAMs
25 (7%)
Clock Speed
95 MHz
Simulation Rate
6 MHz
15.6
4.7 MIPS
Example Target
Register File with 2 Read Ports, 2 Write Ports
Reads take zero clock cycles
Direct configuration onto FPGA: 9242 slices, 104 MHz
rd_addr1
rd_addr2
wr_addr1
wr_val1
wr_addr2
wr_val2
2 R /2 W
R e g is t e r
F ile
rd_val1
rd_val2
CC 1
CC 2
rd_addr1
rd_val1
V(A)
V(C)
rd_addr2
rd_val2
V(B)
V(D)
Separated
Separatedmodel
modelclock
clockfrom
fromFPGA
FPGAclock
clock rd_addr1
A
A
How
do
we
compose
these
modules
into
a
correct,
efficient
How do we compose these modules into a correct, efficient
rd_val1
V(A)
system?
system?
rd_addr2doesB it B
Lets
Letsexamine
examinehow
howaasoftware
softwareperformance
performancemodel
model does it
rd_val2
3
A
V(A)
B
V(B)
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
Barrier Synchronization
Controller
Controller
curCC
FET
FET
DEC
DEC
EXE
EXE
MEM
MEM
WB
WB
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
10
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
12
1.2
0.8
Barrier Sync
A-Ports Default Buffers
A-Ports Optimal Buffers
0.6
0.4
0.2
0
median
multiply
qsort
towers
vvadd
average
13
Takeaways
Performance Modeling on FPGAs shows great potential
Cycle-accurate simulation in MHz vs KHz
A-Ports:
Distributed, efficient tracking of time that scales
Manages dynamic slip in model time
Dynamic average case instead of worst case
14