You are on page 1of 14

A-Ports: A Distributed, Efficient

Technique for Performance Models on


FPGAs
Michael Pellauer
Muralidaran Vijayaraghavan
Michael Adler
Arvind
Joel Emer
MIT

Computer Science and AI Lab


Computation Structures Group
{pellauer, vmurali, arvind}
@csail.mit.edu

Intel

VSSAD Group
{michael.adler, joel.emer}
@intel.com

Introduction
The modern circuit design flow:
Specify
Specify
System
System
Requirements
Requirements

Explore
Explore
Architecture
Architecture
Alternatives
Alternatives

Write
Write
Circuit
Circuit
RTL
RTL

Interest to use FPGAs for Performance


Models
HAsim: Re-implement Intel Asim
simulator on FPGA (this talk)
Other Projects: Liberty, UT-FAST
Performance modeling efforts in RAMP

Physical
Physical
Manufacture
Manufacture

Verify
Verify

FPGAs used here


Alternative to
ASIC tapeout

FPGAs used here


Create prototype
for verification

Performance Models
Created early in the design process
Drive architectural exploration, feasibility analysis
parameterization, ease of change
FPGAs
FPGAscan
canhelp!
help!

Shows how
many clock
cycles
an
operation
takes
Performance
Performancemodels
modelshave
havehigh
highdegree
degreeof
ofparallelism
parallelism
model
clock
Does not within
showaawhat
cycle time is
within
modelclock
clockcycle
cycle
Its
Itsall
allmodeling
modelinggates
gates

Homebrewed
C orinteresting
SystemC:
synchronous
simulation
But
circuits
Butmany
manyinteresting
circuitshave
haveno
nogood
good
implementations
on
FPGAs
implementations
onLUT-based
LUT-based
FPGAs crisis:
The software
performance
modeling
CAMs, many-ported register files, nested MUXes, etc
Simulation speed CAMs, many-ported register files, nested MUXes, etc

The
Thesolution:
solution:

Configure
ConfigureFPGA
FPGAinto
intocircuit
circuitsimulator
simulator
Virtualize
Virtualizethe
theFPGA
FPGAclock
clock
Low-detail models
Medium-detail models
High-detail models

Accuracy

Frequency
(order of magnitue)

100 KHz
10 KHz
1 KHz

Development Time

Performance Model on FPGA


4-way Superscalar, Out of Order
FPGA Slices

22,873 (69%)

Block RAMs

25 (7%)

Clock Speed

95 MHz

Simulation Rate

6 MHz

Avg. FPGA cycle to


Model cycle Ratio (FMR)

15.6

Average Simulator IPS

4.7 MIPS

Implemented using Bluespec SystemVerilog


Xilinx Virtex IIPro 70 using Xilinx ISE 8.1i
Clock speed != simulation speed
Uses 1 execution unit to simulate 4 parallel execution units
Sequentially searches BlockRAM to simulate parallel CAM
Ends up taking about 15.6 FPGA cycles per model cycle
Result: 95 / 15.6 = 6 MHz simulation rate

Example Target
Register File with 2 Read Ports, 2 Write Ports
Reads take zero clock cycles
Direct configuration onto FPGA: 9242 slices, 104 MHz

rd_addr1
rd_addr2
wr_addr1
wr_val1
wr_addr2
wr_val2

2 R /2 W
R e g is t e r
F ile

rd_val1
rd_val2

CC 1

CC 2

rd_addr1

rd_val1

V(A)

V(C)

rd_addr2

rd_val2

V(B)

V(D)

Example as Performance Model


Simulate the circuit using synchronous BlockRAM
First do reads, then serialize writes
Only update model time when all requests are serviced
Results: 94 slices, 1 BlockRAM, 224 MHz
Simulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio)
Model CC 1
FPGA CC:

Separated
Separatedmodel
modelclock
clockfrom
fromFPGA
FPGAclock
clock rd_addr1

A
A
How
do
we
compose
these
modules
into
a
correct,
efficient
How do we compose these modules into a correct, efficient
rd_val1
V(A)
system?
system?
rd_addr2doesB it B
Lets
Letsexamine
examinehow
howaasoftware
softwareperformance
performancemodel
model does it
rd_val2

3
A
V(A)
B
V(B)

Time in Software Asim Model


FET
FET

11

DEC
DEC

11
11

EXE
EXE

11

MEM
MEM

11

WB
WB

22

Software has no inherent clock


Model time is tracked via Asim Ports
All communication goes through Ports
Ports have a model time latency for messages

Execution model: for each module in system


Simulate a model cycle for that module
Reads all input Ports, writes all output Ports
Can write special NoMessage value to indicate no activity

FPGA: Can simulate in parallel instead of sequentially


7

Barrier Synchronization
Controller
Controller
curCC

FET
FET

DEC
DEC

EXE
EXE

MEM
MEM

WB
WB

Controller tracks current Model CC


Tells all modules begin
Modules copy input, compute, write output, say done
When all are done, increment cycle count and repeat

More fine-grained parallelism than parallel software models


FGPA-to-Model Ratio: Dynamic worst case
But what about clock rate?

The problem with Barrier Sync


Becomes critical path with large number of modules
Massive fan-in/fan-out from controller
Quickly becomes critical path

Experiment: Linear topology of modules

Could pipeline or cluster, but we can do better

A-Ports: Asim Ports on an FPGA


FET
FET

11

DEC
DEC

11
11

EXE
EXE

11

MEM
MEM

11

WB
WB

22

Scalable: Distributed control, no combinational paths, no counters


Port with n latency starts with n NoMessage in it
Each module may proceed in a dataflow manner
Start a cycle whenever all inputs are available
Compute for any number of FPGA cycles
Stall if output ports are full

A module can derive the current model cycle by counting


Adjacent modules may not be on the same model cycle.

10

Modules can slip in model time


FET
FET

11

DEC
DEC

11
11

EXE
EXE

11

MEM
MEM

11

WB
WB

22

Observation: when port has n messages in it


Producer and Consumer are on same model cycle

Producers can run ahead, prebuffering data


Consumers can run ahead, draining data
Still works with backwards paths
With proper buffering we can get average number of
FPGA cycles per model cycle
Much better than worst case a la Barrier
11

Example: MIPS R10K-like Processor


4-Way Superscalar, Out-of-order Issue

12

Results: OOO Simulator


Out-Of-Order Simulator Speedup
1.4

1.2

0.8
Barrier Sync
A-Ports Default Buffers
A-Ports Optimal Buffers

0.6

0.4

0.2

0
median

multiply

qsort

towers

vvadd

average

13

Takeaways
Performance Modeling on FPGAs shows great potential
Cycle-accurate simulation in MHz vs KHz

A-Ports:
Distributed, efficient tracking of time that scales
Manages dynamic slip in model time
Dynamic average case instead of worst case

In paper: a technique to resynchronize modules to the


same model clock cycle
Underway: Effort to model realistic multicore systems
Future Work: Combine with Chung-style virtualization
[FPGA 2008] to eliminate pipeline stalls

14

You might also like