Michael Pellauer Muralidaran Vijayaraghavan Michael Adler Arvind Joel Emer

A-Ports: A Distributed, Efficient
Technique for Performance Models on

FPGAs
Michael Pellauer
Muralidaran Vijayaraghavan
Michael Adler
Arvind
Joel Emer
MIT
Computer Science and AI Lab

Computation Structures Group
{pellauer, vmurali, arvind}
@csail.mit.edu
Intel
VSSAD Group
{michael.adler, joel.emer}
@intel.com
Introduction
The modern circuit design flow:
Specify
Specify
System
System
Requirements
Requirements
Explore
Explore
Architecture
Architecture
Alternatives
Alternatives
Write
Write
Circuit
Circuit
RTL
RTL
Interest to use FPGAs for Performance

Models
HAsim: Re-implement Intel Asim
simulator on FPGA (this talk)
Other Projects: Liberty, UT-FAST
Performance modeling efforts in RAMP
Physical
Physical
Manufacture
Manufacture
Verify
Verify
FPGAs used here

Alternative to
ASIC tapeout
FPGAs used here

Create prototype
for verification
Performance Models
Created early in the design process
Drive architectural exploration, feasibility analysis
parameterization, ease of change
FPGAs
FPGAscan
canhelp!
help!
Shows how
many clock
cycles
an
operation
takes
Performance
Performancemodels
modelshave
havehigh
highdegree
degreeof
ofparallelism
parallelism
model
clock
Does not within
showaawhat
cycle time is
within
modelclock
clockcycle
cycle
Its
Itsall
allmodeling
modelinggates
gates
Homebrewed
C orinteresting
SystemC:
synchronous
simulation
But
circuits
Butmany
manyinteresting
circuitshave
haveno
nogood
good
implementations
on
FPGAs
implementations
onLUT-based
LUT-based
FPGAs crisis:
The software
performance
modeling
CAMs, many-ported register files, nested MUXes, etc
Simulation speed CAMs, many-ported register files, nested MUXes, etc
The
Thesolution:
solution:
Configure
ConfigureFPGA
FPGAinto
intocircuit
circuitsimulator
simulator
Virtualize
Virtualizethe
theFPGA
FPGAclock
clock
Low-detail models
Medium-detail models
High-detail models
Accuracy
Frequency
(order of magnitue)
100 KHz
10 KHz
1 KHz
Development Time
Performance Model on FPGA

4-way Superscalar, Out of Order
FPGA Slices
22,873 (69%)
Block RAMs
25 (7%)
Clock Speed
95 MHz
Simulation Rate
6 MHz
Avg. FPGA cycle to

Model cycle Ratio (FMR)
15.6
Average Simulator IPS
4.7 MIPS
Implemented using Bluespec SystemVerilog

Xilinx Virtex IIPro 70 using Xilinx ISE 8.1i
Clock speed != simulation speed
Uses 1 execution unit to simulate 4 parallel execution units
Sequentially searches BlockRAM to simulate parallel CAM
Ends up taking about 15.6 FPGA cycles per model cycle
Result: 95 / 15.6 = 6 MHz simulation rate
Example Target
Register File with 2 Read Ports, 2 Write Ports
Reads take zero clock cycles
Direct configuration onto FPGA: 9242 slices, 104 MHz
rd_addr1
rd_addr2
wr_addr1
wr_val1
wr_addr2
wr_val2
2 R /2 W
R e g is t e r
F ile
rd_val1
rd_val2
CC 1
CC 2
rd_addr1
rd_val1
V(A)
V(C)
rd_addr2
rd_val2
V(B)
V(D)
Example as Performance Model

Simulate the circuit using synchronous BlockRAM
First do reads, then serialize writes
Only update model time when all requests are serviced
Results: 94 slices, 1 BlockRAM, 224 MHz
Simulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio)
Model CC 1
FPGA CC:
Separated
Separatedmodel
modelclock
clockfrom
fromFPGA
FPGAclock
clock rd_addr1
A
A
How
do
we
compose
these
modules
into
a
correct,
efficient
How do we compose these modules into a correct, efficient
rd_val1
V(A)
system?
system?
rd_addr2doesB it B
Lets
Letsexamine
examinehow
howaasoftware
softwareperformance
performancemodel
model does it
rd_val2
3
A
V(A)
B
V(B)
Time in Software Asim Model

FET
FET
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
Software has no inherent clock

Model time is tracked via Asim Ports
All communication goes through Ports
Ports have a model time latency for messages
Execution model: for each module in system

Simulate a model cycle for that module
Reads all input Ports, writes all output Ports
Can write special NoMessage value to indicate no activity
FPGA: Can simulate in parallel instead of sequentially

7
Barrier Synchronization
Controller
Controller
curCC
FET
FET
DEC
DEC
EXE
EXE
MEM
MEM
WB
WB
Controller tracks current Model CC

Tells all modules begin
Modules copy input, compute, write output, say done
When all are done, increment cycle count and repeat
More fine-grained parallelism than parallel software models

FGPA-to-Model Ratio: Dynamic worst case
But what about clock rate?
The problem with Barrier Sync

Becomes critical path with large number of modules
Massive fan-in/fan-out from controller
Quickly becomes critical path
Experiment: Linear topology of modules
Could pipeline or cluster, but we can do better
A-Ports: Asim Ports on an FPGA

FET
FET
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
Scalable: Distributed control, no combinational paths, no counters

Port with n latency starts with n NoMessage in it
Each module may proceed in a dataflow manner
Start a cycle whenever all inputs are available
Compute for any number of FPGA cycles
Stall if output ports are full
A module can derive the current model cycle by counting

Adjacent modules may not be on the same model cycle.
10
Modules can slip in model time

FET
FET
11
DEC
DEC
11
11
EXE
EXE
11
MEM
MEM
11
WB
WB
22
Observation: when port has n messages in it

Producer and Consumer are on same model cycle
Producers can run ahead, prebuffering data

Consumers can run ahead, draining data
Still works with backwards paths
With proper buffering we can get average number of
FPGA cycles per model cycle
Much better than worst case a la Barrier
11
Example: MIPS R10K-like Processor

4-Way Superscalar, Out-of-order Issue
12
Results: OOO Simulator

Out-Of-Order Simulator Speedup
1.4
1.2
0.8
Barrier Sync
A-Ports Default Buffers
A-Ports Optimal Buffers
0.6
0.4
0.2
0
median
multiply
qsort
towers
vvadd
average
13
Takeaways
Performance Modeling on FPGAs shows great potential
Cycle-accurate simulation in MHz vs KHz
A-Ports:
Distributed, efficient tracking of time that scales
Manages dynamic slip in model time
Dynamic average case instead of worst case
In paper: a technique to resynchronize modules to the

same model clock cycle
Underway: Effort to model realistic multicore systems
Future Work: Combine with Chung-style virtualization
[FPGA 2008] to eliminate pipeline stalls
14

Michael Pellauer Muralidaran Vijayaraghavan Michael Adler Arvind Joel Emer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Michael Pellauer Muralidaran Vijayaraghavan Michael Adler Arvind Joel Emer

Uploaded by

Copyright:

Available Formats

A-Ports: A Distributed, Efficient

Technique for Performance Models on

Computer Science and AI Lab

Interest to use FPGAs for Performance

FPGAs used here

FPGAs used here

Performance Model on FPGA

Avg. FPGA cycle to

Average Simulator IPS

Implemented using Bluespec SystemVerilog

Example as Performance Model

Time in Software Asim Model

Software has no inherent clock

Execution model: for each module in system

FPGA: Can simulate in parallel instead of sequentially

Controller tracks current Model CC

More fine-grained parallelism than parallel software models

The problem with Barrier Sync

Experiment: Linear topology of modules

Could pipeline or cluster, but we can do better

A-Ports: Asim Ports on an FPGA

Scalable: Distributed control, no combinational paths, no counters

A module can derive the current model cycle by counting

Modules can slip in model time

Observation: when port has n messages in it

Producers can run ahead, prebuffering data

Example: MIPS R10K-like Processor

Results: OOO Simulator

In paper: a technique to resynchronize modules to the

You might also like