ATHENa and Deliverables

Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN
Modern Benchmarking: Natural Progression of Tools

Software FPGAs ASICs
eBACS
?
D. Bernstein, T. Lange
ATHENa Automated Tool for Hardware EvaluatioN

http://cryptography.gmu.edu/athena
Set of scripts written in Perl aimed at an AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms
Currently under development at George Mason University.

Version 0.3.1
Why Athena?
"The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess o known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest. from "Athena, Greek Goddess of Wisdom and Craftsmanship"
Designers of ATHENa
Venkata Vinny MS CpE student Ekawat Ice MS CpE student Marcin PhD ECE student Xin PhD ECE student Michal PhD exchange PhD ECE student from Slovakia student Rajesh
Basic Dataflow of ATHENa

User
6 5
FPGA Synthesis and Implementation
Database query
Ranking of designs
HDL + scripts + configuration files

1
Result Summary + Database Entries
ATHENa Server
Download scripts and configuration files8
HDL + FPGA Tools
Database Entries
0
Designer Interfaces + Testbenches

6
configuration files
testbench
synthesizable source files
constraint files
result summary (user-friendly)
database entries (machinefriendly)

7
synthesizable source files
configuration files
result summary (user-friendly)
ATHENa Major Features (1)

synthesis, implementation, and timing analysis in the batch mode support for devices and tools of multiple FPGA vendors:
generation of results for multiple families of FPGAs of a given vendor
automated choice of a best-matching device within a given family

automated verification of the design through simulation in the batch mode
OR exhaustive search for optimum options of tools
heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.)
10

automated verification of the design through simulation in the batch mode
OR exhaustive search for optimum options of tools
heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.)
11
Multi-Pass Place-and-Route Analysis

GMU SHA-512, Xilinx Virtex 5
100 runs for different placement starting points ~ 20%
The smaller the better best worst
Minimum clock
12 12
Dependence of Results on Requested Clock Frequency
13
ATHENa Applications
single_run: - one set of options placement_search - one set of options - multiple starting points for placement exhaustive_search - multiple sets of options - multiple starting points for placement - multiple requested clock frequencies
SHA-1 Results
Throughput [Mbit/s]
Virtex 5 Virtex 4 Spartan 3
Architectures
15
ATHENA Results for SHA-1, SHA-256 & SHA-512

2000
1800
1600
1400
1200
Mb/s
1000
800
sha1 sha256 sha512
600
400
200
0 spartan3 virtex4 virtex5 cyclone2 cyclone3 stratix2 stratix3
FPGA family
16
Ideas (1)
Select several representative FPGA platforms with significantly different properties e.g., different vendor Xilinx vs. Altera process - 90 nm vs. 65 nm LUT size - 4-input vs. 6-input optimization - low-cost vs. high-performance
Use ATHENa to characterize all SHA-3 candidates and SHA-2 using these platforms in terms of the target performance metrics (e.g. throughput/area ratio)
17
Ideas (2)
Calculate ratio SHA-3 candidate performance vs. SHA-2 performance (for the same security level) Calculate geometrical average over multiple platforms
18
Xilinx FPGA Devices

Technology 120/150 nm 90 nm Spartan 3 Low-cost Highperformance Virtex 2, 2 Pro Virtex 4
65 nm
45 nm 40 nm Spartan 6
Virtex 5
Virtex 6
Xilinx FPGA Device Support by Tools
Version
Xilinx ISE 10.1 Xilinx WebPACK 11.1 Xilinx WebPACK 11.3
Low-cost
All up to Virtex 5 Smallest up to Virtex 5 Smallest up to Virtex 5 Smallest Spartan 6, Virtex 6
High-performance
All up to Virtex 5 Smallest up to Virtex 5 Smallest up to Virtex 5 Smallest Spartan 6, Virtex 6
Altera FPGA Devices

Technology
130 nm 90 nm 65 nm 40 nm
Low-cost
Cyclone Cyclone II Cyclone III Cyclone IV
Mid-range
Highperformance Stratix
Stratix II
Arria I Arria II
Stratix III Stratix IV
Altera FPGA Device Support by Tools

Version
Quartus 7.1
Low-cost
Cyclone IV none, Cyclone III all Cyclone IV none, Cyclone III all Cyclone IV none, Cyclone III all Cyclone IV smallest, Cyclone III all
Mid-range
Arria GX all Arria II GX none Arria GX all Arria II GX none Arria GX all Arria II GX none Arria GX all Arria II GX smallest
Highperformance
Stratix II smallest, Stratix III none Stratix I, II, III smallest Stratix I, II, III smallest Stratix I, II, III all Stratix IV none
Quartus 8.1
Quartus 9.0 sp2, Sep. 09 Quartus 9.1 Nov. 09
FPGA and ASIC Performance Measures
23
The common ground is vague

Hardware Performance: cycles per block, cycles per byte, Latency (cycles), Latency (ns), Throughput for long messages, Throughput for short messages, Throughput at 100 KHz, Clock Frequency, Clock Period, Critical Path Delay, Modexp/s, PointMul/s Hardware Cost: Slices, Slices Occupied, LUTs, 4-input LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on ASIC, DSP Blocks, BRAMS, Number of Cores, CLB, MUL, XOR, NOT, AND Hardware efficiency: Hardware performance/Hardware cost
24
Our Favorite Hardware Performance Metrics:
Mbit/s
for Throughput
ns
for
Latency
Allows for easy cross-comparison among implementations in software (microprocessors), FPGAs (various vendors), ASICs (various libraries)
25
But how to define and measure throughput and latency for hash functions?
Time to hash N blocks of message = Htime(N, TCLK) = Initialization Time(TCLK) + N * Block Processing Time(TCLK) + Finalization Time(TCLK)
Latency = Time to hash ONE block of message = Htime(1, TCLK) = = Initialization Time + Block Processing Time + Finalization Time
Block size
Throughput (for long messages) = Htime(N+1, TCLK) - Htime(N, TCLK) =

Block size Block Processing Time (TCLK)
26
But how to define and measure throughput and latency for hash functions?
Initialization Time(TCLK) = cyclesI TCLK
Block Processing Time(TCLK) = cyclesP TCLK Finalization Time(TCLK) Block size from place & route report (or experiment) = cyclesF TCLK
from specification
from analysis of block diagram and/or functional simulation
27
How to compare hardware speed vs. software speed?

EBASH reports (http://bench.cr.yp.to/results-hash.html)
In graphs
Time(n) = Time in clock cycles vs. message size in bytes for n-byte messages, with n=0,1, 2, 3, 2048, 4096 In tables Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg
Time(4096) Time(2048)
Performance for long message =
2048
28
How to compare hardware speed vs. software speed?
8 bits/byte clock frequency [GHz] Throughput [Gbit/s] =

Performance for long message [cycles/byte]
29
How to measure hardware cost in FPGAs? 1. Stand-alone cryptographic core on FPGA

Cost of a smallest FPGA that can fit the core. Unit: USD [FPGA vendors would need to publish MSRP (manufacturers suggested retail price) of their chips] not very likely or size of the chip in mm2 - easy to obtain
2. Part of an FPGA System On-Chip

Vector: (CLB slices, BRAMs, MULs, DSP units) (LEs, memory bits, PLLs, MULs, DSP units)
for Xilinx for Altera
3. FPGA prototype of an ASIC implementation

Force the implementation using only reconfigurable logic (no DSPs or multipliers, distributed memory vs. BRAM): Use CLB slices as a metric. [LEs for Altera]
30
How to measure hardware cost in ASICs? 1. Stand-alone cryptographic core
Cost = f(die area, pin count)

Tables/formulas available from semiconductor foundries
2. Part of an ASIC System On-Chip Cost ~ circuit area

Units: m2 or GE (gate equivalent) = size of a NAND2 cell
31
Deliverables (1)
1. Detailed block diagram of the Datapath with names of all signals matching VHDL code
[electronic version a bonus]
2. Interface with the division into the Datapath and the Controller [electronic version]
3. ASM charts of the Controller, and a block diagram of connections among FSMs (if more than one used)
[electronic version a bonus]
4. RTL VHDL code of the Datapath, the Controller, and the Top-Level Circuit 5. Updated timing and area analysis formulas for timing confirmed through simulation
32
Deliverables (2)
6. Report on verification highest level entity verified for functional correctness
Functional simulation Post-synthesis simulation Timing simulation [bonus] Name of entity Testbench used for verification Result of verification, incorrect behavior, possible source of error
33
verification of lower-level entities
Deliverables (3)
7. Results of benchmarking using ATHENa
Entire core or the highest level entity verified for correct functionality Xilinx Spartan 3, Virtex 4, Virtex 5 Three methods of testing
Single_run Placement_search [cost table = 1, 11, 21] Exhaustive_search [cost_table = 31, 41, 51; speed or area; two sets of requested frequencies]
Results generated by ATHENa Your own graphs and charts Observations and conclusions
34
Bonus Deliverables (4)

8. Pseudo-code [but not a C code] 9. Bugs and suspicious behavior of ATHENa 10. Additional results of benchmarking using ATHENa

Altera Cyclone II, Stratix II, Cyclone III, Arria I, Stratix III Three methods of testing
Single_run Placement_search [seed = 1, 1000, 2000] Exhaustive_search [seed = 3000, 4000, 5000; speed or area; two sets of requested frequencies]
Results generated by ATHENa Your own graphs and charts Observations and conclusions
35

11. Report from the meeting with students working on the same SHA core
Summary of major differences Advantages and disadvantages of your design
12. Bugs found in the

Padding script Testbench Class examples Slides Documentation SHA-3 Packages Etc.
36

13. Extending the design to cover all hash function variants
Hash value sizes: 512 [highest priority], 384, 224 Other variant/parameter support specific to a given hash function Support through generics or constants
14. Padding in hardware

Assuming that message size before padding is already a multiple of the word size byte size a single bit
37
Composition of Students
4 GWU PhD candidates
14 local students (with 3 former BSCpE graduates)
14 international students
38
After Grading
1. Summary of results published on the course web page 2. Selected students invited to develop articles/reports to be posted on the - ATHENa web page - SHA-3 Zoo Web Page 3. Unification, generalization and optimization of codes by Ice, myself, and other students
4. Presentation to NIST, conference submissions, presentation at the Second SHA-3 Conference in Santa Barbara in August 2010.
39

ATHENa and Deliverables

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ATHENa and Deliverables

Uploaded by

Copyright:

Available Formats

Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN

Modern Benchmarking: Natural Progression of Tools

ATHENa Automated Tool for Hardware EvaluatioN

Currently under development at George Mason University.

Basic Dataflow of ATHENa

FPGA Synthesis and Implementation

HDL + scripts + configuration files

Result Summary + Database Entries

Download scripts and configuration files8

HDL + FPGA Tools

Designer Interfaces + Testbenches

synthesizable source files

result summary (user-friendly)

database entries (machinefriendly)

synthesizable source files

result summary (user-friendly)

ATHENa Major Features (1)

generation of results for multiple families of FPGAs of a given vendor

automated choice of a best-matching device within a given family

ATHENa Major Features (2)

OR exhaustive search for optimum options of tools

ATHENa Major Features (2)

OR exhaustive search for optimum options of tools

Multi-Pass Place-and-Route Analysis

The smaller the better best worst

Dependence of Results on Requested Clock Frequency

Virtex 5 Virtex 4 Spartan 3

ATHENA Results for SHA-1, SHA-256 & SHA-512

sha1 sha256 sha512

0 spartan3 virtex4 virtex5 cyclone2 cyclone3 stratix2 stratix3

Xilinx FPGA Devices

Xilinx FPGA Device Support by Tools

Altera FPGA Devices

Stratix III Stratix IV

Altera FPGA Device Support by Tools

Quartus 9.0 sp2, Sep. 09 Quartus 9.1 Nov. 09

FPGA and ASIC Performance Measures

The common ground is vague

Our Favorite Hardware Performance Metrics:

Throughput (for long messages) = Htime(N+1, TCLK) - Htime(N, TCLK) =

from analysis of block diagram and/or functional simulation

How to compare hardware speed vs. software speed?

How to compare hardware speed vs. software speed?

8 bits/byte clock frequency [GHz] Throughput [Gbit/s] =

How to measure hardware cost in FPGAs? 1. Stand-alone cryptographic core on FPGA

2. Part of an FPGA System On-Chip

3. FPGA prototype of an ASIC implementation

How to measure hardware cost in ASICs? 1. Stand-alone cryptographic core

Cost = f(die area, pin count)

2. Part of an ASIC System On-Chip Cost ~ circuit area

verification of lower-level entities

Bonus Deliverables (4)

Bonus Deliverables (5)

12. Bugs found in the

Bonus Deliverables (6)

14. Padding in hardware

14 local students (with 3 former BSCpE graduates)

You might also like