An Evaluation of Methods For FPGA Implementation From A Matlab Description

An evaluation of methods for FPGA implementation from a Matlab description
Using a Matlab model as the origin, an evaluation of VHDL generation using AccelDSP and a method for comparing implementations on FPGA and PowerPC.
Kristofer Lindberg Kristofer Nissbrandt
Masters Degree Project Stockholm, Sweden 2008
XR-EE-SB 2008:014
Public REPORT
Prepared(alsosubjectresponsibleifother) No.
KristoferLindberg,KristoferNissbrandt
Approved Checked
SMW/DD/G08:070En
Date Rev Reference
DD/GFC(AnetteHansson)
20080918
AnevaluationofFPGAimplementationsfromaMatlab description
UsingaMatlabmodellastheorigin,anevaluationofVHDLgenerationusing AccelDSPandamethodforcomparingimplementationsonFPGAand PowerPC.
AMasterThesisby: KristoferLindberg KristoferNissbrandt
Supervisor: AndersBergker,DepartmentochDataandSignalProcessing,Saab MicrowaveSystems
Abstract Algorithms for signal processing and radar control at Saab Microwave Systems (SMW) are implemented on a platform consisting of several boards equipped with programmable logic, in form of Field Programmable Gate Array (FPGA) devices and PowerPC processors. Dierent boards are responsible for dierent parts of the processing, which is performed in parallel on the incoming Video Data from the antenna. It is hard to know if a certain algorithm should be implemented on a FPGA or PowerPC to obtain the most optimal realisation. Today implementations towards FPGA are written in the Hardware Description Language VHDL, which leads to a very long implementation time. This is especially a problem for rapid prototyping, used by SMW to develop new functionality in their radars. Generating VHDL automatically from a high level description would reduce the implementation time. In this thesis Xilinx AccelDSP, a software for generating VHDL from a high level MathWorks Matlab description, has been evaluated. A method for deciding if an algorithm should be implemented on a FPGA or PowerPC has also been developed. The method is based on proling of implementations on the two platforms made with generated code from the same Matlab description. VHDL is generated using AccelDSP and C-Code using MathWorks Real Time Workshop Embedded Coder (EMLC). Five dierent aspects of AccelDSP and how it can be used by SMW have been evaluated. The result is that the main purpose for AccelDSP at SMW is to be used in the method for platform decision. The software can also be used for generating parts of a design but this is not recommended. The drawbacks of VHDL generated from AccelDSP is more important than the reduced design time. The main problems are the performance, reliability of the description, readability and the problems of maintaining the source. The method for platform decision gives an estimation of the execution time and the resource usage on the both platforms given a specic design and includes a discussion of other metrics important for the decision. The estimated metrics is good enough since it gives an indication of the resulting performance usable for deciding implementation platform.
Contents
List of acronyms Terminology 1 Introduction 1.1 Hardware implementation . . . . . . . . . . . . . . . . 1.2 Problem denition . . . . . . . . . . . . . . . . . . . . 1.2.1 Problem statement . . . . . . . . . . . . . . . . 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Xilinx System Generator . . . . . . . . . . . . . 1.3.2 Altera DSP Builder . . . . . . . . . . . . . . . 1.3.3 MathWorks Simulink HDL Coder . . . . . . . . 1.3.4 MathWorks Filter Design HDL Coder . . . . . 1.3.5 The HPEC Challenge Benchmark Suite . . . . 1.3.6 Implementation using Xilinx System Generator 1.3.7 Matlab to C-Code Translation . . . . . . . . . 2 Background 2.1 FPGA Architecture . . . . . . . . . 2.1.1 Applications . . . . . . . . . 2.1.2 Architecture description . . . 2.2 FPGA development using VHDL . . 2.2.1 Implementation . . . . . . . . 2.2.2 Simulation . . . . . . . . . . 2.2.3 Intellectual Properties . . . . 2.3 Processor Architecture . . . . . . . . 2.3.1 Fundamental design . . . . . 2.3.2 Pipeline . . . . . . . . . . . . 2.3.3 Memory . . . . . . . . . . . . 2.3.4 PowerPC . . . . . . . . . . . 2.4 Benchmarking . . . . . . . . . . . . . 2.4.1 Benchmarking fundamentals . 7 11 13 14 15 15 15 15 16 16 16 16 16 17 19 19 19 19 22 22 24 24 25 25 25 26 27 28 28
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
CONTENTS
2.5
2.6
2.4.2 Processor benchmarking . . . . . . . . . 2.4.3 FPGA benchmarking . . . . . . . . . . 2.4.4 Digital signal processing benchmarking Xilinx AccelDSP . . . . . . . . . . . . . . . . . 2.5.1 Workow . . . . . . . . . . . . . . . . . 2.5.2 AccelWare . . . . . . . . . . . . . . . . . 2.5.3 Original compiler implementation . . . . MathWorks Matlab . . . . . . . . . . . . . . . . 2.6.1 Native code generation . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
29 32 34 35 35 39 39 41 41 43 43 44 46 46 49 49 49 51 53 53 55 55 56 56 58 58 58 59 59 60 61 62 62 63 63 63 64 64 66 67
3 Method 3.1 VHDL generation and simulation . 3.1.1 VHDL code generation . . . 3.1.2 Verication . . . . . . . . . 3.1.3 Evaluation . . . . . . . . . 3.2 C-Code generation and simulation 3.2.1 C-Code generation . . . . . 3.2.2 Compilation . . . . . . . . . 3.2.3 Evaluation . . . . . . . . . 3.3 Platform decision . . . . . . . . . . 3.3.1 Method . . . . . . . . . . . 4 Algorithms 4.1 FFT . . . . . . . . . . . . . . . 4.1.1 Implementation . . . . . 4.1.2 Evaluation . . . . . . . 4.2 Correlation . . . . . . . . . . . 4.2.1 Implementation . . . . . 4.2.2 Evaluation . . . . . . . 4.3 Matrix multiplication . . . . . . 4.3.1 Matlab implementation 4.3.2 VHDL reference design 4.3.3 Evaluation . . . . . . . 4.4 FIR lter . . . . . . . . . . . . 4.4.1 Evaluation . . . . . . . 4.5 CORDIC . . . . . . . . . . . . 4.5.1 Implementation . . . . . 4.5.2 Evaluation . . . . . . . 4.6 CFAR . . . . . . . . . . . . . . 4.6.1 Implementation . . . . . 4.6.2 Evaluation . . . . . . . 4.7 MUSIC . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
CONTENTS
4.8
4.7.1 Implementation . . . . . . . . . 4.7.2 Evaluation . . . . . . . . . . . Cartesian to polar map transformation 4.8.1 Implementation . . . . . . . . . 4.8.2 Evaluation . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
67 67 68 68 69 71 71 71 73 75 78 80 82 84 87 89 89 89 91 93 93 94 97 99 100 102 103 103 103 103 104 105 105 106 106 107 109 110 110 110
5 Results 5.1 VHDL generation using AccelDSP . . . . . . 5.1.1 FFT . . . . . . . . . . . . . . . . . . . 5.1.2 Correlation . . . . . . . . . . . . . . . 5.1.3 Matrix multiplication . . . . . . . . . 5.1.4 FIR-lter . . . . . . . . . . . . . . . . 5.1.5 CORDIC . . . . . . . . . . . . . . . . 5.1.6 CFAR . . . . . . . . . . . . . . . . . . 5.1.7 MUSIC . . . . . . . . . . . . . . . . . 5.1.8 Cartesian to polar map transformation 5.2 C-Code generation using EMLC . . . . . . . 5.2.1 FFT . . . . . . . . . . . . . . . . . . . 5.2.2 MUSIC . . . . . . . . . . . . . . . . . 5.2.3 Cartesian to polar map transformation 6 Discussion 6.1 Evaluation of AccelDSP . . . . . . . . . . . 6.1.1 Usability . . . . . . . . . . . . . . . 6.1.2 Performance . . . . . . . . . . . . . 6.1.3 Rapid prototyping . . . . . . . . . . 6.1.4 Technology independence . . . . . . 6.1.5 Accuracy . . . . . . . . . . . . . . . 6.2 Comparing a processor to a FPGA . . . . . 6.2.1 Speed . . . . . . . . . . . . . . . . . 6.2.2 Throughput . . . . . . . . . . . . . . 6.2.3 Resource usage . . . . . . . . . . . . 6.2.4 Memory . . . . . . . . . . . . . . . . 6.2.5 Power . . . . . . . . . . . . . . . . . 6.2.6 I/O . . . . . . . . . . . . . . . . . . 6.3 Implementation decision based on proling 6.3.1 Implementation decision . . . . . . . 6.3.2 Accuracy . . . . . . . . . . . . . . . 6.3.3 Usability . . . . . . . . . . . . . . . 6.4 Future work . . . . . . . . . . . . . . . . . . 6.4.1 Evaluation of AccelDSP . . . . . . . 6.4.2 Implementation decision . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
CONTENTS
6.4.3
Evaluation of other tools . . . . . . . . . . . . . . . . . . . 110 111 111 112 112 113 117 119 121 125
7 Conclusions 7.1 VHDL generation using AccelDSP . . . . . . 7.1.1 Evaluation of algorithms for AccelDSP 7.1.2 Evaluation of AccelDSP . . . . . . . . 7.2 Implementation decision based on proling . Appendices A Programs used B Performance data from PSIM C Makele for GCC and PSIM
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
D Cartesian to polar map transformation 129 D.1 CPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . 129 D.2 FPGA implementation . . . . . . . . . . . . . . . . . . . . . . . . 131 E VHDL Constraints 135
List of acronyms
Arithmetical Logic Unit (ALU) See, p. 26. American National Standards Institute (ANSI) See, p. 16. American Standard Code for Information Interchange (ASCII) See, p. 38. Application Specic Integrated Circuit (ASIC) See, p. 19. Abstract Syntax Tree (AST) See, p. 39.
Computer-Aided Design (CAD) See, p. 20. Constant False Alarm Rate (CFAR) See, p. 64. Congurable Logical Block (CLB) See, p. 20. Component Object Model (COM) See, p. 95. COordinate Rotation DIgital Computer (CORDIC) See, p. 44. Cycles Per Instruction (CPI) See, p. 53. Central Processing Unit (CPU) See, p. 11. Defence Advanced Research Projects Agency (DARPA) See, p. 16. Digital Clock Management (DCM) See, p. 21. Discrete Fourier Transform (DFT) See, p. 55.
Digital Signal Processing (DSP) See, p. 12. Digital Signal Processor (DSP) See, p. 109. Xilinx Virtex-5 DSP slices (DSP48E) See, p. 21. Executable and Linkable Format (ELF) See, p. 50. Embedded Matlab Subset (EML) See, p. 41. Real Time Workshop Embedded Coder (EMLC) See, p. 41.
LIST OF ACRONYMS
Fast Fourier Transform (FFT)
See, p. 55.
Finite Impulse Response (FIR) See, p. 13. Field Programmable Gate Array (FPGA) See, p. 11. Front Side Bus (FSB) See, p. 104. GNU Compiler Collection (GCC) See, p. 50. GNU Debugger (GDB) See, p. 51.
Graphical User Interface (GUI) See, p. 95. Institute of Electrical and Electronics Engineers, Inc (IEEE) See, p. 16. Innite Impulse Response (IIR) See, p. 40. Input/Output (I/O) See, p. 20. Instructions Per Cycle (IPC) See, p. 25. Look-Up Table (LUT) See, p. 20. Matlab Code (M-Code) See, p. 41. Multiply Accumulate (MAC) See, p. 21. Mapping (MAP) See, p. 23. Mega Samples Per Second (MSPS) See, p. 72. Multiple Signal Classication Method (MUSIC) See, p. 67. Operating System (OS) See, p. 31. Place And Route (PAR) See, p. 24. Program Counter (PC) See, p. 25. Programmable Logic Device (PLD) See, p. 19. Phased Locked-Loop (PLL) See, p. 20. Plane Position Indicator (PPI) See, p. 13. Random Access Memory (RAM) See, p. 21. Reduced Instruction Set Computer (RISC) See, p. 27. Read Only Memory (ROM) See, p. 24. Register Transfer Level (RTL) See, p. 22.
LIST OF ACRONYMS
Synthetic Aperture Radar (SAR) See, p. 16. Saab Microwave Systems (SMW) See, p. 13. Static Random Access Memory (SRAM) See, p. 20. VERIfy LOGic (VERILOG) See, p. 14. VHSIC Hardware Description Language (VHDL) See, p. 11. Very High Speed Integrated Circuit (VHSIC) . Wide Sense Stationary (WSS) See, p. 58.
10
LIST OF ACRONYMS
Terminology
AccelDSP AccelDSP is a software from Xilinx for translating oating-point M-Code to xed-point VHDL, p. 35. AccelWare is a library of functions that can be used in AccelDSP to replace generic Matlab functions with functions optimized for generating VHDL, p. 39. Information used for programming a FPGA generated after PAR in the VHDL workow, p. 22. The Xilinx software for distributing IP-Blocks, p. 24.
AccelWare
bitstream
CORE Generator
execution unit A unit in the data path on a CPU responsible for a certain type of operations, p. 25. IP-Block An Intellectual property is an algorithm implementation provided by a FPGA manufacturer with a license term, p. 24. Xilinx ISE Design suite, a suite containing all programs needed to create FPGA designs, p. 12. A high-level technical computation environment from MathWorks, p. 41. A technology dependent description of a design generated by the synthesis step in the VHDL workow, p. 23. Processor with RISC architecture created by Apple, IBM and Motorola in 1991 based on the POWER architecture created by IBM, p. 27. A software used for analysis of the performance of a program by running it in a simulated environment, p. 31. A PowerPC simulator for performance evaluation and simulations integrated with GDB, p. 51.
ISE
Matlab
netlist
PowerPC
proler
PSIM
12
TERMINOLOGY
Simulink
An environment, supported by a graphical user interface, for simulation and model-based design of embedded and dynamic systems from MathWorks, p. 15. A design process stage where VHDL are synthesized into a technology dependent netlist, p. 23. Radar information after the early signal processing stages before target identication, p. 64.
synthesize
video data
xc5vsx95t-f1136-3 A Xilinx Virtex-5 SXT FPGA device optimized for DSP and memory intensive applications, p. 46. XST Xilinx Synthesis Tool, a free synthesis tool included in ISE, p. 46.
Chapter 1 Introduction
Saab Microwave Systems (SMW) manufactures radar and other sensor systems for air, ground and sea, primary for military but also civil security domains. The goal is to provide their customers with information superiority which give them the ability to observe and react to situations more accurately and rapidly than their opponents. The platform for signal processing is together with the antenna and components to send, receive and generate microwaves core functionality of a radar. The platform for signal processing modulates and directs generated radar pulses, commonly called beamforming, and processes retrieved signals. The processing extracts target identication including velocity, course, bearing and distance as well as classication of targets including hovering helicopters. SMW can provide complete radar systems including cabin, operator interface and support systems. Identied targets are transferred through an Ethernet interface to a system for data processing and presentation. The most common way to present targets is to use a Plane Position Indicator (PPI), optionally projected on a map. The platform for signal processing consists of several boards equipped with programmable logic, Field Programmable Gate Array (FPGA) and PowerPC processors. Dierent boards are responsible for dierent parts of the processing which is performed in parallel on the incoming data. The data stream for the surface radars with 18 antenna lobes consists of 18 channels with complex data sampled in a rate of 3.125M Hz. This corresponds to a sequential rate of 112.5 millions words per second. When new or improved functionality is added, algorithms are realised in the same way as related algorithms in former projects without knowing if it is possible to do more eectively. For example if a FIR-lter with 16 taps was described in VHDL for implementation in a FPGA during a project and a lter with 32 taps is needed in a following project the realisation will be done using
14
Introduction
the same method. The reason is that it is very hard to show how the best realisation is made. It could be possible that it is more suitable to write the FIR-lter in the previous example in C-Code for implementation on a processor or generate VHDL from a Matlab model for implementation on a FPGA. To develop a method for making platform decisions and to reduce the implementation time towards FPGA platforms SMW became interested in the software AccelDSP. The interest in AccelDSP, which can generate VHDL from Matlab, is the origin of the thesis.
1.1
Hardware implementation
An algorithm can be implemented in many ways but the thesis is restricted to implementations on FPGA and PowerPC. The PowerPC is an example of a processor platform and the implementation procedure is similar if another processor is of interest. The natural way to make the implementations is to write code for the processor in a native programming language like C/C++ or Ada. The FPGA implementation is made in a hardware description language like VHDL or VERILOG. The drawback of using native languages is the very long implementation time. If the implementation can be generated from a description in a high level language, where the algorithm can be implemented fast, a lot of time can be saved. Figure 1.1 shows how an algorithm described in Matlab can be implemented using translations between dierent languages.
Matlab
C/C++
VHDL / Verilog
Processor
FPGA
Figure 1.1: A Matlab model can be translated using these paths for implementation on a processor or a FPGA.
1.2 Problem denition
15
1.2
Problem denition
The aim of the thesis is to develop a procedure for making better decisions about realisations of algorithms for signal processing and radar control than what is currently possible at SMW today. The procedure should assist in deciding implementation platform for a certain algorithm and if the implementation should be generated or hand-coded. A Matlab design of the algorithm is the origin for evaluation of implementations on PowerPC and FPGA. For generating VHDL from Matlab AccelDSP has been evaluated and for generating C-Code Real Time Workshop Embedded Coder (EMLC) has been used. The evaluation of AccelDSP is a requirement from SMW and EMLC is u used because it has been evaluated by Marcus Mlleger and an evaluation license was provided by MathWorks. The work made by Mlleger, see section 1.3.7, is u used as a reference for the performance of C-Code generation and evaluation of the functionality of dierent translators. Only AccelDSP has been evaluated in the thesis and no further evaluation of EMLC has been performed.
1.2.1
Problem statement
1. Is AccelDSP useful for SMW and which purposes can it full? 2. How can an algorithm be evaluated to see if it is suitable for VHDL generation from Matlab using AccelDSP or if ts it better to describe the algorithm directly in VHDL? 3. How can proling on the both platforms, with code generated from Matlab, be used to determine if an algorithm should be implemented on a PowerPC or a FPGA?
1.3
Related work
This section contains softwares and work related to this thesis.
1.3.1
Xilinx System Generator
This software is closely related to AccelDSP since it is used for generating FPGA implementations from Matlab. The dierences is that it generates technology dependent netlists from Simulink and not VHDL from Matlab Code (M-Code). This software is therefore unsuitable for SMW. The product contains a block library for Simulink for modelling DSP algorithms suitable for FPGA implementation and a tool for generating the implementation.
16
Introduction
1.3.2
Altera DSP Builder
This is the equivalent to Xilinx System Generator but from the FPGA manufacturer Altera. In this program it is possible to generate VHDL les which is not possible in System Generator. This program is designed for Altera devices and is therefore not useful.
1.3.3
MathWorks Simulink HDL Coder
Simulink HDL Coder is MathWorks equivalent to Xilinx System Generator and Altera DSP Builder. It generates VHDL and VERILOG les, which are stated to be target independent and compliant with the IEEE 1076 standard, from MathWorks Simulink. Matlab code from the Embedded Matlab Subset (EML) subset can be generated using a special block in Simulink. An evaluation of this program is considered as Future Work.
1.3.4
MathWorks Filter Design HDL Coder
This is a specialised tool for generating VHDL and VERILOG code for xedpoint lters designed with Matlab Filter Design Toolbox. The code is stated to be compliant with the IEEE 1076 standard.
1.3.5
The HPEC Challenge Benchmark Suite
The Lincoln Laboratory at Massachusetts Institute of Technology has developed an advanced benchmarking suite for radar systems in ANSI C sponsored by DARPA. The suite contains both kernel benchmarks including signal, image, information and knowledge processing algorithms and a SAR system benchmark. The later is available as M-Code.
1.3.6
Implementation using Xilinx System Generator
Xilinx System Generator has been evaluated in a master thesis at SMW, at the time called Ericsson Microwave Systems, in cooperation with Linkping o University of Technology 2004 [5]. The evaluation is made using two complex algorithms and not a kernel set of algorithms as in this thesis. The two algorithms are a base band modulator and an algorithm for automatic gain control. The resource utilisation are compared to a hand-coded implementation for the modulator but not for the gain control. The timing performance are not compared with a reference design but is veried to full the requirements by SMW. This is possible since SMW uses implementations of both algorithms and therefore have design specications to compare with.
1.3 Related work
17
The result of the thesis is that Xilinx System Generator can be used to eectively implement a model of a digital function in an FPGA from Xilinx.
1.3.7
Matlab to C-Code Translation
A functional and performance evaluation of C-Code generation from Matlab has been made in a master thesis by Marcus Mlleger for Ericsson AB in cooperation u with Halmstad University [17]. In detail two softwares for translating M-Code to C-Code has been evaluated. EMLC, used in this thesis for C-Code generation, and Matlab to C Synthesis (MCS). The thesis compares the two translators but also the performance compared to hand-coded implementations. The following three aspects are studied: 1. Generation of reference code 2. Target code generation 3. Floating to xed-point conversion The benchmarking technique is a combination of using both simple and complex algorithms and dierent algorithms are used for evaluating dierent aspects. The results is that MCS provides a better support for C algorithm reference generation, by covering a larger set of the Matlab language, and EMLC is more suitable for direct target implementation. EMLC only allocates memory statically. This makes the supported subset of the M-Code language smaller.
18
Introduction
Chapter 2 Background
This chapter contains background information for better understanding the report. Only the sections where the reader has not got previous knowledge is relevant to read since there are references to this chapter in the report where background information is needed. The chapter contains information about FPGA and processor architecture, an introduction to benchmarking and descriptions of the both software AccelDSP and Matlab.
2.1
FPGA Architecture
This section contains a brief introduction to FPGA architecture needed to understand the performance metrics discussed in section 2.4.3 and the major differences between processor and FPGA architecture.
2.1.1
Applications
A FPGA has many advantages, compared to other platforms, since it allows for a fast and programmable implementation. The main drawbacks are the high price and the complicated implementation procedure compared to a CPU implementation. FPGAs are used for small batches, rapid prototyping when the design of Application Specic Integrated Circuit (ASIC) are too time-consuming and when a reprogrammable hardware implementation is desired. Small batches will normally lead to a higher price per chip for ASICs than FPGAs.
2.1.2
Architecture description
A Field Programmable Gate Array (FPGA) contains programmable logic and is thus a Programmable Logic Device (PLD). The logic is contained in blocks which can be programmed to perform basic logical operations as well as complex mathematical functions and the blocks can be connected by programmable
20
Background
interconnections. FPGAs have a memory element such as a ip-op in their blocks for performing synchronous operations. The logical blocks are arranged in a 2-dimensional array with interconnections for connecting dierent logical blocks to each other and blocks with I/O pads. There are dierent ways of creating a programmable connection matrix. The most common way is using SRAM which are reprogrammable and antifuse which are not reprogrammable. As explained above the FPGA contains a predened structure of logic which limits the possible designs and performance, such as timing and power usage, compared to the ASICs. Most FPGA architectures are possible to reprogram in circuit, it is from this ability Field Programmable comes from. Using a FPGA, instead of an ASIC, makes it possible to update the hardware in end-user products for patching or upgrades. Since a FPGA contains congurable logic, designed using a CAD program, it possible to design parallel structures optimised for a given task. The only limitation is the number of I/O pads and logic blocks for the work that can be executed in parallel. This should be compared to a processor with a xed hardware layout capable of sequentially execute instructions. Some non-deterministic parallelism is introduced in modern processor architecture using advanced out-of order execution but this is far more limited than the parallelism that can be implemented in hardware. For running dierent parts of the device at dierent clock frequencies most FPGAs contains several Phased Locked-Loop (PLL). A PLL is a control system used to generate a signal with higher or lower frequency than a reference signal keeping a xed relation to the phase. It uses a negative feedback loop to control an oscillator so it maintains a constant phase angle relative to the reference signal. 2.1.2.1 XILINX Virtex5
Xilinx uses a Look-Up Table (LUT) based design for the logical blocks. A LUT is a small one bit memory which can be used as a logical function where the address lines are the inputs and the one bit memory is the output. The logic is realised by properly programming the 2k bits of the memory, where k is the number of inputs. Each logical block, by Xilinx called Congurable Logical Block (CLB), contains four slices and each slice contains LUTs, ip-ops and logic for example carry-chain operations. The interconnections between CLBs consist of segments of dierent length spanning from one CLB pair to the entire length of the chip. For routing a signal between two CLBs the signal must pass through at least one switch. The switch is usally a SRAM cell. The performance depends on how the CAD tools manage to route the signals.
2.1 FPGA Architecture
21
The number of slices, their components as well as the design of the supporting logic and interconnection matrix diers between dierent devices. For increased performance specialised blocks are added to the FPGA architecture by many manufacturers. This could be specialised versions of slices or blocks supporting advanced operations. The Virtex5 series have two types of slices SLICEL and SLICEM, the later slightly more complex, on-chip 36 Kbit BlockRAM and specialised DSP slices. A block diagram of the slices can be found in [27, pp. 173-174]. All slices have a 6-input LUT as logic function generator, a carry chain for implementation of fast adder/subtraction structures, a multiplexer and a ipop. The SLICEM has primitives instead of an ordinary LUT which can be implemented as distributed RAM, shift registers or a LUT. Since FPGAs are often used for signal processing the Virtex5 series have built-in Xilinx Virtex-5 DSP slices with support for commonly used DSP operations. These functions include multiply, Multiply Accumulate (MAC), multiply add, three-input add, barrel shift, wide-bus multiplexing, magnitude comparator, etc... The architecture also supports cascading multiple slices to form wide math functions, DSP lters, and complex arithmetic without the use of general FPGA fabric [28]. For generating internal clock signals the Virtex5 series contains Digital Clock Management (DCM) as well as PLL. The DCM provides advanced clocking capabilities including eliminating clock skew of clock nets, phase shift signals and generate new clock signals by a mixture of clock multiplication and division. Clock skew is the dierence in timing delay between the original clock signal and the signal that arrives at dierent components in the device. The dierence arises from dierent wire length, temperature dierences, capacitive coupling, etc ...
22
Background
2.2
FPGA development using VHDL
There is several steps involved in the typical design ow when implementing on a FPGA using VHSIC Hardware Description Language (VHDL). The rst is to describe the implementation using one of the three abstraction levels available in VHDL. Behavioural level is the highest level of abstraction. This level is used to model algorithms and system behaviour without specifying clock information. Some synthesis tools are available that can take behavioural VHDL code as input, but the lack of architecture details means that the implementation can be inecient in terms of area and performance. RTL stands for Register Transfer Level. An RTL description of an algorithm has an explicit clock, which means that all operations are scheduled to occur in specic clock cycles. The behaviour of the circuit is dened in terms of all registers and the ow of signals between them. This is the most common level used for synthesis. Gate level is the lowest level of abstraction in VHDL. A gate level description consists of a network of gates and registers instanced from a technologyspecic library. This library provides information about the components timing and gate delay VHDL supports syntax for dividing the code into components and packages. This way generic components can be created, which are easy to reuse. The components can be connected together to form application specic algorithms. VHDL was developed at the behest of the US Department of Defense and is originally a modelling language for documentation and simulation of circuits. The language is standardised by IEEE for simulation but only a subset of the language is supported for synthesis and there is no standard for which subset that should be supported. Fortunately the same subset is supported by the most common programs used for synthesis [20].
2.2.1
Implementation
When the algorithm has been described in VHDL the code must be translated to a bitstream, which can be used for programming the FPGA. The bitstream is the information stored in the SRAM used for interconnections, the bits stored in the LUT and conguration information for example the primitives. This translation is the equivalent to compiling in the programming world and consists of several steps. After most steps a report containing information, including performance data, of the design is generated.
2.2 FPGA development using VHDL
23
2.2.1.1
Synthesis
The rst step is to synthesize the VHDL code into a netlist. In this process the tool tries to recognise components from the code and the generated netlist contains components and how they are connected. Examples of components are memory elements, registers and adders. It is possible to inuence the process by adding constraints controlling timing, area limitations and I/O pin assignment. The constraints are used in the optimisation of the netlist and are also saved to the next step of the translation. Many components like adders and multipliers can be implemented in many ways. For example a fast implementation of an adder is the Sklansky tree adder which requires a large area to be implemented and the ripple carry adder which is small but slow. The constraints will guide the program in choosing implementation, i.e. the algorithm using the smallest area but still fullling the timing constraints will be used. Information gathered during synthesis is summarised in a report. These reports are generated using Xilinxs FPGA design suite ISE. The reports contain information about macro statistics, resources and timing. The macro1 statistics are information about which components that was found by the macro analysing the VHDL code. The resource information is an estimate of which on-chip resources that are required to implement the components on a certain chip. The timing information is a rough estimate of how fast the implementation will be. No routing delay is included in this estimate since the layout performed by the Place And Route (PAR) process has not yet been made. It is also possible that optimisations done during mapping will decrease the timing.
2.2.1.2
Mapping
The next translation step is Mapping (MAP) where the component implementations are mapped to primitive components in the target technology. This is usually the ordinary slices in the CLB but if memory elements are requested or advanced mathematic this is mapped to SLICEM and DSP48E slices by the mapping process. The mapping needs information about the target architecture to know which resources that are available. This information is provided by a technology library describing the target device. The report created after mapping contains essentially the same information as the synthesis report. Most interesting is the more accurate information about resource requirements.
synthesis tools contain macro-scripts that recognise certain behaviour of the VHDL code. If the behaviour matches an internal component, like a blockRAM, the script will use the RAM instead of implementing the behaviour with LUTs.
1 The
24
Background
2.2.1.3
Place and Route
Finally the mapped components are assigned to physical resources in the FPGA and the resources are routed together. This step is called Place And Route (PAR). The process includes optimisation of the location of every resource and how they should be connected to meet the user dened timing constraints. The PAR generates information to congure every resource on the FPGA. For example certain SLICEM are congured as ROM if memory of that type is dened in the design phase. The report generated after PAR contains the most accurate timing information which includes routing delay. Since all translation steps takes long time for large designs it could be better to use the estimates in the former reports when evaluating designs. The information in the PAR report is useful as a verication in design steps before hardware verication.
2.2.2
Simulation
The possibility to simulate a design before it is implemented in hardware is an important part of VHDL. Two types of simulation can be made during the synthesis ow. Behavioural simulation can be made directly from the VHDL models without considering the hardware implementation. A gate level simulation can be made after the PAR process when accurate timing information is available. The gate level simulation is much slower but is able to nd problem with timing and optimisations made by the CAD tool. By simulating the code before implementation it is easier to verify correct behaviour since internal error handling can be added to each component and internal signals not available to the outside world can be measured. The internal error handling can for example be assertions that certain signals are not high at the same time.
2.2.3
Intellectual Properties
To enable faster development and code reusage most FPGA manufacturers provides implementations of various algorithms called IP-Block. These comes with dierent licensing agreements and costs. To protect the source device dependent netlists are often provided together with a software containing a library with the cores and functionality to adapt the cores to a specic usage. ISE contains a software called CORE Generator which contains free simple IP-Blocks including DSP functions, memories, storage elements and math functions as well as evaluation versions of complex blocks [21].
2.3 Processor Architecture
25
2.3
Processor Architecture
This section contains a brief introduction to processor architecture needed to understand the performance metrics discussed in section 2.4.2 and the major differences between processor and FPGA architecture. The architecture described is a simple form used to introduce important concepts. Modern processors are far more advanced.
2.3.1
Fundamental design
A processor is a component capable of changing its functionality when given dierent instructions. The basic processor can be divided into two parts. A datapath and a control unit. The datapath includes logic, called execution units, for manipulating data. Examples of execution units are integer and oatingpoint units which are used for performing mathematical operations on integer or oating-point data. The control unit congures the data path for performing the operation represented by an instruction. For example it activates the correct execution unit used by an instruction. A processor is a sequential machine and the operation of the simple architecture described here can be summarised by fetch-decode-execute-store. This is called one instruction cycle. Fetch An instruction is fetched from the memory position indicated by the Program Counter (PC). Decode The instruction are decoded which means that the datapath are congured for executing the instruction by the control unit. Execute The datapath then executes the instruction, for example calculates the sum of two integers. Store The result of the execution is stored in memory and the PC is updated to point to the next instruction.
2.3.2
Pipeline
This architecture is a valid design but the clock frequency is limited by the delay of the combinational logic in the datapath. A better way is to introduce registers between the steps in the instruction cycle which stores the intermediate values. This will allow an increased clock frequency and will increase the performance if several instructions can be executed concurrently. Executing several instructions concurrently by storing the intermediate values in registers is called pipelining. The architecture described above has four steps in its pipeline. Increasing the number of steps will generally reduce the Instructions Per Cycle (IPC) but
26
Background
allows higher clock frequency. One reason that the IPC is decreased is data dependency. A simplied explanation is that one instruction in the pipeline needs data from another instruction in the pipeline which has not yet written its result to the memory. A resolution is to insert stalls, which is instructions without operation, between the two dependent instructions. During the stall the result is written to the memory to be available for the next instruction. Another problem with pipelining is conditional branches2 . This type of instruction is generated by conditional statements like case, if and by loops like while and for. If the address to the next instruction to be inserted in the pipeline is calculated by a conditional branch this information is not available until the branch is resolved. This will lead to a stall of four cycles. The resolution is advanced branch prediction techniques where the processor predicts the next instruction to be executed and throws away the result if the prediction fails. Modern general-purpose processor has generally much longer pipelines than four stages. For example the Pentium 4 from Intel with Prescott architecture have a 31-stage pipeline. This architecture allows clock frequencies up to almost 4 GHz but suers hard from problems with branch prediction. Such processors have several datapaths which allows instructions to execute in parallel. Application specic processors have datapaths for application specic operations like mathematical functions and specialised datapaths for feedback of results to prevent stalling. Examples are multiple Arithmetical Logic Unit (ALU) and specialised ALU in processors for signal processing. Specialised datapaths and hardware requires special instructions. Using these instructions is crucial for performance. An experienced programmer or an advanced compiler schedules instructions to take advantage of parallel datapaths and rearranges instructions to prevent data dependencies.
2.3.3
Memory
The processor architecture described above as well as the PowerPC operates on data located in registers inside the core. This memory is extremely small, expensive and can impossibly contain all data and instructions needed by even extremely small programs. A larger memory is therefore made available to the processor from which it can fetch instructions and data for storage in the registers. Such memory is considerably slower than the processor. The solution is to introduce cache memories on the same chip as the processor but outside the core and create dedicated hardware to make sure that all data and instructions needed by the processor is in the cache. Otherwise the processor is stalled while the data is transferred from the memory to the cache. One cache can be used for both instructions and data but they are often separated to gain performance.
2 When
the value of the Program Counter (PC) is changed depending on a condition
2.3 Processor Architecture
27
The details of the complex algorithms that handles the cache is outside the scope of this introduction. Thus it is important to understand that the memory bottleneck and cache misses are performance critical.
2.3.4
PowerPC
PowerPC was designed by Apple, IBM and Motorola in 1991 based on the POWER architecture created by IBM. The architectures is Reduced Instruction Set Computer (RISC) designed for personal computers but since Apples transition to Intel based workstations it is mostly used in embedded environments. The reason PowerPC is used as a target for the performance evaluation in this thesis is because it is a common embedded processor and that it is used by SMW. 2.3.4.1 PowerPC 604
This processor is the reference used for evaluation of processor performance using PSIM in this thesis. It is a general purpose processor and not specic to a certain application but shows good DSP performance because of the support for MAC instructions. The architecture is a super scalar RISC architecture that can dispatch four instructions and execute up to seven instructions in parallel. This is possible since the processor contains seven execution units including three integer units and one oating-point unit. The address width is 32 bits like the maximum integer resolution but up to 64 bit oating-point numbers are supported. A branch processing unit and a completion unit controls the out of order execution of instructions on the processor trying to maximise the performance. The processor includes separate on-chip instruction and data caches of 32 KByte. The maximum clock speed of the CPU is 180 MHz. More information is found in [3, PowerPC 604 Processor System] and [10].
28
Background
2.4
Benchmarking
In this report benchmarking is implied to be the act of running a synthetic workload on an object and measuring the performance in order to be able to compare the object with other. Analysing the performance of a xed workload is called proling. The object in this report is a FPGA device or a processor. Benchmarking have been used for evaluating the performance of VHDL generated with AccelDSP on an FPGA and proling on a PowerPC and FPGA for comparing realisations on dierent platforms. With benchmarking as a major focus of this report, it is important to present a description of the fundamentals and problems concerning benchmarking and proling.
2.4.1
Benchmarking fundamentals
Three considerations arise when benchmarking an object. The rst consideration is what kind of performance to measure and in what unit. The next is how the performance should be measured, especially without inuencing the object. And the last one is that the workload should be designed to give a fair comparison between dierent objects. When benchmarking a compiler, in this report AccelDSP, it is important that the workloads contains enough diversity to be able to draw a general conclusion of the performance of the compiler. Traditionally dierent approaches are used for performance measurement: Simple metrics is parameters directly derived from the manufacturing of the object. Those are usually too simple to describe the actual performance. Application benchmarking is when a complete application is used as workload and is a very good way to cover a lot of the possible operations of the object. The disadvantage is, as already mentioned, that the usage gets very specic which gives a benchmark only interesting to a few applications similar to the workload. Algorithm kernel benchmarking is an interesting method if a kernel set of algorithms can be extracted from the interesting applications. The performance of these benchmarks are then related to the performance of the application of interest. This method is used for benchmarking of AccelDSP in this thesis. Micro benchmarking is when a single metric is measured to identify peak capability and potential bottlenecks of a device. This can be very over optimistic in terms of real application performance. Functionality benchmarking can be seen as a kind of micro benchmarking and is when dierent types of functionality, of an object, is measured with
2.4 Benchmarking
29
dierent benchmarks. What kind of functionality the application of interest are using determines which measurements to rely on. Example of functionality could be I/O performance, memory performance or computing power. Benchmarking is becoming more and more advanced as the complexity of new hardware architectures increases. Proling, i.e. to determine the performance of a certain application, run in a specic way on a certain platform, is rather easy but to gain generally usable information is not trivial. As systems becomes more complex, the harder it is to isolate the performance of the object of interest. Advanced hardware often requires the compiler to generate code that is able to utilise the advanced features. For a C-Code compiler this means generating specialised machine code instructions and for a translator from M-Code to VHDL to generate a description that can utilise optimised slices.
2.4.2
Processor benchmarking
Performance modelling and measurement for software and processors are thoroughly discussed in [11]. This section contains a brief summary of all theory regarding performance evaluation and metrics, as a background for nding techniques suitable for comparing the performance of processor and FPGA architectures. 2.4.2.1 Traditional processor metrics
There is a lot of simple performance metrics that can be used to describe processor performance, but they are almost solely interesting within an architecture not when comparing dierent architectures to each other. The commonly used metrics are: Clock frequency A common metric in customer electronics which is totally irrelevant unless the same model, but with dierent clock frequencies, are compared. MIPS Describes how many instructions that can be executed per second. This may be interesting to compare on architectures with the same instruction set. It is important to remember that not all instructions can be executed concurrently if several datapaths are available and that not all instructions take the same amount of cycles to execute. For example oating-point operations are usually slower than xed-point operations. MOPS Describes how many operations that can be executed per second. This suers from related problems as MIPS. What count as an operation and how many operations are needed to perform useful work?
30
Background
FLOPS Describes how many oating-point operations that can be executed per second. This is a more interesting metric than the previous mentioned since the oating-point operation takes a dened number of cycles on a specic architecture. FLOPS is a common metric in scientic calculations. This metric, like the previous, does not tell anything about the overall performance since it is theoretical and does not consider memory performance. IPC/CPI Describes how many instruction are executed per cycle. This is a very good metric which can be used to compare dierent processors with each other. The metrics is not constant for a certain processors, for reasons previously discussed. An average can however be calculated for a certain software Since the simple metrics only gives limited information about the performance of a certain processor some kind of performance evaluation must be carried out. How this is performed depends on the software of interest and how accurate results that are required. 2.4.2.2 Performance evaluation
Performance evaluation is of interest for those designing both hardware and software and can be divided into performance modelling and performance measurement. Performance modelling method is an estimation of the performance of a system without running the code on the actual hardware. The method is more common in early stages of the design process and can further be divided into simulation-based modelling and analytical modelling. Using performance measuring the code is run on actual hardware and some kind of instrumentation technique are used for gathering performance data. 2.4.2.3 Performance measuring
Observing how code is executed during runtime is very important to understand the bottlenecks of a system. This can be done by: On-chip hardware monitoring Most processors have built-in performancemonitoring counters that observes interesting metrics for the actual architecture. Depending on the architecture and conguration the counters can monitor cycle count, instruction counts (fetch/completed), cache misses, branch mispredictions, etc... These counters can be read by software using special instructions to present the information to a developer. O-chip hardware monitoring is when dedicated hardware are attached to the processor for the cause of monitoring the performance. The o-chip
2.4 Benchmarking
31
hardware can for example interrupt the processor after every instruction completion and save all information of interest available in the processor. Software monitoring A similar method as the o-chip monitoring can be performed in software by interrupting the processor using trap instruction. This is very invasive but is easy to implement. A major drawback is that OS activity is hard to monitor unless trap instructions can be added to the OS source code. Microcoded instrumentation This is a method requiring hardware support that supports recording of instruction execution by modifying the microcode. This method does not gather performance information but an instruction trace that can be used in a trace simulation. 2.4.2.4 Analytical modelling
Analytical modelling are not very popular for microprocessors but for whole systems. The method is based on models relying on probabilistic methods, queuing theory, Markov models or Petri nets. More information can be found in [11, 2.1.3 Analytical modelling] 2.4.2.5 Simulation based modelling
By creating a model of the system being simulated, i.e. the target machine, and running it on a host machine the created software can be run without access to the specic hardware the software was designed for. The simulator can be either a functional simulator or a timing simulator and be either trace-driven or execution-driven. A functional simulator is able to run the program in a simulated environment where performance data can be revealed. The functional simulator can be cycle accurate which means that each clock cycle of the processor are simulated which gives the most accurate results with the drawback of long simulation time. When only performance data is of interest a fully functional simulator is not necessary. Much simulation can be made with only a limited functionality if the performance and not the result are relevant. Performance analysis is commonly called proling and is done using a proler tool. A trace-driven simulator can be seen as a simplied form of an executiondriven simulator where the simulator is not able to execute the program but to analyse a trace of information representing the instruction sequence that would have executed on the target machine. The input to a trace driven simulator can either be fed continuously or fetched from a memory. These simulators suers from mainly two problems. The large size of the trace since it is proportional to the dynamic instruction count and that the trace is not very representative for out-of-order processors.
32
Background
Execution-driven simulators solves the two major problems with the tracedriven since the static instructions are used as input and the out-of-order processing can be simulated as well. The simulator can either interpret all instructions or only those of interest letting the host processor execute the rest natively. The later will reduce the invasion and therefore increase the speed of the simulation. Execution-driven simulation is highly accurate but is very time consuming and requires long periods of time for developing the simulator [11, 2.1.1.2 Execution-driven simulation]. 2.4.2.6 Energy and power simulators
The power consumption of a processor consists of two parts. One static which is independent of the processor activity but depends on the processor mode. Modern processors can typically go into idle mode where power consuming parts are turned o. The other part is dynamic and depends on the work executed by the processor. Power simulation are done by modelling the power consumption of the individual components and let the simulator calculate the consumption based on the activity statistics of each component. The accurateness of the models and the granularity of the simulation determines how exact the estimation will be.
2.4.3
FPGA benchmarking
To accurately benchmark the performance of an FPGA is a costly and timeconsuming task. The reason behind this is that the complexity of todays FPGA architectures and CAD tools make it dicult to obtain proper benchmarking results. For example an FPGA can vary in terms of size, maximum clock frequency, number of I/O pins, chip specic implementations, built in PowerPC cores, DSP specialised slices, LUT sizes, BlockRAM etc... Even if VHDL and VERILOG is platform independent languages the compilers and CAD tools vary between platforms and some architecture specic modules may have dierent interfaces. When implementing on FPGAs the CAD tools have a big impact on the performance of the nal design. Timing constraints and conguration of optimisation trade-o in the tools have a large impact on the performance of the generated design. To correctly add constraints and congure trade-os for maximising the performance sets high requirements on a benchmarking methodology. 2.4.3.1 Traditional FPGA measurements
It is important to set constraints and congure trade-os depending on what to measure. In FPGA design there is three important benchmarking and design
2.4 Benchmarking
33
directions: Timing When designing and benchmarking towards timing, important metrics are: Maximum clock frequency, input pin to setup delay and clock to output pin delay. To get comparable results of these measures there is a lot of things to consider, see section 2.4.3.2 for more details. Area When designing with area limitations the goal is to keep the design as small as possible most probably to reduce the cost. The resource usage is given by the CAD tools after a design has been successfully mapped to a device. Important metrics is LUT usage, register usage, I/O pins used, BlockRAM count, DSP48E usage and number of PLL/DCM. Power Power consumption is split into two important factors, static and dynamic power usage. Static power usage comes from the devices physical parameters like size, package and supply voltage. Other factors that contribute to static power usage are design parameters like placement, routing and operating conditions. Dynamic power usage comes from signal toggle rate. Measuring power consumption is done using probing or with the integrated power analysis tools often supplied with the CAD tools. The analysis tool for ISE is called XPower. 2.4.3.2 Performance benchmarking
To achieve maximum performance and meaningful benchmarks on a device requires a structured benchmark methodology. Without strict guidelines the performance results can be misleading or in worst case wrong. In this section an overview of these requirements will be presented. Manufacturer specic details for Xilinx and Altera are found in [1, 22]. To setup a FPGA performance benchmark there is a number of important points to consider: Apply timing constraints in synthesis and place and route until the timing slack 3 is negative. This is done to force the tools to optimise the timing. Constrain all important clocks. This is to force the tools to improve all clocks in the design. If a single global clock constraint is used in a design with several clocks, the CAD tool will only use the constraint on the clock domain which has the worst performance. The highest eort choice in both the synthesis and PAR tools should be used.
timing slack is the dierence between the constraint and the resulting performance. If the slack is positive it means that the constraint can be set tighter to force the CAD tools to work harder
3 The
34
Background
Apply moderate I/O constraints. This is done to force the CAD tools to factor in any trade-os in I/O timing requirements. Without this constraint the results may be unrealistic. When comparing dierent FPGAs it is important to make sure that the devices run on similar speed grades. To interpret the results given by the timing reports it is important to check that the CAD tools used all constraints. It is also important to run the synthesis and PAR a few times with a deviation of the constraints to tune the tools for better performance. Xilinx recommends 5% increments in constraint until timing is no longer met.
2.4.4
Digital signal processing benchmarking
In this thesis the focus is to benchmark and implement algorithms for Digital Signal Processing (DSP). This section presents a few important terms and arithmetic operations used in DSP. Multiply-add is when two values are multiplied and a third value is added to the result. It is an important operation used in DSP applications including computation of vector product, FIR-ltering, correlation and FFT. Multiply-accumulate operation (MAC) is the more common denition used to describe the Multiply-add operation in hardware. The dierence is that result from the Multiply-add operation is stored in a result register called an accumulator, hence the name. Multiply-accumulates per second or MACs is a common micro benchmarking metric used to measure the peak performance in number of Multiply-accumulates a device can perform per second. Complex number calculations since complex numbers are very common to describe the phase and magnitude of signals many calculations are carried through with complex numbers. Complex multiplication can be implemented in two ways using either 3 real multiplications and 5 additions or 4 multiplications and 2 additions. Matrix operations many advanced algorithms uses matrix operations including multiplication, decompositions and singular and eigenvalue calculations.
2.5 Xilinx AccelDSP
35
2.5
Xilinx AccelDSP
AccelDSP became a part of Xilinx XtremeDSP solutions in 2006 when Xilinx acquired the company AccelChip founded 2000 in California. AccelChip was a provider of Matlab synthesis software for DSP systems with roots in a research project at Northwestern University in cooperation with DARPA. The project was called MATCH an acronym for Matlab Compilation Environment for Adaptive Computing [23, 18]. The only operating system AccelDSP supports is Microsoft Windows XP. The software is used for translating M-Code into VHDL which can be synthesised to create digital hardware. The translation includes automatic analysis of type and shapes of variables and generation of a xed-point design suitable for hardware implementation. The original research included functionality for mapping the application to multiple FPGA by parallelising it but this has not reached AccelDSP. Many programs exist for translating general purpose programming languages including C/C++ and Java to VHDL but developing a direct synthesis path from Matlab enables a fast and easy evaluation of a lot of algorithms [7]. The reason is that Matlab is used by high technology companies for developing algorithms and a direct path will make intermediate translations to other programming languages unnecessary. This was the reason behind the original research. Using Matlab as a source has both advantages and disadvantages. The main advantage, besides its superiority for developing algorithms, is the high level syntax of the M-Code language. Since most signal processing building blocks as matrix multiplication, FFT, correlation, eigenvalue calculations etc... are available using a single function it is easy to auto-infer optimised IP-Blocks when translating to VHDL. The main drawback is that Matlab is an interpreted language with dynamic type and shape resolution of its variables. The variables can even change shape which means that a variable can change from being a scalar, to a vector a matrix during runtime. Dynamically changing shape of matrices in loops is unfortunately common in M-Code since such code is very slow and not suitable for hardware implementation. There is no concept of constants in the M-Code. Therefore a workaround is needed for optimisations dependent on constants.
2.5.1
Workow
Generating VHDL from a Matlab model using AccelDSP starts with dividing the model into two parts, a script le and a function-le. The function-le contains the actual function to be translated into VHDL written as an ordinary Matlab function with an interface of input and output variables. The script le has three functions. It creates stimuli, feeds the stimuli to
36
Background
the function in a streaming loop and veries the output from the function. The streaming loop simulates the innite stream of data entering and leaving the design in hardware and the combination of a Matlab function and the loop makes it possible to input the data into manageable partitions. The loop can be either a for or while loop. The stimuli, generated or imported, in the script le is important for two reasons. The data is used as a reference in the automatic type and size identication and xed-point generation framework implemented in AccelDSP. It is from the scaling of the input the internal bit-widths are determined. The input must represent the real world input for the verication of the function to be relevant. The verication is made in several steps in the AccelDSP workow. An overview of the workow is found in Figure 2.1 and is shortly explained in the following sections. The complete manual is found in [24].
Floating point model "Golden"
Verify floating point
Analyse
Generate fixed point
Fixed point model
Verify fixed point
Generate RTL
RTL Model (VHDL/Verilog)
Verify RTL
Synthesize RTL
Gate-Level Netlist
Verify gate level
Implement
Bitstream and simulation file.
Figure 2.1: AccelDSP Workow
2.5 Xilinx AccelDSP
37
2.5.1.1
Verify oating-point
AccelDSP lets the user execute the script le inside the program and shows all plots, variables and output. The oating-point model is the golden source which must be veried by the designer using this output. Errors in this model will propagate through all later steps and exist in the nal bitstream. It is also important to check that all important variables are observed since the output is used to verify the xed-point model. If probes, see section 2.5.1.4, are used oating-point data is stored in this step for later use. 2.5.1.2 Analyse
An in-memory model of the design is created in this step. This involves parsing the M-Code to an internal structure and all constructs not supported will cause an error to be raised. This step also tries to identify the streaming loop and function-le from the script le. AccelDSP supports a subset of the M-Code language with restriction in usable functions, operators, syntax and shapes. A reference can be found in [26]. Matlab functions are supported using a library called AccelWare which is included with AccelDSP, see section 2.5.2. Some of the functions can be automatically inferred in the design but some needs to be generated and inserted in the M-Code manually. For example, a divider can be inferred automatically and directives for the hardware implementation can be set in the xed-point parse tree. For some functions, like a FFT, the code must be generated and inserted manually. 2.5.1.3 Generate xed-point
The framework already mentioned creates a xed-point model in either Matlab or C++ from the oating-point source. The design is presented in a parse tree which includes all information about the design including inferred IP-Blocks, operators and shapes. The interface lets the user graphically add directives to control bit-widths, add pipeline step or choose hardware specic implementations. 2.5.1.4 Verify xed-point
In this step the xed-point model is executed by Matlab and the same output as in the oating-point code is presented for the user to compare with the output from the oating-point. If the results are unsatisfying the user has to go back and annotate the design with more directives or to control or change the oating-point design. This iteration is performed until the user is satised with the results.
38
Background
To simplify the analysis of variables inside functions and the eect of xedpoint conversion AccelDSP provides probes. These are inserted into the M-Code for observing both the oating and xed-point values of a variable. A plot presenting the values and dierences between the models are shown when running the AccelDSP function verify xed-point, which can be used to identify errors in the quantisation. 2.5.1.5 Generate RTL
When the user is satised with the delity of the xed-point results an RTL description can be generated. Both VHDL and VERILOG les containing the RTL description is created. A testbench for verifying the generated design is also created automatically together with ASCII les containing the function stimuli and output from the Matlab model. The testbench reads the input to the VHDL function from the ASCII les and compares the output with the Matlab reference. The verication pass if all values are the same. The testbench is dependent on Xilinx packages for performing the comparison with ASCII les. This leads to compact test benches but the packages must be available if the testbench are to be run outside AccelDSP. 2.5.1.6 Verify RTL
The testbench is run inside AccelDSP using one of several commercial available tools. The result is either pass or fail. 2.5.1.7 Synthesise RTL
If the verication pass, the synthesis can also be performed inside the program by utilising an external tool like in the verication. After this step a gate level netlist is created. 2.5.1.8 Implement
The netlist is mapped to the hardware using MAP and PAR in this step. After this process two les of particular interest are created. The conguration bitstream which is used for programming the FPGA and a gate level simulation le. 2.5.1.9 Verify Gate Level
The same testbench used in the RTL verication can be used to verify the gate level implementation generated after PAR. This verication is able to nd both synthesis and timing related errors. If this verication passes it guarantees that the implementation is bit-true with the Matlab xed-point model.
2.5 Xilinx AccelDSP
39
2.5.2
AccelWare
M-Code implementations of Matlab functions optimised for generation of VHDL are included in AccelDSP using a library called AccelWare. The library only contains a small subset of all Matlab functions and only some of the function can be automatically inferred using the same or a similar syntax as in Matlab. Functions that can not be inferred automatically must be generated using a graphical interface with a form for choosing implementation parameters. This interface creates an AccelDSP project with both a script le containing plots and test data to verify the functionality and a function-le which can be inserted in the user code instead of the non supported Matlab function.
2.5.3
Original compiler implementation
This section contains information about the original implementation of the MATCH compiler as described in [6]. How much of this that have changed before the current version of AccelDSP is not known but at least some parts, like the state machine generation, is still implemented. As a rst compiler step a Matlab Abstract Syntax Tree (AST) is generated from the source code annotated with the information in the directive les. A type-shape inference phase infers the types and shapes of the variables by analysing the tree including type and shape directives. Optimised library functions are recognised and IP-Blocks are inferred. For operations where no cores are available a scalarisation phase expands the matrix and vector operations into loops. The original compiler then performed a parallelisation phase which split loops or assigned dierent tasks onto multiple FPGAs available on the same board. This required a compatible board with several FPGAs and libraries for communication between the devices. The Matlab AST is translated to a VHDL AST using a state machine description. The state machine is, among other things, required to hold states of loops and calculations when using pipelining. A precision inference scheme operates on the new AST to nd the minimum bits required to represent every variable. When the bit widths are inferred hardware dependent optimisations are performed which can alter parts of the generated state machine. A traversal of the generated tree produces the output VHDL code. 2.5.3.1 Performance
During the development of the MATCH compiler the researchers used benchmarking for evaluating the performance and measure the eect of dierent optimisations. The benchmark suite presented in [6] consists of Matrix Multipli-
40
Background
cation, FIR lter, IIR lter, Sobel edge detection algorithm, Average lter and Motion Estimation algorithm. The conclusion drawn by the authors were that the auto-generated code were almost equivalent in execution time to manually designed hardware and in some cases superior. The resource utilisation were within a factor of four of the manual designed and the design time were reduced from months to minutes. The benchmarking was done on the WildChild board from Annapolis Micro Systems with 9 FPGA devices. How parallelism in the algorithms are exploited to divide the algorithm between the devices and how the communication is implemented is crucial for the performance and resource utilisation. This functionality is not a part of AccelDSP.
2.6 MathWorks Matlab
41
2.6
MathWorks Matlab
Matlab is a high-level technical computation environment from The MathWorks founded by Jack Little, Steve Banger and Cleve Moler 1984. The foundation to Matlab was written by Cleve Moler and was an implementation of the EISPACK and LINPACK Fortran packages for linear algebra and eigenvalue calculations. The program was written when Moler was a math professor at the University of New Mexico to let the students do scientic calculations without having to program in Fortran. Nowadays Matlab let engineers develop algorithms, do numerical computations and visualise data easier and faster than using traditional programming languages. The core of Matlab is the Matlab Code language which is a high level, mathematical oriented language which can be used for algorithm design using Matlab built-in function as well as programming new functions. Matlab built-in functions are mostly written in M-Code which makes it easy to inspect the algorithms and make changes. Fundamental and performance sensitive functions are compiled to gain better performance. The core functionality of Matlab can be extended by installing toolboxes which contains tools for a certain eld. Toolboxes are available for everything from aerospace and nance to bioinformatics [13, 15, 14].
2.6.1
Native code generation
Matlab can generate C-Code with both Matlab Compiler which can be used to generate code and executables for workstations and the software used in this thesis; Real Time Workshop Embedded Coder (EMLC) which is designed for embedded targets. The dierence in generated code is considerable and EMLC is recommended for embedded platforms [17]. EMLC is a module to Real Time Workshop which is an extension to Matlab for generating and executing stand-alone C-Code. The module can be used with both Embedded Matlab Subset (EML), and Simulink. The later is an environment, supported by a graphical user interface, for simulation and model-based design of embedded and dynamic systems. EMLC was originally a part of Real Time Workshop which leads to EMLC still being dependent on Simulink. Both xed and oating-point code generation are possible using EMLC. Parameters for the code generation target and its abilities are provided to EMLC using conguration objects. This information includes endianess, oating support, etc.. For xed-point generation, which is of interest in this report, the Fixed-Point Toolbox is needed. This toolbox contains object types which can be used for annotating variables with xed-point directives. These directives are used directly in M-Code to create a xed-point Matlab implementation supported for code generation.
42
Background
For more information and a performance evaluation of code generation from Matlab see section 1.3.7.
Chapter 3 Method
This chapter is divided into three sections. The rst method found in section 3.1 describes a method for generating VHDL and evaluating the performance on a Field Programmable Gate Array (FPGA). Next section 3.2 describes a method for generating C-Code using Real Time Workshop Embedded Coder (EMLC) and evaluating the performance on a PowerPC. A method for making a platform decision based on the previous two methods is found in section 3.3. Descriptions of the algorithms used in all methods are found in chapter 4.
3.1
VHDL generation and simulation
This section describes the method, for generating and simulating VHDL code from a Matlab implementation, used in this thesis for evaluating AccelDSP and generating VHDL for the method for platform decision found in section 3.3. Background information of how AccelDSP should be used according to Xilinx is found in section 2.5. A ow chart of the method described in this section is found in Figure 3.1. To evaluate which purposes AccelDSP can full at Saab Microwave Systems (SMW) the following aspects have been evaluated: 1. To attack the algorithm implementation from the perspective of an engineer with no or little knowledge in VHDL and about FPGAs. 2. To evaluate the exibility of the AccelDSP M-Code interpreter in terms of which functions and syntax that can be synthesised. 3. To evaluate the performance and resource utilisation of generated designs compared to hand-coded or IP-Block implementations. 4. To evaluate AccelDSP as a tool for rapid prototyping. 5. To evaluate if the generated VHDL code is technology independent.
44
Method
For evaluating the performance Algorithm kernel benchmarking, see section 2.4.1, was performed using the following kernel set: FFT, Matrix multiplication, FIR ltering and CFAR. These algorithms represents commonly used Digital Signal Processing (DSP) algorithms and a common radar algorithm for which a reference could be found or hand-coded in a reasonable time. The kernel set as well as: Correlation, CORDIC, MUSIC and Cartesian to polar map transformation were used. The results are found in section 5.1.
VHDL code generation Matlab model
M-code needs to be modified to work
M-code
AccelDSP
Iterate until fixed-point model works as intended
Verification
Testbench fails to verify VHDL VHDL-code and reference files
ModelSIM
Verify VHDL using AccelDSP testbench
ISE Foundation
Evaluation
Testbench pass
XST
Synthesize succesfully
PAR with constraints
XST fails to synthesize the design Synthesize succesfully
Constraint iterations
Synplify PRO
Area utilization and performance results
Figure 3.1: The VHDL generation workow and evaluation
3.1.1
VHDL code generation
The design process of an algorithm starts with creating a working Matlab model of the algorithm. How to generate VHDL with AccelDSP from this Matlab
3.1 VHDL generation and simulation
45
model is described in section 2.5. It is important to note that many Matlab built-in functions like t , lter , etc... are not supported by AccelDSP and must be implemented directly in M-Code or using an AccelWare core. A result of this limitation is that the design process requires a number of iterations before AccelDSP succeed with generating the algorithm. 3.1.1.1 AccelDSP
As showed in Figure 3.1 there are two types of design iterations inside AccelDSP. These iterations are not xed to a specic step in the design process, but rather dependent of the type of error that occur during the dierent steps in the VHDL generation workow. M-code iteration Several iterations can be needed to get AccelDSP to parse the M-Code. An iteration involves rewriting unsupported built-in functions or adding functions that can not be auto-inferred using AccelWare. If errors in xed-point verication occur that can not be solved by changing the xed-point model the M-Code source must be edited which results in another iteration. Fixed-point iteration To get the xed-point model to correspond to the oating-point several iterations can be needed. By adding quantisation directives to variables in the xed-point model it is possible to interact with the automatic size and shape identication process. To assist in this process AccelDSP provides probes, see section 2.5.1.4. Unrolling and pipelining directives can also be added in this iteration. If errors are found, it is necessary to go back and modify the xed-point directives. If it is not possible to solve the problem using directives a change in the M-Code design is needed. After the program has veried that the xed-point model of the algorithm correspond to the intended function, AccelDSP generates a Register Transfer Level (RTL) model. AccelDSP gives the possibility to synthesise the xed-point model and continue the implementation workow inside the program, as described in section 2.5. Because of the limited computing power on the Microsoft Windows XP workstation where AccelDSP was run the remaining part of the workow was performed in ISE Foundation running on a Linux server. AccelDSP can not run on the server because Linux is not a supported operating system. 3.1.1.2 ISE Foundation
46
Method
ISE Foundation is used to synthesise the generated VHDL les with Xilinx Synthesis Tool (XST), tune the implementation optimisations and to assign constraints to the design. The program is executed on a Linux remote desktop environment which runs on a server used to handle resource heavy calculations and compilations. Another reason to use ISE is that this program is used to synthesise the handcoded or IP-Block implementation used as a reference model. The performance of the generated VHDL is compared against the reference to evaluate the quality of the generated VHDL. To be able to get comparable result, the designs must be implemented using the same programs with comparable settings and constraints.
3.1.2
Verication
AccelDSP delivers a VHDL testbench along with the generated RTL model, as explained in section 2.5.1.5. The testbench along with the reference data les and dependent libraries are copied from the AccelDSP project to the ISE Foundation project. The testbench is then executed with ModelSIM a VHDL simulator which simulates the testbench and presents the results in a waveform table. The waveform is used to calculate the execution time and start-up cycles of the generated design. The testbench will also run a verication procedure that compares the output waveform of the RTL model with the reference data les. The result of this procedure will report a pass or a fail. If the testbench fails, the fault can be in the xed-point model or in the original M-Code description since the testbench only report a pass or fail. More detailed information about the error from the testbench would make it easier to locate the source of the error.
3.1.3
Evaluation
The goal of the evaluation is to simulate and compare the performance of generated code with a hand-coded or IP-Block implementation of the same function. The proling follows the guidelines described in section 2.4.3.2. This guidelines contains information about which constraints that are added and other parameters set in ISE Foundation. The algorithms used in the cross-platform comparison are not used for evaluating AccelDSP and have thus no reference designs. For these benchmarks a Virtex-5 device used by SMW, xc5vsx95t-f1136-3, has been used. This device is a part of the SXT device family optimised for DSP implementations. There are two basic types of DSP algorithm implementations: Burst I/O is when a algorithm rst loads the input values, process the given values and nally presents the output. During the processing is the imple-
3.1 VHDL generation and simulation
47
mentation locked and can not accept any new values until the processing is done. Streaming I/O is when the algorithm constantly accepts new values, process the given values and presents the output in parallel. There are two types of streaming I/O: A sample streaming I/O implementation takes one value or sample each cycle. A frame streaming I/O implementation needs a batch of values to be loaded before the processing can begin. When the rst frame is loaded the next frame is starting to load without delay unlike the batch processing. The dierence between the sample and frame streaming I/O is that a single value can not be processed using frame streaming I/O. The algorithm implementations are compared in area utilisation and timing performance from data generated in the implementation workow, see section 2.2.1. Area utilisation is important for this evaluation, because it is a good indication of how well the implementation is generated and how much of the platform that can be used for other processes. The important measurements are : Slice usage including LUTs and register use is an indication of how much of the device the implementation uses. DSP48E slice usage is used to compare how well the design utilise the primitive DSP resources. BlockRAM usage gives an indication of required memory of the implementation. Global clock buer (BUFG) usage is important to see if several clock domains are required to drive the generated design. Timing performance is measured in the maximum clock frequency (fM AX ) that the design can handle after Place And Route (PAR). The execution or calculation time of the algorithm is also measured. This is measured dierently depending on the implementation: Batch execution time is the time it takes to complete one burst I/O calculation. Streaming execution time is the time between when a value is available on the input until the result is available on the output.
48
Method
Frame execution time is the time it takes to load one entire frame, process the frame and present the result. Other information related to timing is: Input sampling, which is measured in the number of samples the design can load and process each second. The reason to use this measurement is to evaluate the I/O performance of the generated design. Start-up cycles or data latency is the execution delay in cycles. The delay is measured from the point in time where a value is loaded into the algorithm until the processed version of that value is available as output. This delay often correspond to the number of pipeline stages in a streaming I/O design. In a burst I/O design the delay is related to the calculation time of the implementation. 3.1.3.1 Generated design
To prole the generated VHDL code, it is rst synthesised using XST included in ISE Foundation. If the synthesis fails Synplify PRO is used as a backup tool. This program is an expensive but high-performance, sophisticated logic synthesis software which often succeeds with synthesis when XST fails. If Synplify PRO is used the reference design has to be transferred and synthesised in this program as well. To get proper timing and area utilisation information the synthesised design are passed to PAR. The type of constraints used in this step is found in appendix E. To make the results as good as possible a number of constraint iterations is done to tune the timing until a negative timing slack is reached, see section 2.4.3.2. 3.1.3.2 Reference design
The reference design is synthesised with the same settings as the generated design. To make the designs comparable the same type of settings and constraints as the generated design are used in PAR. The only dierence is that the values of the constraints are tuned dierently between the designs to get the best performance of each design.
3.2 C-Code generation and simulation
49
3.2
C-Code generation and simulation
This section describes the workow for generating, verifying and proling CCode from a Matlab implementation. An overview of the ow is found in Figure 3.2. For evaluating and developing the method three algorithms have been used. FFT, MUSIC and Cartesian to polar map transformation. The FFT is a simple kernel algorithm used to try the methodology before the more complex algorithms but the other two algorithms represent algorithms of interest for SMW. The results are found in section 5.2.
M-Code
Matlab reference
Functional verification (Matlab)

Simulation output
Simulation (PSIM)
Embedded Matlab Subset (EML)
Fixed point EML
C code
Figure 3.2: An overview of the process of evaluating the performance of a PowerPC implementation of an algorithm described in M-Code.
3.2.1
C-Code generation
EMLC only supports Embedded Matlab Subset (EML) which means that algorithms using non supported M-Code syntax or functions must be rewritten to conform to this subset. To be comparable to the FPGA implementation and to be able to prole on xed-point processors the oating-point M-Code must be annotated with xedpoint information. This is done using the Fixed-Point Toolbox, see section 2.6.1. No automatic precision analysing are performed in EMLC like in AccelDSP. The simplest way to dene input to the function is to use example data dened in a cell array, see Listing 3.1. More information about EMLC is found in the Matlab help pages and in section 1.3.7. For convenience an example of using the PowerPC as target device is found in Listing 3.2.
3.2.2
Compilation
Before compilation a wrapper, serving as an entry point for the program, must be provided. A main-function in C-Code was written which provided stimuli to the function and saved the result. A small library for reading and storing data
50
Method
% D e f i n e f i x e d p o i n t p r o p e r t i e s u s i n g o b j e c t s % Use s i g n e d w o r d l e n g t h = 8 , f r a c t o r i a l l e n g t h = 0 T = numerictype ( 1 , 8 , 0 ) ; % Wrap i f o v e r f l o w o c c u r s and a u t o g e n e r a t e w i d t h s f o r f u l l % p r e c i s i o n i n b o t h sum and p r o d u c t o p e r a t i o n s F = f i m a t h ( OverflowMode , wrap , SumMode , . . . F u l l P r e c i s i o n , ProductMode , F u l l P r e c i s i o n ) ; % C r e a t e example d a t a u s i n g 2 d i f f e r e n t s y n t a x e s e x d a t a = { zeros ( 1 0 0 0 , 1 , u i n t 1 6 ) , f i ( 0 ,T, F) } % Compile t h e f u n c t i o n music ( data , l e n g t h ) emlc c T RTW:EXE eg exd ata music
Listing 3.1: Denition of function input using example data in EMLC

% C r e a t e an hardware c o n f i g u r a t i o n o b j e c t h w i c f g = e m l c o d e r . HardwareImplementation ; h w i c f g . ProdHWDeviceType = F r e e s c a l e >32 b i t PowerPC ; % Compile t h e f u n c t i o n music ( data , l e n g t h ) emlc c T RTW:EXE s h w i c f g music
Listing 3.2: Denition of code generation target in EMLC
in ASCII format for functional verication with Matlab was written for use in the main-functions. To be able to compile for the PowerPC architecture on the x86 Linux machine used in the evaluation, a GNU Compiler Collection (GCC) cross compiler was built. For reference see the Makele in appendix C. GCC was congured to use Newlib, a C library designed for use on embedded systems. This library denes the real entry point for the program which calls the main-function and contains implementations of the standard C library functions. The cross-compiler built using the enclosed description uses the GCC target powerpc-eabisim which generates an ELF le compatible with the simulator. The conguration of the build script for the compiler can easily be changed to use a target generating code for a processor used in production. The code was compiled using high optimisation by applying the -O3 option to GCC. This will optimise the code for speed and not code size since functions are inlined [12]. This optimisation level also optimise register usage, primary for processors with lots of registers. The increased compilation time and code size is generally not a problem for the small algorithms evaluated in this thesis. The drawback is that an increased code size can generate more cache misses in
3.2 C-Code generation and simulation
51
the instruction cache. Since these misses are presented by the simulator it is easy to change the optimisations to -O2 to exclude function inlining.
3.2.3
Evaluation
PSIM is a PowerPC execution driven functional simulator with a built-in proler, see section 2.4.2.5, integrated with GNU Debugger (GDB). By running the generated code in this simulated environment the code can both be veried to be functionally correct and the performance be evaluated using the proling support. The device simulated using PSIM is a PowerPC 604 described in section 2.3.4. This is the default device in PSIM when no option is used to specify a device. PSIM supports instruction counting, execution unit modelling and instruction cache performance of the PowerPC architecture. The simulator have no support for data cache simulation and are not cycle accurate. Example of performance data received from PSIM are found in B. The generated code in combination with the proling are presumed to be used for performance evaluation and optimised hand-written code for implementation. In this case the lack of data cache simulation are not crucial for the evaluation in this thesis since the hand-coded code is often more optimised for cache performance than generated. 3.2.3.1 Functional verication
Before performance evaluation the generated code was veried to be functional correct by verifying against the Matlab model. To be able to compare the results the same stimuli must be used for both models. The stimuli generated in Matlab and provided to the M-Code implementation were saved in an ASCII le. This le is read by the main-function which provides the stimuli to the generated code. The output from the generated code is saved in the main-function, loaded into Matlab and compared to the M-Code model. This approach guarantees correct results with a xed-point resolution determined by the number of digits saved to the le. 3.2.3.2 Performance evaluation
When evaluating the performance all unnecessary operations like functional verication were removed before compilation. The evaluation were made two times. One time without calling the generated code to determine the overhead of the main-function and another time including the code. It is important to review the result to make sure that parts of the main-function are not removed by
52
Method
optimisations when the generated code is not included. The estimated number of cycles to execute the generated code is the simulated result of the generated code called subtracted with the simulated overhead of the main-function. To get a gure of the resource usage the whole CPU is considered a resource shared in time between algorithms. By assuming a real-time requirement of how often the algorithm needs to be run and compare this with the actual execution time a gure of the resource usage are gained. For example if the algorithm needs to be run every 20 ms, determined by the rate of incoming data, and the execution time is 10 ms the usage is 10/20 = 50%.
3.3 Platform decision
53
3.3
Platform decision
Comparing two dierent platforms are not trivial since there is no common simple metric that describe the performance of a general platform. An example is Cycles Per Instruction (CPI) which is a good metric1 for comparing processors, but there exists no such common simple metric for comparing FPGA with CPU architectures. A better approach is to prole an algorithm of interest on the platforms. This section contains a method for deciding if an algorithm should be implemented on a FPGA or PowerPC using proling.
3.3.1
Method
A Matlab model of the algorithm is the source for generation and evaluation of VHDL using the method described in section 3.1 and C-Code with the method in section 3.2. The programs used in the two methods supports two dierent subsets of the Matlab language. Two dierent Matlab models conforming to the dierent subsets needs to be created from the original Matlab source and used in the two methods. It is important to keep the models as similar as possible to be able to compare the implementations. For comparing the implementations two metrics have been found that are easy to measure on both platforms and contains vital information for deciding implementation platform. These are execution time and resource usage. A discussion of other interesting metrics is found in the chapter Discussion in the end of the report. 3.3.1.1 Execution time
The execution time is easily calculated as the period time for the clock multiplied with the number of cycles it takes to nish the execution. The number of clock cycles on a CPU is estimated using one of the methods described in section 2.4.2.2. In this thesis the simulator PSIM is used. The frequency is xed for a certain processor model. The maximum clock frequency on a FPGA is determined by the delay of the longest combinatorial path, and the number of clock cycles is determined by the VHDL description of the implementation. The number of clock cycles is generally a trade-o between speed and resource usage. Exploiting parallelism reduce the number of clock cycles needed for calculation at the cost of more resources. The clock frequency is estimated during implementation, see section 2.2.1. To get a fair comparison between pipelined and non-pipelined designs the input data to the algorithm must conform to the required usage of the algois only relevant for processors if it can be assumed that all instructions takes the same amount of cycles to complete.
1 CPI
54
Method
rithm. Using a smaller set of input data will disfavour the pipelined design if the pipeline is not fully used. 3.3.1.2 Resource usage
To get a gure of the resource usage on the PowerPC the whole CPU is considered a resource shared in time between algorithms. By assuming a real-time requirement of how often the algorithm needs to be run and compare this with the actual execution time a gure of the resource usage is gained. For example if the algorithm needs to be run every 20 ms, determined by the rate of incoming data, and the execution time is 10 ms the usage is 10/20 = 50%. For the FPGA the slice usage is used as metric of resource usage. Time sharing of the FPGA as a unit is not possible since the hardware is built for performing a specic task and the required resources are allocated no matter if the task is executed or not.
Chapter 4 Algorithms
This chapter contains general descriptions and implementation details of all algorithms used for evaluating the performance of AccelDSP and for comparing FPGAs with processors. The FFT, correlation, matrix multiplications, FIR-lter, CORDIC and CFAR algorithms have been used for evaluation AccelDSPFor evaluating the comparison MUSIC and an algorithm for map transformation have been used. The transformation algorithm were designed in Matlab from a use case description, the CFAR algorithm from a VHDL design specication and the rest from existing Matlab implementations. The development time to get the algorithms to work with AccelDSP diers a lot between the algorithms. An estimation of the development time for each algorithm is found in section 5.1.
4.1
FFT
The Fast Fourier Transform (FFT) is a collection of algorithms for fast computation of the Discrete Fourier Transform (DFT). The DFT is a very common operation in signal processing used for fast lter implementation, frequency domain analysis and ecient computation of various operations. At Saab Microwave Systems (SMW) the FFT is used in the Doppler channel of the radar signal processing where moving targets are found and classied. The most common algorithm is the Cooley-Tukey algorithm which recursively, (most algorithms rearrange the data instead of using explicit recursion), breaks down the DFT into smaller ones. The radix determines how the algorithm divides the data into smaller transforms. A radix-2 decimation in time algorithm, which is one of the most common FFT implementations, divides a signal of length N = 2k , into two transformations of length N/2. The rst contains all even indexed numbers and the
56
Algorithms
second all odd indexed numbers. The signal are then split into smaller signals and recursively transformed and added to produce the transformation of the whole signal. The transformation is made by multiplications of the complex roots of unity called the twiddle factors. Cf. the denition of the DFT Equation 4.1
N 1
X(k) =
n=0
x(n)e
j2kn N
(4.1)
4.1.1
Implementation
In Matlab the command X = t(x, N) is used for calculating the N point DFT of the signal x. The FFT is a Matlab built-in function which is supported by AccelDSP using AccelWare. An explanation of the parameters are found in [25, pp. 147-151]. To examine EMLC a simple oating-point, Matlab built-in FFT of length 256 with real input were used.
4.1.2
Evaluation
To get an idea of the performance of the AccelWare generated function the same algorithm is generated with CORE Generator and the implementations are compared. The parameters used for AccelWare and CORE Generator are found in Table 4.1. To analyse the CORE Generator FFT which is provided as an optimised netlist a VHDL wrapper was written.
4.1 FFT
57
Parameter Algorithm Decimation Radix Algorithm Length Output Input Scaling Number of stages using BlockRAM Complex multiplier Precision Input Input fract width Twiddle factors Rounding
a
CORE Gen Cooley-Tukey frequency 2 Pipelined Streaming I/O 1024 Bit/Digit Reversed Order complex Yes 3 4 mult & 2 adders
AccelWare -a frequency 2 Pipelined Streaming I/O 1024 Bit/Digit Reversed Order complex Yes -a 4 mult & 2 adders
16 -a 16 convergent
16 0 16 oor
Information is not available
Table 4.1: FFT Implementation
58
Algorithms
4.2
Correlation
The Cross correlation is a statistical metric of the similarity of two waveforms as a function of a time-lag applied to one of the. The Auto correlation it the Cross correlation of a signal with itself. The correlation of two random processes is dened as: E[Y (n1 )X(n2 )] (4.2) From the realisations of two ergodic, Wide Sense Stationary (WSS) random processes the correlation can be estimated using Equation 4.3 for positive k. rY X (k) = 1 N
N k
y(n + k) x(n)
n=1
(4.3)
This formula is very similar to the convolution but there is no reversing of one of the signals as in the convolution. The correlation can be implemented eectively by the use of the FFT.
4.2.1
Implementation
In Matlab the command [Rxy, lags] = xcorr(x,y) are used for calculating both cross- and autocorrelation. The command handles both vectors and matrices and can calculate both biased and unbiased estimates for two random (jointly stationary) stochastic processes. The implementation is made in frequency domain for eciency Listing 4.1.
% Transform b o t h v e c t o r s X = f f t ( x , 2 nextpow2( 2 M 1 ) ) ; Y = f f t ( y , 2 nextpow2( 2 M 1 ) ) ; % Compute c r o s s c o r r e l a t i o n c = i f f t (X. conj (Y ) ) ;
Listing 4.1: MathWorks XCorr implementation
4.2.2
Evaluation
The performance of the generated VHDL are not evaluated since it is comparable to the performance of the FIR-lter which is basically a convolution. The only dierence between convolution and correlation is described in the former section. Instead the complexity of getting a Matlab description of the function, to be compatible with AccelDSP for generating VHDL is investigated.
4.3 Matrix multiplication
59
4.3
Matrix multiplication
Matrix multiplication is a common mathematical operation resulting in Multiply Accumulate (MAC) operations. It is a fundamental operation used in every area of signal processing. C =AB (4.4)
Two dierent types of implementations are made. The batch implementation take two matrices (A, B) as arguments and returns the resulting matrix (C) after multiplication. The streaming implementation takes one row of an input matrix at each clock cycle and returns a row of the resulting matrix after multiplication with a preloaded matrix. A streaming implementation is generally slower, but requires considerably smaller I/O busses since only a row and not the whole matrix needs to be feed simultaneously to the function. The evaluation is made with 3x3 matrices for batch and 10x10 for streaming multiplication.
4.3.1
4.3.1.1
Matlab implementation
Batch implementation
The simple code in Listing 4.2 takes the matrices A and B as input and returns the resulting matrix C after multiplication. The multiplication was annotated to be fully unrolled using directives.
function C = mult (A, B) % Perform m a t r i x m u l t i p l i c a t i o n C = AB ;
Listing 4.2: Batch 3x3 matrix multiplication
4.3.1.2
Streaming implementation
The Matlab implementation found in Listing 4.3 is an example of how M-Code is written for hardware implementation. First each row of the B -matrix is loaded into memory, during this process the valid signal is asserted low. Then each row of the A-matrix is multiplied with the B -matrix resulting in one row of the resulting matrix on the output. The ctrl signal determines if a B -matrix should be loaded or if the multiplication should be made. Handshaking and reset signals are generated by AccelDSP. To optimise the speed of the code the vector-matrix multiplication is set to non-resource shared/pipelined using directives in AccelDSP.
60
Algorithms
function [ outdata , v a l i d ] = mult ( i n d a t a , c t r l ) a = length ( i n d a t a ) ; % A l l o c a t e memory f o r B m a t r i x and row i n d e x p e r s i s t e n t B Brow ; % I n i t i a l i s a t i o n o f memory i f isempty (B) B = zeros ( a ) ; Brow = 1 ; end % Save B m a t r i x i n memory i f c t r l == 1 B( Brow , : ) = i n d a t a ; Brow = Brow + 1 ; % Make s u r e o u t d a t a i s a l w a y s d e f i n e d outdata = indata ; valid = 0; % C a l c u l a t e one m a t r i x row else o u t d a t a = i n d a t a B ; v a l i d =1; end
Listing 4.3: Streaming 10x10 matrix multiplication
4.3.2
VHDL reference design
Implementing matrix multiplications in hardware is very resource costly when the sizes of the matrices are large. This is because of the many multiplication and summation operations needed in the standard matrix multiplication algorithm.
4.3.2.1
A good way of implementing an one cycle matrix multiplication is using the systolic array implementation[19]. The big downside of this approach is the resulting delay of using asynchronous adders. To combat this and to better map the algorithm to the DSP48E slices the design was pipelined. This approach is very resource costly and for a 3x3 matrix multiplication 27 multipliers and 18 adders are needed.
4.3 Matrix multiplication
61
4.3.2.2
The algorithm for the 10x10 matrix multiplication is designed as the Matlab model. The B matrix is loaded into a register bank one column each clock cycle and when the matrix is loaded one row of the A matrix is read each cycle. The output row is calculated in a 7 stage pipeline, which will result in a delay of 7 cycles. The result is presented as a row in the resulting matrix. The heavy use of pipeline structures is to make the synthesis tools optimise the multiplicator and adder components into DSP48E slices to get better performance and area utilisation.
4.3.3
Evaluation
The performance of the matrix multiplication is compared between the handcoded VHDL and the AccelDSP generated implementation.
62
Algorithms
4.4
FIR lter
The basic principle of a lter is that a signal x(n) is convolved with an impulse response h(n) and results in a ltered signal y(n). If the impulse response is nite, i.e. h(n) is zero for n > N and the input only depends on current and previous values the lter is called an Finite Impulse Response (FIR)-lter with length N . Such lter are very common in signal processing and are easy to implement in hardware.
N
y(n) =
k=0
h(k)x(n k)
(4.5)
The impulse response for a FIR-lter is the lter coecients, also called taps when referring to the implementation. The input to the lter can be several interleaved signals which lets one lter process several input channels. The number of interleaved signals are called channels. In a real-time application the lter have to work at a clock rate M times faster than the input rate if M denotes the number of channels.
4.4.1
Evaluation
The performance of the lter is evaluated using an AccelWare generated block compared with a hand-coded VHDL reference from SMW. The both implementations are adjusted to be as similar as possible and two dierent length of lters have been compared. The implementation details are found in Table 4.2. Parameter Filter operation mode Number of taps Number of channels Programmable coecients Precision Coecients width Input data width Output data width
a
Hand-coded real and complex 64/16 4 Yes
AccelWare real 64/16 4 Yes
12 25/16 32
14 25 40
Information is not available
Table 4.2: FIR-lter implementations
4.5 CORDIC
63
4.5
CORDIC
A common algorithm used for calculating trigonometric functions in hardware is the COordinate Rotation DIgital Computer (CORDIC) algorithm. It uses an iterative method of calculating vector rotations using only shift and add computations. This makes the CORDIC ideal to implement on digital hardware without dedicated multipliers. A description of the algorithm can be found in [9]. The function used to test the CORDIC was a SMW function that converts a complex number in rectangular to polar representation. A Matlab model was provided for the implementation.
4.5.1
Implementation
A Matlab code, used by SMW for verication of the reference design, was provided as a starting point for AccelDSP implementation. This code was modied to match the coding guidelines for AccelDSP. The input to the Matlab function is a complex number and the number of CORDIC iterations. The result is presented as an angle in the rst quadrant, quadrant information and the magnitude. The angle is a bit-vector where a 1 represent a positive and 0 a negative contribution of the value at the position. Each position have the value atan( 21 ) where n are the position indexed from 0. n The reference design have the same interface as the Matlab function except that it is implemented as a generic component where the number of iterations are dened during the component instantiation. This behaviour can not be created in AccelDSP, which means that the number of iterations needs to be declared as a constant before VHDL generation. The number of iterations were set to 9 iterations for the evaluation.
4.5.2
Evaluation
This algorithm is implemented and in use by SMW. That implementation is used as a reference to the AccelDSP generated VHDL. In addition to performance evaluation of AccelDSP this algorithm evaluates how easy it is to implement a design when following a specication written for VHDL.
64
Algorithms
4.6
CFAR
Constant False Alarm Rate (CFAR) is an algorithm for keeping a constant false alarm rate in an environment with time variant interference, clutter and noise. The algorithm adapts the threshold that determines if a signal from the radar antenna are considered to be a target and not background noise. This is done by maximizing the detection probability while the probability for a false alarm is held at a constant level. The implementation of CFAR at SMW is especially interesting to evaluate since it contains both mathematical operations like a divider and mean value calculations as well as logic like an insert sort implementation operating on data buers.
4.6.1
Implementation
Matlab code used by SMW for verication of their VHDL implementation if this algorithm was used as a starting point for the implementation using AccelDSP. The implementation follows the same specication as the VHDL implementation at SMW and not the structure of the M-Code for verication which is fundamentally dierent. A brief outline of the algorithm is explained below. Into the algorithm comes a stream of video data from the antenna. If not the initial problems, when less then 37 samples1 have been retrieved from the antenna, are considered 16 values before and after the current value in the incoming data stream are collected and sorted into two lists. The current value is 16 samples before the last value retrieved from the antenna. From each sorted list a mean value is calculated from the two values in the middle. The current value is divided by the greatest of either the two mean value or a reference value provided to the function. An illustration is found in Figure 4.1 The algorithm described above is implemented using one sliding window collecting the 16 previous values two guard bits before the current value. These values are sorted and the mean value is calculated and stored in a buer. The buer values are used for both the previous and next mean values to save calculations. The sliding window is implemented as a buer sorted using the insert sort algorithm. To keep track of which value that should be removed from the buer another buer is needed. The mean values are stored in one last buer for later use. The implementation is outlined in Figure 4.2.
1 16
+ 2 guard samples + current sample + 2 guard + 16 = 37 samples
4.6 CFAR
65
f1
f2
...
f15
f16
e1
e2
...
e15 e16
Sorted list,
min
Sf
max min
Sorted list,
Se
max
1 Cf = ( ) 2
Sfk
k=7
1 Ce = ( ) 2
Sek
k=7
CFAR Mean = max(Cf, Ce, Reference value)
CFAR Video = T / CFAR Mean
Figure 4.1: Outline of the CFAR algorithm. T is current sample, fx is samples before current and ex samples after. Cf is the mean value of the samples before current sample and Ce mean value of the samples after.
f1
t
f2
...
f15
f16
e1
e2
...
e15 e16
T(n)
Sorted lists / sliding window min max
T(n+2)
Cf(n)
Current value Cache 19 values (16+2+1)
Mean value Cache 22 values (16+2+1+2+1)
Ce(n) = Cf(n+21)
CFAR Mean(n) = max(Ce, Cf(n), Reference value)
CFAR Video = T(n) / CFAR Mean(n)
Figure 4.2: Outline of the implementation of the CFAR algorithm. T is current sample, fx is samples before current and ex samples after. Cf is the mean value of the samples before current sample and Ce mean value of the samples after.
66
Algorithms
4.6.2
Evaluation
This algorithm is implemented and in use by SMW. That implementation is used as a reference to the AccelDSP generated VHDL which is designed using the same specication as the reference implementation. This makes the both implementations comparable.
4.7 MUSIC
67
4.7
MUSIC
Multiple Signal Classication Method (MUSIC) is an eigenvalue based method for frequency estimation and an improvement of the Pisarenko harmonic decomposition. The algorithm estimates the frequency of a known number of complex exponentials in noise, provided using the argument p in the implementation below.
4.7.1
Implementation
The Matlab implementation, Listing 4.4, is taken from [8, pp. 463-465]. The input to the function was a sine of length 1000 samples which gives the parameter P in the function declaration in Listing 4.4 the value 2. The parameter M determining the size of the autocorrelation matrix was set to 8.
function Px = music ( x , p ,M) R=c o v a r ( x ,M) ; [ v , d]= eig (R ) ; [ y , i ]= sort ( diag ( d ) ) ; Px=0; f o r j =1:M p Px=Px+abs ( f f t ( v ( : , i ( j ) ) , 1 0 2 4 ) ) ; end Px=20log10 ( Px ) ;
Listing 4.4: Music Algorithm
4.7.2
Evaluation
This algorithm is evaluated as an example of a more advanced algorithm based on other kernel algorithms from linear algebra, signal processing and computer science. The algorithm is used by SMW which have a PowerPC, but no FPGA, implementation. This algorithm is implemented in Matlab and code generated for both FPGA and CPU and the performance on the both platforms are compared using the developed method.
68
Algorithms
4.8
Cartesian to polar map transformation
One function in the naval radar developed by SMW is to lter out land to make it easier to detect stationary targets. This is made using a chart database and instruments for positioning of the boat on the chart. The chart database contains cartesian maps where the value of each pixel determines if that position is land or sea. The radar generically returns polar position with origin at the boat. A transformation are thus required to cartesian coordinates before the targets can be ltered using the chart data. The ltering can be implemented in two ways. A lookup can be made in the cartesian chart every time a target is detected which requires the target position to be transformed to cartesian coordinates. Another possibility is to reset every position in the video data which corresponds to land before the target decision logic. This requires the chart data to be transformed to a polar map. Currently SMW have implemented a polar to cartesian transformation and lookup in the data-processing of the radar. The data-processing is implemented on PowerPC processors. This requires all targets even those which later are ltered out as land to be transferred to the data-processing which consumes a lot of communication bandwidth. In this work the performance of the other method is evaluated.
4.8.1
Implementation
The implementation was designed from scratch with the intention to be implemented on both FPGA and CPU. This description is general and will not cover platform dependent details. An oversampling technique is used in the implementation, where every corner of the cartesian pixels are transformed to polar coordinates. The polar coordinates are then transformed to a cell in the Plane Position Indicator (PPI). The value (land or sea) of each pixel is transferred to all corners of the pixel and every corner belongs to all neighbouring pixels. This means that it is enough for one pixel out of four to be land for the corner to be marked as land. Since the corner is transformed to a polar position and then to a cell it is enough for one pixel to be land for the resulting cell in the PPI to marked as land. For an illustration see Figure 4.3 A cell is the smallest resolution in the PPI and is divided by the lines creating the sectors and the folds. The sectors are numbered from 1 and increase with and the folds are numbered from 1 but increase with R. The transformation from polar coordinates to cells are done by dividing the R and coordinates with the fold and sector resolution and ceil the result to integers. Figure 4.4 show the cells that will be marked as land if the indicated pixel is land.
4.8 Cartesian to polar map transformation
69
Figure 4.3: Cartesian map where 1 represents land and 0 sea. The indicated spot in the middle will be marked as land since at least one of its neighbours is land. Given a cartesian map the algorithm iterates through all pixels and transforms the two leftmost corners to polar coordinates if the pixel represents land. The polar coordinates are transformed to cells and land data stored in the polar map. A position is marked as land if either current pixel or former pixel in the row are land since they share one border. In this way the cartesian matrix are read continuously which gives good cache performance on processors and only store operations are performed on the polar map which reduce the impact of slow memory access.
4.8.2
Evaluation
This algorithm is used to develop and evaluate the method for comparing performance of dierent platforms by generating code for both CPU and FPGA from the Matlab implementation.
70
Algorithms
Figure 4.4: If the indicated pixel is land will all cells in the polar map marked with grey be marked as land because their borders interacts with the corner of the pixel.
Chapter 5 Results
This chapter presents the results for all algorithms used for the evaluation. VHDL generation using AccelDSP is presented in section 5.1 and C-Code generation using Real Time Workshop Embedded Coder (EMLC) is presented in section 5.2. The algorithms is described in chapter 4.
5.1
VHDL generation using AccelDSP
This section presents the results of VHDL generation of all algorithms described in chapter 4 using the method described in section 3.1. Most algorithms are used for evaluating AccelDSP but the performance data for FFT, MUSIC and Cartesian to polar map transformation are used in the platform comparison, see section 6.2. A short note about the development time is also presented for each algorithm.
5.1.1
FFT
The performance and area utilisation results for the implementations are found in Table 5.1. It is important to note that the generated design uses a lot more memory than the reference, which can be seen in the blockRAM usage. The target device, xc5vsx95t-f1136-3, is designed for memory intensive applications and using 18% of the device for a single algorithm is a very poor result. Especially since the reference only uses 1%. The pipeline delay for the generated design is also considerable longer than the reference, which do not aect the input sampling but the signal delay. The calculation time delay is 560% longer for the AccelWare than the CORE Generator code. The time it takes to implement each of these designs is very similar since both are IP-Blocks, which means that there are no advantage in any way to use
72
Results
AccelDSP to implement a FFT. Utilisation Occupied Slices DSP48Es Slice Registers Slice LUTs Number of bonded IOBs BlockRAM/FIFO Total Memory used (KB) BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation time1 Input Sampling
1
AccelWare 2,564 (17%) 40 (6%) 4,915 (8%) 7,479 (12%) 70 (10%) 44 (18%) 1,584 (18%) 2 (6%)
COREGen 1,575 (10%) 40 (6%) 3,465 (5%) 2,689 (4%) 85 (13%) 3 (1%) 72 (1%) 1 (3%)
Virtex-5 out of 14,720 out of 640 out of 58,880 out of 58,880 out of 640 out of 244 out of 8,784 out of 32
151.1 MHz 6.62 ns 6144 cycles 47.45 s 151 MSPS
373.5 MHz 2.67 ns 2171 cycles 8.53 s 373 MSPS
out of 550 MHz out of 1.81 ns
The calculation time is the time it takes from when the rst input data of a frame is loaded into the FFT until the entire output for the frame have been collected from the FFT. Calculation time = (Start-up cycles + 1023 cycles for remaining outputs) * Minimum Period
Table 5.1: Performance comparison for the FFT algorithm
5.1 VHDL generation using AccelDSP
73
5.1.2
Correlation
The performance of the XCorr function was never analysed. Instead the complexity of the implementation was the major focus. The function can not be interpreted by AccelDSP since it contains an unsupported built-in function call to: strncmpi. XCorr is a built-in function implemented in M-Code. To be able to edit the function le it must be copied to the current working directory. Even if the function is rewritten the t utilised by XCorr is only supported by AccelDSP using an AccelWare block. This requires the function call to be replaced by a generated AccelWare block for t and it or using an time domain implementation instead. The later method was chosen leading to Listing 5.1.
N=length ( x ) ; l a g s = N+1:N1; R = zeros ( 1 ,N2 1); i =1; % Negative lags for m =N1: 1:0 R( i ) = x ( 1 : N conj ( y(1+m:N ) ) ; m) i =i +1; end % Positiv lags f o r m=0:N1 R(m +N) = x(1+m:N) conj ( y ( 1 : N ) ; m) end
Listing 5.1: Correlation using common Matlab syntax This code is still unsupported by AccelDSP since it is unable to infer a static shape for 1:N m. The summation needs to be rewritten using loops leading to Listing 5.2, which is correctly interpreted by AccelDSP. The for loops can be unrolled to increase the input sampling.
74
Results
% S i z e o f t h e i n p u t v e c t o r s x and y N=10; l a g s = N+1:N1; R = zeros ( 1 , 2 N1); % Negative lags i =1; for m =N1: 1:1 f o r xInd =1:N m yInd = xInd+ m; yConj = conj ( y ( yInd ) ) ; R( i ) = R( i ) + x ( xInd ) yConj ; end i = i +1; end %P o s i t i v l a g s f o r m=0:N f o r yInd =1:N m xInd = yInd+ m; yConj = conj ( y ( yInd ) ) ; R(m +N) = R(m +N) + x ( xInd ) yConj ; end end
Listing 5.2: Correlation compatible with AccelDSP
75
5.1.3
5.1.3.1
Matrix multiplication
AccelDSP do not accept matrix inputs or output to the design function. The input matrices needs to be reshaped to vectors before input, reshaped ones more to matrices inside the function to perform matrix multiplication and then reshaped again before returning the result. At the time this function was written AccelDSP could not accept integer constants in the reshape function. This was due to a bug that generated an internal error during the analyse step in the program. The interface of the function was written as in Listing 5.3, which loads the constants from a le instead. This worked as a workaround for the internal error. This bug is xed in newer versions of AccelDSP. The results are found in Table 5.2. The time it takes to develop the design with AccelDSP is considerably shorter compared to the development time for the hand-coded version. The development time for AccelDSP implementation is about an hour if the problems presented here is ignored. The hand-coded version takes a day or two in development time.
function Cret = mult ( Ain , Bin ) % Read v e c t o r / m a t r i x s i z e s from f i l e s i z e = load ( t e s t f i l e . t x t ) ; % Reshape i n p u t v e c t o r s t o m a t r i c e s A = reshape ( Ain , s i z e ( 1 ) , s i z e ( 1 ) ) ; B = reshape ( Bin , s i z e ( 1 ) , s i z e ( 1 ) ) ; % Perform m a t r i x m u l t i p l i c a t i o n C = AB ; % Reshape r e s u l t i n g m a t r i x t o v e c t o r Cret = reshape (C, s i z e ( 2 ) , s i z e ( 3 ) ) ;
Listing 5.3: Batch 3x3 matrix multiplication
5.1.3.2
The generated VHDL using the non-resource shared/pipelined optimisation is corrupt and generates incompatible bus widths. Thus resource shared/pipelined is given as directive for the vector-matrix multiplication. #ERROR:HDLParsers:837 ../mult_accel_mult_accum_001.vhd" Line 44. Width mismatch.
76
Results
Utilisation Occupied Slices DSP48Es Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation time1 Input sampling
1
AccelDSP 203 (1%) 27 (4%) 344 (1%) 3 (1%) 292 (45%) 1 (3%)
Hand-coded 1 (1%) 27 (4%) 0 (0%) 1 (1%) 290 (45%) 1 (3%)
Virtex 5 out of 14,720 out of 640 out of 58,880 out of 58,880 out of 640 out of 32
250 MHz 4 ns 6 cycles 24 ns 250 MSPS
550 MHz 1.81 ns 3 cycles 5.43 ns 550 MSPS
The calculation time is the time it takes from when two matrices is sent as input to that the output matrix is available.
Table 5.2: Comparison of batch 3x3 matrix multiplication implementations
Expected width 21, Actual width is 20 for dimension 1 of C. Since the code will be very slow without the speed optimisation another version of the multiplication using loops has been written, see Listing 5.4. The performance of this code is compared with the VHDL reference. The loops are annotated to be fully unrolled. The results are found in Table 5.3.
f o r row =1: a f o r c o l =1: a o u t d a t a ( row ) = o u t d a t a ( row ) + i n d a t a ( c o l ) B( c o l , row ) ; end end
Listing 5.4: Explicit vector-matrix calculation The development time comparison between the AccelDSP implementation and the hand-coded version is very similar to the batch matrix implementation except that the hand-coded VHDL version took around twice the time to complete. But, the superior design time for AccelDSP implementation is not worth it considering the bad timing performance and area utilisation.
77
Utilisation Occupied Slices DSP48E Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation time1 Input sampling
1
AccelDSP 3,797 (25%) 100 (15%) 4,404 (7%) 5,576 (9%) 247 (38%) 1 (3%)
Hand-coded 194 (1%) 140 (21%) 720 (1%) 12 (1%) 328 (51%) 1 (3%)
Virtex 5 out of 14,720 out of 640 out of 58,880 out of 58,880 out of 640 out of 32
50.5 MHz 19.8 ns 2.5 cycles 445 ns 50.5 MSPS
242.8 MHz 4.118 ns 7 cycles 111.18 ns 242 MSPS
Frame calculation time. The two matrices are loaded in with 10 values each cycles, which results in 20 cycles. These 20 cycles is the frame. The calculation starts when the rst value of the second matrix is loaded into the design.
Table 5.3: Comparison of streaming 10x10 matrix multiplication implementations
78
Results
5.1.4
FIR-lter
The Saab Microwave Systems (SMW) implementation is a generic VHDL implementation, which means that the lter behaviour is dened in the instantiation of the component by setting certain parameters. A generic component have the same interface and the same performance as an ordinary component ones instantiated but allows greater exibility since it can be instantiated to have different functionality. Generic components can not be created in AccelDSP, but the reference component can be instantiated to match the AccelDSP generated design. First a 64-tap asymmetric FIR-lter was evaluated, since the SMW reference implementation has documented performance results for this type of lter. A result for this lter can not be presented since the generated VHDL code could not be synthesised in XST and the code generated over 200000 warnings in Synplify PRO. This behaviour is most likely a result of an internal bug in AccelDSP, related to the asymmetric characteristic since a symmetric 64-tap lter do not generate this amount of warnings. The 16-tap asymmetric FIR-lter was implemented without problems. The performance and area utilisation is presented in Table 5.4. It is important to note that the performance and area utilisation gures of the reference implementation was provided by SMW. Since the AccelDSP FIR-lter implementation was done using an AccelWare core the development time is considerably lower compared to the hand-coded version done by SMW. The SMW development time is over a month, but the result is a generic design which can implement a great number of dierent types of FIR-lters. The AccelDSP generated design diers from the reference design when it comes to interface and control logic. The more complex interface in the reference design causes extra pipeline delay.
79
Utilisation Occupied Slices DSP48Es Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation Time2 Input Sampling
1 2
AccelWare 301 (2%) 16 (2%) 1065 (1%) 62 (1%) 66 (10%) 1 (3%)
Hand-coded 304 (2%) 16 (2%) 606 (1%) 464 (1%) 65 (10%) 1 (3%)
Virtex-5 out of 14,720 out of 640 out of 58,880 out of 58,880 out of 640 out of 32
196.7 MHz 5.09 ns 67.51 344 ns 196 MSPS
213 MHz 4.69 ns 90 422 ns 213 MSPS
The generated design delivers the output during a negative ank. Sample calculation time.
Table 5.4: Performance comparison for the 16-tap FIR lter
80
Results
5.1.5
CORDIC
The goal for the COordinate Rotation DIgital Computer (CORDIC) algorithm was to create a functional identical implementation of the VHDL reference. Large expressions in the provided Matlab code had to be divided into smaller expressions compare Listing 5.5 and Listing 5.6. The reason is to make it possible to manually edit all quantisation properties of the expressions. If the expression is left in its original form the auto-quantisier in AccelDSP will quantise the sub-expression dierently compared the variable holding the result which results in loss or precision.
r e v e c ( i ) = r e v e c ( i 1 ) i m v e c ( i i ) ( 2 ( i i ) ) ; i m v e c ( i ) = i m v e c ( i 1 ) + r e v e c ( i i ) ( 2 ( i i ) ) ;
Listing 5.5: Orignal expression of a part of the provided code
re im re im re im
vec prev = re vec ( i 1); vec prev = im vec ( i 1 ) ; v e c s c a l e d = re vec prev (2( i 1 ) ) ; vec scaled = im vec prev (2( i 1 ) ) ; vec ( i ) = re vec prev im vec scaled ; vec ( i ) = im vec prev + r e v e c s c a l e d ;
Listing 5.6: The same expression, but with the ability to edit the quantisation for the prev and scaled variables The resulting VHDL code from AccelDSP was veried to be functional correct using a testbench that tested both implementations with the same input data. The main dierence between the implementations are that the reference code uses a generic variable to dene the number of iterations. Since this behaviour can not be created in AccelDSP the number of iterations was set to 9 before VHDL generation. The results are found in Table 5.5. As shown in table the results of the two designs are very similar in both area and performance. The development time for the AccelDSP CORDIC implementation is measured in days. This is due to the many problems with the quantisation directives in the xed-point model. The development time for the hand-coded version is hard to determine since it was done by SMW.
81
Utilisation Occupied Slices Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation time1 Input Sampling
1
AccelDSP 454 (3%) 838 (1%) 981 (1%) 87 (13%) 1 (3%)
Hand-coded 287 (1%) 708 (1%) 727 (1%) 104 (16%) 1 (3%)
Virtex-5 out of 14,720 out of 58,880 out of 58,880 out of 640 out of 32
259.9 MHz 3.8 ns 9 34.2 ns 259 MSPS
358.2 MHz 2.8 ns 9 25.2 ns 358 MSPS
Sample Calculation time
Table 5.5: Performance comparison for the CORDIC implementation with 9 iterations. This is the pipeline length and start-up cycles in this case
82
Results
5.1.6
CFAR
To make an AccelDSP implementation of Constant False Alarm Rate (CFAR) following the specication written by SMW is not possible due to its use of multiple clocks which is not supported by AccelDSP. As a result of this a single clock implementation of the CFAR was designed instead of two dierent clocks. Since the M-Code implementation provided by SMW is written for verication and not a real-time hardware realisation a new implementation was written from scratch. This was because the design of the original code was batch oriented and contained a lot of unsupported syntax. The new code was written to follow the VHDL specication as strict as possible and the original code was used to verify the new implementation and as functional reference. The specication states that a CORE Generator block shall be used for the divider used to perform the normalisation. An AccelWare divider autoinferred using the rdivide function was used instead. A number of dierent implementations are available for this function but only one managed to provide the correct result. None of the algorithms resulted in any error message but the result diered from the oating-point design. The sorting of the sliding window was implemented using insert sort as in the reference design. The loops for removing and inserting values into the window could not be unrolled which drastically inuence the performance. The problems appears in the verify RTL step in the VHDL generation workow which fails if any of the loops are unrolled. This is the reason for the 50 calculation cycles. Considering the large calculation time for each value the AccelDSP design is no longer useful as a streaming implementation. Since the implementation diers a good metric of the performance is the calculation time of a test-vector. The vector used in result Table 5.6 has 780 values. The reference implementation is old and was compiled by SMW which have a compilation environment setup for the Virtex-IIPRO model for this algorithm. This device was therefore used as a reference for both implementations. The VHDL generated by AccelDSP for the Virtex-5 device was implemented on the Virtex-II without any problems. The development time for the AccelDSP implementation of CFAR is a week in total time. The design time for the hand-coded reference is hard to determine since it was done by SMW.
83
Utilisation Occupied Slices Slice Registers Slice LUTs Number of bonded IOBs GCLKs Timing Maximum Frequency Minimum Period Calculation cycles Calculation time1 Input Sampling
1 2
AccelDSP 4141 (17%) 3654 (7%) 4850 (10%) 114 (14%) 1 (6%)
Hand-coded 2099 (8%) 2490 (5%) 2680 (5%) 105 (12%) 3 (18%)
Virtex-2PRO out of 23,616 out of 47,232 out of 47,232 out of 812 out of 16
67 MHz 14.8 ns 50 577 s 1.352 MSPS
6.25/12.5/50 MHz 160/80/20 ns 1 125 s 6.25 MSPS
out of 450 Mhz out of 2.22 ns
Batch calculation time for 780 test values Since the calculation takes 50 cycles new samples can be sampled at 67 MHz / 50 cycles = 1.35 MSPS
Table 5.6: Performance comparison for CFAR
84
Results
5.1.7
MUSIC
The extremely short M-Code algorithm implementation in Listing 4.4 provided by Hayes [8] is not accepted by AccelDSP since t , diag, eig & sort are nonsupported Matlab built-in functions. Implementations for these functions can not be automatically inferred by the program which means that the code must be implemented by hand or using IP-Blocks. The M-Code implementation provided by Hayes was modied and rewritten to match the AccelDSP coding guidelines. The code also needed to be written with hardware I/O restrictions in mind because the large amounts of data entering the algorithm. The Multiple Signal Classication Method (MUSIC) algorithm went through a series of iterations to get a functional correct implementation. The rst implementation passed the RTL testbench, but the area utilisation was so extensive other FPGA devices had to be considered. A number of algorithm optimisations were done to the M-Code to make the design map better to a regular Virtex-5. The development time for the AccelDSP implementation of MUSIC was several weeks.
5.1.7.1
Covariance
The covariance function covar(x,M) used in Listing 4.4 is provided by Hayes [8]. The algorithm generates a covariance matrix with size M from the input to the MUSIC function x. covar can not be automatically inferred with AccelDSP and was rewritten for hardware implementation.
5.1.7.2
Eigenvalues
In Matlab eigenvalues are calculated using the built-in function eig(), which is not supported by AccelDSP. The implementation of the eigenvalue calculation must be made in a way suitable for hardware and compatible with AccelDSP. In Matlab eigenvalues are calculated using the QR-method which has good numerical properties but a test implementation showed that it was unsuitable for generation using AccelDSP. Since the correlation matrix generated from covar, from which the eigenvalues are to be determined, is square, symmetric and positive denite the singular values are equal to the eigenvalues [16]. This fact makes it possible to use the singular value decomposition available as a AccelWare block to calculate the eigenvalues.
85
5.1.7.3
Diag
The built-in function diag is not supported by AccelDSP. A replacement function was written using a for-loop to iterate over the diagonal of the matrix. 5.1.7.4 Sort
A simple Bubble-sort algorithm for sorting the eigenvalues was written in Matlab code to replace the Matlab function sort(). There is better algorithms for this like the Bitonic sort which can be implemented in hardware. Since only a small number of values are sorted this optimisation would not make any large change in performance. 5.1.7.5 FFT
The FFT implementation was done using a AccelWare core. The generated design was inferred as a replacement of the built-in function t in the Matlab code. 5.1.7.6 Performance and area utilisation
The I/O interface of the MUSIC implementation was dened as a burst implementation were all input values are loaded before execution. The generated VHDL code of the I/O interface behaves very strange since the testbench indicates that it requires 10 clock cycles to load each value and 10 clock cycles to deliver each output. The result can probably be improved considerably if the Matlab code is optimised more towards hardware. The reason for this is the very good results of the area optimisations done to make the design t. A example is that the slice register usage was lowered from 133,800 to 7,931 registers.
86
Results
Utilisation Occupied Slices Blockrams Memory uses(KB) DSP48Es Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Calculation Time1
1
AccelDSP 5,802 (39%) 20 (8%) 720 (8%) 84 (13%) 7,931 (13%) 15,989 (27%) 74 (11%) 2 (6%)
Virtex-5 out of 14,720 out of 244 out of 8,784 out of 640 out of 58,880 out of 58,880 out of 640 out of 32
42.8 MHz 23.3 ns 15.3 ms
Batch calculation time
Table 5.7: Synthesis result for the MUSIC algorithm
87
5.1.8
The algorithm presented here is evaluated as an example of an algorithm that is considered for implementation using AccelDSP. It is also an example algorithm used in the method for platform comparison. The AccelDSP model was designed to use the same cartesian maps as the PowerPC implementation used by SMW. The requirement of the implementation was that it should nish the map creation in under approximately 5 seconds. The algorithm is also required to use external memory to read and store map data. How the map data are retrieved and stored are assumed to be handled by a memory controller. The design fetch one new pixel of the cartesian map each clock cycle if the current or previous pixel is water and every other cycle otherwise. This process continues until the algorithm has iterated through the entire map. To lower memory usage an indication of land and not sea is written to the polar map. The implementation is found in appendix section D.2. To verify that the AccelDSP design function correctly a simulation of a small 128*128 cartesian map was performed. A real simulation of a cartesian map in the same size as in the SMW implementation can not be perform because of poor scalability of AccelDSP and limited computation resources. The reason that the implementation needs be remade in AccelDSP is because the program can not create a generic VHDL component. This means that the implementation needs to be regenerated each time the map size needs to be changed. The gures shown in Table 5.8 are from the 128*128 map implementation using a map covered with land. A worst case scenario where each pixel in the cartesian map takes 2 cycles to calculate. Given this result it is possible to calculate the estimated calculation time of a larger map since the implementation only scales in the number of iterations and not in complexity. The development time for the AccelDSP implementation of this algorithm was about two weeks.
88
Results
Utilisation Occupied Slices DSP48Es Slice Registers Slice LUTs Number of bonded IOBs BUFG/BUFGCTRLs Timing Maximum Frequency Minimum Period Start-up cycles Calculation time1
1 2 3
AccelDSP 4244 (28%) 22 (3%) 7927 (13%) 11136 (18%) 38 (5%) 2 (6%)
Virtex-5 out of 14,720 out of 640 out of 58,880 out of 58,880 out of 640 out of 32
41.9 MHz 23.9 ns 30 784 s2 / 802 ms3
Batch calculation time 128*128 map gives 23.9 ns/cycle * (30 + 128*128*2) cycles 4096*4096 map gives 23.9 ns/cycle * (30 + 4096*4096*2) cycles
Table 5.8: Implementation results for Cartesian to polar map transformation
5.2 C-Code generation using EMLC
89
5.2
C-Code generation using EMLC
In this section the result of the C-Code generation is presented for the three algorithms used for developing a method for platform decision. The code was generated using EMLC and evaluated using PSIM as described in section 3.2. Descriptions of the algorithms are found in chapter 4 and the processor in section 2.3.4.
5.2.1
FFT
EMLC generated working code from the Matlab command X = t(x, N) without any problems. The performance of the generated code is found in Table 5.9. Since this is a small algorithm relative to the size of the instruction cache level 3 optimisations generates less cache misses and better overall performance. Removing the overhead from the main-function, used for evaluating the performance and verifying the results of the function, the number of cycles required to execute the algorithm are: 194, 333 52, 216 = 142, 177cycles this leads to an execution time of: 142, 177cycles 5.56ns/cycle = 790.0s (5.2) (5.1)
For the resource usage it is assumed that a transformation of 256 values of one of the 18 channels of the surface radar is required. The input sampling rate is 3.125 MHz and real data is assumed for the sake of simplicity. The real time requirement is that the execution much be nished in the time it takes to buer 256 values. This time is calculated as: 256samples = 81.92s 3.125M SP S 5.2 and 5.3 results in a resource usage of: 790s = 960% 81.9s (5.4) (5.3)
The resource usage is thus 960% of the CPU which means that this use case is not possible.
5.2.2
MUSIC
The MUSIC algorithm described in section 4.7.1 was used for code generation without any problems. The performance of the generated code is found in Table 5.10.
90
Results
Level 3 optimisations Overhead Instructions: Cycles: Icache misses: Total Instructions Cycles: Icache misses: 203,731 194,333 701 52,274 52,216 433
Level 2 optimisations
52,747 52,723 485
204,207 194,841 747
Table 5.9: Performance of the FFT algorithm
Since this algorithm is quite large in number of instructions better instruction cache hit rate is gained with level 2 optimisations. Despite the hit rate of the instruction cache better performance are gained using level 3 optimisations. Removing the overhead the number of cycles required to execute the algorithm are: 6, 540, 479 198, 888 = 6, 341, 591cycles this leads to an execution time of: 6, 341, 591cycles 5.56ns/cycle = 35.2ms (5.6) (5.5)
A slightly modied version of the use case for MUSIC at SMW has been used for estimation of the resource usage. In a high resolution mode1 in the surface radar it is possible to add a window where higher resolution is desired. According to SMW a window is 10 wide and 45 folds long (for a denition of folds see section 4.8.1) and contains 35000 samples. A window is retrieved each second. This corresponds to one 360 turn of the antenna. It is assumed that the calculation can be made in batches of 1000 samples which requires 35 executions of the algorithm per window. 35 executions of the algorithms with an execution time according to 5.6 results in a resource usage of: 35 35.2ms = 123% 1s (5.7)
The resource usage is thus 123% of the CPU which means that this use case is not possible to implement on this CPU.
1 The
mode is called HRFC, High Resolution Fire Control
5.2 C-Code generation using EMLC
91
Level 3 optimisations Overhead Instructions: Cycles: Icache misses: Total Instructions Cycles: Icache misses: 6,244,855 6,540,479 408,115 206,367 198,888 481
206,367 198,888 481
6,358,598 6,567,920 335,895
Table 5.10: Performance of the MUSIC algorithm
5.2.3
The algorithm were generated without any problems and the performance is found in Table 5.11. The implementation are found in appendix section D.1. Despite this algorithm is large measured in number of instructions the level 3 optimisations generates less instructions, better instruction cache hit rate and better overall performance. Removing the overhead the number of cycles required to execute the algorithm are: 34, 383, 017 32, 953 = 34, 350, 064cycles this leads to an execution time of: 34, 350, 064cycles 5.56ns/cycle = 191.0ms (5.9) (5.8)
For the resource usage it is assumed that the algorithm is run on a boat with a maximum speed of 20 kn. A new map must be transformed before the boat has travelled one fold which represents 48 m using the current resolution. Using the fact that one knot is 1.852 km/h = 0.51 m/s this gives a real time requirement of: 48m = 4.7s (5.10) 20kn 0.51m/s 5.9 and 5.10 results in a resource usage of: 190ms/4.7s = 4% (5.11)
The resource usage is thus 4% of the CPU which means that 96% of a time slot of 4.7s is free for other work.
92
Results
Level 3 optimisations Overhead Instructions: Cycles: Icache misses: Total Instructions Cycles: Icache misses: 28,310,504 34,383,017 1,938,067 82,148 32,953 238
82,148 32,953 238
29,360,976 35,726,107 2,071,987
Table 5.11: Performance of Cartesian to polar map transformation
Chapter 6 Discussion
This chapter contains a discussion of the evaluation of AccelDSP in section 6.1, interesting areas when comparing implementation platforms in section 6.2 and a discussion of the method for platform decision and how it can be used in section 6.3. The discussions are the base for the conclusions in chapter 7. The last section 6.4 contains interesting softwares and evaluations considered for future work.
6.1
Evaluation of AccelDSP
As stated in section 3.1 the following aspects of AccelDSP, and its ability to generate VHDL, have been evaluated: 1. To attack the algorithm implementation from the perspective of an engineer with no or little knowledge in VHDL and about FPGAs. 2. To evaluate the exibility of the AccelDSP M-Code interpreter in terms of which functions and syntax that can be synthesised. 3. To evaluate the performance and resource utilisation of generated designs compared to hand-coded or IP-Block implementations. 4. To evaluate AccelDSP as a tool for rapid prototyping. 5. To evaluate if the generated VHDL code is technology independent. The perspective of an engineer with little knowledge in VHDL and which subset of M-Code that can be synthesised are discussed in usability. The other three evaluation points have their own sections. The accuracy of the evaluation is discussed in the last section.
94
Discussion
6.1.1
Usability
This section concerns both the general usability of AccelDSP as well as the M-Code interpreter. The interpreter is the part of the software which parse the M-Code and thus determines which functions and syntax that are supported. In this thesis AccelDSP is evaluated for its ability to generate VHDL code intended to be inferred into a larger design. The design can change target device during its lifetime because of the long product life at Saab Microwave Systems (SMW) (20-25 years). A workow recommended by Xilinx, according to Olivier Tremois a DSP expert at Xilinx, is to use AccelDSP together with Xilinx System Generator. This software is described in section 1.3.1. Using this approach AccelDSP is used for generating small building blocks and System Generator for creating the fundamental design using a graphical environment. This workow is not usable for SMW since there is no guarantee that a single software will have support and are able to run on new workstations for the long product lifetime at SMW. For example 16 years before the writing of this report Windows 3.1 was popular. Software written for this Operating System (OS) is not likely to work on current systems. A hardware description language like VHDL is likely to have better support. The results of the evaluation in this thesis corresponds to the recommendations of Tremois. Most small algorithms are easy to implement using AccelDSP but more complex algorithms causes problems. These algorithms causes many design iterations and often requires utilised M-Code functions to be rewritten. AccelDSP is not as suitable for generation of VHDL for an engineer with little experience of FPGA implementations as hoped-for. The main reason is the fundamental dierence in the streaming behaviour of processor and FPGA implementations and the limited subset of the M-Code language supported. Functions that are supported using AccelWare often needs information about which type of hardware implementation that should be used to work correctly. To choose the correct implementation, knowledge of the hardware and the algorithms is needed. A good example is the rdivide AccelWare block, which is used in the CFAR implementation. This block contains a selection of dierent divider hardware implementations the user can choose from. The dierence between these implementations is not documented in the manual and only one of them provides that correct result for the algorithm. It is possible to tune each divider implementation with architecture specic options, which is probably the reason why only one of the dividers works in this design. If the others are congured with the correct options it could be possible to get correct result even with these implementations. Knowledge of these implementations is needed to make an optimal implementation decision which requires previous knowledge of hardware implementations.
6.1 Evaluation of AccelDSP
95
It is however a lot easier to learn how to use AccelDSP for generating a FPGA implementation than to learn VHDL. The user do not need to learn a new language or have to care about timing information or I/O handshaking. But only knowledge of how to eciently implement algorithms in Matlab is not enough. AccelDSP is not found usable for realising complete system, not even for an experienced VHDL designer. However, only a fundamental knowledge of implementations towards FPGA is required for designing a subsystem that can be used by a VHDL developer. A discussion of the dierence between M-Code for VHDL generation and Matlab is made in section 6.1.1.1. The experience of working with AccelDSP is generally good but some features and bugs limits the functionality: AccelDSP only runs in Microsoft Windows XP. This is considered a large limitation for SMW. Mainly because SMW run most of their resource heavy software on remote Linux servers. The Component Object Model (COM) interface for communication between AccelDSP and Matlab creates zombie processes consuming a large amount of memory if AccelDSP is not shut down correctly. There is no functionality to cancel steps in the workow without closing the program which creates zombie processes. Every step in the workow must be run separately. There is no functionality in the GUI for running several or all steps. Some other minor problems like memory leaks exists but they do not seriously interfere with the usability. 6.1.1.1 M-Code interpreter
The subset of the M-Code language supported by AccelDSP is very limited which aects the usability of the program. Many syntaxes normally used by DSP designers to write M-Code does not work. An example of this is the result of the implementation of the xcorr() function presented in section 5.1.2. These limitations together with the need to restructure the code from the sequential behaviour used in Matlab to a continuous ow of data more suitable for FPGA implementation means that AccelDSP can not be used directly for implementing algorithms designed for Matlab. A good example of how the fundamental design of an algorithm diers between processor and FPGA is the Cartesian to polar map transformation included in appendix D. The rst version is the original Matlab model, which works with EMLC without changes. The other version is a model that is suitable for use with AccelDSP. This version contains a state machine declaration
96
Discussion
for holding internal states needed when rewriting a batch implementation to a streaming behaviour. Some functions have to be written with I/O restrictions in mind, which means that control logic must be created in M-Code. This involves writing state machine behaviour to load streaming data and store them in internal or external memory before the data can be processed. An example of I/O restrictions is the input to the MUSIC algorithm which consists of 1000 values with a resolution of 16 bits each. This interface is easy to write in M-Code but 16000 bits can not be transferred simultaneously inside the FPGA. Large algorithms with large I/O interface, dependent on several kernel algorithms written with a non-streaming behaviour is not suitable for AccelDSP. A good example is the MUSIC algorithm, see section 5.1.7. The implementation compatible with AccelDSP is more similar to a behavioural description in VHDL than to the original Matlab implementation. The complexity of these Matlab models as well as the performance of the generated design indicates that AccelDSP is not meant to be used for larger designs.
6.1.1.2
Generated VHDL code
A noticeable problem during the thesis have been that the generated VHDL code often produces a lot of warnings during synthesis. Even some of the AccelWare cores results in hundreds of warnings, even though they are designed for VHDL generation. An example is the FIR-lter, see section 5.1.4. The severity of these warnings are hard to determine since the code is veried, using simulation, to function correctly. The large number of warnings is a problem if the generated design is used in a larger design, since hundreds of warnings can make important warnings almost impossible to locate. An example of problematic warnings are the warnings which indicates that many signals are unconnected, source less or assigned but never used. This is a indication that AccelDSP generates a lot of redundant and unused code. The generated code is also very hard to understand and read, which makes it almost impossible to modify the generated code. This means that a generated implementation can not be xed or corrected if errors occur. The only solution is to edit the design with AccelDSP and generate a new VHDL description of the design. This is problematic of AccelDSP no longer can be run, as discussed in Usability. It is important to note that none of the generated designs have been tested on real devices. Simulations have been made to verify that the designs function correctly but this does not guarantee that they work on real devices. A few gatelevel simulations were also done to check if the generated designs were mapped correctly after Place And Route (PAR).
97
6.1.1.3
Development time
The main reason to use AccelDSP is to reduce the development time. The program manage to reduce the development time considerably for many algorithms, see section 5.1, but the drawback is worse timing performance and area utilisation. A good example of this is the 3x3 matrix multiplication. When comparing the time it took to create the AccelDSP design compared to the hand-coded version, the AccelDSP design is superior. To hand-code a good VHDL description of the algorithm, and a testbench to test it, takes a two days. In AccelDSP this takes an hour since both a VHDL description and a testbench is generated. Even if this design is worse in both performance and area utilisation, it performs the specied task without problems. The 3x3 matrix multiplication is a small design and easy to describe in M-Code. Larger algorithms like MUSIC take considerably longer time to implement in AccelDSP than rst assumed. The implementation described in section 5.1.7 had to be rewritten and modied many times to get a working implementation. The time it took to create the rst working implementation is measured in weeks. To write the same implementation directly in VHDL would probably take even longer time but then an ecient implementation with a description which is easy to read would have been obtained.
6.1.2
Performance
This section presents a summary of the timing and area utiliation results, presented in chapter 5, for all algorithm used for evaluating AccelDSP. 6.1.2.1 Timing
A collection of the maximum clock frequency for the designs is found in Figure 6.1. As seen in the gure the frequencies for the generated designs are roughly half of the frequency compared to their references. The exceptions are the FIR-lter and CFAR implementations. The clock frequencies also aect many other metrics associated with timing. Most notably the execution time and the input sampling. Another timing metric used is the execution time for the algorithms. This is measured dierently between the algorithms, as explained in section 3.1.3. The execution times varies between the designs, hence the need to present the values in the dierence between the execution times. These gures are found in Figure 6.2. The most interesting result is the execution time for CFAR, which needs ve times longer time to calculate its test-vector than the reference. It is important to note that the two designs have approximately the same maximum clock
98
Discussion
600
500 Maximum clock frequency/MHz AccelDSP Reference 400
300
200
100
FFT
Matrix B Matrix S FIR Algorithm
CORDIC
CFAR
Figure 6.1: The maximum frequencies for the implementations. Matrix B is the batch and Matrix S streaming implementation. frequency. This is due to the problems with implementing an unrolled sort in AccelDSP, as explained in section 5.1.6. The generated design thus needs more cycles for each input to complete the calculations. The input sample rate for the algorithms is equal to the maximum clock frequency, as showed in Figure 6.3. The only exception is the CFAR algorithm. This implementation is slowed down by the sorting part of the algorithm, which needs several clock cycles to complete each sort. 6.1.2.2 Area utilisation
The dierence in area utilisation between the designs varies a lot which is an indication that the program is better for generating certain designs over others. This means that it may be hard to estimate the area utilisation for an implementation before it is generated. A result of this may be that the generated design use to much resources and have to be hand-coded instead. The area utilisation in occupied slices is found in Figure 6.4. One of the more noticeable results is the area utilisation for the streaming matrix multiplication. As showed in Figure 6.4 the dierence between the generated design and the reference design is very large. Matrix multiplication is a simple algorithm and for such this is a bad result. Especially since the maximum
99
600 Execution time of AccelDSP divided with reference/%
500
400
300
200
100
FFT
CORDIC
CFAR
Figure 6.2: Execution time for the generated designs divided by the execution time of their reference designs. Matrix B is the batch and Matrix S streaming implementation. clock frequency is ve times lower than the hand-coded reference. Another interesting result is the CORDIC as well as the FIR-lter, which both correspond very well to the their reference implementations. Area utilisation of the implementations is good as well as the performance. This is a perfect example of how AccelDSP should work for all algorithms, a small decrease in performance and a small increase in area utilisation. Both these algorithms are small and simple implementations, which provides another indication that AccelDSP is more suitable for smaller designs. The result for the batch matrix multiplication can be confusing since the reference bar is missing in Figure 6.4. The explanation is that the hand-coded reference only use one regular slice. The rest of the design is implemented entirely with DSP48E slices.
6.1.3
Rapid prototyping
Rapid prototyping is a development process, which is used to develop and create prototypes faster than ordinary implementations. The prototypes are used for evaluation of performance, function and the development process itself. AccelDSP can be used for this purpose, since the time it takes to create a
100
Discussion
600
500 Input Sample Rate/MSPS AccelDSP Reference 400
300
200
100
FFT
CORDIC
CFAR
Figure 6.3: The input sample rate for the algorithms. Matrix B is the batch and Matrix S streaming implementation. functional design is much faster than writing VHDL code. Even if AccelDSP can be used for rapid prototyping it has a long way to go before the implementation workow goes smoothly. The major issue is bad support for built-in functions and Matlab syntaxes, which must be improved. The way AccelDSP uses AccelWare cores must be more consistent. This means that all AccelWare functions should be auto inferred directly from M-Code. The hardware specify options can remain, but it should be editable in the xed-point model after it has been inferred. As it works now a new project must be created for some AccelWare but not all and then inferred into the main design. If these issues are solved, AccelDSP may become a much better tool to use in rapid prototyping. No road map for the development and improvements of AccelDSP have been found. No information are thus available for the plans of increasing the M-Code support.
6.1.4
Technology independence
An important aspect of the evaluation of AccelDSP is to check if the program generates technology independent code. The reason behind these checks is that the product life time of designs at SMWs are very long (20-25 years). This requires the code to work on new hardware models and possibly hardware from
101
4500 4000 3500 3000 Occupied slices 2500 2000 1500 1000 500 0 AccelDSP Reference
FFT
CORDIC
CFAR
Figure 6.4: Area utilisation in occupied slices / Total slice usage. Matrix B is the batch and Matrix S streaming implementation.
other vendors. To evaluate the independence a number of Xilinx and Altera devices were tested to check if the generated code could be synthesised and how well it performs on other architectures. The VHDL code used for this task was the AccelDSP generated FFT implementation, which contains a good selection of dierent components like blockRAMs, shift registers and DSP elements. The result were created using Synplify PRO, since ISE does not have support for devices from Altera. The results of the evaluation are found in Table 6.1. The performance information in the table is an performance estimation after synthesis and not PAR. The results presented in Table 6.1 conrms that the VHDL code is technology independent. But it is not very portable to other FPGA architectures in terms of performance. The dierence in FFT performance between the two vendors in Table 6.1 is a consequence of the way AccelDSP generates VHDL code. AccelDSP is congured with a target device for which it will generate VHDL code. Since the target device is set to a Virtex-5, the program generates VHDL code for the blockRAMs of this device. This declaration does not map well to the non Virtex-5 devices since their blockRAM behaviour is not the same.
102
Discussion
Device family Xilinx Virtex-5 Xilinx Virtex-4 Xilinx Spartan 3E Altera Stratix III Altera Cyclone III
1
DSP slices Yes Yes Yes1 Yes Yes
Performance 215 MHz 194 MHz 97.1 MHz 78.7 MHz 59.3 MHz
This device has Mult18x18 blocks, a multiplicator primitive used by Spartan 3E instead of DSP48E slices
Table 6.1: Results of VHDL technology independence test after synthesis using Synplify PRO
6.1.5
Accuracy
The algorithm kernel set used in the evaluation of AccelDSP correspond to sets used in other studies. The main dierence is the number of variants of each algorithm. An example is the BTDI study [4] which have benchmarks for 4 dierent types of FIR-lters. Due to the limited extend of a Master thesis the diversity of the kernel set is reduced compared to professional studies. Compared to most other studies the focus in this thesis is to evaluate the tool for generating VHDL and the performance of the generated description, not the platform. A more general algorithm kernel set is enough to get an overview of the capability of AccelDSP to implement this kernel. The major part of performed simulations are behavioural, used to verify the function of the implementation. A behavioural simulation together with a successful PAR is often enough to verify that the design works on a real device. A more accurate post-PAR or Gate-level simulation, which simulates the design after PAR, were performed for some of the algorithms. The most important reason for this is to make sure that the synthesis tools and PAR do not remove parts of the implementation during optimisation. Another reason is to check that it is possible to run the implementation in the reported maximum clock frequency without any timing errors.
6.2 Comparing a processor to a FPGA
103
6.2
Comparing a processor to a FPGA
The method for platform decision described in section 3.3 base its decision on execution time and resource usage. There are several other interesting performance metrics to compare. This sections contains a discussion of other areas found to be of interest when comparing platforms.
6.2.1
Speed
This is the most important metric which describes the time it takes to execute the algorithm using resources available on the platform and is used as metric for comparing platforms in this report. The time is easily calculated as the period time for the clock multiplied with the number of cycles it takes to nish the execution. A processor has generally a higher clock frequency but requires more clock cycles to execute an algorithm compared to a FPGA. The reason is that it is possible to customise the hardware in the FPGA to adapt to the requirements of the algorithm and to utilise parallelism. A description of how execution time on FPGA and CPU architectures is compared is found in section 3.3.1.1.
6.2.2
Throughput
The execution time can be misleading if a pipelined design are compared to a design without pipeline. A complementing metric is the throughput which describes how fast data can be feed to the implementation for processing. The throughput is also related to the I/O interface since if more data can be read simultaneously less instructions is required for loading input data to the algorithm. The throughput is taken into account in the method for platform decision by using a large number of executions of the implementation, i.e. using a large set of input data.
6.2.3
Resource usage
Hardware is expensive, generates heat, requires space and more components in a system increases the complexity. An implementation with low resource usage reduce these problems since other work can be done without the need of more hardware. It is easier to think of resources on a FPGA than a processor since it contains a limited number of slices, clock buers, RAM, etc... The amount of resources required for an implementations are only xed for a certain speed goal, see section 2.2.1.1. This results in resources and time being tied together.
104
Discussion
On a processor resources are more abstract. A double core processor contains two cores which can, with some restrictions, be used individually by two processes. These two cores can be seen as resources. A process using one of the cores thus utilise 50% of the processor. A single core processor using out of order execution have several execution units which also can be seen as resources and how they are used is a metric of the resource usage. In this case the resources can not be used by another processes, if not some kind of hyper threading technique is used. This makes the resource usage, measured in usage of execution units, a metric of how eective the algorithm is implemented and not the amount of hardware needed. This is not the case on the dual-core processor where the cores can be used by two process simultaneously. If the resource usage of a single core processor is to be compared with a FPGA the most practicable metric is the execution time. A short execution time means that more work can be assigned to the processor during a time frame. How for example 5 ms execution time on a processor should be compared 25 ms execution time on a FPGA using 20% of the slices, is however not trivial. If the real time constraint requires that the algorithm should be executed every 50 ms and the slices are the limited resource on the FPGA then 10% of the processor are used and 20% of the FPGA. This line of argument makes the processor more suitable. However if another algorithm should run every 5 ms this is only possible on the FPGA, as long as this algorithm requires less than 80% of the slices, and not on the processor. Thus there exist no simple way of comparing resources without concerning about the whole system. In the described method slice usage are used as a metric for FPGA and time sharing of the whole processor as a unit for the CPU.
6.2.4
Memory
In many radar applications the limiting factor is not the execution speed on the platform but the memory bandwidth available for accessing data. Analysing memory operations is complex and no simulation have been done in this thesis. The following is a brief idea of a method for this task. For a processor the memory access depends on the speed of the FSB, data cache and the memory. A better approach than modelling all these are probably to use some kind of performance measuring, see section 2.4.2.3. On the FPGA access of external memory can be written in VHDL and simulated along with the algorithm. The interface to external memory are to complex to be written in M-Code for generation using AccelDSP and thus it is outside the scope of the thesis.
6.2 Comparing a processor to a FPGA
105
6.2.5
Power
For portable solutions and when a low heat generation is required the power consumption is an important factor. For comparing the consumption of the platforms three approaches are possible. Values of the typical consumption can be found in data sheets and represents a very rough estimation since the dynamic consumption varies for dierent implementations and workloads. More accurate gures are obtained by using a simulator also called power analyser or by measure the consumption on the hardware by attaching a probe. Brief descriptions of simulators for the platforms are found in chapter 2. No simulation of power usage are integrated into PSIM.
6.2.6
I/O
The I/O interface of a FPGA is much more versatile compared to a processor. The Xilinx Virtex5 device used in this thesis have 640 congurable I/O pins which allows for a lot of data to be access simultaneously. A processor access data sequentially through the FSB which are used for all communication with the processor. The speed of this bus can be dierent from the speed of the memory and the processor or run synchronously. The versatile I/O of the FPGA give this platform advantage compared to the processor since optimised interfaces can be created. To compare the actual performance between the platforms an actual setup including external interfaces must be examined. Generally complex interfaces like Ethernet are easier to utilise using a processor than a FPGA since communication stacks are more straight forward to implement.
106
Discussion
6.3
Implementation decision based on proling
In section 3.3 a method for generating, evaluating and comparing implementations on FPGA and PowerPC from a Matlab model is described. The results of the evaluation divided between PowerPC and FPGA for the three example algorithms used for developing the method are found in chapter 5. This rst section 6.3.1 contains a comparison between the platforms for two of the example algorithms, using the described method. The following sections contains a discussion of the accuracy and usability of the method.
6.3.1
Implementation decision
A comparison of the execution time of the example algorithms used for developing and evaluating the method are found in Table 6.2. The gures are based on the results in section 5.1 for the FPGA and section 5.2 for the CPU. The execution times is calculated using the maximum possible frequency on the FPGA and a clock speed of 180 MHz on the CPU. Modern processors can operate much faster than 180 MHz but since the simulation is performed using the architecture of a PowerPC 604 it is not obvious that the same amount of cycles can be achieved with a faster processor. A PowerPC 440 operates at 550 MHz and it is reasonable to assume that similar execution statistics can be obtained. An estimation, without using a simulator for a modern CPU, is that the execution is 3 times faster. This extrapolation is not shown in Table 6.2 because of its uncertainty. FPGA MUSIC Cartesian to polar map transformation 15 ms 784 s PowerPC 35 ms 191 ms
Table 6.2: Execution time on the both platforms
The MUSIC algorithm is obviously not well tted for a FPGA, considering the many problems with the implementation on this platform, even thus the execution time is half of the time on the CPU. The values are to close to be reliable which means that the performance of an actual modern CPU implementation is as likely to have good performance as a FPGA implementation. Considering the resource usage and the problematic implementation a CPU implementation is preferred. The performance of the Cartesian to polar map transformation shows better condence and it is likely that a FPGA implementation is at least 100 times faster.
6.3 Implementation decision based on proling
107
Other important factors besides the achievable performance are the cost of the platform, how easy the design and implementation process is and how easy verication, testing and debugging are. The design against processors are in most cases much cheaper and easier with exception to low level functionality when strict control over timing is needed. If a performance evaluation only shows a small increment in performance on a FPGA compared to a PowerPC and if no restricting real time constraint exists a CPU implementation is generally recommended. Another performance metric is the resource usage which is determined dierently on the both platforms. On the CPU it is measured as the time utilisation of the whole processor as one unit and for the FPGA as the slice utilisation. The result is shown in Table 6.3. FPGA MUSIC Cartesian to polar map transformation 39% 28% PowerPC 123% 4%
Table 6.3: Resource usage on the both platforms measured in time utilisation for the PowerPC and slice utilisation for the FPGA.
A suitable implementation platform for MUSIC and Cartesian to polar map transformation was not obvious before the performance evaluation. The described method shows that a FPGA implementation is at least 100 times faster for the transformation algorithm but requires 30% of the FPGA. Running on a PowerPC the execution is slower but only 4% of the resource is used and 191 ms is fast enough to full the real time requirement of 4.7 s1 . The MUSIC algorithm requires 123% of the CPU but this is using an old model. Using a modern processor this platform will full the requirement and is probably a better implementation platform.
6.3.2
Accuracy
The described method contains mainly two sources of errors. The rst source is errors deriving from the fact that the performance of auto-generated code is compared. The method for platform decision is based on that the performance of generated code can represent the performance of the code used for implementation. The next source is errors in the evaluation of the simulated implementations. The two groups are described in two dierent sections.
1 the
requirement is described in section 5.2.3
108
Discussion
6.3.2.1
Code generation
If auto-generated descriptions are used for the performance evaluation but handcoded descriptions or IP-Blocks are used for the actual implementation it is important that the performance of the generation is the same for both C-Code and VHDL. If auto-generated code is used for both the evaluation and the actual implementation there exist no problem, but generally this is not the fact. It is important to note that the generated description does not need to have the same performance as the hand-coded used for implementation but the performance loss when generating VHDL and C-Code from Matlab must be approximately the same. The work made by Marcus Mlleger section 1.3.7 is mainly focused on comu paring two dierent softwares for C-Code generation and the comparison with hand-coded code are limited, especially for signal processing. For the sort, basic algebra and matrix multiplication algorithms analysed by Mlleger EMLC u performs between equal and 3 times slower than hand-coded implementations. The evaluation of AccelDSP made during this thesis shows similar results. AccelDSP performs equal to 3 times slower than the reference implementations when comparing maximum clock frequency. The calculation time is hard to compare since the implementations sometimes have very dierent design in pipeline steps, input sampling, etc... Input sampling shows similar performance as the clock frequency since it is dependent on the clock but the calculation time is sometimes 6 times longer because of very long pipelines. The longer calculation time is often not interesting since the algorithms run continuously on an incoming data stream. The performance of auto-generating VHDL and C-Code are roughly the same with a performance equal or 3 times worse than for reference implementations for continuously running algorithms. This makes the method meaningful. The variation in performance of the generated algorithms compared to the references gives an indication of the accuracy. An evaluation of the performance of EMLC for signal processing algorithms will make the comparison more accurate. 6.3.2.2 Evaluation
Since the performance evaluation of the two implementations is performed using simulation and not on real hardware it is important that the performance of the simulation are equal for both VHDL and C-Code. The analytical modelling of the FPGA performance measure is accurate and simulation is a natural part of the realisation workow for FPGAs. For PowerPC the simulation is less accurate since it is not as common to simulate performance of processors. The simulator PSIM are not cycle accurate and have no data cache simulation as discussed in section 3.2.3 and no gures of a comparison between PSIM
109
and real hardware have been found. This results in that the only credibility of the program comes from its common use. It can not be shown that the method is accurate but this is not required. The method is accurate enough to give a rough estimation of the execution time which fulls the requirement of the method. If the performance in execution time on the two platforms are similar other factors like implementation time, available hardware resources and human resources determines the implementation platform.
6.3.3
Usability
Using the method enough information is gained to compare the implementations and take an implementation decision. If no signicant dierence exist in the values for resource usage and execution time other factors determines the implementation. This is the case for the MUSIC algorithm. The main drawback is the complex algorithm design workow for AccelDSP. An existing Matlab model of a system created by a system level engineer can not easily be used for proling implementations without considerable changes. The most practicable way is to create two Matlab models, optimised for FPGA and CPU. Writing algorithms in Matlab is fast compared to C-Code and VHDL but writing two algorithms still takes time and changes to the algorithm must be made to two separate sources. Since the both platforms are fundamentally dierent two M-Code designs are generally desired even if the design workow of AccelDSP is simplied in future versions of the software. The code generation and evaluation workow after the design phase are also time consuming and involves several iterations in AccelDSP for complex designs. If hand-coded VHDL, and not the generated, is used for implementation the generated VHDL is useless after the implementation decision is taken. Thus the decision itself must be worth the time if no other results, like an usable implementation, is generated. A reason to not use the generated VHDL for implementation is that easily readable code or better performance is needed. The described method can be used for all CPU for which a proler can be found and those who gives good performance for the generated ANSI C-Code. Not only PowerPC processors. Some application specic processors, for example some DSP, requires special instruction to maximise the performance, more information is found in section 2.3. The method are not suitable for performance evaluation and comparison for these processors. Until improvements have been made to AccelDSP to simplify the workow, mainly increase the M-Code subset of supported syntax and functions, the method are only useful for designs where the decision are important enough to spend a considerable amount of time.
110
Discussion
6.4
Future work
There are two main areas for future work. The rst area is evaluations which can be made to improve the delity of the results and methods in this thesis. The second area is further investigations of workows and factors for hardware implementations.
6.4.1
To increase the accuracy of the evaluation of AccelDSP real device testing of the generated code can be performed. The implementation step of the VHDL implementation workow sometimes shows a huge amount of warnings. The simulation veries that the design is correct, in spite of the warnings, but hardware testing would show that generated block works in a complete design run on real hardware.
6.4.2
Implementation decision
The major sources of errors which inuence the delity of the method of making implementation decisions are the performance of the generated code of AccelDSP and EMLC and the accuracy of PSIM. A further evaluation of the performance of C-Code generated with EMLC using the same or at least similar algorithms as in the evaluation of AccelDSP would increase the delity. Changing PSIM to a cycle accurate simulator with a veried precision and a wider range of supported processors is desirable. An interesting cycle accurate and functional simulator for PowerPC is fMW developed by Carnegie Mellon University [2]. In this thesis the execution time is used as the most important performance metric for comparing platforms using a direct path from Matlab to C-Code and VHDL. Further evaluations of other metrics like those described in section 6.2 and evaluations of other factors like implementation time and possibilities of debugging would be interesting.
6.4.3
Evaluation of other tools
AccelDSP is not the only tool which provides an implementation path from Matlab to VHDL. A very interesting software is MathWorks Simulink HDL Coder. As mentioned in section 1.3.3 this software can generate VHDL from a design in Simulink. Using a special block in Simulink EML, the subset of M-Code also supported by EMLC, can be used for generating VHDL. Other implementation paths than the direct path from Matlab to VHDL, for example using C-Code as an intermediate language, is also interesting to evaluate.
Chapter 7 Conclusions
The aim of the thesis, as stated in the introduction, is to develop a procedure for making decisions about realisations of algorithms for signal processing and radar control. The procedure should assist in deciding if an implementation of a certain algorithm should be made on a PowerPC or Field Programmable Gate Array (FPGA) using proling. It should also assist in deciding if VHDL for implementation on a FPGA should be generated from Matlab using AccelDSP or be written by hand. The procedure is divided into the following problem statements: 1. Is AccelDSP useful for SMW and which purposes can it full? 2. How can an algorithm be evaluated to see if it is suitable for VHDL generation from Matlab using AccelDSP or if ts it better to describe the algorithm directly in VHDL? 3. How can proling on the both platforms, with code generated from Matlab, be used to determine if an algorithm should be implemented on a PowerPC or a FPGA? The statements 1 and 2 are answered in section 7.1 by concluding how the suitability of algorithms can be evaluated, general recommendations how AccelDSP should be used and a more specic recommendation for SMW. Section 7.2 answers statement 3 by summarising the developed method for platform decision and the usability and accuracy of the method.
7.1
VHDL generation using AccelDSP
In section 6.1 VHDL generation using AccelDSP in general and how it can be used at SMW is thoroughly discussed. To evaluate which purposes AccelDSP
112
Conclusions
can full at SMW ve dierent aspects have been evaluated. This section contains conclusions of this discussion.
7.1.1
Evaluation of algorithms for AccelDSP
Small algorithms without a large amount of unsupported sub functions, strict timing performance like memory interfaces and using one single clock can be considered for use with AccelDSP, refer to Usability. A good way of evaluating an algorithm which conforms to these restrictions starts with checking all Matlab functions to verify that they are supported either directly by AccelDSP or using AccelWare. A reference of supported functions and syntax is found in [26]. If the unsupported functions are easy to rewrite using supported M-Code syntax and the ow of the algorithm is easy to change to a streaming behaviour, the algorithm can be considered for implementation using AccelDSP. A small algorithm contains a small amount of sub functions like dividers, eigenvalue or singular value calculations, sort algorithms etc... A typical example of a small algorithm is the CORDIC, described in section 4.5, which consists of simple mathematical calculations. The MUSIC algorithm, described in section 4.7, consists of a sorting algorithm, singular value calculation and creation of a covariance matrix. This algorithm is to complex for AccelDSP which results in a long implementation time and bad performance.
7.1.2
The main advantage of generating VHDL compared to hand-coding is the reduced implementation time. Generally hand-coded VHDL was found to be the preferred implementation method towards a FPGA in this thesis. The reasons is that generated VHDL is hard to read, generates many warnings when synthesised and shows worse performance. AccelDSP can however reduce the implementation time for parts of a design if it is evaluated to be suitable for the program. The drawback of using AccelDSP is reduced performance. A better approach than using AccelDSP as stand-alone is probably to use it together with Xilinx System Generator or another software based on Simulink. Since System Generator only generates netlists MathWorks Simulink HDL Coder is an interesting option. An evaluation of this approach is included in Future work. AccelDSP is not suitable for use by a person with no knowledge about VHDL design for FPGA implementation but only a limited knowledge is required for generating parts of a design when working together with VHDL developers. An existing Matlab model is not guaranteed to reduce the development time using AccelDSP. This is because large modications of the model can be needed to conform to the supported subset of the M-Code language, to create a streaming design suitable for FPGA implementation and to conform to I/O restrictions.
113
A typical use case for AccelDSP is rapid prototyping. In this use case the resulting performance is less important than that the development time is short. The requirements of algorithms is of course the same when using AccelDSP for rapid prototyping. The ability of AccelDSP to be used for rapid prototyping makes it useful for generating reference FPGA designs usable in the method for platform decision. 7.1.2.1 Performance
The maximum clock frequency of generated VHDL is generally half of the frequency possible to achieve when VHDL is hand-coded or using IP-Blocks but the deviation is big. The execution time is generally three times longer but also this gure shows large deviations. Refer to section 6.1.2, Figure 6.1 and Figure 6.2. The execution time is generally not interesting for streaming implementations since this is the same as the startup cycles, i.e. the latency for the rst value. The major problem with the resource usage for AccelDSP is the unpredictable results. A good example is the streaming implementation of the matrix multiplication, described in section 4.3.1.2. The resource usage of this algorithm compared is roughly 18 times bigger compared to a hand-coded reference design even thus the algorithm is simple. The area utilisation measured in number of occupied slices for all algorithms used in the evaluation of AccelDSP is found in Figure 6.4 in section 6.1.2. 7.1.2.2 Using AccelDSP at SMW
According to this study the only purpose for AccelDSP at SMW is to be used in the method for platform decision. The software can also be used for generating parts of the design since a lot of VHDL and FPGA knowledge exists at the company but this is not recommended. The drawbacks of generating VHDL like the performance, reliability of the description, readability and the problems of maintaining the source is more important than the reduced design time. It is likely that the design is needed to be changed during the long product lifetime (20-25 years). If AccelDSP no longer exists or is impossible to run on the workstations in 20 years the generated VHDL needs to be editable. Considering the code quality this will be almost impossible.
7.2
Implementation decision based on proling
A method for deciding if a PowerPC or FPGA should be used as implementation platform for a certain algorithm based on proling has been developed. The outline of the method is generating VHDL using AccelDSP and C-Code using
114
Conclusions
EMLC for implementation on a FPGA and a CPU respectively. The execution time and resource usage of the implementations are compared and from the comparison a decision of a suitable platform for a certain algorithm can be drawn. A complete description of the method is found in section 3.3. The method is only usable for algorithms which are to be considered for FPGA implementation using AccelDSP and when the decision are important enough to be worth the sometimes time-consuming workow described on the method. The method gives an estimation of the execution time and the resource usage on the both platforms given a specic design and includes a discussion of other metrics important for the decision. The estimated metrics is good enough to full the requirement from SMW since it gives an indication of the resulting performance usable for deciding implementation platform. The usability of the method will be better if the usability of AccelDSP for rapid prototyping is improved in future versions of the software.
Bibliography
[1] Altera. Guidance for accurately benchmarking fpgas. Technical report, Altera, December 2007. [2] Candice Bechem, Jonathan Combs, Noppanunt Utamaphethai, Bryan Black, R.D. Shawn Blanton, and John Paul Shen. An integrated functional performance simulator. IEEE Micro, 19(3):2635, 1999. [3] Inc. Berkeley Design Technology. Dsp on general-purpose processors. Technical report, Berkeley Design Technology, Inc., 1997. [4] Inc. Berkeley Design Technology. Evaluating dsp processor performance. Technical report, Berkeley Design Technology, Inc., 2002. [5] Henrik Eriksson. Computer aided implementation using xilinx system generator. Masters thesis, Technical University of Linkping, January 2004. o [6] Malay Haldar, Anshuman Nayak, Alok Choudhary, and Prith Banerjee. A system for synthesizing optimized fpga hardware from matlab. iccad, 00:314, 2001. [7] Malay Haldar, Anshuman Nayak, Alok Choudhary, Prith Banerjee, and Nagraj Shenoy. Fpga hardware synthesis from matlab. vlsid, 00:299, 2001. [8] Monson H. Hayes. Statistical digital signal processing and modeling. John Wiley & Sons, Inc., 1996. [9] Shen-Fu Hsiao and J.M Delosme. Householder cordic algorithms. IEEE Transactions on Computers, 44(8):9901001, August 1995. [10] IBM, Motorola and PowerPC. PowerPC 604e, RISC Microprocessor Users Manual. [11] Lizy Kurian John and Lieven Eeckhout, editors. Performance Evaluation and Benchmarking. Taylor and Francis Group, LLC, 2006. [12] M. Tim Jones. Optimizations in gcc. Linux Journal, January 2005.
116
BIBLIOGRAPHY
[13] MathWorks. The origins of matlab. Cleves corner www.mathworks.com, December 2004. [14] MathWorks. The growth of matlab and the mathworks over two decades. Cleves corner www.mathworks.com, January 2006. [15] MathWorks. About the mathworks. www.mathworks.com/company/aboutus, July 2008. [16] Cleve Moler. Professor svd. In The MathWorks News and Notes. http://www.mathworks.com, October 2008. [17] Markus Mllegger. Evaluation of compilers for matlab- to c-code translau tion. Masters thesis, Halmstad University, January 2008. [18] Stephan Orr. Accelchip tool synthesizes matlab designs. Electronic Engineering Times Asia, April 2002. [19] K Benkrid & D Crookes S Belkacemi. A logic based hardware development environment. In Proceedings of the 11th Annual IEEE Somposium of FieldProgrammable Custom Computing Machines. School of Computer Science, The Queens University of Belfast, IEEE, 2003. [20] Stefan Sjholm and Lennart Lindh. VHDL fr konstruktion. Studentlittero o atur, 2003. [21] Xilinx. Core Generator Overview, 2005. [22] Xilinx. Achieving breakthrough performance in virtex-4 fpgas. Technical report, Xilinx, May 2006. [23] Xilinx. Xilinx acquires dsp design tool leader accelchip. Xilinx Press Release 0613, January 2006. [24] Xilinx. AccelDSP Synthesis Tool User Guide, 10.1 edition, March 2008. [25] Xilinx. AccelWare Reference Designs User Guide, 10.1 edition, March 2008. [26] Xilinx. Matlab for Synthesis Style Guide, 10.1 edition, March 2008. [27] Xilinx. Virtex-5 FPGA User Guide, March 2008. [28] Xilinx. Virtex-5 FPGA XtremeDSP Design Considerations. Xilinx, April 2008.
Appendices
Appendix A Programs used

Operating Systems: Microsoft Windows XP Professional Service Pack 2 Redhat Enterprise GNU Linux x86 64 Programs under Windows: Xilinx AccelDSP 10.1 Mathworks Matlab R2007a, R2008a Programs under Linux: Xilinx ISE Foundation 10.1 Mentor Graphics Modelsim 6.3c Synplicity Synplify PRO 8.9 GNU GCC 4.2.4 Psim PowerPC simulator included in GNU GDB 6.8
120
Programs used
Appendix B Performance data from PSIM

Performance data from PSIM for FFT compiled using level 3 optimisations. 502 763 6,642 13,649 1 1,798 4,590 20,343 4,333 1,028 7,869 1 996 2,302 255 4,960 763 6,493 4,817 5,418 3,143 1,010 4,082 1 AND instructions. AND Immediate instructions. Add instructions. Add Immediate instructions. Add Immediate Carrying instruction. Add Immediate Shifted instructions. Branch instructions. Branch Conditional instructions. Branch Conditional to Link Register instructions. Compare instructions. Compare Immediate instructions. Compare Logical Immediate instruction. Floating Absolute Value instructions. Floating Add instructions. Floating Add Single instructions. Floating Compare Unordered instructions. Floating Convert To Integer Word with round towards Zero instructions. Floating Move Register instructions. Floating Multiply instructions. Floating Multiply-Add instructions. Floating Multiply-Subtract instructions. Floating Negate instructions. Floating Subtract instructions. Load Byte and Zero instruction.
122
Performance data from PSIM
19,546 765 5,622 2 31,259 1 764 7,897 764 4,336 510 1,799 2 2,538 13 1 10,473 1,276 2 3 1 13,778 1 4,333 1 1 1 502 1 1,780 194,333 83,092 2,300 1 9,702 10,394 18,872 10,390 5,497 5,319
Load Floating-Point Double instructions. Load Floating-Point Double Indexed instructions. Load Floating-Point Single instructions. Load Halfword Algebraic Indexed instructions. Load Word and Zero instructions. Load Word and Zero Indexed instruction. Move From Condition Register instructions. Move from Special Purpose Register instructions. Move to Condition Register Fields instructions. Move to Special Purpose Register instructions. Negate instructions. OR instructions. OR Immediate instructions. Rotate Left Word Immediate then AND with Mask instructions. Rotate Left Word Immediate then Mask Insert instructions. Shift Right Algebraic Word Immediate instruction. Store Floating-Point Double instructions. Store Floating-Point Double Indexed instructions. Store Floating-Point Double with Update instructions. Store Floating-Point Single instructions. Store Half Word instruction. Store Word instructions. Store Word Indexed instruction. Store Word with Update instructions. Subtract From instruction. Subtract From Extended instruction. System Call instruction. XOR instructions. XOR Immediate instruction. XOR Immediate Shifted instructions. cycles. stalls waiting for data. stalls waiting for a function unit. stall waiting for serialisation. times a write-back slot was unavailable. branches. conditional branches fell through. successful branch predictions. unsuccessful branch predictions. branch if the condition is FALSE conditional branches.
123
3 branch if the condition is FALSE, reverse branch likely conditional branches. 10,563 branch if the condition is TRUE conditional branches. 2 branch if the condition is TRUE, reverse branch likely conditional branches. 893 7,896 764 34,414 5,752 12,997 34,239 61,474 29,266 178,142 57,196 29,867 701 203,731 branch if --CTR != 0 conditional branches. branch always conditional branches. mtcrf moving 1 CR instructions. 1st single cycle integer functional unit instructions. 2nd single cycle integer functional unit instructions. multiple cycle integer functional unit instructions. floating point functional unit instructions. load/store functional unit instructions. branch functional unit instructions. instructions that were accounted for in timing info. reads (1 1-byte, 2 2-byte, 36,882 4-byte, 20,311 8-byte). writes (0 1-byte, 1 2-byte, 18,115 4-byte, 11,751 8-byte). icache misses. instructions in total.
Simulator speed was 8,859,410 instructions/second.
124
Performance data from PSIM
Appendix C Makele for GCC and PSIM

# Root d i r e c t o r i e s ROOT=/ p r o j / f y x g e x j o b b / AccelDSP GCC ROOT=$ (ROOT) /Gcc PSIM ROOT=$ (ROOT) /Gdb =$ QEMU ROOT (ROOT) /Qemu # Src d i r e c t o r i e s BINUTILS DIR=/tmp/ g c c / b i n u t i l s 2.18 GCC DIR=/tmp/ g c c / gcc 4 . 2 . 4 PSIM DIR=/tmp/ g c c /gdb 6.8 NEWLIB DIR=/tmp/ g c c / newlib 1 . 1 6 . 0 MPFR DIR=/tmp/ g c c / mpfr 2 . 3 . 1 # Qemu s r c QEMU DIR=/tmp/qemu/qemu 0 . 9 . 1
.PHONY: b i n u t i l s g c c psim n e w l i b qemu
all : # B u i l d Bin U t i l i t i e s make b u i l d b i n u t i l s # B u i l d GCC make b u i l d g c c # Build Newlib make b u i l d n e w l i b
126
Makele for GCC and PSIM
# B u i l d PSim make b u i l d psim # B u i l d Qemu make b u i l d qemu b u i l d psim : rm f r $ ( PSIM DIR ) / b u i l d mkdir $ ( PSIM DIR ) / b u i l d make f $ (ROOT) / M a k e f i l e C $ ( PSIM DIR ) / b u i l d psim b u i l d n e w l i b : mkdir $ (NEWLIB DIR) / b u i l d make f $ (ROOT) / M a k e f i l e C $ (NEWLIB DIR) / b u i l d n e w l i b b u i l d g c c : rm R f $ (GCC DIR) / b u i l d mkdir $ (GCC DIR) / b u i l d make f $ (ROOT) / M a k e f i l e C $ (GCC DIR) / b u i l d g c c b u i l d b i n u t i l s : mkdir $ ( BINUTILS DIR ) / b u i l d make f $ (ROOT) / M a k e f i l e C $ ( BINUTILS DIR ) / b u i l d binutils b u i l d qemu : rm R f $ (QEMU DIR) / b u i l d mkdir $ (QEMU DIR) / b u i l d make f $ (ROOT) / M a k e f i l e C $ (QEMU DIR) / b u i l d qemu binutils : . . / c o n f i g u r e t a r g e t=powerpce a b i s i m p r e f i x=$ ( GCC ROOT) make a l l i n s t a l l
gcc : . . / c o n f i g u r e t a r g e t=powerpce a b i s i m p r e f i x=$ ( GCC ROOT) withmpfrinclude=$ (MPFR DIR) with mpfrl i b=$ (MPFR DIR) / . l i b s e n a b l e l a n g u a g e s=c , c ++ withn e w l i b d i s a b l e l i b s s p d i s a b l e s h a r e d make a l l i n s t a l l
newlib :
127
. . / c o n f i g u r e t a r g e t=powerpce a b i s i m p r e f i x=$ ( GCC ROOT) e n a b l e newlib mb e n a b l e newlib hwf p bash | PATH=$ (GCC ROOT) :$PATH make a l l i n s t a l l
psim : . . / c o n f i g u r e t a r g e t=powerpce a b i p r e f i x=$ ( PSIM ROOT) e n a b l e simpowerpc e n a b l e sims t d i o make a l l i n s t a l l qemu : . . / c o n f i g u r e p r e f i x=$ (QEMU ROOT) d i s a b l e gfx check make a l l i n s t a l l
128
Makele for GCC and PSIM
Appendix D Cartesian to polar map transformation

The skeleton of the cartesian to polar map transformation. The subfunctions are not included since they are similar for both the FPGA and CPU implementations.
D.1
CPU implementation
function PolarMap = c a r t i m 2 p o l ( CartMap ) % #eml % Read dimension c o n s t a n t s [ map x , map y , max x , max y ] = l o a d c o n s t d i m ( ) ; [ r step , phi step ] = load const steps () ; PolarMap = l o a d c o n s t P o l a r M a p ( ) ; % I t e r a t e row f i r s t t h r u g h t h e i n p u t m a t r i x f o r X=1:( map x 1) prev = u i n t 8 ( 0 ) ; f o r Y=1: map y prev tmp = CartMap (X,Y) ; % I f c u r r e n t or p r e v i o u s v a l u e was l a n d i f ( prev tmp == 3 ) | | ( prev == 3 ) % Translate p i x e l coordinates [ Xtrans , Ytrans ] = t r a n s c o o r d (X, Y, max x , max y ) ; % Calculate polar coordinate to p i x e l
130
% 1 X % | | % | | % 2 X % | | [ phi1 , r 1 ] [ phi2 , r 2 ]
F et ch c o o r d i n a t e s t o two c o r n e r s o f each p i x e l s p e r i t e r a t i o n . = c a r t 2 p o l 2 ( Xtrans +0.5 , Ytrans +0.5) ; = c a r t 2 p o l 2 ( Xtrans +1.5 , Ytrans +0.5) ;
% F et ch s e c t o r and beam from p o l a r c o o r d i n a t e s [ s e c 1 , f o l d 1 ] = p o l 2 s e c ( phi1 , r1 , r s t e p , p h i s t e p ); [ s e c 2 , f o l d 2 ] = p o l 2 s e c ( phi2 , r2 , r s t e p , p h i s t e p ); % Save l a n d d a t a i n p o l a r map PolarMap ( f o l d 1 , s e c 1 ) = 1 ; PolarMap ( f o l d 2 , s e c 2 ) = 1 ; end prev = prev tmp ; end end end
D.2 FPGA implementation
131
D.2
FPGA implementation
function [ done , r o f f s e t , wdata , w o f f s e t , we ] = c a r t i m 2 p o l ( s t a r t , rdata ) % Constants p e r s i s t e n t map x map y max x max y f o l d s s e c t o r s p h i s t e p r step ; % States persistent x y state ; % Variables p e r s i s t e n t c a r t m a p p r e v Xtrans Ytrans ; % I n i t i a l i s e memory i f isempty ( x ) c a r t m a p p r e v =0; Xtrans =0; Ytrans =0; x=1; y =1; s t a t e =0; % Constants dim = load ( dim . t x t ) ; map x = dim ( 1 ) ; map y = dim ( 2 ) ; max x = dim ( 3 ) ; max y = dim ( 4 ) ; c e l l s = load ( c e l l s . t x t ) ; folds = c e l l s (1) ; sectors = c e l l s (2) ; s t e p s = load ( s t e p s . t x t ) ; r step = steps (1) ; phi step = steps (2) ; end % Default return values done =0; wdata =0; w o f f s e t =0; we=0; r o f f s et = 1; % Internal variables r e q r e a d =0; % Iterate if ( start if (x < if (y t h r u c a r t e s i a n map row w i s e u s i n g a s t a t e machine == 1 ) map x ) <= map y )
132
% R e q u e s t r e a d from c a r t e s i a n map i f s t a t e == 0 r o f f s e t = s u b 2 i n d ( map y , x , y ) ; state = 1; % Read v a l u e from c a r t e s i a n map , t r a n s f o r m and % save f i r s t corner of p i x e l e l s e i f s t a t e == 1 % Transform c a r t e s i a n p o s i t i o n s t o p o l a r o n l y i f l a n d % a t c u r r e n t or p r e v i o u s p o s i t i o n i f ( r d a t a == 1 | | c a r t m a p p r e v == 1 ) % Translate p i x e l coordinates [ Xtrans , Ytrans ] = t r a n s c o o r d ( x , y , max x , max y ) ; % Calculate polar coordinate to p i x e l % 1 X F et ch c o o r d i n a t e s t o two c o r n e r s % | | o f each p i x e l s p e r i t e r a t i o n . % | | % 2 X This i s done i n 2 c l o c k c y c l e s u s i n g % | | s t a t e 1 and 2 . [ s e c t o r , f o l d ] = c a r t 2 s e c ( Xtrans , Ytrans , r s t e p , phi step ) ; % R e q u e s t w r i t e t o p o l a r map we = 1 ; w o f f s e t = sub2ind ( s e c t o r s , s ector , f o l d ) ; wdata = 1 ; cartmap prev = rdata ; state = 2; % I f n o t l a n d r e q u e s t r e a d from n e x t p o s i t i o n and s t a y % i n same s t a t e . else state = 1; reqread = 1; cartmap prev = 0 ; end % R e q u e s t r e a d from c a r t e s i a n map , t r a n s f o r m and s a v e % second corner o f p i x e l e l s e i f s t a t e == 2
D.2 FPGA implementation
133
% Calculate polar coordinate to next corner of p i x e l [ s e c t o r , f o l d ] = c a r t 2 s e c ( Xtrans +1, Ytrans , r s t e p , phi step ) ; % R e q u e s t w r i t e t o p o l a r map we = 1 ; w o f f s e t = sub2ind ( s e c t o r s , s ecto r , f o l d ) ; wdata = 1 ; % Request read reqread = 1; state = 1; end % R e q u e s t a new r e a d o p e r a t i o n from c a r t e s i a n map i f r e q r e a d == 1 i f ( y == map y ) y = 1; x = x + 1; % P i x e l s a t end and s t a r t o f rows a r e n o t n e i g h b o u r s cartmap prev = 0 ; else y = y + 1; end r o f f s e t = s u b 2 i n d ( map y , x , y ) ; end else done = 1 ; end end end
134
Appendix E VHDL Constraints

To force the tools to work harder to optimise the design constraints must be used. To get good compareable results the same base set of constraints were used for all algorithms. The rst constraint is the clock period constraint PERIOD, which is used to constraint the clock period length to force the tools to optimise routing to get low delays between dierent synchronic elements. The HIGH value indicates that the rst pulse in the period is high and the 50% after HIGH is the duty cycle of the rst pulse. The TIMESPEC and NET constraints are used to group all elements driven by the clock buer for the Clock signal.
NET Clock TNM NET = Clock ; TIMESPEC TS Clock = PERIOD Clock 3 ns HIGH 50\%;
Listing E.1: PERIOD constraint The other constraints is concerned with the I/O and pad timing. The I/O constraints are necessary to get a realistic place and route result. To constrain this path the FROM X TO Y constraint are used. It constrains the delay between two TIMESPEC groups, which in this case are the predened groups for all the Flip-Flops (FFS) and for all the pads (PADS). The value after is the delay constraint in ns.
TIMESPEC TS P2DSP = F O PADS TO FFS 4 ns ; R M TIMESPEC TS DSP2P = F O FFS TO PADS 4 ns ; R M
Listing E.2: FROM TO constraint with FlipFlops(FFS) The second I/O constraints are related to how the external clock relates to the input and output pads. The OFFSET IN BEFORE constraint is used to ensure that the external clock and the external input data meet the setup time on the internal ip-op. The OFFSET OUT AFTER constraint is used to control the setup and hold times of the external output data pads and the external clock pads.
136
VHDL Constraints
OFFSET = IN 3 ns BEFORE Clock ; OFFSET = OUT 4 ns AFTER Clock ;
Listing E.3: The OFFSET constraint Its also possible to use the OFFSET IN AFTER and OFFSET OUT BEFORE constraints as an alternative to the FROM PADS TO FFS and FROM FFS TO PADS constraints.

An Evaluation of Methods For FPGA Implementation From A Matlab Description

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Evaluation of Methods For FPGA Implementation From A Matlab Description

Uploaded by

Copyright:

Available Formats

An evaluation of methods for FPGA implementation from a Matlab description

Kristofer Lindberg Kristofer Nissbrandt

Masters Degree Project Stockholm, Sweden 2008

UsingaMatlabmodellastheorigin,anevaluationofVHDLgenerationusing AccelDSPandamethodforcomparingimplementationsonFPGAand PowerPC.

AMasterThesisby: KristoferLindberg KristoferNissbrandt

Supervisor: AndersBergker,DepartmentochDataandSignalProcessing,Saab MicrowaveSystems

Fast Fourier Transform (FFT)

1.2 Problem denition

This section contains softwares and work related to this thesis.

Xilinx System Generator

Altera DSP Builder

MathWorks Simulink HDL Coder

MathWorks Filter Design HDL Coder

The HPEC Challenge Benchmark Suite

Implementation using Xilinx System Generator

1.3 Related work

Matlab to C-Code Translation

2.1 FPGA Architecture

FPGA development using VHDL

2.2 FPGA development using VHDL

Place and Route

2.3 Processor Architecture

the value of the Program Counter (PC) is changed depending on a condition

2.3 Processor Architecture

Digital signal processing benchmarking

2.5 Xilinx AccelDSP

Verify floating point

Generate fixed point

Fixed point model

Verify fixed point

RTL Model (VHDL/Verilog)

Verify gate level

Bitstream and simulation file.

Figure 2.1: AccelDSP Workow

2.5 Xilinx AccelDSP

2.5 Xilinx AccelDSP

Original compiler implementation

2.6 MathWorks Matlab

Native code generation

VHDL generation and simulation

M-code needs to be modified to work

PAR with constraints

XST fails to synthesize the design Synthesize succesfully

Area utilization and performance results

Figure 3.1: The VHDL generation workow and evaluation

VHDL code generation

3.1 VHDL generation and simulation

3.1 VHDL generation and simulation

3.2 C-Code generation and simulation

C-Code generation and simulation

Functional verification (Matlab)

Embedded Matlab Subset (EML)

Fixed point EML

Listing 3.1: Denition of function input using example data in EMLC

Listing 3.2: Denition of code generation target in EMLC

3.2 C-Code generation and simulation

3.3 Platform decision

Information is not available

Table 4.1: FFT Implementation

Listing 4.1: MathWorks XCorr implementation

4.3 Matrix multiplication

Listing 4.2: Batch 3x3 matrix multiplication