You are on page 1of 12

Mentor Graphics University Design Contest 2013

Sachin B S*, Kashyap N$

Abstract: Sequential implementation done in CPUs usually delivers a low performance


when compared to its parallel counterpart. FPGAs have plenty of memory and cache
registers at their disposal which plays a role in the parallel implementation of an algorithm.
Although RTL implementation on a FPGA is difficult the performance edge delivered by
the parallel implementation is massive when compared to the sequential implementation on
the CPU. This paper deals with the implementation of a parallel algorithm for the mentor
graphics University design contest. The aim is to accomplish the task by consuming
minimum clock cycles and minimum FPGA resources but operating at a high operating
frequency.
Keywords: Pipeline, Superscalar, High Performance Computing

1. PROBLEM DESCRIPTION
Two binary valued pictures of size 256x120 (big_picture) and 8x8 (small_picture) are given.
The problem is to implement an algorithm on FPGA which small_picture scans the big_picture
and finds all the pixel locations where a corresponding match with the small_picture has to be
found.
The requirements for being a match is:1. At any instant there should be no less than 8x8 pixels to compare
2. Each pixel of big_picture must be less than or equal to corresponding pixel of
small_picture.
3. There should be atleast 50% of the pixels satisfying this condition.
The other tasks include creating a separate 24-bit free-running counter to keep track of
elapsed clock cycles such that it starts counting after de-assertion of sync-reset and stops at
assertion of task_done_flag signal where task_done_flag indicates task completion.
The entire algorithm is to be implemented on Xilinx Spartan-3A FPGA with the device name
XC3S700A (FGG484)-4. Total match count and address of all match locations are to be
stored internally using registers and memories which are accessible through output ports.

* bssachin45@gmail.com
th
4 Year, ECE, BMSCE, Bangalore
(Phone no: 9481786153)

karshyap@gmail.com
4 Year, ECE, BMSCE, Bangalore
(Phone no: 8970334614)
th

2. PROPOSED SOLUTION
The solution mainly consists of custom processor which follows a superscalar
architecture with pipelined implementation along with memories and registers, hence, achieving
parallelism to some extent.[1][2] Keys involved in the parallel processing are pipelining
followed by Superscalar. The brief explanations of each of them are given below:
a) Pipelining
Pipelining is an implementation technique where multiple instructions are overlapped in
execution. The instruction execution pipeline is divided in stages. Each stage completes a
part of an instruction in parallel. The stages are connected one to the next to form a pipe instructions enter at one end, progress through the stages, and exit at the other end. Pipelining
does not decrease the time for individual instruction execution. Instead, it increases
instruction throughput. The throughput of the instruction pipeline is determined by how often
an instruction exits the pipeline.
High pipelining leads to increase of latency i.e. the time required for a signal to propagate
through a full pipe. A pipelined system typically requires more resources like circuit
elements, processing units, memories, etc., than one that executes one batch at a time,
because its stages cannot reuse the resources of a previous stage. Moreover, pipelining may
increase the time it takes for an instruction to finish[2].
b) Superscalar:
A superscalar CPU architecture implements a form of parallelism called instruction level
parallelism within a single processor. It therefore allows faster CPU throughput than would
otherwise be possible at a given clock rate. A superscalar processor executes more than one
instruction during a clock cycle by simultaneously dispatching multiple instructions to
redundant functional units on the processor. Each functional unit is not a separate CPU core
but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter,
or a multiplier.
The simplest processors are scalar processors. Each instruction executed by a scalar
processor typically manipulates one or two data items at a time. By contrast, each instruction
executed by a vector processor operates simultaneously on many data items. An analogy is
the difference between scalar and vector arithmetic. A superscalar processor is a mixture of
the two. Each instruction processes one data item, but there are multiple redundant functional
units within each CPU thus multiple instructions can be processing separate data items
concurrently.
So this is done mainly to accommodate for better performance and hence achieve the
same at a high speed[1].

3. ARCHITECTURE OF SPARTAN 3A XC3S700A FPGA


We have to look at the various different types of memories and resources in Spartan 3A
XC3S700A before proposing the architecture. The Spartan-3 platform was the industrys first 90
nm FPGA, delivering more functionality and bandwidth per dollar than was previously possible,
setting new standards in the programmable logic industry. The Spartan-3 generation FPGAs
provide a superior alternative to mask-programmed ASICs. FPGAs avoid the high initial cost,
the lengthy development cycles, and the inherent inflexibility of conventional ASICs. Also,
FPGA programmability permits design upgrades in the field with no hardware replacement
necessary, an impossibility with ASICs. [3]
The memory blocks and resources used in SPARTAN 3A XC3S700A are:
a. Block RAM:
A block RAM is a dedicated two port memory containing several kilobits of RAM. A Block
RAM as the name says is memory (bits) arranged as a single block. Block RAMs are useful in
storing large chunks of data at one particular location and its particularly helpful in image
processing. Block RAMs is the most efficient memory present in FPGA. There are two rows of
BRAM in Spartan 3A XS700A, with each row having 10 BRAM units each. The size of each
BRAM unit is 18Kb where 16Kb is used for data and 2Kb is for parity. The total size of BRAM
in XS700A is 360Kb. The memory various configurations for BRAMs for different addresses X
data-lines are as show in the figure 1[3].

Figure 1: Various Configurations for BRAMs

b. Distributed RAM:
Distributed RAMs are large sized RAMs which are implemented using a parallel array of a
large number of elements. These RAMs are spread over the entire FPGA. It is ideal for
implementing small sized memories which if implemented in Block RAMs might lead to
wastage of memory space. There are about 92Kb worth of Distributed RAMs in SPARTAN-3
XS700A FPGA. It is present in configuration logic blocks (CLB). This RAM is normally
distributed throughout the FPGA than as a single block and so it is called "distributed RAM". A

look up table on a Xilinx FPGA can be configured as a 16*1bit RAM, ROM, LUT or 16bit shift
register. [3]
c) Configuration Logic Blocs(CLB):
Configurable Logic Blocks (CLBs) contain flexible Look-Up Tables (LUTs) that implement
logic plus storage elements used as flip-flops or latches. CLBs perform a wide variety of logical
functions as well as store data. [3]
d) Input/Output Blocks(IOBs):
Input/Output Blocks (IOBs) control the flow of data between the I/O pins and the internal
logic of the device. IOBs support bidirectional data flow plus 3-state operation. IOBs support a
variety of signal standards, including several high-performance differential standards. Double
Data-Rate (DDR) registers are included. [3]

4. DESIGN
a) Tools Used:
The following tools are used by us for Mentor Graphics University Design Contest.
IDE: HDL Designer 2012.1- HDL Designer is a powerful HDL-based environment
which delivers new approaches to design todays most complex FPGAs and ASICs.
HDL Designer tackles the design management problem by automating and
simplifying project and team management throughout the design flow. HDL Designer
provides the designer and design team with interfaces to other design tools within the
flow including ReqTracer, Questa/ModelSim, Precision, and FPGA vendor and other
EDA tools for automated compilation, simulation, invoke and interactive debug. [4]
Simulator: ModelSim 10.1d- ModelSim is a verification and simulation tool for
VHDL, Verilog, SystemVerilog, and mixed language designs. Simulation waveforms
are observed using ModelSim inside HDL-Designer IDE. It has powerful waveform
compare for easy analysis of differences and bugs, advanced code coverage and
analysis tools for fast time to coverage closure, unified coverage database with
complete interactive and HTML reporting and processing for understanding and
debugging coverage throughout your project. It is coupled with HDL Designer for
complete design creation, project management and visualization capabilities
Synthesis: Precision Synthesis 2012b- Precision Synthesis offers high quality of results,
industry-unique features, and integration across Mentor Graphics FPGA Flow the
industrys most comprehensive FPGA vendor independent solution. With a rich feature set
that includes advanced optimizations, award-winning analysis, and industry-leading language
support, Precision RTL enables vendor-independent design, accelerates time to market,
eliminates design defects, and delivers superior quality of results.[5]

b) Image Storage:
Since the image has a size 256X120 where there are 256 rows and 120 columns with each
pixel being represented by 6 bits. So, the effective size would be 184320 bits which is
equivalent to 180Kbits. The most efficient storage mechanism consuming least area while
giving a reasonable speed is block ram. Hence, block ram (BRAM) was selected as the
storage element of the image.[6]
512X32 is selected as the basic configurations in order to read 32 bits from the BRAM at
a time. In order to compromise for speed, the bottom 256 addresses are not used and it is
implemented as 256X32 BRAM. This satisfies the rows condition, which is 256 for the
image, but the output bit size should be 720 for a single read operation. Hence, this would
require a large number of BRAM which is to be used together. Since, there are 720 bit
output, hence it would require 720/32=22.5 which would require 23 BRAMs, but only 20
BRAMs are present in Spartan 3A XS700A FPGA. Hence, Distributed RAM based storage is
used for the remaining 80 bit output, which is equivalent to 20Kb. Therefore, 20 BRAMs and
3 DRAMs based storage is used for storing the image. [3]
The image is split into 23 parts column wise using a Perl script. 20 parts each of 16Kb
(32 bit in each column) so that it can be stored in a BRAM and 3 part of size 20Kb
(Remaining columns i.e. two DRAMs of 32 bit output and one DRAM of 16 bit output), so
that it can be stored in 3DRAMs. In order to accommodate for higher speed of the system,
dual port mode of the RAM is selected. This mode supports two read and one write in one
clock cycle. The memory layout is as shown in figure 2.
BLOCK RAMs (20 256X32)

DISTRIBUTED RAMs (2 256X32+ 1 256X16)

-- - - - - - - - 256
Rows

20BLOCK RAMs

32-bit outputs

32-bit output 16-bit output


Figure 2 : Storage of big_picture

The HDL program for inferring a RAM template [7] is as shown


module bram1 #(parameter DATA = 32,parameter ADDR = 8)
(
// Port A
input
wire
a_clk,
input
wire
a_wr,
input
wire
[ADDR-1:0] a_addr,
input
wire
[DATA-1:0] a_din,
output reg
[DATA-1:0] a_dout,
// Port B
input
wire
b_clk,
input
wire
b_wr,
input
wire
[ADDR-1:0] b_addr,
input
wire
[DATA-1:0] b_din,
output reg
[DATA-1:0] b_dout
);
// Shared memory
reg [DATA-1:0] mem [(2**ADDR)-1:0];
// Intial Block for Storing Intial Values
initial
begin
$readmemb("B1.dat",mem);
end
// Port A
always @(posedge a_clk) begin
a_dout
<= mem[a_addr];
if(a_wr) begin
a_dout
<= a_din;
mem[a_addr] <= a_din;
end
end
// Port B
always @(posedge b_clk) begin
b_dout
<= mem[b_addr];
if(b_wr) begin
b_dout
<= b_din;
mem[b_addr] <= b_din;
end
end
endmodule

The codes for different BRAM are alerted a little in terms of the name of the BRAM and the .dat
file that the BRAM will be storing. For example, the name of second BRAM is BRAM2 and the
.dat file that it will be reading is B2.dat. The data in the RAM are accessed through structural
coding [8] as shown below
// Initiating 20 BRAMs and 1 DFF based Storage
// BRAM based storage
bram1 b1(clk,big_pic_wr_en,line1,din1[719:688],data1[719:688],clk,1'b0,line2,din1[719:688],data2[719:688]);
bram2 b2(clk,big_pic_wr_en,line1,din1[687:656],data1[687:656],clk,1'b0,line2,din1[687:656],data2[687:656]);
bram3 b3(clk,big_pic_wr_en,line1,din1[655:624],data1[655:624],clk,1'b0,line2,din1[655:624],data2[655:624]);
bram4 b4(clk,big_pic_wr_en,line1,din1[623:592],data1[623:592],clk,1'b0,line2,din1[623:592],data2[623:592]);
bram5 b5(clk,big_pic_wr_en,line1,din1[591:560],data1[591:560],clk,1'b0,line2,din1[591:560],data2[591:560]);
bram6 b6(clk,big_pic_wr_en,line1,din1[559:528],data1[559:528],clk,1'b0,line2,din1[559:528],data2[559:528]);
bram7 b7(clk,big_pic_wr_en,line1,din1[527:496],data1[527:496],clk,1'b0,line2,din1[527:496],data2[527:496]);
bram8 b8(clk,big_pic_wr_en,line1,din1[495:464],data1[495:464],clk,1'b0,line2,din1[495:464],data2[495:464]);
bram9 b9(clk,big_pic_wr_en,line1,din1[463:432],data1[463:432],clk,1'b0,line2,din1[463:432],data2[463:432]);
bram10 b10(clk,big_pic_wr_en,line1,din1[431:400],data1[431:400],clk,1'b0,line2,din1[431:400],data2[431:400]);
bram11 b11(clk,big_pic_wr_en,line1,din1[399:368],data1[399:368],clk,1'b0,line2,din1[399:368],data2[399:368]);
bram12 b12(clk,big_pic_wr_en,line1,din1[367:336],data1[367:336],clk,1'b0,line2,din1[367:336],data2[367:336]);
bram13 b13(clk,big_pic_wr_en,line1,din1[335:304],data1[335:304],clk,1'b0,line2,din1[335:304],data2[335:304]);
bram14 b14(clk,big_pic_wr_en,line1,din1[303:272],data1[303:272],clk,1'b0,line2,din1[303:272],data2[303:272]);
bram15 b15(clk,big_pic_wr_en,line1,din1[271:240],data1[271:240],clk,1'b0,line2,din1[271:240],data2[271:240]);
bram16 b16(clk,big_pic_wr_en,line1,din1[239:208],data1[239:208],clk,1'b0,line2,din1[239:208],data2[239:208]);
bram17 b17(clk,big_pic_wr_en,line1,din1[207:176],data1[207:176],clk,1'b0,line2,din1[207:176],data2[207:176]);
bram18 b18(clk,big_pic_wr_en,line1,din1[175:144],data1[175:144],clk,1'b0,line2,din1[175:144],data2[175:144]);
bram19 b19(clk,big_pic_wr_en,line1,din1[143:112],data1[143:112],clk,1'b0,line2,din1[143:112],data2[143:112]);
bram20 b20(clk,big_pic_wr_en,line1,din1[111:80],data1[111:80],clk,1'b0,line2,din1[111:80],data2[111:80]);
// DRAM based storage
bram21 b21(clk,big_pic_wr_en,line1,din1[79:48],data1[79:48],clk,1'b0,line2,din1[79:48],data2[79:48]);
bram22 b22(clk,big_pic_wr_en,line1,din1[47:16],data1[47:16],clk,1'b0,line2,din1[47:16],data2[47:16]);
bram23 b23(clk,big_pic_wr_en,line1,din1[15:0],data1[15:0],clk,1'b0,line2,din1[15:0],data2[15:0]);

c. Pipelining
A 4-stage pipeline was implemented to increase throughput.[9] Figure 3 shows the
pipelining structure for our algorithm. Since we are using dual port RAM, two address lines
namely LINE-1 and LINE-2 are used to read data from RAM. Once data is read from RAM, a
combinational block does the comparisons operations as described in the algorithm and the result
is stored in integer array. The integer locations are checked if half the pixels of big_picture are
less than or equal to pixels of small_picture and if a match occurs, then the location is stored in
register and register storing the number of matches is incremented.

LINE-1

Comparisons
Integer

RAM

Array

LINE-2

Comparisons

Check
for
Matches

Location
Array

Figure 3: Pipeline Stages

The code for implementing various combinational and sequential blocks for implementing 4stage pipeline is as shown
// Implementation of Sync_reset
always @ (posedge clk)
begin
sync_reset <= async_reset;
end
//-- clock tick counter implementation and main algorithm -always @ (posedge clk)
begin
// Reset Conditions
if (sync_reset == 1'b1 && big_pic_wr_en == 1'b0)
// synchronous reset generated from async_reset
begin
clk_tick_counter <= 24'b0;
task1_done_flag_internal <= 1'b0;
// Taking advantage of dual reads
line1 <= 8'b00000000;
// Set the first line of BRAM to first row
line2 <= 8'b00000001;
// Set the second line of BRAM to second row
for(i=0;i<904;i=i+1)
carray[i]=0;
end
// Runtime Modifications
else if (big_pic_wr_en == 1'b1)
begin
line1 <= big_pic_row_addr;
din1 <= big_pic_row_data;
end
// Main Implemenation Algorithm
else if (sync_reset == 1'b0)
begin

// Enable Write to RAM locations

// Pipeline Stage 1
if(line1 != 8'b11111110 && line2 != 8'b11111111)
begin
line1 <= line1 + 2'b10;
line2 <= line2 + 2'b10;
end

// Reading rowing all rows completed


// Read values onto data1
// Read values onto data2

// Pipeline Stage 2
// Pipeline Stage 2 is Implemented using BRAMs which is released by structural coding
// Pipeline Stage 3
if(line1 ==1'd3 && task1_done_flag_internal == 1'b0)
begin
for(i=120;i>0;i=i-1)
// Move Across Row of Big Image
begin
for(j=8;j>0;j=j-1)
// Move Across Row of Small Image
begin
for(k=0;k<8;k=k+1&& (line1+j)<128)//Move Across Col of Small Image
begin
if(j-line1>0 && k-i>0)
begin
if(data1[(i*6-1)] <= small_array[(6*j-1)][k]);
begin
..
..// Comparison by if statements
Carray[(j-line1)*113+(k-i)]= Carray[(j-line1)*113+(k-i)]+1b1;
end
end
end
end
end
// Pipeline
if(line1 ==
begin
//
//
End

Stage 4
1d5 && task1_done_flag_internal == 1'b0)
optimising the carray size algorithms
Checking for Matches

if(clk_tick_counter == 1d131)
begin
task1_done_flag_internal <= 1b1;
end
end
// Implementation of clk_tick_counter
if (task1_done_flag_internal == 1'b0 && sync_reset == 1'b0)
begin
clk_tick_counter <= clk_tick_counter + 1;
end
end
// Assign task1_done_flag
assign task1_done_flag = task1_done_flag_internal;

// '1' if task completed

d. Algorithm for traversing across image and finding matches


The pixels of the big_picture are indexed from 0 to 119 column-wise and 0 to 255 row-wise
while the pixels of small_picture are indexed from 0 to 7 column-wise and 0 to 7 row-wise.
On considering the number of iterations that should be done for finding the matches, we
would require 113 iterations column-wise considering the boundary conditions. Since, dual
port RAM is used to store the big_picture, we can read two rows of 720 bit i.e. 120 pixels.[7]
Assuming that linei is the line number, i represent the pixel number, j and k represent the
pixel selected in small_picture, we make a comparison between the pixel in linei and i of
big_picture and pixel in j and k of small_picture by traversing from linei=0 to 128(since, its
dual port, the rest 128 values are read simultaneously), i=0 to 113, j = 0 to 7 and k = 0 to 7.

The boundary conditions can be avoided by using a if condition.


if(j-line1>0 && k-i>0 && (line1+j)<128)
begin
// Do comparisons
// Increment carray based on comparions
end

Consider a general row linei and column i. Pixel (linei,i) is considered as the reference
location. We iterate with j =0 to 7 and k=0 to 7 representing the traversal of pixels across the
big_picture and compare with small_picture for each pixel and compare it pixels of
small_picture i.e. Max comparison of 64 for each pixel.[10] If boundary conditions are satisfied,
the pixel (linei, i) of big_picture lesser than pixel (j,k) of small_picture, then an integer
location(j-linei, k-i) is incremented and the iteration continues.The algorithm is illustrated in
figure 4 and 5.
j

Row linei

6 bits (1 pixel)

SMALL IMAGE (MASKING)

BIG IMAGE
Figure 4: Illustration of the small image movement-1

(j+1)

BIG IMAGE

Row linei
6 bits (1 pixel)
SMALL IMAGE (MASKING)
Figure 5: Illustration of the small image movement-2

Integer Array carray has 243X113 integers. Based on the values of these integers, the
match would have occurred i.e. if integer value is more than 32. Match register is incremented
and match location is stored. Since, 243X113 is very big, the size is optimized by storing only 8

rows of integer values at a time, hence, reducing the size since only row values are read in 1
clock and operation at any point of time is limited to 8 rows only. Hence, the size is 8X113.
e. Displaying Matches through UART and LCD:
The Number of matches and Match location are displayed based on data from UART. UART
template is used and the input Rx is monitored. On receiving a particular ASCII code, the
number of matches or locations is transmitted. ASCII Code 0 is used for displaying number
of matches. 1 is used location and so, on.[11] The same values are shown using LCD also.

5. RESULTS:
The algorithm was simulated using ModelSim and synthesis using Precision RTL was done
for Spartan 3A XS700A FPGA. Figure 6 shows the simulation for number of matches which is
transmitted through UART.

Figure 6. Simulation for Matches

The number of matches was found out to be 19. The locations are (0,0), (0,1), (0,2), (1,0), (1,1),
(1,2), (2,0), (2,1), (2,2), (3,0), (3,1), (3,2), (4,0), (4,1), (4,2), (5,1), (6,1), (7,1) and (8,1).
The resources utilization is:
BRAMs: 20 256X32 Dual Port
DRAMs: 2 256X32 Dual Port +1 256X16 Dual Port
DFF: 83K

The max Speed as per synthesis was 136.67 MHz and the output of clk_tick_counter was 131.
Hence, Performance is 0.95. The figure 7 shows the simulation for clk_tick_counter.

Figure 7. Simulation for clk_tick_counter

6. ACKNOWLEDGEMENTS:

We are thankful to our faculty advisor, Ajay Sir, BMSCE for his kind and able guidance.
We are grateful for his help in the preparation of this abstract.
We are thankful to Vikas, 3rd Year, CSE, BMSCE for writing a Perl Script for dividing
the .dat into required partitions.
We are grateful to HOD, ECE, BMSCE for allowing to use the labs beyond college
hours.
We are grateful to Mentor Graphics for allowing us to participate in Mentor Graphics
University Design Contest. We are thankful to Veeresh Shetty, Mentor Graphics for
answer our queries regarding the contest.

7. REFERENCES:
[1]. Dwiel B.H., Choudhary N.K. ; Rotenberg E., FPGA modeling of diverse superscalar processors,
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012.
[2]. Devlin B., Nakura T. , Ikeda M. , Asada, K., Throughput optimization by pipeline alignment of a
Self Synchronous FPGA, International Conference on Field-Programmable Technology, 2009.

[3]. UG331, Spartan 3 Generation User Guide.

[4]. HDL Designer Datasheet.


[5]. Precision RTL Datasheet.
[6]. WP335, Innovative use of BRAMs.
[7]. Precision RTL Synthesis Style Guide.
[8]. Samir Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis.
[9]. Trivedi A.N., 4-stage pipeline architecture, IEEE Conference on Computer,
Communication, Control and Power Engineering 1993.
[10].
Donald G. Bailey, Design for Embedded Image Processing on FPGAs.
[11].
Pong Chu, FPGA Prototyping By Verilog Examples: Xilinx Spartan-3
Version.

You might also like