Professional Documents
Culture Documents
1. PROBLEM DESCRIPTION
Two binary valued pictures of size 256x120 (big_picture) and 8x8 (small_picture) are given.
The problem is to implement an algorithm on FPGA which small_picture scans the big_picture
and finds all the pixel locations where a corresponding match with the small_picture has to be
found.
The requirements for being a match is:1. At any instant there should be no less than 8x8 pixels to compare
2. Each pixel of big_picture must be less than or equal to corresponding pixel of
small_picture.
3. There should be atleast 50% of the pixels satisfying this condition.
The other tasks include creating a separate 24-bit free-running counter to keep track of
elapsed clock cycles such that it starts counting after de-assertion of sync-reset and stops at
assertion of task_done_flag signal where task_done_flag indicates task completion.
The entire algorithm is to be implemented on Xilinx Spartan-3A FPGA with the device name
XC3S700A (FGG484)-4. Total match count and address of all match locations are to be
stored internally using registers and memories which are accessible through output ports.
* bssachin45@gmail.com
th
4 Year, ECE, BMSCE, Bangalore
(Phone no: 9481786153)
karshyap@gmail.com
4 Year, ECE, BMSCE, Bangalore
(Phone no: 8970334614)
th
2. PROPOSED SOLUTION
The solution mainly consists of custom processor which follows a superscalar
architecture with pipelined implementation along with memories and registers, hence, achieving
parallelism to some extent.[1][2] Keys involved in the parallel processing are pipelining
followed by Superscalar. The brief explanations of each of them are given below:
a) Pipelining
Pipelining is an implementation technique where multiple instructions are overlapped in
execution. The instruction execution pipeline is divided in stages. Each stage completes a
part of an instruction in parallel. The stages are connected one to the next to form a pipe instructions enter at one end, progress through the stages, and exit at the other end. Pipelining
does not decrease the time for individual instruction execution. Instead, it increases
instruction throughput. The throughput of the instruction pipeline is determined by how often
an instruction exits the pipeline.
High pipelining leads to increase of latency i.e. the time required for a signal to propagate
through a full pipe. A pipelined system typically requires more resources like circuit
elements, processing units, memories, etc., than one that executes one batch at a time,
because its stages cannot reuse the resources of a previous stage. Moreover, pipelining may
increase the time it takes for an instruction to finish[2].
b) Superscalar:
A superscalar CPU architecture implements a form of parallelism called instruction level
parallelism within a single processor. It therefore allows faster CPU throughput than would
otherwise be possible at a given clock rate. A superscalar processor executes more than one
instruction during a clock cycle by simultaneously dispatching multiple instructions to
redundant functional units on the processor. Each functional unit is not a separate CPU core
but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter,
or a multiplier.
The simplest processors are scalar processors. Each instruction executed by a scalar
processor typically manipulates one or two data items at a time. By contrast, each instruction
executed by a vector processor operates simultaneously on many data items. An analogy is
the difference between scalar and vector arithmetic. A superscalar processor is a mixture of
the two. Each instruction processes one data item, but there are multiple redundant functional
units within each CPU thus multiple instructions can be processing separate data items
concurrently.
So this is done mainly to accommodate for better performance and hence achieve the
same at a high speed[1].
b. Distributed RAM:
Distributed RAMs are large sized RAMs which are implemented using a parallel array of a
large number of elements. These RAMs are spread over the entire FPGA. It is ideal for
implementing small sized memories which if implemented in Block RAMs might lead to
wastage of memory space. There are about 92Kb worth of Distributed RAMs in SPARTAN-3
XS700A FPGA. It is present in configuration logic blocks (CLB). This RAM is normally
distributed throughout the FPGA than as a single block and so it is called "distributed RAM". A
look up table on a Xilinx FPGA can be configured as a 16*1bit RAM, ROM, LUT or 16bit shift
register. [3]
c) Configuration Logic Blocs(CLB):
Configurable Logic Blocks (CLBs) contain flexible Look-Up Tables (LUTs) that implement
logic plus storage elements used as flip-flops or latches. CLBs perform a wide variety of logical
functions as well as store data. [3]
d) Input/Output Blocks(IOBs):
Input/Output Blocks (IOBs) control the flow of data between the I/O pins and the internal
logic of the device. IOBs support bidirectional data flow plus 3-state operation. IOBs support a
variety of signal standards, including several high-performance differential standards. Double
Data-Rate (DDR) registers are included. [3]
4. DESIGN
a) Tools Used:
The following tools are used by us for Mentor Graphics University Design Contest.
IDE: HDL Designer 2012.1- HDL Designer is a powerful HDL-based environment
which delivers new approaches to design todays most complex FPGAs and ASICs.
HDL Designer tackles the design management problem by automating and
simplifying project and team management throughout the design flow. HDL Designer
provides the designer and design team with interfaces to other design tools within the
flow including ReqTracer, Questa/ModelSim, Precision, and FPGA vendor and other
EDA tools for automated compilation, simulation, invoke and interactive debug. [4]
Simulator: ModelSim 10.1d- ModelSim is a verification and simulation tool for
VHDL, Verilog, SystemVerilog, and mixed language designs. Simulation waveforms
are observed using ModelSim inside HDL-Designer IDE. It has powerful waveform
compare for easy analysis of differences and bugs, advanced code coverage and
analysis tools for fast time to coverage closure, unified coverage database with
complete interactive and HTML reporting and processing for understanding and
debugging coverage throughout your project. It is coupled with HDL Designer for
complete design creation, project management and visualization capabilities
Synthesis: Precision Synthesis 2012b- Precision Synthesis offers high quality of results,
industry-unique features, and integration across Mentor Graphics FPGA Flow the
industrys most comprehensive FPGA vendor independent solution. With a rich feature set
that includes advanced optimizations, award-winning analysis, and industry-leading language
support, Precision RTL enables vendor-independent design, accelerates time to market,
eliminates design defects, and delivers superior quality of results.[5]
b) Image Storage:
Since the image has a size 256X120 where there are 256 rows and 120 columns with each
pixel being represented by 6 bits. So, the effective size would be 184320 bits which is
equivalent to 180Kbits. The most efficient storage mechanism consuming least area while
giving a reasonable speed is block ram. Hence, block ram (BRAM) was selected as the
storage element of the image.[6]
512X32 is selected as the basic configurations in order to read 32 bits from the BRAM at
a time. In order to compromise for speed, the bottom 256 addresses are not used and it is
implemented as 256X32 BRAM. This satisfies the rows condition, which is 256 for the
image, but the output bit size should be 720 for a single read operation. Hence, this would
require a large number of BRAM which is to be used together. Since, there are 720 bit
output, hence it would require 720/32=22.5 which would require 23 BRAMs, but only 20
BRAMs are present in Spartan 3A XS700A FPGA. Hence, Distributed RAM based storage is
used for the remaining 80 bit output, which is equivalent to 20Kb. Therefore, 20 BRAMs and
3 DRAMs based storage is used for storing the image. [3]
The image is split into 23 parts column wise using a Perl script. 20 parts each of 16Kb
(32 bit in each column) so that it can be stored in a BRAM and 3 part of size 20Kb
(Remaining columns i.e. two DRAMs of 32 bit output and one DRAM of 16 bit output), so
that it can be stored in 3DRAMs. In order to accommodate for higher speed of the system,
dual port mode of the RAM is selected. This mode supports two read and one write in one
clock cycle. The memory layout is as shown in figure 2.
BLOCK RAMs (20 256X32)
-- - - - - - - - 256
Rows
20BLOCK RAMs
32-bit outputs
The codes for different BRAM are alerted a little in terms of the name of the BRAM and the .dat
file that the BRAM will be storing. For example, the name of second BRAM is BRAM2 and the
.dat file that it will be reading is B2.dat. The data in the RAM are accessed through structural
coding [8] as shown below
// Initiating 20 BRAMs and 1 DFF based Storage
// BRAM based storage
bram1 b1(clk,big_pic_wr_en,line1,din1[719:688],data1[719:688],clk,1'b0,line2,din1[719:688],data2[719:688]);
bram2 b2(clk,big_pic_wr_en,line1,din1[687:656],data1[687:656],clk,1'b0,line2,din1[687:656],data2[687:656]);
bram3 b3(clk,big_pic_wr_en,line1,din1[655:624],data1[655:624],clk,1'b0,line2,din1[655:624],data2[655:624]);
bram4 b4(clk,big_pic_wr_en,line1,din1[623:592],data1[623:592],clk,1'b0,line2,din1[623:592],data2[623:592]);
bram5 b5(clk,big_pic_wr_en,line1,din1[591:560],data1[591:560],clk,1'b0,line2,din1[591:560],data2[591:560]);
bram6 b6(clk,big_pic_wr_en,line1,din1[559:528],data1[559:528],clk,1'b0,line2,din1[559:528],data2[559:528]);
bram7 b7(clk,big_pic_wr_en,line1,din1[527:496],data1[527:496],clk,1'b0,line2,din1[527:496],data2[527:496]);
bram8 b8(clk,big_pic_wr_en,line1,din1[495:464],data1[495:464],clk,1'b0,line2,din1[495:464],data2[495:464]);
bram9 b9(clk,big_pic_wr_en,line1,din1[463:432],data1[463:432],clk,1'b0,line2,din1[463:432],data2[463:432]);
bram10 b10(clk,big_pic_wr_en,line1,din1[431:400],data1[431:400],clk,1'b0,line2,din1[431:400],data2[431:400]);
bram11 b11(clk,big_pic_wr_en,line1,din1[399:368],data1[399:368],clk,1'b0,line2,din1[399:368],data2[399:368]);
bram12 b12(clk,big_pic_wr_en,line1,din1[367:336],data1[367:336],clk,1'b0,line2,din1[367:336],data2[367:336]);
bram13 b13(clk,big_pic_wr_en,line1,din1[335:304],data1[335:304],clk,1'b0,line2,din1[335:304],data2[335:304]);
bram14 b14(clk,big_pic_wr_en,line1,din1[303:272],data1[303:272],clk,1'b0,line2,din1[303:272],data2[303:272]);
bram15 b15(clk,big_pic_wr_en,line1,din1[271:240],data1[271:240],clk,1'b0,line2,din1[271:240],data2[271:240]);
bram16 b16(clk,big_pic_wr_en,line1,din1[239:208],data1[239:208],clk,1'b0,line2,din1[239:208],data2[239:208]);
bram17 b17(clk,big_pic_wr_en,line1,din1[207:176],data1[207:176],clk,1'b0,line2,din1[207:176],data2[207:176]);
bram18 b18(clk,big_pic_wr_en,line1,din1[175:144],data1[175:144],clk,1'b0,line2,din1[175:144],data2[175:144]);
bram19 b19(clk,big_pic_wr_en,line1,din1[143:112],data1[143:112],clk,1'b0,line2,din1[143:112],data2[143:112]);
bram20 b20(clk,big_pic_wr_en,line1,din1[111:80],data1[111:80],clk,1'b0,line2,din1[111:80],data2[111:80]);
// DRAM based storage
bram21 b21(clk,big_pic_wr_en,line1,din1[79:48],data1[79:48],clk,1'b0,line2,din1[79:48],data2[79:48]);
bram22 b22(clk,big_pic_wr_en,line1,din1[47:16],data1[47:16],clk,1'b0,line2,din1[47:16],data2[47:16]);
bram23 b23(clk,big_pic_wr_en,line1,din1[15:0],data1[15:0],clk,1'b0,line2,din1[15:0],data2[15:0]);
c. Pipelining
A 4-stage pipeline was implemented to increase throughput.[9] Figure 3 shows the
pipelining structure for our algorithm. Since we are using dual port RAM, two address lines
namely LINE-1 and LINE-2 are used to read data from RAM. Once data is read from RAM, a
combinational block does the comparisons operations as described in the algorithm and the result
is stored in integer array. The integer locations are checked if half the pixels of big_picture are
less than or equal to pixels of small_picture and if a match occurs, then the location is stored in
register and register storing the number of matches is incremented.
LINE-1
Comparisons
Integer
RAM
Array
LINE-2
Comparisons
Check
for
Matches
Location
Array
The code for implementing various combinational and sequential blocks for implementing 4stage pipeline is as shown
// Implementation of Sync_reset
always @ (posedge clk)
begin
sync_reset <= async_reset;
end
//-- clock tick counter implementation and main algorithm -always @ (posedge clk)
begin
// Reset Conditions
if (sync_reset == 1'b1 && big_pic_wr_en == 1'b0)
// synchronous reset generated from async_reset
begin
clk_tick_counter <= 24'b0;
task1_done_flag_internal <= 1'b0;
// Taking advantage of dual reads
line1 <= 8'b00000000;
// Set the first line of BRAM to first row
line2 <= 8'b00000001;
// Set the second line of BRAM to second row
for(i=0;i<904;i=i+1)
carray[i]=0;
end
// Runtime Modifications
else if (big_pic_wr_en == 1'b1)
begin
line1 <= big_pic_row_addr;
din1 <= big_pic_row_data;
end
// Main Implemenation Algorithm
else if (sync_reset == 1'b0)
begin
// Pipeline Stage 1
if(line1 != 8'b11111110 && line2 != 8'b11111111)
begin
line1 <= line1 + 2'b10;
line2 <= line2 + 2'b10;
end
// Pipeline Stage 2
// Pipeline Stage 2 is Implemented using BRAMs which is released by structural coding
// Pipeline Stage 3
if(line1 ==1'd3 && task1_done_flag_internal == 1'b0)
begin
for(i=120;i>0;i=i-1)
// Move Across Row of Big Image
begin
for(j=8;j>0;j=j-1)
// Move Across Row of Small Image
begin
for(k=0;k<8;k=k+1&& (line1+j)<128)//Move Across Col of Small Image
begin
if(j-line1>0 && k-i>0)
begin
if(data1[(i*6-1)] <= small_array[(6*j-1)][k]);
begin
..
..// Comparison by if statements
Carray[(j-line1)*113+(k-i)]= Carray[(j-line1)*113+(k-i)]+1b1;
end
end
end
end
end
// Pipeline
if(line1 ==
begin
//
//
End
Stage 4
1d5 && task1_done_flag_internal == 1'b0)
optimising the carray size algorithms
Checking for Matches
if(clk_tick_counter == 1d131)
begin
task1_done_flag_internal <= 1b1;
end
end
// Implementation of clk_tick_counter
if (task1_done_flag_internal == 1'b0 && sync_reset == 1'b0)
begin
clk_tick_counter <= clk_tick_counter + 1;
end
end
// Assign task1_done_flag
assign task1_done_flag = task1_done_flag_internal;
Consider a general row linei and column i. Pixel (linei,i) is considered as the reference
location. We iterate with j =0 to 7 and k=0 to 7 representing the traversal of pixels across the
big_picture and compare with small_picture for each pixel and compare it pixels of
small_picture i.e. Max comparison of 64 for each pixel.[10] If boundary conditions are satisfied,
the pixel (linei, i) of big_picture lesser than pixel (j,k) of small_picture, then an integer
location(j-linei, k-i) is incremented and the iteration continues.The algorithm is illustrated in
figure 4 and 5.
j
Row linei
6 bits (1 pixel)
BIG IMAGE
Figure 4: Illustration of the small image movement-1
(j+1)
BIG IMAGE
Row linei
6 bits (1 pixel)
SMALL IMAGE (MASKING)
Figure 5: Illustration of the small image movement-2
Integer Array carray has 243X113 integers. Based on the values of these integers, the
match would have occurred i.e. if integer value is more than 32. Match register is incremented
and match location is stored. Since, 243X113 is very big, the size is optimized by storing only 8
rows of integer values at a time, hence, reducing the size since only row values are read in 1
clock and operation at any point of time is limited to 8 rows only. Hence, the size is 8X113.
e. Displaying Matches through UART and LCD:
The Number of matches and Match location are displayed based on data from UART. UART
template is used and the input Rx is monitored. On receiving a particular ASCII code, the
number of matches or locations is transmitted. ASCII Code 0 is used for displaying number
of matches. 1 is used location and so, on.[11] The same values are shown using LCD also.
5. RESULTS:
The algorithm was simulated using ModelSim and synthesis using Precision RTL was done
for Spartan 3A XS700A FPGA. Figure 6 shows the simulation for number of matches which is
transmitted through UART.
The number of matches was found out to be 19. The locations are (0,0), (0,1), (0,2), (1,0), (1,1),
(1,2), (2,0), (2,1), (2,2), (3,0), (3,1), (3,2), (4,0), (4,1), (4,2), (5,1), (6,1), (7,1) and (8,1).
The resources utilization is:
BRAMs: 20 256X32 Dual Port
DRAMs: 2 256X32 Dual Port +1 256X16 Dual Port
DFF: 83K
The max Speed as per synthesis was 136.67 MHz and the output of clk_tick_counter was 131.
Hence, Performance is 0.95. The figure 7 shows the simulation for clk_tick_counter.
6. ACKNOWLEDGEMENTS:
We are thankful to our faculty advisor, Ajay Sir, BMSCE for his kind and able guidance.
We are grateful for his help in the preparation of this abstract.
We are thankful to Vikas, 3rd Year, CSE, BMSCE for writing a Perl Script for dividing
the .dat into required partitions.
We are grateful to HOD, ECE, BMSCE for allowing to use the labs beyond college
hours.
We are grateful to Mentor Graphics for allowing us to participate in Mentor Graphics
University Design Contest. We are thankful to Veeresh Shetty, Mentor Graphics for
answer our queries regarding the contest.
7. REFERENCES:
[1]. Dwiel B.H., Choudhary N.K. ; Rotenberg E., FPGA modeling of diverse superscalar processors,
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012.
[2]. Devlin B., Nakura T. , Ikeda M. , Asada, K., Throughput optimization by pipeline alignment of a
Self Synchronous FPGA, International Conference on Field-Programmable Technology, 2009.