Performance Enhancement of Video Compression Algorithms With SIMD

Performance Enhancement of Video
Compression Algorithms with SIMD
Report Submitted By:

Shamik Valia
Saket Jamkar
1 Introduction
Video Compression algorithms have a large number of applications ranging from Video
Conferencing to Video on Demand to Video phones. Video Compression standards (such
as the MPEG -1, 2, 4, 7) and Teleconferencing standards (such as the H.2XX) are vital
algorithms used in these and other multimedia applications, whose performance is very
critical given the high data rates that are common for video applications. The timing
constraints with such high data rates can be challenging enough even for custom Video-
codecs and overwhelming for some of the state-of-the-art superscalar processors.
Performing these operations real-time isnt easy on most platforms if image resolutions of
acceptable quality are desired.
The algorithms, however, consist of repetitive and regular operations by nature, which
could benefit greatly from the use of some architectures that are better able to perform
such repetitive tasks efficiently. In recent years, general purpose microprocessors have
also been endowed with functional units capable of Single Instruction Stream Multiple
Data Stream (SIMD) operation. This project attempts to study the speedup achievable for
the most critical parts of these algorithms by utilizing the Streaming SIMD Extensions 2
from the Intel Pentium 4 processors. We also deal with the improvement schemes for the
DCT algorithm on the SSE2 architecture.
2 Basic steps used in Video Compression

The Video Compression algorithm utilized in numerous standards (such as MPEG 1, 2
H.263) usually consists of the following steps:
1. Motion Estimation
2. Motion Compensation and Image Subtraction
3. Discrete Cosine Transform
4. Quantization
5. Run Length Encoding
6. Entropy Coding Huffman Coding
We examine each of these steps in greater detail in this section.

2.1 Motion Estimation
Motion estimation is the process of calculating motion vectors by finding matching
blocks in the future frame corresponding to blocks in the current frame. Motion
estimation helps in detecting the temporal redundancy. Various search algorithms have
been devised for estimating motion. The basic assumption underlying these algorithms is
that only translational motion can be compensated for Rotational motion and zooming
cannot be estimated by using block based search algorithms. It is known to be the most
crucial and computationally intensive process in the video compression algorithm.
Search Region (-p,p)
Current block
Figure 2.1.1 Search region (-p, p)
Since most of the video streams have a frame rate ranging from 15 to 30 frames per
second, there is never a very large motion of any object between two successive frames.
Therefore most search algorithms search for matching block in the neighborhood of the
position of the current block in the next frame. The region where matching block is
searched for is called the search region. Search region around a block is shown in the
figure 2.1.1.
The choice for the value of p will depend upon the type of broadcast that has to be sent.
For fast-moving videos such as sports events a higher value of p such as 16 or 32 may be
used. On the other hand for broadcasts with less motion such as a news-telecast a smaller
value of p such as 4 or 8 may be used.
(x,y)
(x+u,y+v)
u,v
Figure 2.1.2 Motion vector and best match
The task to be performed by the search algorithm is to find the best match for a block in
the current frame in the next frame. A typical block size is 8x8 or 16x16 pixels. The
quality of match found will depend on the value of Mean Absolute Error, more
commonly known as MAE between the blocks. This is the average absolute pixel-
wise difference between two blocks, reference block in the current frame and
probable match found in the next frame. The matching block is figured out on the
basis of the magnitude of the value of its mean error. Smaller the magnitude better is
the match. The displacement of the block with the minimum MAE is taken as the
motion vector.
Formula for MAE is given by:
MAE = (1/MN) | C(x+k,y+l) R(x+I+k,y+j+l)|
Next we explain the two search algorithms that were used in this project.
2.1.1 Full Search
Full search is an exhaustive search algorithm. Full search is the simplest method to find
the motion vector for each block; in it the MAE(i,j) is found at each point in the search
region. Thus a search for the match block is made in the complete (-p, +p) range in the
future frames for every block of the current frame.
For each motion vector, there are (2p) 2 search locations. At each search location (i,j) we
compare N x M pixels. Each pixel comparison requires three operations, namely: a
subtraction, an absolute value calculation and an addition. We ignore the cost of
accessing the pixels C(x + k, y + l) and R(x + i + k, y + j + l). Thus the total complexity
per block is (2p) 2 x MN x 3 operations. For a picture resolution of I x J, and a picture rate
of F pictures/second, the overall complexity is IJF/MN x (2p) 2 x MN x 3 operations per
second.
But this makes it a very intensive method computationally. CPU time for full search is the
highest of all the algorithms. At the same time the accuracy of Full search is also highest
and the best match for every block in the current frame is always found. Full search,
therefore is a benchmark for comparison of the quality of a search algorithm, which as
was previously mentioned depends on CPU time and accuracy. There is a trade-off
between the efficiency of the algorithm and the quality of the prediction image. Keeping
this trade-off in mind a lot of algorithms have been developed.
2.1.2 Three Step Search

Three step search became very popular because of its simplicity and also robust and near
optimal performance. It searches for the best motion vectors in a coarse to fine search
pattern. The algorithm may be described as: (Refer to figure 2.1.3)
Step 1: An initial step size is picked. Eight points at a distance of step size from the
centre (around the centre point) are picked for comparison.
Step 2: The step size is halved. The centre is moved to the point with the minimum
distortion. Steps 1 and 2 are repeated till the step size becomes smaller
than 1. A particular path for the convergence of this algorithm is shown
below:
Points chosen for first stage

Points chosen for second stage
Points chosen for third stage
Figure 2.1.3 Example path for convergence of Three Step Search
2.2 Motion Compensation and Image Subtraction

The process of Motion Estimation and Motion Compensation is similar to DPCM. The
idea is to reduce the bandwidth required for the video by sending only the difference
frames instead of the actual frames.
The motion vectors produced during Motion Estimation are utilized in the Motion
Compensation process in order to produce the predicted image in the encoder just like it
would be produced in the Decoder. The two images (current frame and the motion
compensated frame) are now subtracted and the difference is sent to the receiver along
with the motion vectors. Thus the decoder can produce the exact copy of the future frame
by first motion compensating the current frame using the motion vectors and then adding
the difference image.
The block diagram of the Encoder is given below in Figure 2.2.1 in order to illustrate the
idea.
Frame (n)
I(x,y,t) Motion
u,v
Estimation
Frame (n+1)
I(x,y,t+1)
Motion
Compensation
E(x,y,t) = I(x,y,t) I(x-u,y-v,t+1)
DCT coding
Fig. 2.2.1 Block Diagram of Video Encoder
2.3 Discrete Cosine Transform (DCT)

DCT based image coding is the basis for almost all the image and video compression
standards. Discrete Cosine Transform is a derivative of the Discrete Fourier Transform
(DFT), which is encountered very commonly in Digital Signal Processing.
The fundamental operation performed by DCT is to transform the space domain
representation of an image to a spatial frequency domain (known as DCT domain). The
formula for DCT is given below:
Y(k,l)= C(k) C(l)/4 Xij cos((2i+1)k) cos((2j+1)l
C(k) = () if k = 0
C(k) = 1 otherwise
The DCT transformation can be viewed as the process of finding for each waveform, the
corresponding weight Y(k,l) so that the sum of 64 waveforms scaled by the
corresponding weights Y(k,l) yields the reconstructed version of the original 8 x 8 block.
Energy compaction of DCT is among the highest next only to the Karhunen- Loeve
Transform. This means that the information can be compressed to a very high degree with
DCT, which is why DCT is commonly used. At the same time DCT also minimizes the
block artifact that is present in many other transforms due to the favorable periodic nature
of DCT.
DCT, in principle, is a lossless process. However, due to the finite word-lengths in a

microprocessor, there is some loss of information due to rounding and truncating of
calculated DCT values. This loss of information is irreversible.
2.4 Quantization
The human eye is not sensitive to the high frequency content in an image. Therefore
removal of these spatial frequencies does not lead to any perceptible loss in image
quality. This is the basic principle behind quantization. The spatial frequency content of
the image is obtained by using the DCT operation, which is followed by a removal of the
high frequency content that is the quantization process.
The JPEG standard recommends standard values of quantization tables which are used to
deemphasize higher frequencies in the DCT image. Quantization is a lossy process and
some data is lost during quantization. This loss of information is irreversible.
2.5 Run Length Encoding (RLE)
Run-length encoding is the next stage of the compression process. It encodes the runs of
zeroes. If pixel values are correlated to their neighbors, then there will be sequences of
the same value. Instead of coding all the repeat values, just encode the first value and
then give the run length of the sequence. Intuitively, one can understand how RLE can
help in achieving compression. Suppose the data is 000000(ten times). Now instead of
writing ten zeroes one can send only 0-10, which could be taken to mean that a zero
occurs 10 times. This is how compression is achieved in Run-Length Encoding. Runs of
zeroes are encoded in a 16 bit or 8 bit format.
A higher compression can be achieved in Run-Length encoding if we somehow obtain

longer strings of zeroes. This is achieved by performing RLE in a zigzag manner on a
block. In the DCT image the higher frequency content is always found towards the lower
right hand corner of the DCT image while the lowest frequencies are in the upper left
hand corner of the image. During quantization the higher frequencies are reduced to zero
and therefore the values in the lower right hand side are mostly zero. Therefore by
performing RLE in a zig-zag manner, we try to obtain runs of zeroes out of the lower
right hand side of the DCT domain representation.
2.6 Huffman Encoding

Huffman encoding is a form of entropy encoding and it is based on Shannons
Information theory. The fundamental idea behind Huffman encoding is that symbols,
which occur more frequently, should be represented by fewer bits, while those occurring
less frequently should be represented by more number of bits. This scheme is similar to
the one utilized in Morse code.
Shannon has proved that the entropy of the total message gives the most efficient code,
with minimum average code length, for sending a message.
Given n symbols S1 to Sn-1 with probabilities of occurrence P1 to Pn-1 in a certain
message, the entropy of the message will be given by
Entropy = Pi log2 (1/ Pi)
Huffman encoding attempts to minimize the average number of bits per symbol and try to
get a value close to entropy.
Example:
We describe the algorithm for Huffman encoding with the help of an example.
Consider 4 symbols with probabilities as shown in the first column of the table.
Symbol Probability Iteration 1 Iteration 2 Length in bits

S1 0.4 (1) 0.4 (1) 0.6 (0) 1
S2 0.3 (00) 0.3 (00) 0.4 (1) 2
S3 0.2 (010) 0.3 (01) 3
S4 0.1 (011) 3
Table 2.6.1 Huffman encoding method
Step 1: Sort the probabilities and arranged in descending order as shown in the column
marked Probability.
Step 2: Add the probabilities of the last two symbols and add them to the next column
after sorting the values.
Step 3: Continue steps 1 and 2 until only two symbols remain.
Step 4: Assign bit 0 to upper symbol and 1 to the lower symbol. (or vice-versabut then
this format should be followed throughout the process.)
Step 5: Trace back the probabilities according to where they have come from in the
previous column and append a 0 or 1 depending on the format chosen above.
Step 6: Follow this procedure up to first column and you have the variable length code
ready.
Huffman encoding isnt implemented in this manner in the JPEG image compression
standard. The standard code tables for Huffman coding are defined in the standard for the
DC values and the non-DC values as well. These values are looked up both for encoding
and decoding. Huffman code is a prefix code and hence it can be uniquely decoded.
The other alternative methods for entropy coding or source coding are Shannon-Fano
encoding, and arithmetic coding. Arithmetic coding has even been adopted in the JPEG
2000 standard for entropy coding after IBM agreed to release its patent on this technique
for the JPEG 2000 standard.
3. Key Bottlenecks
After performing an analysis of the Video compression algorithms and a survey in the
literature to improve its performance we were able to identify two algorithms (i.e. Motion
estimation and DCT), which are the most resource intensive and in which a very high
proportion of the time in a video compression algorithm is spent. Motion Estimation
requires making use of highly repetitive methods applied to the whole image. DCT is
essentially a matrix multiplication loop which is to be performed on every 8 or 16 pixels
of the image. Also with the increase in image resolution the problem becomes even worse
as the loop iterations will increase and will require more computational resources.
However, it can be seen that there isnt any data dependence between the various data
elements that are used in the algorithm. Therefore it is possible to try and improve
performance of these programs by exploiting parallelism inherent in these media
algorithms and running different data points in parallel to obtain higher throughput.
SIMD architecture exploits this parallelism by use of increase datapath size and
performing the same operations on the different data point (in our case pixels).
4. SIMD Architecture:
Usually, processors process one data element in one instruction, a processing style called
Single Instruction Single Data, or SISD. In contrast, processors having the SIMD
capability process more than one data element in one instruction. The Single Instruction
Stream Multiple Data Stream (SIMD) Architectures perform operations on many
elements in a lockstep fashion. The same instruction is performed on different data
elements computed by different functional units.
The Intels MMX/SSE/SSE2, AMDs 3DNow, Power PCs Altivec ISA extensions are
testimonial to the benefits of SIMD support to traditional superscalar.
4.1 Intels Streaming SIMD Extensions
SIMD Extensions for the IA-32 ISA began with the Multimedia Extensions (MMX) in
1997 for the Pentium processor. MMX datapath of 64 bits subword parallel ALUs for
bytes, words and doublewords enhanced its performance on multimedia benchmarks.
However, these instructions had a very limited function, in that only integer data-types
could be handled. Also since the MMX instructions utilized the floating point registers, it
was very hard to inter-mingle floating point and MMX instructions.
Streaming SIMD Extensions (SSE) from the Pentium III marked the advent of 68 new
instructions to the IA-32 ISA, in particular the MMX. The biggest winners from the new
instructions were applications that handled 3D or streaming media, as applying identical
instructions to multiple pieces of code was now handled in parallel. AMD wasn't idle
over this time though, and introduced 3DNow! to the world. This much catchier-sounding
set offered capabilities similar to those made possible by SSE, but was incompatible with
it.
The SSE2 technology from the Pentium-4, introduced new Single Instruction Multiple
Data (SIMD) double-precision floating-point instructions and new SIMD integer
instruction into the IA-32 Intel architecture. The 128-bit SIMD integer extensions are a
full superset of the 64-bit integer SIMD instructions, with additional instructions to
support more integer data types, conversion between integer and floating-point data
types, and efficient operations between the caches and system memory. These
instructions provide a means to accelerate operations typical of 3D graphics, real-time
physics, spatial (3D) audio, video encoding/decoding, encryption, and scientific
application.
4.1.1 SSE Vs MMX
MMX and SSE, both of which are extensions to existing architectures, share the concept
of SIMD, but they differ in the data types they handle, and in the way they are supported
in the processor.
MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-
precision floating-point numbers also. MMX instructions operate on two 32-bit integers
simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.
A major difference between MMX and SSE is that no new registers were defined for
MMX, while eight new registers have been defined for SSE. Each of the registers for
SSE is 128 bits long and can hold four single-precision floating-point numbers (each
being 32 bits long). The arrangement of the floating-point numbers in the new data type
handled by SSE is illustrated in Figure 4.1.
Figure 4.1: Arrangement of numbers in the new data type.
The immediate question is: Where did the registers for MMX come from? The MMX
registers were allocated out of the floating-point registers of the floating-point unit. A
floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A
limitation of this architecture is that an application cannot execute MMX instructions and
perform floating-point operations simultaneously. Additionally, a large number of
processor clock cycles are needed to change the state of executing MMX instructions to
the state of executing floating-point operations and vice versa. SSE does not have such a
restriction. Separate registers have been defined for SSE. Hence, applications can execute
SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously.
Applications can also execute non-SIMD floating-point and SIMD floating-point
instructions simultaneously.
The arrangement of the registers in MMX and SSE is illustrated in Figure 4.2. Figure
4.2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure
4.2(b) illustrates the SSE registers.
Figure 4.2: Registers in MMX and SSE.
MMX and SSE have one more similarity: Both have eight registers. MMX registers are
named mm0 through mm7, while SSE registers are named xmm0 through xmm7. For the
purpose of our experiment we make use of the SSE2 extensions.
4.2 SSE2 Coding Techniques
There is limited compiler support available for the SIMD ISA extension. As a result to
make use of the rich features provided by this extension we need to go through different
programming techniques. One can use one of the following techniques to code programs
with SSE2.
a) Assembly level programming

b) Intrinsics
c) Vector Class Library
Advantages of using Intrinsics and the Vector Class Library is that the Intrinsics and
Vector Classes free the programmer from managing registers while ensuring easier
maintenance and modularization of code. The compiler optimizes instruction scheduling
and register allocation and hence the executable runs faster.
Each computation and data manipulation assembly instruction has a corresponding

intrinsic that implements it directly. The intrinsic in SSE2 contain suffixes to indicate the
datatype operated on by instructions.
- p, pd, ps suffix indicates a packed, packed double, packed single precision

floating point operation
- s, sd, ss indicates a scalar, scalar double or scalar single precision floating point
operation
- i , si, su, pi, pu, epi, epu indicates an integer, 64-bit signed or unsigned integer,
128 bit (ep) signed or unsigned extended precision operation for 8, 16, 32 or 64
bits.
To use the intrinsics library, the file xmmintrin.h must be included. Thus we chose to
utilize the Intrinsics style of coding for Motion Estimation Algorithms. We chose the
Intels C++ compiler over the Microsofts Visual Studio pack to compile our motion
estimation algorithms. For most of the parts we made use of normal C code constructs.
However in cases where we could exploit parallelism with SIMD we made use of SSE2
intrinsics to indicate to the compiler its use.
4.3. Motion Estimation

We perform motion estimation for full search and three step search for both the 16x16
and 8x8 block size and compare performance. The complete sample of C code for all the
programs are provided in the appendix. Here we present the instrinic optimization done to
incorporate the SSE2 features.
4.3.1Code Snippets
Blockdiff is the main computationally intensive function call in the program. It also can
make use of the SSE2 features to improve its performance. We change the code using
intrinsics to employ the SSE2 datapath. Snippet below provides the blockdiff function
call.
int blockdiff(int x1,int x2,int y11,int y22)
{
unsigned char block1[16], block2[16];
int i1,j1,k1,ch,offset1,offset2;
int diff1[16][16], totaldiff = 0;
FILE *fp1, *fp2;
__m128i *b1,*b2,m1;
union mmx
{
__m128i m;
short int x[8];
}m;
.
.
.
// type casting pointers.

b1 = (__m128i*)block1;
b2 = (__m128i*)block2;
//SAD for 16 bytes.

m1 = _mm_sad_epu8(*b1,*b2);
m.m = m1;
totaldiff = totaldiff + m.x[3] + m.x[7];
}
}
Figure 4.3 Blockdiff function to process 16 x 16 blocks.
Figure 4.3 above shows the SSE code for the blockdiff() function which finds the
difference between two blocks located at (x1, y11) and (x2, y22) .
The top part shows the declarations inside the function, while the bottom part shows the
calculation of the difference using the SSE intrinsic. We define a union called mmx,
which can be used to address the m register of the mmx datatype __m128i and as an
array of 8 intgers as well. This __m128i register consists of 16 8-bit integer values.
The block1 and block2 arrays will contain the 16 8-bit pixel values from the image.
These are typecast into the __m128i format and put into the locations pointed by b1 and
b2. Next the __mm_sad_epu8() instruction finds the sum of differences of these 16
values directly and places it in the m1 register. Totaldiff adds up the total difference from
previous iterations and this one.

{
FILE *fp1, *fp2;
__m64 *b1,*b2,m1;
union mmx
{
__m64 m;
int x[2];
}m;
.
.
.
b1 = (__m64*)block1;
b2 = (__m64*)block2;
//SAD for 16 bytes.

m1 = _m_psadbw(*b1,*b2);
m.m = m1;
totaldiff = totaldiff + m.x[0];
}
Figure 4.4 Snippet from blockdiff function for 8 x 8 blocks.
Figure 4.4 above shows the blockdiff function code for taking differences between 8 x 8
blocks in a similar manner to the one above.
The operations performed are similar but the datatype and intrinsic used are different.
The datatype used is the __m64 type, which consists of 8 8-bit values. The intrinsic used
to calculate the sum of differences is the _m_psadbw() operation.
Totaldiff will again contain the accumulated difference from all the iterations.
4.3.2. Results
Without SSE With SSE
Full Search 4 secs 1 secs

16 x 16
Full Search 23 secs 6 secs
8x8
Three Step 3 secs 1 secs
16 x 16
Three Step 12 secs 3 secs
8x8
Table 1 Timing information for the various programs.
Figure 4.4.1 Frame 4 and 5 of the news.qcif video-stream

Figure 4.4.2 Motion Compensated Frame produced from Frame 4 of the news.qcif
video-stream with block size of 8 x 8 and 16 x 16 respectively
Figure 4.4.3 Part of frame (4) Figure 4.4.4 Part of frame (5)
Figure 4.4.5 Part of Predicted frame Figure 4.4.6 Part of Predicted frame
(5) with block size of 8 x (5) with block size of
8 16x16
We draw the following conclusions from the results given above. We notice that the
speedup is by a factor of 3-4 for most programs with SSE. Also the 8 x 8 block programs
for both algorithms take longer to execute compared to the 16 x 16 block programs. The
reason is that the loop overhead for the programs goes up, even though the number of
addition or subtractions to be performed are the same. From the images we see that the 8
x 8 blocks perform a better job at matching than the 16 x 16 blocks. The predicted images
after motion compensation show that the 8 x 8 blocks are better suited for tracking
movement of the smaller image regions with these algorithms.
4.4 DCT
Our second candidate algorithm which is highly computationally intensive is the DCT. It
is essentially matrix multiplication of an image block by a DCT constant multiplication
matrix. This structure can also make use of the SIMD architecture to improve
performance. For the following section we present a few suggestions that would improve
the performance of the DCT algorithm. However, we dont go into the performance
comparisons of the algorithms due to lack of the capability to add additional functionality
to the compiler tool. Below is a code snippet for the DCT algorithm.
void DCT (int InBlock[][8], int OutBlock[][8])
{
int TempBlock[8][8], CosTrans[8][8];
/*TempBlock = InBlock * CosBlock^T*/

MatrixMult(InBlock, CosTrans, TempBlock);
Transpose(CosBlock, CosTrans);
/*OutBlock = CosBlock * TempBlock*/

MatrixMult(CosBlock, TempBlock, OutBlock);
}
Figure 4.4.1
This DCT code could be further enhanced using the SIMD support provided by Intels
SSE2
To illustrate this fact let us consider a 4 x 4 multiplication using traditional methods and
then by using the SIMD architecture.

(a) (b)
(c) (d)
Figure 4.4.2:Matrix Multiplication for computation of a single element.Parts a,b,c,d
show the various steps for obtaining a single result for matrix multiplication
The traditional method requires that the row and column elements of the two matrices
that are multiplied be accessed one at a time and a MAC operation performed. This will
require 64 sequential operations of accessing the elements from memory and multiply
accumulate. However, when we employ the use of the SIMD architecture it will require
16 operations on the SIMD architecture. The illustration is given below.
(a) (b)
(c) (d)
Figure 4.4.3:Matrix Multiplication for four elements.Parts a,b,c,d show the
computation on the SIMD platform
Essentially, because of the large datapath of the SSE architecture, it is possible to
concurrently perform operations that are independent from each other. Therefore, partial
products for 4 different elements of the matrix are carried on in parallel. Hence improving
performance at the cost of extra hardware.
We believe this will improve performance of the DCT code till upto 4 times the original
sequential code for DCT
4.4.2 Specialized hardware support for DCT on the SIMD architecture
Using the SIMD architecture of the SSE2 for the implementation of 8 point 1-D DCT
does improve performance over the use of simple C code implementation of 1-D DCT.A
specialized accelerator for DCT incorporated on the SIMD would improve performance
further.
The motivation of this study is therefore to study the trade off between the cost in terms
of hardware v/s performance improvement obtained by using a dedicated accelerator for
1-D DCT implementation. Choice of the DCT accelerator was highly driven by its
capability to scale with the SIMD architecture. Hence implementing distributed
arithmetic for DCT which is easily scalable with the SIMD architecture.
4.2.2.1 DCT Implementation on hardware:
The 2-D DCT has been recognized as the most cost effective techniques among various
transform coding schemes for image compression. The DCT is one of the orthogonal
transforms and the N x N 2-D DCT is defined as follows
2 N 1 N 1
(2i 1)u (2 j 1)u
X (u , v) C (u ).C (v). x(i, j ). cos cos
N i 0 j 0 2N 2N
where x(i,j) (i,j =0,1,2,N-1) is the pixel data, X(u,v)(u,v=0,1,2,..N-1) is the

transformed coefficient, and C(0)=1/ 2 ,C(u)=C(v)=1 if u,v 0.
The 2-D DCT unit is comprised of two 1-D DCT units and a transpose operation. This 2-
D DCT is separated into two 1-D DCTs by the row-column decomposition technique.
The input data are fed into the first DCT unit where 1-D DCT is calculated in row order.
Then the intermediate data is transposed. Finally, the transposed data are inputted to the
second 1-D DCT unit and processed in column order.
The recursive fast DCT algorithm is used to calculate the eight point DCT as shown
X0 A AAA 0xx 7
X xx
1 2 B C C B 1 6
.
X4 2A C AA 2xx 5

X 6 C BB C 3xx 4
X1 D E F G 0xx 7
X xx
1 3 E G D F 1 6
.
X5 2F D G E 2xx 5

X 7 G F E D 3xx 4

, B cos , C sin , D cos
A = cos 4 8 8 16 , E cos 3 , F sin 3 , G sin
16 16 16
where xi (i=0,1,2,.7) is the pixel data and Xu (u=0,1,2.7) is the transformed

coefficient.
Owing to this algorithm, the number of multiplications becomes half for the DCT.
x0+ x7 x1+x6 x2+x5 x0-x7
x3+x4 x1-x6 x2-x5 x3-x4
4
ROM ROM DAP DAP DAP DAP DAP DAP DAP

16
16
0.5
+
X2 X4 X6 X1 X3 X5 X7
16
16
R 0.25
16
Figure 4.2.2.1:DAP structure for the SSE2 extension
The figure 4.2.2.1 shows the block diagram of the 1-D DCT processing unit on the SSE2
platform.The preprocessed values of addition and substraction can be obtains by the
parallel addition and substraction instructions on the SSE2.
Multiplier accumulator in the DCT core processor has been designed with the distributed
arithmetic. According to the distributed arithmetic, the parallel multipliers can be
eliminated from the core processor and the hardware amount is greatly reduced.
Furthermore, a very high speed operation can be achieved because the critical path is
formed in adder instead of multiplier.
Here, we illustrate the principle of distributed arithmetic (DA).Assume the input vector is
presented in N-bit twos complement code as follows:
N 1
x k bk 0 bkn .2 n
n 1
The multiply accumulate in the normal way can be presented as the following equation:
K
y a k .x k
k 1
K N 1
a k .(bk 0 bkn .2 n
k 1 n 1
where ak (k=1,2,3,.K) is the multiply coefficient. Based on the distributed arithmetic y
can be calculated as follows
N 1
K K
y a k .bkn .2 n a k .(bk 0 )
n 1 k 1 k 1
The multiply operation is implemented with a ROM that stores the precalculated partial
products. Therefore, the hardware of the multiply accumulation based on the DA includes
ROM and an adder that accumulates the partial products read from ROM.
In the multiplyaccumulate operations based on distributed arithmetic, precalculated

partial products are read out from ROMs and accumulated in a bit-wise manner from
LSBs to MSBs. To double the processing speed, two partial products for adjacent bits
can be read from the individual ROMs at the same time. This method of calculation can
be written in the following equation
N /2
K N /2
K K
y a k .bk ( 2 m 1) .2 ( 2 m 1) a k .bk .2 m .2 m a k .(bk 0 )
m 1 k 1 m 1 k 1 k 1
In this case the two adjacent bits are processed simultaneously. Thus, two ROMs
required to offer a pair of partial products for higher and lower bits in every cycle, and
both have two banks for the two modes of DCT. The ROM size itself was reduced by 2 4
times by using the fast algorithm in conjunction with the DA scheme. With the present
configuration of the DAP(Distributed Arithmetic processor) structure we would require a
16 ROMs each 16 x 16 bits. Each of the DAPs will complete the operation in 8 cycles.
Hence the entire DCT is calculated in 8 cycles because all the DAPs work in parallel on
the SIMD architecture of SSE2.
4.4.2.2 Implementation and Results
The DAP structure was implemented using Verilog and synthesis was done using the
gflx-p library. The control flow for the DAP is given below:
The DAP will take a total of 8 cycles to complete. For all the fractional binary bits we use
the shift add method. Essentially we shift the operand right by 2 and add it every cycle as
shown in the DAP figure.
The Shiftadd complement method is used in case when DAP is processing the integer
part of the binary fixed point number. This can be further understood by looking into the
code placed in the appendix.
Synthesis Results
Clock period achieved : 1.8ns
Area results:
Combinational Area : 6660.396
Non-Combinational Area : 1128.06
Interconnect Net Area :1211.4841
Total Cell Area :7788.476

Total Area : 8999.942
5.Conclusion:
SIMD extensions to the superscalar architectures have helped improve the performance
of the general purpose processors on media applications. We used the motion estimation
algorithm and optimized it to make use of the SIMD architecture offered by todays
modern processors. There was considerable performance improvement with the use of the
wide datapath. Also, we explored into improving performance of the DCT by use of the
existing ISA and by employing a dedicated hardware for the DCT implementation. The
performance analysis of this extension on the ISA and its hardware trade-off remains to
be seen and an agenda for future work.
APPENDIX
A.1 Program for the Full Search Motion Estimation Algorithm with simple C for 16
x 16 image blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT FULL SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
int blockdiff(int, int, int, int);

int x1, x2, y11, y2, i,j,k, mindiff[12][10], p = 8;
int far diff[12][10][257];
int sort[257], temp,point, l, motionx, motiony, col ,row, x, y;
long peldiff = 0;
float amad = 0.0;
time_t first, second;
void main()
{
int i1,j1,k1;
FILE *fpold,*fpnew;
first = time(NULL);
for(j=0; j<9 ;j++)

{
for(i=0; i<11 ;i++)
{
x1 = 16*i; //START OF BLOCKS IN REFERENCE IMAGE
y11 = 16*j;
for(row = 0; row < 2*p ; row++)

{
for(col = 0; col < 2*p; col++)
{
x2 = x1 - p + col;
y2 = y11 - p + row;
diff[i][j][(16*row) + col] = blockdiff(x1, x2, y11,
y2);
}
}
//------------------------------------------------------------------------
//SUBPROGARM FOR SORTING THE DIFFERENCES
//-----------------------------------------------------------------------
for(k=0; k<256 ; k++)

{
sort[k] = diff[i][j][k];
// printf("\tk%d diff%d", k, diff[i][j][k]);
}
for(k=0; k<256 ;k++)

{
for(l=0; l<256 ; l++)
{
if(sort[l] < sort[l+1])
{
temp = sort[l];
sort[l] = sort[l+1];
sort[l+1] = temp;
}
}
}
mindiff[i][j] = sort[255];
// printf("\nmindiff=%d", mindiff[i][j]);
// printf("\nsort=%d", sort[255]);
// getch();
for(k=0; k<255; k++)

{
if(diff[i][j][k] == sort[255])
{
l = k;
}
}
x = l % 16;
y = (l - motionx)/16;
motionx = x - 8;
motiony = y - 8;
// printf("\t %d %d %d %d", x1, y11, motionx , motiony);

// printf("\n\t%d %d",(j*11)+i, mindiff[i][j]);
//----------------------------------------------------------------------
// SUBPROGRAM TO OVERWRITE OLD IMAGE AND PRODUCE SHIFTED IMAGE
//----------------------------------------------------------------------
/* fpold = fopen("C:\\ECE734\\f4.raw","rb");
fpnew = fopen("C:\\ECE734\\f5recon.raw","r+b");
if(motionx != 0 || motiony != 0)
{
//SKIP PIXELS UPTO INITIAL POINT
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP
PIXELS USING MOTION VECTOR FOR THE NEW IMAGE
//COPY REQUIRED PIXELS FROM BLOCK 1
for(i1 = 0; i1<16; i1++)

{
for(j1 =0; j1<16; j1++)//FOR LOOP FOR
WRITING REQUIRED PIXELS IN NEW IMAGE
{
point = fgetc(fpold);
fputc(point, fpnew);
}
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold); */
amad = amad + sqrt(motionx*motionx + motiony*motiony);
peldiff = peldiff + mindiff[i][j];
}
}
amad = amad/99;
printf("\nAMAD = %f", amad);
printf("\nPixel Difference %ld", peldiff/99);
second=time(NULL);
printf("\nDifference in time %ld", second - first);
getch();
}
//--------------------------------------------------------------------------
// FUNCTION BLOCKDIFF
//--------------------------------------------------------------------------

{
int block1[16][16], block2[16][16],i1,j1,k1,ch;
FILE *fp1, *fp2;
//DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME
if(x2 < 0 || y2 < 0 || x2 >160 || y2 >128)

{
totaldiff = 10000;
}
else
{
fp1 = fopen("C:\\ECE734\\f4.raw", "rb");
if(fp1 == NULL || fp2 == NULL)

{
printf("File cannot be opened");
getch();
exit(0);
}
for(i1=0; i1<16 ; i1++)

{
for(j1=0; j1<16 ;j1++)
{
diff1[i1][j1] = 0;
block1[i1][j1]= 0;
block2[i1][j1]= 0;
}
}
for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT

ch = fgetc(fp1);
for(i1 = 0; i1<16; i1++)

{
for(j1 =0; j1<16; j1++)
{
block1[i1][j1] = fgetc(fp1);
}
for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE
NOT NEEDED
{
ch = fgetc(fp1);
}
}
//BLOCK COPIED FROM SECOND FRAME

for(i1=0; i1<(176*y2) + x2; i1++)//SKIP PIXELS UPTO INTIAL POINT
ch = fgetc(fp2);

for(i1 = 0; i1<16; i1++)
{
for(j1 =0; j1<16; j1++)
{ block2[i1][j1] = fgetc(fp2);}
for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT
NEEDED
{ch = fgetc(fp2);}
}
for(i1=0; i1<16 ; i1++)

{
for(j1=0; j1<16 ;j1++)
{
diff1[i1][j1] = block2[i1][j1] - block1[i1][j1];
diff1[i1][j1] = abs(diff1[i1][j1]);
totaldiff = totaldiff + diff1[i1][j1];
}
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.2 Full Search Program with SSE 2 intrinsics for 16 x 16 blocks

//-------------------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT FULL SEARCH WITH SSE2
//-------------------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<xmmintrin.h>
#include<sse2mmx.h>
//#include <iostream.h>
#include <mmsystem.h>
#include <windows.h>

int diff[12][10][257];
long peldiff = 0;
float amad;
clock_t first, second;
FILE *in1,*in2;
void main()
{
int i1,j1,k1;
FILE *fpold,*fpnew;
int diff1,diff,diffx,diffy;
amad =0.0;
DWORD start, finish, duration;
//first = clock();
start = timeGetTime();
printf("START TIME!! %ld\n", start);
diff = 1000000;
diffy =0;
diffx =0;
//j and i move the reference blocks
for(j=0; j<9 ;j++)//j<9 9*16 = 144
{
for(i=0; i<11 ;i++)//i<11 16*11 = 176
{
y11 = 16*j;
for(row = 0; row<2*p; row++)

{
for(col = 0; col<2*p; col++)
{
x2 = x1 - p + col;
y22 = y11 - p + row;
diff1 = blockdiff(x1, x2, y11, y22);
if (diff > diff1)
{
diff = diff1;
diffx = x2;
diffy = y22;
}//else discard diff1
}
}
motionx = diffx - x1;

motiony = diffy - y11;
//----------------------------------------------------------------------
//----------------------------------------------------------------------
fpold = fopen("C:\\ECE734\\f4.raw","rb");
{
fseek(fpold,(176 * y11) + x1, 0);
fseek(fpnew,(176 * (y11 + motiony)) + x1 + motionx, 0);//SKIP PIXELS
USING MOTION VECTOR FOR THE NEW IMAGE
for(i1 = 0; i1<16; i1++)

{
for(j1 =0; j1<16; j1++)//FOR LOOP FOR WRITING
REQUIRED PIXELS IN NEW IMAGE
{
}
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);
// amad = 99.0f ;// 1.0;((float)motionx* motionx) + ((float)motiony*motiony);
peldiff = peldiff + diff;
}
}
//amad = amad/99;
// printf("\nAMAD = %f",amad);
second = clock();
printf("\nDifference in time %ld", second);
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{
FILE *fp1, *fp2;
__m128i *b1,*b2,m1;
union mmx
{
__m128i m;
short int x[8];
}m;
if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128)

{
totaldiff = 10000;
}
else
{
{
getch();
exit(0);
}
//skip to the intial point and get the result.

for(i = 0 ;i<16;i++)
{
offset1 = (176*(y11+i) + x1);
offset2 = (176*(y22+i) + x2);
fseek(fp1,offset1,SEEK_SET);
fread(block1,1,16,fp1);
//for(i=0;i<16;i++)
//printf("%c \n",block1[i]);

b1 = (__m128i*)block1;
b2 = (__m128i*)block2;
//SAD for 16 bytes.

m.m = m1;
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.3 Three Step Search Program with simple C for 16 x 16 image blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT 3 STEP SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>

int x1, x2, y11, y22, diff[12][10][10],i,j,p = 16,k, mindiff[12][10];
int sort[10], temp,l, motionx, motiony;
long float amad = 0.0, dist =0.0;
long peldiff = 0;
int locx, locy, point;
void main()
{
FILE *fpold, *fpnew;
int i1,j1,k1,ch,c;
first = time(NULL);
for(j=0; j<9 ;j++)

{
for(i=0; i<11 ;i++)
{
y11 = 16*j;
locx = x1;
locy = y11;
p = 16;
while(p >= 1)
{
//ALGORITHM FOR 3 STEP SEARCH
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
diff[i][j][0] = blockdiff(x1, x2, y11, y22);
//FOR POINT NO. 1

x2 = locx;
y22 = locy - p;
//FOR POINT NO. 2

x2 = locx + p;
y22 = locy - p;
//FOR POINT NO. 3

x2 = locx - p;
y22 = locy;
//FOR POINT NO. 4

x2 = locx;
y22 = locy;
//FOR POINT NO. 5

x2 = locx + p;
y22 = locy;
//FOR POINT NO. 6

x2 = locx - p;
y22 = locy + p;
//FOR POINT NO. 7

x2 = locx;
y22 = locy + p;
//FOR POINT NO. 8

x2 = locx + p;
y22 = locy + p;
//------------------------------------------------------------------------
//-----------------------------------------------------------------------
for(k=0; k<9 ; k++)

{
// printf("\nk=%d diff=%d", k, diff[i][j][k]);
}
for(k=0; k<9 ;k++)

{
for(l=0; l<9 ; l++)
{
{
temp = sort[l];
sort[l+1] = temp;
}
}
}
for(k=0; k<9; k++)

{
{
l = k;
}
}
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.
motionx = locx - x1;

motiony = locy - y11;
// printf("\n\tx1 %d y11 %d \n \tlocx %d locy %d \n\tmotion vector x=%d y=
%d",x1,y11, locx, locy, motionx , motiony);
// getch();
dist = sqrt( motionx*motionx + motiony*motiony);

amad = amad + dist;
//*
//----------------------------------------------------------------------
//----------------------------------------------------------------------
{
fseek(fpold,(176 * y11) + x1, 0);
for(i1 = 0; i1<16; i1++)

{
{
}
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
second = time(NULL);
printf("Time taken %ld", second - first);//TO FIND CPU TIME
printf("\nAMAD=%lf", amad/99);
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{

FILE *fp1, *fp2;
if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128)

{
totaldiff = 10000;
}
else
{
if(fp2 == NULL)
{
printf("File 2 cannot be opened");
getch();
exit(0);
}
if(fp1 == NULL)
{
getch();
exit(0);
}
for(i1=0; i1<16 ; i1++)

{
for(j1=0; j1<16 ;j1++)
{
diff1[i1][j1] = 0;
block1[i1][j1]= 0;
block2[i1][j1]= 0;
}
}

ch = fgetc(fp1);
for(i1 = 0; i1<16; i1++)

{
for(j1 =0; j1<16; j1++)
{
}
NEEDED
{
ch = fgetc(fp1);
}
}

ch = fgetc(fp2);

for(i1 = 0; i1<16; i1++)
{
for(j1 =0; j1<16; j1++)
for(k1=0; k1< 160; k1++) //SKIP PIXELS FROM SAME LINE NOT NEEDED
{ch = fgetc(fp2);}
}
for(i1=0; i1<16 ; i1++)

{
for(j1=0; j1<16 ;j1++)
{
}
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.4 Three Step Search Program with SSE 2 intrinsics for 16 x 16 blocks
//------------------------------------------------------------------------
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<sse2mmx.h>

long peldiff = 0;
void main()
{
int i1,j1,k1,ch,c;
first = time(NULL);
for(j=0; j<9 ;j++)

{
for(i=0; i<11 ;i++)
{
y11 = 16*j;
locx = x1;
locy = y11;
p = 16;
while(p >= 1)
{
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
//FOR POINT NO. 1

x2 = locx;
y22 = locy - p;
//FOR POINT NO. 2

x2 = locx + p;
y22 = locy - p;
//FOR POINT NO. 3

x2 = locx - p;
y22 = locy;
//FOR POINT NO. 4

x2 = locx;
y22 = locy;
//FOR POINT NO. 5

x2 = locx + p;
y22 = locy;
//FOR POINT NO. 6

x2 = locx - p;
y22 = locy + p;
//FOR POINT NO. 7

x2 = locx;
y22 = locy + p;
//FOR POINT NO. 8

x2 = locx + p;
y22 = locy + p;
//------------------------------------------------------------------------
//-----------------------------------------------------------------------
for(k=0; k<9 ; k++)

{
}
for(k=0; k<9 ;k++)

{
for(l=0; l<9 ; l++)
{
{
temp = sort[l];
sort[l+1] = temp;
}
}
}
for(k=0; k<9; k++)

{
{
l = k;
}
}
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.

// getch();

amad = amad + dist;
//*
//----------------------------------------------------------------------
//----------------------------------------------------------------------
{
fseek(fpold,(176 * y11) + x1, 0);
for(i1 = 0; i1<16; i1++)

{
{
}
fseek(fpold,160,1);
fseek(fpnew,160,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
printf("\nAMAD=%lf", amad/99);
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{
FILE *fp1, *fp2;
__m128i *b1,*b2,m1;
union mmx
{
__m128i m;
short int x[8];
}m;
if(x2 < 0 || y22 < 0 || x2 >160 || y22 >128)

{
totaldiff = 10000;
}
else
{

{
getch();
exit(0);
}

for(i = 0 ;i<16;i++)
{
offset1 = (176*(y11+i) + x1);
offset2 = (176*(y22+i) + x2);
//for(i=0;i<16;i++)
//printf("%c \n",block1[i]);

b1 = (__m128i*)block1;
b2 = (__m128i*)block2;
//SAD for 16 bytes.

m.m = m1;
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.5 Full Search with simple C for 8 x 8 blocks
//------------------------------------------------------------------------
// PROGRAM TO IMPLEMENT FULL SEARCH
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>

//int diff[22][18][256];
long peldiff = 0;
float amad = 0.0;
int diff1,diff,diffx,diffy;
void main()
{
int i1,j1,k1;
FILE *fpold,*fpnew;
first = time(NULL);
for(j=0; j<18 ;j++)

{
for(i=0; i<22 ;i++)
{
y11 = 8*j;
diff = 1000000;
//defines search space

for(row = 0; row < 2*p ; row++)
{
//printf("\n");
for(col = 0; col < 2*p; col++)
{
x2 = x1 - p + col;
y22 = y11 - p + row;
diff1 = blockdiff(x1, x2, y11, y22);
if(diff > diff1)
{
diff = diff1;
diffx = x2;
diffy = y22;
}
//printf("diff %d ",diff[i][j][(16*row) + col]);
//if(diff[i][j][(16*row) + col] == 0)
//{printf(" i %d j %d row %d col %d",i,j,row,col);
//getchar();
}
}
//}
//------------------------------------------------------------------------
//-----------------------------------------------------------------------
/*
for(k=0; k<256 ; k++)
{
// printf("\tk%d diff%d", k, diff[i][j][k]);
}
for(k=0; k<256 ;k++)

{
for(l=0; l<256 ; l++)
{
{
temp = sort[l];
sort[l+1] = temp;
}
}
}
// printf("\nsort=%d", sort[255]);
// getch();
for(k=0; k<256; k++)
{
{
l = k;
}
}
x = l % 16;
y = (l - x)/16;
*/
motionx = diffx -x1;
motiony = diffy -y11;
// printf("\t %d %d %d %d", x1, y11, motionx , motiony);

//----------------------------------------------------------------------
//----------------------------------------------------------------------
{
fseek(fpold,(176 * y11) + x1, 0);
for(i1 = 0; i1<8; i1++)

{
for(j1 =0; j1<8; j1++)//FOR LOOP FOR WRITING REQUIRED
PIXELS IN NEW IMAGE
{
}
fseek(fpold,168,1);
fseek(fpnew,168,1);
}
}
fclose(fpnew);
fclose(fpold);
//amad = amad + sqrt(motionx*motionx + motiony*motiony);
}
}
amad = amad/99;
printf("\nAMAD = %f", amad);
second=time(NULL);
printf("\nDifference in time %ld", second - first);
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{
int diff1[8][8];
register int totaldiff = 0;
FILE *fp1,*fp2;
if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136)

{
totaldiff = 10000;
}
else
{

{
getch();
exit(0);
}
for(i1=0; i1<8 ; i1++)

{
for(j1=0; j1<8 ;j1++)
{
diff1[i1][j1] = 0;
block1[i1][j1]= 0;
block2[i1][j1]= 0;
}
}

ch = fgetc(fp1);
for(i1 = 0; i1<8; i1++)

{
for(j1 =0; j1<8; j1++)
{
}
NEEDED
{
ch = fgetc(fp1);
}
}

ch = fgetc(fp2);

for(i1 = 0; i1<8; i1++)
{
for(j1 =0; j1<8; j1++)
{ch = fgetc(fp2);}
}
for(i1=0; i1<8 ; i1++)

{
for(j1=0; j1<8 ;j1++)
{
}
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.7 Three Step Search with simple C for 8 x 8 blocks
//------------------------------------------------------------------------
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>

long peldiff = 0;
void main()
{
int i1,j1,k1,ch,c;
//clrscr();
first = time(NULL);
for(j=0; j<18 ;j++)

{
for(i=0; i<22 ;i++)
{
y11 = 8*j;
locx = x1;
locy = y11;
p = 8;
while(p >= 1)
{
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
//FOR POINT NO. 1

x2 = locx;
y22 = locy - p;
//FOR POINT NO. 2

x2 = locx + p;
y22 = locy - p;
//FOR POINT NO. 3

x2 = locx - p;
y22 = locy;
//FOR POINT NO. 4

x2 = locx;
y22 = locy;
//FOR POINT NO. 5

x2 = locx + p;
y22 = locy;
//FOR POINT NO. 6

x2 = locx - p;
y22 = locy + p;
//FOR POINT NO. 7

x2 = locx;
y22 = locy + p;
//FOR POINT NO. 8

x2 = locx + p;
y22 = locy + p;
//------------------------------------------------------------------------
//-----------------------------------------------------------------------
for(k=0; k<9 ; k++)

{
}
for(k=0; k<9 ;k++)

{
for(l=0; l<9 ; l++)
{
{
temp = sort[l];
sort[l+1] = temp;
}
}
}
for(k=0; k<9; k++)

{
{
l = k;
}
}
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.

// getch();

amad = amad + dist;
///*
//----------------------------------------------------------------------
//----------------------------------------------------------------------
{
fseek(fpold,(176 * y11) + x1, 0);
for(i1 = 0; i1<8 ;i1++)

{
PIXELS IN NEW IMAGE
{
}
fseek(fpold,168,1);
fseek(fpnew,168,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
printf("\nAMAD=%lf", amad/(22*18));
printf("\nPixel Difference %ld", peldiff/(22*18));
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{

FILE *fp1, *fp2;
if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136)

{
totaldiff = 10000;
}
else
{
if(fp2 == NULL)
{
getch();
exit(0);
}
if(fp1 == NULL)
{
getch();
exit(0);
}
for(i1=0; i1<8 ; i1++)

{
for(j1=0; j1<8 ;j1++)
{
diff1[i1][j1] = 0;
block1[i1][j1]= 0;
block2[i1][j1]= 0;
}
}
// for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT

// ch = fgetc(fp1);
fseek(fp1,(176 * y11) + x1,0);
for(i1 = 0; i1<8; i1++)

{
for(j1 =0; j1<8; j1++)
{
}
NEEDED
{
ch = fgetc(fp1);
}
}

ch = fgetc(fp2);
for(i1 = 0; i1<8; i1++)
{
for(j1 =0; j1<8; j1++)
{ch = fgetc(fp2);}
}
for(i1=0; i1<8 ; i1++)

{
for(j1=0; j1<8 ;j1++)
{
}
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
A.8 Three Step Search Program with SSE 2 for 8 x 8 blocks
//------------------------------------------------------------------------
//------------------------------------------------------------------------
#include<stdio.h>
#include<conio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<sse2mmx.h>

long peldiff = 0;
void main()
{
int i1,j1,k1,ch,c;
first = time(NULL);
for(j=0; j<18 ;j++)

{
for(i=0; i<22 ;i++)
{
y11 = 8*j;
locx = x1;
locy = y11;
p = 8;
while(p >= 1)
{
//FOR POINT NO. 0
x2 = locx - p;
y22 = locy - p;
//FOR POINT NO. 1

x2 = locx;
y22 = locy - p;
//FOR POINT NO. 2

x2 = locx + p;
y22 = locy - p;
//FOR POINT NO. 3

x2 = locx - p;
y22 = locy;
//FOR POINT NO. 4
x2 = locx;
y22 = locy;
//FOR POINT NO. 5

x2 = locx + p;
y22 = locy;
//FOR POINT NO. 6

x2 = locx - p;
y22 = locy + p;
//FOR POINT NO. 7

x2 = locx;
y22 = locy + p;
//FOR POINT NO. 8

x2 = locx + p;
y22 = locy + p;
//------------------------------------------------------------------------
//-----------------------------------------------------------------------
for(k=0; k<9 ; k++)

{
}
for(k=0; k<9 ;k++)

{
for(l=0; l<9 ; l++)
{
{
temp = sort[l];
sort[l+1] = temp;
}
}
}
for(k=0; k<9; k++)

{
{
l = k;
}
}
if(l==0)
{
locx = locx - p;
locy = locy - p;
}
if(l==1)
{
locx = locx;
locy = locy - p;
}
if(l==2)
{
locx = locx + p;
locy = locy - p;
}
if(l==3)
{
locx = locx - p;
locy = locy;
}
if(l==4)
{
locx = locx;
locy = locy;
}
if(l==5)
{
locx = locx + p;
locy = locy;
}
if(l==6)
{
locx = locx - p;
locy = locy + p;
}
if(l==7)
{
locx = locx;
locy = locy + p;
}
if(l==8)
{
locx = locx + p;
locy = locy + p;
}
p = p/2;
} //while loop end.

// getch();

amad = amad + dist;
//----------------------------------------------------------------------
//----------------------------------------------------------------------
{
fseek(fpold,(176 * y11) + x1, 0);
for(i1 = 0; i1<8 ;i1++)

{
PIXELS IN NEW IMAGE
{
}
fseek(fpold,168,1);
fseek(fpnew,168,1);
}
}
fclose(fpnew);
fclose(fpold);//*/
}
}
printf("\nAMAD=%lf", amad/(22*18));
printf("\nPixel Difference %ld", peldiff/(22*18));
getch();
}
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------

{
FILE *fp1, *fp2;
__m64 *b1,*b2,m1;
union mmx
{
__m64 m;
int x[2];
}m;
//DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END

OF FRAME
if(x2 < 0 || y22 < 0 || x2 >168 || y22 >136)

{
totaldiff = 100000;
}
else
{

{
getch();
exit(0);
}

for(i1 = 0 ;i1<8;i1++)
{
offset1 = (176*(y11 + i1) + x1);
offset2 = (176*(y22 + i1) + x2);

b1 = (__m64*)block1;
b2 = (__m64*)block2;
//SAD for 16 bytes.

m1 = _m_psadbw(*b1,*b2);
m.m = m1;
totaldiff = totaldiff + m.x[0];
}
fclose(fp1);
fclose(fp2);
}// else loop end
return(totaldiff);
}
/*****************************************************************************
* File Name : DCTmatrix.c
*
* Comment : This file will help produce ROM values stored to compute DCT
* using distributed arithmetic.
*
* Project :ECE734
* Author :Shamik Valia
*
****************************************************************************/
/* Arguments for the execution

* foo precision
* command line
* gcc -lm -g DCTmatrix.c -o DCTmatrix
*/
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
#define FOR_HARDWARE
//#define FOR_SIMULATOR
main(int argc,char *argv[])

{
FILE *R1,*R2,*R3,*R4,*R5,*R6,*R7,*R8 ;
double A,B,C,D,E,F,G,v1;
int i,j,k,l;
//char *C;
char c[atoi(argv[1])+1] ;
void dectobinary(double ,int, char[]);
void complement(char[]);
// printf("am here please print");
A = cos(M_PI/4);
B = cos(M_PI/8);
C = sin(M_PI/8);
D = cos(M_PI/16);
E = cos(3*M_PI/16);
F = sin(3*M_PI/16);
G = sin(M_PI/16);
printf("PI = %lf \n",M_PI);

printf("A = %lf \n",A);
printf("B = %lf \n",B);
printf("C = %lf \n",C);
printf("D = %lf \n",D);
printf("E = %lf \n",E);
printf("F = %lf \n",F);
printf("G = %lf \n",G);
R1 = fopen("dct_ROM1.dat","w");
for(i=0;i<=1;i++){
for(j=0;j<=1;j++){
for(k=0;k<=1;k++) {
for(l=0;l<=1;l++){
#ifdef FOR_SIMULATOR
v1=0.5*((A*i)+(A*j)+(A*k)+(A*l));
fprintf(R1,"sequence %d%d%d%d value = %lf \n",i,j,k,l,v1);
v1=0.5*((B*i)+(C*j)-(C*k)-(B*l));
v1=0.5*((A*i)-(A*j)-(A*k)+(A*l));
v1=0.5*((C*i)-(B*j)+(B*k)-(C*l));
fprintf(R4,"sequence %d%d%d%d value = %lf \n ",i,j,k,l,v1);
v1=0.5*((D*i)+(E*j)+(F*k)+(G*l));
v1=0.5*((E*i)-(G*j)-(D*k)-(F*l));
v1=0.5*((F*i)-(D*j)+(G*k)+(E*l));
v1=0.5*((G*i)-(F*j)+(E*k)-(D*l));
#endif
#ifdef FOR_HARDWARE
v1=0.5*((A*i)+(A*j)+(A*k)+(A*l));
dectobinary(v1,atoi(argv[1]),c);
fprintf(R1,"4'b%d%d%d%d : out = %s'b%s ;\n",i,j,k,l,argv[1],c);
v1=0.5*((B*i)+(C*j)-(C*k)-(B*l));
v1=0.5*((A*i)-(A*j)-(A*k)+(A*l));
v1=0.5*((C*i)-(B*j)+(B*k)-(C*l));
v1=0.5*((D*i)+(E*j)+(F*k)+(G*l));
v1=0.5*((E*i)-(G*j)-(D*k)-(F*l));
v1=0.5*((F*i)-(D*j)+(G*k)+(E*l));
v1=0.5*((G*i)-(F*j)+(E*k)-(D*l));
#endif
}
}
}
}
fclose (R1);
fclose (R2);
fclose (R3);
fclose (R4);
fclose (R5);
fclose (R6);
fclose (R7);
fclose (R8);
void dectobinary(double no ,int precision,char c[]) {
//double no;
///int precision ;
//given the nos before the decimal point it will give binary representation
float fixed ;
int decimal ;
int ans = 0, x ,i=0,two_complement = 0;
void complement(char[]);
//char C[precision];
//printf("i am inside");
if (no<0){
two_complement = 1;
no = - no ;
}
decimal = (int)no ;
fixed = no-decimal;
while(decimal/2 !=0)
{
x = decimal % 2 ;
ans = ans + x*pow(10,i);
i++;
decimal = decimal / 2 ;
}
x= decimal%2;
ans = ans + x*pow(10,i);
sprintf(c,"%d",ans);
if(strlen(c)!=2)
{
if(strlen(c)>2) fprintf(stderr,"error at decimal..more than 2 places");
c[1] = c[0];
c[0] = '0';
c[2]='\0';
}
// printf("stringlenth is %d",strlen(c));
//given the nos after decimal point , gives the binary.
for(i=strlen(c);i<precision;i++)
{
fixed = 2*fixed ;
if(fixed>=1.0) {
fixed = fixed-1.0 ;
c[i] = '1';
}
else c[i] ='0';
}
c[i]='\0';
if(two_complement==1){
complement(c);
}
// return C ;
}
void complement(char c[]){
int i = strlen(c);
int flag = 0;
while(i!=-1){
if (flag ==0){
if(c[i]=='1')
flag = 1;
i--;
}else //flag ==0 ends
{ //flag ==1
if(c[i]=='1')
c[i]='0';
else c[i]='1';
i--;
} //flag==1 ends
} //while ends
}
/* Real dct implementation.
x0 = x0 + x7 ;
x1 = x1 + x6 ;
x2 = x2 + x5 ;
x3 = x3 + x4 ;
x4 = x0 - x7 ;
x5 = x0 - x6 ;
x6 = x0 - x5 ;
x7 = x0 - x4 ;
X0 = A(x0) + A(x1) + A(x2) + A(x3);

X2 = B(x0) + C(x1) - C(x2) - B(x3);
X4 = A(x0) - A(x1) - A(x2) + A(x3);
X6 = C(x0) - B(x1) - B(x2) - C(x3);
X1 = D(x4) + E(x5) + F(x6) + G(x7);

X3 = E(x4) - G(x5) - D(x6) - F(x7);
X5 = F(x4) - D(x5) + G(x6) + E(x7);
X7 = G(x4) - F(x5) + E(x6) - D(x7);
*************************************************/
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
main ()
{
short A,B,C,D,E,F,G;
short int X0,X1,X2,X3,X4,X5,X6,X7 ;
short int x0,x1,x2,x3,x4,x5,x6,x7;
double X0_,X1_,X2_,X3_,X4_,X5_,X6_,X7_;
int Add_SS2(int ,int);
int n1,n2,n3;
/*
A = cos(M_PI/4);
B = cos(M_PI/8);
C = sin(M_PI/8);
D = cos(M_PI/16);
E = cos(3*M_PI/16);
F = sin(3*M_PI/16);
G = sin(M_PI/16);
*/
A =23170;
B=30273;
C=12539;
D=32138;
E=27245;
F=18204;
G=6392;
printf("input x0 : ");
scanf("%hd",&x0);
printf("\ninput x1 : ");
scanf("%hd",&x1);
scanf("%hd",&x2);
scanf("%hd",&x3);
scanf("%hd",&x4);
scanf("%hd",&x5);
scanf("%hd",&x6);
scanf("%hd",&x7);
/*
X0 = (short int)(((A*x0) + (A*x1) + (A*x2) + (A*x3))>>1);
X2 = (short int)(((B*x0) + (C*x1) - (C*x2) - (B*x3))>>1);
X4 = (short int)(((A*x0) - (A*x1) - (A*x2) + (A*x3))>>1);
X6 = (short int)(((C*x0) - (B*x1) - (B*x2) - (C*x3))>>1);
X1 = (short int)(((D*x4) + (E*x5) + (F*x6) + (G*x7))>>1);

X3 = (short int)(((E*x4) - (G*x5) - (D*x6) - (F*x7))>>1);
X5 = (short int)(((F*x4) - (D*x5) + (G*x6) + (E*x7))>>1);
X7 = (short int)(((G*x4) - (F*x5) + (E*x6) - (D*x7))>>1);
*/
n1 = Add_SS2((A*x0) ,(A*x1));
n2 = Add_SS2((A*x2) ,(A*x3));
n3 = Add_SS2(n1,n2);
X0 = (short int)(n3>>17);
n1 = Add_SS2((B*x0) ,- (C*x2));
n2 = Add_SS2((C*x1) ,- (B*x3));
// X2=X2+ (((B*x_[0]) + (C*x_[1]) - (C*x_[2]) - (B*x_[3]))/(power(2,15-i)));
n1 = Add_SS2((A*x0) , - (A*x1));
n2 = Add_SS2((A*x3),- (A*x2) );
//X4=X4+ (((A*x_[0]) - (A*x_[1]) - (A*x_[2]) + (A*x_[3]))/(power(2,15-i)));

n1 = Add_SS2((C*x0), - (B*x1));
n2 = Add_SS2((B*x2),- (C*x3) );
//X6=X6+ (((C*x_[0]) - (B*x_[1]) + (B*x_[2]) - (C*x_[3]))/(power(2,15-i)));

n1 = Add_SS2((D*x4),(E*x5));
n2 = Add_SS2( (F*x6), (G*x7) );
//X1=X1+ (((D*x_[4]) + (E*x_[5]) + (F*x_[6]) + (G*x_[7]))/(power(2,15-i)));

n1 = Add_SS2((E*x4),- (G*x5));
n2 = Add_SS2(- (D*x6), - (F*x7) );
// X3=X3+ (((E*x_[4]) - (G*x_[5]) - (D*x_[6]) - (F*x_[7]))/(power(2,15-i)));
n1 = Add_SS2((F*x4), - (D*x5));
n2 = Add_SS2((G*x6), (E*x7) );
// X5=X5+ ((F*x_[4]) - (D*x_[5]) + (G*x_[6]) + (E*x_[7]))/(power(2,15-i)));
n1 = Add_SS2((G*x4), - (F*x5));
n2 = Add_SS2((E*x6),- (D*x7));
printf("X0 = %hd \n",X0);

X0_ = 0.5*((A*x0) + (A*x1) + (A*x2) + (A*x3));

X2_ = 0.5*((B*x0) + (C*x1) - (C*x2) - (B*x3));
X4_ = 0.5*((A*x0) - (A*x1) - (A*x2) + (A*x3));
X6_ = 0.5*((C*x0) - (B*x1) - (B*x2) - (C*x3));
X1_ = 0.5*((D*x4) + (E*x5) + (F*x6) + (G*x7));

X3_ = 0.5*((E*x4) - (G*x5) - (D*x6) - (F*x7));
X5_ = 0.5*((F*x4) - (D*x5) + (G*x6) + (E*x7));
X7_ = 0.5*((G*x4) - (F*x5) + (E*x6) - (D*x7));
printf("X0 = %lf \n",X0_);

int Add_SS2(int a, int b) {

int out;
out = a + b;
if (a > 0 && b > 0 && out < 0)

return 0x7fffffff;
else if (a < 0 && b < 0 && out > 0)

return 0x10000000;
else return out;
}
/*****************************************************************************
* File Name : ROM.c
*
* Comment : This file will help produce ROM code in verilog
*
* Project :ECE699
* Author :Shamik Valia
* Creation Date : 27 July 2003.
* Advisor : Prof Mike Schulte.
****************************************************************************/
/* Execution format will be

* ROM inputfile inputbits# outputfile outputbits#
*/
#include<stdio.h>
main(int argc,char *argv[])

{
FILE *fo;
//checking for the right syntax given.

if(argc!=5)
{
fprintf(stderr,"Insufficient arguments");
fprintf(stderr,"Requires 5 arguments");
}
//open file to read

// fi = fopen(argv[1],"r");
//open a new file to write output
fo = fopen(argv[3],"w");
//Copyright information...
fprintf(fo,"//#########################################################\n");
fprintf(fo,"//\n");
fprintf(fo,"// File Name:%s \n",argv[3]);
fprintf(fo,"//\n");
fprintf(fo,"// Comment : The file is ROM code for inbit#%s and output#%s \n",argv[2],argv[4]);
fprintf(fo,"// Project: \n");
fprintf(fo,"// Author : Shamik Valia \n");
fprintf(fo,"// Creation Date : \n");
fprintf(fo,"// Advisor:Prof.Mike Schutle\n");
fprintf(fo,"//\n");
fprintf(fo,"//#########################################################\n");
fprintf(fo,"\n\n");
//Generation of verilog code.

fprintf(fo,"module rom(\n ");
fprintf(fo," select, //input select line to ROM \n");
fprintf(fo," output1, //output data of the ROM \n");
fprintf(fo," output2, //output data of the ROM \n");
fprintf(fo,"); \n\n");
fprintf(fo,"parameter in_bits = %s, \n",argv[2]);
fprintf(fo," out_bits = %s; \n\n",argv[4]);
fprintf(fo,"input [in_bits-1:0] select ; \n");
fprintf(fo,"output1[out_bits-1:0] output ; \n\n");
fprintf(fo,"output2[out_bits-1:0] output ; \n\n");
fprintf(fo,"reg [out_bits-1:0] output ;\n\n");
fprintf(fo,"always@(select) \n");
fprintf(fo," case(select) \n");
fprintf(fo," `include \"%s\" \n ",argv[1]);
fprintf(fo," endcase \n");
fprintf(fo,"endmodule \n");
}
Distributed Arithmetic structure
//
module struc_16dap(in1,in2,in3,in4,reset,clk,start,out,done);
input [15:0] in1,in2,in3,in4 ;

input reset,clk,start;
output [20:0] out ;
output done ;
wire [3:0] in1_,in2_,in3_ ,in4_,in5_,in6_,in7_,in8_,in9_,in10_,in11_,in12_,in13_,in14_,in15_,in16_;

wire [3:0] in_ROM1,in_ROM2;
wire [15:0] out_ROM1,out_ROM2;
wire [18:0] ext_out1,ext_out2;
wire [20:0] adder_out ;
wire reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use;
wire [2:0] S1,S2;
wire strobe;
//instantiation of a DAP controller
dap_controller_4 D1(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);
//wire the bits to be put in mux.
assign in16_ = {in1[15],in2[15],in3[15],in4[15]};

//select the input to be selected for ROM input

mux_8 #(4) M1(in_ROM1,in2_,in4_,in6_,in8_,in10_,in12_,in14_,in16_,S1);
mux_8 #(4) M2(in_ROM2,in1_,in3_,in5_,in7_,in9_,in11_,in13_,in15_,S2);
//ROM data strcuture.

ROM_16 RO1(in_ROM1,out_ROM1,rd);
ROM_16 RO2(in_ROM2,out_ROM2,rd);
//Sign extension for ROM values

assign ext_out1 = {out_ROM1[15],out_ROM1[15],out_ROM1[15],out_ROM1};//higher value
assign ext_out2 = {out_ROM2[15],out_ROM2[15],out_ROM2[15],out_ROM2};//lesser value
//adder block
adder Add1(adder_out,ext_out1,ext_out2,out,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);
assign strobe = 1'b1;
reg_ #(21) RE1(out,adder_out,clk,strobe,reset_reg);
endmodule
module adder(adder_out,in_adder_,in_adder,r,shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use);
input [18:0] in_adder_,in_adder ;

input [20:0] r;
input shiftAdd,shiftaddcomplement,reset_reg,reset_reg_use;
output [20:0]adder_out;
reg [20:0] adder_out;
reg [20:0] c,d,e;
always@(in_adder,in_adder_,r,shiftAdd,shiftaddcomplement,reset_reg_use)
begin
if (shiftAdd ==1'b1)
begin
if (reset_reg_use == 1'b1)
begin
c = {in_adder_,1'b0,1'b0};
d = {in_adder[18],in_adder,1'b0};
adder_out = {in_adder_,1'b0,1'b0}+{in_adder[18],in_adder,1'b0};
end
else
begin
c = {in_adder_,1'b0,1'b0};
e = {r[20],r[20],r[20:2]} ;
adder_out = {in_adder_,1'b0,1'b0}+({in_adder[18],in_adder,1'b0})+
({r[20],r[20],r[20:2]});
end
end
else
begin
if (shiftaddcomplement ==1'b1)
begin
adder_out = {(~in_adder_ + 1'b1),1'b0,1'b0} +({in_adder[18],in_adder,1'b0}) +
({r[20],r[20],r[20:2]});
c = {(~in_adder_ + 1'b1),1'b0,1'b0};
e = {r[20],r[20],r[20:2]} ;
end
else
begin
c= 20'b0;
d = 20'b0;
e = 20'b0;
adder_out = 20'b0;
end
end
end
endmodule
module
dap_controller_4(reset_reg_use,done,S1,S2,reset_reg,shiftAdd,shiftaddcomplement,rd,reset,start,clk);
input reset ,clk ,start;

output reset_reg,shiftAdd,shiftaddcomplement,rd,done,reset_reg_use;
output [2:0] S1,S2;
reg reset_reg,shiftAdd,shiftaddcomplement,rd,reset_reg_use;
reg [2:0] S1,S2;
reg [2:0]state,nextstate ;
reg done;
parameter ST0 = 3'b000;
always@(posedge clk or posedge reset)

begin
// reset_reg = 1'b0;
// shiftAdd = 1'b0;
// shiftaddcomplement =1'b0;
// rd = 1'b1;
if (reset == 1'b1)
begin
state = ST0;
//reset_reg = 1'b1;
rd = 1'b0;
reset_reg_use = 1'b1;
end
else
begin
case(state)
ST0:if(start ==1'b1)
begin
S1 = 3'b000;
S2 = 3'b000;
nextstate = ST1 ;
reset_reg_use = 1'b1;
shiftAdd = 1'b1;
rd = 1'b1;
done = 1'b0;
end
else
nextstate = ST0;
ST1:begin
nextstate = ST2;
S1 = 3'b001;
S2 = 3'b001;
shiftAdd = 1'b1;
reset_reg_use =1'b0;
rd = 1'b1;
end
ST2 : begin
nextstate = ST3;
S1 = 3'b010;
S2 = 3'b010;
shiftAdd = 1'b1;
rd = 1'b1;
end
ST3 : begin
nextstate = ST4;
S1 = 3'b011;
S2 = 3'b011;
shiftAdd = 1'b1;
rd = 1'b1;
end
ST4 : begin
nextstate = ST5;
S1 = 3'b100;
S2 = 3'b100;
shiftAdd = 1'b1;
rd = 1'b1;
end
ST5 : begin
nextstate = ST6;
S1 = 3'b101;
S2 = 3'b101;
shiftAdd = 1'b1;
rd = 1'b1;
end
ST6 : begin
nextstate = ST7;
S1 = 3'b110;
S2 = 3'b110;
shiftAdd = 1'b1;
rd = 1'b1;
end
ST7 : begin
nextstate = ST0;
S1 = 3'b111;
S2 = 3'b111;
//shiftAdd = 1'b1;
shiftaddcomplement = 1'b1;
rd = 1'b1;
done =1'b1;
end
default: nextstate = ST0;

endcase
state = nextstate ;
end
end
endmodule
/*
module struc_16;
wire [15:0] in1,in2,in3,in4;

wire [20:0] out ;
reg clk , reset ,start ;
reg [63:0] t;
wire [31:0] out1;
wire [20:0] value;
wire [15:0] error;
wire done;
reg val;
assign in1=t[15:0];
assign in2=t[31:16];
check16 c1(in1,in2,in3,in4,out1);
struc_16dap D1(in1,in2,in3,in4,reset,clk,start,out,done);
assign value = out<<1;
//assign error = (val==1'b1) ? (out1[30:15] - out[16:1]) : 16'b0;
assign error = out1[30:15] - out[16:1];
always
#5 clk = ~clk ;
always
begin
# 5 start = 1'b1;val = 1'b0;
t = t+1;
# 10 start =1'b0;
# 95 val=1'b1;
end
initial
begin
clk = 1'b1;reset =1'b1;t = 64'h0000_0000_0000_0000;
#7 reset =1'b0;
end
endmodule
*/

Performance Enhancement of Video Compression Algorithms With SIMD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Enhancement of Video Compression Algorithms With SIMD

Uploaded by

Copyright:

Available Formats

Performance Enhancement of Video

Compression Algorithms with SIMD

Report Submitted By:

2 Basic steps used in Video Compression

We examine each of these steps in greater detail in this section.

Search Region (-p,p)

Figure 2.1.1 Search region (-p, p)

Figure 2.1.2 Motion vector and best match

Formula for MAE is given by:

MAE = (1/MN) | C(x+k,y+l) R(x+I+k,y+j+l)|

2.1.2 Three Step Search

Points chosen for first stage

Figure 2.1.3 Example path for convergence of Three Step Search

2.2 Motion Compensation and Image Subtraction

E(x,y,t) = I(x,y,t) I(x-u,y-v,t+1)

Fig. 2.2.1 Block Diagram of Video Encoder

2.3 Discrete Cosine Transform (DCT)

Y(k,l)= C(k) C(l)/4 Xij cos((2i+1)k) cos((2j+1)l

DCT, in principle, is a lossless process. However, due to the finite word-lengths in a

A higher compression can be achieved in Run-Length encoding if we somehow obtain

2.6 Huffman Encoding

Symbol Probability Iteration 1 Iteration 2 Length in bits

Table 2.6.1 Huffman encoding method

Figure 4.1: Arrangement of numbers in the new data type.

Figure 4.2: Registers in MMX and SSE.

a) Assembly level programming

c) Vector Class Library

Each computation and data manipulation assembly instruction has a corresponding

- p, pd, ps suffix indicates a packed, packed double, packed single precision

4.3. Motion Estimation

// type casting pointers.

//SAD for 16 bytes.

int blockdiff(int x1,int x2,int y11,int y22)

//SAD for 16 bytes.

Full Search 4 secs 1 secs

Table 1 Timing information for the various programs.

Figure 4.4.1 Frame 4 and 5 of the news.qcif video-stream

/*TempBlock = InBlock * CosBlock^T*/

/*OutBlock = CosBlock * TempBlock*/

4.2.2.1 DCT Implementation on hardware:

where x(i,j) (i,j =0,1,2,N-1) is the pixel data, X(u,v)(u,v=0,1,2,..N-1) is the

where xi (i=0,1,2,.7) is the pixel data and Xu (u=0,1,2.7) is the transformed

ROM ROM DAP DAP DAP DAP DAP DAP DAP

Figure 4.2.2.1:DAP structure for the SSE2 extension

In the multiplyaccumulate operations based on distributed arithmetic, precalculated

4.4.2.2 Implementation and Results

Clock period achieved : 1.8ns

Total Cell Area :7788.476

int blockdiff(int, int, int, int);

for(j=0; j<9 ;j++)

for(row = 0; row < 2*p ; row++)

for(k=0; k<256 ; k++)

for(k=0; k<256 ;k++)

for(k=0; k<255; k++)

// printf("\t %d %d %d %d", x1, y11, motionx , motiony);

//COPY REQUIRED PIXELS FROM BLOCK 1

for(i1 = 0; i1<16; i1++)

int blockdiff(int x1,int x2,int y11,int y2)

//DISCARDING LOCATIONS NEAR THE BOTTOM AND RIGHT END OF FRAME

if(x2 < 0 || y2 < 0 || x2 >160 || y2 >128)

if(fp1 == NULL || fp2 == NULL)

for(i1=0; i1<16 ; i1++)

for(i1=0; i1<(176 * y11) + x1; i1++)//SKIP PIXELS UPTO INTIAL POINT

/TempBlock = InBlock CosBlock^T*/

/OutBlock = CosBlock TempBlock*/

dist = sqrt( motionxmotionx + motionymotiony);

dist = sqrt( motionxmotionx + motionymotiony);