Professional Documents
Culture Documents
INTRODUCTION
Digital image processing is an area characterized by the need for extensive experimental work to establish the viability of proposed solutions to a given problem. An important characteristic underlying the design of image processing systems is the significant level of testing & experimentation that normally is required before arriving at an acceptable solution. This characteristic implies that the ability to formulate approaches &quickly prototype candidate solutions generally plays a major role in reducing the cost & time required to arrive at a viable system implementation.
types of computerized processes in this continuum: low-, mid-, & high-level processes. Low-level process involves primitive operations such as image processing to reduce noise, contrast enhancement & image sharpening. A low- level process is characterized by the fact that both its inputs & outputs are images. Mid-level process on images involves tasks such as segmentation, description of that object to reduce them to a form suitable for computer processing & classification of individual objects. A mid-level process is characterized by the fact that its inputs generally are images but its outputs are attributes extracted from those images. Finally higher- level processing involves Making sense of an ensemble of recognized objects, as in image analysis & at the far end of the continuum performing the cognitive functions normally associated with human vision. Digital image processing, as already defined is used successfully in a broad range of areas of exceptional social & economic value. Images are an everyday aspect of computers now. Web sites on the internet are generally made up of many pictures. A large proportion of transmission bandwidth and storage facilities are taken up with computer images. Reducing the storage requirements of the image while retaining the quality is very important, otherwise systems would become completely clogged. Since 1990, the JPEG1 picture format has been adopted as the standard for photographic images on the internet. This project looks at another method for compressing images using the Singular Value Decomposition (SVD).
processing books, the image origin is defined to be at (xylem)=(0,0).The next coordinate values along the first row of the image are (xylem)=(0,1).It is important to keep in mind that the notation (0,1) is used to signify the second sample along the first row. It does not mean that these are the actual values of physical coordinates when the image was sampled. Following figure shows the coordinate convention. Note that x ranges from 0 to M-1 and y from 0 to N-1 in integer increments. The coordinate convention used in the toolbox to denote arrays is different from the preceding paragraph in two minor ways. First, instead of using (xylem) the toolbox uses the notation (race) to indicate rows and columns. Note, however, that the order of coordinates is the same as the order discussed in the previous paragraph, in the sense that the first element of a coordinate topples, (alb), refers to a row and the second to a column. The other difference is that the origin of the coordinate system is at (r, c) = (1, 1); thus, r ranges from 1 to M and c from 1 to N in integer increments. IPT documentation refers to the coordinates. Less frequently the toolbox also employs another coordinate convention called spatial coordinates which uses x to refer to columns and y to refers to rows. This is the opposite of our use of variables x and y.
f(1,N-1) . . .
The right side of this equation is a digital image by definition. Each element of this array is called an image element, picture element, pixel or pel. The terms image and pixel are used throughout the rest of our discussions to denote a digital image and its element. A digital image can be represented naturally as a MATLAB matrix:
f(1,1) f(1,2) . f(1,N) f(x, y) = f(2,1) f(2,2) .. f(2,N) . . . f (M, 1) f(M,2) .f(M,N)
Where f(1,1) = f(0,0) (note the use of a monoscope font to denote MATLAB quantities). Clearly the two representations are identical, except for the shift in origin. f(p ,q) denotes the element located in row p and the column q. Matrices in MATLAB are stored in variables with names such as A, a, RGB, real array and so on. Variables must begin with a letter and contain only letters, numerals and underscores. As noted in the previous paragraph, all MATLAB quantities are written using monoscope characters. We use conventional Roman, italic notation such as f(x ,y), for mathematical expressions.
Most monochrome image processing operations are carried out using binary or intensity images, so our initial focus is on these two image types. Indexed and RGB color images.
1.5.1 Intensity Images An intensity image is a data matrix whose values have been scaled to represent intentions. When the elements of an intensity image are of class unit8, or class unit 16, they have integer values in the range [0,255] and [0, 65535], respectively. 1.5.2 Binary Images Binary images have a very specific meaning in MATLAB.A binary image is a logical array of 0s and1s.Thus, an array of 0s and 1s whose values are of data class, say unit8, is not considered as a binary image in MATLAB .A numeric array is converted to binary using function logical. Thus, if A is a numeric array consisting of 0s and 1s, we create an array B using the statement. B=logical (A) If A contains elements other than 0s and 1s.Use of the logical function converts all nonzero quantities to logical 1s and all entries with value 0 to logical 0s. Using relational and logical operators also creates logical arrays. To test if an array is logical we use the I logical function: is logical(c). If c is a logical array, this function returns a 1.Otherwise returns a 0. Logical array can be converted to numeric arrays using the data class conversion functions. 1.5.3 Indexed Images An indexed image has two components: A data matrix integer, x. A color map matrix, map. Matrix map is an m*3 arrays of class double containing floating_ point values in the range [0, 1]. The length m of the map are equal to the number of colors it defines. Each row of map specifies the red, green and blue components of a single color. An indexed images uses direct mapping of pixel intensity values color map values. The color of each pixel is determined by using the corresponding value the integer matrix x as a pointer in to map. If x is of class double ,then all of its components with values less than or equal to 1 point to the first row in map, all components with value 2 point to the second row and so on. If x is
of class units or unit 16, then all components value 0 point to the first row in map, all components with value 1 point to the second and so on. 1.5.4 RGB Images An RGB color image is an M*N*3 array of color pixels where each color pixel is triplet corresponding to the red, green and blue components of an RGB image, at a specific spatial location. An RGB image may be viewed as stack of three gray scale images that when fed in to the red, green and blue inputs of a color monitor. Produce a color image on the screen. Convention the three images forming an RGB color image are referred to as the red, green and blue components images. The data class of the components images determines their range of values. If an RGB image is of class double the range of values is [0, 1]. Similarly the range of values is [0,255] or [0, 65535].For RGB images of class units or unit 16 respectively. The number of bits use to represents the pixel values of the component images determines the bit depth of an RGB image. For example, if each component image is an 8bit image, the corresponding RGB image is said to be 24 bits deep. Generally, the number of bits in all component images is the same. In this case the number of possible color in an RGB image is (2^b) ^3, where b is a number of bits in each component image. For the 8bit case the number is 16,777,216 colors.
sizes of manageable and transmittable dimensions. Increasing the bandwidth is another method, but the cost sometimes makes this a less attractive solution.
Chapter 2
VIDEO COMPRESSION
The digital video compression technology has been boomed for many years. Today, when people chat with their friends through a visual telephone, when people enjoy the movie broadcasting through Internet or the digital music such as mp3, the convenience that the digital video industry brings to us cannot be forgotten. All of these should attribute to the enhancement on mass storage media or streaming video/audio services which has influenced our daily life deeply. In this project we will be implementing in simulink, Simulink is a platform for multidomain simulation and Model-Based Design for dynamic systems. It provides an interactive graphical environment and a customizable set of block libraries, and can be extended for specialized applications. This video compression using motion compensation and Discrete Cosine Transform (DCT) techniques with the Video and Video Processing Block set. The demo calculates motion vectors between successive frames and uses them to reduce redundant information. Then it divides each frame into sub matrices and applies the discrete cosine transform to each sub matrix. Finally, the demo applies a quantization technique to achieve further compression. The Decoder subsystem performs the inverse process to recover the original video.
Throw away too much, and the changes become noticeable. With heavy compression you can get video that's nearly unrecognizable. When you compress video, always try several compression settings. The goal is to compress as much possible until the data loss becomes noticeable and then notch the compression back a little. That will give you the right balance between file size and quality. And remember that every video is different.
special token each time a chain of more than two equal input tokens are found. This special input advises the decoder to insert the following token n times into his output stream. The effectivity of run length encoding is a function of the number of equal tokens in a row in relation to the total number of input tokens. This relation is very high in undeterred two tone images of the type used for facsimile. Obviously, effectively degrades when the input does not contain too many equal tokens. With a rising density of information, the likelihood of two following tokens being the same does sink significantly, as there is always some noise distortion in the input. Run length coding is easily implemented, either in software or in hardware. It is fast and very well verifiable, but its compression ability is very limited. Huffman Encoding: This algorithm, developed by D.A. Huffman, is based on the fact that in an input stream certain tokens occur more often than others. Based on this knowledge, the algorithm builds up a weighted binary tree according to their rate of occurrence. Each element of this tree is assigned a new code word, whereat the length of the code word is determined by its position in the tree. Therefore, the token which is most frequent and becomes the root of the tree is assigned the shortest code. Each less common element is assigned a longer code word. The least frequent element is assigned a code word which may have become twice as long as the input token. The compression ratio achieved by Huffman encoding uncorrelated data becomes something like 1:2. On slightly correlated data, as on images, the compression rate may become much higher, the absolute maximum being defined by the size of a single input token and the size of the shortest possible output token (max. compression = token size[bits]/2[bits]). While standard palletized images with a limit of 256 colors may be compressed by 1:4 if they use only one color, more typical images give results in the range of 1:1.2 to 1:2.5. Entropy Coding: The typical implementation of an entropy coder follows J. Ziv/A. Lempel's approach. Nowadays, there is a wide range of so called modified Lempel/Ziv coding. These algorithms all have a common way of working. The coder and the decoder both build up an equivalent dictionary of met symbols, each of which represents a whole sequence of input tokens. If a sequence is repeated after a symbol was found for it, then only the symbol becomes part of the coded data and the sequence of tokens referenced by the symbol becomes part of the decoded data later. As the dictionary is build up based on the data, it is not necessary to put it into the coded data, as it is with the tables in a Huffman
10
coder. Entropy coders are a little tricky to implement, as there are usually a few tables, all growing while the algorithm runs. Area Coding: Area coding is an enhanced form of run length coding, reflecting the two dimensional character of images. This is a significant advance over the other lossless methods. For coding an image it does not make too much sense to interpret it as a sequential stream, as it is in fact an array of sequences, building up a two dimensional object. Therefore, as the two dimensions are independent and of same importance, it is obvious that a coding scheme aware of this has some advantages. The algorithms for area coding try to find rectangular regions with the same characteristics. These regions are coded in a descriptive form as an Element with two points and a certain structure. The whole input image has to be described in this form to allow lossless decoding afterwards. Practical implementations use recursive algorithms for reducing the whole area to equal sized sub rectangles until a rectangle does fulfill the criteria defined as having the same characteristic for every pixel. This type of coding can be highly effective but it bears the problem of a nonlinear method, which cannot be implemented in hardware. Therefore, the performance in terms of compression time is not competitive. 2.2.2 Lossy Coding Techniques In most of applications we have no need in the exact restoration of stored image. This fact can help to make the storage more effective, and this way we get to lossy compression methods. Lossy image coding techniques normally have three components: Image modeling which defines such things as the transformation to be applied to the image. Parameter quantization whereby the data generated by the transformation is quantized to reduce the amount of information. Encoding, where a code is generated by associating appropriate codeword to the raw data produced by the quantization. Each of these operations is in some part responsible of the compression. Image modeling is aimed at the exploitation of statistical characteristics of the image (i.e. high correlation, redundancy). Typical examples are transform coding methods, in which the data is represented in a different domain (for example, frequency in the case of the Fourier 11
Transform [FT], the Discrete Cosine Transform [DCT], the Kahrunen-Loewe Transform [KLT], and so on), where a reduced number of coefficients contains most of the original information. In many cases this first phase does not result in any loss of information. The aim of quantization is to reduce the amount of data used to represent the information within the new domain. Quantization is in most cases not a reversible operation: therefore, it belongs to the so called 'lossy' methods. Encoding is usually error free. It optimizes the representation of the information (helping sometimes to further reduce the bit rate), and may introduce some error detection codes. In the following sections, a review of the most important coding schemes for lossy compression is provided. Some methods are described in their canonical form (transform coding, region based approximations, fractal coding, wavelets, hybrid methods) and some variations and improvements presented in the scientific literature are reported and discussed. Transform Coding (DCT/Wavelets/Gabor): A general transform coding scheme involves subdividing an NxN image into smaller nxn blocks and performing a unitary transform on each sub image. A unitary transform is a reversible linear transform whose kernel describes a set of complete, ortho normal discrete basic functions. The goal of the transform is to decorate the original signal, and this declaration generally results in the signal energy being redistributed among only a small set of transform coefficients. In this way, many coefficients may be discarded after quantization and prior to encoding. Also, visually lossless compression can often be achieved by incorporating the HVS contrast sensitivity function in the quantization of the coefficients. Transform coding can be generalized into four stages: Image subdivision Image transformation Coefficient quantization Huffman encoding For a transform coding scheme, logical modeling is done in two steps: a segmentation one, in which the image is subdivided in bi dimensional vectors (possibly of different sizes) and a transformation step, in which the chosen transform (e.g. KLT, DCT, and Hadamard) is applied. Quantization can be performed in several ways. Most classical approaches use 12
'zonal coding', consisting in the scalar quantization of the coefficients belonging to a predefined area (with a fixed bit allocation), and 'threshold coding', consisting in the choice of the coefficients of each block characterized by an absolute value exceeding a predefined threshold. Another possibility, that leads to higher compression factors, is to apply a vector quantization scheme to the transformed coefficients. The same type of encoding is used for each coding method. In most cases a classical Huffman code can be used successfully. The JPEG and MPEG standards are examples of standards based on transform coding. Vector Quantization: A vector quantize can be defined mathematically as a transform operator T from a K-dimensional Euclidean space R^K to a finite subset X in R^K made up of N vectors. This subset X becomes the vector codebook or more generally, the codebook. Clearly, the choice of the set of vectors is of major importance. The level of distortion due to the transformation T is generally computed as the most significant error (MSE) between the "real" vector x in R^K and the corresponding vector x' = T(x) in X. This error should be such as to minimize the Euclidean distance d. An optimum scalar quantizer was proposed by Lloyd and Max. Later on Linde, Buzo and Gray resumed and generalized this method, extending it to the case of a vector quantizer. The LBG algorithm for the design of a vector codebook always reaches a local minimum for the distortion function, but often this solution is not the optimal one. A careful analysis of the LBG algorithm's behavior allows one to detect two critical points: the choice of the starting codebook and the uniformity of the Voronoi regions' dimensions. For this reason some algorithms have been designed that give better performances. With respect to the initialization of the LBG algorithm, for instance, one can observe that a random choice of the starting codebook requires a large number of iterations before reaching an acceptable amount of distortion. Moreover, if the starting point leads to a local minimum solution, the relative stopping criterion prevents further optimization steps. Segmentation and Approximation Methods: With segmentation and approximation coding methods, the image is modeled as a mosaic of regions, each one characterized by a sufficient degree of uniformity of its pixels with respect to a certain feature (e.g. grey level, texture); each region then has some parameters related to the characterizing feature associated with it. The operations of finding a suitable segmentation and an optimum set of approximating parameters are highly correlated, since the segmentation algorithm must 13
take into account the error produced by the region reconstruction (in order to limit this value within determined bounds). These two operations constitute the logical modeling for this class of coding schemes; quantization and encoding are strongly dependent on the statistical characteristics of the parameters of this approximation. For polynomial approximation regions are reconstructed by means of polynomial functions in (x, y); the task of the encoder is to find the optimum coefficients. In texture approximation, regions are filled by synthesizing a parameterized texture based on some model (e.g. fractals, statistical methods, Markov Random Fields [MRF]). It must be pointed out that, while in polynomial approximations the problem of finding optimum coefficients is quite simple (it is possible to use least squares approximation or similar exact formulations), for texture based techniques this problem can be very complex. Fractal Compression: It is a form of vector quantization and it is a lossy compression. Compression is performed by locating self-similar sections of an video, then using a fractal algorithm to generate the sections. Like DCT, discrete wavelet transform mathematically transforms an video into frequency components. The process is performed on the entire video, which differs from the other methods (DCT) that work on smaller pieces of the desired data. The result is a hierarchical representation of a video, where each layer represents a frequency band.
14
MPEG-2: Designed for between 1.5 and 15 Mbit/sec Standard on which Digital Television set top boxes and DVD compression is based. It is based on MPEG-1, but designed for the compression and transmission of digital broadcast television. The most significant enhancement from MPEG-1 is its ability to efficiently compress interlaced video. MPEG-2 scales well to HDTV resolution and bit rates, obviating the need for an MPEG-3. MPEG-4: It is a Standard for multimedia and Web compression. MPEG-4 is based on object-based compression, similar in nature to the Virtual Reality Modeling Language. Individual objects within a scene are tracked separately and compressed together to create an MPEG4 file. This results in very efficient compression that is very scalable, from low bit rates to very high. It also allows developers to control objects independently in a scene, and therefore introduce interactivity JPEG: JPEG stands for Joint Photographic Experts Group. It is also an ISO/IEC working group, but works to build standards for continuous tone video coding. JPEG is a lossy compression technique used for full-color or gray-scale videos, by exploiting the fact that the human eye will not notice small color changes JPEG 2000 is an initiative that will provide an video coding system using compression techniques based on the use of wavelet technology.
2.4 Transforms
There are several common transforms being used in signal processing, such as the Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT) and some other as well. The (DCT) is the most common transform being used when processing videos and video. The (DWT) is used in the video compression standard JPEG2000, and will be used in this application as well. Both the (DCT) and (DWT) will be more thoroughly described. The basic idea of using transforms when processing for example a video is to decorrelate the pixels to one another. By doing so compression is achieved since the amount of redundant information is minimized. A transform can be seen as a projection onto orthonormal bases, separated in time and/or frequencies. By transforming a signal the energy is separated into sub bands. By describing each sub band with dierent precisions, higher precision within high energy sub bands and less precision in low energy sub bands, the signal can be compressed.
15
To transform a matrix Y the transform matrix, c, is multiplied with with Y and gives the transformed matrix X =CY. The Cosine transform is real-valued and orthogonal which means that X has the properties as in X=X* X =XT The DCT is also excellent in energy compaction which means that the energy of the matrix is concentrated to a small region of the transformed matrix. It has also good decorrelation properties. These properties are very suitable for video and video processing and are therefore widely used,(i.e. JPEG, MPEG and H.263.). The two-dimensional DCT of Figure 2.1(a) can be seen. Note that the compaction of energy is concentrated to the upper left corner.
Mask =
B2 = blkproc (B,[8 8],'P1.*x',mask); I2 = blkproc (B2,[8 8],'P1*x*P2',T',T); Although there is some loss of quality in the reconstructed video, it is clearly recognizable, even though almost 85% of the DCT coefficients were discarded. To experiment with discarding more or fewer coefficients, and to apply this technique to other videos, try running the demo function DCT demo.
17
Chapter 3
The matrix U contains one orthonormal basis. U is also known as the left singular vectors. The matrix V contains another orthonormal basis. V is also known as the right singular vectors. The diagonal matrix S contains the singular values.
18
3.2 Factoring U
Eliminating V from the equation is very similar to eliminating U. Instead of multiplying on the left by AT we will multiply on the right by AT.This gives A AT= (USVT) (USVT) T = USVT VSTU T Since V TV = I, this gives A AT =VS2V T Again we will find the eigenvectors, but this time for AAT. These are the columns of U (the left singular vectors).
19
If A has rank of r then vj, vj, , vr form an orthonormal basis for range space The rank of matrix A is equal to the number of its nonzero singular values.
of AT ,R(AT ), and uj, uj, , ur form an orthonormal basis for .range space A, R(A).
That is A can be represented by the outer product expansion: A=S1U1V1T + S2U2V2T +..+SrUrVrT When compressing the image, the sum is not performed to the very last SVs; the SVs with small enough values are dropped. (Remember that the SVs are ordered on the diagonal.) The closet matrix of rank k is obtained by truncating those sums after the first k terms: A=S1U1V1T + S2U2V2T +..+SkUkVkT The total storage for Ak will be K (m+n+1). The integer k can be chosen confidently less then n, and the digital image corresponding to Ak still have very close the original image. However, choosing the different k will have a different corresponding image and storage for it. For typical choices of the k, the storage required for Ak will be less the 20 percentage.
20
Figure 3.1: Frog Rock Test Image A human can quickly look at a photograph and isolate the sections of high detail from low detail. However this can be a difficult task for a computer, requiring a lot of 21
processing. Ideally the picture would be perfectly split into separate regions based on complexity, but in practice this would be too time consuming and require too much overhead information to keep track of the regions. A simple approach is to break the image into smaller blocks of the same size. Although the blocks wont perfectly align with the different regions of complexity, if there are enough blocks then the blocks will generally match regions of complexity. This is the approach used by JPEG; pictures are divided into blocks of 8 8 (the JPEG specification allows block sizes of 16 16 but this is rarely used). The second approach used in this project is to have adaptive block sizes. Initially the picture is broken up into a series of large blocks. Then each block is split into four quarter size blocks. If less storage is required when the block is split into quarters, then these new blocks are accepted, otherwise the original block is left. This process can be repeated on the new blocks, getting smaller and smaller each time.
22
speed of decompression. The advantage of this adaptive block size technique is that it can better map the regions of complexity of the picture.
Unfortunately image quality as perceived by the eye is a very subjective measurement. A human can quickly look at an image and determine that the quality is acceptable or not acceptable but it is difficult to mathematically represent this. The most common measurement used in image processing is the Peak to Peak Signal to Noise Ratio (PSNR) measured in decibels (db). Although not a great model of the human eye, it is simple to calculate. PSNR = 10log10 ((max range)2/RMSE)
RMSE= Max range is the allowed value range of the pixels. For convenience pixels will be in the range [0.. 1]. Hence max range = 1. RMSE is the Root Mean Square Error. Higher Order SVD: Tensor decomposition was studied in psychometric data analysis during the 1960s, when data sets having more than two dimensions (generally called threeway data sets) became widely used. A fundamental achievement was brought by Tucker (1963), who proposed to decompose a 3-D signal using directly a 3-D principal component analysis (PCA) instead of unfolding the data on one dimension and using the standard SVD. This three-way PCA is also known as Tucker3 decomposition. In the 1980s, such multidimensional techniques were also applied to chemometrics analysis. The signal processing community only recently showed interest in the Tucker3 decomposition. The work of Lathauwer et al. (2000) proved that this decomposition is a multilinear generalization of the SVD to multidimensional data. Studying its properties with a notation more familiar to the signal processing community, the authors highlighted its properties concerning the rank, oriented energy, and best reduced-rank approximation. As the decomposition can have higher dimensions than 3, they called it higher order SVD (HOSVD). In the following, we consider the notation of and define the HOSVD decomposition. Multiple-Level Decomposition: The decomposition process can be iterated, with successive approximations being decomposed in turn, so that one signal is broken down into many lower resolution components. This is called the wavelet decomposition tree. 24
cA3
Number of Levels: Since the analysis process is iterative, in theory it can be continued indefinitely. In reality, the decomposition can proceed only until the individual details consist of a single sample or pixel. In practice, youll select a suitable number of levels based on the nature of the signal, or on a suitable criterion such as entropy. Recently, the parametric model proposed by Doretto et al. was shown to be a valid approach for analysis/synthesis of dynamic textures. Each video frame is unfolded into a column vector and constitutes a point that follows a trajectory as time evolves. The analysis consists in finding an appropriate space to describe this trajectory and in identifying the trajectory using methods of dynamical system theory. The first part is done by using singular value decomposition (SVD) to perform dimension reduction to a lower dimensional space. The point trajectory is then described using a multivariate autoregressive (MAR) process of order 1. Dynamic textures are, thus, modeled using a linear dynamic system and synthesis is obtained by driving this system with white noise. In this model, the SVD exploits the temporal correlation between the video frames but the unfolding operations prevent the possibility of exploiting spatial and chromatic correlations. We use the parametric approach of but perform the dynamic texture analysis with a higher order SVD, which permits to simultaneously decompose the temporal, spatial, and chromatic components of the video sequence. This approach was proposed by the authors in [10] and here it is described in detail. Our scheme is depicted in Fig. 1. SVD in the analysis is substituted by HOSVD.
25
Figure 3.3: Schematic Representation of the Tensor-Based Linear Model Approach for Analysis and Synthesis. HOSVD is an extension of the SVD to higher order dimensions. It is not an optimal tensor decomposition in the sense of least squares data fitting and has not the truncation property of the SVD, where truncating the first singular values permits to find the best -rank approximation of a given matrix. Despite this, the approximation obtained is not far from the optimal one and can be computed much faster. In fact, the computation of HOSVD does not require iterative alternating least squares algorithms, but needs standard SVD computation only. The major advantage of the HOSVD is the ability of simultaneously considering the spatial, temporal, and chromatic correlations. This allows for a better data modeling than a standard SVD, since dimension reduction can be performed not only in the time dimension but also separately for spatial and chromatic content. The separate analysis of each signal component allows adapting the signal compression given by the dimension reduction to the characteristics of each dynamic texture. For comparable visual synthesis quality, we, thus, obtain a number of model coefficients that is on average five times smaller than those obtained using standard SVD. 26
Creating more compact models is also addressed in, where dynamic texture shape and visual appearance are jointly addressed, thus enabling the modeling of complex video sequences containing sharp edges. Their and our approach is both characterized by a more computationally expensive analysis, but also a fast synthesis. In our case, synthesis can be done in real-time. This makes our technique very appropriate for applications with memory constraints, such as mobile devices. We believe that HOSVD is a very promising technique for other video analysis and approximation applications. Recently, it has been successfully used in image based texture rendering, face super resolution, and in face analysis and recognition. In the framework of video compression and transmission, it is useful to find a way to analyze/synthesize dynamic textures. An efficient compression, in fact, would open the possibility of having access to realistic video animations on devices that have strong constraints in the available bandwidth. This is the case of mobile phones, for instance. The approaches used to model dynamic textures can be classified into non-parametric and parametric. In the first case, the analysis and synthesis is conducted directly from a given representation of the image (the pixel values or a description in an transformed domain obtained using certain bases, as wavelets for instance). In the second case, researchers aim to describe the dynamic texture using dynamical models. An interesting approach is to consider a linear dynamic model (LDS). In fact, if some simplifications are considered, a close solution for the estimation of the model's parameters can be found for such systems. Unfortunately, the synthesized sequences obtained using this method are not visually appealing, if compared to the original sequence, where periodicity (oscillation) has been introduced in the model by forcing the poles of the dynamic system to lay on the unit circle. This solution permits to obtain more realistic sequences, but still is based on the same assumptions used for the construction. A dynamic texture can be considered as a multidimensional signal. In the case of a grayscale image video, it can be represented with a 3-D tensor by assigning spatial information to the first two dimensions and time to the third. In a color video sequence, chromatic components add another dimension. The input signal then becomes 4-D. The analysis is done by first decomposing the input signal using the HOSVD and then by considering the orthogonal matrix derived from the decomposition along the time dimension. This matrix contains the dynamics of the video sequences, since its columns, 27
ordered along the time axis, correspond to the weights that control the appearance of the dynamic texture as time evolves.
28
occupation, and usually permits on-the-fly synthesis. Moreover, it can also be used for other applications, such as segmentation, recognition, and editing. The term specificity indicates if a given approach is specific to a certain type of dynamic texture, such as fire, water, or smoke, or can be used for all kinds of dynamic textures. The term flexibility indicates if the characteristics of the generated texture can easily be changed during the synthesis. The physics-based approaches have high flexibility, but also high specificity, since a model for fire cannot be used for the generation of water or smoke, for instance. They have high flexibility since the visual appearance of the synthetic texture can be modified by tuning the model parameters
3.10 Tensor
Tensor is a general name for multilinear mappings over a set of vector spaces, i.e. a vector is a 1mode tensor, a matrix is a 2mode tensor. The tensor T is an N mode tensor where the dimensionality of the mode i is di. In the same way as a matrix can be premultiplied (mode 1 multiplication) or post-multiplied (mode2 multiplication) with another matrix, a matrix can be multiplied with a higher order tensor with respect to different modes. The mode multiplication of a matrix MIndn with a tensor T is denoted as TnM and results in a tensor U with the same number of modes. The elements of the tensor U is computed in the following way: Ud1....dn1indn+1...dN =Xdn td1...dN mindn Tensor Decomposition: Principal Component Analysis (PCA) is a version of Singular Value Decomposition (SVD) which is a 2mode tool, commonly used in signal processing to reduce the dimensionality of the space and reduce noise. SVD decomposes a matrix into three other matrices, such that: A = US VT Where, the matrix U spans the row space of A, the matrix V spans the column space of A and S is a diagonal matrix of singular values. The column eigenvectors vectors of matrices U (likewise for V ) are orthonormal to each other, describing a new orthonormal coordinate system for the space spanned by matrix A. Nmode SVD or Higher Order SVD (HOSVD) is a generalization of the matrix SVD to tensors. It decomposes a tensor T , by
29
orthogonolazing its modes, yielding a core tensorand matrices spanning the vector spaces in each mode of the tensor, i.e.: T = S 1 U1 2 U2.... n Un The tensor S is called the core tensor and is analogous to the diagonal singular value matrix in the traditional SVD. However, for HOSVD, the tensor S is not a diagonal tensor but coordinates the interaction of matrices to produce the original tensor. Matrices Ui are again orthonormal and the column vectors of Ui spans the space of tensor T , flattened with respect to mode i. The row vectors of Ui are the coefficient sets describing each dimension in mode i. These coefficients can be thought as the coefficients extracted from PCA but there are different sets of coefficients for each mode in HOSVD analysis. Dimensionality Reduction: After decomposing the original data tensor to yield the core tensor and mode matrices, we are able to reduce the dimensionality with respect to the mode we want, unlike PCA where the dimensionality reduction is only based on the variances. By reducing the number of dimensions in one mode and keeping the other intact, we can have more control over the noise reduction, classification accuracies and complexity of the problem. The dimensionality reduction is achieved by deleting the last mcolumn vectors from the desired mode matrix and deleting the corresponding m hyper planes from the core tensor. It is also defined that the error after dimensionality reduction is bounded by the Frobeniusnorm of the hyperplanes deleted from the core tensor.
Chapter 4
INTRODUCTION to MATLAB
30
MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include Math and computation Algorithm development Data acquisition Modeling, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering graphics Application development, including graphical user interface building. MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar non interactive language such as C or FORTRAN. The name MATLAB stands for matrix laboratory. MATLAB was originally written to provide easy access to matrix software developed by the LINPACK and EISPACK projects. Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the state of the art in software for matrix computation. MATLAB has evolved over a period of years with input from many users. In university environments, it is the standard instructional tool for introductory and advanced courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis. MATLAB features a family of add-on application-specific solutions called toolboxes. Very important to most users of MATLAB, toolboxes allow you to learn and apply specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems. Areas in which toolboxes are available include signal processing, control systems, neural networks, fuzzy logic, wavelets, simulation, and many others.
31
32
The Command History Window contains a record of the commands a user has entered in the command window, including both current and previous MATLAB sessions. Previously entered MATLAB commands can be selected and re-executed from the command history window by right clicking on a command or sequence of commands. This action launches a menu from which to select various options in addition to executing the commands. This is useful to select various options in addition to executing the commands. This is a useful feature when experimenting with various commands in a work session. Using the Matlab Editor to Create M-Files: The MATLAB editor is both a text editor specialized for creating M-files and a graphical MATLAB debugger. The editor can appear in a window by itself, or it can be a sub window in the desktop. M-files are denoted by the extension .m, as in pixelup.m. The MATLAB editor window has numerous pull-down menus for tasks such as saving, viewing, and debugging files. Because it performs some simple checks and also uses color to differentiate between various elements of code, this text editor is recommended as the tool of choice for writing and editing M-functions. To open the editor, type edit at the prompt opens the M-file filename.m in an editor window, ready for editing. As noted earlier, the file must be in the current directory, or in a directory in the search path. Getting Help: The principal way to get help online is to use the MATLAB help browser, opened as a separate window either by clicking on the question mark symbol (?) on the desktop toolbar, or by typing help browser at the prompt in the command window. The help Browser is a web browser integrated into the MATLAB desktop that displays a Hypertext Markup Language (HTML) documents. The Help Browser consists of two panes, the help navigator pane, used to find information, and the display pane, used to view the information. Self-explanatory tabs other than navigator pane are used to perform a search.
4.3 Commands
Uigetfile: Open standard dialog box for retrieving files Description 34
Uigetfile displays a modal dialog box that lists files in the current directory and enables the user to select or type the name of a file to be opened. If the filename is valid and if the file exists, uigetfile returns the filename when the user clicks Open. Otherwise uigetfile displays an appropriate error message from which control returns to the dialog box. The user can then enter another filename or click Cancel. If the user clicks Cancel or closes the dialog window, uigetfile returns 0. Aviinfo: Information about Audio/Video Interleaved (AVI) file Description Fileinfo = aviinfo (filename) It returns a structure whose fields contain information about the AVI file specified in the string filename. If filename does not include an extension, then .avi is used. The file must be in the current working directory or in a directory on the MATLAB path. Aviread: Read Audio/Video Interleaved (AVI) file Description Mov = aviread (filename) reads the AVI movie filename into the MATLAB movie structure mov. If filename does not include an extension, then .avi is used. Use the movie function to view the movie mov. frame2im: Convert movie frame to indexed image Description [X, Map] = frame2im (F) converts the single movie frame F into the indexed image X and associated colormap Map. The functions getframe and im2frame create a movie frame. If the frame contains true-color data, then Map is empty.
35
f = im2frame(X, map) converts the indexed image X and associated colormap map into a movie frame f. If X is a truecolor (m-by-n-by-3) image, then map is optional and has no effect. Imwrite: Write image to graphics file Description Imwrite(X, map, filename, fmt) writes the indexed image in X and its associated colormap map to filename in the format specified by fmt. If X is of class uint8 or uint16, imwrite writes the actual values in the array to the file. If X is of class double, imwrite offsets the values in the array before writing, using uint8(X1). Map must be a valid MATLAB colormap. Note that most image file formats do not support colormaps with more than 256 entries.When writing multiframe GIF images, X should be an 4-dimensional M-by-N-by-1by-P array, where P is the number of frames to write. Imread: Read image from graphics file Description A = imread (filename, fmt) reads a grayscale or color image from the file specified by the string filename. If the file is not in the current directory, or in a directory on the MATLAB path, specify the full pathname. Movie: Play recorded movie frames Description Movie plays the movie defined by a matrix whose columns are movie frames (usually produced by getframe).movie (M) plays the movie in matrix M once, using the current axes as the default target. If you want to play the movie in the figure instead of the axes, specify the figure handle (or gcf) as the first argument: movie (figure_handle...). M must be an array of movie frames (usually from getframe).
Chapter 5
By Higher Order SVD Analysis for Dynamic Texture Synthesis Videos like Flame, Pond & Grass are given as input, so that the obtained Output video is 3 times compressed of Input video.
Figure 5.1: Output Frame for Input Flame Video Description: It is one of the output frames that are obtained from the given input video after compression. The following are the parameters that are obtained from the compressed video. input_file_size = output_file_size = Compression = 20505600 6835200 3
Output file size is 3 times compression of input file size Compression _ratio = 0.3333 compression_ratio = 33.3333 Psnr = 37.2036
37
Figure 5.2: Output Frame for Input Pond Video Description: It is one of the output frames that is obtained from the given input video after compression. The following are the parameters that are obtained from the compressed video. input_file_size = output_file_size = Compression = 3 0.3333 39744000 13248000
Output file size is 3 times compression of input file size compression_ratio = Psnr = 40.8908 compression_ratio = 33.3333
38
Figure 5.3: Output Frame for Input Grass Video Description: Figure 5.3 is one of the output frames that are obtained from the given input video after compression. The following are the parameters that are obtained from the compressed video. input_file_size = output_file_size = Compression = 3 0.3333 9676800 3225600
Output file size is 3 times compression of input file size compression_ratio = Psnr = 45.4285 compression_ratio = 33.3333
39
Conclusion
Here it is proposed to decompose the multidimensional signal that represents a dynamic texture by using a tensor decomposition technique. As opposed to techniques that unfold the multi dimensional signal on a 2-D matrix, our method analyzes data in their original dimensions. This decomposition, only recently used for applications in image and video processing, permits to better exploit the spatial, temporal, and chromatic correlation between the pixels of the video sequence, leading to an important decrease in model size. Compared to algorithms where the unfolding operations are performed in 2-D or where the spatial information is exploited by considering the analysis in the Fourier domain, this method results in models with on average five times less coefficients, still ensuring the same visual quality. Despite being a suboptimal solution for the tensor decomposition, the HOSVD ensures close-to-optimal energy compaction and approximation error. The sub optimality derives from the fact that the HOSVD is computed directly from the SVD, without using expensive iterative algorithms, such as done for the optimal solution. This is an advantage, since the analysis can be done faster and with less computational power. The few model parameters permit to perform synthesis in real-time. Moreover, the small memory occupancy favours the use of the HOSVD based model in architectures characterized by constraints in memory and computational power complexity, such as PDAs or mobile phones.
40
APPENDIX
Source Code
clear all; clc; [filename, pathname]=uigetfile('*.avi'); str2='.bmp'; file=aviinfo(filename); for i=1:frm_cnt frm(i)=aviread(filename,i); frm_name=frame2im(frm(i)); filename1=strcat(strcat(num2str(i)),str2); imwrite(frm_name,filename1); end str3='.png'; for j=1:frm_cnt filename_1=strcat(strcat(num2str(i)),str2); D=imread(filename_1); [u1,s1,v1]=svd(double(filename_1)); im = (u1 * s1 * transpose(v1)); file_2=strcat(strcat(num2str(j)),str3); imwrite(im,file_2); end for k=1:frm_cnt file_2=strcat(num2str(k),'.bmp'); v=imread(file_2); [Y, map] = rgb2ind(v,255); F(:,:,k)=im2frame(flipud(Y),map); save F F end 41 % Write image file % read the Video file % to get inforamtaion abt video file frm_cnt=file.NumFrames ; % No.of frames in the video file
mov=aviread(filename); [h, w, p] = size(mov(1).cdata); hf = figure('Name','INPUT VIDEO '); set(hf, 'position', [150 150 w h]); movie(gcf,mov); [h, w, p] = size(F(1).cdata); hf = figure('Name','HOSVD COMPRESSED VIDEO '); set(hf, 'position', [150 150 w h]); movie(gcf,F); input_file_size = frm_cnt * size(frm(1).cdata,1)* size(frm(1).cdata,2) * size(frm(1).cdata,3) output_file_size=frm_cnt * size(F(1).cdata,1)* size(F(1).cdata,2) * size(F(1).cdata,3) compression = (input_file_size / output_file_size) fprintf('output file size is %d times compression of input file size',compression); compression_ratio = output_file_size/input_file_size compression_ratio = compression_ratio * 100 mse=(sum(mov(1).cdata(:,:,1)-F(1).cdata).*sum(mov(1).cdata(:,:,1)F(1).cdata))/input_file_size; psnr=20*log10(255/sqrt(max(mse)))
REFERENCES
42
[1] Doretto.G, Chiuso. A, Wu.Y, and Soatto.S, Dynamic textures, Int.J. Comput. Vis., vol.51, no.2, pp. 91109, 2003. [2] Doretto.G, Cremers.D, Favaro.P, and Soatto.S, Dynamic texture segmentation, in Proc. IEEE Int. Conf. Image Processing, 2003, pp.12361242. [3] Kwatra.V, Schdl.A, Essa.I, Turk.G, and Bobick.B, Graphcut textures: Image and Video synthesis using graph cuts, in Proc. Siggraph, 2003, pp. 277286. [4] Rafael C. Gonzalez, Richard E.Woods. Digital Image Processing Second Edition [5] Schdl, Szeliski.R, Salesin.D, and Essa.I, Video textures, in Proc .ACM Siggraph, 2000, pp. 48998. URLS
43