You are on page 1of 19

Data Compression Techniques

GROUP#1

Assignment# 1 Introduction to Data Compression


In computer science and information theory, data compression involves encoding information using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying marginally important information and removing it. Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. Because compressed data must be decompressed to be used, this extra processing imposes computational or other costs through decompression. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed, and the option to decompress the video in full before watching it may be inconvenient or require additional storage. The design of data compression schemes involve trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when using lossy data compression), and the computational resources required to compress and uncompress the data.

Lossless
Lossless data compression algorithms usually exploit statistical redundancy to represent data more concisely without losing information. Lossless compression is possible because most real-world data has statistical redundancy. For example, an image may have areas of colour that do not change over several pixels; instead of coding "red pixel, red pixel, " the data may be encoded as "279 red pixels". This is a simple example of run-length encoding; there are many schemes to reduce size by eliminating redundancy. The LempelZiv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (LempelZivWelch) is used in GIF images. Also noteworthy are the LZR (LZRenau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format. The very best modern lossless compressors use probabilistic models, such as prediction by partial matching. The BurrowsWheeler transform can also be viewed as an indirect form of statistical modelling. The class of grammar-based codes are recently noticed because they can extremely compress highly-repetitive text, for instance, biological data collection of same or related species, huge versioned document collection, internet archives, etc.

Data Compression Techniques


Pair are practical grammar compression algorithms which public codes are available.

GROUP#1

The basic task of grammar-based codes is constructing a context-free grammar deriving a single string. Sequitur and Re-

In a further refinement of these techniques, statistical predictions can be coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel imagecompression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inversearithmetic-coder.

Lossy
Lossy data compression is contrasted with lossless data compression. In these schemes, some loss of information is acceptable. Depending upon the application, detail can be dropped from the data to save storage space. Generally, lossy data compression schemes are guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" less-important visual information. There is a corresponding trade-off between information lost and the size reduction. A number of popular compression formats exploit these perceptual differences, including those used in music files, images, and video. Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 Video codec for video compression. In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.

Data Compression Techniques

GROUP#1

Assignment# 2 Data Compression Tools


7-Zip (Windows/Linux, Free) 7-Zip is a free, open-source file archive utility with a spare interface but powerful feature set. With support for most popular compression formats (and quite a few not-so-popular), this lightweight, open source option does the job quickly and without fuss. While some 7-Zip users complain about its spare interface, others are happy with 7-Zip's no-nonsense approach and fast operation. IZArc (Windows, Freeware) IZArc is the compression tool that may take home the prize for most-supported read and write formats for this Hive Five. IZArc is also the only featured archiver apart from PeaZip that distributes a portable version on their web site (though third parties have made other apps portablelike 7-Zip Portable). Users go for IZArc for its attractive interface and its low price tag. IZArc is freeware, but donations are accepted. WinRAR (Windows, Shareware) WinRAR is a powerful file compression and decompression tool that's been around since 1993. It's also one of the few archives capable of writing RAR archivesthough overall it's limited to creating only RARs or ZIPs. WinRAR costs a pretty steep $29 for a license, but several users are happy to suffer through the nag screens to avoid the cost. PeaZip (Windows and Linux, Free) PeaZip is a free and open-source archive manager that supports a boatload of formats. Unlike its open-source sister, 7Zip, PeaZip also has a very attractive interface, from the main application interface down to the desktop icons it uses when you set it as your default compression tool. Like IZArc, it's also available in a portable versionso even if you don't go with it for your default, it's worth tossing on your thumb drive just in case you need a little compression on the road. The Unarchiver (Mac OS X, Freeware) The Unarchiver is the built-in default file compression utility for Mac OS X. Unlike Windows, which only supports the ZIP format out-of-the-box, The Unarchiver handles most major formats. The catch: The Unarchiver is a read-only application, so if you're on a Mac and you want to write to more obscure archive types than ZIP, you may need to add an extra tool to your arsenal. Most OS X users, however, are happy to stick with The Unarchiver for all their decompression needs. OTHERS ARE: Universal Extractor is a popular Windows compression tool because it integrates nicely with Windows Explorer. This free utility will open compressed files for viewing and extraction as well as other file types such as executables and database files. Universal Extractor detects the type of file you want to open; you can open files even if they have no extension, so extracting files is always fast and easy. Universal Extractor extracts files: it does not have the ability to create compressed files. KGB Archiver has one of the highest compression ratios available when it uses its own PAQ7 algorithm. In fact, with KGB, you can shrink files to less than one-fifth of their size This program works with KGB and Zip files, so youll use it

Data Compression Techniques

GROUP#1

more to compress than to decompress. KGB Archiver also integrates with Universal Extractor and incorporates the option to use AES-256 encryption for files you create. Version two of KGB Archiver is now in beta and is available for Windows only. The current release of KGB (1.x) runs with Windows, Linux, or Macintosh. This is a free utility. TUGZip is a robust compression tool that supports scripting and other powerful options. File archives in P, 7-ZIP, A, ACE, ARC, ARJ, BH, BZ2, CAB, CPIO, DEB, GCA, GZ, IMP, JAR, LHA (LZH), LIB, RAR, RPM, SQX, TAR, TGZ, TBZ, TAZ, YZ1 and ZOO formats are all opened in TUBGZip and disk images in BIN, C2D, IMG, ISO and NRG formats can also be opened. TUGZip will create 7-ZIP, BH, BZ2, CAB, JAR, LHA (LZH), SQX, TAR, TGZ, YZ1 and ZIP files. This program integrates well with Windows shell and enables powerful browsing and drag and drop functionality as well as several encryption formats. TUGZip is free and runs only in Windows.

Data Compression Techniques

GROUP#1

Assignment# 3 Image Compression


The objective of image compression is to reduce irrelevance and redundancy of the image data in order to be able to store or transmit data in an efficient form.

Lossy and lossless compression


Image compression may be lossy or lossless. Lossless compression is preferred for archival purposes and often for medical imaging, technical drawings, clip art, or comics. This is because lossy compression methods, especially when used at low bit rates, introduce compression artifacts. Lossy methods are especially suitable for natural images such as photographs in applications where minor (sometimes imperceptible) loss of fidelity is acceptable to achieve a substantial reduction in bit rate. The lossy compression that produces imperceptible differences may be called visually lossless. Methods for lossless image compression are: Run-length encoding used as default method in PCX and as one of possible in BMP, TGA, TIFF DPCM and Predictive Coding Entropy encoding Adaptive dictionary algorithms such as LZW used in GIF and TIFF Deflation used in PNG, MNG, and TIFF Chain codes

Methods for lossy compression: Reducing the color space to the most common colors in the image. The selected colors are specified in the color palette in the header of the compressed image. Each pixel just references the index of a color in the color palette. This method can be combined with dithering to avoid posterization. Chroma subsampling. This takes advantage of the fact that the human eye perceives spatial changes of brightness more sharply than those of color, by averaging or dropping some of the chrominance information in the image. Transform coding. This is the most commonly used method. A Fourier-related transform such as DCT or the wavelet transform are applied, followed by quantization and entropy coding. Fractal compression.

Other properties

Data Compression Techniques


are other important properties of image compression schemes:

GROUP#1

The best image quality at a given bit-rate (or compression rate) is the main goal of image compression, however, there

Scalability generally refers to a quality reduction achieved by manipulation of the bitstream or file (without decompression and re-compression). Other names for scalability are progressive coding orembedded bitstreams. Despite its contrary nature, scalability also may be found in lossless codecs, usually in form of coarse-to-fine pixel scans. Scalability is especially useful for previewing images while downloading them (e.g., in a web browser) or for providing variable quality access to e.g., databases. There are several types of scalability: Quality progressive or layer progressive: The bitstream successively refines the reconstructed image. Resolution progressive: First encode a lower image resolution; then encode the difference to higher resolutions Component progressive: First encode grey; then color.
[1][2]

Region of interest coding. Certain parts of the image are encoded with higher quality than others. This may be combined with scalability (encode these parts first, others later). Meta information. Compressed data may contain information about the image which may be used to categorize, search, or browse images. Such information may include color and texture statistics, small preview images, and author or copyright information. Processing power. Compression algorithms require different amounts of processing power to encode and decode. Some high compression algorithms require high processing power. The quality of a compression method often is measured by the Peak signal-to-noise ratio. It measures the amount of noise introduced through a lossy compression of the image, however, the subjective judgment of the viewer also is regarded as an important measure, perhaps, being the most important measure.

Data Compression Techniques

GROUP#1

Assignment# 4 Audio Compression


Audio data compression, as distinguished from dynamic range compression, reduces the transmission bandwidth and storage requirements of audio data. Audio compression algorithms are implemented in software as audio codecs. Lossy audio compression algorithms provide higher compression at the cost of fidelity, are used in numerous audio applications. These algorithms almost all rely on psychoacoustics to eliminate less audible or meaningful sounds, thereby reducing the space required to store or transmit them. In both lossy and lossless compression, information redundancy is reduced, using methods such as coding, pattern recognition and linear prediction to reduce the amount of information used to represent the uncompressed data. The acceptable trade-off between loss of audio quality and transmission or storage size depends upon the application. For example, one 640MB compact disc (CD) holds approximately one hour of uncompressed high fidelity music, less than 2 hours of music compressed losslessly, or 7 hours of music compressed in the MP3 format at a medium bit rate. A digital sound recorder can typically store around 200 hours of clearly intelligible speech in 640MB. Lossless audio compression produces a representation of digital data that decompresses to an exact digital duplicate of the original audio stream, unlike playback from lossy compression techniques such as Vorbis and MP3. Compression ratios are around 5060% of original size , similar to those for generic lossless data compression. Lossy compression depends upon the quality required, but typically yields files of 5 to 20% of the size of the uncompressed original.
[8] [7]

Lossless compression is unable to attain high compression ratios due to the complexity of wave forms and the

rapid changes in sound forms. Codecs like FLAC, Shorten and TTA use linear prediction to estimate the spectrum of the signal. Many of these algorithms use convolution with the filter [-1 1] to slightlywhiten or flatten the spectrum, thereby allowing traditional lossless compression to work more efficiently. The process is reversed upon decompression. When audio files are to be processed, either by further compression or for editing, it is desirable to work from an unchanged original (uncompressed or losslessly compressed). Processing of a lossily compressed file for some purpose usually produces a final result inferior to creation of the same compressed file from an uncompressed original. In addition to sound editing or mixing, lossless audio compression is often used for archival storage, or as master copies. A number of lossless audio compression formats exist. Shorten was an early lossless format. Newer ones include Free Lossless Audio Codec (FLAC), Apple's Apple Lossless, MPEG-4 ALS, Microsoft's Windows Media Audio 9 Lossless (WMA Lossless), Monkey's Audio, and TTA. See list of lossless codecs for a complete list. Some audio formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file. Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack, and OptimFROG DualStream.

Data Compression Techniques


Other formats are associated with a distinct system, such as: Direct Stream Transfer, used in Super Audio CD Meridian Lossless Packing, used in DVD-Audio, Dolby TrueHD, Blu-ray and HD DVD

GROUP#1

Lossy audio compression


Lossy audio compression is used in a wide range of applications. In addition to the direct applications (mp3 players or computers), digitally compressed audio streams are used in most video DVDs; digital television; streaming media on the internet; satellite and cable radio; and increasingly in terrestrial radio broadcasts. Lossy compression typically achieves far greater compression than lossless compression (data of 5 percent to 20 percent of the original stream, rather than 50 percent to 60 percent), by discarding less-critical data. The innovation of lossy audio compression was to use psychoacoustics to recognize that not all data in an audio stream can be perceived by the human auditory system. Most lossy compression reduces perceptual redundancy by first identifying sounds which are considered perceptually irrelevant, that is, sounds that are very hard to hear. Typical examples include high frequencies, or sounds that occur at the same time as louder sounds. Those sounds are coded with decreased accuracy or not coded at all. Due to the nature of lossy algorithms, audio quality suffers when a file is decompressed and recompressed (digital generation loss). This makes lossy compression unsuitable for storing the intermediate results in professional audio engineering applications, such as sound editing and multitrack recording. However, they are very popular with end users (particularly MP3), as a megabyte can store about a minute's worth of music at adequate quality.

Data Compression Techniques

GROUP#1

Assignment# 5 Video Compression


Video compression uses modern coding techniques to reduce redundancy in video data. Most video compression algorithms and codecs combine spatial image compression and temporal motion compensation. Video compression is a practical implementation of source coding in information theory. In practice most video codecs also use audio compression techniques in parallel to compress the separate, but combined data streams. The majority of video compression algorithms use lossy compression. Large amounts of data may be eliminated while being perceptually indistinguishable. As in all lossy compression, there is a tradeoff between video quality, cost of processing the compression and decompression, and system requirements. Highly compressed video may present visible or distracting artifacts. Video compression typically operates on square-shaped groups of neighboring pixels, often called macroblocks. These pixel groups or blocks of pixels are compared from one frame to the next and the video compression codec sends only the differences within those blocks. In areas of video with more motion, the compression must encode more data to keep up with the larger number of pixels that are changing. Commonly during explosions, flames, flocks of animals, and in some panning shots, the high-frequency detail leads to quality decreases or to increases in the variable bitrate.

Encoding theory
Video data may be represented as a series of still image frames. The sequence of frames contains spatial and temporal redundancy that video compression algorithms attempt to eliminate or code in a smaller size. Similarities can be encoded by only storing differences between frames, or by using perceptual features of human vision. For example, small differences in color are more difficult to perceive than are changes in brightness. Compression algorithms can average a color across these similar areas to reduce space, in a manner similar to those used in JPEG image compression. Some of these methods are inherently lossy while others may preserve all relevant information from the original, uncompressed video. One of the most powerful techniques for compressing video is interframe compression. Interframe compression uses one or more earlier or later frames in a sequence to compress the current frame, while intraframe compression uses only the current frame, effectively being image compression. The most commonly used method works by comparing each frame in the video with the previous one. If the frame contains areas where nothing has moved, the system simply issues a short command that copies that part of the previous frame, bit-for-bit, into the next one. If sections of the frame move in a simple manner, the compressor emits a (slightly longer) command that tells the decompresser to shift, rotate, lighten, or darken the copy: a longer command, but still much shorter than intraframe compression. Interframe compression works well for programs that will simply be played back by the viewer, but can cause problems if the video sequence needs to be edited.

Data Compression Techniques

GROUP#1

Because interframe compression copies data from one frame to another, if the original frame is simply cut out (or lost in transmission), the following frames cannot be reconstructed properly. Some video formats, such as DV, compress each frame independently using intraframe compression. Making 'cuts' in intraframe-compressed video is almost as easy as editing uncompressed video: one finds the beginning and ending of each frame, and simply copies bit-for-bit each frame that one wants to keep, and discards the frames one doesn't want. Another difference between intraframe and interframe compression is that with intraframe systems, each frame uses a similar amount of data. In most interframe systems, certain frames (such as "I frames" in MPEG-2) aren't allowed to copy data from other frames, and so require much more data than other frames nearby. It is possible to build a computer-based video editor that spots problems caused when I frames are edited out while other frames need them. This has allowed newer formats like HDV to be used for editing. However, this process demands a lot more computing power than editing intraframe compressed video with the same picture quality. Today, nearly all commonly used video compression methods (e.g., those in standards approved by the ITU-T or ISO) apply a discrete cosine transform (DCT) for spatial redundancy reduction. Other methods, such as fractal compression, matching pursuit and the use of a discrete wavelet transform (DWT) have been the subject of some research, but are typically not used in practical products (except for the use of wavelet coding as still-image coders without motion compensation). Interest in fractal compression seems to be waning, due to recent theoretical analysis showing a comparative lack of effectiveness of such methods.

Data Compression Techniques

GROUP#1

Assignment# 6 Huffman Coding I


Huffman coding Huffman coding is based on the frequency of occurance of a data item (pixel in images). The principle is to use a lower number of bits to encode the data that occurs more frequently. Codes are stored in a Code Book which may be constructed for each image or a set of images. In all cases the code book plus encoded data must be transmitted to enable decoding. The Huffman algorithm is now briefly summarized : A bottom-up approach

1. Initialization: Put all nodes in an OPEN list, keep it sorted at all times (e.g., ABCDE). 2. Repeat until the OPEN list has only one node left: (a) From OPEN pick two nodes having the lowest frequencies/probabilities, create a parent node of them. (b) Assign the sum of the children's frequencies/probabilities to the parent node and insert it into OPEN. (c) Assign code 0, 1 to the two branches of the tree, and delete the children from OPEN.

Symbol Count log(1/p)

Code

Subtotal (# of bits)

------ ----- -------- --------- -------------------A B 15 7 1.38 2.48 0 100 15 21

Data Compression Techniques


C D E 6 6 5 2.70 2.70 2.96 101 110 111 18 18 15

GROUP#1

TOTAL (# of bits): 87 The following points are worth noting about the above algorithm: Decoding for the above two algorithms is trivial as long as the coding table (the statistics) is sent before

the data. (There is a bit overhead for sending this, negligible if the data file is big.) Unique Prefix Property: no code is a prefix to any other code (all symbols are at the leaf nodes) -> great

for decoder, unambiguous. If prior statistics are available and accurate, then Huffman coding is very good.

Adaptive Huffman Coding The basic Huffman algorithm has been extended, for the following reasons: (a) The previous algorithms require the statistical knowledge which is often not available (e.g., live audio, video). (b) Even when it is available, it could be a heavy overhead especially when many tables had to be sent when a nonorder0 model is used, i.e. taking into account the impact of the previous symbol to the probability of the current symbol (e.g., "qu" often come together, ...). The solution is to use adaptive algorithms. As an example, the Adaptive Huffman Coding is examined below. The idea is however applicable to other adaptive compression algorithms. ENCODER ------------DECODER

Initialize_model(); while ((c = getc (input)) != eof) { encode (c, output); update_model (c); } } } {

Initialize_model(); while ((c = decode (input)) != eof)

putc (c, output); update_model (c);

The key is to have both encoder and decoder to use exactly the

same initialization and update_model routines.

Data Compression Techniques


update_model does two things: (a) increment the count, (b) update the Huffman tree o o

GROUP#1

During the updates, the Huffman tree will be maintained its sibling property, i.e. the nodes

(internal and leaf) are arranged in order of increasing weights (see figure). When swapping is necessary, the farthest node with weight W is swapped with the node whose

weight has just been increased to W+1. Note: If the node with weight W has a subtree beneath it, then the subtree will go with it. o The Huffman tree could look very different after node swapping (Fig 7.2), e.g., in the third tree,

node A is again swapped and becomes the #5 node. It is now encoded using only 2 bits.

Data Compression Techniques

GROUP#1

Assignment# 7 Huffman Coding II


Run Length Encoding (RLE)
Run Length Encoding is one of the oldest compression methods. It is characterized by the following properties:

Simple implementation of each RLE algorithm Compression efficiency restricted to a particular type of contents Mainly utilized for encoding of monochrome graphic data

The general algorithm behind RLE is very simple. Any sequence of identical symbols will be replaced by a counter identifying the number of repetitions and the particular symbol. For instance, the original contents 'aaaa' would be coded as '4a'. The most important format using RLE is Microsoft bitmap (RLE8 and RLE4). Example For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBW WWWWWWWWWWWWW If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 12W1B12W3B24W1B14W This is to be interpreted as twelve Ws, one B, twelve Ws, three Bs, etc. The run-length code represents the original 67 characters in only 18. Of course, the actual format used for the storage of images is generally binary rather than ASCII characters like this, but the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space.

Data Compression Techniques


length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW). Applications

GROUP#1

However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-

Run-length encoding performs lossless data compression and is well suited to palette -based iconic images. It does not work well at all on continuous-tone images such as photographs, although JPEG uses it quite effectively on the coefficients that remain after transforming and quantizing image blocks. Common formats for run-length encoded data include Truevision TGA, PackBits, PCX and ILBM. Run-length encoding is used in fax machines (combined with other techniques into Modified Huffman coding). It is relatively efficient because most faxed documents are mostly white space, with occasional interruptions of black. Implementation The run-length encoding implementation is as simple as move-to-front, except that the benefit from it is very limited. In fact, because we always have to specify the length of the run we have encoded, an isolated occurrence of one byte will use two bytes in the output. This is why using run-length encoding permits a very limited gain in compression ratio and even performs worse than using just move-to-front in combination with Huffman. Disadvantages This algorithm is very easy to implement and does not require much CPU horsepower. RLE compression is only efficient with files that contain lots of repetitive data. These can be text files if they contain lots of spaces for indenting but line-art images that contain large white or black areas are far more suitable. Computer generated colour images (e.g. architectural drawings) can also give fair compression ratios.

Data Compression Techniques


Arithmetic Coding (AC)

GROUP#1

The aim of the arithmetic coding (AC) is to define a method that provides code words with an ideal length. Like for every other entropy coder, it is required to know the probability for the appearance of the individual symbols. The AC assigns an interval to each symbol, whose size reflects the probability for the appearance of this symbol. The code word of a symbol is an abritrary rational number belonging the corresponding interval. The entire set of data is represented by a rational number, which is always placed within the interval of each symbol. With data being added the number of significant digits rises continuously. The development of arithmetic coding starts in the 60's, the crucial algorithms were introduced at the end of the 70's up to the 80's. Jorma Rissanen and Glen G. Langdon Jr., who presented their work in "IBM Journal of Research and Development", were considerably involved. Although the AC usually provides a better result in comparison to the wide-spread Huffman code, it is applied rarely. At first this was due to the requirements on the computer's performance. Later legal considerations became more important. The companies IBM, AT&T and Mitsubishi shall have valid patents at least. Principle of the Arithmetic Coding The entire data set is represented by a single rational number, which is placed between 0 and 1. This range is divided into sub-intervals each representing a certain symbol. The number of sub-intervals is identical to the number of symbols in the current set of symbols and the size is proportional to their probability of appearance. For each symbol in the original data a new interval division takes place, on the basis of the last sub-interval. Example Sub-Interval: Probability Initial Intervals a 0.5 (50%) [0.0; 0.5) b 0.3 (30%) [0.5; 0.8) c 0.2 (20%) [0.8; 1.0) Size Current Interval: 1.0 Start 'b' 'a' [0.50; 0.65) --> [0.500; 0.575) [0.575; 0.620) [0.620; 0.650) 0.150

a [0.0; 0.5)

b [0.5; 0.8) --> [0.65; 0.74) c [0.8; 1.0) [0.74; 0.80)

Size Current Interval: 0.30

Data Compression Techniques


0.8. If the symbol 'b' would stand alone, any number from this interval could be selected.

GROUP#1

Starting with the initial sub-division the symbol 'b' is coded with a number, which is greater or equal than 0.5 and less than

The encoding of the following symbol bases on the current interval [0.5; 0.8), which will be divided again into subintervals. If the second symbol is an 'a', then any number could be coded from the interval [0.50; 0.65). With any further symbol the last current interval have to be divided again and again, the number of significant digits of the code word increases continuously. Even if the accuracy of the number is increased permanently, it is not necessary to process the entire one. With the help of suitable algorithms parts that are no longer relevant can be eliminated. Dividing into Intervals On the basis of a well-known alphabet the probability of all symbols has to be determined and converted into intervals. The size of the interval depends linearly on the symbol's probability. If this is 50% for example, then the associated subinterval covers the half of the current interval. Usually the initial interval is [0: 1) for the encoding of the first symbol: Example sub-division of the interval [0.0; 1.0) Frequency Symbol a b c d e 4 2 2 1 1 Probability Sub-Interval 0.4 0.2 0.2 0.1 0.1 1.0 [0.0; 0.4) [0.4; 0.6) [0.6; 0.8) [0.8; 0.9) [0.9; 1.0) [0.0; 1.0)

Sum 10

Example sub-division of the interval [0.9; 1.0) Frequency Symbol a b c d e 4 2 2 1 1 Probability Sub-Interval 0.4 0.2 0.2 0.1 0.1 [0.90; 0.94) [0.94; 0.96) [0.96; 0.98) [0.98; 0.99) [0.99; 1.00)

Data Compression Techniques


Example sub-division of the interval [0.99; 1.00) Frequency Symbol a b c d e 4 2 2 1 1 Probability Sub-Interval 0.4 0.2 0.2 0.1 0.1 [0.990; 0.994) [0.994; 0.996) [0.996; 0.998) [0.998; 0.999) [0.999; 1.000)

GROUP#1

Accordingly a rational number in the range of 0.999 to < 1.000 represents the string "eee". Assignment of Intervals to Codes The symbols are identified by any number, which falls into the interval concerned. All values, which are greater or equal than 0.0 and less than 0.4, will be interpreted as 'a' for example. Examples: assigning codes and intervals: Symbol Sub-Interval a b c d e [0.0; 0.4) [0.4; 0.6) [0.6; 0.8) [0.8; 0.9) [0.9; 1.0)

Code Word Sub-Interval Symbol 0.1 0.2 0.4 0.55 0.99 [0.0; 0.4) [0.0; 0.4) [0.4; 0.6) [0.4; 0.6) [0.9; 1.0) -> a -> a -> b -> c -> e

All intervals have a closed lower endpoint (indicated by '[') and an open upper endpoint (indicated by ')'). The lower endpoint always belongs to the interval, while the upper endpoint is placed outside. I.e. the value 1 might be never encoded in this context. Sub-Intervals After the initial encoding, any further symbol will not relate to the basic intervall [0; 1). The sub-interval identified by the first symbol has to be used for the next step. In the case of an 'a' the new interval would be [0; 0.4), for 'b' [0.4; 0.6).

Data Compression Techniques


Providing a constant probability distribution and the first symbol 'a', the new sub-intervals would be: Divide interval after encoding of 'a total: [0.0; 1.0) total: [0.0; 0.4)

GROUP#1

Sym. P(x) Intervals

Sub-Intervals

a 0.4 [0.0; 0.4) ----- 0.16 [0.00; 0.16) b 0.2 [0.4; 0.6) | 0.08 [0.16; 0.24) c 0.2 [0.6; 0.8) | 0.08 [0.24; 0.32) d 0.1 [0.8; 0.9) | 0.04 [0.32; 0.36) e 0.1 [0.9; 1.0) ----Sum 1.0 Implementations -- 0.04 [0.36; 0.40)

-----0.40

In context with the JPEG compression procedure arithmetic coding may be used for final entropy coding of the AC and DC coefficients presupposing appropriate licenses. Usually however Huffman coding is used.

You might also like