Professional Documents
Culture Documents
1.Physical Models: If we know something about the physics of the data generation process,
we can use that information to construct a model.
For Ex. In speech- related applications, knowledge about the physics of speech production
can be used to construct a mathematical model for the sampled speech process. Sampled
speech can be encoded using this model.
2. Probability Models: The simplest statistical model for the source is to assume that each
letter that is generated by the source is independent of every other letter, and each occurs
with the same probability. We could call this the ignorance model as it would generation be
useful only when we know nothing about the source. The next step up in complexity is to
keep the independence assumption but remove the equal probability assumption and assign
a probability of occurrence to each letter in the alphabet.
For a source that generates letters from an alphabet A=a1,a2,……..am we can have a
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
4. Composite Source Model: In many applications it is not easy to use a single model to
describe the source. In such cases, we can define a composite source, which can be viewed
as a combination or composition of several sources, with only one source being active at
any given time. A composite source can be represented as a number of individual sources
Si , each with its own model Mi and a switch that selects a source Si with probability Pi.
This is an exceptionally rich model and can be used to describe some very complicated
processes.
Que-1(C) Explain update procedure of adaptive Huffman coding with suitable example. 7
Solution:
The update procedure requires that the nodes be in a fixed order. This ordering is preserved
by numbering the nodes. The largest node number is given to the root of the tree, and the
smallest number is assigned to the NYT node. The numbers from the NYT node to the root
of the tree are assigned in increasing order from left to right, and from lower level to upper
level. The set of nodes with the same weight makes up a block. Following Figure is a
flowchart of the updating procedure.
The function of the update procedure is to preserve the sibling property. In order that the
update procedures at the transmitter and receiver both operate with the same information,
the tree at the transmitter is updated after each symbol is encoded, and the tree at the
receiver is updated after each symbol is decoded. The procedure operates as follows:
After a symbol has been encoded or decoded, the external node corresponding to the
symbol is examined to see if it has the largest node number in its block. If the external
node does not have the largest node number, it is exchanged with the node that has the
largest node number in the block, as long as the node with the higher number is not the
parent of the node being updated. The weight of the external node is then incremented. If
we did not exchange the nodes before the weight of the node is incremented, it is very
likely that the ordering required by the sibling property would be destroyed. Once we have
incremented the weight of the node, we have adapted the Huffman tree at that level. We
then turn our attention to the next level by examining the parent node of the node whose
weight was incremented to see if it has the largest number in its block. If it does not, it is
exchanged with the node with the largest number in the block. Again, an exception to this
is when the node with the higher node number is the parent of the node under
consideration.
Once an exchange has taken place (or it has been determined that there is no need for
an exchange), the weight of the parent node is incremented. We then proceed to a new
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
parent node and the process is repeated. This process continues until the root of the tree is
reached.
If the symbol to be encoded or decoded has occurred for the first time, a new external
node is assigned to the symbol and a new NYT node is appended to the tree. Both the new
external node and the new NYT node are offsprings of the old NYT node. We increment
the weight of the new external node by one. As the old NYT node is the parent of the new
external node, we increment its weight by one and then go on to update all the other nodes
until we reach the root of the tree.
Que-2(A) Given an alphabet A = {a1, a2, a3, a4}, find the first order entropy in the following case: 3
P(a1) = 0.505, P(a2) = ¼, P(a3) = 1/8 and P(a4) =0.12.
Solution:
Entropy is given by following expression:
H=∑P(ai)log2(1/P(ai))
So, the entropy for the following data would be as follows-
H = 0.505*0.986 + (1/4)*2 + (1/8)*3 + 0.12*3.058
H=1.74 bps
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
Que-2(B) Determine the minimum variance Huffman code with the given probabilities. 4
P(a1) = 0.2, P(a2) = 0.4, P(a3) = 0.2, P(a4) = 0.1 and P(a5) = 0.1
Solution:
Que-2(C) The probability model is given by P(a1) = 0.2, P(a2) = 0.3 and P(a3) = 0.5. Find the real
valued tag for the sequence a1a1 a3 a2 a3 a1. (Assume cumulative probability function:
F(0) = 0)
OR 7
Explain Tunstall codes with suitable example.
Solution:
In the Tunstall code, all codewords are of equal length. However, each codeword
represents a different number of letters.
The main advantage of a Tunstall code is that errors in codewords do not propagate, unlike
other variable-length codes, such as Huffman codes, in which an error in one codeword
will cause a series of errors to occur.
The design of a code that has a fixed codeword length but a variable number of symbols
per codeword should satisfy the following conditions:
1. We should be able to parse a source output sequence into sequences of symbols that
appear in the codebook.
2. We should maximize the average number of source symbols represented by each
codeword.
Que-3(A) Determine whether the following codes are uniquely decodable or not.
1. {0, 01, 11, 111} 2. {0, 10, 110, 111}
(a) The codeword 0 is a prefix for the codeword 01.
The dangling suffix is 1.
Add 1 to the list.
The codeword 11 is a prefix for the codeword 111.
The dangling suffix is 1, which is already in the list.
There are not other codeword that is a prefix of another codeword.
The new list is: {0, 01, 11, 111, 1}
In this new list, 1 is the prefix for the codeword 11. The dangling suffix is
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
The Linde–Buzo–Gray algorithm (introduced by Yoseph Linde, Andrés Buzo and Robert
M. Gray in 1980) is a vector quantization algorithm to derive a good codebook.
For grammatical reasons, documents are going to use different forms of a word,
such as organize, organizes, and organizing. Additionally, there are families of
derivationally related words with similar meanings, such
as democracy, democratic, and democratization. In many situations, it seems as
if it would be useful for a search for one of these words to return documents that
contain another word in the set.
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form. For
instance:
am, are, is be
car, cars, car's, cars' car
The result of this mapping of text will be something like:
the boy's cars are different colors the boy car be differ color
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma . If confronted with the token saw, stemming
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
The most common algorithm for stemming English, and one that has repeatedly
been shown to be empirically very effective, is Porter's algorithm (Porter, 1980).
The entire algorithm is too long and intricate to present here, but we will indicate
its general nature. Porter's algorithm consists of 5 phases of word reductions,
applied sequentially. Within each phase there are various conventions to select
rules, such as selecting the rule from each rule group that applies to the longest
suffix. In the first phase, this convention is used with the following rule group:
Many of the later rules use a concept of the measure of a word, which loosely
checks the number of syllables to see whether a word is long enough that it is
reasonable to regard the matching portion of a rule as a suffix rather than as part
of the stem of a word. For example, the rule:
( ) EMENT
would map replacement to replac, but not cement to c. The official site for the
Porter Stemmer is:
http://www.tartarus.org/~martin/PorterStemmer/
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
Stemmers use language-specific rules, but they require less knowledge than a
lemmatizer, which needs a complete vocabulary and morphological analysis to
correctly lemmatize words. Particular domains may also require special
stemming rules. However, the exact stemmed form does not matter, only the
equivalence classes it forms.
Rather than using a stemmer, you can use a lemmatizer , a tool from Natural
Language Processing which does full morphological analysis to accurately
identify the lemma for each word. Doing full morphological analysis produces at
most very modest benefits for retrieval. It is hard to say more, because either
form of normalization tends not to improve English information retrieval
performance in aggregate - at least not by very much. While it helps a lot for
some queries, it equally hurts performance a lot for others. Stemming increases
recall while harming precision. As an example of what can go wrong, note that
the Porter stemmer stems all of the following words:
Alternate:
Use -bit representation for the first -m values and
bit binary representation of r+ -m for the rest of the values.
r = 0,1,2]→
n = 21 →
Represent quotient q by its unary code
= 3, =2
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
q= , r = n – qm
data (3, 5)
structure (3, 6)
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an
entry (1, 1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions
respectively (here position is based on word).
The index may have weights, frequencies, or other indicators.
In lossless compression, in order to minimize the average number of bits per input symbol,
we assigned shorter codewords to symbols that occurred with higher probability and longer
codewords to symbols that occurred with lower probability. In an analogous fashion, in
order to decrease the average distortion, we can try to approximate the input better in
regions of high probability, perhaps at the cost of worse approximations in regions of lower
probability. We can do this by making the quantization intervals smaller in those regions
that have more probability mass.
we would have smaller intervals near the origin. If we wanted to keep the number of
intervals constant, this would mean we would have larger intervals away from the origin. A
quantizer that has nonuniform intervals is called a nonuniform quantizer.
An example of a nonuniform quantizer is shown in below Figure.
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
Notice that the intervals closer to zero are smaller. Hence the maximum value that the
quantizer error can take on is also smaller, resulting in a better approximation. We pay for
this improvement in accuracy at lower input levels by incurring larger errors when the
input falls in the outer intervals. However, as the probability of getting smaller input values
is much higher than getting larger signal values, on the average the distortion will be lower
than if we had a uniform quantizer. While a nonuniform quantizer provides lower average
distortion, the design of nonuniform quantizers is also somewhat more complex. However,
The basic idea is quite straightforward: find the decision boundaries and reconstruction
levels that minimize the mean squared quantization error. We look at the design of
nonuniform quantizers in more detail in the following sections.
match at all, so the output cannot contain just pointers. In LZ77 the sequence is
encoded in the form of a triple <o, l, c>, where ‘o’ stands for an offset to the match, ‘l’
represents length of the match, and ‘c’ denotes the next symbol to be encoded. A null
pointer is generated as the pointer in case of absence of the match (both the offset and
the match length equal to 0) and the first symbol in the look-ahead buffer i.e.
(0,0,”character”). The values of an offset to a match and length must be limited to some
maximum constants. Moreover the compression performance of LZ77 mainly depends
on these values.
Algorithm:
While (look-Ahead Buffer is not empty) { Get a pointer (position, length) to longest
match; if (length > 0) { Output (position, Longest match length, next symbol ); Shift
the window by (length+1) positions along; } Else { Output (0, 0, first symbol in the
look-ahead buffer); Shift the window by 1 character along; } }
Example:
Instead of retrieving large units and identifying subelements (top down), we can also
search all leaves, select the most relevant ones and then extend them to larger units in
postprocessing (bottom up).
he least restrictive approach is to index all elements. This is also problematic. Many XML
elements are not meaningful search results, e.g., typographical elements
like <b>definitely</b> or an ISBN number which cannot be interpreted without context.
Also, indexing all elements means that search results will be highly redundant. For the
query Macbeth's castle and the document in Figure 10.1 , we would return all of the play,
act, scene and title elements on the path between the root node and Macbeth's castle. The
leaf node would then occur four times in the result set, once directly and three times as part
of other elements. We call elements that are contained within each other nested . Returning
redundant nested elements in a list of returned hits is not very user-friendly.
2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)
Assignment – 1
Because of the redundancy caused by nested elements it is common to restrict the set of
elements that are eligible to be returned. Restriction strategies include:
discard all small elements
discard all element types that users do not look at (this requires a working XML
retrieval system that logs this information)
discard all element types that assessors generally do not judge to be relevant (if
relevance assessments are available)
only keep element types that a system designer or librarian has deemed to be useful
search results
In most of these approaches, result sets will still contain nested elements. Thus, we may
want to remove some elements in a postprocessing step to reduce redundancy.
Alternatively, we can collapse several nested elements in the results list and
use highlighting of query terms to draw the user's attention to the relevant passages. If
query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes
little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section
and the paragraph both occur in the results list, it is sufficient to show the section. An
additional advantage of this approach is that the paragraph is presented together with its
context (i.e., the embedding section). This context may be helpful in interpreting the
paragraph (e.g., the source of the information reported) even if the paragraph on its own
satisfies the query.
If the user knows the schema of the collection and is able to specify the desired type of
element, then the problem of redundancy is alleviated as few nested elements have the
same type. But as we discussed in the introduction, users often don't know what the name
of an element in the collection is (Is the Vatican a country or a city?) or they may not know
how to compose structured queries at all.
Given a character sequence and a defined document unit, tokenization is the task of
chopping it up into pieces, called tokens, perhaps at the same time throwing away
certain characters, such as punctuation.
Given a character sequence and a defined document unit, tokenization is the task of
chopping it up into pieces, called tokens, perhaps at the same time throwing away
certain characters, such as punctuation. Here is an example of tokenization:
These tokens are often loosely referred to as terms or words, but it is sometimes
important to make a type/token distinction. A token is an instance of a sequence of
characters in some particular document that are grouped together as a useful semantic
unit for processing. A type is the class of all tokens containing the same character
sequence. A term is a (perhaps normalized) type that is included in the IR system's
dictionary. The set of index terms could be entirely distinct from the tokens, for
instance, they could be semantic identifiers in taxonomy, but in practice in modern IR
systems they are strongly related to the tokens in the document. However, rather than
being exactly the tokens that appear in the document, they are usually derived from
them by various normalization processes.