B.E Semester: 6 - IT (GTU) : 2161603 - Data Compression and Data Retrieval

2161603 – Data Compression and Data Retrieval
B.E Semester: 6 – IT (GTU)

Assignment – 1
Que-1(A) Define: self-information, entropy, lossy compression 3

Self – Information:
 Self-information or surprisal is a measure of the information content associated
with an event in a probability space or with the value of a discrete random variable.
 The self-information is a measure of deviation from expectation of random variable
in Shannons (bits, the unit may vary when used in different context) when sampling
a random variable.
Lossy Compression:
 Lossy compression reduces a file by permanently eliminating certain information,
especially redundant information. When the file is uncompressed, only a part of the
original information is still there (although the user may not notice it). Lossy
compression is generally used for video and sound, where a certain amount of
information loss will not be detected by most users. The JPEG image file,
commonly used for photographs and other complex still images on the Web, is an
image that has lossy compression. Using JPEG compression, the creator can decide
how much loss to introduce and make a trade-off between file size and image
quality.
 A compression technique that does not decompress digital data back to 100% of the
original. Lossy methods can provide high degrees of compression and result in
smaller compressed files, but some number of the original pixels, sound waves or
video frames are removed forever. Examples are the widely used JPEG image,
MPEG video and MP3 audio formats.
The greater the compression, the smaller the file. However, a high image
compression loss can be observed in photos printed very large, and people with
excellent hearing can notice a huge difference between MP3 music and high-
resolution audio files (see audiophile). Typically, the moving frames of video can
tolerate a greater loss of pixels than still images.

Que-1(B) State the models for lossless compression and explain any one in. 4
 the model is a concise representation of the data
 exact model ↔ lossless compression (e.g., zip)
 approximate model ↔ lossy compression (e.g., mp3)
1.Physical Models: If we know something about the physics of the data generation process,
we can use that information to construct a model.
For Ex. In speech- related applications, knowledge about the physics of speech production
can be used to construct a mathematical model for the sampled speech process. Sampled
speech can be encoded using this model.
Real life Application: Residential electrical meter readings
2. Probability Models: The simplest statistical model for the source is to assume that each
letter that is generated by the source is independent of every other letter, and each occurs
with the same probability. We could call this the ignorance model as it would generation be
useful only when we know nothing about the source. The next step up in complexity is to
keep the independence assumption but remove the equal probability assumption and assign
a probability of occurrence to each letter in the alphabet.
For a source that generates letters from an alphabet A=a1,a2,……..am we can have a
Assignment – 1
probability model P=P(a1),P(a2)………P(aM)

3. Markov Models: Markov models are particularly useful in text compression, where the
probability of the next letter is heavily influenced by the preceding letters. In current text
compression, the Kth order Markov Models are more widely known as finite context
models, with the word context being used for what we have earlier defined as state.
Consider the word ‘preceding’. Suppose we have already processed ‘preceding’ and we are
going to encode the next ladder. If we take no account of the context and treat each letter a
surprise, the probability of letter ‘g’ occurring is relatively low. If we use a 1st order
Markov Model or single letter context we can see that the probability of g would increase
substantially. As we increase the context size (go from n to in to din and so on), the
probability of the alphabet becomes more and more skewed which results in lower entropy.
4. Composite Source Model: In many applications it is not easy to use a single model to
describe the source. In such cases, we can define a composite source, which can be viewed
as a combination or composition of several sources, with only one source being active at
any given time. A composite source can be represented as a number of individual sources
Si , each with its own model Mi and a switch that selects a source Si with probability Pi.
This is an exceptionally rich model and can be used to describe some very complicated
processes.
Que-1(C) Explain update procedure of adaptive Huffman coding with suitable example. 7
Solution:
The update procedure requires that the nodes be in a fixed order. This ordering is preserved
by numbering the nodes. The largest node number is given to the root of the tree, and the
smallest number is assigned to the NYT node. The numbers from the NYT node to the root
of the tree are assigned in increasing order from left to right, and from lower level to upper
level. The set of nodes with the same weight makes up a block. Following Figure is a
flowchart of the updating procedure.
The function of the update procedure is to preserve the sibling property. In order that the
update procedures at the transmitter and receiver both operate with the same information,
the tree at the transmitter is updated after each symbol is encoded, and the tree at the
receiver is updated after each symbol is decoded. The procedure operates as follows:
After a symbol has been encoded or decoded, the external node corresponding to the
symbol is examined to see if it has the largest node number in its block. If the external
node does not have the largest node number, it is exchanged with the node that has the
largest node number in the block, as long as the node with the higher number is not the
parent of the node being updated. The weight of the external node is then incremented. If
we did not exchange the nodes before the weight of the node is incremented, it is very
likely that the ordering required by the sibling property would be destroyed. Once we have
incremented the weight of the node, we have adapted the Huffman tree at that level. We
then turn our attention to the next level by examining the parent node of the node whose
weight was incremented to see if it has the largest number in its block. If it does not, it is
exchanged with the node with the largest number in the block. Again, an exception to this
is when the node with the higher node number is the parent of the node under
consideration.
Once an exchange has taken place (or it has been determined that there is no need for
an exchange), the weight of the parent node is incremented. We then proceed to a new
Assignment – 1
parent node and the process is repeated. This process continues until the root of the tree is
reached.
If the symbol to be encoded or decoded has occurred for the first time, a new external
node is assigned to the symbol and a new NYT node is appended to the tree. Both the new
external node and the new NYT node are offsprings of the old NYT node. We increment
the weight of the new external node by one. As the old NYT node is the parent of the new
external node, we increment its weight by one and then go on to update all the other nodes
until we reach the root of the tree.
Que-2(A) Given an alphabet A = {a1, a2, a3, a4}, find the first order entropy in the following case: 3
P(a1) = 0.505, P(a2) = ¼, P(a3) = 1/8 and P(a4) =0.12.
Solution:
Entropy is given by following expression:
H=∑P(ai)log2(1/P(ai))
So, the entropy for the following data would be as follows-
H = 0.505*0.986 + (1/4)*2 + (1/8)*3 + 0.12*3.058
H=1.74 bps
Assignment – 1
Que-2(B) Determine the minimum variance Huffman code with the given probabilities. 4
P(a1) = 0.2, P(a2) = 0.4, P(a3) = 0.2, P(a4) = 0.1 and P(a5) = 0.1
Solution:
Que-2(C) The probability model is given by P(a1) = 0.2, P(a2) = 0.3 and P(a3) = 0.5. Find the real
valued tag for the sequence a1a1 a3 a2 a3 a1. (Assume cumulative probability function:
F(0) = 0)
OR 7
Explain Tunstall codes with suitable example.
Solution:
In the Tunstall code, all codewords are of equal length. However, each codeword
represents a different number of letters.
The main advantage of a Tunstall code is that errors in codewords do not propagate, unlike
other variable-length codes, such as Huffman codes, in which an error in one codeword
will cause a series of errors to occur.
The design of a code that has a fixed codeword length but a variable number of symbols
per codeword should satisfy the following conditions:
1. We should be able to parse a source output sequence into sequences of symbols that
appear in the codebook.
2. We should maximize the average number of source symbols represented by each
codeword.
Que-3(A) Determine whether the following codes are uniquely decodable or not.
1. {0, 01, 11, 111} 2. {0, 10, 110, 111}
(a) The codeword 0 is a prefix for the codeword 01.
The dangling suffix is 1.
Add 1 to the list.
The codeword 11 is a prefix for the codeword 111.
The dangling suffix is 1, which is already in the list.
There are not other codeword that is a prefix of another codeword.
The new list is: {0, 01, 11, 111, 1}
In this new list, 1 is the prefix for the codeword 11. The dangling suffix is
Assignment – 1
1, which is already in the list.

Also, 1 is the prefix of 111. The dangling suffix is 11, but 11 is a
codeword.
Therefore, this code is not uniquely decodable.

Que-3(B) Explain Linde-Buzo-Gray algorithm in detail.
The Linde–Buzo–Gray algorithm (introduced by Yoseph Linde, Andrés Buzo and Robert
M. Gray in 1980) is a vector quantization algorithm to derive a good codebook.
t is similar to the k-means method in data clustering.

The algorithm
At each iteration, each vector is split into two new vectors.

 A initial state: centroid of the training sequence;
 B initial estimation #1: code book of size 2;
 C final estimation after LGA: Optimal code book with 2 vectors;
 D initial estimation #2: code book of size 4;
 E final estimation after LGA: Optimal code book with 4 vectors;
Que-3(C) Explain stemming and lemmatization with suitable example.
For grammatical reasons, documents are going to use different forms of a word,
such as organize, organizes, and organizing. Additionally, there are families of
derivationally related words with similar meanings, such
as democracy, democratic, and democratization. In many situations, it seems as
if it would be useful for a search for one of these words to return documents that
contain another word in the set.
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form. For
instance:
am, are, is be
car, cars, car's, cars' car
The result of this mapping of text will be something like:
the boy's cars are different colors the boy car be differ color
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma . If confronted with the token saw, stemming
Assignment – 1
might return just s, whereas lemmatization would attempt to return

either see or saw depending on whether the use of the token was as a verb or a
noun. The two may also differ in that stemming most commonly collapses
derivationally related words, whereas lemmatization commonly only collapses
the different inflectional forms of a lemma. Linguistic processing for stemming
or lemmatization is often done by an additional plug-in component to the
indexing process, and a number of such components exist, both commercial and
open-source.
The most common algorithm for stemming English, and one that has repeatedly
been shown to be empirically very effective, is Porter's algorithm (Porter, 1980).
The entire algorithm is too long and intricate to present here, but we will indicate
its general nature. Porter's algorithm consists of 5 phases of word reductions,
applied sequentially. Within each phase there are various conventions to select
rules, such as selecting the rule from each rule group that applies to the longest
suffix. In the first phase, this convention is used with the following rule group:
Many of the later rules use a concept of the measure of a word, which loosely
checks the number of syllables to see whether a word is long enough that it is
reasonable to regard the matching portion of a rule as a suffix rather than as part
of the stem of a word. For example, the rule:
( ) EMENT
would map replacement to replac, but not cement to c. The official site for the
Porter Stemmer is:
http://www.tartarus.org/~martin/PorterStemmer/
Assignment – 1
Stemmers use language-specific rules, but they require less knowledge than a
lemmatizer, which needs a complete vocabulary and morphological analysis to
correctly lemmatize words. Particular domains may also require special
stemming rules. However, the exact stemmed form does not matter, only the
equivalence classes it forms.
Rather than using a stemmer, you can use a lemmatizer , a tool from Natural
Language Processing which does full morphological analysis to accurately
identify the lemma for each word. Doing full morphological analysis produces at
most very modest benefits for retrieval. It is hard to say more, because either
form of normalization tends not to improve English information retrieval
performance in aggregate - at least not by very much. While it helps a lot for
some queries, it equally hurts performance a lot for others. Stemming increases
recall while harming precision. As an example of what can go wrong, note that
the Porter stemmer stems all of the following words:
operate operating operates operation operative operatives operational

to oper. However, since operate in its various forms is a common verb, we
would expect to lose considerable precision on queries such as the following
with Porter stemming:
operational and research
operating and system
operative and dentistry
For a case like this, moving to using a lemmatizer would not completely fix the
problem because particular inflectional forms are used in particular collocations:
a sentence with the words operate and systemis not a good match for the query
operating and system. Getting better value from term normalization depends
more on pragmatic issues of word use than on formal issues of linguistic
morphology.
Assignment – 1
OR Generate the golomb code for m = 5 and n = 8 & 9.

Que-3(A) Golomb Code: Split integer into two parts : 1) unary code
2) different code
Ex: integer n (n>0) (m>0)
Represent n>0, using q and r
q= , r = n - qm (m = parameter)
q = quotient, use unary code to represent q
r = remainder = 0,1,2……………..m-1
n = qm+r
If m is a power of two, use log2m bit representation of r
Otherwise use bit representation
Alternate:
Use -bit representation for the first -m values and
bit binary representation of r+ -m for the rest of the values.
Design Golomb code for m = 5

= 3, =2
-m = 8-5 = 3
r = 0,1,2]→
r = 3,4 → 3 bit representation of r+3

codeword for 3→ 0110
codeword for 21→ 1111001
q= = 4 → 11110
r = 1 → 01
For m = 5
n = 21 →
Represent quotient q by its unary code
= 3, =2
Assignment – 1
q= , r = n – qm
q q is represented by unary code
OR Explain incident matrix and inverted index with suitable example.

Que-3(B)  An inverted index is an index data structure storing a mapping from content, such as
words or numbers, to its locations in a document or a set of documents. In simple
words, it is a hashmap like data structure that directs you from a word to a document or
a web page.
 There are two types of inverted indexes: A record-level inverted index contains a list
of references to documents for each word. A word-level inverted index additionally
contains the positions of each word within a document. The latter form offers more
functionality, but needs more processing power and space to be created.
Suppose we want to search the texts “hello everyone, ” “this article is based on inverted
index, ” “which is hashmap like data structure”. If we index by (text, word within the
text), the index with location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
Assignment – 1
data (3, 5)
structure (3, 6)
 The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an
entry (1, 1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions
respectively (here position is based on word).
The index may have weights, frequencies, or other indicators.
 Steps to build an inverted index:

o Fetch the Document
Removing of Stop Words: Stop words are most occuring and useless
words in document like “I”, “the”, “we”, “is”, “an”.
o Stemming of Root Word
Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called
“cats” or “catty” instead of “cat”. To relate the both words, I’ll chop
some part of each and every word I read so that I could get the “root
word”. There are standard tools for performing this like “Porter’s
Stemmer”.
o Record Document IDs
If word is already present add reference of document to index else create
new entry. Add additional information like frequency of word, location
of word etc.
 Repeat for all documents and sort the words.
 Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
 Advantage of Inverted Index are:
o Inverted index is to allow fast full text searches, at a cost of increased
processing when a document is added to the database.
o It is easy to develop.
o It is the most popular data structure used in document retrieval systems,
used on a large scale for example in search engines.
 Inverted Index also has disadvantage:
o Large storage overhead and high maintenance costs on update, delete
and insert.
OR Write pseudocode for integer arithmetic encoding and decoding algorithm.
Que-3(C)
Assignment – 1
Que-4(A) Explain nonuniform quantization.

Pick the boundaries such that error is minimized „
i.e., smaller/bigger step for smaller/bigger values
Basic idea: Find the decision boundaries and reconstruction levels that minimize the mean
squared quantization error.
Two approaches:
Pdf-optimized quantization
Companded quantization.
In lossless compression, in order to minimize the average number of bits per input symbol,
we assigned shorter codewords to symbols that occurred with higher probability and longer
codewords to symbols that occurred with lower probability. In an analogous fashion, in
order to decrease the average distortion, we can try to approximate the input better in
regions of high probability, perhaps at the cost of worse approximations in regions of lower
probability. We can do this by making the quantization intervals smaller in those regions
that have more probability mass.
we would have smaller intervals near the origin. If we wanted to keep the number of
intervals constant, this would mean we would have larger intervals away from the origin. A
quantizer that has nonuniform intervals is called a nonuniform quantizer.
An example of a nonuniform quantizer is shown in below Figure.
Assignment – 1
Notice that the intervals closer to zero are smaller. Hence the maximum value that the
quantizer error can take on is also smaller, resulting in a better approximation. We pay for
this improvement in accuracy at lower input levels by incurring larger errors when the
input falls in the outer intervals. However, as the probability of getting smaller input values
is much higher than getting larger signal values, on the average the distortion will be lower
than if we had a uniform quantizer. While a nonuniform quantizer provides lower average
distortion, the design of nonuniform quantizers is also somewhat more complex. However,
The basic idea is quite straightforward: find the decision boundaries and reconstruction
levels that minimize the mean squared quantization error. We look at the design of
nonuniform quantizers in more detail in the following sections.
Que-4(B) Explain skip pointer with suitable example. 4

Skip pointers allow us to skip postings that will not figure in the search results. This makes
intersecting postings lists more efficient. Some postings lists contain several million entries
– so efficiency can be an issue even if basic intersection is linear. Where do we put skip
pointers? How do we make sure intersection results are correct?
Que-4(C) Explain LZ77 with suitable example. 7

Solution:
 Jacob Ziv and Abraham Lempel have presented their dictionary-based scheme in 1977
for lossless data compression. The LZ77 compression algorithm is the most used
compression algorithm, on which program like PkZip has their foundation along with a
few other algorithms. LZ77 exploits the fact that words and phrases within a text file
are likely to be repeated. When there is repetition, they can be encoded as a pointer to
an earlier occurrence, with the pointer followed by the number of characters to be
matched. It is a very simple technique that requires no prior knowledge of the source
and seems to require no assumptions about the characteristics of the source. In the
LZ77 approach, the dictionary work as a portion of the previously encoded sequence.
The encoder examines the input sequence by pressing into service of sliding window
which consists of two parts: Search buffer and Look-ahead buffer. A search buffer
contains a portion of the recently encoded sequence and a look-ahead buffer contains
the next portion of the sequence to be encoded.
 The algorithm searches the sliding window for the longest match with the beginning of
the look-ahead buffer and outputs a pointer to that match. It is possible that there is no
Assignment – 1
match at all, so the output cannot contain just pointers. In LZ77 the sequence is
encoded in the form of a triple <o, l, c>, where ‘o’ stands for an offset to the match, ‘l’
represents length of the match, and ‘c’ denotes the next symbol to be encoded. A null
pointer is generated as the pointer in case of absence of the match (both the offset and
the match length equal to 0) and the first symbol in the look-ahead buffer i.e.
(0,0,”character”). The values of an offset to a match and length must be limited to some
maximum constants. Moreover the compression performance of LZ77 mainly depends
on these values.
 Algorithm:
 While (look-Ahead Buffer is not empty) { Get a pointer (position, length) to longest
match; if (length > 0) { Output (position, Longest match length, next symbol ); Shift
the window by (length+1) positions along; } Else { Output (0, 0, first symbol in the
look-ahead buffer); Shift the window by 1 character along; } }
 Example:
 Shortcomings of LZ77 Algorithm

 In the original LZ77 algorithm; Lempel and Ziv proposed that all string be encoded
as a length and offset, even string founds no match.
 In LZ77, search buffer is thousands of bytes long, while the look-ahead buffer is tens
of byte long.
 The encryption process is time consuming due to the large number of comparison
done to find matched pattern.
 LZ77 doesn’t have its external dictionary which cause problem while decompressing
on another machine. In this algorithm whenever there is no match of any strings it
encoded that string as a length and offset which will take more space and this
unnecessary step also increases time taken by the algorithm.
OR Explain pdf optimized quantization.
Que-4(A)
OR Explain phrase queries with suitable example.
Assignment – 1
Que-4(B)  Want to be able to answer queries such as “stanford university” – as a phrase

 Thus the sentence “I went to university at Stanford” is not a match.
 The concept of phrase queries has proven easily understood by users; one of
the few “advanced search” ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
 The representation of documents as vectors is fundamentally lossy: the relative order of

terms in a document is lost in the encoding of a document as a vector. Even if we were
to try and somehow treat every biword as a term (and thus an axis in the vector space),
the weights on different axes not independent: for instance the phrase German shepherd
gets encoded in the axis german shepherd, but immediately has a non-zero weight on
the axes german and shepherd. Further, notions such as idf would have to be extended
to such biwords. Thus an index built for vector space retrieval cannot, in general, be
used for phrase queries. Moreover, there is no way of demanding a vector space score
for a phrase query -- we only know the relative weights of each term in a document.
 On the query german shepherd, we could use vector space retrieval to identify
documents heavy in these two terms, with no way of prescribing that they occur
consecutively. Phrase retrieval, on the other hand, tells us of the existence of the phrase
german shepherd in a document, without any indication of the relative frequency or
weight of this phrase. While these two retrieval paradigms (phrase and vector space)
consequently have different implementations in terms of indexes and retrieval
algorithms, they can in some cases be combined usefully, as in the three-step example
of query parsing
OR Explain LZ78 with suitable example.
Que-4(C)  The LZ78 is a dictionary-based compression algorithm that maintains an explicit
dictionary. The encoded output consists of two elements: an index referring to the
longest matching dictionary entry and the first non-matching symbol. The algorithm
also adds the index and symbol pair to the dictionary. When the symbol is not yet
found in the dictionary, the codeword has the index value 0 and it is added to the
dictionary as well. With this method, the algorithm constructs the dictionary.
 LZ78 algorithm has the ability to capture patterns and hold them indefinitely but it also
has a serious drawback. The dictionary keeps growing forever without bound. There
are various methods to limit dictionary size. the easiest one is to stop adding entries and
continue like a static dictionary coder or to throw the dictionary away and start from
scratch after a certain number of entries has been reached.
 Algorithm:
Algorithm w := NIL; While (there is input){ K := next symbol from input; If (wK exists
in the dictionary) { w := wK; } Else { Output (index(w), K); Add wK to the dictionary;
w := NIL; } }
 Example:
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ78
algorithm.
Assignment – 1
The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B).

Que-5(A) Explain structured vector quantizers.
Que-5(B) Explain challenges in XML information retrieval.
The first challenge in structured retrieval is that users want us to return parts of documents
(i.e., XML elements), not entire documents as IR systems usually do in unstructured
retrieval. If we query Shakespeare's plays for Macbeth's castle, should we return the scene,
the act or the entire play.
Parallel to the issue of which parts of a document to return to the user is the issue of which
parts of a document to index.
We can also use one of the largest elements as the indexing unit, for example, the book
element in a collection of books or the play element for Shakespeare's works. We can then
postprocess search results to find for each book or play the subelement that is the best hit.
For example, the query Macbeth's castle may return the play Macbeth, which we can then
postprocess to identify act I, scene vii as the best-matching subelement. Unfortunately, this
two-stage retrieval process fails to return the best subelement for many queries because the
relevance of a whole book is often not a good predictor of the relevance of small
subelements within it.
Instead of retrieving large units and identifying subelements (top down), we can also
search all leaves, select the most relevant ones and then extend them to larger units in
postprocessing (bottom up).
he least restrictive approach is to index all elements. This is also problematic. Many XML
elements are not meaningful search results, e.g., typographical elements
like <b>definitely</b> or an ISBN number which cannot be interpreted without context.
Also, indexing all elements means that search results will be highly redundant. For the
query Macbeth's castle and the document in Figure 10.1 , we would return all of the play,
act, scene and title elements on the path between the root node and Macbeth's castle. The
leaf node would then occur four times in the result set, once directly and three times as part
of other elements. We call elements that are contained within each other nested . Returning
redundant nested elements in a list of returned hits is not very user-friendly.
Assignment – 1
Because of the redundancy caused by nested elements it is common to restrict the set of
elements that are eligible to be returned. Restriction strategies include:
 discard all small elements
 discard all element types that users do not look at (this requires a working XML
retrieval system that logs this information)
 discard all element types that assessors generally do not judge to be relevant (if
relevance assessments are available)
 only keep element types that a system designer or librarian has deemed to be useful
search results
In most of these approaches, result sets will still contain nested elements. Thus, we may
want to remove some elements in a postprocessing step to reduce redundancy.
Alternatively, we can collapse several nested elements in the results list and
use highlighting of query terms to draw the user's attention to the relevant passages. If
query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes
little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section
and the paragraph both occur in the results list, it is sufficient to show the section. An
additional advantage of this approach is that the paragraph is presented together with its
context (i.e., the embedding section). This context may be helpful in interpreting the
paragraph (e.g., the source of the information reported) even if the paragraph on its own
satisfies the query.
If the user knows the schema of the collection and is able to specify the desired type of
element, then the problem of redundancy is alleviated as few nested elements have the
same type. But as we discussed in the introduction, users often don't know what the name
of an element in the collection is (Is the Vatican a country or a city?) or they may not know
how to compose structured queries at all.
Que-5(C) Explain usage of discrete cosine transform (DCT) in JPEG.

OR Explain pyramid vector quantization.
Que-5(A)
OR Explain tokenization in detail.
Que-5(B)  Tokenization is the act of breaking up a sequence of strings into pieces such as words,
keywords, phrases, symbols and other elements called tokens. Tokens can be individual
words, phrases or even whole sentences. In the process of tokenization, some
characters like punctuation marks are discarded. The tokens become the input for
another process like parsing and text mining.
 Tokenization is used in computer science, where it plays a large part in the process of
lexical analysis.
 Tokenization relies mostly on simple heuristics in order to separate tokens by following
a few steps:
o Tokens or words are separated by whitespace, punctuation marks or line breaks
o White space or punctuation marks may or may not be included depending on
the need
o All characters within contiguous strings are part of the token. Tokens can be
made up of all alpha characters, alphanumeric characters or numeric characters
only.
 Tokens themselves can also be separators. For example, in most programming
languages, identifiers can be placed together with arithmetic operators without white
spaces. Although it seems that this would appear as a single word or token, the
grammar of the language actually considers the mathematical operator (a token) as a
separator, so even when multiple tokens are bunched up together, they can still be
Assignment – 1
separated via the mathematical operator.
 Given a character sequence and a defined document unit, tokenization is the task of
chopping it up into pieces, called tokens, perhaps at the same time throwing away
certain characters, such as punctuation.
 Given a character sequence and a defined document unit, tokenization is the task of
chopping it up into pieces, called tokens, perhaps at the same time throwing away
certain characters, such as punctuation. Here is an example of tokenization:
Input: Friends, Romans, Countrymen, lend me your ears;

Output:
 These tokens are often loosely referred to as terms or words, but it is sometimes
important to make a type/token distinction. A token is an instance of a sequence of
characters in some particular document that are grouped together as a useful semantic
unit for processing. A type is the class of all tokens containing the same character
sequence. A term is a (perhaps normalized) type that is included in the IR system's
dictionary. The set of index terms could be entirely distinct from the tokens, for
instance, they could be semantic identifiers in taxonomy, but in practice in modern IR
systems they are strongly related to the tokens in the document. However, rather than
being exactly the tokens that appear in the document, they are usually derived from
them by various normalization processes.
OR Explain audio compression technique with suitable diagram.

Que-5(C)

B.E Semester: 6 - IT (GTU) : 2161603 - Data Compression and Data Retrieval

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B.E Semester: 6 - IT (GTU) : 2161603 - Data Compression and Data Retrieval

Uploaded by

Copyright:

Available Formats

2161603 – Data Compression and Data Retrieval

B.E Semester: 6 – IT (GTU)

Que-1(A) Define: self-information, entropy, lossy compression 3

tolerate a greater loss of pixels than still images.

Real life Application: Residential electrical meter readings

probability model P=P(a1),P(a2)………P(aM)

1, which is already in the list.

Therefore, this code is not uniquely decodable.

t is similar to the k-means method in data clustering.

At each iteration, each vector is split into two new vectors.

Que-3(C) Explain stemming and lemmatization with suitable example.

might return just s, whereas lemmatization would attempt to return

operate operating operates operation operative operatives operational

OR Generate the golomb code for m = 5 and n = 8 & 9.

Design Golomb code for m = 5

r = 3,4 → 3 bit representation of r+3

q q is represented by unary code

OR Explain incident matrix and inverted index with suitable example.

 Steps to build an inverted index:

Que-4(A) Explain nonuniform quantization.

Que-4(B) Explain skip pointer with suitable example. 4

Que-4(C) Explain LZ77 with suitable example. 7

 Shortcomings of LZ77 Algorithm

Que-4(B)  Want to be able to answer queries such as “stanford university” – as a phrase

 The representation of documents as vectors is fundamentally lossy: the relative order of

The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B).

Que-5(C) Explain usage of discrete cosine transform (DCT) in JPEG.

separated via the mathematical operator.

Input: Friends, Romans, Countrymen, lend me your ears;

OR Explain audio compression technique with suitable diagram.

You might also like