Text-Based Document Similarity Matching Using Sdtext

2016 49th Hawaii International Conference on System Sciences
Text-based Document Similarity Matching Using sdtext
Clay Shields
Department of Computer Science
Georgetown University
Washington, D.C., 20057
clay@cs.georgetown.edu
I. A BSTRACT those terms. Digests are either hashes of the set of dictionary
Forensics examiners frequently try to identify duplicate terms or a bit vector representing which terms are in the
les during an investigation. They might do so to identify dictionary. Digests are portable and can be shared between
known les of interest, or to allow more rapid review of investigators, and they can be matched against each other to
documents that appear to be similar. Current forensic tools nd les that have similar content.
for detecting duplicate les operate over the low-level bits of Below, we describe how existing approaches work, and
the le, typically using hashing. While this can be a fast and then provide more detail about the operation of sdtext. We
effective method in many cases, it can fail due to differences then present an experimental comparison of sdtext and
in le format. We introduce sdtext, a tool developed to sdhash : rst, in a general case; next, over les containing
identify similar les based on their textual contents, which text; and nally over les consisting of extracted text. We
is robust to changes in format. We show that sdtext is show that sdtext is able to identify similar les that
far more accurate than existing tools in matching les that are textual with much higher accuracy than existing tools,
contain the same text in different formats. particularly in cases in which the same text is present in
different le formats.
II. I NTRODUCTION
As technology improves and prices fall, the amount of A. Similarity Hashing
digital media found in a forensic examination increases. For Existing forensic approaches to duplicate document de-
examiners in the eld, this can create problems as they may tection operate based on the bits that encode the le. In the
recover a large amount of digital media requiring triage simplest case, cryptographic hashes are used to identify les
to determine what is interesting or relevant. One method that are bit-for-bit identical. This is used most often to iden-
of isolating interesting media is to identify which media tify known child pornography and to identify known system
contains les similar to those already known to be of interest. and application les to exclude from further examination [3].
Similarly, identifying near-duplicate material can allow an These techniques do not work if even a single bit is changed
examiner to more quickly work through large amounts of in the le and consequently they fail to identify similar les
data by examining similar documents at the same time rather that have been subjected to even small edits.
than piecemeal as they are encountered. There have been a number of approaches taken to allow
One approach to nding similar les is done on the basis hashing to match le based on similarity. All involve break-
of hashes that operate over the bits of the le [1], [2]. In ing the le into smaller pieces and hashing those pieces,
many cases, however, this approach can fail to identify le then comparing the piecewise hashes to nd matches. The
containing similar content in different formatting or encod- difference is in how pieces are created for hashing.
ings. An example of this is a Microsoft Word document that The simplest approach is to use pieces of xed size, as is
has been exported as a PDF. While humans would perceive used in dcfldd [4]. This allows an examiner who is taking
the same text, hashing will likely not identify documents as a forensic image to verify that image piecewise; if some
being similar. Other examples might be different classes of piece is corrupted, the integrity of other pieces is maintained.
documents with similar text, like web pages from the same When applied to document recognition, this approach suffers
source or documents with similar boilerplate text. from an alignment problem. Should a single bit be added or
To improve the state of the art in matching documents deleted from the le, that block and all following blocks will
that contain similar text, we have developed a tool named have differing hashes.
sdtext, which is available under an open-source license A better approach is called context triggered piecewise
at sdtext.com. It operates by extracting text from les, hashing (CTPH), and was built into a forensic tool named
constructing a dictionary of statistically important terms, ssdeep [1]. First used for spam detection [5], CTPH uses
then parsing les and recording the presence or absence of a rolling hash to determine when to trigger computation of a
1530-1605/16 $31.00 2016 IEEE 5606

5607
DOI 10.1109/HICSS.2016.694
piece hash. The effect of similar les is to trigger the rolling A. Overview of operation
hash in the same places even if some portions are altered. In sdtext operates by performing some statistical analysis
practice enough similarities remain that many of the piece of the les that are to be compared or of les which are ex-
hashes match. pected to be similar to those being compared. This analysis
An issue with CTHP is determining how often to create consists of computing the inverse document frequency (IDF)
piece hashes; this frequency is related to parameters con- of each term found in a le, which, for each term T in a
trolling the rolling hash. Rather than attempting to deter- set of documents D, where a document containing term T
mine an optimal frequency, a more accurate approach is to is denoted as DT , is dened as:
(|D|)
compute hashes for many different average piece sizes at idfT = log 1+|D T|
once, then to compare among pieces computed using the The result of this analysis is a dictionary that contains
same parameters. This approach is used in multi-resolution a list of statistically signicant terms. As described below,
similarity hashing [2], which operates over a byte stream and this dictionary is then used in creating le digests. Each
creates multiple hashes on that stream with differing rolling- digest contains some information about the le as well as
hash parameters. The resulting piece hashes are placed in a bit array representing information about which dictionary
Bloom lters [6], one for each parameter chosen. Later, terms were found in the le.
when other les are compared, their hashes are compared To approximately compare two digests, sdtext uses
against previously computed hashes in the Bloom lter. cosine similarity. This treats each bit array as a vector in
There are several advantages to using hashes to nd space where the number of dimensions is the same as the
similar documents. They are le-format independent, and number of bits in the array. The two digests are considered
can be used across many different le types. They are fast similar if the cosine of the angle of the vectors falls within
to compute and do not require signicant storage as well. a specied range. This range is a user-selectable parameter
Their disadvantage is that they do not reect the way that allows the examiner to tune the results they see. This is
computer users see les, which is based on le content analogous to looking down the list of results on a search
and not their encoding. For example, a Word document can engine; the lower-ranked results might also be a match,
be saved in a variety of le formats, such as text, PDF though the lower the score the less the likelihood.
or HTML. Those les would be considered identical to a sdtext therefore operates in three distinct rounds. First,
person, but would not be similar enough on a bit level to be a dictionary must be created, which requires reading les and
seen as a match using existing tools. This creates problems performing statistical analysis. Second, digests of les are
for the forensic examiner, who might be trying to match constructed using this dictionary, which can also be shared
similar les between systems. For example, on one system with other examiners so that they can create digests as well.
the le might be an email message in text or .eml format; Third, digests are compared in order to nd matches. We
on the other it might be part of an HTML cache le holding describe each of these operations in more detail below.
the page from a webmail server. Current similarity tools B. Text Extraction
might miss this, and we show many instances where the
probability of such a miss is high.
The operation of sdtext is predicated on being able to
To help rectify this, we introduce sdtext, which
extract text from les found during an investigation. While
matches les based solely on content, allowing a much
simple in concept, the number of le formats that exist can
higher accuracy.
make this difcult in some cases. Fortunately, however, there
are variety of approaches that can be used, particularly for
III. S IMILARITY M ATCHING WITH sdtext common le formats.
For les that contain text but have a proprietary le
The operation of sdtext is based on principles from format, a technology like Oracles OutsideIn [9] can be used
information retrieval, the general area of computer science to extract the text. We have included the necessary hooks in
research that powers search engines such as Google and sdtext to operate with OutsideIn, but do not distribute
Bing [7]. The techniques used are those used in duplicate OutsideIn with sdtext due to licensing limitations.
document and plagiarism detection [8] and are novel only in For more common le formats, such as Microsoft Ofce
their application to forensics problems. While the IR com- documents or PDF les, there are a wider variety of choices
munity has studied duplicate document detection, most work for text extraction. One can use native versions of programs
has been on nding similar documents in a single collection that can read the les and write them as text. A scripting
of les. This differs from some forensic applications in that language that can interface with the program can make bulk
an examiner may require a portable digest that can be shared translation easier; for this paper, we used Applescript and
with other examiners without sharing the full set of les Microsoft Ofce to convert hundreds of Ofce les to text
from which the digest was rst created. and other processable formats.
5608
5607
In addition, there are other programs that have reverse- such as when unencoded or base64 encoded text is part
engineered le formats and allow text extraction. We of a document. Some languages, such as English, benet
used several, including OpenOffice and unoconv, a from stemming, which is the process of removing word
command-line tool that uses OpenOfces libraries, to ex- endings but leaving the root. This helps ensure that words
tract text from Microsoft Ofce Files; pdftotext, a like ensured match ensuring. sdtext includes the
command-line linux program that extracts text from PDF frequently-used Porter Stemmer [11] to accomplish this.
les; links, a command-line browser that can dump text
from HTML les and web sites; and unrtf, a command- D. Dictionary creation
line tool that extracts text from RTF les.
Creating a dictionary requires a sample of les that are
In cases in which other methods fail, it is possible expected to be similar to those being matched. This can be
to extract text from printable les using optical character the entire set of documents to be compared, rather than some
recognition. In our experiments below, we used the open- prior collection, but a set of les in the same language can
source OCR program tesseract-ocr to extract text from sufce. Dictionaries are created by processing all les and
PDF les. discovering what tokens appear in which documents. We
We also note that not all text is the same as there exist then extract statistically important terms using the inverse
a variety of possible encodings and line endings. Some document frequency, or IDF, as described above.
of the most commonly encountered include ASCII, ISO
sdtext then trims the set of gathered tokens by IDF,
8859-1, and UTF-8. Different operating systems also use
dropping the low IDF and the high IDF terms. The reason
different markers to indicate line endings. As part of our
for this is that low-IDF terms are very common and require
experiments, we varied the encoding and line endings. We
storage space while not adding signicantly to the process of
used iconv to convert between encodings; dos2unix to
differentiating les. Similarly, high-IDF terms are removed
change line endings; and fold to shorten long lines by
as they are so uncommon as to appear in only one or a few
adding additional line endings.
different documents in the collection. While they can help
It is important to note that not all text-extraction mech-
identify those few les that contain those terms, they are
anisms produce identical text. Optical character recognition
useless otherwise and can be dropped to preserve space.
adds many additional characters as noise, and other methods
Determination of appropriate IDF settings can be made
treat special characters, like ligatures or smart quotes, differ-
by experimentation, or the examiner can use a heuristic.
ently. Because of this, any method that works to recognize
sdtext includes the ability to test a variety of IDF ranges
text needs to be robust against small differences in the text.
to determine which would produce the most accurate results,
We show below that sdtext is robust in this way, whereas
though this can be computationally expensive. Alternatively,
sdhash - due entirely to design differences - is not.
it is possible to use language-based heuristics. For example,
C. Tokenization a normalized IDF range of approximately 0.3 to 0.6 often
works well for English language documents, and is used in
During processing, each le that is read is tokenized. The
our tests below.
tokenization process splits the text into smaller segments,
typically words or lines, each of which is one of the units
E. Digest creation
used in making digests. We note that it is necessary to extract
text from les during this process, and described approaches Once the dictionary has been created, text digests can
to recovering text in section III-B, above. be created. Each document is tokenized to extract tokens
The tokenization to be performed can and will vary, of interest. From this set of tokens, we currently support
particularly by language. Tokenizers are generally used as a two different types of digests. First, sdtext supports hash-
set; each le that is read is passed through all the tokenizers. based digests that are made by taking all dictionary terms
In sdtext, each tokenizer has a name that reects its that are present, alphabetizing and concatenating them, then
function, and the rst tokenizer in the list must be able hashing the resulting string. This is the approach used in
to read the le to be processed. The current tokenizer I-Match [12], and produces compact hashes which can be
set includes le-reading tokenizers that can read from text used only for exact matches between les.
and gzipped text les; can use Oracles OutsideIn for text A better approach to similarity matching, however, uses
extraction from a variety of le formats; and which can bit vectors. Each bit in the vector represents the presence
use Apache Lucene [10] for processing Chinese and Arabic or absence of a dictionary term in the le being processed.
language documents. In a sense, these are a semantic representation of the le
Other tokenizers then lter this text by performing such in a lossy format. These digests allow more exibility
operations as splitting tokens on punctuation or numerals in matching, as les that contain a sufcient number of
or ltering tokens based on length, which can be useful matching terms with a smaller number of non-matching
in removing tokens that are a byproduct of text extraction, terms will still result in a high similarity score.
5609
5608
sdhash t5 scores > 5 sdtext score > 40 t5 comparison
600 1200
500 1000
400 800
Frequency
Frequency
300 600
200 400
100 200
0 0
5
8
11
20
23
26
29
32
35
38
41
44
50
53
56
59
62
65
68
71
74
80
83
86
89
92
95
99
14
77
47
17
40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98
sdhash scores sdtext score
(a) sdhash, score > 5 (b) sdtext, score > 40

Figure 1: All-pairs matching of the T5 data set
F. Digest Matching IV. M ATCHING P ERFORMANCE

Digests can be matched against each other. For bit-vector We rst compare the performance of sdtext against
digests we use a cosine similarity measure, which is the that of sdhash using the methodology and data corpus
measure of the cosine of the angle between two vectors described by Roussev [13] using a variety of text extraction
represented in n space. Investigators can specify a parameter techniques. We show that in the general case composed of a
that represents the maximum cosine of the angle between corpus that contains a signicant proportion of non-textual
the two vectors that can be considered a match, making it les, such as images, sdtext does not perform as well as
possible to control the precision and recall of their searches sdhash, as one would expect. We then move on to les that
to a degree. Increasing it results in higher precision but a are primarily textual and show that sdtext outperforms
lower recall. An investigator can use this fact to perform sdhash in terms of accurate matching of les that contain
initial, highly precise searches then expand the scope of the similar text in different formats.
searches by decreasing the matching parameter. These later We do not include performance comparisons in this paper.
searches will return a wider variety of matches which are In performance, sdhash is far superior. Where as sdtext
less accurate but contain more possibilities. may need to read the les twice - once for dictionary creation
and once for digest creation - sdhash only has to read
the les once. Once digests are created, sdtext must
G. Score Interpretation
individually compare all digests of interest, resulting in an
As output, sdtext and other similarity matching tools O(n) or O(n2 ) operation. In contrast, sdhash can place
produce an integer number between 0 and 100. While this all digests in a Bloom Filter and do an O(1) operation to
number has mathematical meaning, the fact that two les determine if there are any matches. That being said, sdtext
have a 100 score does not guarantee that they are identical, is highly threaded and text processing is a relatively fast
and in fact there can be signicant differences between such operation. Our experience with sdtext shows that it runs
les. It is common for les that have a specic boilerplate to fast enough, even over large data sets, to be of practical use.
match, though the topics may be different. For example, in
our data set, described below, we nd that web pages from A. t5 corpus
the same site often match, as do calls for grants from the Past work by Roussev has compared the performance
same agency as they contain standard text. of similarity tools [13]. We add to that body of work
A better way of thinking about the scores relates to search by performing the same tests using the t5 data corpus he
engine performance. Matching in sdtext is essentially produced and comparing them to the results from the most
performing a query based on the terms recorded in the digest recent version of sdhash. We do not expect sdtext to
against the terms in other digests. A high score means it perform well on this dataset, as it includes a variety of les
is likely the match will be of interest; a lower score less including many that do not contain text.
so. In general, for English text, an sdtext score of above We show the results in this section as a histogram of
60 seems to be worth examining. For our experiments with scores of individual les from the test set. In general, the
sdhash, we use the cut-off of 20 as a reasonable value as more high scores the better the result; as a heuristic, a
cited by Roussev [13]. sdhash score of 20 or greater would indicate a likely
5610
5609
sdhash document comparison sdtext document comparison
45000 4500
40000 4000
35000 3500
30000 3000
Frequency
Frequency
25000 2500
20000 2000
15000 1500
10000 1000
5000 500
0 0
0
3
6
9
12
15
18
21
24
30
33
36
39
42
45
48
51
54
61
64
70
73
79
82
85
88
91
0
3
6
9
12
15
18
21
24
100
30
33
36
39
42
45
48
51
54
60
63
66
69
72
79
82
85
88
91
100
94
94
76
76
27
27
57
97
57
97
67
sdhash score sdtext score
(a) sdhash (b) sdtext

Figure 2: All-pairs matching, doc corpus
match, for sdtext a score of 60 or greater might. The site; the second hit data sets in the same format.
results for sdhash are shown in Figure 2a, and those for Overall, however, sdtext doesnt perform as well in
sdtext in Figure 2b. the general case as sdhash, which is fully expected. In
For this experiment, we used version 3.3 of sdhash and particular, sdtext was unable to create digests for many
sdtext version 0.4 with automatic text extraction produced les due to the lack of text to be extracted. In the next
using OutsideIn. In conguring sdtext, we used minimal section we describe experiments on textual les and show
tokenization in that we removed tokens with punctuation, that sdtext outperforms sdhash in this specic, textual
limited the length of tokens to between 3 and 10 characters, domain.
and performed Porter Stemming. We then compared created
sdtext t5 scores > 40, no images
digests of each le in the t5 corpus and did an all-pairs com- 600
parison between digests and recorded the resulting scores.

We performed a similar operation for sdhash, setting the 500
minimum score to report to zero and selecting the option to 400
generate an all-pairs comparison, again recording the scores.

Frequency
300
For each tool, the vast majority of pairs do not generate
a matching score. For sdhash, 1,073 of 9,965,880 score 200
above the sdhash cut-off score of 20, which is about

0.01%. In comparison, sdtext reports fewer total matches, 100
providing scores for only 5,095,020 pairs. The reason for 0
this is because many of the les in the t5 corpus produce 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98
sdtext scores
no text when OutsideIn processes them. Files that do not
produce text do not produce digests, as such digests would Figure 3: sdtext scores over T5 data set, score > 40,
be useless, so they are not included in the matching process. images removed
Of the scores reported, 3,389 were above the sdtext cut-
off score of 60, which is about 0.07%. Figure 1a shows the
results for sdhash; Figure 1b shows those for sdtext. B. Experiments with Textual Files
Surprised at the seeming effectiveness of sdtext in As sdtext is designed to operate effectively over text
a situation where we did not expect it to perform well, les, we designed and conducted experiments to demonstrate
we took a closer look at the results. We noted that the performance on pure text les. We developed a test corpus,
high scores in the 99 and 94 columns consisted almost as described below, and converted the text in each Microsoft
entirely of JPEG images. Checking the text extracted, we Word document to PDF, HTML, OCRed text, and text.
found that OutsideIn was reporting standard language EXIF We then compared each Word document against all other
headers that were causing a match. Removing the image versions of the same document using sdtext and sdhash.
les from comparison gave us the results shown in Figure 3. Performing experiments in this manner allows us to have a
Examining the remaining hits showed reasonable results: for solid ground truth. Each different format of the same le
example, the top hit was between web pages from the same should, ideally, match each other, as they contain the same
5611
5610
text when viewed by humans with the appropriate viewing tested each document only against its re-formatted copy.
software. As mentioned above, this experiment provides a strong
Corpus Creation: We rst selected a random subset of measure of ground truth, as each comparison should be
500 Microsoft Word les from the GOVDOCS [14] corpus. true. We started by comparing each doc le against the
From this set, we removed documents that were only one corresponding PDF le. For this, and all the comparisons
page, as many of those were blank forms with little text. between formats below, we used OutsideIns text extraction,
We then briey examined each document to ensure it was removed tokens with punctuation, limited the length to 3-
in English and that when saved as PDF it created a single 10 characters and performed Porter Stemming. The results
document; many of the .doc les created two or more PDFs are shown for sdhash in Figure 4a and for sdtext in
when printed. Though we had a goal of 300 text documents, Figure 4b.
we ended up with 299 as late in testing it was determined one It is clear that something about the PDF conversion
was in Spanish, so its results were removed. This resulting process badly affects the ability of sdhash to match a
set of documents, which we will refer to as the doc corpus, Microsoft Word document to its corresponding PDF le, as
are available for download from anonymized-for-review. none of the matched pairs returned a score other than zero.
Format conversion: For each document, we converted In contrast, sdtext performed quite accurately, with only
it from Word format to PDF by using the option to print two pairs failing to meet a reasonable score, though they
as PDF through Microsoft Ofce on a Macintosh desktop were close. This is a 99.33% success rate.
through a small program written in Applescript. We similarly 3) Comparison between doc and HTML: As a conversion
converted the documents to HTML in the same manner, to HTML can be a common transformation, we compared
keeping only the HTML portion and not the correspond- each doc le to its corresponding HTML le. We note that
ing subdirectories. The PDF les were then run through the HTML as produced by Microsoft Word is quite verbose,
tesseract-OCR to produce a text version through optical and that a different conversion method or post-processing of
character recognition. the HTML to reduce its complexity might produce different
1) All pairs doc les: Our initial test was to compare each results. The results are shown for sdhash in Figure 5a and
doc le against all others. The results are shown for sdhash for sdtext in Figure 5b.
in Figure 2a and for sdtext in Figure 2b. For sdtext, we Again, sdhash performed poorly, matching only 1 pair
used OutsideIns text extraction, removed tokens with punc- out of 299, for a success rate of 0.33%. sdtext did much
tuation, limited the length to 3-10 characters and performed better, using the same tokenization as above with a 100%
Porter Stemming. This is a demonstration that, for the most success rate.
part, the set of les extracted are not generally self-similar. 4) Comparison between doc and OCR text: Though text
It also shows that sdhash identies matches that sdtext extraction is possible in many common electronic formats, in
does not. the case where the les are in image form or exist on paper
To determine why this was the case, we examined all text can be retrieved using optical character recognition. This
sdhash matches that scored above 20 and all sdtext process, while often effective, can be very noisy and error
matches that scored above 60. For sdhash, there were 108 prone. We note that our OCR corpus was produced using
such matches. Of those, a brief examination of each match existing PDF les, and is likely cleaner than copies that
revealed that 49 appeared to be legitimate, mostly due to were scanned from paper. Examination of the resulting text
cases in which the documents were produced by the same les does, however, show that there are a signicant number
agency and had identifying header text and embedded logo of additional characters introduced into the le.
image. The other 59 matches were apparent false positives. Again, as shown in Figure 6a and Figure 6b, sdhash
Interestingly, most of these false positives involved the same performed poorly in this specic case, showing a 0% success
10 or so les. A hypothesis is that there may be some rate, while sdtext missed only 3 matches for a 98.99%
embedded code that is not visible, perhaps something like a success rate.
font denition, that is triggering those matches. 5) Comparison between doc and alternate text extractions
In contrast, sdtext had 4 matches above a score of 60, : Finally, we compared doc les against text les that had
three of which were documents from the same organization, been created using the Save as text function of Microsoft
and the fourth of which (and the lowest scored) was a false Word. The text produced this way differs from that produced
positive. This shows that each approach has a strength, as by OutsideIn, which was the default for sdtext in this
sdhash found many matches that sdtext did not at the experiment. The results are shown below in Figure 7a and
cost of many false positives. Meanwhile sdtext found one Figure 7b.
match sdhash did not, but was more precise in what it In this test, sdhash was able to match a good portion of
reported. the les, scoring 220 out of 299 as matches for a success rate
2) Comparison between doc and PDF: We next moved of 73.6%. Again, sdtext did well with a 100% success
on to comparison of the same text in different formats. We rate, showing that it is robust to small differences in text
5612
5611
sdhash doc vs. pdf comparison sdtext doc vs. pdf comparison
350 35
300 30
250 25
Frequency
Frequency
200 20
150 15
100 10
50 5
0 0
0
4
13
22
25
28
31
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
100
0
3
6
9
12
15
18
21
24
10
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
19
16
94
76
7
97
27
57
37
87
67

Figure 4: Matching doc les to PDF les
sdhash doc vs. html comparison sdtext doc vs. html comparison
350 45
40
300
35
250
30
Frequency
Frequency
200
25
20
150
15
100
10
50
5
0 0
0
3
6
9
12
15
18
21
24
30
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
0
3
6
9
12
15
18
21
24
100
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
94
76
27
27
97
57
87
37
67

Figure 5: Matching doc les to HTML les
sdhash doc vs. ocr comparison sdtext doc vs ocr comparison

350 35
300 30
250 25
Frequency
Frequency
200 20
150 15
100 10
50 5
0 0
0
3
13
22
25
28
31
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
100
0
3
6
9
12
15
18
21
24
10
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
19
16
94
76
7
97
27
57
37
87
67
sdhash score sdtext scores

Figure 6: Matching doc les to OCRed text les
extraction. of formats. While sdhash is an excellent tool in general

use, it performs less well in this specic case.
6) Results: The results show that sdtext is highly
accurate at matching les containing text across a variety
5613
5612
sdhash doc vs. text comparison sdtext doc vs. text comparison
35 60
30
50
25
40
Frequency
Frequency
20
30
15
20
10
10
5
0 0
0
3
6
9
12
15
18
21
24
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
99
27
57
87
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99

Figure 7: Matching doc les to text les
Experiment Results Experiment Results

105
70
100
65 95
90
60
85
55 80
75
50
70
45 65
Average Score
Average Score
60
40
55
35 50
30 45
40
25
35
20 30
25
15
20
10 15
10
5
5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Percent Mangled Percent Mangled
IDF:0.0_1.0 deleteSection IDF:0.0_1.0 IDF:0.0_1.0 deleteSection IDF:0.0_1.0
(a) sdhashwith Deleted Sections (b) sdtextwith Deleted Sections

Figure 8: Matching text to text with deleted sections

105
70
100
65 95
60 90
85
55
80
50 75
70
45
65
Average Score
Average Score
40 60
35 55
50
30
45
25 40
35
20
30
15 25
10 20
15
5
10
0 5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
IDF:0.0_1.0 deleteWhitespace IDF:0.0_1.0 IDF:0.0_1.0 deleteWhitespace IDF:0.0_1.0
(a) sdhashwith Deleted Whitespace (b) sdtextwith Deleted Whitespace

Figure 9: Matching text to text with deleted whitespace
C. Using sdhash on extracted text naturally arises if the text extraction is what provides the
advantage. Would sdhash do as well if provided the
Given that sdtext operates over extracted text while extracted text to operate on?
sdhash needs to operate over all formatting, the question
5614
5613
105
70
100
65 95
90
60
85
55 80
75
50
70
45 65
Average Score
Average Score
40 60
55
35
50
30 45
40
25
35
20 30
25
15
20
10 15
10
5
5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
IDF:0.0_1.0 IDF:0.0_1.0 ChangeToChar IDF:0.0_1.0 splitToken IDF:0.0_1.0
(a) sdhashwith Character Errors (b) sdtextwith Character Errors

Figure 10: Matching text to text with deleted whitespace
To determine if this is the case, we ran a series of larger deletion percentage, on average matching documents
experiments over extracted text. For sdhash, we used that had 77% of the text removed sdtext was behind,
english-language text and normalized it by making the text matching on average with about 70% of the text removed.
lower case; removing punctuation; and replacing all white This indicates that sdhash-E can perform slightly better
space with a single space. We refer to sdhash working in the case of matching contiguous fragments of text to les.
over extracted text as sdhash-E. For sdtext, we ex- 2) Deleting or Changing Characters: We also considered
tracted text; split tokens on whitespace; stripped digits and other edits typical of the errors that can occur in matching
punctuation; removed short and long tokens; then applied noisy text, like that is created by OCR or recovered with
Porter stemming. other noise. In this case we tried two different experiments.
We then also applied automatic error insertion into the In the rst, we randomly deleted whitespace characters in
text at varying levels. These manipulations included deleting the le. This is one type of OCR error that is common.
a contiguous chunk of the text; deleting whitespace tokens The results are shown in Figures 9a and 9b. In this case,
with a percentage chance; or changing individual characters sdhash-E performs very poorly, failing to match les
to random other characters with some percentage chance. with 10% of white space removed. sdtext performs much
These edits were intended to represent different situations better, matching the les on average with as much as 40%
that would occur in forensic situations. Recovering partially of the whitespace removed.
overwritten les might result in only recovering a small Similarly, we examined what would happen when OCR-
portion of the le. OCR of documents might result in type errors altered individual characters. In this experi-
erroneous characters, or missed whitespace characters. We ment, individual characters were randomly changed with the
show that each of these affects the operation of sdtext specied probability. The results are shown in Figures 10a
and sdhash-E differently, but that sdtext is generally and 10b. Again, sdtext is robust against about 60% of the
more resilient to errors in the text. characters being randomly changed; sdhash-E fails with
even 10% of characters changed.
1) Deleting Sections: In the rst experiment, we matched 3) Results: The results show the different approaches of
a large number of source plain texts les from the Enron sdtext and sdhash cause different performance. Because
data set [15] against the same le with a contiguous section sdhash creates hashes over sections of text, any change
deleted. The results are shown in Figures 8a and 8b, which in that section means it will not match the original sec-
recount the average score over multiple trials for each tion. However, given a clean hash of a small section, it
deletion size. The 95% condence interval is calculated, but will still match the original document. This gives it better
is too close to the plotted points to be visible. As a reminder, performance than sdtext in matching segments, thought
scores above 20 are reasonably interpreted as matches for the difference is not major.
sdhash , and scores above 60 are reasonable matches for On the other hand, sdtext is far more resilient to errors
sdtext. in characters in a le than sdhash. This indicates that it
The graphs show that both sdhash-E running over is likely preferable in situations when the type of error that
extracted text and sdtext do a good job at matching might be present in recovered text is not known
segments of les to the original le. In this case, sdhash-E sdtext. We used the base set of text les created using
is able to provide a high condence score to a slightly Microsoft Word as described above. We then compared these
5615
5614
les to the set of OCRed les, called the OCR set; a set of R EFERENCES
les that was created using the unoconv utility on linux, [1] J. D. Kornblum, Identifying almost identical les using
called the unoconv set; and the base set of text les that had context triggered piecewise hashing, Digital Investigation,
line endings changed with the dos2unix utility and the vol. 3, no. Supplement-1, pp. 9197, 2006.
line length wrapped to 80 characters using the fold utility,
[2] V. Roussev, G. R. III, and L. Marziale, Multi-resolution Sim-
called the folded set. ilarity Hashing, in Digital Forensics Research Conference
(DFRWS), 2007, pp. 105113.
V. L IMITATIONS OF sdtext AND FUTURE WORK
[3] N. I. of Standards and Technology, National software refer-
While sdtext is very accurate at matching text les, ence library, http://www.nsrl.nist.gov/, August 2003.
it has limitations compared to existing tools. First, the
processing overhead is higher. To match a new set of les, [4] N. Harbour, DCFLDD, http://dcdd.sourceforge.net/,
March 2002.
sdtext must read the les once to create the dictionary,
then again for the signatures. For large data sets this can [5] A. Tridgell, Spamsum readme,
induce a signicant overhead. Second, there is an additional http://samba.org/ftp/unpacked/junkcode/spamsum/README,
overhead required to load the dictionary. In some cases, 2000.
this can be expensive, particularly when sdtext is run
[6] B. H. Bloom, Space/time trade-offs in hash coding with
repeatedly to create ngerprints, as the dictionary must load allowable errors, Commun. ACM, vol. 13, no. 7, pp. 422
each time. Third, the digests are not hashes and require more 426, 1970.
computation to match. The hashes produced by sdhash and
ssdeep can be inserted into a Bloom Filter [6] that later [7] D. Grossman and O. Frieder, Information Retrieval, 2nd ed.
allows very fast checking of the presence of a matching Springer, 2004.
hash. In contrast, the digests produced by sdtext must be [8] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios,
checked individually to determine the score between them. Duplicate record detection: A survey, IEEE Trans. on
Finally, sdtext is written in Java, which does not have Knowl. and Data Eng., vol. 19, no. 1, pp. 116, Jan. 2007.
the inherent performance of C or C++ which other tools are [Online]. Available: http://dx.doi.org/10.1109/TKDE.2007.9
coded in.
[9] Oracle, Oracle Outside In Technology, http:
Our planned future work intends to address some of //www.oracle.com/technology/products/content-management/
these issues. First, we plan to place all tokens from each oit/oit all.html.
le as it is read into a Bloom lter. Instead of reading
the le a second time, the Bloom lter will be probed to [10] Lucene homepage, Available at http://lucene.apache.org/.
determine which tokens from the dictionary were present,
[11] M. F. Porter, An Algorithm for Sufx Stripping, Readings
removing the need to read each le twice. Second, we are in Information Retrieval, pp. 313316, 1997.
planning to implement an improved algorithm for all-pairs
digest scoring [16]. This approach limits the number of [12] A. Chowdhury, O. Frieder, D. Grossman, McCabe, and
comparisons that need to be made. Third, while sdtext M. Catherine, Collection statistics for fast duplicate docu-
is already heavily threaded, we are planning on examining ment detection, ACM Transactions on Information Systems,
vol. 20, no. 2, pp. 171191, 2002.
other options for parallelism, including Apache Spark.
[13] V. Roussev, An Evaluation of Forensic Similarity Hashes. in
VI. C ONCLUSION In Proceedings of the Eleventh Annual DFRWS Conference,
New Orleans, LA., August 2011, pp. 3441.
Similarity matching is an increasingly useful technique
for working with large amounts of acquired evidence. Most [14] S. Garnkel, P. Farrell, V. Roussev, and G. Dinolt, Bring-
similarity tools work over the bitstream of the le, which, ing Science to Digital Forensics with Standardized Forensic
Corpora, in In Proceedings of the Nineth Annual DFRWS
while useful, does not do well in nding documents with Conference, Montreal, Canada, August 2009.
similar text in differing formats. We introduced sdtext,
which matches similar text les on the basis of their con- [15] B. Klimt and Y. Yang, Introducing the Enron Corpus, in
tents. We showed that while sdtext is not as suitable for First Conference on Email and Anti-spam (CEAS), 2004.
nding duplicates in the general case, it is more accurate at
[16] R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs
identifying textual le matches than current tools. similarity search, in Proceedings of the 16th International
Conference on World Wide Web, ser. WWW 07. New York,
VII. ACKNOWLEDGEMENTS NY, USA: ACM, 2007, pp. 131140.
I would like to thank Lindsey Neubauer and Evan

Bloomberg for their work on sdtext.
5616
5615

Text-Based Document Similarity Matching Using Sdtext

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text-Based Document Similarity Matching Using Sdtext

Uploaded by

Copyright:

Available Formats

2016 49th Hawaii International Conference on System Sciences

Text-based Document Similarity Matching Using sdtext

1530-1605/16 $31.00 2016 IEEE 5606

sdhash scores sdtext score

(a) sdhash, score > 5 (b) sdtext, score > 40

F. Digest Matching IV. M ATCHING P ERFORMANCE

(a) sdhash (b) sdtext

parison between digests and recorded the resulting scores.

minimum score to report to zero and selecting the option to 400

generate an all-pairs comparison, again recording the scores.

above the sdhash cut-off score of 20, which is about

providing scores for only 5,095,020 pairs. The reason for 0

this is because many of the les in the t5 corpus produce 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98

(a) sdhash (b) sdtext

sdhash score sdtext score

(a) sdhash (b) sdtext

sdhash doc vs. ocr comparison sdtext doc vs ocr comparison

sdhash score sdtext scores

(a) sdhash (b) sdtext

extraction. of formats. While sdhash is an excellent tool in general

sdhash score sdtext score

(a) sdhash (b) sdtext

Experiment Results Experiment Results

IDF:0.0_1.0 deleteSection IDF:0.0_1.0 IDF:0.0_1.0 deleteSection IDF:0.0_1.0

(a) sdhashwith Deleted Sections (b) sdtextwith Deleted Sections

Experiment Results Experiment Results

IDF:0.0_1.0 deleteWhitespace IDF:0.0_1.0 IDF:0.0_1.0 deleteWhitespace IDF:0.0_1.0

(a) sdhashwith Deleted Whitespace (b) sdtextwith Deleted Whitespace

IDF:0.0_1.0 IDF:0.0_1.0 ChangeToChar IDF:0.0_1.0 splitToken IDF:0.0_1.0

(a) sdhashwith Character Errors (b) sdtextwith Character Errors

I would like to thank Lindsey Neubauer and Evan

You might also like