Professional Documents
Culture Documents
Clay Shields
Department of Computer Science
Georgetown University
Washington, D.C., 20057
clay@cs.georgetown.edu
I. A BSTRACT those terms. Digests are either hashes of the set of dictionary
Forensics examiners frequently try to identify duplicate terms or a bit vector representing which terms are in the
les during an investigation. They might do so to identify dictionary. Digests are portable and can be shared between
known les of interest, or to allow more rapid review of investigators, and they can be matched against each other to
documents that appear to be similar. Current forensic tools nd les that have similar content.
for detecting duplicate les operate over the low-level bits of Below, we describe how existing approaches work, and
the le, typically using hashing. While this can be a fast and then provide more detail about the operation of sdtext. We
effective method in many cases, it can fail due to differences then present an experimental comparison of sdtext and
in le format. We introduce sdtext, a tool developed to sdhash : rst, in a general case; next, over les containing
identify similar les based on their textual contents, which text; and nally over les consisting of extracted text. We
is robust to changes in format. We show that sdtext is show that sdtext is able to identify similar les that
far more accurate than existing tools in matching les that are textual with much higher accuracy than existing tools,
contain the same text in different formats. particularly in cases in which the same text is present in
different le formats.
II. I NTRODUCTION
As technology improves and prices fall, the amount of A. Similarity Hashing
digital media found in a forensic examination increases. For Existing forensic approaches to duplicate document de-
examiners in the eld, this can create problems as they may tection operate based on the bits that encode the le. In the
recover a large amount of digital media requiring triage simplest case, cryptographic hashes are used to identify les
to determine what is interesting or relevant. One method that are bit-for-bit identical. This is used most often to iden-
of isolating interesting media is to identify which media tify known child pornography and to identify known system
contains les similar to those already known to be of interest. and application les to exclude from further examination [3].
Similarly, identifying near-duplicate material can allow an These techniques do not work if even a single bit is changed
examiner to more quickly work through large amounts of in the le and consequently they fail to identify similar les
data by examining similar documents at the same time rather that have been subjected to even small edits.
than piecemeal as they are encountered. There have been a number of approaches taken to allow
One approach to nding similar les is done on the basis hashing to match le based on similarity. All involve break-
of hashes that operate over the bits of the le [1], [2]. In ing the le into smaller pieces and hashing those pieces,
many cases, however, this approach can fail to identify le then comparing the piecewise hashes to nd matches. The
containing similar content in different formatting or encod- difference is in how pieces are created for hashing.
ings. An example of this is a Microsoft Word document that The simplest approach is to use pieces of xed size, as is
has been exported as a PDF. While humans would perceive used in dcfldd [4]. This allows an examiner who is taking
the same text, hashing will likely not identify documents as a forensic image to verify that image piecewise; if some
being similar. Other examples might be different classes of piece is corrupted, the integrity of other pieces is maintained.
documents with similar text, like web pages from the same When applied to document recognition, this approach suffers
source or documents with similar boilerplate text. from an alignment problem. Should a single bit be added or
To improve the state of the art in matching documents deleted from the le, that block and all following blocks will
that contain similar text, we have developed a tool named have differing hashes.
sdtext, which is available under an open-source license A better approach is called context triggered piecewise
at sdtext.com. It operates by extracting text from les, hashing (CTPH), and was built into a forensic tool named
constructing a dictionary of statistically important terms, ssdeep [1]. First used for spam detection [5], CTPH uses
then parsing les and recording the presence or absence of a rolling hash to determine when to trigger computation of a
5608
5607
In addition, there are other programs that have reverse- such as when unencoded or base64 encoded text is part
engineered le formats and allow text extraction. We of a document. Some languages, such as English, benet
used several, including OpenOffice and unoconv, a from stemming, which is the process of removing word
command-line tool that uses OpenOfces libraries, to ex- endings but leaving the root. This helps ensure that words
tract text from Microsoft Ofce Files; pdftotext, a like ensured match ensuring. sdtext includes the
command-line linux program that extracts text from PDF frequently-used Porter Stemmer [11] to accomplish this.
les; links, a command-line browser that can dump text
from HTML les and web sites; and unrtf, a command- D. Dictionary creation
line tool that extracts text from RTF les.
Creating a dictionary requires a sample of les that are
In cases in which other methods fail, it is possible expected to be similar to those being matched. This can be
to extract text from printable les using optical character the entire set of documents to be compared, rather than some
recognition. In our experiments below, we used the open- prior collection, but a set of les in the same language can
source OCR program tesseract-ocr to extract text from sufce. Dictionaries are created by processing all les and
PDF les. discovering what tokens appear in which documents. We
We also note that not all text is the same as there exist then extract statistically important terms using the inverse
a variety of possible encodings and line endings. Some document frequency, or IDF, as described above.
of the most commonly encountered include ASCII, ISO
sdtext then trims the set of gathered tokens by IDF,
8859-1, and UTF-8. Different operating systems also use
dropping the low IDF and the high IDF terms. The reason
different markers to indicate line endings. As part of our
for this is that low-IDF terms are very common and require
experiments, we varied the encoding and line endings. We
storage space while not adding signicantly to the process of
used iconv to convert between encodings; dos2unix to
differentiating les. Similarly, high-IDF terms are removed
change line endings; and fold to shorten long lines by
as they are so uncommon as to appear in only one or a few
adding additional line endings.
different documents in the collection. While they can help
It is important to note that not all text-extraction mech-
identify those few les that contain those terms, they are
anisms produce identical text. Optical character recognition
useless otherwise and can be dropped to preserve space.
adds many additional characters as noise, and other methods
Determination of appropriate IDF settings can be made
treat special characters, like ligatures or smart quotes, differ-
by experimentation, or the examiner can use a heuristic.
ently. Because of this, any method that works to recognize
sdtext includes the ability to test a variety of IDF ranges
text needs to be robust against small differences in the text.
to determine which would produce the most accurate results,
We show below that sdtext is robust in this way, whereas
though this can be computationally expensive. Alternatively,
sdhash - due entirely to design differences - is not.
it is possible to use language-based heuristics. For example,
C. Tokenization a normalized IDF range of approximately 0.3 to 0.6 often
works well for English language documents, and is used in
During processing, each le that is read is tokenized. The
our tests below.
tokenization process splits the text into smaller segments,
typically words or lines, each of which is one of the units
E. Digest creation
used in making digests. We note that it is necessary to extract
text from les during this process, and described approaches Once the dictionary has been created, text digests can
to recovering text in section III-B, above. be created. Each document is tokenized to extract tokens
The tokenization to be performed can and will vary, of interest. From this set of tokens, we currently support
particularly by language. Tokenizers are generally used as a two different types of digests. First, sdtext supports hash-
set; each le that is read is passed through all the tokenizers. based digests that are made by taking all dictionary terms
In sdtext, each tokenizer has a name that reects its that are present, alphabetizing and concatenating them, then
function, and the rst tokenizer in the list must be able hashing the resulting string. This is the approach used in
to read the le to be processed. The current tokenizer I-Match [12], and produces compact hashes which can be
set includes le-reading tokenizers that can read from text used only for exact matches between les.
and gzipped text les; can use Oracles OutsideIn for text A better approach to similarity matching, however, uses
extraction from a variety of le formats; and which can bit vectors. Each bit in the vector represents the presence
use Apache Lucene [10] for processing Chinese and Arabic or absence of a dictionary term in the le being processed.
language documents. In a sense, these are a semantic representation of the le
Other tokenizers then lter this text by performing such in a lossy format. These digests allow more exibility
operations as splitting tokens on punctuation or numerals in matching, as les that contain a sufcient number of
or ltering tokens based on length, which can be useful matching terms with a smaller number of non-matching
in removing tokens that are a byproduct of text extraction, terms will still result in a high similarity score.
5609
5608
sdhash t5 scores > 5
sdtext score > 40 t5 comparison
600
1200
500 1000
400
800
Frequency
Frequency
300
600
200 400
100 200
0
0
5
8
11
20
23
26
29
32
35
38
41
44
50
53
56
59
62
65
68
71
74
80
83
86
89
92
95
99
14
77
47
17
40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98
5610
5609
sdhash document comparison
sdtext document comparison
45000
4500
40000 4000
35000 3500
30000
3000
Frequency
Frequency
25000
2500
20000 2000
15000 1500
10000 1000
5000 500
0
0
0
3
6
9
12
15
18
21
24
30
33
36
39
42
45
48
51
54
61
64
70
73
79
82
85
88
91
0
3
6
9
12
15
18
21
24
100
30
33
36
39
42
45
48
51
54
60
63
66
69
72
79
82
85
88
91
100
94
94
76
76
27
27
57
97
57
97
67
sdhash score
sdtext score
match, for sdtext a score of 60 or greater might. The site; the second hit data sets in the same format.
results for sdhash are shown in Figure 2a, and those for Overall, however, sdtext doesnt perform as well in
sdtext in Figure 2b. the general case as sdhash, which is fully expected. In
For this experiment, we used version 3.3 of sdhash and particular, sdtext was unable to create digests for many
sdtext version 0.4 with automatic text extraction produced les due to the lack of text to be extracted. In the next
using OutsideIn. In conguring sdtext, we used minimal section we describe experiments on textual les and show
tokenization in that we removed tokens with punctuation, that sdtext outperforms sdhash in this specic, textual
limited the length of tokens to between 3 and 10 characters, domain.
and performed Porter Stemming. We then compared created
sdtext t5 scores > 40, no images
digests of each le in the t5 corpus and did an all-pairs com- 600
300
For each tool, the vast majority of pairs do not generate
a matching score. For sdhash, 1,073 of 9,965,880 score 200
sdtext scores
no text when OutsideIn processes them. Files that do not
produce text do not produce digests, as such digests would Figure 3: sdtext scores over T5 data set, score > 40,
be useless, so they are not included in the matching process. images removed
Of the scores reported, 3,389 were above the sdtext cut-
off score of 60, which is about 0.07%. Figure 1a shows the
results for sdhash; Figure 1b shows those for sdtext. B. Experiments with Textual Files
Surprised at the seeming effectiveness of sdtext in As sdtext is designed to operate effectively over text
a situation where we did not expect it to perform well, les, we designed and conducted experiments to demonstrate
we took a closer look at the results. We noted that the performance on pure text les. We developed a test corpus,
high scores in the 99 and 94 columns consisted almost as described below, and converted the text in each Microsoft
entirely of JPEG images. Checking the text extracted, we Word document to PDF, HTML, OCRed text, and text.
found that OutsideIn was reporting standard language EXIF We then compared each Word document against all other
headers that were causing a match. Removing the image versions of the same document using sdtext and sdhash.
les from comparison gave us the results shown in Figure 3. Performing experiments in this manner allows us to have a
Examining the remaining hits showed reasonable results: for solid ground truth. Each different format of the same le
example, the top hit was between web pages from the same should, ideally, match each other, as they contain the same
5611
5610
text when viewed by humans with the appropriate viewing tested each document only against its re-formatted copy.
software. As mentioned above, this experiment provides a strong
Corpus Creation: We rst selected a random subset of measure of ground truth, as each comparison should be
500 Microsoft Word les from the GOVDOCS [14] corpus. true. We started by comparing each doc le against the
From this set, we removed documents that were only one corresponding PDF le. For this, and all the comparisons
page, as many of those were blank forms with little text. between formats below, we used OutsideIns text extraction,
We then briey examined each document to ensure it was removed tokens with punctuation, limited the length to 3-
in English and that when saved as PDF it created a single 10 characters and performed Porter Stemming. The results
document; many of the .doc les created two or more PDFs are shown for sdhash in Figure 4a and for sdtext in
when printed. Though we had a goal of 300 text documents, Figure 4b.
we ended up with 299 as late in testing it was determined one It is clear that something about the PDF conversion
was in Spanish, so its results were removed. This resulting process badly affects the ability of sdhash to match a
set of documents, which we will refer to as the doc corpus, Microsoft Word document to its corresponding PDF le, as
are available for download from anonymized-for-review. none of the matched pairs returned a score other than zero.
Format conversion: For each document, we converted In contrast, sdtext performed quite accurately, with only
it from Word format to PDF by using the option to print two pairs failing to meet a reasonable score, though they
as PDF through Microsoft Ofce on a Macintosh desktop were close. This is a 99.33% success rate.
through a small program written in Applescript. We similarly 3) Comparison between doc and HTML: As a conversion
converted the documents to HTML in the same manner, to HTML can be a common transformation, we compared
keeping only the HTML portion and not the correspond- each doc le to its corresponding HTML le. We note that
ing subdirectories. The PDF les were then run through the HTML as produced by Microsoft Word is quite verbose,
tesseract-OCR to produce a text version through optical and that a different conversion method or post-processing of
character recognition. the HTML to reduce its complexity might produce different
1) All pairs doc les: Our initial test was to compare each results. The results are shown for sdhash in Figure 5a and
doc le against all others. The results are shown for sdhash for sdtext in Figure 5b.
in Figure 2a and for sdtext in Figure 2b. For sdtext, we Again, sdhash performed poorly, matching only 1 pair
used OutsideIns text extraction, removed tokens with punc- out of 299, for a success rate of 0.33%. sdtext did much
tuation, limited the length to 3-10 characters and performed better, using the same tokenization as above with a 100%
Porter Stemming. This is a demonstration that, for the most success rate.
part, the set of les extracted are not generally self-similar. 4) Comparison between doc and OCR text: Though text
It also shows that sdhash identies matches that sdtext extraction is possible in many common electronic formats, in
does not. the case where the les are in image form or exist on paper
To determine why this was the case, we examined all text can be retrieved using optical character recognition. This
sdhash matches that scored above 20 and all sdtext process, while often effective, can be very noisy and error
matches that scored above 60. For sdhash, there were 108 prone. We note that our OCR corpus was produced using
such matches. Of those, a brief examination of each match existing PDF les, and is likely cleaner than copies that
revealed that 49 appeared to be legitimate, mostly due to were scanned from paper. Examination of the resulting text
cases in which the documents were produced by the same les does, however, show that there are a signicant number
agency and had identifying header text and embedded logo of additional characters introduced into the le.
image. The other 59 matches were apparent false positives. Again, as shown in Figure 6a and Figure 6b, sdhash
Interestingly, most of these false positives involved the same performed poorly in this specic case, showing a 0% success
10 or so les. A hypothesis is that there may be some rate, while sdtext missed only 3 matches for a 98.99%
embedded code that is not visible, perhaps something like a success rate.
font denition, that is triggering those matches. 5) Comparison between doc and alternate text extractions
In contrast, sdtext had 4 matches above a score of 60, : Finally, we compared doc les against text les that had
three of which were documents from the same organization, been created using the Save as text function of Microsoft
and the fourth of which (and the lowest scored) was a false Word. The text produced this way differs from that produced
positive. This shows that each approach has a strength, as by OutsideIn, which was the default for sdtext in this
sdhash found many matches that sdtext did not at the experiment. The results are shown below in Figure 7a and
cost of many false positives. Meanwhile sdtext found one Figure 7b.
match sdhash did not, but was more precise in what it In this test, sdhash was able to match a good portion of
reported. the les, scoring 220 out of 299 as matches for a success rate
2) Comparison between doc and PDF: We next moved of 73.6%. Again, sdtext did well with a 100% success
on to comparison of the same text in different formats. We rate, showing that it is robust to small differences in text
5612
5611
sdhash doc vs. pdf comparison
sdtext doc vs. pdf comparison
350
35
300 30
250
25
Frequency
Frequency
200
20
150 15
100 10
50 5
0
0
0
4
13
22
25
28
31
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
100
0
3
6
9
12
15
18
21
24
10
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
19
16
94
76
7
97
27
57
37
87
67
sdhash score
sdtext score
sdhash doc vs. html comparison
sdtext doc vs. html comparison
350
45
40
300
35
250
30
Frequency
Frequency
200
25
20
150
15
100
10
50
5
0
0
0
3
6
9
12
15
18
21
24
30
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
0
3
6
9
12
15
18
21
24
100
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
94
76
27
27
97
57
87
37
67
300 30
250
25
Frequency
Frequency
200 20
150 15
100 10
50 5
0
0
0
3
13
22
25
28
31
34
40
43
46
49
52
55
58
61
64
70
73
79
82
85
88
91
100
0
3
6
9
12
15
18
21
24
10
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
100
19
16
94
76
7
97
27
57
37
87
67
5613
5612
sdhash doc vs. text comparison
sdtext doc vs. text comparison
35
60
30
50
25
40
Frequency
Frequency
20
30
15
20
10
10
5
0
0
0
3
6
9
12
15
18
21
24
30
33
36
39
42
45
48
51
54
60
63
66
69
72
75
78
81
84
90
93
96
99
27
57
87
0
3
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
57
60
63
66
69
72
75
78
81
84
87
90
93
96
99
Average Score
60
40
55
35 50
30 45
40
25
35
20 30
25
15
20
10 15
10
5
5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Percent Mangled Percent Mangled
60 90
85
55
80
50 75
70
45
65
Average Score
Average Score
40 60
35 55
50
30
45
25 40
35
20
30
15 25
10 20
15
5
10
0 5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Percent Mangled Percent Mangled
C. Using sdhash on extracted text naturally arises if the text extraction is what provides the
advantage. Would sdhash do as well if provided the
Given that sdtext operates over extracted text while extracted text to operate on?
sdhash needs to operate over all formatting, the question
5614
5613
Experiment Results Experiment Results
105
70
100
65 95
90
60
85
55 80
75
50
70
45 65
Average Score
Average Score
40 60
55
35
50
30 45
40
25
35
20 30
25
15
20
10 15
10
5
5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Percent Mangled Percent Mangled
To determine if this is the case, we ran a series of larger deletion percentage, on average matching documents
experiments over extracted text. For sdhash, we used that had 77% of the text removed sdtext was behind,
english-language text and normalized it by making the text matching on average with about 70% of the text removed.
lower case; removing punctuation; and replacing all white This indicates that sdhash-E can perform slightly better
space with a single space. We refer to sdhash working in the case of matching contiguous fragments of text to les.
over extracted text as sdhash-E. For sdtext, we ex- 2) Deleting or Changing Characters: We also considered
tracted text; split tokens on whitespace; stripped digits and other edits typical of the errors that can occur in matching
punctuation; removed short and long tokens; then applied noisy text, like that is created by OCR or recovered with
Porter stemming. other noise. In this case we tried two different experiments.
We then also applied automatic error insertion into the In the rst, we randomly deleted whitespace characters in
text at varying levels. These manipulations included deleting the le. This is one type of OCR error that is common.
a contiguous chunk of the text; deleting whitespace tokens The results are shown in Figures 9a and 9b. In this case,
with a percentage chance; or changing individual characters sdhash-E performs very poorly, failing to match les
to random other characters with some percentage chance. with 10% of white space removed. sdtext performs much
These edits were intended to represent different situations better, matching the les on average with as much as 40%
that would occur in forensic situations. Recovering partially of the whitespace removed.
overwritten les might result in only recovering a small Similarly, we examined what would happen when OCR-
portion of the le. OCR of documents might result in type errors altered individual characters. In this experi-
erroneous characters, or missed whitespace characters. We ment, individual characters were randomly changed with the
show that each of these affects the operation of sdtext specied probability. The results are shown in Figures 10a
and sdhash-E differently, but that sdtext is generally and 10b. Again, sdtext is robust against about 60% of the
more resilient to errors in the text. characters being randomly changed; sdhash-E fails with
even 10% of characters changed.
1) Deleting Sections: In the rst experiment, we matched 3) Results: The results show the different approaches of
a large number of source plain texts les from the Enron sdtext and sdhash cause different performance. Because
data set [15] against the same le with a contiguous section sdhash creates hashes over sections of text, any change
deleted. The results are shown in Figures 8a and 8b, which in that section means it will not match the original sec-
recount the average score over multiple trials for each tion. However, given a clean hash of a small section, it
deletion size. The 95% condence interval is calculated, but will still match the original document. This gives it better
is too close to the plotted points to be visible. As a reminder, performance than sdtext in matching segments, thought
scores above 20 are reasonably interpreted as matches for the difference is not major.
sdhash , and scores above 60 are reasonable matches for On the other hand, sdtext is far more resilient to errors
sdtext. in characters in a le than sdhash. This indicates that it
The graphs show that both sdhash-E running over is likely preferable in situations when the type of error that
extracted text and sdtext do a good job at matching might be present in recovered text is not known
segments of les to the original le. In this case, sdhash-E sdtext. We used the base set of text les created using
is able to provide a high condence score to a slightly Microsoft Word as described above. We then compared these
5615
5614
les to the set of OCRed les, called the OCR set; a set of R EFERENCES
les that was created using the unoconv utility on linux, [1] J. D. Kornblum, Identifying almost identical les using
called the unoconv set; and the base set of text les that had context triggered piecewise hashing, Digital Investigation,
line endings changed with the dos2unix utility and the vol. 3, no. Supplement-1, pp. 9197, 2006.
line length wrapped to 80 characters using the fold utility,
[2] V. Roussev, G. R. III, and L. Marziale, Multi-resolution Sim-
called the folded set. ilarity Hashing, in Digital Forensics Research Conference
(DFRWS), 2007, pp. 105113.
V. L IMITATIONS OF sdtext AND FUTURE WORK
[3] N. I. of Standards and Technology, National software refer-
While sdtext is very accurate at matching text les, ence library, http://www.nsrl.nist.gov/, August 2003.
it has limitations compared to existing tools. First, the
processing overhead is higher. To match a new set of les, [4] N. Harbour, DCFLDD, http://dcdd.sourceforge.net/,
March 2002.
sdtext must read the les once to create the dictionary,
then again for the signatures. For large data sets this can [5] A. Tridgell, Spamsum readme,
induce a signicant overhead. Second, there is an additional http://samba.org/ftp/unpacked/junkcode/spamsum/README,
overhead required to load the dictionary. In some cases, 2000.
this can be expensive, particularly when sdtext is run
[6] B. H. Bloom, Space/time trade-offs in hash coding with
repeatedly to create ngerprints, as the dictionary must load allowable errors, Commun. ACM, vol. 13, no. 7, pp. 422
each time. Third, the digests are not hashes and require more 426, 1970.
computation to match. The hashes produced by sdhash and
ssdeep can be inserted into a Bloom Filter [6] that later [7] D. Grossman and O. Frieder, Information Retrieval, 2nd ed.
allows very fast checking of the presence of a matching Springer, 2004.
hash. In contrast, the digests produced by sdtext must be [8] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios,
checked individually to determine the score between them. Duplicate record detection: A survey, IEEE Trans. on
Finally, sdtext is written in Java, which does not have Knowl. and Data Eng., vol. 19, no. 1, pp. 116, Jan. 2007.
the inherent performance of C or C++ which other tools are [Online]. Available: http://dx.doi.org/10.1109/TKDE.2007.9
coded in.
[9] Oracle, Oracle Outside In Technology, http:
Our planned future work intends to address some of //www.oracle.com/technology/products/content-management/
these issues. First, we plan to place all tokens from each oit/oit all.html.
le as it is read into a Bloom lter. Instead of reading
the le a second time, the Bloom lter will be probed to [10] Lucene homepage, Available at http://lucene.apache.org/.
determine which tokens from the dictionary were present,
[11] M. F. Porter, An Algorithm for Sufx Stripping, Readings
removing the need to read each le twice. Second, we are in Information Retrieval, pp. 313316, 1997.
planning to implement an improved algorithm for all-pairs
digest scoring [16]. This approach limits the number of [12] A. Chowdhury, O. Frieder, D. Grossman, McCabe, and
comparisons that need to be made. Third, while sdtext M. Catherine, Collection statistics for fast duplicate docu-
is already heavily threaded, we are planning on examining ment detection, ACM Transactions on Information Systems,
vol. 20, no. 2, pp. 171191, 2002.
other options for parallelism, including Apache Spark.
[13] V. Roussev, An Evaluation of Forensic Similarity Hashes. in
VI. C ONCLUSION In Proceedings of the Eleventh Annual DFRWS Conference,
New Orleans, LA., August 2011, pp. 3441.
Similarity matching is an increasingly useful technique
for working with large amounts of acquired evidence. Most [14] S. Garnkel, P. Farrell, V. Roussev, and G. Dinolt, Bring-
similarity tools work over the bitstream of the le, which, ing Science to Digital Forensics with Standardized Forensic
Corpora, in In Proceedings of the Nineth Annual DFRWS
while useful, does not do well in nding documents with Conference, Montreal, Canada, August 2009.
similar text in differing formats. We introduced sdtext,
which matches similar text les on the basis of their con- [15] B. Klimt and Y. Yang, Introducing the Enron Corpus, in
tents. We showed that while sdtext is not as suitable for First Conference on Email and Anti-spam (CEAS), 2004.
nding duplicates in the general case, it is more accurate at
[16] R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs
identifying textual le matches than current tools. similarity search, in Proceedings of the 16th International
Conference on World Wide Web, ser. WWW 07. New York,
VII. ACKNOWLEDGEMENTS NY, USA: ACM, 2007, pp. 131140.
5616
5615