Professional Documents
Culture Documents
Data
Best Practices
Case Studies
* There are many more duplicates in the source collection than would exist in
Enron’s production email environment. Some collections account for this and
others do not. There does not currently exist an correlate various de-duplication
efforts.
Accomplishments
◦ EDRM Enron PST files representing 132 custodians with
attachments
Upcoming Deliverables
◦ EDRM / TREC 2009 Data Set
◦ EDRM / TREC 2010 Data Set
Custodian PSTs Y N Y
EDRM XML N Y Y
TIFF N N5 Y
Full Headers Y N Y
SDOC NO3 N N Y
1. To be available shortly from EDRM. This set was contributed by Clearwell Systems for the TREC 2009 Legal Track project and will be
hosted by EDRM.
2. This data set is still in progress for delivery use by TREC Legal Track 2010.
3. Required for correlating various versions.
4. The EDRM XML specifies a single custodian. As part of the TREC 2009 Legal Track, ZL Technologies identified 104 custodians in this
data set.
5. TIFF images can be made available.
generated
Probabilistic Hash
dramatically
enhancing culling of
knowable non-user
files