You are on page 1of 20

John Wang, Michael McGuire

EDRM Data Set Project Co-Leads


EDRM VI Kickoff Meeting
May 12, 2010
Minneapolis, MN
 Overview
◦ 2009-2010 Accomplishments
◦ Project Trajectory
 EDRM DS Projects
◦ Reference ESI Data Sets
◦ Software Reference Data Set
◦ Probabilistic Hash Data Set
 Summary

EDRM VI Kickoff Meeting – May 12, 2010


Improve the Quality and Reduce the Costs of ED

By Providing Unencumbered Data Sets

And Best Practices Guidelines

In Partnership with Leading Organizations

EDRM VI Kickoff Meeting – May 12, 2010


 Launch of first three ESI Data Sets
◦ Enron PST Files v1.0
◦ File Formats Data Set v1.0
◦ Internationalization Data Set v1.0
 Partnership with TREC Legal Track
◦ To provide 2010 TREC Legal Track Data Set
 Identify and Articulate New Projects

EDRM VI Kickoff Meeting – May 12, 2010


Software Reference Probabilistic Hash
ESI Data Sets
Data Sets Data Set
(ESIDS)
(SRDS, “EDRM List”) (PHDS)

Data

Best Practices

Case Studies

EDRM VI Kickoff Meeting – May 12, 2010


 Background
◦ To provide multiple reference
data sets for testing and
benchmarking Unit
Testing
System
Testing

 Goals & Benefits


Regression Acceptance
◦ Improved Quality Testing Testing

◦ Lower Development Costs


◦ Lower Acquisition Costs

EDRM VI Kickoff Meeting – May 12, 2010


 Current ESI Data Set Efforts
◦ Enron ESI Data Set
◦ File Formats Data Set
◦ Internationalization Data Set

EDRM VI Kickoff Meeting – May 12, 2010


 Background
◦ The email released by FERC via the Enron Western Energy
Crisis investigation remains one of largest collections of
email available as ESI
 1+ million before de-duplication*
 100,000s after de-duplication*
◦ Many different versions of the collection have been used
in various research projects, including TREC Legal Track
◦ It is hard to correlate the many studies

* There are many more duplicates in the source collection than would exist in
Enron’s production email environment. Some collections account for this and
others do not. There does not currently exist an correlate various de-duplication
efforts.

EDRM VI Kickoff Meeting – May 12, 2010


 Goals
◦ To provide a data set that accurately represents Enron’s
email environment
◦ That is useful for a wide variety of research and industry
use, including the ability to correlate various studies

 Accomplishments
◦ EDRM Enron PST files representing 132 custodians with
attachments

 Upcoming Deliverables
◦ EDRM / TREC 2009 Data Set
◦ EDRM / TREC 2010 Data Set

EDRM VI Kickoff Meeting – May 12, 2010


EDRM PST EDRM EDRM
v1.0 TREC TREC
20091 20102
Custodians 132 1 (1044) 150

Custodian PSTs Y N Y

EDRM XML N Y Y

TIFF N N5 Y

Full Headers Y N Y

SDOC NO3 N N Y

1. To be available shortly from EDRM. This set was contributed by Clearwell Systems for the TREC 2009 Legal Track project and will be
hosted by EDRM.
2. This data set is still in progress for delivery use by TREC Legal Track 2010.
3. Required for correlating various versions.
4. The EDRM XML specifies a single custodian. As part of the TREC 2009 Legal Track, ZL Technologies identified 104 custodians in this
data set.
5. TIFF images can be made available.

EDRM VI Kickoff Meeting – May 12, 2010


 File Format Data Set
◦ 381 Files
◦ 200 File Formats
 Internationalization Data Set
◦ 23 languages
◦ 724 MB of email in MIME format

EDRM VI Kickoff Meeting – May 12, 2010


 Getting Involved
◦ Analyze and report on contents of existing data
sets
◦ Promote EDRM Data Set project usage by authoring
and working on case studies
◦ Contribute new data sets

EDRM VI Kickoff Meeting – May 12, 2010


 Background
◦ The NSRL or “NIST List” is a common way to cull system and
application files but file coverage is very limited
 Goals & Benefits
◦ To lower the costs of eDiscovery and forensics by
complementing the NSRL through a parallel EDRM offering
covering more knowable files

Installed / Uninstalled Online Distributed Software

Installed Software from CD / DVD Media

Uninstalled Software from CD /DVD Media

EDRM VI Kickoff Meeting – May 12, 2010


 Proposed Work Product
◦ Published hash lists
◦ Open source hashing software. We may be able to
leverage NSRL open source hashing code:
 http://www.nsrl.nist.gov/perl/
 Hashing Targets
◦ Public images, e.g. AWS, Azure VM images
 Community Involvement
◦ Have organizations install and use EDRM provided
tools to submit hashes on known hashing targets

EDRM VI Kickoff Meeting – May 12, 2010


 Getting Involved
◦ Establish bootstrap goals for initial v1.0 launch
◦ Identify and access initial software targets
◦ Develop and validate tools for creating hash lists
◦ Release toolsets and hash lists
◦ Partner with NSRL and other organizations

EDRM VI Kickoff Meeting – May 12, 2010


 Background
◦ There are many knowable system and application
files that can be culled for eDiscovery, e.g. OS,
application, help files, etc.; however, there are also
many that cannot be pre-determined.

◦ What if there was a way to probabilistically


determine if any file was user-generated or not?

EDRM VI Kickoff Meeting – May 12, 2010


 Goals and Benefits
◦ To provide a hash database for theoretically all files
on a probabilistic basis so
organizations can use
statistics to help
ESI Data Sets (test
files)

determine whether or Software Reference

not a file may be user Data Sets (knowable


files)

generated
Probabilistic Hash

◦ Lower ED costs by Data Set (all files)

dramatically
enhancing culling of
knowable non-user
files

EDRM VI Kickoff Meeting – May 12, 2010


1) Community 2) EDRM Aggregated 3) eDiscovery
Contribution Tracking Processing

 Provide anonymous  Aggregated  System checking for


hashes of all collected frequency and files not culled via
files to EDRM DS histogram analysis NSRL or ESRDS

 By aggregating anonymous hashes across all collections, frequency


analysis can be used to assist in determining if any given file is user
generated

EDRM VI Kickoff Meeting – May 12, 2010


 Getting Involved
◦ Recruiting interested parties
◦ Initial system
 Requirements/ Development
 Interfaces (API, XML File Type)
 Systems and development
 POC deployment
 Using known ESIDS data
 Scalability testing

EDRM VI Kickoff Meeting – May 12, 2010


 Providing real benefits for the ED community
◦ Multiple ESI data sets have been released
◦ Case studies are in progress
◦ New projects are underway
◦ Discussions and partnerships with other
organizations are underway, e.g. TREC, NIST, etc.

 Get involved for 2010-2011!


◦ dataset@edrm.net
◦ http://edrm.net/activities/projects/dataset

EDRM VI Kickoff Meeting – May 12, 2010

You might also like