You are on page 1of 39

Searching The United States Code with Solr/Lucene

Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com

Searching the United States Code


Who are we:
Paul Nelson, Chief Architect Ronald Matamoros, Lead Engineer

Our Mission: Replace Personal Librarian Search


A 20-Year-Old Search Engine!

Key Challenges
How to index this massive, complex, 85-year-old document? How to replicate 20-Year-Old search features?

Government Documents are Fun!


3

Search Technologies
The largest independent provider of enterprise search expertise and services 80 full-time dedicated search engine experts 200+ customers Technology Neutral
(yeah, we know Sphinx too)

Offices All Over


DC, NY, CA, MD, OH, UK, CR

A Quick Civics Lesson


The United States Code
The general & permanent laws of the U.S. Government All in one place 51 titles
Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health

First Version: 1926

The Office of the Law Revision Council (OLRC)


20 lawyers who author the U.S. Code They report to the Speaker of the House of Representatives

Bonus Question: Which Title is the largest?


5

Major Challenges
1. Document Parsing
A 50 Volume Table Of Contents!

2. Query Parsing
Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields)

3. Searching & Highlighting Fields


Some fields are embedded in the document These fields must be highlighted in context
6

screenshot

screenshot

screenshot

10

Part The First: Document Processing

11

Document Processing / Indexing


USC Title
Parse & Granularize Embed Refs Construct XHTML Store Xform & Index

Solr

Repository

12

Field Type 1: Extracted to Index


<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 Page Numbers documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1), <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> Source Credit <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1 <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of

13

Document Processing / Indexing


USC Title
Parse & Granularize Embed Refs Construct XHTML Store Xform & Index

Solr

Repository
Title 14

ch. 1

ch. 2

ch. 3


14

pt. A

pt. B

pt. C

sec. 3

sec. 1

sec. 2

Field Type 2: Embedded Refs


<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1), Public Law <!-- field-end:sourcecredit --> <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1 <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of Public Law <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of

15

Document Processing / Indexing


USC Title
Parse & Granularize Embed Refs Construct XHTML Store Xform & Index

Solr

Repository

16

Document Processing / Indexing


USC Title
Parse & Granularize Embed Refs Construct XHTML Store Xform & Index

Solr

Repository

/US-Code /2010 /title2 /USC-title2-section1532.htm /USC-title2-node3-rule5.htm

17

Part The Second: Token Processing

18

Token Processing 1
xhtml tag tokenizer
<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash; 296 substituted &ldquo;Department of <!-- field-end:amendment-note --> <!-- field-start:amendment-note --> <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->

19

Field Type 3: Marked Within Doc


<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1), <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1 <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of

20

Token Processing 2
Mark Start and End Tags
<!-- field-start:amendment-note --> <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> S/amendment <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of E/amendment

21

Token Processing 3
Remove XHTML Tags
S/amendment <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of E/amendment 2002 Pub L 107 296 Substituted Department of E/amendment Amendments S/amendment

22

Token Processing 4
Tag Original Case & Lower Case
S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/pub L/l L/107 L/296 L/substituted L/department L/of

23

Token Processing 5
Lemmatize
Uses dictionary-based lemmatizer based on GCIDE and WordNet
S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/pub L/l L/107 L/296 L/substituted L/department L/of S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/Pub L/l; L/107 L/296 L/Substituted L/Department L/of amendment 2002 pub l 107 296 substitute department of

24

Part The Third: Query Processing

25

Query Processing
(not all stages shown) Query String
mark exact: mark phrases query template build lucene query parse lemmatize

search

Communicates via generic QNode Class


Simpler to manipulate than Lucene operators

Can produce FAST FQL as well


(cue the derisive catcalls)

But most importantly:


It is a Query Processing Pipeline
Mix and match query processing modules
26

Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize

search

and exact: |FOIA| |top| phrase |secret| amendment: |RECORDS|

27

Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize

search

and O/FOIA |top| phrase |secret| amendment: |RECORDS|

28

Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize

search

and O/FOIA phrase |L/top| |L/secret| amendment: |records|

29

Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize

search

and O/FOIA phrase |L/top| |L/secret| amendment: |record|

30

Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize

search

and O/FOIA phrase |L/top| |L/secret| between


S/amendment

|record|
E/amendment
31

The between() Operator


between(start-tag, end-tag, pos-clause, neg-clause) start-tag Starting tag, e.g. S/amendment end-tag Ending tag, e.g. E/amendment pos-clause words which must occur between start and end
Note: Requires a nested ScanAnd() operator

neg-clause words which must not occur between start and end
32

Part the Fourth: Hierarchical Navigation

33

screenshot

34

Hierarchies: Requirements
Any number of levels
Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section

Levels vary across titles


Title 1: 3 levels Title 26: 8 levels

Multiple views:
Children Ancestors Ancestor s Siblings

Multiple search scopes:


Only children, all descendents, everything
35

Hierarchies: Ancestor-Siblings
US-Code
Title 1 Title 2
Chapter 1 Chapter 2
Part 1 Part 2 Section 2.1 Section 2.2 Part 3 Part 4

Chapter 3 Chapter 4

Title 3
36

Hierarchies: Fields
ancestors
Searching
USC USC-title2 USC-title2-chapter25 USC-title2-chapter25subchapter2

encodedAncestors for display only


Where the node exists within the hierarchy
id;heading;subjectTitle//id;heading;subjectTitle//...
USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform

parentId ID of the parent node


USC-title2-chapter25-subchapter2

treesort Hierarchical sort field, e.g. 13/000/0/00882


37

Hierarchies: Tree Sort


Sorting In Print Order
Front Matter Titles Tables etc. Everything padded to fixed-length

01/011/1/02032
01 = USC Title 011 = Title 11 Sequence # in file 1 = An Appendix

38

Hierarchies: Sample Searches


Assuming Node = USC-title2-chapter25 Search Children
parentId:USC-title2-chapter25

Search All Descendents


ancestors:USC-title2-chapter25

Ancestor Siblings
(parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)

39

Contact
Paul Nelson
pnelson@searchtechnologies.com

Ronald Matamoros
rmatamoros@searchtechnologies.com

Search Technologies
http://searchtechnologies.com

40

You might also like