Professional Documents
Culture Documents
Key Challenges
How to index this massive, complex, 85-year-old document? How to replicate 20-Year-Old search features?
Search Technologies
The largest independent provider of enterprise search expertise and services 80 full-time dedicated search engine experts 200+ customers Technology Neutral
(yeah, we know Sphinx too)
Major Challenges
1. Document Parsing
A 50 Volume Table Of Contents!
2. Query Parsing
Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields)
screenshot
screenshot
screenshot
10
11
Solr
Repository
12
13
Solr
Repository
Title 14
ch. 1
ch. 2
ch. 3
14
pt. A
pt. B
pt. C
sec. 3
sec. 1
sec. 2
15
Solr
Repository
16
Solr
Repository
17
18
Token Processing 1
xhtml tag tokenizer
<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002—Pub. L. 107– 296 substituted “Department of <!-- field-end:amendment-note --> <!-- field-start:amendment-note --> <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
19
20
Token Processing 2
Mark Start and End Tags
<!-- field-start:amendment-note --> <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> S/amendment <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of E/amendment
21
Token Processing 3
Remove XHTML Tags
S/amendment <h4 class="note-head"> Amendments </h4> <p class="note-body"> 2002 Pub L 107 296 Substituted Department of E/amendment 2002 Pub L 107 296 Substituted Department of E/amendment Amendments S/amendment
22
Token Processing 4
Tag Original Case & Lower Case
S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/pub L/l L/107 L/296 L/substituted L/department L/of
23
Token Processing 5
Lemmatize
Uses dictionary-based lemmatizer based on GCIDE and WordNet
S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/pub L/l L/107 L/296 L/substituted L/department L/of S/amendment O/Amendments O/2002 O/Pub O/L O/107 O/296 O/Substituted O/Department O/of E/amendment L/amendments L/2002 L/Pub L/l; L/107 L/296 L/Substituted L/Department L/of amendment 2002 pub l 107 296 substitute department of
24
25
Query Processing
(not all stages shown) Query String
mark exact: mark phrases query template build lucene query parse lemmatize
search
Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize
search
27
Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize
search
28
Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize
search
29
Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize
search
30
Query Processing
exact:FOIA top secret amendment:RECORDS Query String
mark original mark lowercase query template build lucene query parse lemmatize
search
|record|
E/amendment
31
neg-clause words which must not occur between start and end
32
33
screenshot
34
Hierarchies: Requirements
Any number of levels
Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section
Multiple views:
Children Ancestors Ancestor s Siblings
Hierarchies: Ancestor-Siblings
US-Code
Title 1 Title 2
Chapter 1 Chapter 2
Part 1 Part 2 Section 2.1 Section 2.2 Part 3 Part 4
Chapter 3 Chapter 4
Title 3
36
Hierarchies: Fields
ancestors
Searching
USC USC-title2 USC-title2-chapter25 USC-title2-chapter25subchapter2
01/011/1/02032
01 = USC Title 011 = Title 11 Sequence # in file 1 = An Appendix
38
Ancestor Siblings
(parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)
39
Contact
Paul Nelson
pnelson@searchtechnologies.com
Ronald Matamoros
rmatamoros@searchtechnologies.com
Search Technologies
http://searchtechnologies.com
40