Galene Linkedinssearcharchitecture 141126110730 Conversion Gate02

Galene: LinkedIns search architecture
Diego Buthay & Sriram Sankar
LinkedIns Vision
Create economic opportunity for every member of the
global workforce
Find work
Realize your dream job
Be great at what you do
LinkedIns Vision
Search and Recommendations

are core to our Vision
Overview
Infrastructure scaling
Developer productivity scaling
Result quality scaling
Comparison of different Search Engines

Netflix:
AirBnB:
Ebay:
Bing:
Google:
Facebook:

Netflix: 100K
AirBnB: 800K
Ebay: 500M
Bing: 100s of Billions
Google: 100s of Billions
Facebook: Trillions

Netflix: 100K
Lucene
AirBnB: 800K
Lucene
Ebay: 500M
Custom C++
Bing: 100s of Billions
Custom C++
Google: 100s of Billions
Custom C++
Facebook: Trillions
Custom C++
LinkedIn:
100s of Millions
Lucene
Galene
(Lucene based)
Galene
(Custom)
Important Galene Features
Offline index building

Live updates at a fine granularity
Static rank and early termination
Faceting
Data distribution
Relevance framework
Offline index building

Live updates at a fine granularity
A little about LinkedIn data

Most datasets at LinkedIn are available in 2 ways
A real 9me, change no9ca9on stream
A complete dataset, ETLd to Hadoop
We often rely on derived datasets
Many derived datasets cant be crunched in real time
Anatomy of a Galene index

Base Index
Generated by Hadoop periodically
Single-segment Lucene index
On Disk. Immutable. MMAPed and MLOCKed
Contains complex / rich features, that we can only aord to compute oine
Live Index
Inverted index with our own format
In-memory data structure
Contains incremental updates to documents
Snapshot Index
On Disk Snapshot of Live index when necessary
Ini9ally empty
Single segment Lucene Index. Live index is folded in regularly
1.
2.
BLAH BLAH BLAH
BLAH BLAH
Jeff
Jeff
Reid
BLAH
BLAH BLAH
LinkedIn
LinkedIn
Reid
BLAH BLAH BLAH BLAH
BLAH BLAH BLAH BLAH BLAH BLAH BLAH
LinkedIn
1
2
Inverted Index (with Posting Lists)
Forward Index
Live Update
Snapshot
Base Index
1
2
3
4
5
6
7
8
9
10
11
12
. . .
3
In-Memory
Live Updates
Inverted Index: Three Segments

Three independent segments with non-overlapped UIDs:
B1S1L1 (Base/snapshot/live) segment

Base has all UIDs.
Neither of Snapshot nor Live introduces new UIDs.
S2L2 (Snapshot/live) segment
None of UIDs exist in BSL.
Snapshot has all UIDs
Live does not introduce any new UIDs.
L3 (live) segment
None of UIDs exist in BSL or SL.
B1
S1
S2
L1
L2
L3
Static rank and early termination
Search: Static Rank (SR)

A global score of a document
Each document must have one and only one SR
It could be anything that can globally represent the importance of an UID, for
example, the number of 1st degree connec9ons
Dierent documents might have same SRs
B1S1L1 segment
Base knows SRs of all UIDs of the segment
S2L2
Snapshot knows SRs of all UIDs of the segment
L3 segments
We assign ar9cial SRs in either of the two ways:
Ascending order star9ng from the max SR of all UIDs in all 3 segments
Descending order star9ng from the min SR of all UIDs in all 3 segments
Search: Early Termination (ET)

Segment Level ET
Depending on the ordering of sta9c ranking assignment of L segment, which will
aect the ordering of all segments, we can search:
BSL -> SL -> L (if it is descending)
L -> SL -> BSL (if it is ascending)
Posting List Level ET

Since all pos9ngs are rst sorted by SR, early termina9on on pos9ng list guarantees
that documents with highest SRs are always rst retrieved (however, this does not
guarantee that the nal scores are also highest scores).
Going Forward
Very efficient custom index in C++
Base index build can be run in a distributed manner
BSL supported at a more fundamental level
Faceting
Faceting
Types of facets supported:
discoverable (e.g. current company)
sta9c values (e.g. network)
supplied values (e.g. my groups)
Legacy stack had no early termination allowing for exact facet counting (at a
cost)
Current Galene stack applies heuristics to determine counts in an approximate
manner
Going forward, custom posting list format will encode facet details for more
efficient facet count estimation
Relevance framework
Relevance Framework
Infrastructure to support common scoring needs
Provides framework to evaluate relevance changes
Enables rapid iterations over relevance experiments
Allows relevance engineers to focus on building features
Life of a Query Within A Rewriter
Rewriter
State
Query
Rewriter
Module
Rewriter
Module
Rewriter
Module
DATA
MODEL
DATA
MODEL
DATA
MODEL
Rewri4en
Query
Life of a Query Within A Search Shard
Top
Results
Rewri4en
Query
Retrieve a
Document
INDEX
Score the
Document
Top
Results
From
Shard
Case study Instant Search
Case Study: Instant Member Search

The index contains connections as document terms
(term:diego AND prex:buth AND (connec>on:35176 OR connec>on:418001 OR connec>on:1520032))
Static Rank of documents reflects popularity

Documents are augmented offline with spell correction data
shreeram sa :
(term:shreeram OR cluster:5678) AND (prex:sa) AND (connec9on:1234)
Summary
Infrastructure scaling
Developer productivity scaling
Result quality scaling
30

Galene Linkedinssearcharchitecture 141126110730 Conversion Gate02

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Galene Linkedinssearcharchitecture 141126110730 Conversion Gate02

Uploaded by

Copyright:

Available Formats

Galene: LinkedIns search architecture

Diego Buthay & Sriram Sankar

Search and Recommendations

Comparison of different Search Engines

Comparison of different Search Engines

Comparison of different Search Engines

Important Galene Features

Offline index building

Offline index building

A little about LinkedIn data

Anatomy of a Galene index

BLAH BLAH BLAH

BLAH BLAH BLAH BLAH

BLAH BLAH BLAH BLAH BLAH BLAH BLAH

Inverted Index: Three Segments

B1S1L1 (Base/snapshot/live) segment

Static rank and early termination

Search: Static Rank (SR)

Search: Early Termination (ET)

Posting List Level ET

Life of a Query Within A Rewriter

Life of a Query Within A Search Shard

Case study Instant Search

Case Study: Instant Member Search

Static Rank of documents reflects popularity

You might also like