You are on page 1of 30

Galene: LinkedIns search architecture

Diego Buthay & Sriram Sankar

LinkedIns Vision
Create economic opportunity for every member of the
global workforce
Find work
Realize your dream job
Be great at what you do

LinkedIns Vision

Search and Recommendations


are core to our Vision

Overview
Infrastructure scaling
Developer productivity scaling
Result quality scaling

Comparison of different Search Engines


Netflix:
AirBnB:
Ebay:
Bing:
Google:
Facebook:

Comparison of different Search Engines


Netflix: 100K
AirBnB: 800K
Ebay: 500M
Bing: 100s of Billions
Google: 100s of Billions
Facebook: Trillions

Comparison of different Search Engines


Netflix: 100K
Lucene
AirBnB: 800K
Lucene
Ebay: 500M
Custom C++
Bing: 100s of Billions
Custom C++
Google: 100s of Billions
Custom C++
Facebook: Trillions
Custom C++

LinkedIn:
100s of Millions
Lucene

Galene
(Lucene based)

Galene
(Custom)

Important Galene Features

Offline index building


Live updates at a fine granularity
Static rank and early termination
Faceting
Data distribution
Relevance framework

Offline index building


Live updates at a fine granularity

A little about LinkedIn data


Most datasets at LinkedIn are available in 2 ways
A real 9me, change no9ca9on stream
A complete dataset, ETLd to Hadoop
We often rely on derived datasets
Many derived datasets cant be crunched in real time

Anatomy of a Galene index


Base Index
Generated by Hadoop periodically
Single-segment Lucene index
On Disk. Immutable. MMAPed and MLOCKed
Contains complex / rich features, that we can only aord to compute oine
Live Index
Inverted index with our own format
In-memory data structure
Contains incremental updates to documents
Snapshot Index
On Disk Snapshot of Live index when necessary
Ini9ally empty
Single segment Lucene Index. Live index is folded in regularly

1.

2.

BLAH BLAH BLAH

BLAH BLAH

Jeff

Jeff

Reid

BLAH

BLAH BLAH

LinkedIn

LinkedIn

Reid

BLAH BLAH BLAH BLAH

BLAH BLAH BLAH BLAH BLAH BLAH BLAH

LinkedIn

1
2
Inverted Index (with Posting Lists)

Forward Index

Live Update
Snapshot

Base Index
1
2
3
4
5
6
7
8
9
10
11
12

. . .
3

In-Memory
Live Updates

Inverted Index: Three Segments


Three independent segments with non-overlapped UIDs:

B1S1L1 (Base/snapshot/live) segment


Base has all UIDs.
Neither of Snapshot nor Live introduces new UIDs.
S2L2 (Snapshot/live) segment
None of UIDs exist in BSL.
Snapshot has all UIDs
Live does not introduce any new UIDs.
L3 (live) segment
None of UIDs exist in BSL or SL.

B1

S1

S2

L1

L2

L3

Static rank and early termination

Search: Static Rank (SR)


A global score of a document
Each document must have one and only one SR
It could be anything that can globally represent the importance of an UID, for
example, the number of 1st degree connec9ons
Dierent documents might have same SRs
B1S1L1 segment
Base knows SRs of all UIDs of the segment
S2L2
Snapshot knows SRs of all UIDs of the segment
L3 segments
We assign ar9cial SRs in either of the two ways:
Ascending order star9ng from the max SR of all UIDs in all 3 segments
Descending order star9ng from the min SR of all UIDs in all 3 segments

Search: Early Termination (ET)


Segment Level ET
Depending on the ordering of sta9c ranking assignment of L segment, which will
aect the ordering of all segments, we can search:
BSL -> SL -> L (if it is descending)
L -> SL -> BSL (if it is ascending)

Posting List Level ET


Since all pos9ngs are rst sorted by SR, early termina9on on pos9ng list guarantees
that documents with highest SRs are always rst retrieved (however, this does not
guarantee that the nal scores are also highest scores).

Going Forward
Very efficient custom index in C++
Base index build can be run in a distributed manner
BSL supported at a more fundamental level

Faceting

Faceting
Types of facets supported:
discoverable (e.g. current company)
sta9c values (e.g. network)
supplied values (e.g. my groups)
Legacy stack had no early termination allowing for exact facet counting (at a
cost)
Current Galene stack applies heuristics to determine counts in an approximate
manner
Going forward, custom posting list format will encode facet details for more
efficient facet count estimation

Relevance framework

Relevance Framework
Infrastructure to support common scoring needs
Provides framework to evaluate relevance changes
Enables rapid iterations over relevance experiments
Allows relevance engineers to focus on building features

Life of a Query Within A Rewriter

Rewriter
State
Query

Rewriter
Module

Rewriter
Module

Rewriter
Module

DATA
MODEL

DATA
MODEL

DATA
MODEL

Rewri4en
Query

Life of a Query Within A Search Shard

Top
Results
Rewri4en
Query

Retrieve a
Document

INDEX

Score the
Document

Top
Results
From
Shard

Case study Instant Search

Case Study: Instant Member Search


The index contains connections as document terms
(term:diego AND prex:buth AND (connec>on:35176 OR connec>on:418001 OR connec>on:1520032))

Static Rank of documents reflects popularity


Documents are augmented offline with spell correction data
shreeram sa :
(term:shreeram OR cluster:5678) AND (prex:sa) AND (connec9on:1234)

Summary
Infrastructure scaling
Developer productivity scaling
Result quality scaling

30

You might also like