Joins

Efficient Parallel Set-Similarity Joins Using MapReduce
Content
Introduction
Preliminaries
Self-Join case R-S Join case Handling insufficient memory Experimental evaluation Conclusions
Introduction
Vast amount of data:
Google N-gram database : ~1 trillion records GeneBank : 100 million records, size=416GB Facebook : 400 million active users
Detecting similar pairs of records becomes a challanging proble
Examples
Detecting near duplicate web-pages in web crawlin Document clustering Plagiarism detection Master data management
John W. Smith , Smith, John , John William Smith
Making recommendations to users based on their similarity to other users in query refinement Mining in social networking sites
User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
Identifying coalitions of click fraudsters in online advertising
Preliminaries
Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold , find the pairs of objects/items/records satisfying sim(o1,o2)
Set -similarity functions

Jaccard or Tanimoto coefficient
Jaccard(x, y) =|x y| / |x U y|
I will call back =[I, will, call, back] I will call you soon=[I, will, call, you, soon] Jaccard similarity=3/6=0.5
Why Hadoop ?
Set-similarity with MapReduce

Large amount data,shared nothign architecture
map (k1,v1) -> list(k2,v2); reduce (k2,list(v2)) -> list(k3,v3) Problem :

Too much data to transfer Too many pairs to verify(Two similar sets share at least 1 token)
Set-Similarity Filtering
Efficient set-similarity join algorithms rely on effective filters string s =I will call back global token ordering {back,call, will, I} prefix of length 2 of s= [back, call] prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
Prefix filtering: example
Record 1
Record 2
Each set has 5 tokens Similar: they share at least 4 tokens Prefix length: 2
9
Parallel Set-Similarity Joins
Stage I: Token Ordering Compute data statistics for good signatures Stage II -RID-Pair Generation Stage III: Record Join Generate actual pairs of joined records
Input Data
RID = Row ID a : join column A B C is a string:
Address: 14th Saarbruecker Strasse Name: John W. Smith
Stage I: Token Ordering

Basic Token Ordering(BTO) One Phase Token Ordering (OPTO)
Token Ordering
Creates a global ordering of the tokens in the join column, based on their frequency
RID a b c
1 2
Global Ordering: (based on frequency)
A B D AA BBDAE E 1 D 2
B 3 A 4
Basic Token Ordering(BTO)
2 MapReduce cycles:
1st : compute token frequencies 2nd: sort the tokens by their frequencies
Basic Token Ordering 1st MapReduce cycle

, ,
map: tokenize the join value of each record emit each token with no. of occurrences 1
reduce: for each token, compute total count (frequency)
Basic Token Ordering 2nd MapReduce cycle
map: interchange key with value
reduce(use only 1 reducer): emits the value
One Phase Tokens Ordering (OPTO)
alternative to Basic Token Ordering (BTO):

Uses only one MapReduce Cycle (less I/O) In-memory token sorting, instead of using a reducer
OPTO Details
, ,
Use tear_down method to order the tokens in memory
map: reduce: tokenize the join for each token, compute value of each record total count (frequency) emit each token with no. of occurrences 1
Stage II: RID-Pair Generation

Basic Kernel(BK) Indexed Kernel(PK)
RID-Pair Generation
scans the original input data(records) outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous stage
RID-Pair Generation: Map Phase

scan input records and for each record:
project it on RID & join attribute tokenize it extract prefix according to global ordering of tokens obtained in the Token Ordering stage route tokens to appropriate reducer
Grouping/Routing Strategies
Goal: distribute candidates to the right reducers to minimize reducers workload Like hashing (projected)records to the corresponding candidate-buckets Each reducer handles one/more candidatebuckets 2 routing strategies:
Using Individual Tokens Using Grouped Tokens
Routing: using individual tokens

Treat each token as a key For each record, generates a (key, value) pair for each of its prefix tokens:
Example: Given the global ordering: Token Frequency A 10 B 10 E 22 D 23 G 23 C 40 F 48
A B C => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: (A, (1,A B C)) (B, (1,A B C))
Grouping/Routing: using individual tokens
Advantage:
high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)
Disadvantage:
high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
Routing: Using Grouped Tokens
Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) For each record, generates a (key, value) pair for each the groups of the prefix tokens:
Example: Given the global ordering: Token Frequency A 10 B 10 E 22 D 23 G 23 C 40 F 48
A B C => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: (X, (1,A B C)) (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens

The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner
Token Frequency A D F
Group1
A 10
B 10 B G
E 22
D 23
G 23
C 40 E C
Group3
F 48
Group2
Grouping/Routing: Using Grouped Tokens

Advantage:
fewer replication of record projection
Disadvantage:
Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) ABCD (A,B belong to Group X ; C belong to Group Y)
o/p (X,_) & (Y,_)
EFG (E belong to Group Y )

o/p (Y,_)
RID-Pair Generation: Reduce Phase

This is the core of the entire method Each reducer processes one/more buckets In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate
If the similarity of the 2 candidates >= threshold => output their ids and also their similarity
Bucket of candidates
RID-Pair Generation: Reduce Phase
Computing similarity of the candidates in a bucket comes in 2 flavors:

Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket
Indexed Kernel : uses a PPJoin+ index
RID-Pair Generation: Basic Kernel

Straightforward method for finding candidates satisfying the join predicate Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
Uses a special index data structure Not so straightforward to implement map() -same as in BK algorithm Much more efficient
Stage III: Record Join

Until now we have only pairs of RIDs, but we need actual records Use the RID pairs generated in the previous stage to join the actual records Main idea:
bring in the rest of the each record (everything except the RID which we already have)
2 approaches:
Basic Record Join (BRJ) One-Phase Record Join (OPRJ)
Record Join: Basic Record Join

Uses 2 MapReduce cycles
1st cycle: fills in the record information for each half of each pair
2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join
Uses only one MapReduce cycle
R-S Join
Challenge: We now have 2 different record sources => 2 different input streams Map Reduce can work on only 1 input stream 2nd and 3rd stage affected Solution: extend (key, value) pairs so that it includes a relation tag for each record
Handling Insufficient Memory

Map-Based Block Processing. Reduce-Based Block Processing
Evaluation
Cluster: 10-node IBM x3650, running Hadoop Data sets: DBLP: 1.2M publications CITESEERX: 1.3M publication Consider only the header of each paper(i.e author, title, date of publication, etc.) Data size synthetically increased (by various factors) Measure: Absolute running time Speedup Scaleup
Self-Join running time

Best algorithm: BTO-PK-OPRJ Most expensive stage: the RID-pair generation
Self-Join Speedup
Fixed data size, vary the cluster size Best time: BTO-PK-OPRJ
Self-Join Scaleup
Increase data size and cluster size together by the same factor Best time: BTO-PK-OPRJ
Self-Join Summery
I stage- BTO was the best choice. II stage- PK was the best choice. III stage,-the best choice depends on the amount of data and the size of the cluster
OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative
Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
I stage - R-S Join performance was identical to the first stage in the self-join case II stage -noticed a similar speedup (almost perfect) as for the self-join case. III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
Conclusions
For both self-join and R-S join cases, we recommend BTOPK-BRJ as a robust and scalable method. Useful in many data cleaning scenarios SSJoin and MapReduce: one solution for huge datasets Very efficient when based on prefix-filtering and PPJoin+ Scales-up up nicely
Thank You!

Joins

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Joins

Uploaded by

Copyright:

Available Formats

Efficient Parallel Set-Similarity Joins Using MapReduce

Detecting similar pairs of records becomes a challanging proble

Identifying coalitions of click fraudsters in online advertising

Set -similarity functions

Set-similarity with MapReduce

map (k1,v1) -> list(k2,v2); reduce (k2,list(v2)) -> list(k3,v3) Problem :

Prefix filtering: example

Parallel Set-Similarity Joins

Stage I: Token Ordering

Basic Token Ordering(BTO)

Basic Token Ordering 1st MapReduce cycle

reduce: for each token, compute total count (frequency)

Basic Token Ordering 2nd MapReduce cycle

map: interchange key with value

reduce(use only 1 reducer): emits the value

One Phase Tokens Ordering (OPTO)

alternative to Basic Token Ordering (BTO):

Use tear_down method to order the tokens in memory

Stage II: RID-Pair Generation

RID-Pair Generation: Map Phase

Routing: using individual tokens

Grouping/Routing: using individual tokens

Routing: Using Grouped Tokens

Grouping/Routing: Using Grouped Tokens

Grouping/Routing: Using Grouped Tokens

EFG (E belong to Group Y )

RID-Pair Generation: Reduce Phase

RID-Pair Generation: Reduce Phase

Computing similarity of the candidates in a bucket comes in 2 flavors:

Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel

RID-Pair Generation:PPJoin+Indexed Kernal

Stage III: Record Join

Record Join: Basic Record Join

Record Join: One Phase Record Join

Uses only one MapReduce cycle

Handling Insufficient Memory

Self-Join running time

Best scaleup was achieved by BTO-PK-BRJ

R-S Join Performance

You might also like