Professional Documents
Culture Documents
Content
Introduction
Preliminaries
Self-Join case R-S Join case Handling insufficient memory Experimental evaluation Conclusions
Introduction
Vast amount of data:
Google N-gram database : ~1 trillion records GeneBank : 100 million records, size=416GB Facebook : 400 million active users
Examples
Detecting near duplicate web-pages in web crawlin Document clustering Plagiarism detection Master data management
John W. Smith , Smith, John , John William Smith
Making recommendations to users based on their similarity to other users in query refinement Mining in social networking sites
User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
Preliminaries
Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold , find the pairs of objects/items/records satisfying sim(o1,o2)
I will call back =[I, will, call, back] I will call you soon=[I, will, call, you, soon] Jaccard similarity=3/6=0.5
Why Hadoop ?
Set-Similarity Filtering
Efficient set-similarity join algorithms rely on effective filters string s =I will call back global token ordering {back,call, will, I} prefix of length 2 of s= [back, call] prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
Record 1
Record 2
Each set has 5 tokens Similar: they share at least 4 tokens Prefix length: 2
9
Stage I: Token Ordering Compute data statistics for good signatures Stage II -RID-Pair Generation Stage III: Record Join Generate actual pairs of joined records
Input Data
RID = Row ID a : join column A B C is a string:
Address: 14th Saarbruecker Strasse Name: John W. Smith
Token Ordering
Creates a global ordering of the tokens in the join column, based on their frequency
RID a b c
1 2
Global Ordering: (based on frequency)
A B D AA BBDAE E 1 D 2
B 3 A 4
2 MapReduce cycles:
1st : compute token frequencies 2nd: sort the tokens by their frequencies
map: tokenize the join value of each record emit each token with no. of occurrences 1
OPTO Details
, ,
map: reduce: tokenize the join for each token, compute value of each record total count (frequency) emit each token with no. of occurrences 1
RID-Pair Generation
scans the original input data(records) outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous stage
Grouping/Routing Strategies
Goal: distribute candidates to the right reducers to minimize reducers workload Like hashing (projected)records to the corresponding candidate-buckets Each reducer handles one/more candidatebuckets 2 routing strategies:
Using Individual Tokens Using Grouped Tokens
A B C => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: (A, (1,A B C)) (B, (1,A B C))
Advantage:
high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)
Disadvantage:
high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) For each record, generates a (key, value) pair for each the groups of the prefix tokens:
Example: Given the global ordering: Token Frequency A 10 B 10 E 22 D 23 G 23 C 40 F 48
A B C => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: (X, (1,A B C)) (Y, (1,A B C))
A 10
B 10 B G
E 22
D 23
G 23
C 40 E C
Group3
F 48
Group2
Disadvantage:
Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) ABCD (A,B belong to Group X ; C belong to Group Y)
o/p (X,_) & (Y,_)
Bucket of candidates
Uses a special index data structure Not so straightforward to implement map() -same as in BK algorithm Much more efficient
2 approaches:
Basic Record Join (BRJ) One-Phase Record Join (OPRJ)
R-S Join
Challenge: We now have 2 different record sources => 2 different input streams Map Reduce can work on only 1 input stream 2nd and 3rd stage affected Solution: extend (key, value) pairs so that it includes a relation tag for each record
Evaluation
Cluster: 10-node IBM x3650, running Hadoop Data sets: DBLP: 1.2M publications CITESEERX: 1.3M publication Consider only the header of each paper(i.e author, title, date of publication, etc.) Data size synthetically increased (by various factors) Measure: Absolute running time Speedup Scaleup
Self-Join Speedup
Fixed data size, vary the cluster size Best time: BTO-PK-OPRJ
Self-Join Scaleup
Increase data size and cluster size together by the same factor Best time: BTO-PK-OPRJ
Self-Join Summery
I stage- BTO was the best choice. II stage- PK was the best choice. III stage,-the best choice depends on the amount of data and the size of the cluster
OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative
Speed Up
I stage - R-S Join performance was identical to the first stage in the self-join case II stage -noticed a similar speedup (almost perfect) as for the self-join case. III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
Conclusions
For both self-join and R-S join cases, we recommend BTOPK-BRJ as a robust and scalable method. Useful in many data cleaning scenarios SSJoin and MapReduce: one solution for huge datasets Very efficient when based on prefix-filtering and PPJoin+ Scales-up up nicely
Thank You!