Professional Documents
Culture Documents
I NTRODUCTION
Haoji Hu, Xiaoling Wang and Aoying Zhou are with the Shanghai Key
Laboratory of Trustworthy Computing, East China Normal University,
Shanghai, China.
E-mail: {hjhu}@ecnu.cn E-mail: {xlwang,ayzhou}@sei.ecnu.edu.cn
Kai Zheng is with the School of Information Technology and Electrical
Engineering, University of Queensland, Brisbane, Australia, 4072.
E-mail: {kevinz}@itee.uq.edu.au
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
2
2.1
P RELIMINARIES
Problem Definition
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
a b
b c c d d e e $
string: abde
a b
b d
a b
b c c d d e e $
query: abcde
d e e $
string id
1,3,5,7
1,3,9,11
1,3,7,9,11
1,4,5,10,15
1,3(4),9,10,16
1,2,7,8,13,14
M OTIVATION
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
j
gG
(s1 ) |I(g)|.
LB
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
jLB =
|I(g)| +
j
gGLB (s1 )
costv (c)
(1)
j
cCLB (s1 )
min
gi Gq (s1 ),LB
where:
gi Gq (s1 )
|I(g)| +
costv (c)
(2)
cCLB (s1 )
( LB+D )
1 j |G
q (s1 )|
gi Gq (s1 ) gi = LB
gi =0,1
(3)
ment of set s.
1
2
3
4
5
6
7
Input:
The q-gram set of query Q, G(Q);
The number of grams destroyed by edit operations, D;
The inverted lists, I;
Output:
The similar string set Re according to query
Choose D + 1 grams as the base query-gram set and
generate the candidate set C1 ;
for i = 2;i <= |G(Q)| D;+ + i do
if No benefits can be obtained from adding new grams
then
break;
Add gi with largest benefit into the query-gram set and
merge its inverted list with Ci1 to generate the
candidate set Ci ;
Verify the Cf inally set to generate the result set Re;
return Re;
sS
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
the reduction cost in the size of the candidate set and new
gram list length is used to quantify the gain of adding a
new gram.
5.2.1 Effect on Cost Model by Adding Gram in QueryGram Set
Assuming a base query-gram set is obtained, which contains D + 1 grams and common gram lower bound is 1. Let Ci
denote the candidate set whose common gram lower bound
is i and Ci= denote the subset, whose elements frequency
is equal to the common gram lower bound, of Ci . Based on
the total cost in Equation 1, absorbing new gram gn may
increase the access time but can reduce the candidate size
that saves the verification time. Let B denote the reduction
in the size of the candidate set and I(gn ) denote new added
gram gn in inverted list. The Gain of a newly added gram
is evaluated by the following equation:
Gain(gn ) = |B| costv (Q) |I(gn )|
(4)
JS (C
,I(gn ))
|Ci=
(5)
I(gn )| and
|Ci= I(gn )|
|Ci= I(gn )|
|Ci= | + |I(gn )|
1 + JS (Ci= , I(gn ))
(7)
5.2.2 Algorithm
The greedy algorithm revise framework in Algorithm 1 and
is illustrated in Algorithm 2. SelectGramByBenefit() is to
select gram with the largest Benef it(gn ) in the left q-gram
set. After the base query-gram set is fixed, new grams are
appended until the largest Gain(gn ) has no benefit, which
is check in CheckLargestGain(). The worst cost of greedy
algorithm is O(n2 ), where n is the number of querys qgrams.
Algorithm 2: Greedy Algorithm
1
2
3
4
(6)
we can obtain
|Ci= I(gn )| =
(8)
Gram set selection framework can reduce the cost of selection and the gain can help decide when to stop selection.
Nevertheless, we still need more efficient and effective
strategies to find high quality grams. In this section, we
improve this framework by introducing a ranking method
for better grams selection and a two-step stop condition
with less extra checking cost. Since the string elements
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
candidate number
Candidate Size
4000
3500
3000
2500
2000
1500
1000
500
0
DBLP
102
101
50
100
query id
150
200
Q = abcdef g generate grams {ab, bc, cd, de, ef, f g}, and
D = 1, LB = 2. For the traditional prefix filter, querygram set {ab, bc, cd} will be chosen as the query-gram set.
Then the candidates are {1, 3, 7, 9, 11}, since all of them
have at least 2 occurrences in the inverted list. However, if
we choose another three grams {ab, ef, f g} as the querygram set, the candidates are {1, 7}, where the number of
candidates is reduced due to the small list overlap among
the lists associated with the query grams. Although the
operation time increases for extra 3 elements, the candidate
size is reduced by more than a half, by which we can save
3( +1)|Q|) = 327 = 42 operations in the verification
step (We use verification method proposed in [33]).
We observe through analysis based on real dataset that
short gram inverted lists dominate the gram set of a string
collection. Figure 3(b) shows that 3-gram lists length in
DBLP follows a power law distribution, which means the
majority of gram lists is not very long. There is potential to
select longer lists without sacrificing a lot of access cost,
even though they are not optimized. If we can choose the a
little longer gram lists that just generate a few candidates,
the total cost is expected to decrease. Inspired by this
observation, we may improve the quality of the constructed
prefix by proposing a new ranking criterion for grams
based on the potential number of candidates that it may
generate. However, we are faced with several challenges
for developing this new criterion:
100
100
101
102
103
104
Token Frequency
105
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
gram
ab
bc
cd
de
ef
fg
18
4,0
4,0
4,0
4,3
4,3
0,1
1-sig
9 16
null,null
1,2
1,2
2,3
1,6
5,0
1-sig DIS
3.5
5
5
3.5
3.5
1.5
2-sig
1 16
4,1
4,3
4,3
2,0
3,0
0,5
2-sig DIS
2
3
3
1
1
0
discrimination of gi on G is defined as
F requencyI (sj ) 1
(9)
sj I(gi )
Relative Discrimination
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
1
2
6.5
Synthetic Criterion
g, and |IG
(g)| stands for the strings that survive in length
filter under the edit distance . The discrimination and
list length are combined into the total rank of gram after
normalization(Lines 3 4), which is sorted in ascending
order (Line 5). The time complexity of this algorithm is
O(N + N log N ), in which the N is the number of grams.
Recall the calculation of the Gain in Equation 4and
Lemma 4, another observation is Gain |I(gn )| and
Gain JS (Ci1,I(gn )) . It suggests gram with smaller length
and lower list overlap with others can have large Gain.
This is similar to synthetic criterion, which aims to rank
shorter and more discriminative grams in the front of order.
Inspired by this observation, we can simply select grams in
turn, which will be used in next subsection.
Algorithm 3: RankGram()
1
2
3
4
5
Input:
The discrimination set of gram set, U ;
The gram set corresponding inverted lists of G(Q), IG ;
The edit distance ;
for each g G(Q) do
N ormdiscrimination + = U (g);
N ormlength + = |IG
(g)|;
for each g G(Q) do
|IG
(g)|
U (g)
+ N orm
;
rank(g) = (1) N ormdiscrimination
length
3
4
5
6
7
8
9
10
Input:
The q-gram set of query Q, G(Q);
The number of grams destroyed by edit operations, D;
The min-hash signature set of gram lists, S;
The inverted lists, I;
Output:
The similar string set Re according to query
CalculateDiscriminationOfGrams(G(Q));
RankGram(G(Q), );
Choose the top D + 1 grams as the base query-gram set and
generate the candidate set C1 ;
for i = D + 2;i < |G(Q)|;+ + i do
if (|Ci= | |Ii |) costv (Q) |Ii | < 0 then
if Gain(gi ) < 0 then
break;
Add i-th gram gi in the ranked lists into the query-gram
set and merge its inverted list with Ci 1 to generate
the candidate set Ci ;
Verify the Cf inally set to generate the result set Re;
return Re;
EXPERIMENTAL EVALUATION
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
Data Sets
Uniref50
TREC
DBLP
DBLP-author
avg len
406
240
105
39
max len
4017
779
1626
100
min len
120
100
28
16
Sizes
878,148
417,698
650,207
650,207
10
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
4
5
hash number
20
15
10
(a) DBLP
4
5
hash number
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
time
NDCG
30
Time(ms)
25
35
NDCG
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
time
NDCG
Time(ms)
time
NDCG
30
25
20
15
10
5
0
4
5
hash number
(b) Uniref50
NDCG
NDCG
Time(ms)
11
(c) TREC
similarity search
80
CF-Select
GF-Select
Greedy
Adaptive-Search
ChunkGram
80
Time(ms)
Time(ms)
100
similarity search
100
60
40
60
20
0
8
12
16
edit distance
(a) DBLP
20
100
40
20
4
120
CF-Select
GF-Select
Baseline
Adaptive-Search
ChunkGram
Time(ms)
120
80
CF-Select
GF-Select
Greedy
Adaptive-Search
ChunkGram
60
40
20
0
12
16
edit distance
20
(b) Uniref50
12
16
edit distance
20
(c) TREC
Fig. 7. Comparison of general gram filter with query-gram set selection similarity-search algorithms and existing
methods
varied query length
Time(ms)
40
35
30
25
20
15
10
5
0
CF-Select
Adaptive-Search
ChunkGram
25
35
45
55
query length
65
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
Uniref
elems
10085
8672
9268
9872
10539
cands
3
1088
197
43
2
grams
6.7
3
5
7
9
TREC
elems
27045
22783
25107
28323
33724
cands
7
1576
218
5
2
12
R ELATED W ORK
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
=15
=20
=25
=30
250
200
150
50
30
20
50
10
0
2
4
6
8 10 adap
Prefix Schemes
=10
=15
=20
=25
50
40
100
=10
=15
=20
=25
60
Time(ms)
Time(ms)
300
Time(ms)
13
40
30
20
10
0
(a) DBLP
4
6
8
10 adap
Prefix Schemes
(b) Uniref50
6
8
10 12 adap
Prefix Schemes
(c) TREC
candidate size
6
4
2
0
2
8
7
6
5
4
3
2
1
0
candidate size
DIS
IDF
hybrid
DIS
IDF
hybrid
10
8
7
6
5
4
3
2
1
0
DIS
IDF
hybrid
(a) DBLP
(b) Uniref50
(c) TREC
35
estimate-DIS
total
30
Time(ms)
Time(ms)
80
60
40
est-DIS
total
25
Time(ms)
100
20
15
10
20
0
5
10
15
20
edit distance
(a) DBLP
25
10
15
20
edit distance
40
35
30
25
20
15
10
5
0
25
(b) Uniref50
est-DIS
total
10
15
20
edit distance
25
(c) TREC
Jaccard similarity, cosine similarity, edit distance metrics, Bregman Divergence and Earth Movers Distance
[7][1][16][15][17][18]. Edit distance is the most common
way to measure the difference between two strings.
A variant, top-k string similarity search, is also studied in
[34]. Another related problem, Approximate string match,
refers to the problem of matching a string or sub-string to a
given pattern in a text. There have also been a lot of studies
on this problem [24][27], and Navarro gives a detailed
analysis on the existing approaches in his survey [14]. The
selectivity estimation of similarity search and similarity join
in [3][4][5][28].
C ONCLUSIONS
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2349914, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish N. Koudas, S. Muthukrishnan and D. Srivastava, Approximate String Joins in a Database
(Almost) for Free, Proc. Intl Conf. Very Large Data Bases (VLDB),
pp. 491-500, 2001.
M. Hadjieleftheriou, A. Chandel, N. Koudas and D. Srivastava, Fast
Indexes and Algorithms for Set Similarity Selection Queries, Proc.
Intl Conf. Data Eng (ICDE), pp. 267-276, 2008.
M. Hadjieleftheriou, X. Yu, N. Koudas and D. Srivastava,
Hashed Samples: Selectivity Estimators For Set Similarity Selection
Queries, Proc. Intl Conf. Very Large Data Bases (VLDB), pp. 201212, 2008.
L. Jin and C. Li, Selectivity Estimation for Fuzzy String Predicates
in Large Data Sets, Proc. Intl Conf. Very Large Data Bases
(VLDB), pp 397-408, 2005.
H. Lee, R. T. Ng and K. Shim, Similarity join size estimation using
locality sensitive hashing, Proc. Intl Conf. Very Large Data Bases
(VLDB), pp.338-349, 2011.
C. Xiao, W. Wang and Xuemin Lin, Ed-Join: an efficient algorithm
for similarity joins with edit distance constraints, Proc. Intl Conf.
Very Large Data Bases (VLDB), pp.933-944, 2008.
S. Chaudhuri, V. Ganti and R. Kaushik, A primitive operator for
similarity joins in data cleaning, Proc. Intl Conf. Data Eng (ICDE),
pp.795-825, 2006.
J. Qin, W. Wang, Y. Lu, C. Xiao and Xuemin Lin, Efficient Exact
Edit Similarity Query Processing with the Asymmetric Signature
Scheme, Proc. ACM SIGMOD Intl Conf. Management of Data
(SIGMOD), pp.1033-1044, 2011.
J. Wang, G. Li and J. Feng, Can We Beat the Prefix Filtering? An
Adaptive Framework for Similarity Join and Search, Proc. ACM
SIGMOD Intl Conf. Management of Data (SIGMOD), pp.85-96,
2012.
C. Li, B. Wang and X. Yang, VGRAM: Improving Performance of
Approximate Queries on String Collections Using Variable Length
Grams, Proc. Intl Conf. Very Large Data Bases (VLDB), pp.303314, 2007.
X. Yang, B. Wang and C. Li, Cost-Based Variable-Length-Gram
Selection for String Collections to Support Approximate Queries
Efficiently, Proc. Intl Conf. Very Large Data Bases (VLDB),
pp.353-364, 2008.
C. Xiao, W. Wang, X. Lin and J. X. Yu, Efficient Similarity Joins
for Near Duplicate Detection, Proc. Intl Conf. World Wide Web
(WWW), 2008.
A. Z. Broder, On the resemblance and containment of documents,
Compression and Complexity of Sequences, pp.21-29, 1997.
G. Navarro, A Guided Tour to Approximate String Matching, Proc.
ACM Comp. Surveys, vol. 33, no. 1, pp. 31-88, 2001.
S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates, Proc. ACM SIGMOD Intl Conf. Management of Data
(SIGMOD), pp. 743-754, 2004.
R. J. Bayardo, Y. Ma and R. Srikant, Scaling up all pairs similarity
search, Proc. Intl Conf. World Wide Web (WWW), pp. 131-140
2007.
Z. Zhang, B. C. Ooi, S. Parthasarathy and A. K. H. Tung, Similarity
search on bregman divergence: Towards non-metric indexing, Proc.
Intl Conf. Very Large Data Bases (VLDB), vol. 2, no. 1, pp. 13-24,
2009.
J. Xu, Z. Zhang, A. K. H. Tung and G. Yu, Efficient and effective
similarity search over probabilistic data based on earth movers
distance, Proc. Intl Conf. Very Large Data Bases (VLDB), vol.
3, no. 1, pp. 758-769, 2010.
A. Arasu, V. Ganti and R. Kaushik, Efficient exact set-similarity
joins, Proc. Intl Conf. Very Large Data Bases (VLDB), pp. 918929, 2006.
14
[20] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi and D. Srivastava, Bedtree: an all-purpose index structure for string similarity search based
on edit distance, Proc. ACM SIGMOD Intl Conf. Management of
Data (SIGMOD), pp. 915-926, 2010.
[21] S. Schleimer, D. S. Wilkerson and A. Aiken, Winnowing: Local
Algorithms for Document Fingerprinting, Proc. ACM SIGMOD
Intl Conf. Management of Data (SIGMOD), pp. 76-85, 2003.
[22] W. Wang, J. Qin, C. Xiao, X. Lin and H. T. Shen, VChunkJoin:
An Efficient Algorithm for Edit Similarity Joins, IEEE Tran. on
Knowledge and Data Engineering, 2012.
[23] T. Bocek and B. Stiller, Fast Similarity Search in Large Dictionaries, Technical Report ifi-2007.02, Department of Informatics,
University of Zurich, 2007.
[24] W. Wang, C. Xiao, X. Lin and C. Zhang, Efficient approximate
entity extraction with edit constraints, Proc. ACM SIGMOD Intl
Conf. Management of Data (SIGMOD), pp. 759-770, 2009.
[25] S. Chaudhuri and R. Kaushik, Extending autocompletion to tolerate
errors, Proc. ACM SIGMOD Intl Conf. Management of Data
(SIGMOD), pp. 707-718, 2009.
[26] J. Wang, J. Feng and G. Li, Trie-Join: Efficient Trie-based String
Similarity Joins with Edit-Distance Constraints, Proc. Intl Conf.
Very Large Data Bases (VLDB), vol. 3, no. 1, pp. 1219-1230, 2010.
[27] C. Sun, J. Naughton and S. Barman, Approximate String Membership Checking: A Multiple Filter, Optimization-Based Approach,
Proc. Intl Conf. Data Eng (ICDE), pp. 882-893, 2012.
[28] A. Mazeika, M. H. Bohlen, N. Koudas and D. Srivastava, Estimating the selectivity of approximate string queries, ACM Tran. on
Database Systems, vol. 32, no. 2, pp. 12, 2009.
[29] S. F. Altschul, W. Gish, W. C. Miller, E. W. Myers and D. J. Lipman,
Basic local alignment search tool, Journal of molecular biology,
vol. 215, no. 3, pp. 403-410, 1990.
[30] A. Z. Broder, S. C. Glassman, M. S. Manasse and G. Zweig,
Syntactic Clustering of the Web, Computer Networks and Isdn
Systems, vol. 29, no. 8-13, pp. 1157-1166, 1997.
[31] A. Gionis, P. Indyk and R. Motwani, Similarity search in high
dimensions via hashing, Proc. Intl Conf. Very Large Data Bases
(VLDB), pp. 518-529, 1999.
[32] M. S. Charikar, Similarity estimation techniques from rounding
algorithms, ACM Symposium on Theory of Computing, pp. 380388, 2002.
[33] G. Li, D. Deng, J. Wang and J. Feng, Pass-Join: A Partitionbased
Method for Similarity Joins, Proc. Intl Conf. Very Large Data
Bases (VLDB), vol. 5, no. 3, pp. 253-264, 2012.
[34] D. Deng, G. Li, J. Feng and W. Li, Top-k String Similarity Search
with Edit-Distance Constraints, Proc. Intl Conf. Data Eng (ICDE),
pp. 925-936, 2013.
[35] C. Li, J. Lu and Y. Lu, Efficient merging and filtering algorithms
for approximate string searches. Proc. Intl Conf. Data Eng (ICDE),
pp. 257-266, 2008.
Haoji Hu is master candidate in Software Engineering Institute at
East China Normal University. He is expected to receive his master
degree in 2014. His research interests mainly include similarity
query processing and semi-supervised learning.
Kai Zheng is an ARC DECRA (Australia Research Council
Discovery Early-Career Researcher Award) Fellow with The
University of Queensland. He obtained his Ph.D in Computer
Science from The University of Queensland in 2012. His research
interests include uncertain database, spatial-temporal query
processing, and trajectory computing.
Xiaoling Wang is a professor of Software Engineering Institute
at East China Normal University. She received her Ph.D from
Southeastern University in 2003. She achieved the Programs of
New-Century Talent of Ministry of Education of China and Innovation
of Science and Technology of Department of Education of Shanghai.
Her research interests mainly include Web data management, data
service technology and applications.
Aoying Zhou is a professor in Software Engineering Institute at
East Normal University of China. He received the PhD degree from
Fudan University in 1993. His research interests include web search
and mining, data stream management, uncertain data management,
distributed storage and computing, and web service. He is a member
of the IEEE.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.