You are on page 1of 5

P2P-Join: A Keyword Based Join Operation

in Relational Database Enabled Peer-to-Peer Systems

Zhigang Chen, Zhongding Huang Bo Ling, Jiang Li


Shanghai Second Polytechinc University China Executive Leadership Academy, Pudong
No.2360 Jinhai Road, Shanghai, 201209 No.99 Qiancheng Road, Shanghai, 201204
{chen zhigang, huangzd}@sohu.com {bling, jli}@celap.org.cn

Abstract provided, the semantics of data is largely ignored. In some


unstructured systems (e.g., Gnutella[4]), searching is re-
Query-by-keywords is the most popular manner to stricted to strings contained in a filename and directory path.
search for data in this computing age. However, most Structured systems (e.g., Chord [20], CAN [17], and Pastry
work was proposed for searching centralized relational [18]) only support ”exact name match” file searching. In
databases. This paper investigates how keyword search to a word, existing P2P systems support only semantics-free,
be deployed in Relational database-enabled peer-to-peer large-granularity level data sharing, while lack of data man-
systems. Unlike centralized system, the key challenges in agement capabilities and support for semantic search. This
this new computing paradigm come from the autonomy of is because of P2P systems lacking the consideration of se-
peers, the lack of a global schema, and the dynamics of the mantics, data transformation and data relationships [5].
peer connectivity. First, the concept of P2P-Join is pro- Recently, many researchers manage to integrate P2P
posed, which is a join operation to combine tuples among with database technologies into a highly distributed data
relations from different peers containing certain keywords sharing environment [5, 7, 10, 11, 12, 15, 16]. Such sort of
in the query. Second, a fully distributed framework to real- systems, with autonomous relational database being peers,
ize P2P-Join processing is devised, which not only inherits are referred as relational database-enabled P2P systems
the syntax and semantics of traditional join but also cher- (PeerDBS). In this paper, a novel operation is introduced
ishes the ideology of peer-to-peer computing. Finally, two into such type of systems to mining knowledge from P2P
mechanisms are proposed to improve the performance of the networks. Specifically, the main contributions of this paper
operation: a join path selecting order scheme and a push- are as follows:
based load balancing mechanism across peers.
• The concept of P2P-Join operation is proposed, which
joins tuples among relations containing certain key-
words in the query from different peers;
1. Introduction
• Second, a fully distributed framework to realize P2P-
Peer-to-peer (P2P) technology, also called peer comput- Join operation is devised, which not only inherits the
ing, is an emerging paradigm that is now viewed as a po- syntax and semantics of traditional join operation, but
tential technology that could re-construct distributed archi- also cherishes the ideology of P2P computing;
tectures (e.g., the Internet). In a P2P distributed system, a • Finally, two mechanisms are proposed to improve the
large number of nodes (e.g., PCs connected to the Internet) performance of the operation: a join path selection
can potentially be pooled together to share their resources, scheme, and a push-based load balancing mechanism
information and services. These nodes, which can both con- across peers.
sume as well as provide data and/or services, may join and
leave theP2P network at any time, resulting in a truly dy- The rest of the paper is organized as follows. Section 2
namic and ad-hoc environment. The natures of such a de- reviews some related work. Section 3 states the problem
sign provides exciting opportunities for new applications. and gives definitions. Section 4 presents the framework of
While data sharing is the dominant application in P2P P2P-Join. Section 5 discusses the heuristics to improve the
computing at present, only file-system-like capabilities are performance and section 6 makes a conclusion.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
2 Related Work 3 Problem Statement

3.1 An Overview of Relational Database-


P2P technologies have been deployed in many appli- enabled Peer-to-Peer Systems
cations, such as instant message (IM) [13], collaborative
workgroup tool [6], CPU cycle sharing [19] and data shar- For generality, our work is based on common P2P proto-
ing [3, 4, 15]. While most of the applications are on data cols. Such a system has following natures:
sharing, current mechanisms are largely restricted to file
• A system is decentralized and self-organized, while its
level sharing without capabilities of relational database.
peers are autonomous and dynamic;
In [5], the issues of data management in P2P environ-
ment are discussed from a database perspective. Its focus is • Information is distributed among peers but not concen-
largely on what database technologies can do for P2P ap- trated at dedicated servers;
plications. Though a preliminary architecture for peer data • Peers are equivalent in functionality and responsibility
management (Piazza) is described, little is discussed about and interact with each other symmetrically.
how peers cooperate. Different from [5], PeerOLAP [10]
sought to address the problem in a different way - it looks To further exploit the merits of P2P protocols, we assume
at what P2P technologies can do for database applications. a three layer-based architecture for a relational database-
Essentially, PeerOLAP is still a client/server system. How- enabled P2P system. Specifically, from bottom to up,
ever, the cooperation among clients (peers) is explored: all they are structured layer, unstructured layer and applica-
data within clients is shared together. tion layer. The structured layer employs the protocol of
structured P2P systems (e.g., Chord [20] and CAN [17]),
Since each peer maintains its data independently, to
which takes the DHT [8] mechanism to manage meta-data
achieve semantically meaningful search, some kind of un-
of peers. The unstructured layer is implemented upon the
derstanding among peers is required. In PeerDB[15],
protocol of unstructured P2P systems (e.g., BestPeer [14]),
an IR-based approach is used to mediate peer schemas
so that a peer in the system can be autonomous and able to
with a global thesaurus being assumed. In [16], a data
re-configures its neighbors dynamically. While the applica-
model(LRM) is designed for P2P systems with domain re-
tion layer is of a relational database system.
lations and coordination formulas to describe relationships
between two peer databases. Data Mapping[12] is a simpli-
3.2 Query and Answer
fied implementation of this model.
Some contribution has also been made by data integra- A query is modelled as a set of keywords, i.e., q =
tion researchers. However, unlike traditional data integra- k1 , · · · , kt , qid . Here, k1 , · · · , kt  is a set of keywords
tion systems, where a global schema is assumed with a few to semantically describe users’ desiring; while the qid is
data sources, a P2P system cannot simply assume a global a system-widely unique identifier for the query q gener-
schema, due to its high dynamicity and a large number of ated by its initiator. This query style is chosen mainly be-
data sources. Nevertheless, it may be possible to compose cause there’s no global schema in P2P database systems,
mediators, having some mediators defined in terms of other and moreover, it is user-friendly.
mediators and data sources, thus archiving system extensi- Each peer maintains its data in a relational database.
bility. This is the main thrust of [7] and [11]. While [7] When a query is processed, the peer searches its own
focuses on how queries are reformulated with views, [11] database and return tuples containing all or a part of key-
focuses on selective view expansion with considering full words in the query. Keyword searching in local relational
view expansion may be prohibitively expensive. Similar as databases can be done by DISCOVER [9] or other methods.
[7] and [11], our work is also mainly interested in query Our focus is not in local processing, but in how to integrate
processing aspects in peer database systems, however, we tuples from different peers.
make little assumption about peer availability, that means, The results of a local query can be of two forms: either
our work can be applied into a more general P2P environ- a tuple in a relation contains all keywords in the query or a
ment. tuple only contains a subset of keywords in the query. For
Exploiting keyword search in database querying has also the latter, the final resultant answer that contains all key-
drawn some attention recently [1, 2, 9]. However, these words in the query may be obtained by joining tuples from
works only focus on centralized system, where the seman- multiple relations. In addition, we expect some answers to
tics across different relations (e.g., key/foreign-key rela- be obtainable from individual peers (even those that involve
tionships) can be exploited to improve the search accuracy. joins), while others require the cooperation among peers.
Such semantics are much harder in P2P systems, which is We refer to the latter category as P2P-Join, which we are
the main challenge of our work. most interested in and will be defined next.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
3.3 Peer-to-Peer Join With the join graph, a peer can identify the peers with
which it can be joined, and try to perform a P2P join. For the
Definition 1 P2P-Join is a join operation that combines join results that contain all keywords in the query can now
two (or more) relations from two (or more) peers based on be directly returned to the requesting peer, while the results
the semantics of keywords and syntax of join operation of containing partial keywords will be propagated along the
relational database systems. join paths until the final results including all keywords in
the query are generated and then returned to the user.
For example, consider a query Q={k1 , k2 }. Suppose
there are two peers P1 and P2, such that peer P1 maintains 4.2 The Generation of Join Graph
a Relation R1(A1,. . ., B, . . .) and peer P2 also maintains
a relation R2(A2, . . ., B, . . .), and both relations share a A join graph is a graph G (V, E), in which V is the vertex
common attribute B. Furthermore, if some of the values of set, and E is the edge set. In such a graph, each vertex de-
attribute A1 is k1 while some values of attribute A2 is k2 . notes a peer with one keyword in the query Q. According to
In addition, if there exist a tuple < k1 ,. . ., b,. . ., x > in R1 this strategy, one peer may be denoted by several different
and a tuple < k2 ,. . .,b, . . ., y > then a P2P join operation vertices if it contains several keywords in the query. Two
can be performed and the result is < k1 , b, k2 ,. . ., x , y >. peers are connected by an edge if and only if they contain
From the above example, we present a theorem for P2P- two different keywords in the query and they share certain
Join, which can be easily proved. common attributes with the same meaning. A peer’s ver-
tices may be connected if it contains more than one keyword
Theorem 1 Two tuples in different peers are joined by a in the query and they can be joined locally.
operation of P2P join if and only if their common attribute After processing the query locally, each peer sends the
of two relation pertaining to two different peers has at least keywords in the query that it contains to all its neighbors
one equal value, and furthermore, such a pair tuples con- who are also accessed by the query with the same query
tain different keywords in the query Q. identity. Further, when a peer receives the keywords from
one of its neighbor, it compares them with those in the query
4 Framework of P2P-Join that appears in its own local database. If there exists one
or more keywords in local database that are different with
4.1 An Overview those are in the neighbor candidate, it will establish a con-
nection with that peer. With these operations, the join graph
In general, a query processing consists of six steps: to a query is thus generated.
Query distribution, Local processing, Information ex- From the above, we can see that only peers connected by
change among peers, Join graph generation, P2P join, and an edge in the join graph need to be joined.
Result propagation.
When a query is submitted, it is distributed to all neigh- 4.3 Peer-to-Peer Join
bors. The neighbors will further forward the query to their
own neighbors, and so on, till the query’s lifetime (Time-to- First, we consider the problem of P2P join between two
Live, i.e., TTL) is expired. relations from two different peers. For example, Relation
For each peer who receives the query, it will first lookup R1(A1,...B,...) and Relation R2(A2,...B,...) are two rela-
the full text index of its database to decide whether it con- tional tables belonging to two different peers P1 and P2.
tains some or all of the keywords in the query. Based on According to the definition of P2P join, obviously, the tu-
keyword searching methods in databases, tuples that con- ples whose values of attribute A1 or A2 are not k1 or k2 can
tain all keywords will be returned to the requesting peer di- be filtered out directly, while other tuples are reserved. Cer-
rectly, while tuples including partial keywords will be used tainly, we would like the filtering operation to be performed
for peer-to-peer join. At present, there are many approaches locally first, since this will reduce the bandwidth cost, and
to support local keyword-based query processing, such as distribute the processing load on different peers, thus im-
DISCOVER[9], BANKS[2], DBXplorer[1]. proving query processing performance to some extent.
Based on the results of local processing, each peer ex- We can further improve performance by some optimiza-
changes information with its neighbors, which are key- tion techniques employed in RDBMS, such as semi-join.
words in the tuples that only include subset of keywords in
the query. After having the information of its neighbors, a 4.4 The propagation of Results
peer will generate a join graph, where the vertices are peers,
while edges connect the pairs of peers that should be joined. A join path is a path in the join graph, in which each
how a join graph is generated is describe in the next section. keyword in the query appears exactly once in the vertices. A

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
final result can be only obtained after all join operations on
the edges in a join path have been executed. Furthermore, 1 4
when all join paths have been traversed, we suppose, the
complete answer set to a given query is obtained. Therefore, 3
to obtain the results of a query, each join path should be
traversed, which is described by the following procedure.
BEGIN 2 5
For each join path
For each edge connects Peer p and q
P2P-Join p and q Figure 1. An example of Join Graph
Reset neighbors and keywords
Delete edge(p,q)
Once a path traversed,
send results to requesting peer path will have different cost, thus providing some opportu-
END nities for further optimization.
In the context of database, it is widely accepted that shar-
ing common computation can be always beneficial. For ex-
Note that the neighborhood relationship is changing with ample, if we need to process both (A  B) and (A  B
the join process going on. Furthermore, the process will  C), we can process (A  B)first and store its result for
traverse all the join paths and the complete answer set to the later use (processing (A  B  C)). Here, we will use this
given query is obtained when it is finished. Compared to heuristic to direct the join sequence choosing.
the traditional processing, the above traversal procedure in Using the above join graph as an example, we are now
a P2P system is fully distributed. illustrating how the heuristic can be used to help reducing
The above algorithm is similar to the problem of graph computation cost. We consider how different order of oper-
traversing, which has been proved to be cost-exponential. ation of edges affects the reuse of edge 3 in the join graph,
However, with fully distributed approach, the situation is which imply the efficiency of the utilization of the resources
different. Here, we only perform the worse case time anal- in relational database systems.
ysis. Suppose in a P2P system, n peers have been accessed
by the query, and the maximally expected neighbor number 1. If the join operation on edge 3 is executed first, the
of a peer be k. Obviously, k  n. Then the internal loop is partial result can be taken advantage of by further join
executed at most O(k)m!. Here, m is the number of the key- operations (1,3), (1,3,4), (1,3,5), (2,3), (2,3,4) (2,3,5),
words in the query. Therefore, the total execution time takes so that the sum of reuse time of edge 3 is 6.
in the worst case is O(knm!). Since k and m is much smaller
number, the time complexity of the algorithm is much less 2. However, if join operation on edge 1 (or 2) is executed
than n2 . first, the partial result of edge 3 can be only taken ad-
vantage by (1,3,4) and (1,3,5) (or, (2,3,4) and (2,3,5)).
5 Improvements Thus, the sum of reuse times of edge 3 is only 2.

3. In summary, the sum of reuse times of edge in case (1)


In this section, two heuristics are presented to further im- is three times (6/2=3) of that in case (2), which implies
prove the efficiency of query processing in Peer database that the join order greatly effects the cost and efficiency
systems. First, one heuristic is presented to deal with the of the whole P2P join processing.
issue of the order of join path selection. Second, anther
heuristic is proposed to balance the load among peers. As shown above, the sequence of join processing along
the join path is a key factor for the query processing per-
5.1 Join Path Selection formance in P2P. However, this heuristic may be difficult to
implement, since it is not easy to synchronize peers. Addi-
In the previous section, we assume P2P join is processed tionally, it is expensive to get the complete graph and ana-
in a local manner, that is, the join operation between each lyze globally. We design a simple mechanism to approxi-
pair of peers does not affect other peers. However, some mate this heuristic: Before join, each peer sends the num-
peers may have more neighbors while others have fewer, ber of neighbors to its neighbors. A peer only agrees to join
so that the relative importance of edges among peers in a with half of its neighbors who have more neighbors than
join graph or subgraph is different. Therefore, different se- other half of neighbors. Only when both of two peers agree
quences of the join operation along the edges on the join to join, a join operation can be executed.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
5.2 Push-based Balancing References
[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A sys-
After several iterations of the join operations, the peers
tem for keyword-based search over relational databases. In
which have many neighbors can be very busy, which has Proceedings of the 18th ICDE, CA, April 2002.
been observed in our experiments. To achieve better load
[2] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and S. Su-
balancing, we propose another heuristic: Each peer can set darshan. Keyword searching and browsing in databases us-
a threshold based on its processing power to denote how ing banks. In Proceedings of the 18th ICDE, CA, April 2002.
many join operations it can process simultaneously. If the [3] P. Druschel. and A. Rowstron. Past: Persistent and anony-
number of join processing a peer has exceeded the thresh- mous storage in a peer-to-peer networking environment. In
old, it can broadcast to its neighbors to let other peers exe- Proceedings of the 8th IEEE Workshop on HotOS, 2001.
cute the join operation instead. [4] Gnutella Homepage. http://gnutella.wego.com/.
Two cases of this heuristic exist: If one of its neighbors [5] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.
takes the join task, only the join plan, which will put the What can databases do for peer-to-peer. In WebDB, 2001.
result to the other endpoint of the connection will be con- [6] Groove Home Page. http://www.groove.net.
sidered. Otherwise, the peers that should be joined will
[7] A. Y. Halevy, Z. G. Ives, and D. Suciu. Schema mediation in
send their data to a third peer, which will take over the join peer data management systems. In Proceedings of the 19th
operation. Note that, there currently exist many join algo- ICDE, 2003.
rithms, such as pipeline join or RIPPLE join, can execute [8] M. Harren, J. Hellerstein, R. Huebsch, B. Loo, S. Shenker,
such tasks. and I. Stoica. Complex queries in dht-based peer-to-peer net-
works. In IPTPS02, 2002.
5.3 Implementation [9] V. Hristidis and Y. Papakonstantinou. Discover: Keyword
search in relational databases. In VLDB’2002, 2002.
[10] P. Kalnis, B. C. Ooi, W. S. Ng, D. Papadias, and K. L. Tan.
A prototype with P2P-Join operation has been built upon An adaptive peer-to-peer network for distributed caching of
BestPeer [14], a generic P2P platform on which P2P ap- olap results. In ACM SIGMOD, 2002.
plications can be developed efficiently. BestPeer integrates
[11] T. Katchaounov. Query processing in self-profiling compos-
mobile agent and P2P techniques together. While P2P pro- able peer-to-peer mediator databases. In Proc. EDBT Ph.D.
vides resource sharing amongst nodes, mobile agents ex- Workshop 2002, 2002.
tends functions, including P2P-Join operation. In addition, [12] A. Kementsietsidis, M. Arenas, and R. Miller. Data mapping
peers in BestPeer can dynamically reconfigure their neigh- in peer-to-peer systems. In Proceedings of the 19th ICDE,
bor candidates. Further, Chord [20] is employed to map 2003 (Poster Paper).
meta-data (e.g., key, foreign key) among peers. [13] MSN Home Page. http://www.msn.com/.
An experimental study has been conducted upon the pro- [14] W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-
totype, and the primary results are promising. Furthermore, configurable peer-to-peer system. In Proceedings of the 18th
with the two heuristics being implemented, the performance ICDE, San Jose, CA, April 2002 (Poster Paper).
is greatly improved compared with original proposal. [15] W. S. Ng, B. C. Ooi, K. L. Tan, and A. Zhou. Peerdb: A p2p-
based system for distributed data sharing. In Proceedings of
the 19th ICDE, 2003.
6 Conclusion [16] A. B. Philip, G. Fausto, K. Anastasios, M. John, S. Luciano,
and Z. Ilya. Data management for peer-to-peer computing:
This paper managed to deploy relational database opera- A vision. In WebDB Workshop, 2002.
tion upon P2P computing. First, the concept of P2P-Join is [17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and
proposed, which can combine tuples among relations from S. Shenker. A scalable content-addressable network. In Pro-
different peers containing certain keywords in the query. ceedings of SIGCOMM, 2001.
Further, a fully distributed method to realize P2P-Join pro- [18] A. Rowstron and P. Druschel. Pastry: Scalable, distributed
cessing is devised, which inherits the syntax and semantics object location and routing for large-scale peer-to-peer sys-
of traditional join and cherishes the ideology of P2P as well. tems. In Proceedings of the International Conference on
Distributed Systems Platforms (Middleware), Germany, Nov.
Finally, two enhancements are proposed to improve the per-
2001.
formance of the proposed P2P-Join operation. Since rela-
tional database-enabled operation in P2P computing is still [19] Seti@home Home Page. http://setiathome.ssl.berkely.edu/.
at its infant stage, some other issues need to be addressed, [20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakr-
e.g. network optimization and cache management, which ishnan. Chord: A scalable peer-to-peer lookup service for
internet applications. In Proceedings of SIGCOMM, 2001.
are the topics of our future research.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

You might also like