Professional Documents
Culture Documents
VOL. 26,
NO. 2,
FEBRUARY 2014
453
INTRODUCTION
HE
1.1 Motivations
To protect user privacy in profile-based PWS, researchers
have to consider two contradicting effects during the search
process. On the one hand, they attempt to improve the
search quality with the personalization utility of the user
profile. On the other hand, they need to hide the privacy
contents existing in the user profile to place the privacy risk
under control. A few previous studies [10], [12] suggest that
people are willing to compromise privacy if the personalization by supplying user profile to the search engine
yields better search quality. In an ideal case, significant
gain can be obtained by personalization at the expense
of only a small (and less-sensitive) portion of the user
profile, namely a generalized profile. Thus, user privacy can
be protected without compromising the personalized
search quality. In general, there is a tradeoff between the
search quality and the level of privacy protection achieved
from generalization.
Unfortunately, the previous works of privacy preserving
PWS are far from optimal. The problems with the existing
methods are explained in the following observations:
1.
454
FEBRUARY 2014
2.
3.
1.2 Contributions
The above problems are addressed in our UPS (literally for
User customizable Privacy-preserving Search) framework.
The framework assumes that the queries do not contain
any sensitive information, and aims at protecting the
privacy in individual user profiles while retaining their
usefulness for PWS.
As illustrated in Fig. 1, UPS consists of a nontrusty search
engine server and a number of clients. Each client (user)
accessing the search service trusts no one but himself/
herself. The key component for privacy protection is an
online profiler implemented as a search proxy running on the
client machine itself. The proxy maintains both the
complete user profile, in a hierarchy of nodes with semantics,
and the user-specified (customized) privacy requirements represented as a set of sensitive-nodes.
The framework works in two phases, namely the offline
and online phase, for each user. During the offline phase, a
RELATED WORKS
455
456
TABLE 1
Symbols and Descriptions
FEBRUARY 2014
supR t
;
supR s
t 2 subtrs; R:
457
458
FEBRUARY 2014
riskq; G < :
UPS PROCEDURES
1.
2.
3.
459
TABLE 2
Contents of T (Eagles)
2.
3.
prefH t; q
prefH t0 ; H:
t0 2Ct;H
prefH t; q
:
prefH t0 ; q
H q
t0 2T
relR t; q
:
relR t0 ; q
t0 2T q
3. This approximation is made based on the idea that the users queryrelated preference of t can be estimated with its long-term preference in the
user profile.
460
GENERALIZATION TECHNIQUES
5.1
Metrics
P Gq; G
P rt j q; G log
t2TG q
FEBRUARY 2014
P rt j q; G
P rt
P rt j q; GICt Ht j q; G ;
|{z}
ob2
|{z}
10
t2TG q
ob1
11
12
P
where t2TH q P rt j q; HICt is the expected IC of topics
in TH q, given the profile G is generalized from H. It is easy
to demonstrate that the value of DP q; G is bounded
within 0; 1.
Then, the personalization utility is defined as the gain of
discriminating power achieved by exposing profile G together
with query q, i.e.,
utilq; G DP q; G DP q; R;
where DP q; R quantifies the discriminating power of the
query q without exposing any profile, which can be
obtained by simply replacing all occurrences of P rt j q; G
in (12) with P rt j q (obtained in (8)). Note that utilq; G
can be negative. That is, personalization with a profile G
may generate poorer discriminating power. This may happen
when G does not reduce the uncertainty of P rt j q
effectively, i.e., TG q T q, and describes the related
topics in coarser granularity.
Since DP q; R is fixed whenever q is specified, the
profile generalization simply take DP q; G (instead of
utilq; G) to be the optimization target.
15
461
462
FEBRUARY 2014
IMPLEMENTATION ISSUES
Nt;w
;
t0 2R Nt0 ;w
Nd;w ln P
w2d
17
Nt;w
:
t0 2T qi Nt0 ;w
Nd;w ln P
w2d
18
463
EXPERIMENTAL RESULTS
t0 ;w
6. In the original AOL-data, the url field is truncated to the domain name
(e.g., www.une.edu/mwwc/ becomes www.une.edu) except the URL is of
the protocol https. As a result, we have to recover the full URLs by means of
REPLAY. That is, to reissue the query of the AOL-records to some designate
search engines and then retrieve the URLs matching url. We processed
REPLAY, respectively, on the top 100 results returned by Yahoo and ODP,
and 40 percent (#clicks 800) and 25 percent (#clicks 500) full URLs are
recovered.
464
FEBRUARY 2014
Fig. 5. Results of Distinct/Medium/Ambiguous queries during each iteration in GreedyDP/GreedyIL. All results are obtained from the same profile.
power, and study the tradeoff between the utility and the
privacy risk in the GreedyDP/GreedyIL algorithm. To
clearly illustrate the difference between the three groups of
queries, we use the synthetic profiles in this experiment. We
perform iterative generalization on the profile using one of
the original queries for creating the profile itself. The DP
and risk are measured after each iteration. As the results of
different profiles display similar trends, we only plot the
results of three representative queries (Wikipedia for
distinct queries, Freestyle for medium queries, and
Program for ambiguous queries) in Fig. 5.
As Figs. 5a, 5b, and 5c show the discriminating power of
all three sample queries displays a diminishing-returns
property during generalization, especially the ambiguous
one (i.e., Program). This indicates that the higher-level
topics in the profile are more effective in improving the
search quality during the personalization, while the lowerlevel ones are less. This property has also been reported in
[12], [10]. In addition, we also plot the results of profile
granularity and topic similarity (TS) across iterations in
these figures. We observe that for all three samples, 1) PG
shows an exactly similar trend as that of DP, 2) TS
remains unchanged until the last few iterations of generalization. In particular, the TS of the ambiguous one is
always 0. The reason of such results is that TS is fixed
before the generalization reaches the least common
ancestor of the related queries, which means PG shapes
the overall DP more.
Similarly, Figs. 5d, 5e, and 5f show the results of risk
during the generalization. The value of the metric first
declines rapidly, but the decrease slows down as more
specific profile information becomes hidden. Fig. 5g
illustrates the tradeoff pattern of DP versus risk of three
sample queries. For all queries, we observe an apparent
knee on their tradeoff curve. Before this turning point,
small concessions on risk can bring great promotion on
utility; while after that, any tiny increase of utility will lead
to enormous increase in risk. Hence, the knee is a nearoptimal point for the tradeoff. We also find that the knee
n
X
i1
i
=n;
li :rank
19
465
466
CONCLUSIONS
ACKNOWLEDGMENTS
This work was supported by the National Science Foundation of China (No. 61170034), and China Key Technology
R&D Program (No. 2011BAG05B04).
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
FEBRUARY 2014
467