Professional Documents
Culture Documents
Label Propagation, a standard algorithm • The edge weight encodes some notion of re-
for semi-supervised classification, suffers latedness between the vertices.
from scalability issues involving memory
and computation when used with large- • The relation represented by edges is at least
scale graphs from real-world datasets. In weakly transitive. Examples of such rela-
this paper we approach Label Propagation tions include, “is similar to”, “is more gen-
as solution to a system of linear equations eral than”, and so on. It is important that the
which can be implemented as a scalable relations selected are transitive for the graph-
parallel algorithm using the map-reduce based learning methods using random walks.
framework. In addition to semi-supervised
Such graphs present several possibilities for
classification, this approach to Label Prop-
solving natural language problems involving rank-
agation allows us to adapt the algorithm to
ing, classification, and clustering. Graphs have
make it usable for ranking on graphs and
been successfully employed in machine learning
derive the theoretical connection between
in a variety of supervised, unsupervised, and semi-
Label Propagation and PageRank. We pro-
supervised tasks. Graph based algorithms perform
vide empirical evidence to that effect using
better than their counterparts as they capture the
two natural language tasks – lexical relat-
latent structure of the problem. Further, their ele-
edness and polarity induction. The version
gant mathematical framework allows simpler anal-
of the Label Propagation algorithm pre-
ysis to gain a deeper understanding of the prob-
sented here scales linearly in the size of
lem. Despite these advantages, implementations
the data with a constant main memory re-
of most graph-based learning algorithms do not
quirement, in contrast to the quadratic cost
scale well on large datasets from real world prob-
of both in traditional approaches.
lems in natural language processing. With large
amounts of unlabeled data available, the graphs
1 Introduction
can easily grow to millions of nodes and most ex-
Natural language data often lend themselves to a isting non-parallel methods either fail to work due
graph-based representation. Words can be linked to resource constraints or find the computation in-
by explicit relations as in WordNet (Fellbaum, tractable.
1989), and documents can be linked to one an- In this paper we describe a scalable implemen-
other via hyperlinks. Even in the absence of such a tation of Label Propagation, a popular random
straightforward representation it is possible to de- walk based semi-supervised classification method.
rive meaningful graphs such as the nearest neigh- We show that our framework can also be used for
bor graphs, as done in certain manifold learning ranking on graphs. Our parallel formulation shows
methods, e.g. Roweis and Saul (2000); Belkin and a theoretical connection between Label Propaga-
Niyogi (2001). Typically, these graphs share the tion and PageRank. We also confirm this em-
following properties: pirically using the lexical relatedness task. The
58
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, ACL-IJCNLP 2009, pages 58–65,
c
Suntec, Singapore, 7 August 2009.
2009 ACL and AFNLP
proposed Parallel Label Propagation scales up lin- this distance measure to a similarity measure, we
early in the data and the number of processing ele- take the reciprocal of the shortest-path length. We
ments available. Also, the main memory required refer to this as the geodesic similarity.
by the method does not grow with the size of the
graph.
The outline of this paper is as follows: Section 2
introduces the manifold assumption and explains
why graph-based learning algorithms perform bet-
ter than their counterparts. Section 3 motivates
the random walk based approach for learning on
graphs. Section 4 introduces the Label Propaga-
tion method by Zhu et al. (2003). In Section 5 we
describe a method to scale up Label Propagation
using Map-Reduce. Section 6 shows how Label Figure 2: Shortest path distances on graphs ignore
Propagation could be used for ranking on graphs the connectivity structure of the graph.
and derives the relation between Label Propaga-
tion and PageRank. Parallel Label Propagation is While shortest-path distances are useful in
evaluated on ranking and semi-supervised classifi- many applications, it fails to capture the following
cation problems in natural language processing in observation. Consider the subgraph of WordNet
Section 8. We study scalability of this algorithm in shown in Figure 2. The term moon is con-
Section 9 and describe related work in the area of nected to the terms religious leader
parallel algorithms and machine learning in Sec- and satellite.1 Observe that both
tion 10. religious leader and satellite are
at the same shortest path distance from moon.
2 Manifold Assumption However, the connectivity structure of the graph
The training data D can be considered as a collec- would suggest satellite to be more similar
tion of tuples D = (X , Y) where Y are the labels than religious leader as there are multiple
and X are the features, and the learned model M senses, and hence multiple paths, connecting
is a surrogate for an underlying physical process satellite and moon.
which generates the data D. The data D can be Thus it is desirable to have a measure that cap-
considered as a sampling from a smooth surface or tures not only path lengths but also the connectiv-
a manifold which represents the physical process. ity structure of the graph. This notion is elegantly
This is known as the manifold assumption (Belkin captured using random walks on graphs.
et al., 2005). Observe that even in the simple case
4 Label Propagation: Random Walk on
of Euclidean data (X = {x : x ∈ Rd }) as shown
Manifold Graphs
in Figure 1, points that lie close in the Euclidean
space might actually be far off on the manifold. An efficient way to combine labeled and unla-
A graph, as shown in Figure 1c, approximates the beled data involves construction of a graph from
structure of the manifold which was lost in vector- the data and performing a Markov random walk
ized algorithms operating in the Euclidean space. on the graph. This has been utilized in Szummer
This explains the better performance of graph al- and Jaakkola (2001), Zhu et. al. (2003), and Azran
gorithms for learning as seen in the literature. (2007). The general idea of Label Propagation in-
volves defining a probability distribution F over
3 Distance measures on graphs the labels for each node in the graph. For labeled
Most learning tasks on graphs require some notion nodes, this distribution reflects the true labels and
of distance or similarity to be defined between the the aim is to recover this distribution for the unla-
vertices of a graph. The most obvious measure of beled nodes in the graph.
distance in a graph is the shortest path between the Consider a graph G(V, E, W ) with vertices V ,
vertices, which is defined as the minimum number edges E, and an n × n edge weight matrix W =
of intervening edges between two vertices. This is 1
The religious leader sense of moon is due to Sun
also known as the geodesic distance. To convert Myung Moon, a US religious leader.
59
(a) (b) (c)
Figure 1: Manifold Assumption [Belkin et al., 2005]: Data lies on a manifold (a) and points along the
manifold are locally similar (b).
60
tifier for a node in the graph and the value corre- PageRank (Page et al., 1998). PageRank is a ran-
sponds to the data associated with the node. The dom walk model that allows the random walk to
mappers run on different machines operating on “jump” to its initial state with a nonzero proba-
different parts of the data and the reduce function bility (α). Given the probability transition matrix
aggregates results from various mappers. P = [Prs ], where Prs is the probability of jumping
Algorithm 2: Parallel Label Propagation from node r to node s, the weight update for any
map(key, value): vertex (say v) is derived as follows
begin
vt+1 = αvt P + (1 − α)v0 (4)
d=0
neighbors = getNeighbors(value); Notice that when α = 0.5, PageRank is reduced
foreach n ∈ neighbors do to Algorithm 1, by a constant factor, with the ad-
w = n.weight();
ditional (1 − α)v0 term corresponding to the con-
d += w ∗ n.getDistribution();
tribution from the “dummy vertices” VD in Algo-
end
rithm 1.
normalize(d);
We can in fact show that Algorithm 1 reduces to
value.setDistribution(d);
PageRank as follows:
Emit(key, value);
end vt+1 = αvt P + (1 − α)v0
reduce(key, values): Identity Reducer (1 − α)
∝ vt P + v0 (5)
Algorithm 2 represents one iteration of Algo- α
rithm 1. This is run repeatedly until convergence = vt P + βv0
or for a specified number of iterations. The al-
where β = (1−α) α . Thus by setting the edge
gorithm is considered to have converged if the la-
weights to the dummy vertices to β, i.e., ∀(z, z 0 ) ∈
bel distributions associatedwith each node
do not E and z 0 ∈ VD , wzz 0 = β, Algorithm 1, and hence
change significantly, i.e., F(i+1) − F(i) 2 <
Algorithm 2, reduces to PageRank. Observe that
for a fixed > 0.
when β = 1 we get the original Algorithm 1.
6 Label Propagation for Ranking We’ll refer to this as the “β-correction”.
61
Algorithm 3: Graph Construction bel Propagation algorithm. The results are seen in
map(key, value): Table 2.
begin
edgeEntry = value; Method Spearman
Node n(edgeEntry); Correlation
Emit(n.id, n); PageRank (α = 0.1) 0.39
end Parallel Label 0.39
Propagation (β = 9)
reduce(key, values):
begin Table 2: Lexical-relatedness results: Comparision
Emit(key, serialize(values)); of PageRank and β-corrected Parallel Label Prop-
end agation
62
(a) (b)
(F-scores) with another scalable previous work by this case, we progressively increase the number of
Kim and Hovy (Kim and Hovy, 2006) in Table 2 nodes in the cluster. Again, the speedup achieved
for the same seed set. Their approach starts with a is linear in the number of processing elements
few seeds of positive and negative terms and boot- (CPUs). An appealing factor of Algorithm 2 is that
straps the list by considering all synonyms of pos- the memory used by each mapper process is fixed
itive word as positive and antonyms of positive regardless of the size of the graph. This makes the
words as negative. This procedure is repeated mu- algorithm feasible for use with large-scale graphs.
tatis mutandis for negative words in the seed list
until there are no more words to add. 10 Related Work
63
addition, the Netflix Million Dollar Challenge4 package graph;
generated sufficient interest in large scale cluster-
message NodeNeighbor {
ing algorithms. (McCallum et al., 2000), describe required string id = 1;
algorithmic improvements to the k-means algo- required double edgeWeight = 2;
rithm, called canopy clustering, to enable efficient repeated double labelDistribution = 3;
}
parallel clustering of data.
While there is earlier work on scalable map- message UndirectedGraphNode {
required string id = 1;
reduce implementations of PageRank (E.g., Gle- repeated NodeNeighbor neighbors = 2;
ich and Zhukov (2005)) there is no existing liter- repeated double labelDistribution = 3;
ature on parallel algorithms for graph-based semi- }
supervised learning or the relationship between message UndirectedGraph {
PageRank and Label Propagation. repeated UndirectedGraphNode nodes = 1;
}
11 Conclusion
In this paper, we have described a parallel algo-
rithm for graph ranking and semi-supervised clas- References
sification. We derived this by first observing that Arik Azran. 2007. The rendezvous algorithm: Multi-
the Label Propagation algorithm can be expressed class semi-supervised learning with markov random
as a solution to a set of linear equations. This is walks. In Proceedings of the International Confer-
easily expressed as an iterative algorithm that can ence on Machine Learning (ICML).
be cast into the map-reduce framework. This al- Micheal. Belkin, Partha Niyogi, and Vikas Sindhwani.
gorithm uses fixed main memory regardless of the 2005. On manifold regularization. In Proceedings
size of the graph. Further, our scalability study re- of AISTATS.
veals that the algorithm scales linearly in the size John Blatz, Erin Fitzgerald, George Foster, Simona
of the data and the number of processing elements Gandrabur, Cyril Goutte, Alex Kulesza, Alberto
in the cluster. We also showed how Label Prop- Sanchis, and Nicola Ueffing. 2004. Confidence es-
agation can be used for ranking on graphs and timation for machine translation. In Proceeding of
COLING.
the conditions under which it reduces to PageR-
ank. We evaluated our implementation on two Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu,
learning tasks – ranking and semi-supervised clas- Gary R. Bradski, Andrew Y. Ng, and Kunle Oluko-
tun. 2006. Map-reduce for machine learning on
sification – using examples from natural language multicore. In Proceedings of Neural Information
processing including lexical-relatedness and senti- Processing Systems.
ment polarity lexicon induction with a substantial
gain in performance. Jeffrey Dean and Sanjay Ghemawat. 2004. Map-
reduce: Simplified data processing on large clusters.
In Proceedings of the symposium on Operating sys-
A Appendix A: Interface definition for tems design and implementation (OSDI).
Undirected Graphs
Christaine Fellbaum, editor. 1989. WordNet: An Elec-
In order to guarantee the constant main memory tronic Lexical Database. The MIT Press.
requirement of Algorithm 2, the graph represen- D. Gleich and L. Zhukov. 2005. Scalable comput-
tation should encode for each node, the complete ing for power law graphs: Experience with parallel
information about it’s neighbors. We represent pagerank. In Proceedings of SuperComputing.
our undirected graphs in the Google’s Protocol Ananth Grama, George Karypis, Vipin Kumar, and An-
Buffer format.5 Protocol Buffers allow a compact, shul Gupta. 2003. Introduction to Parallel Comput-
portable on-disk representation that is easily ex- ing (2nd Edition). Addison-Wesley, January.
tensible. This definition can be compiled into effi-
David Kauchak and Regina Barzilay. 2006. Para-
cient Java/C++ classes. phrasing for automatic evaluation. In Proceedings
The interface definition for undirected graphs is of HLT-NAACL.
listed below:
Michael Kearns. 1993. Efficient noise-tolerant learn-
4
http://www.netflixprize.com ing from statistical queries. In Proceedings of the
5
Implementation available at Twenty-Fifth Annual ACM Symposium on Theory of
http://code.google.com/p/protobuf/ Computing (STOC).
64
Soo-Min Kim and Eduard H. Hovy. 2006. Identifying
and analyzing judgment opinions. In Proceedings of
HLT-NAACL.
Andrew McCallum, Kamal Nigam, and Lyle H. Un-
gar. 2000. Efficient clustering of high-dimensional
data sets with application to reference matching.
In Knowledge Discovery and Data Mining (KDD),
pages 169–178.
G. Miller and W. Charles. 1991. Contextual correlates
of semantic similarity. In Language and Cognitive
Process.
Larry Page, Sergey Brin, Rajeev Motwani, and Terry
Winograd. 1998. The pagerank citation ranking:
Bringing order to the web. Technical report, Stan-
ford University, Stanford, CA.
Siddharth Patwardhan, Satanjeev Banerjee, and Ted
Pedersen. 2005. Senserelate::targetword - A gen-
eralized framework for word sense disambiguation.
In Proceedings of ACL.
John M. Prager, Jennifer Chu-Carroll, and Krzysztof
Czuba. 2001. Use of wordnet hypernyms for an-
swering what-is questions. In Proceedings of the
Text REtrieval Conference.
M. Szummer and T. Jaakkola. 2001. Clustering and
efficient use of unlabeled examples. In Proceedings
of Neural Information Processing Systems (NIPS).
Jason Wolfe, Aria Haghighi, and Dan Klein. 2008.
Fully distributed EM for very large datasets. In Pro-
ceedings of the International Conference in Machine
Learning.
W. Xiao and I. Gutman. 2003. Resistance distance and
laplacian spectrum. Theoretical Chemistry Associa-
tion, 110:284–289.
Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty.
2003. Semi-supervised learning using Gaussian
fields and harmonic functions. In Proceedings of
the International Conference on Machine Learning
(ICML).
65