Professional Documents
Culture Documents
Topology
Andrew Yates
Manirupa Das
Jose Diaz
Daniel Dotson
das@cse.ohio-state.edu
Stephanie Schulte
diaz.6@osu.edu
Rajiv Ramnath
andrewyates@fb.com
Ohio State University
dotson.77@osu.edu
schulte.109@osu.edu
ramnath.6@osu.edu
ABSTRACT
Methods that are both computationally feasible and practically effective are needed to make sense of big corpuses of
content, or big content. For example, in open access academic publishing, supervised categorization techniques are
ill-suited for automated categorization because they rely on
an existing categorization scheme, but no such scheme can
stay abreast of the evolving landscape of scholarly work.
This problem applies to any domain where no good categorization scheme exists. To address this challenge, we
present an unsupervised method to fit a hierarchical categorization scheme to a corpus based on clustering the network
of shared concepts in the corpus, or its concept topology.
Our method potentially applies to any type of content, and
it scales to large networks of millions of vertices. We demonstrate this by applying our method to a corpus of 1.5 million scholarly texts representing the majority of open access
(OA) academic publications on the web and validating our
results using expert librarian annotations. Since our corpus
represents all OA academic work, our resulting categorization scheme best represents OA academic publishing as it
exists today.
Keywords
categorization, graph clustering, content analysis
1.
INTRODUCTION
The world is inundated with an increasing amount of different types of big content and few good ways to make
Research was conducted while the author was unaffiliated.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WOODSTOCK 97 El Paso, Texas USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
represented as sets of highly informative but sparsely distributed features like keywords.
We provide our unsupervised categorization results1 and
raw corpus OA academic2 corpus online for review and future research. Excluding the top-level summary labels like
7. Computer Science, concept tagging, categorization, and
category labeling were entirely unsupervised, generated solely
from author-provided text.
This paper is arranged as follows. In Section 2, we review
related research and background topics related to this work.
Section 3 is an introduction to state-of-the-art stochastic
blockmodeling, a newly improved clustering technique that
we apply to produce our categorizations. In Section 4, we
describe our method of hierarchically clustering the concept
topology of big content as a general workflow, and in Section
5, we describe our specific application of our method to a
corpus of millions of OA academic publications. We present
our experimental results and validation by expert librarian
review in Section 6 and conclude with discussion in Section
7.
2.
3.
INTRODUCTION TO BLOCKMODELS
Quantum
Physics
Materials
Science
Cancer
Public
Health
Figure 1:
(left) An example adjacency matrix
sorted by blocks. (right) Corresponding edge probabilities between blocks where pij is the probabiliy
of an edge between vertices in blocks i and j and
lighter means lower probability.
A stochastic blockmodel is a statistical network model
where vertices belong to a fixed number of blocks and the
probability of an edge between two vertices is a function
of to which blocks those vertices belong [20]. This can be
illustrated using an adjacency matrix where the rows and
columns are ordered so that vertices in the same block are
adjacent. A characteristic block pattern in non-zero matrix entries becomes apparent. See Figure 1(left) for an example adjacency matrix of documents sorted by topic, where
filled entries represent high similarity between documents.
Blockmodeling, then, is the statistical inference technique
of fitting empirical data to block assignments and a matrix of
edge probabilities between those blocks as in Figure 1(right).
This is in contrast to modularity-based techniques that only
consider in-versus-out cluster connectivities and not weak
connections between clusters to make cluster assignments.
For example, in Figure 1, when making block assignments,
the blockmodel includes evidence that Cancer and Public
Health blocks are weakly connected to each other but not
connected to the Quantum Physics or Materials Science
blocks.
3.1
P =
1Aij
pbiij
bj (1 pbi bj )
(1)
i<j
mrs log
rs
mrs
nr ns
(2)
mrs log
rs
3.1.2
4.
Blockmodel Formulation
3.1.1
mrs
r s
(3)
Hierarchical Extension
4.1
http://graph-tool.skewed.de
4.2
Figure 3:
Step 2: More common concept tokens (circles) have lower specificity scores (numbered stars)
Concept tokens may be general or specific in meaning.
For example, in an academic texts, consider the keyword
tokens HIV and medicine. Intuitively, HIV has a more
specific meaning than medicine, and texts that share the
token HIV should be more similar in topic than those that
share the token medicine.
Rather than relying on a concept ontology which may be
unavailable or infeasible to customize for particular applications, because our content corpus is large, we can estimate
global concept specificity S by counting the frequency of concept tokens; see Figure 3. The less frequent a concept, the
more specific we estimate it to be. We compute a specificity
score per token using Equation 4, where Si is the specificity
score for concept token ti , ci is the number of times concept
token ti appears in the corpus where ci > 1, and B is a user
parameter.
Si =
1
log2 (ci + 2B 2) B + 1
(4)
4.3
B
D
5
A
5
A
6
F
6
E
C
E
1
B
3
B
6
B
2
B
D
A
1
C
3
E
1
C
2
D
E
1
F C
1
D
2
1
F C
1
1
D A
2
(5)
Computation: Naively comparing and recording the similarity between all-pairs of content items is an T N 2 operation that requires N 2 memory, where N is the number of
content items in the corpus and T = O(lg(N )) is the maximum number of terms per item. When N is large, this
is prohibitively
computationally intensive. We improve this
p
to O(N (N )T ) compute time and O(N K) memory by only
comparing content items that share at least one concept and
recording the top K most
p similar per content item as follows.
There are at p
most O( (N ) lg(N )) operations to score at
most M = O( (N ) lg(N )) other content items per content
item. Top K is computed in O(M ) time per content item
using quickselect [7]. In practice, M << N because concepts are not independently distributed in the corpus and
most concepts are shared by only few content items. Computation performance can be further improved by computing
each similarity list in parallel.
4.4
A
B
4.7
Once clustering terminates, we automatically label clusters by the top most statistically enriched concept tokens
in that cluster using a chi-squared test on the contingency
table shown in Table 1. That is, we label clusters by the
concepts they have more of in comparison to other clusters
at the same level. The test is performed top-down where
clusters are only compared to other clusters with the same
parent.
Has Concept
Hasnt Concept
In Cluster
v
y
Not in Cluster
x
v
D
Table 1: Contingency table used in chi-square test.
Figure 5: Undirected Top-k concept similarity network corresponding to the adjacency list in Figure
4
The top-K most-similar list produced in Step 3 is an adjacency list representation of an undirected network, where
vertices are content items and an edge represents a top-K
concept-similarity relationship. Optionally, edges can have
integer edge weights (multiplicities) that increase with similarity score, where Wrs is a function of similarity score SSrs
between vertices r and s that share at least one concept, see
Equation 6.
Wrs = max(dSSrs 1e, 1)
5.
EXPERIMENT ON OA PUBLICATIONS
5.1
(6)
Figure 7:
4.5
E F
Figure 6:
A B
C D
4.6
5.2
For validation and for an understanding of the corpus composition, we expertly annotated a random sample of 498
publications using LCC two-letter categorizations and domain specific librarian subject tags. Our expert reviewer
coauthors are: Jose Diaz, Curator for Special Collections;
Daniel Dotson, Associate Professor and Mathematical Sciences Librarian, who tagged using LCSH; Stephanie Schulte,
Associate Professor, Health Sciences Library, who tagged using MeSH.
We found a highly imbalanced distribution of publications
in LCC categories: 90% of our sample are in 3 of 21 possible top-level LCC categories: Science(56%), Medicine(25%)
or Technology (8%). Also, the top 4 second-level LCC categories account for 50% of all documents: Physics (22%),
Math(11%), Internal Medicine(10%), Astronomy(7%). Note
that LCC does not have a high level category for more relatively modern fields like computer science or subdisciplines
of biomedicine or physics. We provide a complete table of
LCC categorization label in our sample online.4
5.3
(7)
In our 1.5M document abstract corpus, we found 556k total unique concepts and 195k concept tokens that appeared
4
Occurrences
1
29
1049
50199
200499
5001999
>= 2000
Count
360712
128210
40017
16206
5710
3585
1298
Min
1
When no particular method is necessarily the best, an effective approach can be to combine results from multiple independently designed and implemented methods [13]. Here,
we apply this approach to tag plain text with concept tokens
from text using two independent natural language processing (NLP) services, AlchemyAPI (http://www.alchemyapi.
com) and Aylien (http://aylien.com). By combining results, we increased our confidence in tokens produced by
both services and, given that both services operate independently, mitigated poor performance of one service with
results from the other. As input to both services, we formatted document title, abstract, and any author-provided keywords into HTML strings that annotated the document title.
From AlchemyAPI, we used the Entity Extraction and Concept Tagging services; from Aylien, we used only the Concept Extraction service. Entity extraction is the context- and
grammar-sensitive automated detection of important proper
nouns like the names of people, places, and things. Concept
extraction / tagging associates the inferred meaning of text
with terms in ontological databases, for example, DBpedia
[3]. Concept extraction is also capable of identifying concepts that are not explicitly referenced in text.
Our NLP tagging services produced two lists of scored
tokens, one from Aylien, and one from AlchemyAPI. We
standardized raw token strings, stemmed words, and filtered
low quality or irrelevant results using relevance thresholds.
We then assigned a quality score between 0 and 1 based on
service-specific attributes like relevance or confidence.
To combine the lists of standardized, scored tokens, first
we boosted the best score of tokens returned by both services as in Equation 7. We then sorted tokens by decreasing
score (highest score first) and choose up to the first 30 tokens.
5.4
Mean
12.5
Median
12
Std Dev
4.9
Max
36
5.5
Manual Review Results: B = 10 performed significantly better (confidence 90% binomial test) than both B =
1 and B = in best set scoring and also better in error
frequency, but not significantly (see Tables 4 and 5; * indicates a significant result with 90% confidence per two-sided
binomial test). On inspecting errors, in general, the deleterious effects of erroneous concept tokens tended to average out
with other tokens. Instead, similarity-level errors seemed to
have two primary causes: 1) a single rare but tangentially
or unrelated concept was given more weight than the sum
of evidence from other shared concepts; or 2) key specific
concepts were not given more weight than the sum of many
other common and possibly less important shared concepts.
This was especially aggravated when a target documents
concept list was short (6 or fewer concepts). These two failure cases tended to correspond to the extremes of B = 1,
where rare tokens have the highest relative specificity score,
and B = , where the token specificity score is constant
and not related to token frequency. B = 10 blended these
two approaches by weighting rare tokens more than common tokens, but at a less extreme scale than B = 1. Thus,
B = 10 mitigated these two failure cases and tended to produce superior results.
B
1
10
# Best
75/100
89/100*
64/100
# Errors
7/500
4/500
8/500
5.6
Clustering Computation
We generated an optimal, hierarchical categorization using using our method. To produce our first level of clusers,
from 1,502,473 our records we removed records with with
< 5 tokens and limited token set sizes to at most the 30
highest scored tokens. Per our parameter selection procedure, we used B = 10 to compute document similarity scores
and Equation 6 to compute edge multiplicities with a ceiling multiplicity of 6. We set = 2.5, K = 15 based on
our computational capacity. In total, we fitted a network of
1,459,110 documents and 18,041,069 similarity edges with
an average edge multiplicity of 1.6 (see Table 6).
The most expensive computation was the first level clustering of 1.5 million documents. Computation took 59 hours
of computation on an m2.4xlarge Amazon EC2 instance,
while other computations took comparably less significant
time. The first, most expensive clustering produced 84,190
leaf clusters and 37 levels, and with a resulting minimum
description length of 5.96 nats per multiedge. Of these
leaves, we filtered all 2,423 leaves of size 1 or 2 that contained 43,643 documents as outliers. To justify our use of
similarity-weighted multiedges, we clustered the first level
1
52.7%
2
37.1%
3
8.3%
4
1.5%
5
0.3%
6
0.1%
6.
6.1
RESULTS
Summary of Categorization Results
Our method produced a meaningfully labeled categorization scheme of 1.5 million documents that best describes
our corpus in a balanced way, which we present online at
http://goo.gl/vMiWU0. Clustering produced four hierarchical levels with roughly evenly-balanced clusters at every level as shown in Table 7. Per our corpus composition in which physics and biomedicine were over-represented,
of nine top level categories, our method produced three
biology-themed categories and categorized public health topics in a social sciences cluster. Likewise, physics was divided
into condensed matter and other physics top level categories. Of note is the significant difference in shared concept
compositions that clustering found between genomic-based
(Category 3) and non-genomic (Category 0) biomedicine and
related topics, a distinction not recognized in traditional
subject categorizations like LCC. Likewise, clustering successfully separated mathematics and computer science into
separate top-level categories, a distinction commonly held
today but not made in LCC. We list all 9 top level categories
with our summary labels and percentage of total documents
below.
Documents
Level
N
1
81767
2
3549
3
160
9
4
Children
2
3549
160
3
4
9
Min
3*
28
2894
109326
Max
75
1442
20342
231619
Mean
17.8
410.3
9100
161778
Median
17
381
8354
162586
SD
7.4
186
3622
39845
1
7
10
77
56
30
23
22.2
17.8
22
21
17
10
8.1
6.7
8. Mathematics (9.4%)
Lower-level Categories: As categorization progresses
to lower levels, categories become more specific within their
parent category as intended; see below for example Level 3
categories. However, while all lower-level categories have a
unique parent categories, sub-categories with different parents may have many concepts in common. For example,
consider the two sub-categories for cancer tumors:
0.10. Cancer, Oncology, Neoplasm, Anatomical pathology,
Tumor, Carcinoma
3.36. Cancer, Gene expression, Breast cancer, Carcinogenesis, Oncology, Tumor
While both categories are about cancer tumors, subcategory 0.10 is under non-genomic biomedicine and approaches
cancer primarily from a anatomical approach, while subcategory 3.36 is under the genomic biomedicine cluster and approaches cancer likewise. We consistently observe this type
of categorization in our results, where documents are categorized by highly-level approach, secondary approach, specific
approach, etc. rather than by necessarily hard boundaries
in topical matter.
6 (of 18) Example Sub-Categories with Automatic
Labels for Biology: Non-Genomic Biomedicine, Surgery:
Top level category: 0: Cancer, Surgery, Heart, Myocardial infarction, Atherosclerosis, Blood
Experiment
100k Permutes Mean
100k Permutes Max
Level 4 (top 9)
4.87
1.00
1.19
Level 3 (160)
6.35
1.00
1.56
6.3
7.
CONCLUSION
2012.
[21] T. P. Peixoto. Efficient Monte Carlo and greedy
heuristic for the inference of stochastic block models.
Physical Review E, 89(1):012804, Jan. 2014.
[22] T. P. Peixoto. Hierarchical Block Structures and
High-Resolution Model Selection in Large Networks.
Physical Review X, 4(1):011047, Mar. 2014.
[23] M. Rosvall and C. T. Bergstrom. Multilevel
compression of random walks on networks reveals
hierarchical organization in large integrated systems.
PloS one, 6(4):e18209, Jan. 2011.
[24] Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient
community detection in large networks using content
and links. WWW 13 Proceedings of the 22nd
international conference on World Wide Web, pages
10891098, May 2013.
[25] M. N. Schmidt and M. Morup. Nonparametric
Bayesian modeling of complex networks: an
introduction. IEEE Signal Processing Magazine,
30(3):110128, May 2013.