You are on page 1of 9

Hierarchical Categorization of Big Content Using Concept

Topology

Andrew Yates

Manirupa Das

Jose Diaz

Facebook

Ohio State University

Ohio State University

Daniel Dotson

das@cse.ohio-state.edu
Stephanie Schulte

diaz.6@osu.edu
Rajiv Ramnath

andrewyates@fb.com
Ohio State University

Ohio State University

Ohio State University

dotson.77@osu.edu

schulte.109@osu.edu

ramnath.6@osu.edu

ABSTRACT
Methods that are both computationally feasible and practically effective are needed to make sense of big corpuses of
content, or big content. For example, in open access academic publishing, supervised categorization techniques are
ill-suited for automated categorization because they rely on
an existing categorization scheme, but no such scheme can
stay abreast of the evolving landscape of scholarly work.
This problem applies to any domain where no good categorization scheme exists. To address this challenge, we
present an unsupervised method to fit a hierarchical categorization scheme to a corpus based on clustering the network
of shared concepts in the corpus, or its concept topology.
Our method potentially applies to any type of content, and
it scales to large networks of millions of vertices. We demonstrate this by applying our method to a corpus of 1.5 million scholarly texts representing the majority of open access
(OA) academic publications on the web and validating our
results using expert librarian annotations. Since our corpus
represents all OA academic work, our resulting categorization scheme best represents OA academic publishing as it
exists today.

Categories and Subject Descriptors


H.2.8 [Database Management]: Database Applications
Data mining

Keywords
categorization, graph clustering, content analysis

1.

INTRODUCTION

The world is inundated with an increasing amount of different types of big content and few good ways to make
Research was conducted while the author was unaffiliated.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WOODSTOCK 97 El Paso, Texas USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

sense of it all. Big content is semantically-rich analog of big


data large enough to be a nuanced representation of that
content domain, which includes text, images, and user profiles. Unlike simple data types like log entries and numbers,
big content additionally challenging to model because meaningfully representing it as some kind of quantitative model
of features is unclear.
When presented with a novel corpus of big content, a
first step to understand it is to categorize it. However,
many types of content have no good categorization scheme,
and those that do may have evolved to no longer be wellrepresented by that scheme. One example is of such content
is Open Access (OA) academic publications. Even with
well-developed hierarchical vocabularies to describe scholarly works, these systems are relatively static compared to
the ever evolving language of science, and developing fields
and emerging technologies are not easily described by existing terms. Likewise, as research activity evolves, categorization can become uninformative, placing most new
work under the same few categories while rarely using other,
now obsolete categories. Even when these systems are semiautomated, they are laborious and time consuming. As the
volume of scholarly works expands, this classification time
lag leads to a delay in organization and use. Clearly, there is
a need for a scalable unsupervised categorization scheme to
automatically and quickly classify large volumes of content
in a highly informative way.
Our method creates a balanced categorization scheme for
any corpus. It works by fitting hierarchical, agglomerative
clusters to the network of shared concepts in content. In our
method, we use a shared concept topology, where we consider only the local, most similar connections in content to be
meaningful. This takes advantage of the observation that in
in big content, many small differences in similarity between
unrelated content items are largely meaningless, yet they
add substantial computational complexity and complicate
modeling. Our method also produces interpretable intermediate outputs that can be manually validated. This makes
our method practical for industry applications that require
workflow customizations and method interpretability.
We apply our method specifically to OA academic publications because of the clear need, availability, and access
to expert librarian validation. However, the big picture is
that our method applies to previously difficult to categorize
content like user profiles, interest groups, and free-form text
posted in social media websites any content that can be

represented as sets of highly informative but sparsely distributed features like keywords.
We provide our unsupervised categorization results1 and
raw corpus OA academic2 corpus online for review and future research. Excluding the top-level summary labels like
7. Computer Science, concept tagging, categorization, and
category labeling were entirely unsupervised, generated solely
from author-provided text.
This paper is arranged as follows. In Section 2, we review
related research and background topics related to this work.
Section 3 is an introduction to state-of-the-art stochastic
blockmodeling, a newly improved clustering technique that
we apply to produce our categorizations. In Section 4, we
describe our method of hierarchically clustering the concept
topology of big content as a general workflow, and in Section
5, we describe our specific application of our method to a
corpus of millions of OA academic publications. We present
our experimental results and validation by expert librarian
review in Section 6 and conclude with discussion in Section
7.

2.

BACKGROUND AND RELATED WORK

Unsupervised Text Clustering: Content clustering,


particularly of text, is a well-studied field with a rich history (see Aggarwal and Zhai [1] for a review). Big content,
like most data, can be clustered using variants of classical
clustering techniques like K-means or hierarchical clustering,
whereby clusters become categories. Unsupervised categorization of text is most popularly based on two methods,
Probabilistic Latent Semantic Indexing (PLSI) [8], and Latent Dirichlet Allocation (LDA) [4]. In these methods, a
corpus is modeled as a function of hidden random variables,
or topics, which are then estimated from corpus items represented as collections of features, typically words.
One direction in this research is to include additional information about the corpus to supplement existing topic
models. Newman et al. improve unsupervised topic clustering on low-quality content by including information derived from external texts [17]. Likewise, Mei et al. use social network topology to inform topic modeling [14]. Other
approaches reduce the complexity of the problem using sampling methods like random projections [2] or intelligent edge
filtering [24]. An alternate direction is to improve the way
concepts in content are represented as features. Recent
advances include Paragraph Vector [11] and the skip-gram
model with extensions [15].
Finally, ideas related to concept topology modeling extend
to other types of media. For example, Jing and Baluja use
the network structure of an inferred visual similarity graph
to improve the image search results [9].
Community Discovery: Community discovery by graph
clustering has been studied for several decades. However, recent advances motivated by improved computing power and
applications like social networking and bioinformatics have
made large scale community discovery more feasible and effective. Common modern methods include modularity maximization [5], random walk compression [23], and a variety
of other techniques [6, 25]. See Newman [18] for a review of
the state of the art in community mining and clustering.
One such recent advance in graph clustering is a tech1
2

Categorization Results: http://goo.gl/vMiWU0


Raw Corpus: http://goo.gl/KOlgVX

nique called stochastic blockmodeling. We apply a variant


of stochastic blockmodeling here over methods for its superior empirical results, computational tractability, statistical
model selection, and its design that incorporates patterns
of weak links between communities when making community assignments. [22, 21] We discuss advances in stochastic
blockmodeling in Section 3.
Library Categorization: Historically, text classification systems grew out of the library world. Systems such as
the Library of Congress (LCC) were created to organize and
find physical items on shelves while grouping like items together by subject. However, books and other physical pieces
can cover multiple topics and cannot sit in multiple places.
Thus, there was a need to have subject headings, such as
Library of Congress Subject Headings (LCSH), to describe
the many subjects of a book. A single book should have only
one call number in a library, but it may have several subject headings. While it can be fairly specific in describing
an item, the LCSH system was designed for larger bodies of
work and is not ideal for describing journal articles, which
often tend to be extremely specific. Like LCC call numbers,
LCSH is designed to be a stable standard that covers every discipline. Thus, like LCC, LCSH is not easily able to
rapidly adapt to emerging topics, cover items with a disciplinary lens, or deal with items with an extremely narrow
focus.
Discipline-specific schemes developed for this reason, and
many have their own controlled vocabulary schemes to describe items. The advantage of subject-specific database
schemes is that they use the language of that discipline and
do not need to take into account disambiguation between
different disciplines. The disadvantage of such schemes is
that they are of limited use outside their intended discipline.
One such domain-specific scheme is MeSH (Medical Subject
Headings) [12], which is produced by the United States National Library of Medicine and broadly covers journal articles in biomedicine.

3.

INTRODUCTION TO BLOCKMODELS
Quantum
Physics

Materials
Science

Cancer

Public
Health

Figure 1:
(left) An example adjacency matrix
sorted by blocks. (right) Corresponding edge probabilities between blocks where pij is the probabiliy
of an edge between vertices in blocks i and j and
lighter means lower probability.
A stochastic blockmodel is a statistical network model
where vertices belong to a fixed number of blocks and the
probability of an edge between two vertices is a function
of to which blocks those vertices belong [20]. This can be
illustrated using an adjacency matrix where the rows and
columns are ordered so that vertices in the same block are

adjacent. A characteristic block pattern in non-zero matrix entries becomes apparent. See Figure 1(left) for an example adjacency matrix of documents sorted by topic, where
filled entries represent high similarity between documents.
Blockmodeling, then, is the statistical inference technique
of fitting empirical data to block assignments and a matrix of
edge probabilities between those blocks as in Figure 1(right).
This is in contrast to modularity-based techniques that only
consider in-versus-out cluster connectivities and not weak
connections between clusters to make cluster assignments.
For example, in Figure 1, when making block assignments,
the blockmodel includes evidence that Cancer and Public
Health blocks are weakly connected to each other but not
connected to the Quantum Physics or Materials Science
blocks.

3.1

P =

1Aij
pbiij
bj (1 pbi bj )

(1)

i<j

Fundamentally, we maximize this function, given the data


to solve for the most likely or optimal blockmodel. To
choose an optimial number of blocks, we apply the minimium description length extension by Peixoto. [19]

Multiedge and Degree-Corrected Extensions

Rather than depending on the binary presence or absence


of an edge, it may be desirable to include edge multiplicities
(multiedges) so that some edges count more than others.
To support edge multiplicities, ather than modeling edges
to be drawn from a binary Bernoulli distribution, we draw
them from a Poisson distribution, where values can be any
non-negative integer. As shown by Karrer and Newman [10],
this can be formulated in terms of vertices and edge counts
as in Equation 2,
L (G|g) =

mrs log

rs

mrs
nr ns

(2)

where r and s are blocks, mrs is the number of edges between


blocks r and s, and nr is the number of vertices in block r,
and G an undirected graph model given the observations g.
This model assumes that vertices have roughly the same degree, but in real networks, the vertex degree may vary quite
a bit. To model the large differences in vertex degrees found
in empirical networks, we apply the degree-corrected multigraph variant as in Karrer and Newman [10], (see Equation
3). In this formulation, we replace vertex counts n with the
sum of vertex out-degrees , where r is the number of edges
incident to a vertex in block r.
L (G|g) =

mrs log

rs

3.1.2

4.

Blockmodel Formulation

Let Aij be entry in the adjacency matrix between vertices


i and j, which is 1 if there is an edge between i and j, and 0
otherwise. Let bi be the block to which vertex i belongs and
let pbi bj be the probability of an edge between blocks bi and
bj . The probability that an observed network was generated
by a blockmodel is expressed in Equation 1 [18].

3.1.1

larger than the size of the true clusters


in the network. For
blockmodels, this size scales with O( N ), where N is the
number of vertices in the network. As a result, small but true
clusters may not be detected. As shown by Peixoto [22], this
clustering resolution limitation can be reduced to O(log N )
by fitting a hierarchy of blockmodels and minimizing the
description length of the model. This approach is based
on the observation that a blockmodel can itself be fit to a
higher-level blockmodel, where blocks map to vertices, edges
between blocks map to multiedges, and edges within a block
map to self-multiedges. We use this hierarchical variant of
blockmodeling implemented by Peixoto at the graph-tool
package 3 .

mrs
r s

(3)

Hierarchical Extension

One weakness of blockmodeling is that when the network


grows large, the minimum detectable cluster size may grow

METHOD OF UNSUPERVISED CATEGORIZATION

In this section, we present our concept topology clustering


workflow generally; in Section 5, we present our specific application of it to our motivating problem: the unsupervised
categorization of scholarly texts. The five workflow steps
are as follows: 1) Summarize content items as concept token multisets; 2) Count token frequency and compute token
specificity scores; 3) Construct a list of the top-K most similar content items per content item; 4) Construct similarity
network; and for the last step, 5) Cluster content using hierarchical blockmodeling; 6) If clustering succeeds, repeat the
workflow from Step 1 to cluster clusters. After hierarchical workflow clustering completes, 7) we generate category
labels using the most highly statistically enriched concepts
per cluster using a chi-squared test.

4.1

Step 1: Content as Concept Token Sets


A

Figure 2: Step 1: Content items (lettered boxes)


as summarized concept token sets (colored circles)
Analysis of media like text, music, video, and images is
typically more challenging of quantitative data because unlike numbers, how to represent rich content in a computer
to reveal patterns at scale is not straightforward. Two approaches are commonly applied. One is a high-dimensional
strategy, where content is represented by very large feature
vectors like n-gram counts. These feature vectors can be
unwieldy to use in computations and are not human interpretable. An alternate approach is to use human reviewers
to annotate content. This approach produces small sets of
highly interpretable, meaningful labels. However, it is frequently infeasible at scale due to the volume of content to
be reviewed and the expertise required to do so.
We address the challenge of representing content as features using a standardized concept token sets approach.
A concept token is a discrete unit of meaning, for example, a keyword. We provide a list of criteria for concept
tokens below. In Section 5, we implement concept tokens as
standardized keyphrases using natural language processing
3

http://graph-tool.skewed.de

(NLP); however, other approaches customized for different


types of media are equally applicable within the concept token set criteria.
Concept Token Set Generation Criteria: 1) Apply
the same method to all content items equally and independently. This reduces the risk of introducing feature bias
and batch effects. Additionally, this allows parallel and distributed computation of concept token sets from content
items. 2) Bound the token set size to be a small, finite
number. For example, in text processing, a single publication of hundreds to thousands of words should be represented as a concept token set of a dozen or so keyword
concept tokens. This constraint exists chiefly to constrain
the computational complexity of comparing large numbers
of concept token sets in future steps. Bound the specificity
of concept tokens so that tokens are not so unique that none
are shared with other content items, but not so general that
they are not specific enough to be informative. 3) Standardize the form of concept tokens so that they can be compared across contexts without ambiguity in meaning. For
example, if using NLP keywords, black cat, Black cat,
and black cats are superficially different forms of the same
concept and should all be represented as the same token.
Conversely, virus (Computer) and virus (Immunology)
refer to different concepts and should be represented as different tokens.
We recommend bounding the number of concept tokens
per content item to
O(lg N ) and maximum frequency of a
concept token to O( N ) for computational efficiency reasons described in Step 3, 4.3.

4.2

Step 2: Count Tokens and Compute Specificity Scores


1
2
3
4
4
4

Figure 3:
Step 2: More common concept tokens (circles) have lower specificity scores (numbered stars)
Concept tokens may be general or specific in meaning.
For example, in an academic texts, consider the keyword
tokens HIV and medicine. Intuitively, HIV has a more
specific meaning than medicine, and texts that share the
token HIV should be more similar in topic than those that
share the token medicine.
Rather than relying on a concept ontology which may be
unavailable or infeasible to customize for particular applications, because our content corpus is large, we can estimate
global concept specificity S by counting the frequency of concept tokens; see Figure 3. The less frequent a concept, the
more specific we estimate it to be. We compute a specificity
score per token using Equation 4, where Si is the specificity
score for concept token ti , ci is the number of times concept
token ti appears in the corpus where ci > 1, and B is a user
parameter.

Si =

1
log2 (ci + 2B 2) B + 1

(4)

The user parameter B adjusts the shape of the specificity


score function. When B = 1, S is the inverse of log token
frequency, and rare concepts have a much higher relative
specificity score than common ones do. As B increases, this
difference is attenuated. When B = , all concepts count
equally regardless of their frequency.

4.3

Step 3: Construct Top-K Most Similar List


A
B
C
D
E
F

B
D
5
A
5
A
6
F
6
E

C
E
1
B
3
B
6
B
2
B

D
A
1
C
3
E
1
C
2
D

E
1
F C
1
D
2
1
F C
1
1
D A
2

Figure 4: An adjacency list with each content item


(lettered box) and its corresponding list of top K = 2
content items with the highest sum specificity score
(numbered star)
Intuitively, two content items that share many specific
concepts are more similar than those that only share a few
general ones. Building on this, we compute similarity scores
SS between all content items that share at least one concept
using Equation 5, which is the sum of the similarity scores
of the concepts shared by a pair of content items. Note
that user parameter B in 4 controls the balance in similarity score computation between number of shared concepts
and the specificity of those concepts. We rank each list by
decreasing similarity and choose the top K per content item.
We recommend selecting K = log(N ) where > 1 is a
user parameter and N is the number of vertices being clustered so that the graph has enough edges to be expected to
be largely connected. By limiting the number of comparisons and selecting only the most similar pairs per content
item, we can reduce and control the computational complexity as well as improve results by not considering many small
differences in similarity between unrelated items.
SS =

kti kSi | ti {tokensr tokenss }

(5)

Computation: Naively comparing and recording the similarity between all-pairs of content items is an T N 2 operation that requires N 2 memory, where N is the number of
content items in the corpus and T = O(lg(N )) is the maximum number of terms per item. When N is large, this
is prohibitively
computationally intensive. We improve this
p
to O(N (N )T ) compute time and O(N K) memory by only
comparing content items that share at least one concept and
recording the top K most
p similar per content item as follows.
There are at p
most O( (N ) lg(N )) operations to score at
most M = O( (N ) lg(N )) other content items per content
item. Top K is computed in O(M ) time per content item
using quickselect [7]. In practice, M << N because concepts are not independently distributed in the corpus and

most concepts are shared by only few content items. Computation performance can be further improved by computing
each similarity list in parallel.

4.4

Step 4: Construct Top-K Most Similar Network

A
B

4.7

Step 7: Label Clusters

Once clustering terminates, we automatically label clusters by the top most statistically enriched concept tokens
in that cluster using a chi-squared test on the contingency
table shown in Table 1. That is, we label clusters by the
concepts they have more of in comparison to other clusters
at the same level. The test is performed top-down where
clusters are only compared to other clusters with the same
parent.

Has Concept
Hasnt Concept

In Cluster
v
y

Not in Cluster
x
v

D
Table 1: Contingency table used in chi-square test.

Figure 5: Undirected Top-k concept similarity network corresponding to the adjacency list in Figure
4
The top-K most-similar list produced in Step 3 is an adjacency list representation of an undirected network, where
vertices are content items and an edge represents a top-K
concept-similarity relationship. Optionally, edges can have
integer edge weights (multiplicities) that increase with similarity score, where Wrs is a function of similarity score SSrs
between vertices r and s that share at least one concept, see
Equation 6.
Wrs = max(dSSrs 1e, 1)

5.

EXPERIMENT ON OA PUBLICATIONS

5.1

Corpus Compilation and Characteristics

(6)
Figure 7:

4.5

E F
Figure 6:

A B
C D

Cluster the network in Figure 5.

We cluster the network produced in Step 4 using the leaf


clusters produced by hierarchical blockmodel clustering described in 3.1.2. Generally, clustering partitions content
items so that items in the same set share many concepts
while items in different set share few concepts. If insufficient statistical evidence exists to support clustering, clustering will fail, and the workflow terminates; go to Step 7.

4.6

Publication records by source

Step 5: Cluster Network

Step 6: Repeat: Cluster Clusters

If clustering succeeds, repeat the workflow from Step 1 to


cluster clusters and attempt to generate the next level in
hierarchical clustering. In this next iteration, each cluster
becomes a content item containing the multiset of concept
tokens of its members. This next iteration may fail to produce statistically significant clusters in which case hierarchical clustering ends at the last successful clustering.

To create a representative corpus of all OA publications,


we compiled the complete contents of the three largest public
directories of open access academic publications as of March
2014: arXiv, DOAJ, and PMC OA. See Figure 7 for relative
proportions of publications by their source directories.
The Directory of Open Access Journals (DOAJ)
aims to index all open access peer-reviewed journals in all
disciplines worldwide. To date, it has indexed 9,713 journals
and 1,502,700 articles from 133 countries [16]. URL: doaj.
org
arXiv (pronounced archive) Since 1991, arXiv has hosted
publications in physics, mathematics, nonlinear analysis, computer science, and quantitative biology. It consists of a
repository of 929,442 open access academic preprints (authorpublished), many of these preprints have been published in
other journals. URL: arxiv.org
PubMed Central Open Access (PMC OA) makes
available a subset of 774,993 biomedical publications in PubMed
Central that were released under permissive licenses. URL:
ncbi.nlm.nih.gov/pmc/tools/openftlist
From this collection, we standardized the record format,
de-duplicated records, and removed invalid entries to produce 2,737,377 records of 3,916,783 unique authors and 70
languages. From this, we selected a random subset of 1,502,473
documents with sufficient English-recognizable text using
spellchecking software for further processing.

5.2

Expert Librarian Annotations

For validation and for an understanding of the corpus composition, we expertly annotated a random sample of 498

publications using LCC two-letter categorizations and domain specific librarian subject tags. Our expert reviewer
coauthors are: Jose Diaz, Curator for Special Collections;
Daniel Dotson, Associate Professor and Mathematical Sciences Librarian, who tagged using LCSH; Stephanie Schulte,
Associate Professor, Health Sciences Library, who tagged using MeSH.
We found a highly imbalanced distribution of publications
in LCC categories: 90% of our sample are in 3 of 21 possible top-level LCC categories: Science(56%), Medicine(25%)
or Technology (8%). Also, the top 4 second-level LCC categories account for 50% of all documents: Physics (22%),
Math(11%), Internal Medicine(10%), Astronomy(7%). Note
that LCC does not have a high level category for more relatively modern fields like computer science or subdisciplines
of biomedicine or physics. We provide a complete table of
LCC categorization label in our sample online.4

5.3

MaxScore = max(Score Aylien , Score AlchemyAPI )


1 MaxScore
BoostedScore = MaxScore +
2

(7)

Concept Token Validation

In our 1.5M document abstract corpus, we found 556k total unique concepts and 195k concept tokens that appeared
4

Occurrences
1
29
1049
50199
200499
5001999
>= 2000

Table of LCC category counts in 498 sample: http://goo.


gl/FX5RvM

Count
360712
128210
40017
16206
5710
3585
1298

Table 2: Concept Token Frequencies

Min
1

NLP Concept Tagging

When no particular method is necessarily the best, an effective approach can be to combine results from multiple independently designed and implemented methods [13]. Here,
we apply this approach to tag plain text with concept tokens
from text using two independent natural language processing (NLP) services, AlchemyAPI (http://www.alchemyapi.
com) and Aylien (http://aylien.com). By combining results, we increased our confidence in tokens produced by
both services and, given that both services operate independently, mitigated poor performance of one service with
results from the other. As input to both services, we formatted document title, abstract, and any author-provided keywords into HTML strings that annotated the document title.
From AlchemyAPI, we used the Entity Extraction and Concept Tagging services; from Aylien, we used only the Concept Extraction service. Entity extraction is the context- and
grammar-sensitive automated detection of important proper
nouns like the names of people, places, and things. Concept
extraction / tagging associates the inferred meaning of text
with terms in ontological databases, for example, DBpedia
[3]. Concept extraction is also capable of identifying concepts that are not explicitly referenced in text.
Our NLP tagging services produced two lists of scored
tokens, one from Aylien, and one from AlchemyAPI. We
standardized raw token strings, stemmed words, and filtered
low quality or irrelevant results using relevance thresholds.
We then assigned a quality score between 0 and 1 based on
service-specific attributes like relevance or confidence.
To combine the lists of standardized, scored tokens, first
we boosted the best score of tokens returned by both services as in Equation 7. We then sorted tokens by decreasing
score (highest score first) and choose up to the first 30 tokens.

5.4

at least twice with an average of 12.5 tokens per document;


see Tables 2 and 3.

Mean
12.5

Median
12

Std Dev
4.9

Max
36

Table 3: Concept Token Counts per Document


We validated the quality of our NLP concept token procedure using a manual review results for 300 random records.5
We reviewed both the quality of concept tokens and the
overall quality of representation of the concept token set.
Overall, we found that individual concept token errors were
infrequent (3.4%, 2.8% to 4.0%, with 95% confidence by binomial test).
Remarkably, we observed no document-level errors due
to misleading token sets of sufficient size in our observed
sample; the corpus document error rate is less than 1%, with
95% confidence by a one-sided binomial test. Document
level Good rate, or the rate that the human reviewer felt
the results were not merely relevant but comparable with
human annotation, was 89.3% (85.3% to 92.6% with 95%
confidence by binomial test).

5.5

Specificity Parameter Selection and TopK Validation

We validated the top-K most similar document results by


a manual review of 100 randomly selected documents and
their corresponding top K = 5 most similar documents. Included in this review was a justification for our selection of
the user parameter B, which controls the smoothness of
the specificity score function. We produced results for three
different values of B: 1, 10, and (token count). Overall, we found that errors were rare for all three values of B
(best: 0.8%, 0.2% to 2.0%, with 95% confidence by binomial
test), and that B = 10 had the best performance. Complete
manual review results can be found online. 6
Scoring Rubric and Method: For each of 100 random
documents in our sample, we reviewed the top K = 5 most
similar results for B {1, 10, } (15 total results reviewed
per target document). For each top K = 5 set for each value
of B, we recorded the number of errors (0 to 5) and if that
value of B produced the best results. We allowed duplicate
best set scores; frequently, two or three sets were either
the same or indistinguishably good.
5

NLP Concept Tags, 300 Samples: http://goo.gl/vCKF6h


Top 5 Most Similar Document Samples per B: http://goo.
gl/7Gh4XD
6

Manual Review Results: B = 10 performed significantly better (confidence 90% binomial test) than both B =
1 and B = in best set scoring and also better in error
frequency, but not significantly (see Tables 4 and 5; * indicates a significant result with 90% confidence per two-sided
binomial test). On inspecting errors, in general, the deleterious effects of erroneous concept tokens tended to average out
with other tokens. Instead, similarity-level errors seemed to
have two primary causes: 1) a single rare but tangentially
or unrelated concept was given more weight than the sum
of evidence from other shared concepts; or 2) key specific
concepts were not given more weight than the sum of many
other common and possibly less important shared concepts.
This was especially aggravated when a target documents
concept list was short (6 or fewer concepts). These two failure cases tended to correspond to the extremes of B = 1,
where rare tokens have the highest relative specificity score,
and B = , where the token specificity score is constant
and not related to token frequency. B = 10 blended these
two approaches by weighting rare tokens more than common tokens, but at a less extreme scale than B = 1. Thus,
B = 10 mitigated these two failure cases and tended to produce superior results.
B
1
10

# Best
75/100
89/100*
64/100

Table 4: Number of Best List Designations


B
1
10

# Errors
7/500
4/500
8/500

Table 5: Number of Similarity Errors

5.6

Clustering Computation

We generated an optimal, hierarchical categorization using using our method. To produce our first level of clusers,
from 1,502,473 our records we removed records with with
< 5 tokens and limited token set sizes to at most the 30
highest scored tokens. Per our parameter selection procedure, we used B = 10 to compute document similarity scores
and Equation 6 to compute edge multiplicities with a ceiling multiplicity of 6. We set = 2.5, K = 15 based on
our computational capacity. In total, we fitted a network of
1,459,110 documents and 18,041,069 similarity edges with
an average edge multiplicity of 1.6 (see Table 6).
The most expensive computation was the first level clustering of 1.5 million documents. Computation took 59 hours
of computation on an m2.4xlarge Amazon EC2 instance,
while other computations took comparably less significant
time. The first, most expensive clustering produced 84,190
leaf clusters and 37 levels, and with a resulting minimum
description length of 5.96 nats per multiedge. Of these
leaves, we filtered all 2,423 leaves of size 1 or 2 that contained 43,643 documents as outliers. To justify our use of
similarity-weighted multiedges, we clustered the first level

of all documents again with all edge multiplicities set to 1,


resulting in a less optimal minimum description length of
4.91 nats per edge. At higher levels of clustering, we did
not use multiedges, setting all edge multiplicities to 1.
Multiedge
Percent

1
52.7%

2
37.1%

3
8.3%

4
1.5%

5
0.3%

6
0.1%

Table 6: Document-document multiedge frequencies


used in first level clustering.)

6.
6.1

RESULTS
Summary of Categorization Results

Our method produced a meaningfully labeled categorization scheme of 1.5 million documents that best describes
our corpus in a balanced way, which we present online at
http://goo.gl/vMiWU0. Clustering produced four hierarchical levels with roughly evenly-balanced clusters at every level as shown in Table 7. Per our corpus composition in which physics and biomedicine were over-represented,
of nine top level categories, our method produced three
biology-themed categories and categorized public health topics in a social sciences cluster. Likewise, physics was divided
into condensed matter and other physics top level categories. Of note is the significant difference in shared concept
compositions that clustering found between genomic-based
(Category 3) and non-genomic (Category 0) biomedicine and
related topics, a distinction not recognized in traditional
subject categorizations like LCC. Likewise, clustering successfully separated mathematics and computer science into
separate top-level categories, a distinction commonly held
today but not made in LCC. We list all 9 top level categories
with our summary labels and percentage of total documents
below.
Documents
Level
N
1
81767
2
3549
3
160
9
4
Children
2
3549
160
3
4
9

Min
3*
28
2894
109326

Max
75
1442
20342
231619

Mean
17.8
410.3
9100
161778

Median
17
381
8354
162586

SD
7.4
186
3622
39845

1
7
10

77
56
30

23
22.2
17.8

22
21
17

10
8.1
6.7

Table 7: Documents and children per cluster per


hierarchical level. *sizes 1 and 2 excluded.
9 Top Level Categories (level 4)
0. Biology: Non-Genomic Biomedicine, Surgery (9.7%)
1. Physics: Particle, Theory, Relativity, High Energy (12.2%)
2. Physics: Condensed matter (13.9%)
3. Biology: Genomics, Cells, Molecular Biology (12.2%)
4. Biology: Microbiology, Environmental, Agriculture (15.9%)
5. Social Sciences, Public Health (12.2%)
6. Astrophysics (7.5%)
7. Computer Science (8.1%)

8. Mathematics (9.4%)
Lower-level Categories: As categorization progresses
to lower levels, categories become more specific within their
parent category as intended; see below for example Level 3
categories. However, while all lower-level categories have a
unique parent categories, sub-categories with different parents may have many concepts in common. For example,
consider the two sub-categories for cancer tumors:
0.10. Cancer, Oncology, Neoplasm, Anatomical pathology,
Tumor, Carcinoma
3.36. Cancer, Gene expression, Breast cancer, Carcinogenesis, Oncology, Tumor
While both categories are about cancer tumors, subcategory 0.10 is under non-genomic biomedicine and approaches
cancer primarily from a anatomical approach, while subcategory 3.36 is under the genomic biomedicine cluster and approaches cancer likewise. We consistently observe this type
of categorization in our results, where documents are categorized by highly-level approach, secondary approach, specific
approach, etc. rather than by necessarily hard boundaries
in topical matter.
6 (of 18) Example Sub-Categories with Automatic
Labels for Biology: Non-Genomic Biomedicine, Surgery:
Top level category: 0: Cancer, Surgery, Heart, Myocardial infarction, Atherosclerosis, Blood

size of 498, we could not compute in-out cluster similarity


ratios for subcategories Levels 2 (3549 clusters) and Level 1
(81767 clusters).

Experiment
100k Permutes Mean
100k Permutes Max

Level 4 (top 9)
4.87
1.00
1.19

Level 3 (160)
6.35
1.00
1.56

Table 8: In-out cluster expert label similarity ratios.

6.3

Bottom-up Validation Manual Review

A limitation of the top-down review of results discussed


above is that they do not directly validate the quality of
bottom-level leaf clustering. To address this, we manually
reviewed the clustering quality of 45 randomly selected leaf
(Level 1) clusters containing 971 documents. We provide
our table of results and reviewed sample online. 8 9 Of the
45 leaves and 971 documents, we judged that 2 (4.4%) and
68 (7%, 5.4% to 8.8% with 95% confidence) were unsatisfactorily clustered respectively.

7.

CONCLUSION

In this paper, we considered the problem of producing


1. 0.0. Rheumatoid arthritis, Immune system, Rheumaa balanced, intelligible, hierarchical categorization from a
tology, Arthritis, Systemic lupus erythematosus, Inlarge corpus of content using stochastic blockmodeling. In
flammation
particular, we categorized a novel corpus of 1.5 million documents representing all open access academic publications
2. 0.1. Teeth, Dentistry, Maxilla, Molar (tooth), Perion the web, and we found that our results both conformed
odontitis, Statistics
to our expectations about the corpus and produced novel
3. 0.10. Cancer, Oncology, Neoplasm, Anatomical pathology, insights about how ideas are presented in OA academic pubTumor, Carcinoma
lishing today.
One weakness of our method is that it strongly depends
4. 0.13. Cancer, Metastasis, Oncology, Radiation therapy,
on the concept token production method to produce meanChemotherapy, Carcinoma
ingful, descriptive tokens sets. A second weakness is that
5. 0.25. Medical imaging, Magnetic resonance imaging,
category membership is discrete rather than probabilistic;
Brain, Nuclear magnetic resonance,
how to handle multi-category membership is a topic of fuX-ray computed tomography, Cerebrospinal fluid
ture work. Also of interest is the application of our method
6. 0.34. Heart, Cardiology, Blood, Echocardiography,
to other types and corpuses of content, and other extenHeart failure, Myocardial infarction
sions like the online cluster assignment of new content items
not included in the original categorization. In conclusion,
6.2 Top-Down Validation by Expert Label Sim- we produced a useful categorization scheme of a large corilarity
pus academic works, and we presented how the method by
which we did so can be applied to big content generally.
To quantify our top-down observations of topical clustering quality, we compared similarity in expert librarian labeling with our clusters using our 498 annotated sample doc8. REFERENCES
uments.7 Because different librarian categorization schemes
[1] C. Aggarwal and C. Zhai. A survey of text clustering
(LCC, MeSH, LSCH) are not directly comparable, we conalgorithms. Mining Text Data, 2012.
structed tf-idf feature vectors of standardized keywords pro[2] E. Bingham and H. Mannila. Random projection in
duce from the expert labels for each document. We then
dimensionality reduction: applications to image and
computed mean cosine similarity ratio between documents
text data. Proceedings of the seventh ACM SIGKDD,
in and not-in the same cluster for our clustering and for 100k
2001.
random leaf assignments as shown in Table 8. We found that
[3] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,
documents within the same category were much more simiC. Becker, R. Cyganiak, and S. Hellmann. DBpedia lar in expert labeling than documents in different categories
A crystallization point for the Web of Data. Web
(p < 105 by permutation test). Further, we found that
Semantics: Science, Services and Agents on the World
documents in the same, more specific level 3 subcategories
Wide Web, 7(3):154165, Sept. 2009.
were more similar to each other than documents in the same,
more general level 4 categories. Due to our limited sample
8
Leaf Cluster Review Spreadsheet: http://goo.gl/nz5MLZ
7
9
Expert-annotated 498 Samples: http://goo.gl/gZuWuT
Leaf Clustering Sample: http://goo.gl/wnw25W

[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent


dirichlet allocation. the Journal of machine Learning
research, 3:9931022, 2003.
[5] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and
E. Lefebvre. Fast unfolding of communities in large
networks. Journal of Statistical Mechanics: Theory
and Experiment, 2008(10):P10008, Oct. 2008.
[6] S. Fortunato. Community detection in graphs. Physics
Reports, 2010.
[7] C. A. R. Hoare. Algorithm 65: find. Communications
of the ACM, 4(7):321322, July 1961.
[8] T. Hofmann. Probabilistic latent semantic indexing. In
Proceedings of the 22Nd Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 99, pages 5057, New
York, NY, USA, 1999. ACM.
[9] Y. Jing and S. Baluja. Pagerank for product image
search. In Proceeding of the 17th international
conference on World Wide Web - WWW 08, page
307, New York, New York, USA, Apr. 2008. ACM
Press.
[10] B. Karrer and M. E. J. Newman. Stochastic
blockmodels and community structure in networks.
Physical Review E, 83(1):016107, Jan. 2011.
[11] Q. V. Le and T. Mikolov. Distributed Representations
of Sentences and Documents. Proceedings of the 31 st
International Conference on Machine Learning, May
2014.
[12] C. E. Lipscomb. Medical subject headings (MeSH).
Bulletin of the Medical Library Association, 88(3):265,
2000.
[13] D. Marbach, J. C. Costello, R. K
uffner, N. M. Vega,
R. J. Prill, D. M. Camacho, K. R. Allison, M. Kellis,
J. J. Collins, and G. Stolovitzky. Wisdom of crowds
for robust gene network inference. Nature methods,
9(8):796804, Aug. 2012.
[14] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic
modeling with network regularization. Proceedings of
the 17th international conference on World Wide Web,
2008.
[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
J. Dean. Distributed representations of words and
phrases and their compositionality. In C. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and
K. Weinberger, editors, Advances in Neural
Information Processing Systems 26, pages 31113119.
Curran Associates, Inc., 2013.
[16] H. Morrison. Directory of Open Access Journals
(DOAJ), Jan. 2008.
[17] D. Newman, E. V. Bonilla, and W. Buntine.
Improving Topic Coherence with Regularized Topic
Models. In Advances in Neural Information Processing
Systems, pages 496504, 2011.
[18] M. E. J. Newman. Communities, modules and
large-scale structure in networks. Nature Physics,
8(1):2531, Dec. 2011.
[19] T. Peixoto. Parsimonious module inference in large
networks. Physical Review Letters, 110(16):169905,
Apr. 2013.
[20] T. P. Peixoto. Entropy of stochastic blockmodel
ensembles. Physical Review E, 85(5):056122, May

2012.
[21] T. P. Peixoto. Efficient Monte Carlo and greedy
heuristic for the inference of stochastic block models.
Physical Review E, 89(1):012804, Jan. 2014.
[22] T. P. Peixoto. Hierarchical Block Structures and
High-Resolution Model Selection in Large Networks.
Physical Review X, 4(1):011047, Mar. 2014.
[23] M. Rosvall and C. T. Bergstrom. Multilevel
compression of random walks on networks reveals
hierarchical organization in large integrated systems.
PloS one, 6(4):e18209, Jan. 2011.
[24] Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient
community detection in large networks using content
and links. WWW 13 Proceedings of the 22nd
international conference on World Wide Web, pages
10891098, May 2013.
[25] M. N. Schmidt and M. Morup. Nonparametric
Bayesian modeling of complex networks: an
introduction. IEEE Signal Processing Magazine,
30(3):110128, May 2013.

You might also like