You are on page 1of 11

Scientometrics

DOI 10.1007/s11192-017-2286-1

Improving the co-word analysis method based


on semantic distance

Jia Feng1 Yun Qiu Zhang1 Hao Zhang1

Received: 24 July 2016


Akademiai Kiado, Budapest, Hungary 2017

Abstract We propose an improvement over the co-word analysis method based on


semantic distance. This combines semantic distance measurements with concept matrices
generated from ontologically based concept mapping. Our study suggests that the co-word
analysis method based on semantic distance produces a preferable research situation in
terms of matrix dimensions and clustering results. Despite this methods displayed
advantages, it has two limitations: first, it is highly dependent on domain ontology; second,
its efficiency and accuracy during the concept mapping progress merit further study. Our
method optimizes co-word matrix conditions in two aspects. First, by applying concept
mapping within the labels of the co-word matrix, it combines words at the concept level to
reduce matrix dimensions and create a concept matrix that contains more content. Second,
it integrates the logical relationships and concept connotations among studied concepts into
a co-word matrix and calculates the semantic distance between concepts based on domain
ontology to create the semantic matrix.

Keywords Co-word analysis  Semantic distance  Concept mapping  Semantic matrices

Introduction

Co-word analysis is a content analysis method combining bibliometrics and text mining
technology to reveal the deep meaning of documents. Callon et al. (1983) proposed this
method based on Bruno Latours actornetwork theory. In order to comprehend the
development of a domain, Callon focused his research on the clustering and hierarchical
structure of keywords and took the evolution process into account by adding a time factor.
In Callons research, each word was assigned to an appropriate cluster with a special
context to confirm its meaning. Ira Monarch (2000) produced a deep study of the history of

& Yun Qiu Zhang


yunqiu@jlu.edu.cn
1
Public Health School, Jilin University, Changchun 130021, China

123
Scientometrics

co-word analysis and described it as a method for calculating the strength of the link
between representative terms within a domain to deduce the domains future development
directions or trends.
Theoretical study of the co-word analysis method is at an advanced stage. Many
scholars worldwide have made an effort to optimize it by improving its velocity and effect
at each critical stage. From these studies, three general ways for improving the method can
be identified. First, it can be improved by calculating similarity among co-word matrices.
For example, Leydesdorff and Vaughan (2006) improved the algorithm for calculating
similarity and compared the differences between similar and dissimilar matrices. Sasson
et al. (2015) improved the co-word analysis effect by optimizing the algorithm for cal-
culating similarity. Second, word weights can be used to make the co-word analysis
method work better. For example, Wei-jin (2009) proposed a co-word analysis method
based on subject term weights; Ying and Lei (2011) optimized the results of the co-word
analysis method based on Zhongs adhesive force concept; and Gang and Tie (2011) and
Yan-rong and Yang (2011) developed a co-word analysis method using keyword weights.
The third method for improving results combines the citation analysis method with the co-
word analysis method. For example, Braam et al. (1991a, b) combined these two methods,
and Li-ying et al. (2015) integrated bibliographic coupling and the co-word analysis
method. In addition to these three approaches, many scholars have conducted research on
optimizing this method from other perspectives, such as statistics. Muller and Mancuso
(2008) created a better co-word statistical method and found that, when applying the co-
word analysis method to data classification, random sampling works better than Poisson
binomial distribution. Moreover, Fu-hai et al. (2011) developed the triad co-word analysis
method, and Pei et al. (2014) optimized the E index number of co-word net.
From the existing co-word analysis method studies we observe that research aimed at
improving the co-word analysis effect has primarily focused on matrix similarity calcu-
lations, term frequency weights, threshold optimization, and combining with the citation
analysis method. However, these studies do not consider the conceptual and logical rela-
tionships between words and knowledge; they remain at the level of grammar and have not
reached the level of semantics.
As a consequence, how to view the internal relationship and logical link among words on
the knowledge level has become an important question in the co-word analysis field. The
relationship among words at the knowledge level can be expressed through the ontology of a
domain, and the closeness among words can be measured by semantic distance. Semantic
distance is a method for calculating conceptual semantic similarities that has been widely
used in the text-mining field, with semantic distance calculations based on domain ontology.
Domain ontology includes both domain concepts and their hierarchical relationship. The
foundation for semantic distance calculations is the semantic similarity between two dif-
ferent concepts. In other words, there is a linkage in the ontology network. Semantic distance
based on domain ontology can represent the link strength among concepts on the knowledge
level. To study and discuss the questions above, this paper aims to apply concept mapping
based on domain ontology and semantic distance to co-word networks.

The necessity of integrating semantic distance in co-word analysis

Co-word analysis is used to explore a research topic through the co-occurrence of words.
Co-word analysis is based on two assumptions: first, that the words used in an article are
carefully selected by its author and accurately reflect the articles meaning; second, that the

123
Scientometrics

co-occurrence of two words in different articles indicates a correlation (i.e., the more co-
occurrence, the higher the correlation). Based on this logic hypothesis, the co-word
analysis method can be used to explore a field of research and domain structure. However,
the co-occurrence of words can also have an internal knowledge relationship and logical
link. The co-word analysis method neglects this internal knowledge relationship and
logical link, analyzing research topics only through the co-occurrence relationship.
Therefore, it is necessary to introduce the internal knowledge relationship and logical link
between words into the co-word network.
The internal knowledge relationship and logical link between words can be expressed
by domain ontology. Domain ontology contains the conceptual structure, and the concepts
are organized into a hierarchical structure. The basis for calculating semantic distance over
domain ontology is the fact that two concepts have a certain semantic relevance; that is,
there is at least one path between two concepts in the ontology network. The conceptual
affinity between termsin another words, the relative position of terms in the ontology
can be measured by semantic distance. Therefore, we can use ontological concept mapping
to convert the words in the co-occurrence into ontological concepts and measure their
conceptual affinity by the relative position of the terms in the ontology. Thus, the semantic
distance based on domain ontology can represent the internal knowledge relationship and
logical link between words.
Co-word analysis combined with the semantic distance between words can improve the
semantic relevance of words in the co-word network, increasing the accuracy of the results
of the co-word analysis.

Methodology

The major steps in co-word analysis include counting the frequency of word co-occur-
rence, choosing target words, creating a similarity matrix, clustering, and identifying the
meaning of each cluster. To conduct co-word analysis on a semantic level, this research
comprised these five steps. First, the data was gathered. Second, concept mapping was
applied to target fields (abstracts, keywords, titles, etc.) based on domain ontology and the
count frequency of concepts. Third, high frequency concepts were chosen, and from these
semantic distances based on domain ontology were calculated and a distance matrix was
created. Fourth, the concept matrix and the distance matrix were combined to create a new
semantic matrix. Last, conduct cluster analysis was performed and the results were
interpreted. The entire process is shown in Fig. 1.

Measurement of the co-occurrence relationship

Co-word analysis is a method that counts the frequency of word co-occurrence in a


document and clusters these words based on co-occurrence to reveal the closeness between
them. This analysis usually suggests that the greater the frequency of words co-occurrence
within a document, the closer they are. Thus, when counting within a set of documents, the
frequency of word co-occurrence within the same document could create a co-word net-
work generated by all co-words.
During this process, the relationship among words is easily influenced by the frequency
of co-words, and the interdependency among words is highly sensitive to the frequency of
co-words. In a co-word network usually word forms are not normative, for instance: single

123
Scientometrics

Fig. 1 Technology roadmap

and plural forms, full names and abbreviations, capital and small letters, translated words,
punctuation differences, and acronyms. In addition, allographic synonyms are a huge
problem influencing the accuracy of co-word networks. To solve the above problems and
improve the accuracy of co-word networks, this article applies concept mapping based on
domain ontology to free text words in co-word networks. Concept mapping based on
ontology has been widely studied, fully developed, and has an obvious effect. Many studies
have shown that concept mapping can improve the accuracy of text mining. Applying
concept mapping to words in co-word networks can on one hand standardize the words
used in a network; on the other hand, it can improve the accuracy of the relationships
among these words.

The calculation of semantic distance

The initial relationship between two words can be measured by Semantic Textual Simi-
larity. Many existing semantic similarity algorithms are based on ontology dictionaries and
a large-scale corpus. The former methods are characterized by a low algorithm complexity
and ease of calculation; however, the accuracy of the results needs to be improved. On the
other hand, and due to the high complexity of the algorithm, methods based on large-scale
corpora often face challenges in terms of polysemy and noise information problems. We
chose the method based on ontology dictionaries to measure the semantic distance.
Semantic distance indicates the aggregate of each edge weight of the shortest path between
two concepts on the ontology hierarchy tree. It represents the degree of similarity among
concepts with a geometric measure of the level of similarity. Semantic distance is the most
basic way to measure the similarity between two concepts, and it normally influences
conceptual similarity much more than any other factors.

123
Scientometrics

Semantic distance is generally based on a semantic dictionary. A semantic dictionary is


the organization of concepts through an arborescence or network hierarchical structure, and
it generally represents an ontology or a thesaurus. Ontologically, for example, it calculates
two concepts similarity based on the distance between them on the ontology tree.
Semantic distance is mainly related to the length and depth of two concepts on an ontology
tree. Rada et al. (1989) proposed the algorithm based on semantic distance. Ontology is
considered a semantic network composed of concepts whose similarity is calculated using
the shortest distance between concept nodes. Leacock and Chodorow (1998a, b) improved
Radas method, which is simple in theory, low in algorithmic complexity, relatively simple
in computation and easy to implement. Its classic algorithms include the Leacock and
Chodorow algorithm, weighted links algorithm, Wu and Palmer algorithm, Lin algorithm
and Liuquns semantic similarity algorithm based on China National Knowledge Infras-
tructure (CNKI).
Leacock and Chodorows algorithm is a semantic distance calculation method based on
ontology. Compared with non-ontology-based semantic distance calculation methods, such
as Google Distance (Cilibrasi and Vitanyi 2007), which is corpus-based method, ontology
provides more standardized, focused lexical semantic information and more standardized
word expressions. In this article we use the Leacock & Chodorow algorithm. This algo-
rithms core theory is that the similarity of concepts is related to the length and depth of
their ontology hierarchy. The formula is demonstrated as follows:
lenC1 ; C2
SimC1 ; C2  log
2  Depth

In this formula, len(C1, C2) represents concept word C1 and C2s shortest path in the
ontology hierarchy tree, and Depth represents the depth of the ontology hierarchy tree.

The fusion of the co-occurrence relationship and semantic distance

The co-occurrence relationship represents the strength of the content relationship among
words in the co-word network, while semantic distance represents the strength of the
semantic relationship among concepts in the network. Combining these two into one
network should more accurately represent the objective structure and content of a research
theme in a domain.
Before applying this network combination, standard processing is needed to delete the
dimensional relation and make the two matrixes have the same weights. Then, the value in
the matrix is between [0, 1], and the matrix multiplication method can be used to combine
the two matrices and create the semantic matrix.
The concrete steps are listed below. First, the co-occurrence matrix C is created, as
shown below in Fig. 2. Every number in this matrix indicates the number of co-occur-
rences between each two words.
Second, the Leacock and Chodorow algorithm is applied based on a domain ontology,
and the sematic distance between each word is calculated. The results are shown in Fig. 3.
Last, the two matrixes are combined to create the semantic matrix S, as shown in Fig. 4.
In the matrix multiplication progress, matrix S can be seen as a weighted matrix C; the
weight is matrix D.

123
Scientometrics

Fig. 2 Co-occurrence matrix

Fig. 3 Semantic distance matrix

Fig. 4 Semantic matrix S

The empirical study

In recent years, food contamination accidents have happened worldwide, making food
safety a global problem for customers. This paper takes food safety as an example for
testing the effect of the co-word analysis method based on semantic distance. To collect
overall data, PubMed was used and the MeSH retrieval mode was chosen to collect data for
the last 5 years, resulting in 8360 records obtained.
Co-word analysis generally inspects two dimensions: matrix and cluster. Here, the
results of the co-word analysis method based on semantic distance and the normal co-word
analysis method are compared from these two dimensions, and R language is used to
analyze and visualize the data.

Matrix analysis

The size of a matrix usually influences the effect of co-word analysis, and the choice of
threshold is a key question. To make the results clear and even, this paper set the threshold
at 50, which means that words with a frequency higher than 50 were selected. Moreover,
because PubMed is a medical data set, this study used UMLS as the mapping ontology and
applied MetaMap to conduct the text mapping. MetaMap is a program that matches the

123
Scientometrics

concepts of biomedical text with the UMLS Metathesaurus, which is known for its lin-
guistic accuracy and its dependence on knowledge sources (SPECIALIST lexicon). We use
the WEB API to access the MetaMap Web service to use Metathesaurus with online
concept mapping.In the process of concept mapping, MetaMap always returns many
concepts for one input term with different scores. We chose the highest score concept as
the result of matching.
Figure 5 shows part of the co-word and concept matrices. The concept matrixs words
are more standard considering the words form dimension. In Fig. 5, diagonal lines rep-
resent word frequency. They show that after mapping words have higher frequency, and
their frequency distribution is more continuous. In the co-occurrence dimension, the co-
word matrix is too sparse and the value 0 covers 87.4% of the matrix. While the concept
matrix has a higher frequency and the value 0 covers only 31.68% of it.
We applied the Leacock and Chodorow algorithm to calculate the semantic distance
among concepts and used the UMLS::Similarity online system to calculate semantic dis-
tance. Before combining the two matrixes, we used z scores as the standard processing
method. Part of the semantic matrix is shown as Fig. 6.
To further analyze the co-word network, we chose network centralization and network
centrality as indicators for evaluating the matrix, and the results are shown in Table 1.
Network centralization reflects the tightness of the network content: the higher the network
centralization, the tighter the networks content. In the Table 1, we can see that the
semantic matrixs network centralization (3.64%) is higher than co-word matrixs (2.39%).
Network centrality reflects the degree of equalization in the distribution of the network
nodes. Table 1 shows that the degree of equalization of the semantic network (1.927) is
higher than that of the co-word network (0.882).

Cluster result analysis

This paper used R to apply cluster analysis to co-word and semantic matrices and visu-
alized the results through a heat map.
Figure 7 shows the cluster results, with the darkness of each grids shading being the
directed ratio of the value of each grid. The higher the value is, the darker the shading is.
The diagonal lines value is 1, and the other grids values are between [0, 1]. The top and
right side of the figures show the cluster trees. From the shading we can tell that the
semantic matrix is darker than the co-word matrix, and there are less 0s in the semantic
matrix. Using the same threshold to choose the cluster results, there are 16 clusters for the
co-word matrix and eight clusters for the semantic matrix.
To analyze the content of each cluster, we used NetDraw software to visualize the
results seen in Figs. 8 and 9. The core contents of the eight clusters of the semantic matrix
are harm residues, pathogenic microorganism detection, disease outbreak, Chinas food

Fig. 5 Matrix effects comparison diagram

123
Scientometrics

Fig. 6 Semantic matrix (partial)

Table 1 Co-occurrence network centrality analysis


Indicators Co-word matrix Semantic matrix

Network centralization 2.39% 3.64%


Network centrality 0.882 1.927

Fig. 7 Cluster effect comparison diagram

safety problems, exposure factor, pathogenic bacteria, food processing, and seafood pol-
lution. The 16 clusters of the co-word matrix also include meaningful content, such as food
safety evaluation, prevalence of heavy metal poisoning and seafood pollution control, but
there are some clusters for which it is hard to discern their the meaning, such as the
estimate humans identification including and occurrence clusters on the right
side. Clusters with isolated words cannot clearly suggest research direction and content.
We invited scholars in the field of food safety (Jilin University Professor Ye Lin,
Professor Xie Lin, etc.) to evaluate the results. There are eight clusters in the optimized co-
word matrix, of which five clusters (2, 3, 5, 6, 7) are the main research hotspots and
directions in this field. The other three clusters (1, 4, 8) have scattered words, making it
difficult to determine their main research directions. The domain expert determined that the
optimized matrix is roughly the composite of the research field and has significantly
improved in comparison to the pre-optimized matrix results in the study.

123
Scientometrics

Fig. 8 Co-word matrix topic maps

Fig. 9 Semantic matrix topic maps

In summary, the results of semantic matrix have more meaningful content and are easier
to explain.

Discussion and conclusions

Co-word analysis uses co-occurrence frequency to reveal the closeness of words to


demonstrate the development directions and trends of a domain. Improving the accuracy of
a co-word matrix is the key to improving the co-word analysis method. The co-word
analysis method based on semantic distance optimizes the co-word matrix in two aspects.

123
Scientometrics

First, it applies concept mapping to co-word matrix labels, which combines the words at
the semantic level to reduce matrix dimensions, allowing a concept matrix to contain more
content. Second, it integrates the logical relationships and connotations among concepts
into a co-word matrix and calculates the semantic distances among concepts based on
domain ontology to create the semantic matrix.
This empirical study compared the effects of the co-word analysis method based on
semantic distance and the traditional co-word analysis method on matrix and cluster
dimensions. With regard to the matrix dimension, the word form in the concept matrix is
more standard, the frequency distribution is more even and continuous, and sparsity of the
text is obviously improved. With regard to the cluster dimension, the co-word analysis
method based on semantic distance better reveals the subject of documents and contains
more content.
This paper proposes a co-word analysis method based on semantic distance and displays
advantages for methodology and empirical study, but it still has two limitations. First, this
method is highly dependent on domain ontology. Second, mapping efficiency and accuracy
during the concept mapping process require further study. In the future, we will explore
new deep and multi-dimensional ways in which to combine co-occurrence and distance
matrices.

Authors contribution JF proposed the research idea, planned and designed the outline, carried out the data
collection and data analysis, and wrote the first draft. YQZ revised the plan and outline, joined discussion of
the findings, and contributed to writing the paper and revising it after review. HZ joined discussion of the
findings and contributed to writing the paper.

References
Braam, R. R., Moed, H. F., & Van Raan, A. F. (1991a). Mapping of science by combined co-citation and
word analysis I. Structural aspects. Journal of the American Society for Information Science, 42(4),
233.
Braam, R. R., Moed, H. F., & Van Raan, A. F. (1991b). Mapping of science by combined co-citation and
word analysis II. Dynamical aspects. Journal of the American Society for Information Science, 42(4),
252.
Callon, M., Courtial, J. P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks:
An introduction to co-word analysis. Social Science Information, 22(2), 191235.
Cilibrasi, R. L., & Vitanyi, P. M. (2007). The Google similarity distance. IEEE Transaction on Knowledge
and Data Engineering, 19, 370383.
Fu-hai, L., Lin, W., & Yong, L. (2011). Ternary co-words analysis based on literature keywords: A case
study in knowledge discovery. Journal of the China Society for Scientific and Technical Information,
30(10), 10721077.
Gang, L., & Tie, L. (2011). A new method for weighted co-word analysis based on keywords. Information
Science, 29(3), 321324.
Leacock, C., & Chodorow, M. (1998a). Combining local context and WordNet similarity for word sense
identification. WordNet Electron Lex Database, 49(2), 265283.
Leacock, C., & Chodorw, M. (1998b). Combining local context and WordNet similarity for word sense
identification (pp. 265283)., WordNet: An Electronic Lexical Database Cambridge: MIT Press.
Leydesdorff, L., & Vaughan, L. (2006). Co-occurrence matrices and their applications in information
science: Extending ACA to the Web environment. Journal of the American Society for Information
Science and Technology, 57(12), 16161628.
Li-ying, Z., Fu-hai, L., & Wen-ge, Z. (2015). Research on the improvement of the co-word analysis method
of citation couplingA case study on the topic of agricultural science research in ESI. Information
Studies: Theory & Application, 38(11), 120125.
Monarch, I. (2000). Information science and information systems: Converging or diverging. In 28th Annual
Conference, Canadian Association for Information Science, CAIS.

123
Scientometrics

Muller, H., & Mancuso, F. (2008). Identification and analysis of co-occurrence networks with net cutter.
PLoS ONE, 3(9), e3178.
Pei, H. A., Jing, Z., & Xiao-yu, Z. (2014). Study on improvement of E index words in network analysis.
Information Studies: Theory & Application, 37(1), 4650.
Rada, R., Mili, H., Bichnell, E., et al. (1989). Development and application of a metric on semantic nets.
IEEE Transactions on Systems, Man and Cybernetics, 19(1), 1730.
Sasson, E., Ravid, G., & Pliskin, N. (2015). Improving similarity measures of relatedness proximity: Toward
augmented concept maps. Journal of Informetrics, 9(3), 618628.
Wei-jin, Z. (2009). Clustered word group in co-word cluster analysis of hot subject terms of tumor therapy.
Chinese Journal of Medical Library and Information Science, 2, 4853.
Yan-rong, Y., & Yang, Z. (2011). Research on weighted co-word analysis. Information Studies: Theory &
Application, 34(4), 6163.
Ying, Y., & Lei, C. (2011). Evolution of topics about medical informatics by improved co-word cluster
analysis. New Technology of Library and Information Service, 27(1), 8387.
Zhong, Wei-jin, Jia, L., & Xing-jun, Y. (2008). The research of co-word analysis (3)The principle and
characteristics of the co-word cluster analysis. Journal of Information, 27(7), 118120.

123

You might also like