You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325364576

Missing Link Prediction in Social Networks

Conference Paper · June 2018


DOI: 10.1007/978-3-319-92537-0_40

CITATIONS READS

3 6

2 authors, including:

Chiman Kwan
Signal Processing, Inc.
319 PUBLICATIONS   3,971 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A High Performance Approach to Local Active Noise Reduction in Noisy Cabins View project

Estimation method View project

All content following this page was uploaded by Chiman Kwan on 27 May 2018.

The user has requested enhancement of the downloaded file.


Missing Link Prediction in Social Networks

Jin Zhou and Chiman Kwan

Signal Processing, Inc., Rockville, Maryland, USA


ferryzhou@gmail.com, chiman.kwan@signalpro.net

Abstract. This paper summarizes our effort of applying matrix completion


techniques to a popular social network problem: link prediction. The results of
our matrix completion algorithm are comparable or even better than the results
of state-of-the-art methods. This means that matrix completion is a promising
technique for social network problems. In addition, we customize our algorithm
and developed a recommender system for Github. The recommender can help
users find software tools that match their interest.

Keywords: link prediction and recovery, social networks, recommender system,


Github.

1 Introduction

Social networks such as users’ connections in Facebook, Wechat, Linkedin, etc. can
be viewed as a graph and the links/edges can be represented by a binary matrix. For
instance, suppose the matrix is M, then M(i, j) represents the link from node i to node
j. If there is a link from i to j, M(i, j) is 1, otherwise 0. If the edges are un-directional,
the matrix is symmetric; otherwise the matrix is not symmetric. From the matrix point
of view, the link prediction problem is the same as matrix completion problem, i.e.
given some known elements in the matrix, predict the unknown elements. However,
this problem is challenging due to: 1) the matrix is binary, 2) the links are sparse, and
3) the dimension of the matrix can be very large such as million by million.
Matrix completion is a branch of sparsity based algorithms [1-4] and has found
wide range of applications [5-9]. The basic idea is to determine missing elements in
matrices.
We developed an efficient imputation based matrix completion algorithm which
can handle very large scale link prediction problems. Two real world datasets were
used to demonstrate the effectiveness of our algorithm. The first one was from [10]
whose task is predicting the co-authorship. The second one was from Kaggle [11]
which has millions of nodes and links. For the first dataset, our result is significantly
better than the state-of-the-art result. For the second one, our result ranks top 5 out of
119 teams. In addition to the above experiments, we also developed a recommender
for Github by using the same link prediction algorithm.
The paper is organized as follows. In Section 2, we summarize the line prediction
algorithm. Section 3 summarizes two applications of the prediction algorithm for
social networks. In Section 4, we will describe in detail the development of a recom-
mender system that utilizes the link prediction algorithm. Finally, we conclude the
paper in Section 5.

2 Link Prediction Algorithm

We present an efficient and high performance link prediction algorithm, which is


shown below. The input is the mask (  ) of known values and the initial matrix Y0 .
Suppose the known values are C , then Y0 ()  C and Y0 (~ )  0 . The output is the
final iteration of the predicted matrix Y . The algorithm starts with Y0 and performs a
set of iterations. Each iteration contains three steps: 1) perform low rank approxima-
tion (or called factorization) of matrix M i ; 2) set the approximation as the prediction;
3) adjust the prediction with known values.

Algorithm: IMPUTE
Input:  , Y0
Output: Yn
Algorithm
1. M1  Y0 ,
2. Iterate
3. (U i , Vi )  f ( M i ) // factorize or low rank approximation
4. Yi  U iVi
5. M i 1  Yi   (Y0  Yi )   //adjust with known values

The factorization is done by singular value decomposition (SVD),


[U , S , V ]  svd ( M i ),
U i  U (:,1: r ), (1)
T
Vi  S (1: r ,1: r )V (:,1: r )

If the matrix is symmetric, we use

Vi  S (1: r ,1: r )U (:,1: r )T (2)

For very large matrices, we cannot explicitly generate Yi and M i by applying


SVD. To overcome this issue, we use the lansvd method from PROPACK [12], which
is a software package for large and sparse SVD. The lansvd algorithm uses two func-
tion handles to perform SVD of sparse matrix:

yf ( M , x)  Mx and yt ( M , x)  M T x (3)

where M is the input matrix and x is a vector. In our algorithm, we design the function
handles as follows:
Mx  U (Vx )   Y x (4)

M T x  V T (U T x )   YT x (5)
where Y  (Y0  U iVi )   is a sparse matrix if the number of known elements (or
links) is small compared to the whole matrix. This is true for social network since the
links are always very sparse.
By using function handles, we only need to store U i and Vi in memory, whose
sizes are m  r and r  m respectively. Here m is the number of nodes and r is the
number of rank (or hidden features) we use. Usually, r is set to be less than 100. In
this way, it is easy to handle millions of nodes with a normal PC. The convergence
speed is relatively fast (fewer than 100 iterations).

3 Experiments

3.1 Predicting NIPS coauthorship


In the first example, we used the coauthorship data from the NIPS dataset compiled in
[10]. This dataset contains a list of papers and authors information from NIPS 1-17.
Similar to [13], we took 234 authors who had published with other people and con-
structed the coauthorhip matrix, which is symmetric. For testing the algorithm, we
randomly extract 20% of the matrix elements for testing and the rest 80% for learning.
In the experiment, we set r = 50 and   0.05 . The results are shown in Table 1, in
which results of two other state-of-the-art methods (LFRM [13] and LFL [14]) are
also shown. We can see that our results are significantly better than the other two
state-of-the-art methods in terms of area under the curve (AUC) metric.
Table 1. AUC scores of different methods on NIPS data.

LFRM LFL Ours


AUC 0.9509 0.9424 0.9673

3.2 Predicting links of a large online social network

In the second experiment, we used an online challenge data and compared our results
with others. The data set was the 2011 IJCNN social network challenge. The data
were downloaded from a social network (Flickr). There are 7.2 million contacts/edges
of 38 k users/nodes. They have been drawn randomly ensuring a certain level of
closeness. The training data contain 7,237,983 contacts/edges. There are 37,689 out-
bound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound
nodes so that the total number of unique nodes is 1,133,547. The test dataset contains
8,960 edges from 8,960 unique outbound nodes. Of those 4,480 are true and 4,480 are
false edges. The task is to predict which are true (1) and which are false (0).
Since we do not have the ground truth of the test data, we generate our own test da-
ta from the training samples, with the same 4,480 true and false edges. The testing
edges are randomly selected for 10 times and our algorithm is applied for each testing
data. Our results are shown in Table 2. The mean AUC of our results is 0.9326, which
ranks 4th out of 119 teams (see the top AUC scores in Table 3). This is promising
since we did not perform any tricks like blending results of lots of different approach-
es or perform any problem-specific algorithm tuning.

Table 2. AUC scores of Kaggle data. 10 runs. Our results.

# 1 2 3 4 5 6 8 8 9 10 Mean
AUC 0.934 0.930 0.931 0.936 0.932 0.933 0.930 0.937 0.930 0.933 0.9326

Table 3. Top 5 AUC scores on the leader board (the first one is not included due to de-
normalization algorithm). Our result in Table 2 ranked just below Jeremy.

Team wcuk vsh Jeremy grec hans


AUC 0.96954 0.95272 0.94506 0.92712 0.92613

4 Application of Link Prediction to the Development of a


Recommender for GITHUB

4.1 Overview
This section summarizes our effort on applying the matrix completion technique in
Section 2 to a real world problem, i.e. Github repository recommendation. We devel-
oped a commercial software product and the product has attracted quite a lot of initial
users.
Github is a web-based hosting service for software development projects [15]
launched in April 2008. It is the most popular open source code repository site [15].
The slogan of the site is "Social Coding" as the site provides social networking func-
tionality such as feeds, followers and stars [15]. On January 16, 2013, Github hit 3
million users and about 5 million repositories. Today, Github has 64M repositories
and 23M developers worldwide, see https://github.com/.
It is well known that a core feature of any social network site is recommendation.
For instance, Facebook recommends friends, Twitter recommends followees, and
LinkedIn recommends contacts. Actually, recommendation service is also a core fea-
ture for e-business website. The most famous one is Amazon, it is the earliest compa-
ny who developed a recommendation system to recommend products for users. Rec-
ommendation can bring more traffic and more revenue for a website. It can also sig-
nificantly improve the user experience. For instance, Amazon provides service like
"Customers who bought this item also bought" and "Amazon.com recommends".
However, as the most popular open source code repository website, Github does not
provide a recommendation service. In this performance period, we developed a com-
mercial software product called Github Repository Recommendation, which aimed at
providing recommendation service for Github users. Similar to Amazon, this product
provides the following service:
1. Coders who like this repo also like ...
2. Signalpro recommend repos for "a user" based on his history
The two features are implemented as two web apps which are hosted on our com-
pany’s website ([16]). One screenshot of the web app is shown in Fig. 1.

Fig. 1. Screenshot of recommendation service (recommend by repository).


We also developed a Google Chrome extension [17] which injects the recommen-
dations to Github website.

4.2 Implementation
Data Acquisition. We use the data from Github Archive [18], an open source project
which record the public Github timeline. The data can be retrieved from Google
BigQuery. The timeline is a series of event logs. There are 18 event types. We only
use the star event to do repository recommendation. A star event means a user is in-
terested in and bookmarked a repository.
The query is shown below:
SELECT repository_name, actor, created_at
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent" and created_at >= '2012-04-01 00:00:00'
GROUP BY repository_name, actor, created_at
ORDER BY created_at ASC;
After running the query, we cannot directly download the data as it is too large. We
need to save the results as a table and export the table data to Google cloud storage.
Data Preprocessing. The raw data we collect is a sequence of records. Each record
has three items: username, reponame and time. Before using the data, we do the fol-
lowing preprocessing tasks:
1. Index username and reponame, i.e. convert each username and reponame
to an integer
2. Convert time to an integer number
3. Removing repetitive records
4. Reorder/reindex user and repo index based on star count, in descend order

Data Statistics. After data preprocessing, we have a big sparse binary matrix, whose
columns represent repos and rows represent users. A nonzero value at (i, j) indicates
the user i starred repo j. From the star events from 04/01/2012 to 01/10/2013, we got
344,120 users, 317,128 repos and 3,253,207 stars. The total number of matrix ele-
ments is about 0.1 billion and sparse level is 1/33545.
Table 4 and Table 5 show how many users/repos have a certain number of stars.
The tables show the following facts:
1. About half users starred only 1 repo and about half repos has only one
starred user.
2. About 5000 repos have more than 100 stars
3. About 90% repos have less than 10 stars.
4. About 10 thousands repos have more than 50 stars.

Table 4. The number of repos that has stars. Table 5. The number of users that has stars.
Star Count # of Repos Percentage Star Count # of Users Percentage
1 162399 51.2% 1 141639 41.2%
<=5 259335 81.8% <=5 246617 71.7%
<=10 281935 88.9% <=10 282384 82.1%
>10 35193 11.1% >10 61736 17.9%
>50 9509 3.0% >50 12468 3.6%
>100 5061 1.6% >100 4341 1.3%
>1000 254 0.08% >1000 32 0.01%

Repository Recommendation. We try to answer two questions: 1. coders who like


this repo like? 2. from your history, what repos you may like? The first question can
be answered using nearest neighbors. From user-repo star matrix M, we compute
similarity between two repos with the normalized cross correlation

M (:, i )T M (:, j )
S (i , j )  (6)
|| M (:, i ) ||  || M (:, j ) ||

With similarity scores, we can find closest K repos for any given repo and thus the
first question is answered. In matrix form, the equation is

S  Mˆ T Mˆ (7)

where M̂ is normalized M such that the norm of each column is equal to 1.


For the second question, we need to do matrix completion, i.e. estimate a user-repo
score for all user repo interactions. Once we have the scores, we can sort the repo
scores for each user and recommend the repos with highest scores to the user. In our
system, the user-repo scores are computed based on following equation:
R(i,:)  
k M ( i ,:)
S (k ,:) (8)

Basically, it is equivalent to
R  MS (9)
The interpretation is intuitive: if a user starred repo A and repo B, then we first col-
lect similar repos to A and similar repos to B and then merge the similar repos.
Based on the data statistics, we picked the first 5000 repos for recommendation.

Evaluation. We first split data into training set and testing set based on time. From
training set, we compute repo similarities. For each user in testing set, we first com-
pute 20 recommendations for the user and see how many of them are actually hit in
the testing data. Average hit rate is used as a metric to measure the performance. We
compared our method with a baseline method, i.e. recommend purely based on repo
popularity. For evaluation data, we picked first 10,000 users and 1,000 repos. The
baseline method's average hit rate is 9.2% and ours is 10.4%, which means about 15%
performance boost. We plan to implement a c version of the recommender system for
algorithm evaluation. We noticed that if we choose only 1000 users, then our results
are very close to baseline method. This means that baseline method is more biased to
very active users. When more (not so active) users are used for comparison, our re-
sults should perform much better than the baseline method.

Software Development. We implemented our recommendation software as a web


application, so that there is no need to download and install, and everyone can easily
use our service. The front-end is static html files using javascript/html/css techniques.
The backend is implemented with Ruby and deployed to Heroku. The backend pro-
vides web service of repo recommendation and the front-end use jsonp to communi-
cate with backend. Currently, no database is used. All data is loaded into memory in
the beginning. For space efficiency, we only load S matrix. When a user submits a
repo recommendation request for a given username, we first crawl Github to get the
user’s star repos and then compute the user-repo scores on the fly. We also developed
chrome extension which injects recommendations to Github's website.
Now we have 1800 users for the web app. This is a good sign which shows the po-
tential of the product.

5 Conclusions

In this paper, we present a high performance link recovery algorithm for predicting
missing links in social networks. The approach has been applied to a recommender
for Github. Currently, the recommender has been widely used in Github. Comparison
with other recommenders [19-21] will be carried out in the future.

Acknowledgments. This research was supported by Office of Naval Research under


contract # N00014-12-C-0079. Distribution Statement A. Approved for public re-
lease; distribution is unlimited.

References
1. Candes, E.J., Recht, B.: Exact Matrix Completion via Convex Optimization. Foundations
of Computational Mathematics. 9, 717-772 (2008)
2. Dao, M., Kwan, C., Ayhan, B., Tran, T.: Burn Scar Detection Using Cloudy MODIS Im-
ages via Low-rank and Sparsity-based Models. IEEE Global Conference on Signal and In-
formation Processing. 177 – 181 (2016)
3. Dao, M., Kwan, C., Koperski, K., Marchisio, G.: A Joint Sparsity Approach to Tunnel Ac-
tivity Monitoring Using High Resolution Satellite Images. IEEE Ubiquitous Computing,
Electronics & Mobile Communication Conference. 322-328 (2017)
4. Kwan, C., Budavari, B., Dao, M., Zhou, J.: New Sparsity Based Pansharpening Algorithm
for Hyperspectral Images. IEEE Ubiquitous Computing, Electronics & Mobile Communi-
cation Conference. 88-93 (2017)
5. Zhou, J., Kwan, C., Ayhan, B.: A High Performance Missing Pixel Reconstruction Algo-
rithm for Hyperspectral Images. 2nd. International Conference on Applied and Theoretical
Information Systems Research. (2012)
6. Kwan, C., Zhou, J.: Method for Image Denoising. Patent #9,159,121. (2015)
7. Zhou, J., Kwan, C.: High Performance Image Completion Using Sparsity based Algo-
rithms. SPIE Commercial + Scientific Sensing and Imaging Conference. (2018)
8. Zhou, J., Kwan, C., Tran, T.: ATR Performance Improvement Using Images with Cor-
rupted or Missing Pixels. SPIE Defense + Security Conference. (2018)
9. Dao, M., Suo, Y., Chin, S., Tran, T.: Video Frame Interpolation via Weighted Robust
Principal Component Analysis. Int. Conf. Acoustics, Speech, Signal Processing. (2013)
10. NIPS data, http://ai.stanford.edu/~gal/Data/NIPS
11. Kaggle data, http://www.kaggle.com/c/socialNetwork/Data
12. Propack, http://soi.stanford.edu/~rmunk/PROPACK/
13. Miller, K., Griffiths, T., Jordan, M.: Nonparametric Latent Feature Models for Link Predic-
tion. Advances of Neural Information Processing Systems. 1276 – 1284 (2009)
14. Menon, A., Elkan, C.: Dyadic Prediction Using a Latent Feature Log-Linear Model.
arXiv:1006.2156v1, 10, (2010)
15. Github, http://en.wikipedia.org/wiki/GitHub
16. Github Repository Recommendation by Repo, http://signalpro.net/github/repo_rec.htm
17. Github Repository Recommendation Chrome Extension, https://chrome.google.com/
webstore/detail/github-repository-recomme/dpmjlcnijpnkklopinedkkhmjcchecia
18. Github Achieve, http://www.githubarchive.org/
19. Rezaeimehr, F., Moradi, P., Ahmadiana, S., Qader, N.N., Jalili, M.: TCARS: Time- and
Community-Aware Recommendation System. Future Generation Computer Systems. 78,
419–429 (2018)
20. Azadjalal, M.M., Moradi, P., Abdollahpouri, A., Jalili, M.: A Trust-aware Recommenda-
tion Method based on Pareto Dominance and Confidence Concepts. Knowledge-Based
Systems. 116, 130-143 (2017)
21. Ranjbar, M., Moradi, P., Azami, M., Jalili, M.: An Imputation-based Matrix Factorization
Method for Improving Accuracy of Collaborative Filtering Systems. Engineering Applica-
tions of Artificial Intelligence. 46, 58-66 (2015)

View publication stats

You might also like