Professional Documents
Culture Documents
net/publication/325364576
CITATIONS READS
3 6
2 authors, including:
Chiman Kwan
Signal Processing, Inc.
319 PUBLICATIONS 3,971 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A High Performance Approach to Local Active Noise Reduction in Noisy Cabins View project
All content following this page was uploaded by Chiman Kwan on 27 May 2018.
1 Introduction
Social networks such as users’ connections in Facebook, Wechat, Linkedin, etc. can
be viewed as a graph and the links/edges can be represented by a binary matrix. For
instance, suppose the matrix is M, then M(i, j) represents the link from node i to node
j. If there is a link from i to j, M(i, j) is 1, otherwise 0. If the edges are un-directional,
the matrix is symmetric; otherwise the matrix is not symmetric. From the matrix point
of view, the link prediction problem is the same as matrix completion problem, i.e.
given some known elements in the matrix, predict the unknown elements. However,
this problem is challenging due to: 1) the matrix is binary, 2) the links are sparse, and
3) the dimension of the matrix can be very large such as million by million.
Matrix completion is a branch of sparsity based algorithms [1-4] and has found
wide range of applications [5-9]. The basic idea is to determine missing elements in
matrices.
We developed an efficient imputation based matrix completion algorithm which
can handle very large scale link prediction problems. Two real world datasets were
used to demonstrate the effectiveness of our algorithm. The first one was from [10]
whose task is predicting the co-authorship. The second one was from Kaggle [11]
which has millions of nodes and links. For the first dataset, our result is significantly
better than the state-of-the-art result. For the second one, our result ranks top 5 out of
119 teams. In addition to the above experiments, we also developed a recommender
for Github by using the same link prediction algorithm.
The paper is organized as follows. In Section 2, we summarize the line prediction
algorithm. Section 3 summarizes two applications of the prediction algorithm for
social networks. In Section 4, we will describe in detail the development of a recom-
mender system that utilizes the link prediction algorithm. Finally, we conclude the
paper in Section 5.
Algorithm: IMPUTE
Input: , Y0
Output: Yn
Algorithm
1. M1 Y0 ,
2. Iterate
3. (U i , Vi ) f ( M i ) // factorize or low rank approximation
4. Yi U iVi
5. M i 1 Yi (Y0 Yi ) //adjust with known values
yf ( M , x) Mx and yt ( M , x) M T x (3)
where M is the input matrix and x is a vector. In our algorithm, we design the function
handles as follows:
Mx U (Vx ) Y x (4)
M T x V T (U T x ) YT x (5)
where Y (Y0 U iVi ) is a sparse matrix if the number of known elements (or
links) is small compared to the whole matrix. This is true for social network since the
links are always very sparse.
By using function handles, we only need to store U i and Vi in memory, whose
sizes are m r and r m respectively. Here m is the number of nodes and r is the
number of rank (or hidden features) we use. Usually, r is set to be less than 100. In
this way, it is easy to handle millions of nodes with a normal PC. The convergence
speed is relatively fast (fewer than 100 iterations).
3 Experiments
In the second experiment, we used an online challenge data and compared our results
with others. The data set was the 2011 IJCNN social network challenge. The data
were downloaded from a social network (Flickr). There are 7.2 million contacts/edges
of 38 k users/nodes. They have been drawn randomly ensuring a certain level of
closeness. The training data contain 7,237,983 contacts/edges. There are 37,689 out-
bound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound
nodes so that the total number of unique nodes is 1,133,547. The test dataset contains
8,960 edges from 8,960 unique outbound nodes. Of those 4,480 are true and 4,480 are
false edges. The task is to predict which are true (1) and which are false (0).
Since we do not have the ground truth of the test data, we generate our own test da-
ta from the training samples, with the same 4,480 true and false edges. The testing
edges are randomly selected for 10 times and our algorithm is applied for each testing
data. Our results are shown in Table 2. The mean AUC of our results is 0.9326, which
ranks 4th out of 119 teams (see the top AUC scores in Table 3). This is promising
since we did not perform any tricks like blending results of lots of different approach-
es or perform any problem-specific algorithm tuning.
# 1 2 3 4 5 6 8 8 9 10 Mean
AUC 0.934 0.930 0.931 0.936 0.932 0.933 0.930 0.937 0.930 0.933 0.9326
Table 3. Top 5 AUC scores on the leader board (the first one is not included due to de-
normalization algorithm). Our result in Table 2 ranked just below Jeremy.
4.1 Overview
This section summarizes our effort on applying the matrix completion technique in
Section 2 to a real world problem, i.e. Github repository recommendation. We devel-
oped a commercial software product and the product has attracted quite a lot of initial
users.
Github is a web-based hosting service for software development projects [15]
launched in April 2008. It is the most popular open source code repository site [15].
The slogan of the site is "Social Coding" as the site provides social networking func-
tionality such as feeds, followers and stars [15]. On January 16, 2013, Github hit 3
million users and about 5 million repositories. Today, Github has 64M repositories
and 23M developers worldwide, see https://github.com/.
It is well known that a core feature of any social network site is recommendation.
For instance, Facebook recommends friends, Twitter recommends followees, and
LinkedIn recommends contacts. Actually, recommendation service is also a core fea-
ture for e-business website. The most famous one is Amazon, it is the earliest compa-
ny who developed a recommendation system to recommend products for users. Rec-
ommendation can bring more traffic and more revenue for a website. It can also sig-
nificantly improve the user experience. For instance, Amazon provides service like
"Customers who bought this item also bought" and "Amazon.com recommends".
However, as the most popular open source code repository website, Github does not
provide a recommendation service. In this performance period, we developed a com-
mercial software product called Github Repository Recommendation, which aimed at
providing recommendation service for Github users. Similar to Amazon, this product
provides the following service:
1. Coders who like this repo also like ...
2. Signalpro recommend repos for "a user" based on his history
The two features are implemented as two web apps which are hosted on our com-
pany’s website ([16]). One screenshot of the web app is shown in Fig. 1.
4.2 Implementation
Data Acquisition. We use the data from Github Archive [18], an open source project
which record the public Github timeline. The data can be retrieved from Google
BigQuery. The timeline is a series of event logs. There are 18 event types. We only
use the star event to do repository recommendation. A star event means a user is in-
terested in and bookmarked a repository.
The query is shown below:
SELECT repository_name, actor, created_at
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent" and created_at >= '2012-04-01 00:00:00'
GROUP BY repository_name, actor, created_at
ORDER BY created_at ASC;
After running the query, we cannot directly download the data as it is too large. We
need to save the results as a table and export the table data to Google cloud storage.
Data Preprocessing. The raw data we collect is a sequence of records. Each record
has three items: username, reponame and time. Before using the data, we do the fol-
lowing preprocessing tasks:
1. Index username and reponame, i.e. convert each username and reponame
to an integer
2. Convert time to an integer number
3. Removing repetitive records
4. Reorder/reindex user and repo index based on star count, in descend order
Data Statistics. After data preprocessing, we have a big sparse binary matrix, whose
columns represent repos and rows represent users. A nonzero value at (i, j) indicates
the user i starred repo j. From the star events from 04/01/2012 to 01/10/2013, we got
344,120 users, 317,128 repos and 3,253,207 stars. The total number of matrix ele-
ments is about 0.1 billion and sparse level is 1/33545.
Table 4 and Table 5 show how many users/repos have a certain number of stars.
The tables show the following facts:
1. About half users starred only 1 repo and about half repos has only one
starred user.
2. About 5000 repos have more than 100 stars
3. About 90% repos have less than 10 stars.
4. About 10 thousands repos have more than 50 stars.
Table 4. The number of repos that has stars. Table 5. The number of users that has stars.
Star Count # of Repos Percentage Star Count # of Users Percentage
1 162399 51.2% 1 141639 41.2%
<=5 259335 81.8% <=5 246617 71.7%
<=10 281935 88.9% <=10 282384 82.1%
>10 35193 11.1% >10 61736 17.9%
>50 9509 3.0% >50 12468 3.6%
>100 5061 1.6% >100 4341 1.3%
>1000 254 0.08% >1000 32 0.01%
M (:, i )T M (:, j )
S (i , j ) (6)
|| M (:, i ) || || M (:, j ) ||
With similarity scores, we can find closest K repos for any given repo and thus the
first question is answered. In matrix form, the equation is
S Mˆ T Mˆ (7)
Basically, it is equivalent to
R MS (9)
The interpretation is intuitive: if a user starred repo A and repo B, then we first col-
lect similar repos to A and similar repos to B and then merge the similar repos.
Based on the data statistics, we picked the first 5000 repos for recommendation.
Evaluation. We first split data into training set and testing set based on time. From
training set, we compute repo similarities. For each user in testing set, we first com-
pute 20 recommendations for the user and see how many of them are actually hit in
the testing data. Average hit rate is used as a metric to measure the performance. We
compared our method with a baseline method, i.e. recommend purely based on repo
popularity. For evaluation data, we picked first 10,000 users and 1,000 repos. The
baseline method's average hit rate is 9.2% and ours is 10.4%, which means about 15%
performance boost. We plan to implement a c version of the recommender system for
algorithm evaluation. We noticed that if we choose only 1000 users, then our results
are very close to baseline method. This means that baseline method is more biased to
very active users. When more (not so active) users are used for comparison, our re-
sults should perform much better than the baseline method.
5 Conclusions
In this paper, we present a high performance link recovery algorithm for predicting
missing links in social networks. The approach has been applied to a recommender
for Github. Currently, the recommender has been widely used in Github. Comparison
with other recommenders [19-21] will be carried out in the future.
References
1. Candes, E.J., Recht, B.: Exact Matrix Completion via Convex Optimization. Foundations
of Computational Mathematics. 9, 717-772 (2008)
2. Dao, M., Kwan, C., Ayhan, B., Tran, T.: Burn Scar Detection Using Cloudy MODIS Im-
ages via Low-rank and Sparsity-based Models. IEEE Global Conference on Signal and In-
formation Processing. 177 – 181 (2016)
3. Dao, M., Kwan, C., Koperski, K., Marchisio, G.: A Joint Sparsity Approach to Tunnel Ac-
tivity Monitoring Using High Resolution Satellite Images. IEEE Ubiquitous Computing,
Electronics & Mobile Communication Conference. 322-328 (2017)
4. Kwan, C., Budavari, B., Dao, M., Zhou, J.: New Sparsity Based Pansharpening Algorithm
for Hyperspectral Images. IEEE Ubiquitous Computing, Electronics & Mobile Communi-
cation Conference. 88-93 (2017)
5. Zhou, J., Kwan, C., Ayhan, B.: A High Performance Missing Pixel Reconstruction Algo-
rithm for Hyperspectral Images. 2nd. International Conference on Applied and Theoretical
Information Systems Research. (2012)
6. Kwan, C., Zhou, J.: Method for Image Denoising. Patent #9,159,121. (2015)
7. Zhou, J., Kwan, C.: High Performance Image Completion Using Sparsity based Algo-
rithms. SPIE Commercial + Scientific Sensing and Imaging Conference. (2018)
8. Zhou, J., Kwan, C., Tran, T.: ATR Performance Improvement Using Images with Cor-
rupted or Missing Pixels. SPIE Defense + Security Conference. (2018)
9. Dao, M., Suo, Y., Chin, S., Tran, T.: Video Frame Interpolation via Weighted Robust
Principal Component Analysis. Int. Conf. Acoustics, Speech, Signal Processing. (2013)
10. NIPS data, http://ai.stanford.edu/~gal/Data/NIPS
11. Kaggle data, http://www.kaggle.com/c/socialNetwork/Data
12. Propack, http://soi.stanford.edu/~rmunk/PROPACK/
13. Miller, K., Griffiths, T., Jordan, M.: Nonparametric Latent Feature Models for Link Predic-
tion. Advances of Neural Information Processing Systems. 1276 – 1284 (2009)
14. Menon, A., Elkan, C.: Dyadic Prediction Using a Latent Feature Log-Linear Model.
arXiv:1006.2156v1, 10, (2010)
15. Github, http://en.wikipedia.org/wiki/GitHub
16. Github Repository Recommendation by Repo, http://signalpro.net/github/repo_rec.htm
17. Github Repository Recommendation Chrome Extension, https://chrome.google.com/
webstore/detail/github-repository-recomme/dpmjlcnijpnkklopinedkkhmjcchecia
18. Github Achieve, http://www.githubarchive.org/
19. Rezaeimehr, F., Moradi, P., Ahmadiana, S., Qader, N.N., Jalili, M.: TCARS: Time- and
Community-Aware Recommendation System. Future Generation Computer Systems. 78,
419–429 (2018)
20. Azadjalal, M.M., Moradi, P., Abdollahpouri, A., Jalili, M.: A Trust-aware Recommenda-
tion Method based on Pareto Dominance and Confidence Concepts. Knowledge-Based
Systems. 116, 130-143 (2017)
21. Ranjbar, M., Moradi, P., Azami, M., Jalili, M.: An Imputation-based Matrix Factorization
Method for Improving Accuracy of Collaborative Filtering Systems. Engineering Applica-
tions of Artificial Intelligence. 46, 58-66 (2015)