You are on page 1of 10

Empirical Analysis of Personal Email Network

Praveen Kumar Gurumurthy


gpraveenkumar5yamil.com

Project done for Coursera Course - Social Network Analysis 2013

Abstract
Analyzing Ego Network has become a popular method to detect communities. Email networks are structurally different from Friendship networks.
To infer the benefits of this difference, in this report, I have presented
the growth of my email network and analyzed three different types of subnetwork, obtained by filtering the email network by filtering at various level.
Interestingly, each of these networks provide some unique insight about my
email network.

Introduction

I always believed Emails to be a different kind of social network. I say this because, unlike
other social networks like Friendship network, your network does not stabilize over time.
What do I mean by this? Say for example, you join Facebook. In the first couple of months,
the number of friends you connect to on facebook is going to be much larger (given that
a lot of your friends are already using Facebook) and the number of friends you add over
time is going to become lesser and lesser as time progress (because you have already added
most of them and unless you are a rockstar you wouldnt make tens of new friends every
day). Is is not the same with emails networks? Not quite. For example, me laptops hard
drive burst exactly two years ago (in the thanksgiving season of 2011) and this forced me
to contact a bunch of people, who I had not connected to before, to figure out a solution
for the problem - that was a really a rapid growth of my email network.
Ego networks, shown in fig. 1 [2], consist of a focal node (ego) and the nodes to whom
ego is directly connected to (these are called alters) plus the ties, if any, among the alters
[1]. Analyzing Ego Networks has become a popular and power tool to detect communities
in Social newtork [2].

Figure 1: An Ego Network


1

Driven by the structural difference of the email Networks and the curiosity to detect communities by analyzing my Ego Email Network. The rest of this report is organized as follows.
Section 2 describes the dataset. In Section 3,4 and 5 analyses of three different type of
networks are presented. Finally, Section 6 has the conclusions.

DataSet

I have used emails from my Gmail account to do the analyses. I have made three graphs
out of these email, all of them are directed. There is one edge from the Sender (From:)
to the Receiver(s) (To:, CC: BCC:), as one would naturally expect. Their construction is
described in their respective sections. Following are the three networks:
RawGraph : Directed graph consisting of 9885 nodes and 24663 edges.
EgoGraph : Directed graph consisting of 7762 nodes and 20338 edges.
CoreEgoGraph : Directed graph consisting of 252 nodes and 2368 edges.
Due to the confidential nature of data, I am not making them publically available. However,
it is very easy to retrieve your emails and repeat these analyses.
2.1

Data Collection Process

Given that I use 5 emails pretty frequently out of the n emails account I have, these become
the candidates. Three of them are YahooMail! and two of them are GMail. I began to hunt
the web for tools to downloads emails from both of them. To retrieve emails from GMail
found a pretty good tool - Got Your Back. Details instructions are provided in the tools
documentation on how to retrieve the emails by using the tool, so I am not going to go
over them here. But unfortunately, I dint find any tools that are freely available download
emails from my YahooMail! accounts. They only tool I found was but it requires you to
have a paid license, hence, I dint use it.
Amongst the two GMail accounts, one of them is my primary account that I use for all
social communications i.e. I use this account to communicate with all my friends and all
my social networking accounts like Facebook, Twitter and a bunch of other linked to it.
So, being able to collected data from this account was awesome! The other GMail account
is relatively new and it has only incoming emails, so I thought it wouldnt make sense to
analyze it, so I chose to drop it.
Therefore, using the above mentioned tool, I have collected about seven and half years of
data from the day I opened that GMail account on the 30th of May, 2006 till the 23th of
November, 2013, when I began working on this project.
For the purposes of this analyses, I have extracted the emails addresses of the Sender and
the Receiver(s) from each email using a python script1 . This script gives an adjacency list
of emails i.e. each row in the list has a Sender and one or more Receivers in a format that
can be directly input to Gephi which is used to construct the ego networks. It also generate
a file containing the number the email communications every day which is used to describe
the growth of the my email network.
2.2

Growth of my GMail

Before we begin to analyze the growth of the network, you should know that I am the kind
of person who doesnt turn off email notifications from Social Networking Sites. So, they
contribute to a significant fraction of the emails. Table 1 shows that on an average I have
6000 email communications (sent and received) in the GMail account. I was surprised to see
the number being half of the average for the year 2008 (I had no idea about it). But when I
think back, that was the year I was looking for an internship (to apply for positions, I used
1
To keep the report short and avoid clutter, I
http://gpraveenkumar.com/projects/snacourse2013/index.html

have

put

the

codes

in

one of my yahoo accounts) and I did much less social Networking Orkut that year because it
was going out of fad (I dint have a facebook account yet). Hence, I think the numbers went
down. August of 2011 was the time when I joined Graduate School before which I had to
communicate with a lot of people for the admission, results, accomodation etc. My activity
on social networking sites (only facebook now, my orkut died) was significantly higher, as I
met a good number of new friends at my Graduate School, hence the very high number of
emails communications that year.
Table 1: No. of Emails Communications per year
Year
2006
2007
2008
2009
2010
2011
2012
2013

No. Of emails
1189
6020
2928
6398
6296
12202
8832
6887

Fig. 2 shows a plot (obtained using R Codes2 ) of the distributions of number of emails that
was communicated (sent and received) each month from May, 2009 to November, 2013.
The red line is the least squares regression fit of the number of emails communicated and
the time in months. It is easy to see that they are positively correlated. It is easy to infer
that the no. of email communicated was around 250 on average during 2006 but it has
gradually risen to around 600 of average during 2013. I think this is pretty natural trend
that can be observed amongst a lot of people as online communication is increasing steeply
making it the primary form of communication in a large number of cases.

Figure 2: Month-wise distribution of emails. The red line corresponds to the least squares
fit.
2.3

Limitations

Due to the confidential nature of data, it is in general, very difficult to collect email dataset
both at the individual level and at the organizational. This a major hold back for analyzing
2

The code is available at - http://gpraveenkumar.com/projects/snacourse2013/index.html

email networks when compared to social networks. On one hand, even if we get access to a
dataset, it is anonymised almost all the time. So, it becomes very difficult to understand, in
depth, about what is happening in the network, when these analyses are performed, without
having a domain expert. As you will see, some of the interesting observations, I present
below could have never been detected in an anonymised network, without having a domain
expert (me in this case). On the other hand, even when one is able to collect emails at a
personal level, they are limited by the capabilities to retrieve the data due to the absence of
straight forward way to retrieve data (eg. my YahooMail! account). Finally, email datasets
when collected at a personal level (as I have done here), dont have any profile information
(eg. locations, gender) like other social networking sites. This is another limitation that
prevents interesting analyses.

Analysis of RawGraph

I loaded the adjacency list of emails, henceforth call RawGraph , in Gephi and ran the
network layout3 . The resulting network is shown in 3. Well, based on the layouts appearance with clusters of nodes moving out from center, I named it the Universe of my Email
Network, with the green dot being me the point where the big bang happened and the
every expanding galaxies. If you pay closer attention to those galaxies that are farther away
from me, it is observable that there are good communications links between the leader of
the cluster (the center node which seems to be well connected to all the other nodes in the
cluster) and the other nodes. These other nodes do not seem to be connected to me directly
but the leader seems to be connected to me. So, the natural question now is, who are those
people and who is the leader? We are going to find it out soon!
Using Gephi, I calculated some network statistics for this RawGraph , most of which I wont
report because the are not meaningful. RawGraph is not my Ego Network, because I am in
the network and hence, all the networks statistics like average degree, betweeness, closeness
etc. are going to be highly influenced by my presence. However, in Table 2, I have reported
some of them and analyzing them is going to shape how I should build my Ego Network.
Table 2: Some Network Statistics of RawGraph
Id
Labels
In-Degree Out-Degree Degree Weighted Degree
A
MYEMAIL@gmail.com
2371
1456
3827
46210
B
mailer-daemon@googlemail.com
0
1074
1074
46052
C
FRIEND-1@gmail.com
27
575
602
15503
D notificationo9osa6@facebookmail.com
1
34
35
8222
E
updateo9osa6@facebookmail.com
0
1
1
7276
F
FRIEND-2@gmail.com
22
189
211
4548
G 3504732672376623859@mail.orkut.com
0
1
1
4382

Table 2 shows the top seven nodes of the RawGraph ordered according to Weighted Degree
where Weighted Degree for each node is calculated as the sum of the weights of all its edges.
The key observations that I would like to highlight are:
1. It is clear that only two C and F , apart from me (A), in the out of the top seven
emails address correspond to real people.
2. Further, it is to be noted that three of them B,E and F have In-Degree of 0 as
I have dont send emails to those addresses nor does anyone CC or BCC emails to
them.
3. If you are then wondering, why does D has an In-Degree of 1 , this is because of
the facebooks feature that lets you reply/comment on posts just by replying to the
emails. Sometimes I use this feature and so D has one incoming edge (from me).
3
All the layout, I used had these settings: enable Dissuade Hubs,enable Prevent Overlap
and set Gravity = 10 or 25 without changing any default settings

Figure 3: Universe of my RawGraph . The center green dot is me. All the edges in the
graph are scaled according to its weight - the greener the edge, the higher its weight.

4. Next, two of the nodes E and F have an out degree of one. It means that I only
receive emails from those nodes because that is the way facebook and Orkut send
their notifications.
5. If the reason to as to why D has a Out-Degree of 34 and not 1 then has aroused
your curiosity, well it turns out that 34 is the number of facebook groups I have
chosen to receive notifications from. As each group has its own unique facebook
email address, D send emails to them and I receive them because I have subscribed
to receive notifications. Elegant right, I dint know this but found it out by analyzing
the neighbours of D.
6. Still, why do I have galaxies in the graph? The center of the galaxies, the leaders,
are the primary people who forward information (funny emails, meeting arragement
etc.). This was the general trend way to share information or create events until
facebook became so popular and easy to do those. When a person like C or F
send/forward emails, they send it to all their friends or to a subset of them. In the
cases, where we (the person who send information and I) have very few common
friends, there is an edge between us, between him and all the person who he forwards
5

to but not between me and them (as I might not know them and so we dont
communicate). Hence, that layout algorithm comes up with those beautiful galaxies.
7. The last thing we need to understand is why does B has such a high weighted degree
although its degree is low. To understand this, think about what happens when
you send a email to hundreds of friends, and say one of the email address doesnt
work anymore. You are going to receive an email from B informing you that this
id doesnt exist with the original message that you sent with all the email address
of people who you sent the message. Again, this process of failure can happen with
totally different groups of friends (say to groups of friends from high school and
college to whom you send information) and hence, B is associated with lot of email
addresses. As seen in Table 2, B has a Out-Degree of 1074.
Therefore, given that RawGraph is not my Ego-Network and some of the above mentioned,
we need to process the data to observe something useful.

EgoGraph

4.1

Filtering the Data

To build Ego-Networks, the nodes similar to those mentioned in observation (4) in the
previous section are not useful as they would lead to individual nodes not connected to
anyone else in the network. For this reason, I have removed all the emails that are sent to
just one Receiver and considered only those emails that are sent to more than one receivers.
It is to be noted that doing this does not affect the structure of the graph because if the
Sender and Receiver are already present in the network removing this edge would only
decrease the weight of the edge between them and if either of them are not present, it would
lead to a single hanging node (which are not useful). If the previous statement is unclear,
remember that I have to be either the Sender or the Receiver.
Next, I have removed the nodes similar to B, D, E and G from the network. Although,
a principle approach to do this would be write a script to eliminate such nodes from the
adjacency list, I have just done it using Gephi mainly because there were less than 10 nodes
that I had to remove and hence, it was very easy to do it in Gephi ( Sort the nodes according
to degree or weighted degree and remove those that you think are not suitable candidate
nodes). Finally, I removed myself from the network.
4.2

Analysis of EgoGraph

Figure 4 shows the Universe of my EgoGraph . Comparing it the RawGraph in Fig. 3, there
doesnt seem to be much of a difference structurally. Data Filtering, apart from making the
galaxies and the interactions between them more prominent, does not seem to have of much
help.
However, it can now be identified that my Ego Network seems to have two major cluster.
The violet one shown in Figure 4 is the set of my school friends and the big clustered
network above it corresponds to my College friends. Given that I atleast know 500 odds
friends people in my college with whom I have interacted, I expected it to be messy. This
messiness is aggravated by the fact that atleast 30 of them seem to be active in forward
information to friends via emails. Infact, every prominent leader of the galaxy seen in Fig. 4
is one such person. This prominence has overshadowed most of the other small network
interaction and its discovery.
In figure 5, I have pointed out couple of interesting network activities going on. In Fig.5a,
the blue network you see has no outgoing edge outside of its cluster. Its leader is John
Louis4 - a famous Grand Master of Memory. He actively send good information to his
friends. None of my friends know him and I dont know any of his connections and hence we
see that structure. A similar argument holds for the disconnected structure of Fig.5b but
something more interesting is happening. Before, we proceed to that, we must reminded
4

http://www.world-memory-statistics.com/grandmasters.php

Figure 4: Universe of my EgoGraph .

ourselves that all the leaders i.e. the centers of the galaxies are my friends, otherwise we
could never observe information about to who they forward emails to. Such a network hence
arises because I am friends with the leaders of those two networks and both of them have
their own network of friends. We observe such a structure as both of have a common friend
to whom both of them forwards email to (other than me). Furthermore, the two leaders
share aa edge. This invoked my curiosity as to find out who those two leaders were and it
turns out that they are siblings (bother and sister)...sigh!
Figure 5b shows a fraction of my Ego Network with edges removed for better observing the
galactic structures. The interesting structure (marked with black arrow in the figure) seems
to have two leaders both f who actively forward emails to the same group of friends and it
seems like they send very few friends out of the network. But No, apparently this is not
the case. Both of those leader nodes correspond to one person who has two different email
address and sends out emails to his friends from either of them. It is exciting to see such
intricate details getting revealed in the analysis of these Ego Networks.
7

(a) Disconnected and Sibling Galaxies

(b) School Friends

Figure 5: Intersting Observation on EgoGraph

5
5.1

CoreEgoGraph
Filtering the Data

Finally, I wanted to see if I can discover the core of my network by trying to remove those
nodes connected to the leaders as they correspond to friends of my friends. All those
nodes have an Out Degree of 0 . We can observe some activity in these nodes only when
they reply to the leaders message CCing all other and that way, I would also receive that
email, since I am friends with the leader (as discussed above). This is clearly not what we
observe, otherwise we would have had outgoing edges from these nodes to the friends of all
the leaders (he is replying to a leaders message which had leaders friends). Interestingly
then the structure of that galaxy would be similar to the one we observed in Fig. 5b with
two prominent leaders.
Hence, I decided to remove all the nodes that have an Out Degree of 0 from the EgoGraph
. Again, I used Gephi to iteratively delete all the nodes with Out Degree 0. Although, this
is not the best way to do it, I just did it in this fashion because of convenience.
5.2

Analysis of CoreEgoGraph

The layout of the resulting network is shown in Fig. 6. To my surprise, well separated
and connected components appeared. The network shown Fig. 6 is coloured according to
modularity. The central component which consisting of my friends from undergraduate
college have 6 out of the 14 communities that exist in the CoreEgoGraph . Each such
community consists of a different interest groups say like sports, travel etc. but as some of
my friends (other than me) have a few friends belonging to multiple communities, it forms
a big cluster with 6 interacting communities.
The fact that this network was able to detect the leaders of the sibling galaxy shown in
Figure 5a validated the correctness of trying to find the core network of friends by removing
the nodes that have an Out Degree of 0 . Next, it is easy to see that the High School
network shown in Fig. 5 has few a nodes that bind the high school network together. Fig.
6 has those few nodes as a connected component. These are the nodes in the network that
actually have communicate with each other.
The Average Degree of the CoreEgoGraph is 9.397. This means, on a average, my
friends send/forward information (interesting article, project group meeting, etc. that they
send/forward to me) to t9 other friends. It is to be notes that the average degree is highly
biased by the undergraduate college component as it has more connected links than the rest
of the disconnected components as it can be seen in Fig. 7b.
8

Figure 6: Universe of my CoreEgoGraph .


Table 3: Some Network Statistics of CoreEgoGraph
Metrics
Average Degree
Modularity
Clustering Co-efficient
No. of Connected Components

Value
9.397
0.505
0.574
9

The Modularity of a graph is the measure of the its ability to group into clusters or communities. A value of 0.504 implies that the graph has a good tendency to group into clusters as
it can be seen in 6. I calculated this measure considering the graph as undirected network.
It can be easily seen from Fig. 6 that the number of Connected Components is 9.
The Average Clustering Coefficient is a measure of the degree to which nodes in a graph
tend to cluster together. Since, it would make more sense to calculate this measure as a
undirected graph, I calculated it using Gephi as undirected. A value of 0.574 is somewhere
about the middle and this means that half of the nodes of the CoreEgoGraph is connected
to other nodes on average.
9

(a) Degree Distribution

(b) Clustering Co-efficient

Figure 7: Distributions of some Network Statistics of the CoreEgoGraph . The brighter the
color of the node, the higher is value.

Conclusion

I have analyzed three different networks obtained from my email networks. We were able
to discover some interesting aspects of the network, from these analyses, both in the growth
of the email network over time and its structure.
I would like to conclude with a simple application where these analyses might be useful - The
Ego Networks EgoGraph and CoreEgoGraph lets us detect leaders amongst other things.
These leaders become the first candidates or hubs for disease spreading in the network i.e.
spamming the network when their email address are compromised for example. Researcher
have started to explore email network for behavioural analysis by building profiles of people
a person communicates with and the way they communicate. Anomaly detects can then be
built out of these profiles to notify the user then there is a difference in the communication
pattern and hence, can alert and enhance the security of the network.

References
[1] Ego networks : http://www.test.org/doe/.
[2] Julian McAuley and Jure Leskovec. Learning to discover social circles in ego networks.
In Advances in Neural Information Processing Systems 25, pages 548556, 2012.

10

You might also like