You are on page 1of 28

Reality Mining: sensing complex social systems Nathan Eagle, Alex Pentland Pervasive and Ubiquitous Computing, 2006

Aim
How data collected from mobile phones can be used to uncover regular rules and structures in the behavior of both individual and organization

Mobile Phones as Wearable Sensors


Surveys are done by social scientists to learn about human behavior Usual survey techniques suffer from:
bias sparsity of data lack of continuity between discrete questionnaire
absence of dense, continuous data

Use of phones to collect data on human behavior

Bluetooth
Bluetooth is short-range RF network
10-30 meters in practice

Device-discovery is a standard among Bluetooth devices


Bluetooth MAC address (BTID), Device name, device type BTID is unique

Bluetooth scan is energy-consuming

Dataset & Privacy


Prior consent and human subject approval Dataset
100 Nokia 6600 users
75 Lab users 20 incoming masters students 5 incoming freshman

~450k hours of information about users location, communication, and usage behavior

http://reality.media.mit.edu

User modeling
Easily identifiable routines in every persons life Simple model of behavior
Home, work, elsewhere

Data collected from


Bluetooth, cell tower, temporal information from phone Incorporate information from static BT devices
BT on a desktop

User modeling
Accurate location from cell tower
Complicated as a phone can receive signals from far-away towers Accuracy gets better if user spends enough time
Distribution of time spent with a set of towers adds accuracy

Cell tower probability density functions


The probability of being associated with one of the 25 visible cell towers is plotted above for five users who work on the third floor corner of the same office building. Each tower is listed on the x-axis and the probability of the phone logging it while the user is in his office is shown on the yaxis. (Range was assured to 10 m by the presence of a static Bluetooth device.) It can be seen that each user sees a different distribution of cell towers depending on the location of his office, with the exception of Users 4 and 5, who are officemates and have the same distribution despite being in the office at different times

Office mates

Observations
Different sets of towers for users within 10 m of radius 6% of time, users were without signal 21% to 29%, users were in range of Bluetooth devices or other mobile phones Could Bluetooth be used for localization inside building during such times?
GPS does not work indoors

Encountered devices for a subject during the month of January


The subject is only regularly proximate to other Bluetooth devices between 9:00 and 17:00, while at workbut never at any other times. This predictable behavior will be defined as low entropy. The subjects desktop computer is logged most frequently throughout the day, with the exception of the hour between 14:00 and 15:00. During this time window, Subject 9 is most often proximate to Subject 4

Models for location & activity


Human life is imbued with routine access
Minute-to-minute routineyearly patterns

There is inherent randomness present among the routines Use of information entropy metric to quantify the predictable amount

A low-entropy subjects daily distribution of home/work transitions.


The most likely location of the subject: Work, Home, Elsewhere, and No Signal. While the subjects state sporadically jumps to No Signal, the other states occur with very regular frequency. This is confirmed by the Bluetooth encounters plotted below representing the structured working schedule of the lowentropy subject

A low-entropy subjects daily distribution of encountered Bluetooth devices.

Entropy across demographics


Entropy, H(x), was calculated from the {work, home, no signal, elsewhere} set of behaviors for 100 samples of a 7-day period. The Media Lab freshmen have the least predictable schedules, which makes sense because they come to the lab much less regular basis. The staff and faculty have the most least entropic schedules, typically adhering to a consistent work routine

User modeling
Role of time is very clear in predicting user behavior Uses HMM and EM to model and trains with 1 month of data 95% accuracy achieved

Mobile Usage Pattern


35% of subjects use the clock application regularly
Yet it takes 10 keystrokes to open the application More used at home

Not much use of sophisticated features Snake used as much as elaborate media player

Average application usage in three locations (other, work, and home) for 100 subjects.
The x-axis displays the fraction of time each application is used, as a function of total application usage. For example, the usage at home of the clock application comprises almost 3% of the total times the phone is used. The phone application itself comprises more than 80% of the total usage and was not included in this figure

Data characterization and validation


Data stored on a flash memory card
Flash memory cards have finite number of readwrite cycles

Frequent updates led to corruption of memory cards


10 cards were lost

Later increments were done in RAM and final logs were written to the card

Bluetooth errors
Several technical issues in verifying the accuracy of collected data
10m range with ability to penetrate walls Periodical scans miss short proximity event A device may not be discovered (1% to 3%) Application crash (once every three days)
Redundancy could be leveraged

Most of the time, above problems were identified as noise


Logs help in finding anamolies

Human-induced errors
Two main errors
Phone being off
Battery exhausted Explicit turn-off
1/5 of users do it regularly classrooms, night, movies. Log is time-stamped before the turn-off

Separated from user


Phone is on but not carried by the user
More severe problem

Human-induced errors
Forgetting phone
30% claim of never forgetting it 40% claim once every month 30% claim once every week

A Forgotten phone classifier Identifying a forgotten phones is challenging


Subject could be sick Casually moved beyond 10m of phone
Not enough unique features

Missing data
Major causes
Data corruption Powered-off devices

Logs accounting for 85.3% of the time


<5% : data corruption Rest: powered-off devices by 1/5th of users

Surveys
Subjects were also surveyed about their social network For senior students
High correlation
Logged BTID and dyadic self-report/proximity data

For incoming students


Not significant correlation

Community structure
Human landmarks
Who the user will meet can be guessed

Relationship inference
Nature of association can be inferred

Used GMM for clustering

Proximity Frequency

Proximity networks
Different than the organizational structure
Structured around the faculty director

Hub-and-spoke with changing roles Proximity n/w data is extremely dynamic and sparse. Deadlines bring more reliance on support of the group
Exploring dynamics of a group in response to both external and internal stimuli

Proximity networks
Peoples free time and schedules shift dramatically to met deadlines and project goals
Spending much of the night in lab just before the event

How the aggregate work cycles expand in reaction to global deadlines


Visit of sponsore

Conclusions
First paper to log data at such a magnitude and depth Provides ethnographic studies, individual user modeling, group user modeling

You might also like