You are on page 1of 49

Introduction to plyr

Garrett Grolemund
PhD Student / Rice University
Department of Statistics

Sept 2010
1. US baby names data
2. Slicing & dicing data
3. Group-wise operation
4. Challenges
5. More plyr functions, including an
example of llply and ldply
Baby names
Top 1000 male and female baby
names in the US, from 1880 to
2008.
258,000 records (1000 * 2 * 129)
But only four variables: year,
name, sex and prop.

CC BY http://www.flickr.com/photos/the_light_show/2586781132
Getting started

library(plyr)
library(stringr)
options(stringsAsFactors = FALSE)
bnames <- read.csv("baby-names.csv")
> head(bnames, 15) > tail(bnames, 15)
year name prop sex year name prop sex
1 1880 John 0.081541 boy 257986 2008 Neveah 0.000130 girl
2 1880 William 0.080511 boy 257987 2008 Amaris 0.000129 girl
3 1880 James 0.050057 boy 257988 2008 Hadassah 0.000129 girl
4 1880 Charles 0.045167 boy 257989 2008 Dania 0.000129 girl
5 1880 George 0.043292 boy 257990 2008 Hailie 0.000129 girl
6 1880 Frank 0.027380 boy 257991 2008 Jamiya 0.000129 girl
7 1880 Joseph 0.022229 boy 257992 2008 Kathy 0.000129 girl
8 1880 Thomas 0.021401 boy 257993 2008 Laylah 0.000129 girl
9 1880 Henry 0.020641 boy 257994 2008 Riya 0.000129 girl
10 1880 Robert 0.020404 boy 257995 2008 Diya 0.000128 girl
11 1880 Edward 0.019965 boy 257996 2008 Carleigh 0.000128 girl
12 1880 Harry 0.018175 boy 257997 2008 Iyana 0.000128 girl
13 1880 Walter 0.014822 boy 257998 2008 Kenley 0.000127 girl
14 1880 Arthur 0.013504 boy 257999 2008 Sloane 0.000127 girl
15 1880 Fred 0.013251 boy 258000 2008 Elianna 0.000127 girl
Slicing and
dicing
Function Package

subset base

summarise plyr

transform base

arrange plyr
# They all have similar syntax. The first argument
# is a data frame, and all other arguments are
# interpreted in the context of that data frame

?subset
?transform
?summarise
?arrange
color value color value
blue 1 blue 1
black 2 blue 3
blue 3 blue 4
blue 4
black 5

subset(df, color == "blue")


color value color value double
blue 1 blue 1 2
black 2 black 2 4
blue 3 blue 3 6
blue 4 blue 4 8
black 5 black 5 10

transform(df, double = 2 * value)


color value double
blue 1 2
black 2 4
blue 3 6
blue 4 8
black 5 10

summarise(df, double = 2 * value)


color value total
blue 1 15
black 2
blue 3
blue 4
black 5

summarise(df, total = sum(value))


color value color value
4 1 1 2
1 2 2 5
5 3 3 4
3 4 4 1
2 5 5 3

arrange(df, color)
Your turn
Extract your name from the dataset and
save into a new dataset called myname.
Plot the trend over time with
qplot(year, percent, data = myname,
geom = line, color = sex)

Extract the average prop value for the entire


data set.
Order myname from the year with the
highest prop to the year with the lowest
myname <- subset(bnames, name == "Garrett")
qplot(year, prop, data = myname, geom = "line",
colour = sex)

summarise(bnames, avg_prop = mean(prop))

arrange(myname, desc(prop))
Brainstorm

Thinking about the data, what are some


of the trends that you might want to
explore? What additional variables would
you need to create? What other data
sources might you want to use?
Pair up and brainstorm for 2 minutes.
Some ideas
• First/last letter • Rank
• Length • Ecdf (how many
babies have a
• Number/percent
name in the top
of vowels
2, 3, 5, 100 etc)
• Biblical names?
• Hurricanes?
letter <- function(x, n = 1) {
if (n < 0) {
nc <- nchar(x)
n <- nc + n + 1
}
tolower(substr(x, n, n))
}

vowels <- function(x) {


nchar(gsub("[^aeiou]", "", x))
}
bnames <- transform(bnames,
first = letter(name, 1),
last = letter(name, -1),
vowels = vowels(name),
length = nchar(name)
)

summarise(bnames,
min_length = min(length),
max_length = max(length)
)

subset(bnames, length == 2)
subset(bnames, length == 10)
Your turn
Create a new variable that contains the
first three (or four, or five) letters of each
name. Subset just the names that start
the same as yours. Plot these over time
with
qplot(year, prop, data = ?, geom = "line",
group = name)
bnames$first3 <- tolower(substr(bnames$name, 1,
3))

gar <- subset(bnames, first3 == "gar")


qplot(year, prop, data = gar, geom = "line",
group = name)

gary <- subset(gar, name == "Gary")


qplot(year, prop, data = gary, geom = "line",
colour = sex)

qplot(year, prop, data = had, geom = "line",


colour = sex) + facet_wrap(~ name)
Per-group
operations
Group-wise

What about group-wise transformations


or summaries? e.g. what if we want to
compute the rank of a name within a sex
and year?
This task is easy if we have a single year
& sex, but hard otherwise.

Take two minutes to sketch out an approach


one <- subset(bnames, sex == "boy" & year ==
2008)
one$rank <- rank(-one$prop,
ties.method = "first")

# or
one <- transform(one,
rank = rank(-prop, ties.method = "first"))
head(one)

What if we want to transform


every sex and year?
# Split
pieces <- split(bnames,
list(bnames$sex, bnames$year))

# Apply
results <- vector("list", length(pieces))
for(i in seq_along(pieces)) {
piece <- pieces[[i]]
piece <- transform(piece,
rank = rank(-prop, ties.method = "first"))
results[[i]] <- piece
}

# Combine
result <- do.call("rbind", results)
Split Apply Combine
x y

x y a 2
3
a 2 a 4
a 4
x y

x y
a 2
b 0 b 0
2.5 b 2.5
b 5 b 5
c 7.5
c 5 x y

c 10 c 5
7.5
c 10
# Or equivalently

bnames <- ddply(bnames, c("sex", "year"), transform,


rank = rank(-prop, ties.method = "first"))
Way to split Function to apply to
Input data
# Or equivalently up input each piece

bnames <- ddply(bnames, c("sex", "year"), transform,


rank = rank(-prop, ties.method = "first"))

2nd argument
to transform()
Summaries
In a similar way, we can use ddply() for
group-wise summaries.
There are many base R functions for
special cases. When available, they can
be much faster, but you have to know
they exist, and have to remember how to
use them.
ddply + transform =
group-wise transformation

ddply + summarise =
per-group summaries

ddply + subset =
per-group subsets
Challenges
Warmups

Which names were most popular in 1999?


Work out the average proportion for each
name.
List the 10 names with the highest
proportions.
# Which names were most popular in 1999?
subset(bnames, year == 1999 & rank < 10)
subset(bnames, year == 1999 & prop == max(prop))

# Average usage
overall <- ddply(bnames, "name", summarise,
prop = mean(prop))

# Top 10 names
head(arrange(overall, desc(prop)), 10)
Challenge 1

For each name, find the year in which it


was most popular, and the rank in that
year. (Hint: you might find which.max
useful).
Print all names that have been the most
popular name at least once.
most_pop <- ddply(bnames, "name", summarise,
year = year[which.max(prop)],
rank = min(rank))
most_pop <- ddply(bnames, "name", subset,
prop == max(prop))

subset(most_pop, rank == 1)
Challenge 2

What name has been in the top 10 most


often?
(Hint: you'll have to do this in three steps.
Think about what they are before starting)
top10 <- subset(bnames, rank <= 10)
counts <- count(top10, c("sex", "name"))

ddply(counts, "sex", subset, freq == max(freq))


head(arrange(counts, desc(freq)), 10)
More plyr
functions
Many problems involve splitting up a large
data structure, operating on each piece
and joining the results back together:

split-apply-combine
How you split up depends on the type of
input: arrays, data frames, lists
How you combine depends on the type of
output: arrays, data frames, lists,
nothing
array data frame list nothing

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

n replicates raply rdply rlply r_ply

function
maply mdply mlply m_ply
arguments
array data frame list nothing

array apply adply alply a_ply

data frame daply aggregate by d_ply

list sapply ldply lapply l_ply

n replicates replicate rdply replicate r_ply

function
mapply mdply mapply m_ply
arguments
Case study
GSHS
Each country has its own GSHS spss file.
How can we easily combine them all
together?
Run the following R code and try and
figure out what each line does.
Look at 04-gshs.r for answers.
library(foreign)
library(plyr)
library(stringr)

files <- dir("gshs", pattern = "data-.*\\.sav",


full = T)
names(files) <- str_match(files, "data-(.*)\\.sav")[, 2]

spss <- llply(files, read.spss)


gshs <- ldply(spss, as.data.frame)
Resources
Resources

http://had.co.nz/plyr/plyr-
intro-090510.pdf
http://groups.google.com/group/
manipulatr/
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.

You might also like