Professional Documents
Culture Documents
Garrett Grolemund
PhD Student / Rice University
Department of Statistics
Sept 2010
1. US baby names data
2. Slicing & dicing data
3. Group-wise operation
4. Challenges
5. More plyr functions, including an
example of llply and ldply
Baby names
Top 1000 male and female baby
names in the US, from 1880 to
2008.
258,000 records (1000 * 2 * 129)
But only four variables: year,
name, sex and prop.
CC BY http://www.flickr.com/photos/the_light_show/2586781132
Getting started
library(plyr)
library(stringr)
options(stringsAsFactors = FALSE)
bnames <- read.csv("baby-names.csv")
> head(bnames, 15) > tail(bnames, 15)
year name prop sex year name prop sex
1 1880 John 0.081541 boy 257986 2008 Neveah 0.000130 girl
2 1880 William 0.080511 boy 257987 2008 Amaris 0.000129 girl
3 1880 James 0.050057 boy 257988 2008 Hadassah 0.000129 girl
4 1880 Charles 0.045167 boy 257989 2008 Dania 0.000129 girl
5 1880 George 0.043292 boy 257990 2008 Hailie 0.000129 girl
6 1880 Frank 0.027380 boy 257991 2008 Jamiya 0.000129 girl
7 1880 Joseph 0.022229 boy 257992 2008 Kathy 0.000129 girl
8 1880 Thomas 0.021401 boy 257993 2008 Laylah 0.000129 girl
9 1880 Henry 0.020641 boy 257994 2008 Riya 0.000129 girl
10 1880 Robert 0.020404 boy 257995 2008 Diya 0.000128 girl
11 1880 Edward 0.019965 boy 257996 2008 Carleigh 0.000128 girl
12 1880 Harry 0.018175 boy 257997 2008 Iyana 0.000128 girl
13 1880 Walter 0.014822 boy 257998 2008 Kenley 0.000127 girl
14 1880 Arthur 0.013504 boy 257999 2008 Sloane 0.000127 girl
15 1880 Fred 0.013251 boy 258000 2008 Elianna 0.000127 girl
Slicing and
dicing
Function Package
subset base
summarise plyr
transform base
arrange plyr
# They all have similar syntax. The first argument
# is a data frame, and all other arguments are
# interpreted in the context of that data frame
?subset
?transform
?summarise
?arrange
color value color value
blue 1 blue 1
black 2 blue 3
blue 3 blue 4
blue 4
black 5
arrange(df, color)
Your turn
Extract your name from the dataset and
save into a new dataset called myname.
Plot the trend over time with
qplot(year, percent, data = myname,
geom = line, color = sex)
arrange(myname, desc(prop))
Brainstorm
summarise(bnames,
min_length = min(length),
max_length = max(length)
)
subset(bnames, length == 2)
subset(bnames, length == 10)
Your turn
Create a new variable that contains the
first three (or four, or five) letters of each
name. Subset just the names that start
the same as yours. Plot these over time
with
qplot(year, prop, data = ?, geom = "line",
group = name)
bnames$first3 <- tolower(substr(bnames$name, 1,
3))
# or
one <- transform(one,
rank = rank(-prop, ties.method = "first"))
head(one)
# Apply
results <- vector("list", length(pieces))
for(i in seq_along(pieces)) {
piece <- pieces[[i]]
piece <- transform(piece,
rank = rank(-prop, ties.method = "first"))
results[[i]] <- piece
}
# Combine
result <- do.call("rbind", results)
Split Apply Combine
x y
x y a 2
3
a 2 a 4
a 4
x y
x y
a 2
b 0 b 0
2.5 b 2.5
b 5 b 5
c 7.5
c 5 x y
c 10 c 5
7.5
c 10
# Or equivalently
2nd argument
to transform()
Summaries
In a similar way, we can use ddply() for
group-wise summaries.
There are many base R functions for
special cases. When available, they can
be much faster, but you have to know
they exist, and have to remember how to
use them.
ddply + transform =
group-wise transformation
ddply + summarise =
per-group summaries
ddply + subset =
per-group subsets
Challenges
Warmups
# Average usage
overall <- ddply(bnames, "name", summarise,
prop = mean(prop))
# Top 10 names
head(arrange(overall, desc(prop)), 10)
Challenge 1
subset(most_pop, rank == 1)
Challenge 2
split-apply-combine
How you split up depends on the type of
input: arrays, data frames, lists
How you combine depends on the type of
output: arrays, data frames, lists,
nothing
array data frame list nothing
function
maply mdply mlply m_ply
arguments
array data frame list nothing
function
mapply mdply mapply m_ply
arguments
Case study
GSHS
Each country has its own GSHS spss file.
How can we easily combine them all
together?
Run the following R code and try and
figure out what each line does.
Look at 04-gshs.r for answers.
library(foreign)
library(plyr)
library(stringr)
http://had.co.nz/plyr/plyr-
intro-090510.pdf
http://groups.google.com/group/
manipulatr/
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.