You are on page 1of 2

Data

Wrangling Tidy Data A founda7on for wrangling in pandas


with pandas
&
Cheat Sheet
In a 7dy
F M A F M A Tidy data complements pandass vectorized
opera8ons. pandas will automa7cally preserve
observa7ons as you manipulate variables. No
M
* A F

data set:
other format works as intui7vely with pandas.
h.p://pandas.pydata.org Each variable is saved
in its own column
Each observa8on is
saved in its own row
M * A
Syntax Crea7ng DataFrames Reshaping Data Change the layout of a data set
a b c df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11

3 6 9 12
df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], Gather columns into rows. Spread rows into columns. Rename the columns of a DataFrame

"c" : [10, 11, 12]},
index = [1, 2, 3]) df.sort_index()
Sort the index of a DataFrame
Specify values for each column.

df.reset_index()
df = pd.DataFrame(
[[4, 7, 10], Reset index of DataFrame to row numbers, moving
[5, 8, 11], index to columns.

[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df.drop(['Length','Height'], axis=1)
index=[1, 2, 3], Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame
columns=['a', 'b', 'c'])
Specify values for each row.

n v
a b c
Subset Observa8ons (Rows) Subset Variables (Columns)
1 4 7 10
d
2 5 8 11
e 2 6 9 12

df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) Select mul7ple columns with specic names.
{"a" : [4 ,5, 6],
Extract rows that meet logical Randomly select frac7on of rows. df['width'] or df.width
"b" : [7, 8, 9],
criteria. df.sample(n=10)
"c" : [10, 11, 12]}, Select single column with specic name.
df.drop_duplicates() Randomly select n rows. df.filter(regex='regex')
index = pd.MultiIndex.from_tuples(
Remove duplicate rows (only df.iloc[10:20]
[('d',1),('d',2),('e',2)], Select columns whose name matches regular expression regex.
considers columns). Select rows by posi7on.
names=['n','v'])))
df.head(n) df.nlargest(n, 'value') regex (Regular Expressions) Examples
Create DataFrame with a Mul7Index
Select rst n rows. Select and order top n entries. '\.' Matches strings containing a period '.'
df.tail(n) df.nsmallest(n, 'value')
Method Chaining
'Length$' Matches strings ending with word 'Length'
Select last n rows. Select and order bo.om n entries. '^Sepal' Matches strings beginning with the word 'Sepal'

Most pandas methods return a DataFrame so that '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
another pandas method can be applied to the Logic in Python (and pandas) ''^(?!Species$).*' Matches strings except the string 'Species'
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
.rename(columns={ > Greater than df.column.isin(values) Group membership Select all columns between x2 and x4 (inclusive).
'variable' : 'var', == Equals pd.isnull(obj) Is NaN df.iloc[:,[1,2,5]]
'value' : 'val'}) <= Less than or equals pd.notnull(obj) Is not NaN
Select columns in posi7ons 1, 2 and 5 (rst column is 0).
.query('val >= 200') df.loc[df['a'] > 10, ['a','c']]
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all
) Select rows mee7ng logical condi7on, and only the specic columns .
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. x1 x2 x1 x3
len(df) df.fillna(value) A 1 A T
# of rows in DataFrame. Replace all NA/null data with value. B 2 B F
df['w'].nunique() C 3 D T
# of dis7nct values in a column.
df.describe() Make New Columns Standard Joins
Basic descrip7ve sta7s7cs for each column (or GroupBy) x1 x2 x3 pd.merge(adf, bdf,
A 1 T how='left', on='x1')
B 2 F Join matching rows from bdf to adf.
C 3 NaN
df.assign(Area=lambda df: df.Length*df.Height)
pandas provides a large set of summary func8ons that operate on Compute and append one or more new columns. x1 x2 x3 pd.merge(adf, bdf,
dierent kinds of pandas objects (DataFrame columns, Series, df['Volume'] = df.Length*df.Height*df.Depth A 1.0 T how='right', on='x1')
GroupBy, Expanding and Rolling (see below)) and produce single Add single column. B 2.0 F Join matching rows from adf to bdf.
values for each of the groups. When applied to a DataFrame, the pd.qcut(df.col, n, labels=False) D NaN T
result is returned as a pandas Series for each column. Examples: Bin column into n buckets.
x1 x2 x3 pd.merge(adf, bdf,
sum() min()
A 1 T how='inner', on='x1')
Sum values of each object. Minimum value in each object. Vector Vector B 2 F Join data. Retain only rows in both sets.
count() max() func8on func8on
Count non-NA/null values of Maximum value in each object.
each object. mean() x1 x2 x3 pd.merge(adf, bdf,
median() Mean value of each object. pandas provides a large set of vector func8ons that operate on all A 1 T how='outer', on='x1')
Median value of each object. var() columns of a DataFrame or a single selected column (a pandas B 2 F Join data. Retain all values, all rows.
quantile([0.25,0.75]) Variance of each object. Series). These func7ons produce vectors of values for each of the C 3 NaN
Quan7les of each object. std() columns, or a single Series for the individual Series. Examples: D NaN T
apply(function) Standard devia7on of each max(axis=1) min(axis=1) Filtering Joins
Apply func7on to each object. object. Element-wise max. Element-wise min. x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 All rows in adf that have a match in bdf.
Group Data Trim values at input thresholds Absolute value. B 2

df.groupby(by="col") The examples below can also be applied to groups. In this case, the x1 x2 adf[~adf.x1.isin(bdf.x1)]
Return a GroupBy object, func7on is applied on a per-group basis, and the returned vectors C 3 All rows in adf that do not have a match in bdf.
grouped by values in column are of the length of the original DataFrame.
named "col". shift(1) shift(-1) ydf zdf
Copy with values shihed by 1. Copy with values lagged by 1. x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
Return a GroupBy object, Ranks with no gaps. Cumula7ve sum. B 2 C 3
grouped by values in index rank(method='min') cummax() C 3 D 4
level named "ind". Ranks. Ties get min rank. Cumula7ve max.
Set-like Opera7ons
All of the summary func7ons listed above can be applied to a group. rank(pct=True) cummin()
Addi7onal GroupBy func7ons: Ranks rescaled to interval [0, 1]. Cumula7ve min. x1 x2 pd.merge(ydf, zdf)
size() agg(function) rank(method='first') cumprod() B 2 Rows that appear in both ydf and zdf
Size of each group. Aggregate group using func7on. Ranks. Ties go to rst value. Cumula7ve product. C 3 (Intersec7on).

x1 x2
Windows PloUng A
B
1
2
pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
(Union).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
Return an Expanding object allowing summary func7ons to be Histogram for each column Sca.er chart using pairs of points D 4 pd.merge(ydf, zdf, how='outer',
applied cumula7vely. indicator=True)
df.rolling(n) x1 x2
A 1 .query('_merge == "left_only"')
Return a Rolling object allowing summary func7ons to be .drop(['_merge'],axis=1)
applied to windows of length n. Rows that appear in ydf but not zdf (Setdi).
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants

You might also like