Professional Documents
Culture Documents
What is R?
The R statistical programming language is a free
Why R?
It's free!
It runs on a variety of platforms including Windows, Unix
and Mac-OS.
It provides an unparalleled platform for programming new
statistical methods in an easy and straightforward
manner.
It contains advanced statistical routines not yet available
in other packages.
It has state-of-the-art graphics capabilities.
R Overview
R is a comprehensive statistical and graphical
programming language and is a dialect of the S
language:
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
R Overview
You can enter commands one at a time at the command
R Overview
Most functionality is provided through built-in and user-
R Overview
A key skill to using R effectively is learning how to use the
R Interface
object.
Each object has a class.
This class describes what the object contains and what
each function does with it.
For instance, plot(x) produces different outputs depending
on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical
"=" symbol can be used.
The two following statements are equivalent :
> a <- 2
>a=2
(parentheses).
One can easily combine functions. For instance you can
directly type
>mean(rnorm(1000)^2
The symbol "#" comments to the end of the line:
# This is a comment 5 + 7 # This is also a comment
Commands are normally separated by a newline. If you
want to put more than one statement on a line, you can
use the ";" delimiter.
a <- 1:10 ; mean(a)
Help command
> help(function)
Sample programs
R can be used as a simple calculator and we can perform
Obtaining help
For each package you have a reference manual available
as an HTML file from within R or as a PDF on the CRAN
website.
>library(help="package_name")
You can search for help inside all loaded packages using
apropos() and find() looks for all the functions in the loaded
R Warning !
R is a case sensitive language.
FOO, Foo, and foo are three different objects
Basic Functions
ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls())
load("file.Rda")
...
# assume you want to save an object called 'df'
You can save an R session (all the objects in memory) and load
the session.
>save.image(file="~/Documents/Logiciels/R/test.rda")
>load("~/Documents/Logiciels/R/test.rda")
Defining a working directory. Note for Windows users : R uses
slash ("/") in the directory instead of backslash ("\").
>setwd("~/Desktop") # Sets working directory (character string
enclosed in "...")
>getwd() # Returns current working directory [1]
"/Users/username/Desktop"
> dir() * Lists the content of the working directory
Integer
In order to create an integer variable in R,
we invoke the as.integer function.
We can be assured that y is indeed an integer by applying
Complex
A complex value in R is defined via the pure imaginary
value i.
> z = 1 + 2i # create a complex number
>z
# print the value of z
[1] 1+2i
> class(z)
# print the class name of z
[1] "complex"
The following gives an error as 1 is not a complex value.
> sqrt(1)
# square root of 1
[1] NaN
Warning message:
In sqrt(1) : NaNs produced
Logical
A logical value is often created via comparison between
variables.
> x = 1; y = 2 # sample values
>z=x>y
# is x larger than y?
>z
# print the logical value
[1] FALSE
> class(z)
# print the class name of z
[1] "logical"
Standard logical operations are "&" (and), "|" (or), and "!"
(negation).
Character
A character object is used to represent string values in R.
We convert objects into character values with the
as.character() function:
> x = as.character(3.14)
>x
# print the character string
[1] "3.14"
> class(x)
# print the class name of x
[1] "character"
Two character values can be concatenated with the paste
function.
> fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"
VECTORS
R-Vectors
A vector is a sequence of data elements of the same basic
Length of a vector:
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
Examples:
We can also store vectors:
> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178,
170) #store a vector
> Height # print the vector
[1] 168 177 177 177 178 172 165 171 178 170
> Height[2] # Print the second component
[1] 177
Vectors
> Height[2:5] # Print the second, the 3rd, the 4th and 5th
component
[1] 177 177 177 178
>(obs <- 1:10) # Define a vector as a sequence (1 to 10)
[1] 1 2 3 4 5 6 7 8 9 10
Question
Create two vectors, weight and height. Weight have values
50 to 200 and height have ranging from 100 to 250. Then
find the:
a. Calculate Body Mass Index which is calculated as
(weight/(height/100)^2))
b. Also plot the data using plot() function.
Combining Vectors
Vectors can be combined via the function c. For examples,
>n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee
Value Coercion
In the code snippet above, notice how the numeric values
are being coerced into character strings when the two
vectors are combined. This is necessary so as to maintain
the same primitive data type for members in the same
vector.
Vector Arithmetic
Arithmetic operations of vectors are performed member-
Vector Index
We retrieve values in a vector by declaring an index inside
indexing.
Logical index vectors are recycled to match the length of
the vector being indexed, and return elements
corresponding to index elements that are TRUE.
x = c(5,3,0,8,7,5,0,0,4,9,2,0)
x > 5 # Conditional expressions evaluate to logical vectors
x[x>5] # Logical vectors apply directly as index vectors (get
elements of x where condition >5 is TRUE)
x[x>2 & x<5] # Get elements of x that are >2 AND <5
x[!x==0] # Get elements of x that are NOT 0. (Compare
negative numeric indexing)
x[c(T,F)] # Get odd-numbered elements. Logical vectors are
re-cycled, (numeric and character vectors are not)
NA Values
The is.na function returns TRUE where elements are NA
MATRIX
Matrix
Combining Matrices
The columns of two matrices having the same number of rows can be
combined into a larger matrix.
> C = matrix(
+ c(7, 4, 2),
+ nrow=3,
+ ncol=1)
>C
# C has 3 rows
[,1]
[1,] 7
[2,] 4
[3,] 2
Then we can combine the columns of B and C with cbind.
> cbind(B, C)
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 4 5 4
[3,] 3 7 2
> D = matrix(
+ c(6, 2),
+ nrow=1,
+ ncol=2)
>D
# D has 2 columns
[,1] [,2]
[1,] 6 2
> rbind(B, D)
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
[4,] 6 2
LIST
Lists
A list is a generic vector containing other objects.
List Slicing
We retrieve a list slice with the single square bracket "[]" operator. The
following is a slice containing the second member of x, which is a copy
of s.
> x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
With an index vector, we can retrieve a slice with multiple members.
Here a slice containing the second and fourth members of x.
> x[c(2, 4)]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
[[2]]
[1] 3
Member Reference
In order to reference a list member directly, we have to
Named Lists
We can assign names to list members, and reference them by names
> v["bob"]
$bob
[1] 2 3 5
With an index vector, we can retrieve a slice with multiple members.
Here is a list slice with both members of v.
Notice how they are reversed from their original positions in v.
> v[c("john", "bob")]
$john
[1] "aa" "bb"
$bob
[1] 2 3 5
Member Reference
In order to reference a list member directly, we have to use the double
square bracket "[[]]" operator. The following references a member of v
by name.
> v[["bob"]]
[1] 2 3 5
A named list member can also be referenced directly with
the "$" operator in lieu of the double square bracket
operator.
> v$bob
[1] 2 3 5
DATA FRAMES
Data Frames
A data frame is used for storing data tables. It is a list of
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Cell value from the first row, second column of mtcars
> mtcars[1, 2]
[1] 6
Lastly, the number of data rows in the data frame is given by the nrow
function.
> nrow(mtcars)
[1] 32
And the number of columns of a data frame is given by the ncol function.
> ncol(mtcars)
[1] 11
# number of columns
Preview
Instead of printing out the entire data frame, it is often desirable to preview
it with the head function beforehand.
> head(mtcars)
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
> mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
operator.
Numeric Indexing
The following is a slice containing the first column of the built-in data set
mtcars.
> mtcars[1]
mpg
Mazda RX4
21.0
Mazda RX4 Wag
21.0
Datsun 710
22.8
Name Indexing
We can retrieve the same column slice by its name.
> mtcars["mpg"]
mpg
Mazda RX4
21.0
Mazda RX4 Wag 21.0
Datsun 710
22.8
To retrieve a data frame slice with the two columns mpg and
hp, we pack the column names in an index vector inside the
single square bracket operator.
> mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4
21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710
22.8 93
Numeric Indexing
For example, the following retrieves a row record of the
vector.
> mtcars[c(3, 24),]
mpg cyl disp hp drat wt ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Name Indexing
We can retrieve a row by its name.
> mtcars["Camaro Z28",]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
> L = mtcars$am == 0
>L
[1] FALSE FALSE FALSE TRUE ...
Here is the list of vehicles with automatic transmission.
> mtcars[L,]
mpg cyl disp hp drat wt ...
Hornet 4 Drive
21.4 6 258.0 110 3.08 3.215 ...
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ...
And here is the gas mileage data for automatic transmission.
> mtcars[L,]$mpg
[1] 21.4 18.7 18.1 14.3 24.4 ...
DATA IMPORT
Data Import
It is often necessary to import sample textbook data into R
> library(gdata)
# load gdata package
> help(read.xls)
# documentation
> mydata = read.xls("mydata.xls") # read from first sheet
Alternatively, we can use the function loadWorkbook from
the XLConnect package to read the entire workbook,
and then load the worksheets with readWorksheet.
The XLConnect package requires Java to be preinstalled.
> library(XLConnect)
# load XLConnect package
> wk = loadWorkbook("mydata.xls")
> df = readWorksheet(wk, sheet="Sheet1")
Table File
A data table can resides in a text file.
The cells inside the table are separated by blank
characters.
Here is an example of a table with 4 rows and 3 columns.
100 a1 b1
200 a2 b2
300 a3 b3
400 a4 b4
Now copy and paste the table above in a file named
"mydata.txt" with a text editor.
Then load the data into the workspace with the function
read.table.
CSV
The sample data can also be in comma separated values
(CSV) format.
Each cell inside such data file is separated by a special
character, which usually is a comma, although other
characters can be used as well.
The first row of the data file should contain the column
names instead of the actual data.
Here is a sample of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
DATA EXPORT
The above writes the data data frame MyData into a CSV
Mean
The mean of an observation variable is a numerical
Question
Problem:
Find the mean eruption duration in the data set faithful.
Solution:
We apply the mean function to compute the mean value of
eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration)
# apply the mean function
[1] 3.4878
Median
The median of an observation variable is the value at the
Quartile
There are several quartiles of an observation variable.
The first quartile, or lower quartile, is the value that cuts
Problem
Find the quartiles of the eruption durations in the data set
faithful.
Solution
We apply the quantile function to compute the quartiles of
eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration)
# apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Percentile
The nth percentile of an observation variable is the value
that cuts off the first n percent of the data values when it is
sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption
durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles
of eruptions with the desired percentage ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Range
The range of an observation variable is the difference of
Problem
Problem
Find the range of the eruption duration in the data set
faithful.
Solution
We apply the max and min function to compute the largest
and smallest values of eruptions, then take the difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) min(duration) # apply the max and min
functions
[1] 3.5
Variance
The variance is a numerical measure of how the data
Question
Find the variance of the eruption duration in the data set
faithful.
We apply the var function to compute the variance of
eruptions.
> duration = faithful$eruptions # the eruption durations
> var(duration)
# apply the var function
[1] 1.3027
Var function calculates sample variance.
Standard deviation
The standard deviation of an observation variable is the
Solution
We apply the sd function to compute the standard deviation
of eruptions.
> duration = faithful$eruptions # the eruption durations
> sd(duration)
# apply the sd function
[1] 1.1414
Scatter Plot
A scatter plot pairs up values of two quantitative variables
Problem
Find the scatter plot of the eruption durations and waiting
intervals in faithful.
Does it reveal any relationship between the variables?
> duration = faithful$eruptions
# the eruption durations
> waiting = faithful$waiting
# the waiting interval
> plot(duration, waiting,
# plot the variables
+ xlab="Eruption duration",
# xaxis label
+ ylab="Time waited")
# yaxis label
Answer
The scatter plot of the eruption durations and waiting
intervals is as follows.
It reveals a positive linear relationship between them.
Correlation
The correlation coefficient is defined as follows, where x
Problem
Problem
Find the correlation coefficient of the eruption duration
and waiting time in the data set faithful. Observe if there is
any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation
coefficient of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting
# the waiting period
> cor(duration, waiting)
# apply the cor function
[1] 0.90081
Skewness
The skewness of a data population is defined by the
data values is less than the median, and the data distribution
is left-skewed.
Positive skewness would indicates that the mean of the data
values is larger than the median, and the data distribution is
right-skewed.
Problem
Find the skewness of eruption duration in the data set
faithful.
We apply the function skewness from the e1071 package
Simple Regression
A
Coefficient of Determination
The coefficient of determination of a linear regression
Question
Problem
Find the coefficient of determination for the simple linear
regression model of the data set faithful.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the
linear regression model in a new variable eruption.lm.
> eruption.lm = lm(eruptions ~ waiting, data=faithful)
Problem
Decide whether there is a significant relationship between
summary function.
> summary(eruption.lm)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min
1Q Median
3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -1.87402
0.16014 -11.7 <2e-16 ***
waiting
0.07563
0.00222 34.1 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811,
Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16
Answer
As the p-value is much less than 0.05, we reject the null
Solution
We apply the lm function to a formula that describes the variable
Answer
The 95% confidence interval of the mean eruption
Residual Plot
The residual data of the simple linear regression model is the
Problem
data set with the normal distribution. We can use it with the
standardized residual of the linear regression model and see if
the error term is actually normally distributed.
Problem
Create the normal probability plot for the standardized residual
of the data set faithful.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the linear
regression model in a new variable eruption.lm. Then we
compute the standardized residual with the rstandard function.
Example
For example, in the built-in data set stackloss from
Problem
Apply the multiple linear regression model for the data set
stackloss, and predict the stack loss if the air flow is 72,
water temperature is 20 and acid concentration is 85.
Solution
We apply the lm function to a formula that describes the
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model
in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)
newdata.
> newdata = data.frame(Air.Flow=72, # wrap the parameters
+ Water.Temp=20,
+ Acid.Conc.=85)
Lastly, we apply the predict function to stackloss.lm and
newdata.
> predict(stackloss.lm, newdata)
1
24.582
Answer
Based on the multiple linear regression model and the given
parameters, the predicted stack loss is 24.582.
Example
Find the coefficient of determination for the multiple linear
Answer
The coefficient of determination of the multiple linear
regression model for the data set stackloss is 0.91358.
Example
Problem
Find the adjusted coefficient of determination for the
multiple linear regression model of the data set stackloss.
Solution
We apply the lm function to a formula that describes the
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model in
a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)
Then we extract the coefficient of determination from the
adj.r.squared attribute of its summary.
> summary(stackloss.lm)$adj.r.squared
[1] 0.89833
Answer
The adjusted coefficient of determination of the multiple
linear regression model for the data set stackloss is
0.89833.
Logistic Regression
The logistic regression equation is use to predict the
Example
The built-in data set mtcars, the data column am
Example
Problem
By use of the logistic regression equation of vehicle
transmission in the data set mtcars, estimate the
probability of a vehicle being fitted with a manual
transmission if it has a 120hp engine and weights 2800
lbs.