You are on page 1of 135

INTRODUCTION TO R

What is R?
The R statistical programming language is a free

open source package based on the S language


developed by Bell Labs.
The language is very powerful for writing programs.
Many statistical functions are already built in.
Contributed packages expand the functionality to
cutting edge research.
Since it is a programming language, generating
computer code to complete tasks is required.

Why R?
It's free!
It runs on a variety of platforms including Windows, Unix

and Mac-OS.
It provides an unparalleled platform for programming new
statistical methods in an easy and straightforward
manner.
It contains advanced statistical routines not yet available
in other packages.
It has state-of-the-art graphics capabilities.

R Overview
R is a comprehensive statistical and graphical
programming language and is a dialect of the S
language:
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers

R: initially written by Ross Ihaka and Robert


Gentleman at Dep. of Statistics of U of Auckland,
New Zealand during 1990s.
Since 1997: international R-core team of 15 people
with access to common CVS archive.

R Overview
You can enter commands one at a time at the command

prompt (>) or run a set of commands from a source file.


There is a wide variety of data types, including vectors

(numerical, character, logical), matrices, dataframes, and


lists.
To quit R, use >q()

R Overview
Most functionality is provided through built-in and user-

created functions and all data objects are kept in memory


during an interactive session.
Basic functions are available by default. Other functions are

contained in packages that can be attached to a current


session as needed

R Overview
A key skill to using R effectively is learning how to use the

built-in help system. Other sections describe the working

environment, inputting programs and outputting results,


installing new functionality through packages and etc.

A fundamental design feature of R is that the output from

most functions can be used as input to other functions. This is


described in reusing results.

R Interface

Getting Started with R


R is an object oriented programming language.
This means that virtually everything can be stored as an R

object.
Each object has a class.
This class describes what the object contains and what
each function does with it.
For instance, plot(x) produces different outputs depending
on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical
"=" symbol can be used.
The two following statements are equivalent :
> a <- 2
>a=2

Arguments are passed to functions inside round brackets

(parentheses).
One can easily combine functions. For instance you can
directly type
>mean(rnorm(1000)^2
The symbol "#" comments to the end of the line:
# This is a comment 5 + 7 # This is also a comment
Commands are normally separated by a newline. If you
want to put more than one statement on a line, you can
use the ";" delimiter.
a <- 1:10 ; mean(a)

Help command
> help(function)

Sample programs
R can be used as a simple calculator and we can perform

any simple computation.


# Sample Session
# This is a comment
>2 # print a number
[1] 2
2+3 # perform a simple calculation
[1] 5
log(2)
[1] 0.6931472

We can also store numeric or string objects.

x <- 2 # store an object


x # print this object
[1] 2
> (x <- 3) # store and print an object
[1] 3 >
> x <- "Hello" # store a string object
> x [1] "Hello
Clear Screen: Cntrl + L

Obtaining help
For each package you have a reference manual available
as an HTML file from within R or as a PDF on the CRAN
website.
>library(help="package_name")

You can search for help inside all loaded packages using

help() or ?. Usually you do not need to add quotes to


function names, but sometimes it can be useful. args()
gives the full syntax of a function.
> help(lm)
> ?lm
> args("lm")
function (formula, data, subset, weights, na.action, method
= "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
NULL

apropos() and find() looks for all the functions in the loaded

packages containing a keyword or a regular expression


apropos("norm")
[1] "dlnorm" dnorm"
"plnorm"
[4] "pnorm"
"qlnorm"
"qnorm"
[7] "qqnorm" "qqnorm.default" "rlnorm"
[10] "rnorm" "normalizePath"

R Warning !
R is a case sensitive language.
FOO, Foo, and foo are three different objects

Basic Functions
ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls())

removes them all.


rm(list=ls()) # remove all the objects in the workspace
Each object can be saved to the disk using the save()
function. They can then be loaded into memory using load().

load("file.Rda")
...
# assume you want to save an object called 'df'

save(df, file = "file.Rda")

You can save an R session (all the objects in memory) and load
the session.
>save.image(file="~/Documents/Logiciels/R/test.rda")
>load("~/Documents/Logiciels/R/test.rda")
Defining a working directory. Note for Windows users : R uses
slash ("/") in the directory instead of backslash ("\").
>setwd("~/Desktop") # Sets working directory (character string
enclosed in "...")
>getwd() # Returns current working directory [1]
"/Users/username/Desktop"
> dir() * Lists the content of the working directory

BASIC DATA TYPES

Basic Data Types


There are several basic R data types that are of frequent

occurrence in routine R calculations.


Numeric
Decimal values are called numerics in R.
It is the default computational data type.
If we assign a decimal value to a variable x as follows, x
will be of numeric type.
> x = 10.5
# assign a decimal value
>x
# print the value of x
[1] 10.5
> class(x)
# print the class name of x
[1] "numeric"

Integer
In order to create an integer variable in R,
we invoke the as.integer function.
We can be assured that y is indeed an integer by applying

the is.integer function.


> y = as.integer(3)
>y
# print the value of y
[1] 3
> class(y)
# print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE

Incidentally, we can coerce a numeric value into an integer


with the same as.integer function.
> as.integer(3.14) # coerce a numeric value
[1] 3
And we can parse a string for decimal values in much the
same way.
> as.integer("5.27") # coerce a decimal string
[1] 5

On the other hand, it is erroneous trying to parse a nondecimal string.


> as.integer("Joe") # coerce an nondecimal string
[1] NA
Warning message:
NAs introduced by coercion

Complex
A complex value in R is defined via the pure imaginary

value i.
> z = 1 + 2i # create a complex number
>z
# print the value of z
[1] 1+2i
> class(z)
# print the class name of z
[1] "complex"
The following gives an error as 1 is not a complex value.
> sqrt(1)
# square root of 1
[1] NaN
Warning message:
In sqrt(1) : NaNs produced

Logical
A logical value is often created via comparison between

variables.
> x = 1; y = 2 # sample values
>z=x>y
# is x larger than y?
>z
# print the logical value
[1] FALSE
> class(z)
# print the class name of z
[1] "logical"
Standard logical operations are "&" (and), "|" (or), and "!"
(negation).

> u = TRUE; v = FALSE


>u&v
# u AND v
[1] FALSE
>u|v
# u OR v
[1] TRUE
> !u
# negation of u
[1] FALSE

Character
A character object is used to represent string values in R.
We convert objects into character values with the
as.character() function:
> x = as.character(3.14)
>x
# print the character string
[1] "3.14"
> class(x)
# print the class name of x
[1] "character"
Two character values can be concatenated with the paste
function.
> fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"

However, it is often more convenient to create a readable


string with the sprintf function, which has a C language
syntax.
> sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
To extract a substring, we apply the substr function. Here is
an example showing how to extract the substring between the
third and twelfth positions in a string.
> substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
And to replace the first occurrence of the word "little" by
another word "big" in the string, we apply the sub function.
> sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."

VECTORS

R-Vectors
A vector is a sequence of data elements of the same basic

type. Members in a vector are officially called components.


Here is a vector containing three numeric values 2, 3 and 5.
> c(2, 3, 5)
[1] 2 3 5
And here is a vector of logical values.
> c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE

A vector can contain character strings.


> c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"

Length of a vector:
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
Examples:
We can also store vectors:
> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178,
170) #store a vector
> Height # print the vector
[1] 168 177 177 177 178 172 165 171 178 170
> Height[2] # Print the second component
[1] 177

Vectors
> Height[2:5] # Print the second, the 3rd, the 4th and 5th
component
[1] 177 177 177 178
>(obs <- 1:10) # Define a vector as a sequence (1 to 10)
[1] 1 2 3 4 5 6 7 8 9 10

Question
Create two vectors, weight and height. Weight have values
50 to 200 and height have ranging from 100 to 250. Then
find the:
a. Calculate Body Mass Index which is calculated as
(weight/(height/100)^2))
b. Also plot the data using plot() function.

For plotting both weight and height:


>plot(Height,Weight,ylab="Weight",xlab="Height",main="Cor
pulence")
Here, ylab is label for y-axis, xlab is label for x-axis and
main is the name given to the graph.

Combining Vectors
Vectors can be combined via the function c. For examples,

the following two vectors n and s are combined into a new


vector containing elements from both vectors.

>n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee
Value Coercion
In the code snippet above, notice how the numeric values
are being coerced into character strings when the two
vectors are combined. This is necessary so as to maintain
the same primitive data type for members in the same
vector.

Vector Arithmetic
Arithmetic operations of vectors are performed member-

by-member, i.e., member wise.


For example, suppose we have two vectors a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
>5*a
[1] 5 15 25 35
>a+b
[1] 2 5 9 15
>a-b
[1] 0 1 1 -1

Similarly for subtraction, multiplication and division, we get


new vectors via member wise operations.
>a-b
[1] 0 1 1 -1
>a*b
[1] 1 6 20 56
>a/b
[1] 1.000 1.500 1.250 0.875
Recycling Rule
If two vectors are of unequal length, the shorter one will
be recycled in order to match the longer vector. For
example, the following vectors u and v have different
lengths, and their sum is computed by recycling values of
the shorter vector u.

> u = c(10, 20, 30)


> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
>u+v
[1] 11 22 33 14 25 36 17 28 39

Vector Index
We retrieve values in a vector by declaring an index inside

a single square bracket "[]" operator.


> s = c("aa", "bb", "cc", "dd", "ee")
> s[3]
[1] "cc"
Negative Index
> s[-3]
[1] "aa" "bb" "dd" "ee"
Out-of-Range Index
> s[10]
[1] NA

Logical Index Vector


Logical

index vectors and are used for conditional

indexing.
Logical index vectors are recycled to match the length of
the vector being indexed, and return elements
corresponding to index elements that are TRUE.
x = c(5,3,0,8,7,5,0,0,4,9,2,0)
x > 5 # Conditional expressions evaluate to logical vectors
x[x>5] # Logical vectors apply directly as index vectors (get
elements of x where condition >5 is TRUE)
x[x>2 & x<5] # Get elements of x that are >2 AND <5
x[!x==0] # Get elements of x that are NOT 0. (Compare
negative numeric indexing)
x[c(T,F)] # Get odd-numbered elements. Logical vectors are
re-cycled, (numeric and character vectors are not)

Counting Elements of Vector


x = rep(1:5, 5:1)
y = rep(letters[1:5], 5:1)
length(x) # Count elements in x
sum(x>1) # Count elements in x that are >1 (sum is an
arithmetic operation so logical vectors become numeric: T>1, F->0)
sum(x>1 & x<4) # Count elements in x that are >1 AND <4
sum(y=='a') # Count elements of y that are equal to 'a'
table(y) # Contingency table of the elements of y

NA Values
The is.na function returns TRUE where elements are NA

indicating missing values:


x = c(4,8,0,2,NA,7,NA)
is.na(x) # Logical vector
sum(is.na(x)) # Count NA in x
x = x[!is.na(x)] # Set x to elements of x that are NOT NA,
(drop all NA from x)

MATRIX

Matrix

An element at the mth row, nth column of A can be

accessed by the expression A[m, n].


> A[2, 3]
# element at 2nd row, 3rd column
[1] 7
The entire mth row A can be extracted as A[m, ].
> A[2, ]
# the 2nd row
[1] 1 5 7
Similarly, the entire nth column A can be extracted as A[ ,n].
> A[ ,3]
# the 3rd column
[1] 3 7
We can also extract more than one rows or columns at a
time.

> A[ ,c(1,3)] # the 1st and 3rd columns


[,1] [,2]
[1,] 2 3
[2,] 1 7
If we assign names to the rows and columns of the matrix,
than we can access the elements by names.
> dimnames(A) = list(
+ c("row1", "row2"),
# row names
+ c("col1", "col2", "col3")) # column names
>A
# print A
col1 col2 col3
row1 2 4 3
row2 1 5 7

> A["row2", "col3"] # element at 2nd row, 3rd column


[1] 7

There are various ways to construct a matrix. When we construct a matrix


directly with data elements, the matrix content is filled along the column
orientation by default.
> B = matrix(
+ c(2, 4, 3, 1, 5, 7),
+ nrow=3,
+ ncol=2)
>B
# B has 3 rows and 2 columns
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
Transpose
> t(B)
# transpose of B
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7

Combining Matrices
The columns of two matrices having the same number of rows can be
combined into a larger matrix.
> C = matrix(
+ c(7, 4, 2),
+ nrow=3,
+ ncol=1)
>C
# C has 3 rows
[,1]
[1,] 7
[2,] 4
[3,] 2
Then we can combine the columns of B and C with cbind.
> cbind(B, C)
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 4 5 4
[3,] 3 7 2

> D = matrix(
+ c(6, 2),
+ nrow=1,
+ ncol=2)
>D
# D has 2 columns
[,1] [,2]
[1,] 6 2
> rbind(B, D)
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
[4,] 6 2

LIST

Lists
A list is a generic vector containing other objects.

For example, the following variable x is a list containing


copies of three vectors n, s, b, and a numeric value 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b

List Slicing
We retrieve a list slice with the single square bracket "[]" operator. The
following is a slice containing the second member of x, which is a copy
of s.

> x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
With an index vector, we can retrieve a slice with multiple members.
Here a slice containing the second and fourth members of x.
> x[c(2, 4)]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
[[2]]
[1] 3

Member Reference
In order to reference a list member directly, we have to

use the double square bracket "[[]]" operator.


The following object x[[2]] is the second member of x. In other words,

x[[2]] is a copy of s, but is not a slice containing s or its copy.


> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
We can modify its content directly.
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
>s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected

Named Lists
We can assign names to list members, and reference them by names

instead of numeric indexes.


For example, in the following, v is a list of two members, named "bob"
and "john".
> v = list(bob=c(2, 3, 5), john=c("aa", "bb"))
>v
$bob
[1] 2 3 5
$john
[1] "aa" "bb"
List Slicing
We retrieve a list slice with the single square bracket "[]" operator. Here
is a list slice containing a member of v named "bob".

> v["bob"]
$bob
[1] 2 3 5
With an index vector, we can retrieve a slice with multiple members.
Here is a list slice with both members of v.
Notice how they are reversed from their original positions in v.
> v[c("john", "bob")]
$john
[1] "aa" "bb"
$bob
[1] 2 3 5
Member Reference
In order to reference a list member directly, we have to use the double
square bracket "[[]]" operator. The following references a member of v
by name.

> v[["bob"]]
[1] 2 3 5
A named list member can also be referenced directly with
the "$" operator in lieu of the double square bracket
operator.
> v$bob
[1] 2 3 5

DATA FRAMES

Data Frames
A data frame is used for storing data tables. It is a list of

vectors of equal length. For example, the following variable


df is a data frame containing three vectors n, s, b
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
# df is a data frame
Build-in Data Frame

We use built-in data frames in R for our tutorials. For


example, here is a built-in data frame in R, called mtcars.

> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Cell value from the first row, second column of mtcars
> mtcars[1, 2]
[1] 6

Moreover, we can use the row and column names instead


of the numeric coordinates.
> mtcars["Mazda RX4", "cyl"]
[1] 6

Lastly, the number of data rows in the data frame is given by the nrow

function.
> nrow(mtcars)
[1] 32

# number of data rows

And the number of columns of a data frame is given by the ncol function.

> ncol(mtcars)
[1] 11

# number of columns

Preview

Instead of printing out the entire data frame, it is often desirable to preview
it with the head function beforehand.
> head(mtcars)
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...

Data Frame Column Vector


We reference a data frame column with the double square

bracket "[[]]" operator.


For example, to retrieve the ninth column vector of the
built-in data set mtcars, we write mtcars[[9]].
> mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can retrieve the same column vector by its name.
> mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

We can also retrieve with the "$" operator in lieu of the


double square bracket operator.
> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can also retrieve with the "$" operator in lieu of the
double square bracket operator.
> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
Another way to retrieve the same column vector is to use the
single square bracket "[]" operator.
Prepend the column name with a comma character, which
signals a wildcard match for the row position.

> mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...

Data Frame Column Slice


We retrieve a data frame column slice with the single square bracket "[]"

operator.
Numeric Indexing
The following is a slice containing the first column of the built-in data set
mtcars.
> mtcars[1]
mpg
Mazda RX4
21.0
Mazda RX4 Wag
21.0
Datsun 710
22.8
Name Indexing
We can retrieve the same column slice by its name.
> mtcars["mpg"]
mpg
Mazda RX4
21.0
Mazda RX4 Wag 21.0
Datsun 710
22.8

To retrieve a data frame slice with the two columns mpg and
hp, we pack the column names in an index vector inside the
single square bracket operator.
> mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4
21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710
22.8 93

Data Frame Row Slice


We retrieve rows from a data frame with the single square bracket

operator, just like what we did with columns.


However, in additional to an index vector of row positions, we append

an extra comma character.


This is important, as the extra comma signals a wildcard match for the

second coordinate for column positions.

Numeric Indexing
For example, the following retrieves a row record of the

built-in data set mtcars.


Please notice the extra comma in the square bracket
operator, and it is not a typo.
It states that the 1974 Camaro Z28 has a gas mileage of
13.3 miles per gallon, and an eight cylinder 245 horse
power engine, ..., etc.
> mtcars[24,]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...

To retrieve more than one rows, we use a numeric index

vector.
> mtcars[c(3, 24),]
mpg cyl disp hp drat wt ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Name Indexing
We can retrieve a row by its name.
> mtcars["Camaro Z28",]
mpg cyl disp hp drat wt ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...

And we can pack the row names in an index vector in order

to retrieve multiple rows.


> mtcars[c("Datsun 710", "Camaro Z28"),]
mpg cyl disp hp drat wt ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Logical Indexing
Lastly, we can retrieve rows with a logical index vector.

In the following vector L, the member value is TRUE if the

car has automatic transmission, and FALSE if otherwise.

> L = mtcars$am == 0
>L
[1] FALSE FALSE FALSE TRUE ...
Here is the list of vehicles with automatic transmission.
> mtcars[L,]
mpg cyl disp hp drat wt ...
Hornet 4 Drive
21.4 6 258.0 110 3.08 3.215 ...
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ...
And here is the gas mileage data for automatic transmission.
> mtcars[L,]$mpg
[1] 21.4 18.7 18.1 14.3 24.4 ...

DATA IMPORT

Data Import
It is often necessary to import sample textbook data into R

before you start working on your homework.


Excel File
Quite frequently, the sample data is in Excel format, and needs
to be imported into R prior to use.
For this, we can use the function read.xls from the gdata
package.
It reads from an Excel spreadsheet and returns a data frame.
The following shows how to load an Excel spreadsheet named
"mydata.xls".
This method requires Perl runtime to be present in the system.

> library(gdata)
# load gdata package
> help(read.xls)
# documentation
> mydata = read.xls("mydata.xls") # read from first sheet
Alternatively, we can use the function loadWorkbook from
the XLConnect package to read the entire workbook,
and then load the worksheets with readWorksheet.
The XLConnect package requires Java to be preinstalled.
> library(XLConnect)
# load XLConnect package
> wk = loadWorkbook("mydata.xls")
> df = readWorksheet(wk, sheet="Sheet1")

Table File
A data table can resides in a text file.
The cells inside the table are separated by blank

characters.
Here is an example of a table with 4 rows and 3 columns.
100 a1 b1
200 a2 b2
300 a3 b3
400 a4 b4
Now copy and paste the table above in a file named
"mydata.txt" with a text editor.
Then load the data into the workspace with the function
read.table.

> mydata = read.table("mydata.txt") # read text file


> mydata
# print data frame
V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
4 400 a4 b4

CSV
The sample data can also be in comma separated values

(CSV) format.
Each cell inside such data file is separated by a special
character, which usually is a comma, although other
characters can be used as well.
The first row of the data file should contain the column
names instead of the actual data.
Here is a sample of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3

After we copy and paste the data above in a file named

"mydata.csv" with a text editor,


we can read the data with the function read.csv.
> mydata = read.csv("mydata.csv") # read csv file
> mydata
Col1 Col2 Col3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3

DATA EXPORT

How to export data from R to CSV


# Write CSV in R
write.csv(MyData, file = "MyData.csv")

The above writes the data data frame MyData into a CSV

that it creates called MyData.csv.


Note that the file is written to your working directory.
To omit the row names, add a comma and then
row.names=FALSE.
# Write CSV in R
write.csv(MyData, file = "MyData.csv",row.names=FALSE)

Basic Statistics with R

Mean
The mean of an observation variable is a numerical

measure of the central location of the data values.


It is the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is
defined as follows:

Similarly, for a data population of size N, the population


mean is:

Question
Problem:
Find the mean eruption duration in the data set faithful.

Solution:
We apply the mean function to compute the mean value of
eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration)
# apply the mean function
[1] 3.4878

Median
The median of an observation variable is the value at the

middle when the data is sorted in ascending order.


It is an ordinal measure of the central location of the data
values.
Problem
Find the median of the eruption duration in the data set
faithful.
Solution
We apply the median function to compute the median
value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration)
# apply the median function
[1] 4

Quartile
There are several quartiles of an observation variable.
The first quartile, or lower quartile, is the value that cuts

off the first 25% of the data when it is sorted in ascending


order.
The second quartile, or median, is the value that cuts off

the first 50%.


The third quartile, or upper quartile, is the value that cuts

off the first 75%

Problem
Find the quartiles of the eruption durations in the data set

faithful.
Solution
We apply the quantile function to compute the quartiles of
eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration)
# apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000

Percentile
The nth percentile of an observation variable is the value

that cuts off the first n percent of the data values when it is
sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption
durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles
of eruptions with the desired percentage ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330

Range
The range of an observation variable is the difference of

its largest and smallest data values.


It is a measure of how far apart the entire data spreads in
value.

Problem
Problem
Find the range of the eruption duration in the data set
faithful.
Solution
We apply the max and min function to compute the largest
and smallest values of eruptions, then take the difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) min(duration) # apply the max and min
functions
[1] 3.5

Variance
The variance is a numerical measure of how the data

values is dispersed around the mean.


In particular, the sample variance is defined as:

Similarly, the population variance is defined in terms of the

population mean and population size N:

Question
Find the variance of the eruption duration in the data set

faithful.
We apply the var function to compute the variance of
eruptions.
> duration = faithful$eruptions # the eruption durations
> var(duration)
# apply the var function
[1] 1.3027
Var function calculates sample variance.

Standard deviation
The standard deviation of an observation variable is the

square root of its variance.


Problem
Find the standard deviation of the eruption duration in the
data set faithful.

Solution
We apply the sd function to compute the standard deviation
of eruptions.
> duration = faithful$eruptions # the eruption durations
> sd(duration)
# apply the sd function
[1] 1.1414

Scatter Plot
A scatter plot pairs up values of two quantitative variables

in a data set and display them as geometric points inside


a Cartesian diagram.
Example:
In the data set faithful, we pair up the eruptions and
waiting values in the same observation as (x,y)
coordinates.
Then we plot the points in the Cartesian plane.

> duration = faithful$eruptions


# the eruption durations
> waiting = faithful$waiting
# the waiting interval
> head(cbind(duration, waiting))
duration waiting
[1,] 3.600
79
[2,] 1.800
54
[3,] 3.333
74
[4,] 2.283
62
[5,] 4.533
85
[6,] 2.883
55

Problem
Find the scatter plot of the eruption durations and waiting

intervals in faithful.
Does it reveal any relationship between the variables?
> duration = faithful$eruptions
# the eruption durations
> waiting = faithful$waiting
# the waiting interval
> plot(duration, waiting,
# plot the variables
+ xlab="Eruption duration",
# xaxis label
+ ylab="Time waited")
# yaxis label

Answer
The scatter plot of the eruption durations and waiting

intervals is as follows.
It reveals a positive linear relationship between them.

Correlation
The correlation coefficient is defined as follows, where x

and y are the population standard deviations, and xy is


the population covariance.

Problem
Problem
Find the correlation coefficient of the eruption duration
and waiting time in the data set faithful. Observe if there is
any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation
coefficient of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting
# the waiting period
> cor(duration, waiting)
# apply the cor function
[1] 0.90081

Skewness
The skewness of a data population is defined by the

following formula, where 2 and 3 are the second and third


central moments.

Intuitively, the skewness is a measure of symmetry.


As a rule, negative skewness indicates that the mean of the

data values is less than the median, and the data distribution
is left-skewed.
Positive skewness would indicates that the mean of the data
values is larger than the median, and the data distribution is
right-skewed.

Problem
Find the skewness of eruption duration in the data set

faithful.
We apply the function skewness from the e1071 package

to compute the skewness coefficient of eruptions.


As the package is not in the core R library, it has to be
installed and loaded into the R workspace.
> library(e1071)
# load e1071
> duration = faithful$eruptions # eruption durations
> skewness(duration)
# apply the skewness func
tion
[1] -0.41355

The skewness of eruption duration is -0.41355. It

indicates that the eruption duration distribution is skewed


towards the left.

Simple Regression
A

simple linear regression model that describes the


relationship between two variables x and y can be expressed
by the following equation.
The numbers and are called parameters, and is the
error term.

For example, in the data set faithful, it contains sample data

of two random variables named waiting and eruptions.


The waiting variable denotes the waiting time until the next
eruptions, and eruptions denotes the duration. Its linear
regression model can be expressed as:

Estimation of Simple Regression Equation


Problem
Apply the simple linear regression model for the data set

faithful, and estimate the next eruption duration if the


waiting time since the last eruption has been 80 minutes.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the
linear regression model in a new variable eruption. lm.

> eruption.lm = lm(eruptions ~ waiting, data=faithful)


Then we extract the parameters of the estimated regression
equation with the coefficients function.
> coeffs = coefficients(eruption.lm); coeffs
(Intercept) waiting
-1.874016 0.075628
We now fit the eruption duration using the estimated
regression equation.
> waiting = 80
# the waiting time
> duration = coeffs[1] + coeffs[2]*waiting
> duration
(Intercept)
4.1762

Coefficient of Determination
The coefficient of determination of a linear regression

model is the quotient of the variances of the fitted values


and observed values of the dependent variable. If we
denote yi as the observed values of the dependent
variable, as its mean, and as the fitted value, then the
coefficient of determination is:

Question
Problem
Find the coefficient of determination for the simple linear
regression model of the data set faithful.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the
linear regression model in a new variable eruption.lm.
> eruption.lm = lm(eruptions ~ waiting, data=faithful)

Then we extract the coefficient of determination from the


r.squared attribute of its summary.
> summary(eruption.lm)$r.squared
[1] 0.81146
Answer
The coefficient of determination of the simple linear
regression model for the data set faithful is 0.81146.

Significance Test for Regression


Assume that the error term in the linear regression

model is independent of x, and is normally distributed,


with zero mean and constant variance.
We

can decide whether there is any significant


relationship between x and y by testing the null
hypothesis that = 0.

Problem
Decide whether there is a significant relationship between

the variables in the linear regression model of the data set


faithful at .05 significance level.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the
linear regression model in a new variable eruption.lm.

> eruption.lm = lm(eruptions ~ waiting, data=faithful)


Then we print out the F-statistics of the significance test with the

summary function.
> summary(eruption.lm)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min
1Q Median
3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933

Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -1.87402
0.16014 -11.7 <2e-16 ***
waiting
0.07563
0.00222 34.1 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811,
Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16

Answer
As the p-value is much less than 0.05, we reject the null

hypothesis that = 0. Hence there is a significant


relationship between the variables in the linear regression
model of the data set faithful.

Confidence Interval for Linear Regression


Assume that the error term in the linear regression

model is independent of x, and is normally distributed,


with zero mean and constant variance. For a given value
of x, the interval estimate for the mean of the dependent
variable, , is called the confidence interval.
Problem
In the data set faithful, develop a 95% confidence interval

of the mean eruption duration for the waiting time of 80


minutes.

Solution
We apply the lm function to a formula that describes the variable

eruptions by the variable waiting, and save the linear regression


model in a new variable eruption.lm.
> attach(faithful) # attach the data frame
> eruption.lm = lm(eruptions ~ waiting)
Then we create a new data frame that set the waiting time
value.
> newdata = data.frame(waiting=80)
We now apply the predict function and set the predictor variable
in the newdata argument. We also set the interval type as
"confidence", and use the default 0.95 confidence level.
> predict(eruption.lm, newdata, interval="confidence")
fit lwr upr
1 4.1762 4.1048 4.2476
> detach(faithful) # clean up

Answer
The 95% confidence interval of the mean eruption

duration for the waiting time of 80 minutes is between


4.1048 and 4.2476 minutes.

Residual Plot
The residual data of the simple linear regression model is the

difference between the observed data of the dependent


variable y and the fitted values .

Problem

Plot the residual of the simple linear regression model of the


data set faithful against the independent variable waiting.
Solution
We apply the lm function to a formula that describes the variable
eruptions by the variable waiting, and save the linear
regression model in a new variable eruption.lm. Then we
compute the residual with the resid function.

> eruption.lm = lm(eruptions ~ waiting, data=faithful)


> eruption.res = resid(eruption.lm)
We now plot the residual against the observed values of
the variable waiting.
> plot(faithful$waiting, eruption.res,
+ ylab="Residuals", xlab="Waiting Time",
+ main="Old Faithful Eruptions")
> abline(0, 0)
# the horizon

Normal Probability Plot of Residuals


The normal probability plot is a graphical tool for comparing a

data set with the normal distribution. We can use it with the
standardized residual of the linear regression model and see if
the error term is actually normally distributed.
Problem
Create the normal probability plot for the standardized residual
of the data set faithful.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the linear
regression model in a new variable eruption.lm. Then we
compute the standardized residual with the rstandard function.

> eruption.lm = lm(eruptions ~ waiting, data=faithful)


> eruption.stdres = rstandard(eruption.lm)
We now create the normal probability plot with the qqnorm

function, and add the qqline for further comparison.


> qqnorm(eruption.stdres,
+ ylab="Standardized Residuals",
+ xlab="Normal Scores",
+ main="Old Faithful Eruptions")
> qqline(eruption.stdres)

Multiple Linear Regression


A multiple linear regression (MLR) model that describes a

dependent variable y by independent variables x1, x2, ...,


xp (p > 1) is expressed by the equation as follows, where
the numbers and k (k = 1, 2, ..., p) are the parameters,
and is the error term.

Example
For example, in the built-in data set stackloss from

observations of a chemical plant operation, if we assign


stackloss as the dependent variable, and assign Air.Flow
(cooling air flow), Water.Temp (inlet water temperature)
and Acid.Conc. (acid concentration) as independent
variables, the multiple linear regression model is:

Problem
Apply the multiple linear regression model for the data set

stackloss, and predict the stack loss if the air flow is 72,
water temperature is 20 and acid concentration is 85.
Solution
We apply the lm function to a formula that describes the
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model
in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)

We also wrap the parameters inside a new data frame named

newdata.
> newdata = data.frame(Air.Flow=72, # wrap the parameters
+ Water.Temp=20,
+ Acid.Conc.=85)
Lastly, we apply the predict function to stackloss.lm and
newdata.
> predict(stackloss.lm, newdata)
1
24.582
Answer
Based on the multiple linear regression model and the given
parameters, the predicted stack loss is 24.582.

Multiple Coefficient of Determination


The coefficient of determination of a multiple linear

regression model is the quotient of the variances of the


fitted values and observed values of the dependent
variable. If we denote yi as the observed values of the
dependent variable, as its mean, and as the fitted value,
then the coefficient of determination is:

Example
Find the coefficient of determination for the multiple linear

regression model of the data set stackloss.


Solution
We apply the lm function to a formula that describes the
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model
in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)
Then we extract the coefficient of determination from the
r.squared attribute of its summary.
> summary(stackloss.lm)$r.squared
[1] 0.91358

Answer
The coefficient of determination of the multiple linear
regression model for the data set stackloss is 0.91358.

Adjusted Coefficient of Determination


The adjusted coefficient of determination of a multiple

linear regression model is defined in terms of the


coefficient of determination as follows, where n is the
number of observations in the data set, and p is the
number of independent variables.

Example
Problem
Find the adjusted coefficient of determination for the
multiple linear regression model of the data set stackloss.

Solution
We apply the lm function to a formula that describes the
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model in
a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)
Then we extract the coefficient of determination from the
adj.r.squared attribute of its summary.
> summary(stackloss.lm)$adj.r.squared
[1] 0.89833
Answer
The adjusted coefficient of determination of the multiple
linear regression model for the data set stackloss is
0.89833.

Logistic Regression
The logistic regression equation is use to predict the

probability of a dependent variable taking the dichotomy


values 0 or 1.
Suppose x1, x2, ..., xp are the independent variables,
and k (k = 1, 2, ..., p) are the parameters, and E(y) is the
expected value of the dependent variable y, then the
logistic regression equation is:

Example
The built-in data set mtcars, the data column am

represents the transmission type of the automobile model


(0 = automatic, 1 = manual).
With the logistic regression equation, we can model the
probability of a manual transmission in a vehicle based on
its engine horsepower and weight data.

Estimated Logistic Regression Equation


Using the generalized linear model, an estimated logistic

regression equation can be formulated as below.


The coefficients a and bk (k = 1, 2, ..., p) are determined
according to a maximum likelihood approach, and it
allows us to estimate the probability of the dependent
variable y taking on the value 1 for given values of xk (k =
1, 2, ..., p).

Example
Problem
By use of the logistic regression equation of vehicle
transmission in the data set mtcars, estimate the
probability of a vehicle being fitted with a manual
transmission if it has a 120hp engine and weights 2800
lbs.

We apply the function glm to a formula that describes the transmission


type (am) by the horsepower (hp) and weight (wt). This creates a
generalized linear model (GLM) in the binomial family.
> am.glm = glm(formula=am ~ hp + wt,
+
data=mtcars,
+
family=binomial)
We then wrap the test parameters inside a data frame newdata.

> newdata = data.frame(hp=120, wt=2.8)


Now we apply the function predict to the generalized linear model

am.glm along with newdata. We will have to select response prediction


type in order to obtain the predicted probability.
> predict(am.glm, newdata, type="response")
1
0.64181
For an automobile with 120hp engine and 2800 lbs weight, the

probability of it being fitted with a manual transmission is about 64%.

You might also like