Introduction To Rlogistic

INTRODUCTION TO R
What is R?
The R statistical programming language is a free
open source package based on the S language

developed by Bell Labs.
The language is very powerful for writing programs.
Many statistical functions are already built in.
Contributed packages expand the functionality to
cutting edge research.
Since it is a programming language, generating
computer code to complete tasks is required.
Why R?
It's free!
It runs on a variety of platforms including Windows, Unix
and Mac-OS.
It provides an unparalleled platform for programming new
statistical methods in an easy and straightforward
manner.
It contains advanced statistical routines not yet available
in other packages.
It has state-of-the-art graphics capabilities.
R Overview
R is a comprehensive statistical and graphical
programming language and is a dialect of the S
language:
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
R: initially written by Ross Ihaka and Robert

Gentleman at Dep. of Statistics of U of Auckland,
New Zealand during 1990s.
Since 1997: international R-core team of 15 people
with access to common CVS archive.
R Overview
You can enter commands one at a time at the command
prompt (>) or run a set of commands from a source file.

There is a wide variety of data types, including vectors
(numerical, character, logical), matrices, dataframes, and

lists.
To quit R, use >q()
R Overview
Most functionality is provided through built-in and user-
created functions and all data objects are kept in memory

during an interactive session.
Basic functions are available by default. Other functions are
contained in packages that can be attached to a current

session as needed
R Overview
A key skill to using R effectively is learning how to use the
built-in help system. Other sections describe the working
environment, inputting programs and outputting results,

installing new functionality through packages and etc.
A fundamental design feature of R is that the output from
most functions can be used as input to other functions. This is

described in reusing results.
R Interface
Getting Started with R

R is an object oriented programming language.
This means that virtually everything can be stored as an R
object.
Each object has a class.
This class describes what the object contains and what
each function does with it.
For instance, plot(x) produces different outputs depending
on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical
"=" symbol can be used.
The two following statements are equivalent :
> a <- 2
>a=2
Arguments are passed to functions inside round brackets
(parentheses).
One can easily combine functions. For instance you can
directly type
>mean(rnorm(1000)^2
The symbol "#" comments to the end of the line:
# This is a comment 5 + 7 # This is also a comment
Commands are normally separated by a newline. If you
want to put more than one statement on a line, you can
use the ";" delimiter.
a <- 1:10 ; mean(a)
Help command
> help(function)
Sample programs
R can be used as a simple calculator and we can perform
any simple computation.

# Sample Session
# This is a comment
>2 # print a number
[1] 2
2+3 # perform a simple calculation
[1] 5
log(2)
[1] 0.6931472
We can also store numeric or string objects.
x <- 2 # store an object

x # print this object
[1] 2
> (x <- 3) # store and print an object
[1] 3 >
> x <- "Hello" # store a string object
> x [1] "Hello
Clear Screen: Cntrl + L
Obtaining help
For each package you have a reference manual available
as an HTML file from within R or as a PDF on the CRAN
website.
>library(help="package_name")
You can search for help inside all loaded packages using
help() or ?. Usually you do not need to add quotes to

function names, but sometimes it can be useful. args()
gives the full syntax of a function.
> help(lm)
> ?lm
> args("lm")
function (formula, data, subset, weights, na.action, method
= "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
NULL
apropos() and find() looks for all the functions in the loaded
packages containing a keyword or a regular expression

apropos("norm")
[1] "dlnorm" dnorm"
"plnorm"
[4] "pnorm"
"qlnorm"
"qnorm"
[7] "qqnorm" "qqnorm.default" "rlnorm"
[10] "rnorm" "normalizePath"
R Warning !
R is a case sensitive language.
FOO, Foo, and foo are three different objects
Basic Functions
ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls())
removes them all.

rm(list=ls()) # remove all the objects in the workspace
Each object can be saved to the disk using the save()
function. They can then be loaded into memory using load().
load("file.Rda")
...
# assume you want to save an object called 'df'
save(df, file = "file.Rda")
You can save an R session (all the objects in memory) and load
the session.
>save.image(file="~/Documents/Logiciels/R/test.rda")
>load("~/Documents/Logiciels/R/test.rda")
Defining a working directory. Note for Windows users : R uses
slash ("/") in the directory instead of backslash ("\").
>setwd("~/Desktop") # Sets working directory (character string
enclosed in "...")
>getwd() # Returns current working directory [1]
"/Users/username/Desktop"
> dir() * Lists the content of the working directory
BASIC DATA TYPES
Basic Data Types

There are several basic R data types that are of frequent
occurrence in routine R calculations.

Numeric
Decimal values are called numerics in R.
It is the default computational data type.
If we assign a decimal value to a variable x as follows, x
will be of numeric type.
> x = 10.5
# assign a decimal value
>x
# print the value of x
[1] 10.5
> class(x)
# print the class name of x
[1] "numeric"
Integer
In order to create an integer variable in R,
we invoke the as.integer function.
We can be assured that y is indeed an integer by applying
the is.integer function.

> y = as.integer(3)
>y
# print the value of y
[1] 3
> class(y)
# print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE
Incidentally, we can coerce a numeric value into an integer

with the same as.integer function.
> as.integer(3.14) # coerce a numeric value
[1] 3
And we can parse a string for decimal values in much the
same way.
> as.integer("5.27") # coerce a decimal string
[1] 5
On the other hand, it is erroneous trying to parse a nondecimal string.

> as.integer("Joe") # coerce an nondecimal string
[1] NA
Warning message:
NAs introduced by coercion
Complex
A complex value in R is defined via the pure imaginary
value i.
> z = 1 + 2i # create a complex number
>z
# print the value of z
[1] 1+2i
> class(z)
# print the class name of z
[1] "complex"
The following gives an error as 1 is not a complex value.
> sqrt(1)
# square root of 1
[1] NaN
Warning message:
In sqrt(1) : NaNs produced
Logical
A logical value is often created via comparison between
variables.
> x = 1; y = 2 # sample values
>z=x>y
# is x larger than y?
>z
# print the logical value
[1] FALSE
> class(z)
# print the class name of z
[1] "logical"
Standard logical operations are "&" (and), "|" (or), and "!"
(negation).
> u = TRUE; v = FALSE

>u&v
# u AND v
[1] FALSE
>u|v
# u OR v
[1] TRUE
> !u
# negation of u
[1] FALSE
Character
A character object is used to represent string values in R.
We convert objects into character values with the
as.character() function:
> x = as.character(3.14)
>x
# print the character string
[1] "3.14"
> class(x)
# print the class name of x
[1] "character"
Two character values can be concatenated with the paste
function.
> fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"
However, it is often more convenient to create a readable

string with the sprintf function, which has a C language
syntax.
> sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
To extract a substring, we apply the substr function. Here is
an example showing how to extract the substring between the
third and twelfth positions in a string.
> substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
And to replace the first occurrence of the word "little" by
another word "big" in the string, we apply the sub function.
> sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."
VECTORS
R-Vectors
A vector is a sequence of data elements of the same basic
type. Members in a vector are officially called components.

Here is a vector containing three numeric values 2, 3 and 5.
> c(2, 3, 5)
[1] 2 3 5
And here is a vector of logical values.
> c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE
A vector can contain character strings.

> c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"
Length of a vector:
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
Examples:
We can also store vectors:
> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178,
170) #store a vector
> Height # print the vector
[1] 168 177 177 177 178 172 165 171 178 170
> Height[2] # Print the second component
[1] 177
Vectors
> Height[2:5] # Print the second, the 3rd, the 4th and 5th
component
[1] 177 177 177 178
>(obs <- 1:10) # Define a vector as a sequence (1 to 10)
[1] 1 2 3 4 5 6 7 8 9 10
Question
Create two vectors, weight and height. Weight have values
50 to 200 and height have ranging from 100 to 250. Then
find the:
a. Calculate Body Mass Index which is calculated as
(weight/(height/100)^2))
b. Also plot the data using plot() function.
For plotting both weight and height:

>plot(Height,Weight,ylab="Weight",xlab="Height",main="Cor
pulence")
Here, ylab is label for y-axis, xlab is label for x-axis and
main is the name given to the graph.
Combining Vectors
Vectors can be combined via the function c. For examples,
the following two vectors n and s are combined into a new

vector containing elements from both vectors.
>n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee
Value Coercion
In the code snippet above, notice how the numeric values
are being coerced into character strings when the two
vectors are combined. This is necessary so as to maintain
the same primitive data type for members in the same
vector.
Vector Arithmetic
Arithmetic operations of vectors are performed member-
by-member, i.e., member wise.

For example, suppose we have two vectors a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
>5*a
[1] 5 15 25 35
>a+b
[1] 2 5 9 15
>a-b
[1] 0 1 1 -1
Similarly for subtraction, multiplication and division, we get

new vectors via member wise operations.
>a-b
[1] 0 1 1 -1
>a*b
[1] 1 6 20 56
>a/b
[1] 1.000 1.500 1.250 0.875
Recycling Rule
If two vectors are of unequal length, the shorter one will
be recycled in order to match the longer vector. For
example, the following vectors u and v have different
lengths, and their sum is computed by recycling values of
the shorter vector u.
> u = c(10, 20, 30)

> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
>u+v
[1] 11 22 33 14 25 36 17 28 39
Vector Index
We retrieve values in a vector by declaring an index inside
a single square bracket "[]" operator.

> s = c("aa", "bb", "cc", "dd", "ee")
> s[3]
[1] "cc"
Negative Index
> s[-3]
[1] "aa" "bb" "dd" "ee"
Out-of-Range Index
> s[10]
[1] NA
Logical Index Vector

Logical
index vectors and are used for conditional
indexing.
Logical index vectors are recycled to match the length of
the vector being indexed, and return elements
corresponding to index elements that are TRUE.
x = c(5,3,0,8,7,5,0,0,4,9,2,0)
x > 5 # Conditional expressions evaluate to logical vectors
x[x>5] # Logical vectors apply directly as index vectors (get
elements of x where condition >5 is TRUE)
x[x>2 & x<5] # Get elements of x that are >2 AND <5
x[!x==0] # Get elements of x that are NOT 0. (Compare
negative numeric indexing)
x[c(T,F)] # Get odd-numbered elements. Logical vectors are
re-cycled, (numeric and character vectors are not)
Counting Elements of Vector

x = rep(1:5, 5:1)
y = rep(letters[1:5], 5:1)
length(x) # Count elements in x
sum(x>1) # Count elements in x that are >1 (sum is an
arithmetic operation so logical vectors become numeric: T>1, F->0)
sum(x>1 & x<4) # Count elements in x that are >1 AND <4
sum(y=='a') # Count elements of y that are equal to 'a'
table(y) # Contingency table of the elements of y
NA Values
The is.na function returns TRUE where elements are NA
indicating missing values:

x = c(4,8,0,2,NA,7,NA)
is.na(x) # Logical vector
sum(is.na(x)) # Count NA in x
x = x[!is.na(x)] # Set x to elements of x that are NOT NA,
(drop all NA from x)
MATRIX
Matrix
An element at the mth row, nth column of A can be
accessed by the expression A[m, n].

> A[2, 3]
# element at 2nd row, 3rd column
[1] 7
The entire mth row A can be extracted as A[m, ].
> A[2, ]
# the 2nd row
[1] 1 5 7
Similarly, the entire nth column A can be extracted as A[ ,n].
> A[ ,3]
# the 3rd column
[1] 3 7
We can also extract more than one rows or columns at a
time.
> A[ ,c(1,3)] # the 1st and 3rd columns

[,1] [,2]
[1,] 2 3
[2,] 1 7
If we assign names to the rows and columns of the matrix,
than we can access the elements by names.
> dimnames(A) = list(
+ c("row1", "row2"),
# row names
+ c("col1", "col2", "col3")) # column names
>A
# print A
col1 col2 col3
row1 2 4 3
row2 1 5 7
> A["row2", "col3"] # element at 2nd row, 3rd column

[1] 7
There are various ways to construct a matrix. When we construct a matrix

directly with data elements, the matrix content is filled along the column
orientation by default.
> B = matrix(
+ c(2, 4, 3, 1, 5, 7),
+ nrow=3,
+ ncol=2)
>B
# B has 3 rows and 2 columns
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
Transpose
> t(B)
# transpose of B
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7
Combining Matrices
The columns of two matrices having the same number of rows can be
combined into a larger matrix.
> C = matrix(
+ c(7, 4, 2),
+ nrow=3,
+ ncol=1)
>C
# C has 3 rows
[,1]
[1,] 7
[2,] 4
[3,] 2
Then we can combine the columns of B and C with cbind.
> cbind(B, C)
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 4 5 4
[3,] 3 7 2
> D = matrix(
+ c(6, 2),
+ nrow=1,
+ ncol=2)
>D
# D has 2 columns
[,1] [,2]
[1,] 6 2
> rbind(B, D)
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
[4,] 6 2
LIST
Lists
A list is a generic vector containing other objects.
For example, the following variable x is a list containing

copies of three vectors n, s, b, and a numeric value 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b
List Slicing
We retrieve a list slice with the single square bracket "[]" operator. The
following is a slice containing the second member of x, which is a copy
of s.
> x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
With an index vector, we can retrieve a slice with multiple members.
Here a slice containing the second and fourth members of x.
> x[c(2, 4)]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
[[2]]
[1] 3
Member Reference
In order to reference a list member directly, we have to
use the double square bracket "[[]]" operator.

The following object x[[2]] is the second member of x. In other words,
x[[2]] is a copy of s, but is not a slice containing s or its copy.

> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
We can modify its content directly.
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"
>s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
Named Lists
We can assign names to list members, and reference them by names
instead of numeric indexes.

For example, in the following, v is a list of two members, named "bob"
and "john".
> v = list(bob=c(2, 3, 5), john=c("aa", "bb"))
>v
$bob
[1] 2 3 5
$john
[1] "aa" "bb"
List Slicing
We retrieve a list slice with the single square bracket "[]" operator. Here
is a list slice containing a member of v named "bob".
> v["bob"]
$bob
[1] 2 3 5
With an index vector, we can retrieve a slice with multiple members.
Here is a list slice with both members of v.
Notice how they are reversed from their original positions in v.
> v[c("john", "bob")]
$john
[1] "aa" "bb"
$bob
[1] 2 3 5
Member Reference
In order to reference a list member directly, we have to use the double
square bracket "[[]]" operator. The following references a member of v
by name.
> v[["bob"]]
[1] 2 3 5
A named list member can also be referenced directly with
the "$" operator in lieu of the double square bracket
operator.
> v$bob
[1] 2 3 5
DATA FRAMES
Data Frames
A data frame is used for storing data tables. It is a list of
vectors of equal length. For example, the following variable

df is a data frame containing three vectors n, s, b
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
# df is a data frame
Build-in Data Frame
We use built-in data frames in R for our tutorials. For

example, here is a built-in data frame in R, called mtcars.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Cell value from the first row, second column of mtcars
> mtcars[1, 2]
[1] 6
Moreover, we can use the row and column names instead

of the numeric coordinates.
> mtcars["Mazda RX4", "cyl"]
[1] 6
Lastly, the number of data rows in the data frame is given by the nrow
function.
> nrow(mtcars)
[1] 32
# number of data rows
And the number of columns of a data frame is given by the ncol function.
> ncol(mtcars)
[1] 11
# number of columns
Preview
Instead of printing out the entire data frame, it is often desirable to preview
it with the head function beforehand.
> head(mtcars)
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Data Frame Column Vector

We reference a data frame column with the double square
bracket "[[]]" operator.

For example, to retrieve the ninth column vector of the
built-in data set mtcars, we write mtcars[[9]].
> mtcars[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can retrieve the same column vector by its name.
> mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can also retrieve with the "$" operator in lieu of the

double square bracket operator.
> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We can also retrieve with the "$" operator in lieu of the
double square bracket operator.
> mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
Another way to retrieve the same column vector is to use the
single square bracket "[]" operator.
Prepend the column name with a comma character, which
signals a wildcard match for the row position.
> mtcars[,"am"]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
Data Frame Column Slice

We retrieve a data frame column slice with the single square bracket "[]"
operator.
Numeric Indexing
The following is a slice containing the first column of the built-in data set
mtcars.
> mtcars[1]
mpg
Mazda RX4
21.0
Mazda RX4 Wag
21.0
Datsun 710
22.8
Name Indexing
We can retrieve the same column slice by its name.
> mtcars["mpg"]
mpg
Mazda RX4
21.0
Mazda RX4 Wag 21.0
Datsun 710
22.8
To retrieve a data frame slice with the two columns mpg and
hp, we pack the column names in an index vector inside the
single square bracket operator.
> mtcars[c("mpg", "hp")]
mpg hp
Mazda RX4
21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710
22.8 93
Data Frame Row Slice

We retrieve rows from a data frame with the single square bracket
operator, just like what we did with columns.

However, in additional to an index vector of row positions, we append
an extra comma character.

This is important, as the extra comma signals a wildcard match for the
second coordinate for column positions.
Numeric Indexing
For example, the following retrieves a row record of the
built-in data set mtcars.

Please notice the extra comma in the square bracket
operator, and it is not a typo.
It states that the 1974 Camaro Z28 has a gas mileage of
13.3 miles per gallon, and an eight cylinder 245 horse
power engine, ..., etc.
> mtcars[24,]
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
To retrieve more than one rows, we use a numeric index
vector.
> mtcars[c(3, 24),]
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Name Indexing
We can retrieve a row by its name.
> mtcars["Camaro Z28",]
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
And we can pack the row names in an index vector in order
to retrieve multiple rows.

> mtcars[c("Datsun 710", "Camaro Z28"),]
Datsun 710 22.8 4 108 93 3.85 2.32 ...
Camaro Z28 13.3 8 350 245 3.73 3.84 ...
Logical Indexing
Lastly, we can retrieve rows with a logical index vector.
In the following vector L, the member value is TRUE if the
car has automatic transmission, and FALSE if otherwise.
> L = mtcars$am == 0
>L
[1] FALSE FALSE FALSE TRUE ...
Here is the list of vehicles with automatic transmission.
> mtcars[L,]
Hornet 4 Drive
21.4 6 258.0 110 3.08 3.215 ...
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 ...
And here is the gas mileage data for automatic transmission.
> mtcars[L,]$mpg
[1] 21.4 18.7 18.1 14.3 24.4 ...
DATA IMPORT
Data Import
It is often necessary to import sample textbook data into R
before you start working on your homework.

Excel File
Quite frequently, the sample data is in Excel format, and needs
to be imported into R prior to use.
For this, we can use the function read.xls from the gdata
package.
It reads from an Excel spreadsheet and returns a data frame.
The following shows how to load an Excel spreadsheet named
"mydata.xls".
This method requires Perl runtime to be present in the system.
> library(gdata)
# load gdata package
> help(read.xls)
# documentation
> mydata = read.xls("mydata.xls") # read from first sheet
Alternatively, we can use the function loadWorkbook from
the XLConnect package to read the entire workbook,
and then load the worksheets with readWorksheet.
The XLConnect package requires Java to be preinstalled.
> library(XLConnect)
# load XLConnect package
> wk = loadWorkbook("mydata.xls")
> df = readWorksheet(wk, sheet="Sheet1")
Table File
A data table can resides in a text file.
The cells inside the table are separated by blank
characters.
Here is an example of a table with 4 rows and 3 columns.
100 a1 b1
200 a2 b2
300 a3 b3
400 a4 b4
Now copy and paste the table above in a file named
"mydata.txt" with a text editor.
Then load the data into the workspace with the function
read.table.
> mydata = read.table("mydata.txt") # read text file

> mydata
# print data frame
V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
4 400 a4 b4
CSV
The sample data can also be in comma separated values
(CSV) format.
Each cell inside such data file is separated by a special
character, which usually is a comma, although other
characters can be used as well.
The first row of the data file should contain the column
names instead of the actual data.
Here is a sample of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
After we copy and paste the data above in a file named
"mydata.csv" with a text editor,

we can read the data with the function read.csv.
> mydata = read.csv("mydata.csv") # read csv file
> mydata
Col1 Col2 Col3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
DATA EXPORT
How to export data from R to CSV

# Write CSV in R
write.csv(MyData, file = "MyData.csv")
The above writes the data data frame MyData into a CSV
that it creates called MyData.csv.

Note that the file is written to your working directory.
To omit the row names, add a comma and then
row.names=FALSE.
# Write CSV in R
write.csv(MyData, file = "MyData.csv",row.names=FALSE)
Basic Statistics with R
Mean
The mean of an observation variable is a numerical
measure of the central location of the data values.

It is the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is
defined as follows:
Similarly, for a data population of size N, the population

mean is:
Question
Problem:
Find the mean eruption duration in the data set faithful.
Solution:
We apply the mean function to compute the mean value of
eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration)
# apply the mean function
[1] 3.4878
Median
The median of an observation variable is the value at the
middle when the data is sorted in ascending order.

It is an ordinal measure of the central location of the data
values.
Problem
Find the median of the eruption duration in the data set
faithful.
Solution
We apply the median function to compute the median
value of eruptions.
> median(duration)
# apply the median function
[1] 4
Quartile
There are several quartiles of an observation variable.
The first quartile, or lower quartile, is the value that cuts
off the first 25% of the data when it is sorted in ascending

order.
The second quartile, or median, is the value that cuts off
the first 50%.

The third quartile, or upper quartile, is the value that cuts
off the first 75%
Problem
Find the quartiles of the eruption durations in the data set
faithful.
Solution
We apply the quantile function to compute the quartiles of
eruptions.
> quantile(duration)
# apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Percentile
The nth percentile of an observation variable is the value
that cuts off the first n percent of the data values when it is
sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption
durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles
of eruptions with the desired percentage ratios.
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Range
The range of an observation variable is the difference of
its largest and smallest data values.

It is a measure of how far apart the entire data spreads in
value.
Problem
Problem
Find the range of the eruption duration in the data set
faithful.
Solution
We apply the max and min function to compute the largest
and smallest values of eruptions, then take the difference.
> max(duration) min(duration) # apply the max and min
functions
[1] 3.5
Variance
The variance is a numerical measure of how the data
values is dispersed around the mean.

In particular, the sample variance is defined as:
Similarly, the population variance is defined in terms of the
population mean and population size N:
Question
Find the variance of the eruption duration in the data set
faithful.
We apply the var function to compute the variance of
eruptions.
> var(duration)
# apply the var function
[1] 1.3027
Var function calculates sample variance.
Standard deviation
The standard deviation of an observation variable is the
square root of its variance.

Problem
Find the standard deviation of the eruption duration in the
data set faithful.
Solution
We apply the sd function to compute the standard deviation
of eruptions.
> sd(duration)
# apply the sd function
[1] 1.1414
Scatter Plot
A scatter plot pairs up values of two quantitative variables
in a data set and display them as geometric points inside

a Cartesian diagram.
Example:
In the data set faithful, we pair up the eruptions and
waiting values in the same observation as (x,y)
coordinates.
Then we plot the points in the Cartesian plane.
> duration = faithful$eruptions

# the eruption durations
> waiting = faithful$waiting
# the waiting interval
> head(cbind(duration, waiting))
duration waiting
[1,] 3.600
79
[2,] 1.800
54
[3,] 3.333
74
[4,] 2.283
62
[5,] 4.533
85
[6,] 2.883
55
Problem
Find the scatter plot of the eruption durations and waiting
intervals in faithful.
Does it reveal any relationship between the variables?
> duration = faithful$eruptions
# the eruption durations
# the waiting interval
> plot(duration, waiting,
# plot the variables
+ xlab="Eruption duration",
# xaxis label
+ ylab="Time waited")
# yaxis label
Answer
The scatter plot of the eruption durations and waiting
intervals is as follows.
It reveals a positive linear relationship between them.
Correlation
The correlation coefficient is defined as follows, where x
and y are the population standard deviations, and xy is

the population covariance.
Problem
Problem
Find the correlation coefficient of the eruption duration
and waiting time in the data set faithful. Observe if there is
any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation
coefficient of eruptions and waiting.
# the waiting period
> cor(duration, waiting)
# apply the cor function
[1] 0.90081
Skewness
The skewness of a data population is defined by the
following formula, where 2 and 3 are the second and third

central moments.
Intuitively, the skewness is a measure of symmetry.

As a rule, negative skewness indicates that the mean of the
data values is less than the median, and the data distribution
is left-skewed.
Positive skewness would indicates that the mean of the data
values is larger than the median, and the data distribution is
right-skewed.
Problem
Find the skewness of eruption duration in the data set
faithful.
We apply the function skewness from the e1071 package
to compute the skewness coefficient of eruptions.

As the package is not in the core R library, it has to be
installed and loaded into the R workspace.
> library(e1071)
# load e1071
> duration = faithful$eruptions # eruption durations
> skewness(duration)
# apply the skewness func
tion
[1] -0.41355
The skewness of eruption duration is -0.41355. It
indicates that the eruption duration distribution is skewed

towards the left.
Simple Regression
A
simple linear regression model that describes the

relationship between two variables x and y can be expressed
by the following equation.
The numbers and are called parameters, and is the
error term.
For example, in the data set faithful, it contains sample data
of two random variables named waiting and eruptions.

The waiting variable denotes the waiting time until the next
eruptions, and eruptions denotes the duration. Its linear
regression model can be expressed as:
Estimation of Simple Regression Equation

Problem
Apply the simple linear regression model for the data set
faithful, and estimate the next eruption duration if the

waiting time since the last eruption has been 80 minutes.
Solution
We apply the lm function to a formula that describes the
variable eruptions by the variable waiting, and save the
linear regression model in a new variable eruption. lm.
> eruption.lm = lm(eruptions ~ waiting, data=faithful)

Then we extract the parameters of the estimated regression
equation with the coefficients function.
> coeffs = coefficients(eruption.lm); coeffs
(Intercept) waiting
-1.874016 0.075628
We now fit the eruption duration using the estimated
regression equation.
> waiting = 80
# the waiting time
> duration = coeffs[1] + coeffs[2]*waiting
> duration
(Intercept)
4.1762
Coefficient of Determination
The coefficient of determination of a linear regression
model is the quotient of the variances of the fitted values

and observed values of the dependent variable. If we
denote yi as the observed values of the dependent
variable, as its mean, and as the fitted value, then the
coefficient of determination is:
Question
Problem
Find the coefficient of determination for the simple linear
regression model of the data set faithful.
Solution
linear regression model in a new variable eruption.lm.
Then we extract the coefficient of determination from the

r.squared attribute of its summary.
> summary(eruption.lm)$r.squared
[1] 0.81146
Answer
The coefficient of determination of the simple linear
regression model for the data set faithful is 0.81146.
Significance Test for Regression

Assume that the error term in the linear regression
model is independent of x, and is normally distributed,

with zero mean and constant variance.
We
can decide whether there is any significant

relationship between x and y by testing the null
hypothesis that = 0.
Problem
Decide whether there is a significant relationship between
the variables in the linear regression model of the data set

faithful at .05 significance level.
Solution
linear regression model in a new variable eruption.lm.

Then we print out the F-statistics of the significance test with the
summary function.
> summary(eruption.lm)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min
1Q Median
3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -1.87402
0.16014 -11.7 <2e-16 ***
waiting
0.07563
0.00222 34.1 <2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811,
Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16
Answer
As the p-value is much less than 0.05, we reject the null
hypothesis that = 0. Hence there is a significant

relationship between the variables in the linear regression
model of the data set faithful.
Confidence Interval for Linear Regression

Assume that the error term in the linear regression
model is independent of x, and is normally distributed,

with zero mean and constant variance. For a given value
of x, the interval estimate for the mean of the dependent
variable, , is called the confidence interval.
Problem
In the data set faithful, develop a 95% confidence interval
of the mean eruption duration for the waiting time of 80

minutes.
Solution
We apply the lm function to a formula that describes the variable
eruptions by the variable waiting, and save the linear regression

model in a new variable eruption.lm.
> attach(faithful) # attach the data frame
> eruption.lm = lm(eruptions ~ waiting)
Then we create a new data frame that set the waiting time
value.
> newdata = data.frame(waiting=80)
We now apply the predict function and set the predictor variable
in the newdata argument. We also set the interval type as
"confidence", and use the default 0.95 confidence level.
> predict(eruption.lm, newdata, interval="confidence")
fit lwr upr
1 4.1762 4.1048 4.2476
> detach(faithful) # clean up
Answer
The 95% confidence interval of the mean eruption
duration for the waiting time of 80 minutes is between

4.1048 and 4.2476 minutes.
Residual Plot
The residual data of the simple linear regression model is the
difference between the observed data of the dependent

variable y and the fitted values .
Problem
Plot the residual of the simple linear regression model of the

data set faithful against the independent variable waiting.
Solution
We apply the lm function to a formula that describes the variable
eruptions by the variable waiting, and save the linear
regression model in a new variable eruption.lm. Then we
compute the residual with the resid function.

> eruption.res = resid(eruption.lm)
We now plot the residual against the observed values of
the variable waiting.
> plot(faithful$waiting, eruption.res,
+ ylab="Residuals", xlab="Waiting Time",
+ main="Old Faithful Eruptions")
> abline(0, 0)
# the horizon
Normal Probability Plot of Residuals

The normal probability plot is a graphical tool for comparing a
data set with the normal distribution. We can use it with the
standardized residual of the linear regression model and see if
the error term is actually normally distributed.
Problem
Create the normal probability plot for the standardized residual
of the data set faithful.
Solution
variable eruptions by the variable waiting, and save the linear
regression model in a new variable eruption.lm. Then we
compute the standardized residual with the rstandard function.

> eruption.stdres = rstandard(eruption.lm)
We now create the normal probability plot with the qqnorm
function, and add the qqline for further comparison.

> qqnorm(eruption.stdres,
+ ylab="Standardized Residuals",
+ xlab="Normal Scores",
+ main="Old Faithful Eruptions")
> qqline(eruption.stdres)
Multiple Linear Regression

A multiple linear regression (MLR) model that describes a
dependent variable y by independent variables x1, x2, ...,

xp (p > 1) is expressed by the equation as follows, where
the numbers and k (k = 1, 2, ..., p) are the parameters,
and is the error term.
Example
For example, in the built-in data set stackloss from
observations of a chemical plant operation, if we assign

stackloss as the dependent variable, and assign Air.Flow
(cooling air flow), Water.Temp (inlet water temperature)
and Acid.Conc. (acid concentration) as independent
variables, the multiple linear regression model is:
Problem
Apply the multiple linear regression model for the data set
stackloss, and predict the stack loss if the air flow is 72,
water temperature is 20 and acid concentration is 85.
Solution
variable stack.loss by the variables Air.Flow, Water.Temp
and Acid.Conc. And we save the linear regression model
in a new variable stackloss.lm.
> stackloss.lm = lm(stack.loss ~
+ Air.Flow + Water.Temp + Acid.Conc.,
+ data=stackloss)
We also wrap the parameters inside a new data frame named
newdata.
> newdata = data.frame(Air.Flow=72, # wrap the parameters
+ Water.Temp=20,
+ Acid.Conc.=85)
Lastly, we apply the predict function to stackloss.lm and
newdata.
> predict(stackloss.lm, newdata)
1
24.582
Answer
Based on the multiple linear regression model and the given
parameters, the predicted stack loss is 24.582.
Multiple Coefficient of Determination

The coefficient of determination of a multiple linear
regression model is the quotient of the variances of the

fitted values and observed values of the dependent
variable. If we denote yi as the observed values of the
dependent variable, as its mean, and as the fitted value,
then the coefficient of determination is:
Example
Find the coefficient of determination for the multiple linear
regression model of the data set stackloss.

Solution
and Acid.Conc. And we save the linear regression model
in a new variable stackloss.lm.
+ data=stackloss)
r.squared attribute of its summary.
> summary(stackloss.lm)$r.squared
[1] 0.91358
Answer
The coefficient of determination of the multiple linear
regression model for the data set stackloss is 0.91358.
Adjusted Coefficient of Determination

The adjusted coefficient of determination of a multiple
linear regression model is defined in terms of the

coefficient of determination as follows, where n is the
number of observations in the data set, and p is the
number of independent variables.
Example
Problem
Find the adjusted coefficient of determination for the
multiple linear regression model of the data set stackloss.
Solution
and Acid.Conc. And we save the linear regression model in
a new variable stackloss.lm.
+ data=stackloss)
adj.r.squared attribute of its summary.
> summary(stackloss.lm)$adj.r.squared
[1] 0.89833
Answer
The adjusted coefficient of determination of the multiple
linear regression model for the data set stackloss is
0.89833.
Logistic Regression
The logistic regression equation is use to predict the
probability of a dependent variable taking the dichotomy

values 0 or 1.
Suppose x1, x2, ..., xp are the independent variables,
and k (k = 1, 2, ..., p) are the parameters, and E(y) is the
expected value of the dependent variable y, then the
logistic regression equation is:
Example
The built-in data set mtcars, the data column am
represents the transmission type of the automobile model

(0 = automatic, 1 = manual).
With the logistic regression equation, we can model the
probability of a manual transmission in a vehicle based on
its engine horsepower and weight data.
Estimated Logistic Regression Equation

Using the generalized linear model, an estimated logistic
regression equation can be formulated as below.

The coefficients a and bk (k = 1, 2, ..., p) are determined
according to a maximum likelihood approach, and it
allows us to estimate the probability of the dependent
variable y taking on the value 1 for given values of xk (k =
1, 2, ..., p).
Example
Problem
By use of the logistic regression equation of vehicle
transmission in the data set mtcars, estimate the
probability of a vehicle being fitted with a manual
transmission if it has a 120hp engine and weights 2800
lbs.
We apply the function glm to a formula that describes the transmission

type (am) by the horsepower (hp) and weight (wt). This creates a
generalized linear model (GLM) in the binomial family.
> am.glm = glm(formula=am ~ hp + wt,
+
data=mtcars,
+
family=binomial)
We then wrap the test parameters inside a data frame newdata.
> newdata = data.frame(hp=120, wt=2.8)

Now we apply the function predict to the generalized linear model
am.glm along with newdata. We will have to select response prediction

type in order to obtain the predicted probability.
> predict(am.glm, newdata, type="response")
1
0.64181
For an automobile with 120hp engine and 2800 lbs weight, the
probability of it being fitted with a manual transmission is about 64%.

Introduction To Rlogistic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Rlogistic

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO R

open source package based on the S language

R: initially written by Ross Ihaka and Robert

prompt (>) or run a set of commands from a source file.

(numerical, character, logical), matrices, dataframes, and

created functions and all data objects are kept in memory

contained in packages that can be attached to a current

built-in help system. Other sections describe the working

environment, inputting programs and outputting results,

A fundamental design feature of R is that the output from

most functions can be used as input to other functions. This is

Getting Started with R

Arguments are passed to functions inside round brackets

any simple computation.

We can also store numeric or string objects.

x <- 2 # store an object

help() or ?. Usually you do not need to add quotes to

packages containing a keyword or a regular expression

removes them all.

save(df, file = "file.Rda")

BASIC DATA TYPES

Basic Data Types

occurrence in routine R calculations.

the is.integer function.

Incidentally, we can coerce a numeric value into an integer

On the other hand, it is erroneous trying to parse a nondecimal string.

> u = TRUE; v = FALSE

However, it is often more convenient to create a readable

type. Members in a vector are officially called components.

A vector can contain character strings.

For plotting both weight and height:

the following two vectors n and s are combined into a new

by-member, i.e., member wise.

Similarly for subtraction, multiplication and division, we get

> u = c(10, 20, 30)

a single square bracket "[]" operator.

Logical Index Vector

index vectors and are used for conditional

Counting Elements of Vector

indicating missing values:

An element at the mth row, nth column of A can be

accessed by the expression A[m, n].

> A[ ,c(1,3)] # the 1st and 3rd columns

> A["row2", "col3"] # element at 2nd row, 3rd column

There are various ways to construct a matrix. When we construct a matrix

For example, the following variable x is a list containing

use the double square bracket "[[]]" operator.

x[[2]] is a copy of s, but is not a slice containing s or its copy.

instead of numeric indexes.

vectors of equal length. For example, the following variable

We use built-in data frames in R for our tutorials. For

Moreover, we can use the row and column names instead

# number of data rows

Data Frame Column Vector

bracket "[[]]" operator.

We can also retrieve with the "$" operator in lieu of the

Data Frame Column Slice

Data Frame Row Slice

operator, just like what we did with columns.

an extra comma character.

second coordinate for column positions.

built-in data set mtcars.

To retrieve more than one rows, we use a numeric index