You are on page 1of 51

R Software: An Overview

Hukum Chandra
ICAR-National Fellow & Principal Scientist
Email: hchandra@iasri.res.in

ICAR-Indian Agricultural Statistics Research Institute


Library Avenue, PUSA, New Delhi, India
www.iasri.res.in
Workshop Objective

The objective of this workshop is to provide an overview of the


basic R environment and its applications

This workshop is aimed as a starting point for any future


development with R

At the end of the workshop, you will be able to gain awareness of


the basic R language, importing/exporting data and manipulate
data using R

You will do exploratory data analysis, perform basic statistics and


build plots using R / R Studio

2
Background
R is available as Free Software and maintained by volunteers

R is a language and environment for statistical computing and


graphics

It provides a wide variety of statistical (linear and nonlinear


modelling, classical statistical tests, time-series analysis,
classification, clustering, ...) and graphical techniques

The development of the R system for statistical computing is


heavily influenced by the open source idea

R is extensible, can be expanded by installing packages

The base distribution of R is maintain by a small group of


statisticians, the R development Core Team
3
Initial developers: Ross Ihaka & Robert Gentleman

Maintained by top quality experts, continuous improvement

Available on all platforms (Linux, Mac, Windows)

Download the software from the internet : http://www.r-project.org/


(or Google Download R)

Free to install, no catches

4
R Studio: a free Integrated Development Environment (IDE)
for R (recommended!!)

Download from http://www.rstudio.com/

R Studio provides an easy way to access all of the information &


objects that you need in an easy & straight-forward manner

If you install R and R Studio, then you only need to run R Studio

5
To download R software
In any web browser (e.g. Microsoft Internet Explorer), go to R
webpage: http://www.r-project.org

Downloads: CRAN - on the left hand side menu on the screen, click on
CRAN which is under the Download item
6
Set your Mirror- Pick a country site from which to download (for
example, IIT Madras India, but really you can pick any, all this effects
is download speed).

7
On your right hand side you will see Download R for Windows

8
This brings you to a page where you select the part of R you need. While
you may later want to download the set of user contributed functions, for
now just click on base, which gets you the basic R program

Click there and click on base


9
Click on Download R 3.2.3 for Windows (62 megabytes, 32/64 bit)
R-3.2.3-win.exe and save it to your hard disc. At this point you should be asked (via a
prompt box) where you want to save the file. Pick a place on your PC to save this (e.g.,
c:\).
Latest available version of the software

It is an .exe file, which you can save in your hard disc


By double clicking on the name of this file, R is automatically installed. All you need to
do is follow the installation process
10
11
12
To open R software
The installation process automatically creates a shortcut for R

Double click this icon to open the R environment

Or Start > All Programs >R

R will open up with the appearance of a standard Windows

13
To run R program code
The main active window within the R environment is the R Console

R processes commands on a line by line basis

R works fundamentally by question and answer model

Consequently it is necessary to hit ENTER after typing in (or pasting) a


line of R code in order to get R to implement it

Here at the command prompt (the symbol >), we can enter R


commands which run instantly upon pressing the carriage return key

We can also run blocks of code. Use the R supplied editor or the
Windows-supplied editor Notepad to display and edit our R program
code.

14
To open the editor
We are using the R-supplied editor to display and edit our R program
code, although any general-purpose editor will suffice. Open R-Editor by
going to the File button and clicking on:

File > New Script

15
Use the code editor to enter R commands

Use RUN option to execute the commands (or Ctrl R)

16
Outputs are shown in Console

17
Introduction to RStudio

18
Before start
Preferred Assignment operator <-
Instead of usual equal (=) symbol can be used

Path separator forward slash (/) or two backward slashes


E:/ R Course/inputdata.txt
E:\\R Course\\inputdata.txt

Set Working Directory


Folder for input/output files (for R to look)

Command getwd()
Gets working directory

Command setwd()
Sets working directory setwd("E:\\R Course")

19
Getting started with R
R can be used in many ways

Simple calculations, vectors and graphics

To begin with, well use R as a calculator. Enter arithmetic expression


and receive results (second line is answer line). Try the following
4+5
4/(3+5)
sqrt(9)+5^2
sin(pi/2)-log(exp(1))
exp(2)
rnorm(10)
rnorm(25,5,10)
sqrt(9)-5^2+2 #To take the power of something, use the caret symbol (^)
## TRy this and see what you get
sqrt(9=)-5^2+2

?rnorm
?log
help(rnorm)
20
> 4+5
[1] 9
> 4/(3+5)
[1] 0.5
> sqrt(9)+5^2
[1] 28
> sin(pi/2)-log(exp(1))
[1] 0
> exp(2)
[1] 7.389056

> rnorm(10)
[1] 1.78896720 -1.13840718 -0.14144555 -0.06581805 -0.36301621 -0.47357570
[7] 1.17758935 0.33800009 -0.03361512 1.43694640

Here [7] indicates that 1.17758935 is the seventh element in the vector
21
Help and documentation
Roughly, three different form of documentation for the R system for
statistical computing may be distinguished:
Online help that comes with the base distribution or package
Electronic manuals and
Published work in the form of books etc
help function : Help about a specific command can be had by writing a
question mark before the command, for instance:

> ?log
As an alternative, help can be used; in this case, help (log) or
help (mean)

22
23
Entering and Manipulating Data in R
Assignments - to store immediate results
To assign the value 5 to the variable a, enter
a <- 5
a
[1] 5
b <- 9
b
[1] 9
a+b
[1] 14
a-b
[1] -4
a+b-a^2+(1/b)+(a^-b)
[1] -10.88889

msg <- hello


msg
[1] hello

The symbol <- (or alternatively use =) should be read as assigns.


Two character <- should be read as a single symbol: an arrow
pointing to the variable to which the value is assigned
24
A couple of other useful things

R is case-sensitive, for example, data, Data and DATA are three


different names in R

A comment in R code begins with a hash symbol (#)


- Any line starting with # is a comment not executed
Comment your code so you remember what it does
R scripts are simply text files with a .R extension
Use Ctrl + R to submit code
Use up and down arrows to cycle through previous commands in
console
Dont be afraid of errors; you wont break R
If you get stuck, Google is your friend
Spacing around operators is generally disregarded by R
However, adding a space in the middle of a <- changes the meaning to
less than followed by minus

25
OBJECTS
R has five basic classes of objects:
character
numeric (real numbers)
integer
complex
logical (True/False)

The most basic object is a vector

A vector can only contain objects of the same class


BUT: The one exception is a list, which is represented as a vector but
can contain objects of different classes (indeed, that's usually why we
use them)

26
ATTRIBUTES

R objects can have attributes


names, dim names
dimensions (e.g. matrices, arrays)
class
length
other user-defined attributes/metadata

Attributes of an object can be accessed using the attributes () function.

27
Data Types in R
Scalars (numeric, character etc)
Vectors
Matrices
Frames

28
Vectors and Matrices
Vectors and matrices are of great importance in many numerical
problems since one can not do much statistics on single numbers

Creating Vectors

The c() function can be used to create vectors of objects.


x <- c(0.5, 0.6) ## numeric
x <- c(TRUE, FALSE) ## logical
x <- c(T, F) ## logical
x <- c("a", "b", "c") ## character
x <- 9:29 ## integer
x <- c(1+0i, 2+4i) ## complex

Using the vector() function


x <- vector("numeric", length = 10)
x
[1] 0 0 0 0 0 0 0 0 0 0

29
Working with Vectors
To create a vector named tempdata and assign the values 5, 3, 8 to
it, we write as follows:
tempdata <- c(5,-3,8)
The construct c() is used to define a vector. We can do calculations
with vectors just like ordinary numbers as long as they are of the same
length
Vectors can be manipulated, for instance by adding a constant to all
elements
tempdata <- c(5,-3,8)
myconst=50
myconst+tempdata

weight<- c(60, 72, 57,90, 55,80)


height<-c(1.75, 1.80, 1.65, 1.90,1.55,1.85)
bmi<- weight/height^2

Here we note that operation is carried out element wise


30
Sequences
A vector x1 consisting of the integers between 1 and 10 can be created
by writing
X1 <- c(1:10) # 1:10 is short form for 1,2 , 3,,10
X2 <- c(1:5, 10:15)
X1
X2
Sequence function
X3 <- seq(1,10)
X3
1:n produces 1,2,,n
Function seq(from,to,by=) produces desired sequences
Vectors with sequences of numbers with particular increments can be
created with the seq command:
mydata1 <- seq(0,10,2) # integers between 0 and 10, with increment 2

31
Component extraction
x<- c(2,3,1,5,4,6,5,7,6,8)
y <- c(10, 12, 14, 13, 34, 23, 12, 34, 25, 43)

Elements of a vector can be accessed as


x[1] #The first element of the vector x
[1] 2

x[2] # The 2nd element of the vector x


[1] 3

x[1:4] # 1 to 4 elements
[1] 2 3 1 5

x[x > 4]
[1] 5 6 5 7 6 8

32
u <- x > 4
u
[1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

x[u]
[1] 5 6 5 7 6 8

Functions on vectors:
length(x) #To compute length of data in x.
[1] 10

sum(x) #To compute sum of data in x.


[1] 47

sum(x^2)
[1] 265

mean(x) #To compute mean of data in x.


[1] 4.7

33
mean(y)
[1] 22

var(x) #To compute variance of x.


[1] 4.9

sqrt(var(x)) # To compute standard deviation of x.


[1] 2.213594

sum((x-mean(x))^2)
[1] 44.1

sqrt(var(x))/mean(x)*100 #To compute coefficient of variation

#To compute summary features of data in x.


summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.00 4.70 6.00 8.00

34
summary(x^2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 10.75 25.00 26.50 36.00 64.00

Some calculations
sum(weight)
mean(weight) = sum(weight) / length(weight)

If we denote by mean(weight) by xbar; then


sd(weight) = sqrt(sum((weight- xbar)^2))/ length(weight))

cor(x,y) #To compute correlation coefficient between x and y.


var(x,y) #To compute covariance between x and y.

35
Rep function
Function repeats a number /vector desired number of times

r1 <- rep(5,2)
r1
[1] 5 5

r2=rep(c("Imphal","Delhi"),3)
r2
[1] "Imphal" "Delhi" "Imphal" "Delhi" "Imphal" "Delhi"

r3=rep(c("Imphal","Delhi"),each=3)
r3
[1] "Imphal" "Imphal" "Imphal" "Delhi" "Delhi" "Delhi"

36
Slightly more complicated example

The rule of thumb is that the BMI for a normal weight individual
should be between 20 and 25, and we want to know if our data
deviate systematically from that.

We can use a one sample t test to assess whether the 6 persons BMI
can be assumed to have mean 22.5 given that they come from a
normal distribution.

We can use function t.test

Although you might not be knowing about t test but example is just to
give some indication of what real statistical output look like

37
t test (see ? t.test)

t.test (bmi, mu=22.5)

One Sample t-test

data: bmi
t = -0.22855, df = 5, p-value = 0.8283

alternative hypothesis: true mean is not equal to 22.5

95 percent confidence interval:


20.35465 24.29501
sample estimates:
mean of x 22.32483

If mu is not given then t.test would use default mu=0

The p value is not small, indicating that it is not at all unlikely to get data
like those observed if the mean were in fact 22.5

38
Packages in R
Several packages available to enhance R capabilities of data analysis
(last count 4955)

For Complete List see http://cran.r-project.org/

Need to download and install required package(s)

Use Install Packages

The base distribution already comes with some high priority add on
packages, e.g., boot, nlme, stats, grid, foreign, MASS, spatial etc

The packages included as default in base distribution implement


standard statistical functionality, for example, linear models, classical
tests, a huge collection of high level plotting functions etc

Packages not included in the base distribution can be installed directly


from R prompt
39
Classical Tests
To load the library of classical tests statistics available with R software use

library(stats)

#To get results of t-test for comparing population means of x and y when
variances are not equal.
t.test(x,y)

# To get results for usual t-test when variances are equal. If T is replaced
by F then it is equal to t.test(x, y)

t.test(x,y,var.equal=T)

?t.test

40
library(stats)

x <- c(2,3,1,5,4,6,5,7,6,8)
y <- c(10, 12, 14, 13, 34, 23, 12, 34, 25, 43)

mean(x)
mean(y)
var(x,y)
cor(x,y)
t.test(x)
t.test(x,y)
t.test(x,y, var.equal=T)

var.test(x,y) #To compare variances of x and y

41
F Test to Compare Two Variances
Performs an F test to compare the variances of two samples from normal
populations.
var.test(x, ...)
x1 <- rnorm(100, mean = 0, sd = 2)
y1 <- rnorm(60, mean = 1, sd = 1)
var.test(x, y) # Do x and y have the same variance?

Shapiro-Wilk test of normality


Shapiro-Wilk test indicates that data are unlikely to have come from a
normal distribution.
shapiro.test ()

The lower p-value means test is significant and hypothesis that sample
data comes from normal distribution is rejected

shapiro.test (bmi)
yy=rnorm(100,5,1)
shapiro.test(yy)
42
Nonparametric Tests of Group Differences

R provides functions for carrying out Mann-Whitney U, Wilcoxon


Signed Rank, Kruskal Wallis, and Friedman tests

# Independent 2-group Mann-Whitney U Test

wilcox.test(y~A) # where y is numeric and A is A binary factor


wilcox.test(y,x) # where y and x are numeric

#Dependent 2-group Wilcoxon Signed Rank Test


wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric

43
Matrices
Two Dimensional structure of same type

The commands rbind and cbind can be used to merge row or column
vectors to matrices
x <- c(1,2,3)
y <- c(4,5,6)
A = cbind(x,y)
B = rbind(x,y)
C = t(B)
# The last command gives the matrix transpose of B.

44
Create matrices: matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,ncol=3,byrow=T)
z<- matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=T)
z
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

45
Component extraction
A[r,] - rth row of object A
A[,c] - cth column of object A
A[r,c] - entry in row r and column c of object A
A[A<10] - extract all elements of A that are smaller than 10

z[2,3]
[1] 6

z[,1]
[1] 1 4 7

z[1,]
[1] 1 2 3

46
Arrays
Arrays are similar to matrices, but can be of more than 2 dimensions
Useful in programming new statistical methods
Natural Extension of Matrices
Must be of single type

Data Frame
Data frames - similar to tables (databases), dataset (SAS/SPSS) etc.
Consists of columns of different types
More general than a matrix
Columns Variables; Rows Observations
Convenient to hold all the data required for a data analysis

47
Data Frame - Creation
Created using many ways
Function data.frame function
General syntax is data.frame(col1,col2,) where col1, col2, etc. are
columns of same or different data types (numeric/character/ logical)

Factors
R functions treat nominal, ordinal variables differently as compared to
continuous variables
In R, nominal and ordinal variables are called factors
Use factor() function to make any variable as a factor

48
Handling Data
DATA FRAMES
Data frames are used to store tabular data

They are represented as a special type of list where every element of


the list has to have the same length

Each element of the list can be thought of as a column and the length
of each element of the list is the number of rows

Unlike matrices, data frames can store different classes of objects in


each column (just like lists); matrices must have every element be the
same class

Data frames also have a special attribute called row.names

49
Data frames are usually created by calling read.table() or read.csv()

Can be converted to a matrix by calling data.matrix()


Creating data frames
Data frame: represent the data in traditional table oriented way

The command data.frame can be used to organize data of different


kinds and to extract subsets of said data. Assume that we have data
about three persons and that we store it as follows:
length <- c(180,175,190)
weight <- c(75,82,88)
name <- c("Anil","Ankit","Sunil")

Here name is character vector vector of text strings. It does not


matter here whether you use single or double quote symbols, as log as
the left quote is the same as the right quote
friends <- data.frame(name,length,weight)
friends is now a data frame containing the data for the three persons
50
A data frame corresponds to what other statistical packages call a
data matrix or a data set. It is a list of vectors and /or factors of
the same lengths

Data can easily be extracted:


my.names <- friends$name
length1 <- friends$length[1]

51

You might also like