You are on page 1of 34

16BDA71011

ASSIGNMENT 1.
CONFIDENCE INTERVAL ESTIMATION
Aim: To estimate the confidence interval by determining the mean and standard error.

Software used: R-Studio.

Procedure:
A confidence interval (CI) is a type of interval estimation of a population parameter.
It is an observed interval (i.e., it is calculated from the observations), in principle different
from sample to sample, that potentially includes the unobservable true parameter of interest.
How frequently the observed interval contains the true parameter if the experiment is
repeated is called the confidence level. In other words, if confidence intervals are constructed
in separate experiments on the same population following the same process, the proportion
of such intervals that contain the true value of the parameter will match the given confidence
level. Whereas two-sided confidence limits form a confidence interval, and one-sided limits
are referred to as lower/upper confidence bounds (or limits).
Mean: xbar=1/N*(sigma(xi))
Standard Error: Se=sigma/(sqrt(n))
Random variable: z =(xbar-muo)/(sigma/sqrt(n))
Lower confidence bound: LCI=xbar-1.96(sigma/(sqrt(n))
Upper confidence bound: UCI=xbar+1.96(sigma/sqrt(n))
Confidence interval estimate:
σ µx =X±Z n

Where: Z is the normal distribution’s critical value for a probability of α/2 in each tail.

Problem:
Load the data set mtcars in the datasets R package. Calculate a 90% confidence interval for
the variable mpg.

Input:
data(mtcars) mpg1=mtcars$mpg mpg1 n=length(mpg1)
n
xbar=mean(mpg1)
xbar
se=sd(mpg1)/sqrt(n) se
z=qt(1-0.1/2,n-1) #table value z
cl=xbar+c(1,-1)*z*se
cl
#OR
Department of Big Data Analytics
1
16BDA71011

t.test(mpg1,conf.level=.9) # to set level of confidence to be 90%


ci=t.test(mpg1,conf.level=.9)$conf.int
# to display only the confidence interval ci round(ci)

Output:
> data(mtcars)
> mpg1=mtcars$mpg
> mpg1
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4 >
n=length(mpg1)
>n
[1] 32
> xbar=mean(mpg1)
> xbar
[1] 20.09062
> se=sd(mpg1)/sqrt(n)
> se
[1] 1.065424
> z=qt(1-0.1/2,n-1)#table value
>z
[1] 1.695519
> cl=xbar+c(1,-1)*z*se
> cl
[1] 21.89707 18.28418
> #OR
> t.test(mpg1,conf.level=.9) # to set level of confidence to be 90%

One Sample t-test

data: mpg1
t = 18.857, df = 31, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 90
percent confidence interval:
18.28418 21.89707 sample estimates:
mean of x 20.09062
> ci=t.test(mpg1,conf.level=.9)$conf.int # to display only the confidence interval
> ci
[1] 18.28418 21.89707 attr(,"conf.level")
[1] 0.9
> round(ci) [1] 18 22 attr(,"conf.level")
[1] 0.9

Result:
Therefore, the confidence interval for the variable mpg of the dataset mtcars is
(18.28418,21.89707) at 90%.

Department of Big Data Analytics


2
16BDA71011

ASSIGNMENT 2.
HYPOTHESIS TESTING WITH KNOWN POPULATION VARIANCE
Aim: To test the hypothesis using Z-test with known population variance.

Software used: R-studio.

Procedure:
A Z-test is any statistical test for which the distribution of the test statistic under the
null hypothesis can be approximated by a normal distribution. Because of the central limit
theorem, many test statistics are approximately normally distributed for large samples. For
each significance level, the Z-test has a single critical value (for example, 1.96 for 5% two
tailed) which makes it more convenient than the Student's t-test which has separate critical
values for each sample size. Therefore, many statistical tests can be conveniently performed
as approximate Z-tests if the sample size is large or the population variance known. If the
population variance is unknown (and therefore has to be estimated from the sample itself)
and the sample size is not large (n < 30), the Student's t-test may be more appropriate.
If T is a statistic that is approximately normally distributed under the null hypothesis, the next
step in performing a Z-test is to estimate the expected value θ of T under the null hypothesis,
and then obtain an estimate s of the standard deviation of T. After that the standard score Z
= (T − θ) / s is calculated, from which one-tailed and two-tailed p-values can be calculated as
Φ(−Z) (for uppertailed tests), Φ(Z) (for lower-tailed tests) and 2Φ(−|Z|) (for two-tailed tests)
where Φ is the standard normal cumulative distribution function.

Problem:
The production manager of Twin Forks Ball Bearing, Inc., has asked your assistance in
evaluating a modified ball bearing production process. When the process is operating
properly, the process produces ball bearings whose weights are normally distributed with a
population mean of 5 ounces and a population standard deviation of 0.1 ounce. A new raw-
material supplier was used for a recent production run, and the manager wants to know if
that change has resulted in a lowering of the mean weight of the ball bearings. There is no
reason to suspect a problem with the new supplier, and the manager will continue to use the
new supplier unless there is strong evidence that underweight ball bearings are being
produced.

Solution:
We will test the null hypothesis H0: mu = mu0 = 5
i.e. there is no strong evidence that underweight ball bearings are being produced against
the alternative hypothesis H1: mu < 5
i.e. there is strong evidence that the production process is producing underweight ball
bearings.

Department of Big Data Analytics


3
16BDA71011

Input:
#lower_tail(only this is enough according to the question) xbar=4.962 sigma=0.1 mu0=5
n=16
z=(xbar-mu0)/(sigma/sqrt(n)) z
alpha=.05 #table value is 0.05 z.alpha=qnorm(1-alpha)
-z.alpha #or
pval=pnorm(z)#lower tail
pval #lower tail p value

#upper_tail xbar=4.962 sigma=0.1 mu0=5 n=16


z=(xbar-mu0)/(sigma/sqrt(n)) z alpha=0.05
z.alpha=qnorm(1-alpha)
z.alpha #or
pval=pnorm(z,lower.tail=FALSE)#otherwise by default v get the lower tail value pval#upper
tail p-value

#Two_tail xbar=4.962 mu0=5 si=0.1 n=16


z=(xbar-mu0)/(si/sqrt(n)) z alpha=.05
z.half.alpha=qnorm(1-alpha/2) c(-z.half.alpha,z.half.alpha)
#or
pval=2*pnorm(z)#lower tail
pval #two tail p-value

Output;
> #lower_tail(only this is enough_according to the question)
> xbar=4.962
> sigma=0.1
> mu0=5
> n=16
> z=(xbar-mu0)/(sigma/sqrt(n))
>z
[1] -1.52
> alpha=.05 #table value is 0.05
> z.alpha=qnorm(1-alpha)
> -z.alpha
[1] -1.644854
> #or
> pval=pnorm(z)#lower tail
> pval #lower tail p value
[1] 0.06425549

> #upper_tail > xbar=4.962


> sigma=0.1
> mu0=5
> n=16
> z=(xbar-mu0)/(sigma/sqrt(n))
Department of Big Data Analytics
4
16BDA71011

>z
[1] -1.52
> alpha=0.05
> z.alpha=qnorm(1-alpha)
> z.alpha
[1] 1.644854
> #or
> pval=pnorm(z,lower.tail=FALSE) #otherwise by default v get the lower tail value
> pval#upper tail p-value
[1] 0.9357445
> #Two_tail
> xbar=4.962
> mu0=5
> si=0.1
> n=16
> z=(xbar-mu0)/(si/sqrt(n))
>z
[1] -1.52
> alpha=.05
> z.half.alpha=qnorm(1-alpha/2)
> c(-z.half.alpha,z.half.alpha)
[1] -1.959964 1.959964
> #or
> pval=2*pnorm(z)#lower tail
> pval #two tail p-value
[1] 0.128511

Result:
Since the Zcal = -1.52 > Zalpha = -1.644854, we do not reject the null hypothesis
i.e. there is no strong evidence that the production process is producing underweight ball
bearings.

Department of Big Data Analytics


5
16BDA71011

ASSIGNMENT 3.
HYPOTHESIS TESTING WITH UNKNOWN POPULATION VARAIANCE
Aim: To test the hypothesis using t-test with unknown population variance.

Software used: R-studio.

Procedure:
A t-test is any statistical hypothesis test in which the test statistic follows a Student's
tdistribution under the null hypothesis. It can be used to determine if two sets of data are
significantly different from each other.
A t-test is most commonly applied when the test statistic would follow a normal distribution
if the value of a scaling term in the test statistic were known. When the scaling term is
unknown and is replaced by an estimate based on the data, the test statistics (under certain
conditions) follow a Student's t distribution.
Most t-test statistics have the form t = Z/s, where Z and s are functions of the data. Typically,
Z is designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger
when the alternative hypothesis is true), whereas s is a scaling parameter that allows the
distribution of t to be determined.
As an example, in the one-sample t-test: t=sqrt(p)*(z/s)
=sqrt(p)*(((xbar=mu0)/(sigma/sqrt(n)))/s)
where xbar is the sample mean from a sample X1,X2,…,Xn, of size n, s is the ratio of sample
standard deviation over population standard deviation, sigma(σ) is the population standard
deviation of the data, and mu0(μ) is the population mean.
The assumptions underlying a t-test are that
• X follows a normal distribution with mean μ and variance σ2
• s2 follows a χ2 distribution with p degrees of freedom under the null hypothesis, where p is a
positive constant
• Z and s are independent.

Problems:
1. A random sample of 1,562 undergraduates enrolled in management ethics courses was
asked to respond on a scale from 1 (strongly disagree) to 7 (strongly agree) to this proposition:
Senior corporate executives are interested in social justice. The sample mean response was
4.27, and the sample standard deviation was 1.32. Test at the 1% level, against a two-sided
alternative, the null hypothesis that the population mean is 4.

Solution: The null hypothesis is that H0: mu= 4 i.e. the population mean is 4 against the
alternative hypothesis H1: mu≠4.
Input:
#1
#lower_tail n=1562 xbar=4.27 mu0=4 s=1.32
t=(xbar-mu0)/(s/sqrt(n)) t
alpha =0.01
t.alpha=qt(1-alpha,df=n-1)
Department of Big Data Analytics
6
16BDA71011

-t.alpha #critcal value


#or
pval=pt(t,df=n-1)
pval #lower_tail_p value

#upper_tail t=(xbar-mu0)/(s/sqrt(n)) t
alpha=0.01
t.alpha=qt(1-alpha,df=n-1)
t.alpha #or
pval=pt(t,df=n-1,lower.tail=FALSE) pval #upper_tail_p value

#Two_tail(Solution for the above question) t=(xbar-mu0)/(s/sqrt(n)) t


alpha=0.01
t.half.alpha=qt(1-alpha/2,df=n-1) c(-t.half.alpha,t.half.alpha)
#or
pval=2*pt(t,df=n-1) pval
Output:
> #1
> #lower_tail > n=1562
> xbar=4.27
> mu0=4
> s=1.32
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] 8.084075
> alpha =0.01
> t.alpha=qt(1-alpha,df=n-1)
> -t.alpha #critcal value
[1] -2.328739
> #or
> pval=pt(t,df=n-1)
> pval #lower_tail_p value
[1] 1

> #upper_tail
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] 8.084075
> alpha=0.01
> t.alpha=qt(1-alpha,df=n-1)
> t.alpha
[1] 2.328739
> #or
> pval=pt(t,df=n-1,lower.tail=FALSE)
> pval #upper_tail_p value
[1] 6.218266e-16

Department of Big Data Analytics


7
16BDA71011

> #Two_tail
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] 8.084075
> alpha=0.01
> t.half.alpha=qt(1-alpha/2,df=n-1)
> c(-t.half.alpha,t.half.alpha)
[1] -2.578983 2.578983 > #or
> pval=2*pt(t,df=n-1)
> pval
[1] 2

Result:
Since tcal=8.084075 > | tα/2 | = 2.578983, we reject the null hypothesis i.e. the population
mean is not equal to 4.

2. An engineering research center claims that through the use of a new computer control
system, automobiles should achieve, on average, an additional 3 miles per gallon of gas. A
random sample of 100 automobiles was used to evaluate this product. The sample mean
increase in miles per gallon achieved was 2.4, and the sample standard deviation was 1.8
miles per gallon. Test the hypothesis that the population mean is at least 3 miles per gallon.
Find the p-value of this test, and interpret your findings.
Solution:
The null hypothesis is that H0:mu ≥ 3 i.e. the population mean is at least 3 miles per gallon
against the alternative hypothesis is H1:mu < 3.

Input:
#2
#lower_tail(solution for the above question)
n=100 xbar=2.4 mu0=3 s=1.8
t=(xbar-mu0)/(s/sqrt(n)) t
alpha =0.05
t.alpha=qt(1-alpha,df=n-1)
-t.alpha #critcal value
#or
pval=pt(t,df=n-1)
pval #lower_tail_p value

#upper_tail t=(xbar-mu0)/(s/sqrt(n)) t
alpha=0.05
t.alpha=qt(1-alpha,df=n-1)
t.alpha #or
pval=pt(t,df=n-1,lower.tail=FALSE) pval #upper_tail_p value
#Two_tail
t=(xbar-mu0)/(s/sqrt(n)) t
alpha=0.05
t.half.alpha=qt(1-alpha/2,df=n-1) c(-t.half.alpha,t.half.alpha)
Department of Big Data Analytics
8
16BDA71011

#or
pval=2*pt(t,df=n-1)
pval #two_tailed_value

Output:
> #2
> #lower_tail > n=100
> xbar=2.4
> mu0=3
> s=1.8
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] -3.333333
> alpha =0.05
> t.alpha=qt(1-alpha,df=n-1)
> -t.alpha #critcal value
[1] -1.660391
> #or
> pval=pt(t,df=n-1)
> pval #lower_tail_p value
[1] 0.0006040021
> #upper_tail
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] -3.333333
> alpha=0.05
> t.alpha=qt(1-alpha,df=n-1)
> t.alpha
[1] 1.660391
> #or
> pval=pt(t,df=n-1,lower.tail=FALSE)
> pval #upper_tail_p value
[1] 0.999396
> #Two_tail
> t=(xbar-mu0)/(s/sqrt(n))
>t
[1] -3.333333
> alpha=0.05
> t.half.alpha=qt(1-alpha/2,df=n-1)
> c(-t.half.alpha,t.half.alpha)
[1] -1.984217 1.984217
> #or
> pval=2*pt(t,df=n-1)
> pval #two_tailed_value [1] 0.001208004
Result:
Since tcal = -3.33 < tα = -1.660391, we can reject the null hypothesis i.e. the population mean
is at least 3 miles per gallon.
Department of Big Data Analytics
9
16BDA71011

ASSIGNMENT 4.
POPULATION PROPORTION

Aim: To find the population proportion.

Software used: R-Studio.

Procedure:
A population proportion, generally denoted by P and in some textbooks by pie is a
parameter that describes a percentage value associated with a population. For example, the
2010 United States Census showed that 83.7% of the American Population was identified as
not being Hispanic or Latino. The value of 83.7% is a population proportion. In general, the
population proportion or any other population parameter is unknown. A census can be
conducted to determine the actual value of a population parameter, but in most statistical
practices, a census is not a practical method due to its costs and time consumption.
A population proportion is usually estimated through an unbiased sample statistic obtained
from an observational study or experiment. For example, the National Technological Literacy
Conference conducted a national survey of 2,000 adults to determine the percentage of
adults who are economically illiterate. The study showed that 72% of the 2,000 adults
sampled did not understand what a gross domestic product .The value of 72% is a sample
proportion. The sample proportion is generally denoted by pbar.
A proportion is mathematically defined as being the ratio of the values in a subset S to
the values in a set R.As such, the population proportion can be defined as follows: P=x/N
where, x is the count of successes in the population and Nis the size of the population.
This mathematical definition can be generalized to provide the definition for the sample
proportion: Pbar=x/n
where x is the count of successes in the sample and n is the size of the sample obtained from
the population.

Problems:
1.Market Research, Inc., wants to know if shoppers are sensitive to the prices of items sold in
a supermarket. A random sample of 802 shoppers was obtained, and 378 of those
supermarket shoppers could state the correct price of an item immediately after putting it
into their cart. Test at the 5% level the null hypothesis that at least one half of all shoppers
can state the correct price.

Solution:
Let P denote the population proportion of supermarket shoppers who state the
correct price in these circumstances.
The null hypothesis H0: P ≥ P0 = 0.50 i.e. at least one half of all shoppers can state the
correct price, against the alternative hypothesis H1: P < 0.50.
We reject the null hypothesis if population proportion is greater than Zα/2.
Input:
#1 #lower_tail
pbar=378/802
Department of Big Data Analytics
10
16BDA71011

pbar p0=0.50 n=802


z=(pbarp0)/(sqrt(p0*(1p0)/n))
z alpha=0.05 z.alpha=qnorm(1alpha)
-z.alpha
#OR pval=pnorm(z) pval
#OR
prop.test(378,802,p=0.50,alt="less",correct=FALSE) #To compute p-value directly

#upper_tail
z=(pbar-p0)/(sqrt(p0*(1-p0)/n)) z alpha=0.05
z.alpha=qnorm(1-alpha)
z.alpha
#OR
pval=pnorm(z,lower.tail=FALSE) pval
#OR
prop.test(378,802,p=0.50,alt="greater",correct=FALSE) #To compute p-value directly

#two_tail
z=(pbar-p0)/(sqrt(p0*(1-p0)/n)) z alpha=0.05
z.half.alpha=qnorm(1-alpha/2) c(-z.half.alpha,z.half.alpha) #OR
pval=2*pnorm(z,lower.tail=FALSE) #uppertail pval #two−tailed-p−value
#OR
prop.test(378,802,p=0.50,correct=FALSE) #To compute p-value directly

Output:
> #1
> #lower_tail(solution to the above question)
> pbar=378/802
> pbar
[1] 0.4713217
> p0=0.50
> n=802
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.624316
> alpha=0.05
> z.alpha=qnorm(1-alpha)
> -z.alpha
[1] -1.644854
> #OR
> pval=pnorm(z)
> pval
[1] 0.05215414
> #OR
> prop.test(378,802,p=0.50,alt="less",correct=FALSE) #To compute p-value directly

1-sample proportions test without continuity correction


Department of Big Data Analytics
11
16BDA71011

data: 378 out of 802, null probability 0.5 X-squared = 2.6384, df = 1, p-value = 0.05215
alternative hypothesis: true p is less than 0.5 95 percent confidence interval: 0.0000000
0.5003626 sample estimates:
p
0.4713217

> #upper_tail
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.624316
> alpha=0.05
> z.alpha=qnorm(1-alpha)
> z.alpha
[1] 1.644854
> #OR
> pval=pnorm(z,lower.tail=FALSE)
> pval
[1] 0.9478459
> #OR
> prop.test(378,802,p=0.50,alt="greater",correct=FALSE) #To compute p-value directly

1-sample proportions test without continuity correction

data: 378 out of 802, null probability 0.5 X-squared = 2.6384, df = 1, p-value = 0.9478
alternative hypothesis: true p is greater than 0.5 95 percent confidence interval: 0.4424736
1.0000000 sample estimates:
p
0.4713217

> #two_tail
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.624316
> alpha=0.05
> z.half.alpha=qnorm(1-alpha/2)
> c(-z.half.alpha,z.half.alpha)
[1] -1.959964 1.959964
> #OR
> pval=2*pnorm(z,lower.tail=FALSE) #uppertail
> pval #two−tailed-p−value
[1] 1.895692
> #OR
> prop.test(378,802,p=0.50,correct=FALSE) #To compute p-value directly

1-sample proportions test without continuity correction

Department of Big Data Analytics


12
16BDA71011

data: 378 out of 802, null probability


0.5 X-squared = 2.6384, df = 1, p-
value = 0.1043 alternative
hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4369932 0.5059236 sample
estimates: p
0.4713217

Result:
Since Zcal =-1.624316 > Zα =1.644854, we do not reject the null hypothesis H0 i.e. at least
one half of all shoppers can state the correct price.

2. A random sample of 202 business faculty members was asked if there should be a required
foreign language course for business majors. Of these sample members, 140 felt there was a
need for a foreign language course. Test the hypothesis that at least 75% of all business faculty
members hold this view. Use α = 0.05.
Solution:

Let P denote the population proportion of faculty members who were asked if there
should be a required foreign language course for business majors.
The null hypothesis H0: P ≥ P0 = 0.75 i.e. at least 75% of all business faculty members
felt there was a need for a foreign language course, against the alternative hypothesis
H1: P < 0.75.
We reject the null hypothesis if population proportion is greater than Z α/2.
Input:

#2
#lower_tail(solution to the above
question) pbar=140/202 p0=0.75
n=202
z=(pbar-
p0)/(sqrt(p0*(1-
p0)/n)) z alpha=0.05
z.alpha=qnorm(1-alpha)
-z.alpha
#OR
pval=pnor
m(z) pval
#OR
prop.test(140,202,p=0.75,alt="less",correct=FALSE) #To compute p-value directly

#upper_tail

Department of Big Data Analytics


13
16BDA71011

z=(pbar-
p0)/(sqrt(p0*(1-
p0)/n)) z alpha=0.05
z.alpha=qnorm(1-alpha)
z.alpha
#OR
pval=pnorm(z,lower.tail=FALSE)
pval
#OR
prop.test(140,202,p=0.75,alt="greater",correct=FALSE) #To compute p-value directly

#two_tail
z=(pbar-
p0)/(sqrt(p0*(1-
p0)/n)) z alpha=0.05
z.half.alpha=qnorm(1-alpha/2)
c(-z.half.alpha,z.half.alpha)
#OR
pval=2*pnorm(z,lower.tail=FALSE)
#uppertail pval #two−tailed-p−value
#OR
prop.test(140,202,p=0.75,correct=FALSE) #To compute p-value directly
Output:

> #2
> #lower_tail
> pbar=140/202
> p0=0.75
> n=202
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.868622
> alpha=0.05
> z.alpha=qnorm(1-alpha)
> -z.alpha
[1] -1.644854
> #OR
> pval=pnorm(z)
> #OR
> pval
[1] 0.03083769
> prop.test(140,202,p=0.75,alt="less",correct=FALSE) #To compute p-value directly

1-sample proportions test without continuity correction

data: 140 out of 202, null probability


0.75 X-squared = 3.4917, df = 1, p-
Department of Big Data Analytics
14
16BDA71011

value = 0.03084 alternative


hypothesis: true p is less than 0.75
95 percent confidence interval:
0.0000000 0.7436027 sample
estimates:
p
0.6930693
> #upper_tail
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.868622
> alpha=0.05
> z.alpha=qnorm(1-alpha)
> z.alpha
[1] 1.644854
> #OR
> pval=pnorm(z,lower.tail=FALSE)
> pval
[1] 0.9691623
> #OR
> prop.test(140,202,p=0.75,alt="greater",correct=FALSE) #To compute p-value directly

1-sample proportions test without continuity correction

data: 140 out of 202, null probability


0.75 X-squared = 3.4917, df = 1, p-
value = 0.9692 alternative hypothesis:
true p is greater than 0.75 95 percent
confidence interval: 0.6374324
1.0000000 sample estimates:
p
0.6930693

> #two_tail
> z=(pbar-p0)/(sqrt(p0*(1-p0)/n))
>z
[1] -1.868622
> alpha=0.05
> z.half.alpha=qnorm(1-alpha/2)
> c(-z.half.alpha,z.half.alpha)
[1] -1.959964 1.959964
> #OR
> pval=2*pnorm(z,lower.tail=FALSE) #uppertail
> pval #two−tailed-p−value
[1] 1.938325
> #OR
> prop.test(140,202,p=0.75,correct=FALSE) #To compute p-value directly
Department of Big Data Analytics
15
16BDA71011

1-sample proportions test without continuity correction

data: 140 out of 202, null probability 0.75


X-squared = 3.4917, df = 1, p-value =
0.06168 alternative hypothesis: true p is
not equal to 0.75 95 percent confidence
interval: 0.6263561 0.7525763 sample
estimates:
p
0.6930693

Result:
Since Zcal = -1.868622 < Zα = -1.644854, we reject the null hypothesis H0 i.e. less than 75% of
the business faculty members feel that there is a need for a foreign language course.

Department of Big Data Analytics


16
16BDA71011

ASSIGNMENT 5.
COMPARISON OF TWO POPULATION PROPORTIONS
Aim: To estimate the difference between mean gas mileage(mpg) of manual and
automatic transmission.

Software used: R-studio.

Procedure:
Point estimate is the difference between the two sample proportions, written
as:
Pbar1-pbar2=(y1/n1) -(y2/n2)
The mean of its sampling distribution is pbar1-pbar2 and the standard deviation is given
by:
=sqrt((pbar1(1-pbar1)/n1)+((pbar2(1-pbar2)/n2))
When the observed number of successes and the observed number of failures are
greater than or equal to 5 for both populations, then the sampling distribution of pbar1-
pbar2 vis approximately normal and we can use z-methods.
In the following, the formula of the confidence interval and the test statistic are given
for reference. You can use Minitab to perform the inference. It is more important to
recognize the problem and can use Minitab to draw a conclusion than to train yourself
to use the tedious formula. The number of successes and failures in both populations
is larger than or equal to 5. We will base this on the number of successes and failures
in both samples. Why the samples? If you recall in our discussion for the test of one
proportion, this check of conditions used np0 and n(1-p0) where p0 was the assumed
null hypothesis population proportion. However, here where we are comparing two
proportions, we do not know what the population proportions are; but we are if they
are equal. For instance, the two population proportions could be 0.65 or 0.30, etc. It
doesn't matter as we are assuming they are equal. Thus, we do not have a fixed
population value to use - thus substitute with the sample proportions for each group.

Problem:
If the data in mtcars follows the normal distribution, find the 95% confidence interval
estimate of the difference between the mean gas mileage(mpg) of manual and
automatic transmissions. (hint for automatic cars, am=0, and for manual cars am=1)
Input:
mtcars=cars
mtcars$am==1
print(cars)
mpg1<-mtcars[cars,]$mpg
print(mpg1)
mpg2<-mtcars[!cars,]$mpg
print(mpg2)
print(t.test(mpg1,mpg2))

Department of Big Data Analytics


17
16BDA71011

Output:
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> cars =
mtcars$a
m==1 >
print(cars
)
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE [27]
TRUE TRUE TRUE TRUE TRUE TRUE
> mpg1<-mtcars[cars,]$mpg
> print(mpg1)
Department of Big Data Analytics
18
16BDA71011

[1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4
> mpg2<-mtcars[!cars,]$mpg
> print(mpg2)
[1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5 15.5
[17] 15.2 13.3 19.2
> print(t.test(mpg1,mpg2))

Welch Two Sample t-test

data: mpg1 and mpg2


t = 3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means
is not equal to 0 95 percent confidence interval:
3.209684 11.280194 sample estimates: mean of
x mean of y
24.39231 17.14737

Result:
Hence the confidence interval at 95% is between (3.20,11.28) for estimated sample mean
(24.39,17.14).

Department of Big Data Analytics


19
16BDA71011

ASSIGNMENT 6.
CORRELATION MATRIX
Aim: To create correlation matrix for all numeric variables using iris dataset and to plot
full and lower correlation matrix which gives the correlation coefficient values.

Software used: R-Studio.

Procedure:
Correlation is any of a broad class of statistical relationships involving
dependence, though in common usage it most often refers to the extent to which two
variables have a linear relationship with each other. Familiar examples of dependent
phenomena include the correlation between the physical statures of parents and their
offspring, and the correlation between the demand for a product and its price.
Correlations are useful because they can indicate a predictive relationship that can be
exploited in practice. For example, an electrical utility may produce less power on a
mild day based on the correlation between electricity demand and weather. In this
example there is a causal relationship, because extreme weather causes people to use
more electricity for heating or cooling; however, correlation is not sufficient to
demonstrate the presence of such a causal relationship (i.e., correlation does not imply
causation).
Formally, dependence refers to any situation in which random variables do not satisfy
a mathematical condition of probabilistic independence. In loose usage, correlation can
refer to any departure of two or more random variables from independence, but
technically it refers to any of several more specialized types of relationship between
mean values. There are several correlation coefficients, often denoted ρ or r, measuring
the degree of correlation. The most common of these is the Pearson correlation
coefficient, which is sensitive only to a linear relationship between two variables (which
may exist even if one is a nonlinear function of the other). Other correlation coefficients
have been developed to be more robust than the Pearson correlation – that is, more
sensitive to nonlinear relationships. Mutual information can also be applied to
measure dependence between two variables.

Problem:
Use iris dataset to create correlation matrix for all the numeric variables. Also, plot full
and lower correlation matrix which gives correlation coefficient values (hint: method=
“number”)

Input:
#install.packages("corrplot")
#library("corrplot")
str(iris)
data=iris[,
c(1,2,3,4)]
data

Department of Big Data Analytics


20
16BDA71011

m=cor(dat
a)
m # m is the correlation matrix
corrplot(m)
corrplot(m,type="full",title="Correlation Matrix",method="number")
corrplot(m,type="lower",title="Correlation Matrix",method="number")

Output:
> #install.packages("corrplot")
>
#library("corr
plot") >
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> data=iris[,c(1,2,3,4)]
> m=cor(data)
> m # m is the correlation matrix
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
> corrplot(m)

Department of Big Data Analytics


21
16BDA71011

> corrplot(m,type="full",title="Correlation Matrix",method="number")

> corrplot(m,type="lower",title="Correlation Matrix",method="number")

Result:
Therefore, plots for full and lower correlation matrix which gives correlation coefficient values
is determined.

Department of Big Data Analytics


22
16BDA71011

ASSIGNMENT 7.
ONE WAY ANOVA (F-TEST)
Aim: To perform one-way Analysis of Variation in R by determining the F-value.

Software used: R-studio.

Procedure:
In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a
technique used to compare means of three or more samples (using the F distribution).
This technique can be used only for numerical data.
The ANOVA tests the null hypothesis that samples in two or more groups are drawn
from populations with the same mean values. To do this, two estimates are made of
the population variance. These estimates rely on various assumptions. The ANOVA
produces an F-statistic, the ratio of the variance calculated among the means to the
variance within the samples. If the group means are drawn from populations with the
same mean values, the variance between the group means should be lower than the
variance of the samples, following the central limit theorem. A higher ratio therefore
implies that the samples were drawn from populations with different mean values.
Typically, however, the one-way ANOVA is used to test for differences among at least
three groups, since the two-group case can be covered by a t-test (Gosset, 1908). When
there are only two means to compare, the t-test and the F-test are equivalent; the
relation between ANOVA and t is given by F = t2. An extension of one-way ANOVA is
two-way analysis of variance that examines the influence of two different categorical
independent variables on one dependent variable.

Problem:
A project manager is interested to test the difference between process completion time
under three different methods. The data is given below:
METHOD A: 9.3 ,9.4, 9.6, 10 ,12.4, 13, 10.4 ,11.1, 12.2, 13.5
METHOD B: 12.2, 11.4, 13.2, 14.4, 15.4, 13.4, 14.2, 10.5, 10.8, 12.4
METHOD C: 10.2, 8.7, 9.7, 12.1, 11.4, 12.4, 11.8, 13.4, 14.5, 15.4
Perform One-Way Analysis of variance.

Input:
time = c(9.3 ,9.4, 9.6, 10 ,12.4, 13, 10.4 ,11.1, 12.2, 13.5, 12.2, 11.4, 13.2, 14.4, 15.4,
13.4, 14.2 ,
10.5 , 10.8 , 12.4,10.2 , 8.7 , 9.7 , 12.1 , 11.4 , 12.4 , 11.8 , 13.4 , 14.5
, 15.4) methods = c(rep("Method A",10), rep("Method B",10),
rep("Method C",10)) comp_time = data.frame(time,methods)
comp_time
result = aov(time~methods, data = comp_time)
summary(result)

Department of Big Data Analytics


23
16BDA71011

Output:
> time = c(9.3 ,9.4, 9.6, 10 ,12.4, 13, 10.4 ,11.1, 12.2, 13.5, 12.2, 11.4, 13.2, 14.4, 15.4, 13.4,
14.
2 , 10.5 , 10.8 , 12.4,10.2 , 8.7 , 9.7 , 12.1 , 11.4 , 12.4 , 11.8 , 13.4 , 14.5 , 15.4)
> methods = c(rep("Method A",10), rep("Method B",10), rep("Method C",10))
> comp_time = data.frame(time,methods)

> comp_time time methods


1 9.3 Method A 16 13.4 Method B
2 9.4 Method A 17 14.2 Method B
3 9.6 Method A 18 10.5 Method B
4 10.0 Method A 19 10.8 Method B
5 12.4 Method A 20 12.4 Method B
6 13.0 Method A 21 10.2 Method C
7 10.4 Method A 22 8.7 Method C
8 11.1 Method A 23 9.7 Method C
9 12.2 Method A 24 12.1 Method C
10 13.5 Method A 25 11.4 Method C
11 12.2 Method B 26 12.4 Method C
12 11.4 Method B 27 11.8 Method C
13 13.2 Method B 28 13.4 Method C
14 14.4 Method B 29 14.5 Method C
15 15.4 Method B 30 15.4 Method C

> result = aov(time~methods, data = comp_time)


> summary(result)
Df Sum Sq Mean Sq F value Pr(>F) methods
2 14.45 7.226 2.278 0.122
Residuals 27 85.66 3.173

Result:
One-way analysis of variation has been performed in R. Since F-ratio is greater than tabulated
value we can conclude that there is a significant effect between method and time taken to
complete the process.

Department of Big Data Analytics


24
16BDA71011

ASSIGNMENT 8.
TWO WAY ANOVA
Aim: To perform two-way analysis of variance in R.

Software used: R-studio.

Procedure:
In statistics, the two-way analysis of variance (ANOVA) is an extension of the
one-way ANOVA that examines the influence of two different categorical independent
variables on one continuous dependent variable. The two-way ANOVA not only aims at
assessing the main effect of each independent variable but also if there is any
interaction between them.

Problem:
Use the following data to study whether there was any relationship between
the quantitative variable "number of shots 'made' (i.e., successfully completed out of
50 tries)" and two qualitative variables "Time of Day" and "Shoes Worn “.

Time Shoes Shots Made


Morning Others 25
Morning Others 26
Night Others 27
Night Others 27
Morning Favorite 32
Morning Favorite 22
Night Favorite 30
Night Favorite 34
Morning Others 35

Input:
shots_made = c(25, 26, 27, 27, 32, 22, 30, 34,
35) shoes = c(rep("Others",4),
rep("Favourite",4), "Others")
time = c(rep("Morning",2), rep("Night",2), rep("Morning",2), rep("Night",2),
"Morning") shoes1 = as.factor(shoes) time1 = as.factor(time)
shots = data.frame(time1,shoes1,shots_made)
print(shots)
result = aov(shots_made ~ shoes+time, data = shots)
summary(result)

Department of Big Data Analytics


25
16BDA71011

Output:
> shots_made = c(25, 26, 27, 27, 32, 22, 30, 34, 35)
> shoes = c(rep("Others",4), rep("Favourite",4), "Others")
> time = c(rep("Morning",2), rep("Night",2), rep("Morning",2), rep("Night",2), "Morning")
> shoes1 = as.factor(shoes)
> time1 = as.factor(time)
> shots = data.frame(time1,shoes1,shots_made)
> print(shots)
time1 shoes1
shots_made 1
Morning Others
25
2 Morning Others 26
3 Night Others 27
4 Night Others 27
5 Morning Favourite 32
6 Morning Favourite 22
7 Night Favourite 30
8 Night Favourite 34
9 Morning Others 35
> result = aov(shots_made ~ shoes+time, data = shots)
> summary(result)
Df Sum Sq Mean Sq F value
Pr(>F) shoes 1 5.00 5.000
0.210 0.663 time 1 4.09
4.091 0.172 0.693
Residuals 6 142.91 23.818

Result:
Two-way analysis of variance has been performed in R. The conclusion we get from
the data is that there is no significant effect of both factors, shoes worn and time of day, on
the target variable shots made.

Department of Big Data Analytics


26
16BDA71011

ASSIGNMENT 9
LINEAR REGRESSION
Aim: To perform simple linear regression in R.

Software used: R-studio.

Procedure:
In statistics, linear regression is an approach for modelling the relationship
between a scalar dependent variable y and one or more explanatory variables (or
independent variables) denoted X. The case of one explanatory variable is called simple
linear regression. For more than one explanatory variable, the process is called multiple
linear regression. (This term is distinct from multivariate linear regression, where
multiple correlated dependent variables are predicted, rather than a single scalar
variable.)
In linear regression, the relationships are modelled using linear predictor functions
whose unknown model parameters are estimated from the data. Such models are
called models. Most commonly, the conditional mean of y given the value of X is
assumed to be an affine function of X; less commonly, the median or some other
quantile of the conditional distribution of y given X is expressed as a linear function of
X. Like all forms of regression analysis, linear regression focuses on the conditional
probability distribution of y given X, rather than on the joint probability distribution of
y and X, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and
to be used extensively in practical applications. This is because models which depend
linearly on their unknown parameters are easier to fit than models which are non-
linearly related to their parameters and because the statistical properties of the
resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories: • If the goal is prediction, or forecasting, or error
reduction, linear regression can be used to fit a predictive model to an observed data
set of y and X values. After developing such a model, if an additional value of X is then
given without its accompanying value of y, the fitted model can be used to make a
prediction of the value of y.
• Given a variable y and several variables X1, ..., Xp that may be related to y, linear
regression analysis can be applied to quantify the strength of the relationship
between y and the Xj, to assess which Xj may have no relationship with y at all, and
to identify which subsets of the Xj contain redundant information about y.
Linear regression models are often fitted using the least squares approach, but they
may also be fitted in other ways, such as by minimizing the "lack of fit" in some other
norm (as with least absolute deviations regression), or by minimizing a penalized
version of the least squares loss function as in ridge regression (L2-norm penalty) and
lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit
models that are not linear models. Thus, although the terms "least squares" and "linear
model" are closely linked, they are not synonymous.
Department of Big Data Analytics
27
16BDA71011

Problem:
A random sample of data for 7 days of operation produced the following (price,
quantity) data values:
Price per Gallon of Paint, X Quantity Sold, Y
10 100
8 120
5 200
4 200
10 90
7 110
6 150

a. Prepare a scatter plot of the data. b. Compute and interpret b1. c. Compute and
interpret b0.
d. How many gallons of paint would you expect to sell if the price is $7 per gallon?

Input:
price = c(10, 8, 5, 4, 10, 7, 6) # X independent variable
quantity = c(100, 120, 200, 200, 90, 110, 150) # Y dependent
variable cat("Scatter plot of the data is\n") plot(price,
quantity) #plotting x,y pricequantity.lm =
lm(quantity ~ price) cat("Coefficients are:\n")
cat("Intercept value b1=",coefficients(pricequantity.lm)[1],"\n")
cat("When Price=0,
quantity=",coefficients(pricequantity.lm)[1],"units\n")
cat("coefficient of regression value
b2=",coefficients(pricequantity.lm)[2],"\n") cat("When Price=1,
quantity=",coefficients(pricequantity.lm)[2],"units\n")
cat("When price = 7, quantity is=",predict(pricequantity.lm, data.frame(price = 7)),"\n")

Output:
> price = c(10, 8, 5, 4, 10, 7, 6) # X independent variable
> quantity = c(100, 120, 200, 200, 90, 110, 150) # Y dependent variable
> cat("Scatter plot of the data is\n")
Scatter plot of the data is
> plot(price, quantity) #plotting x,y

Department of Big Data Analytics


28
16BDA71011

> pricequantity.lm =
lm(quantity ~ price) >
cat("Coefficients are:\n")
Coefficients are:
> cat("Intercept value b1=",coefficients(pricequantity.lm)[1],"\n")
Intercept value b1= 268.6957
> cat("When Price=0, quantity=",coefficients(pricequantity.lm)[1],"units\n")
When Price=0, quantity= 268.6957 units
> cat("coefficient of regression value b2=",coefficients(pricequantity.lm)[2],"\n")
coefficient of regression value b2= -18.21739
> cat("When Price=1, quantity=",coefficients(pricequantity.lm)[2],"units\n")
When Price=1, quantity= -18.21739 units
> cat("When price = 7, quantity is=",predict(pricequantity.lm, data.frame(price
= 7)),"\n") When price = 7, quantity is= 141.1739

Result:
Simple linear regression has been performed in R.

Department of Big Data Analytics


29
16BDA71011

ASSIGNMENT 10.
MULTIPLE LINEAR REGRESSION
Aim: To perform multiple linear regression in R.

Software used: R-studio.

Procedure:
A linear regression model that contains more than one predictor variable is called
multiple linear regression model. The following model is a multiple linear regression
model with two predictor variables X1 and X2.

The model is linear because it is linear in the parameters βo, β1 and β2. The model
describes a plane in the three-dimensional space of Y, X1 and X2. The parameter β0 is the
intercept of this plane. Parameters β1 and β2 are referred to as partial regression
coefficients. Parameter β1 represents the change in the mean response corresponding
to a unit change in X1 when X2 is held constant. Parameter β2 represents the change in
the mean response corresponding to a unit change in X1 when X2 is held constant.
Consider the following example of a multiple linear regression model with two predictor
variables X1 and X2:
Y=30+5X1+7X2+
This regression model is a first order multiple linear regression model. This is because
the maximum power of the variables in the model is 1. (The regression plane
corresponding to this model is shown in the figure below.) Also shown is an observed
data point and the corresponding random error, . The true regression model is usually
never known (and therefore the values of the random error terms corresponding to
observed data points remain unknown). However, the regression model can be
estimated by calculating the parameters of the model for an observed data set. This is
explained in Estimating Regression Models Using Least Squares.
One of the following figures shows the contour plot for the regression model the above
equation. The contour plot shows lines of constant mean response values as a function
of and . The contour lines for the given regression model are straight lines as
seen on the plot. Straight contour lines result for first order regression models with no
interaction terms. A linear regression model may also take the following form:
Y=B0 + B1X1 + B2X2 + B12X1X2 +
A cross-product term, X1X2, is included in the model. This term represents an interaction effect
between the two variables X1 and X2. Interaction means that the effect produced by a change
in the predictor variable on the response depends on the level of the other predictor
variable(s).

Department of Big Data Analytics


30
16BDA71011

Problem:

Consider the data set "mtcars" available in the R environment. It gives a comparison
between different car models in terms of mileage per gallon (mpg), cylinder
displacement("disp"), horse power("hp"), weight of the car("wt") and some more
parameters.
Establish the relationship between "mpg" as a response variable with "disp","hp" and
"wt" as predictor variables.
(i) Find out the coefficients of the model, Rsquared and adjusted Rsquared, hypothesis
testing for significance of coefficients.
(ii) predict mpg when disp=180, hp=200, wt=6

Input:

mtcars.lm = lm( mpg ~ disp + hp + wt, data = mtcars)


print(summary(mtcars))
cat("The coefficients of the model are:\n")
cat("coefficient of
disp,b1=",coefficients(mtcars.lm)[2],"\n" ) cat("coefficient
of hp,b2=",coefficients(mtcars.lm)[3],"\n")
cat("coefficient of
wt,b3=",coefficients(mtcars.lm)[4],"\n") cat("R squared
=",summary(mtcars.lm)$r.squared,"\n") cat("Adjusted R
squared =",summary(mtcars.lm)$adj.r.squared,"\n")
cat("As the p-values of disp, hp and wt are less than 0.05, they are both statistically
significant in the multiple linear regression model of mtcars.\n") cat("Value of mpg
when disp=180, hp=200 and wt=6 is\n") newdata = data.frame(disp=180, hp=200,
wt=6) print(predict(mtcars.lm, newdata))

Output:

> mtcars.lm = lm( mpg ~ disp + hp + wt, data = mtcars)


> print(summary(mtcars))
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
Department of Big Data Analytics
31
16BDA71011

3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
> cat("The coefficients of the model
are:\n") The coefficients of the model
are:
> cat("coefficient of disp,b1=",coefficients(mtcars.lm)[2],"\n" )
coefficient of disp,b1= -0.0009370091
> cat("coefficient of hp,b2=",coefficients(mtcars.lm)[3],"\n")
coefficient of hp,b2= -0.03115655
> cat("coefficient of wt,b3=",coefficients(mtcars.lm)[4],"\n")
coefficient of wt,b3= -3.800891
> cat("R squared =",summary(mtcars.lm)$r.squared,"\n")
R squared = 0.8268361
> cat("Adjusted R squared =",summary(mtcars.lm)$adj.r.squared,"\n")
Adjusted R squared = 0.8082829
> cat("As the p-values of disp, hp and wt are less than 0.05, they are both statistically
significant in the multiple linear regression model of mtcars.\n")
As the p-values of disp, hp and wt are less than 0.05, they are both statistically
significant in the multiple linear regression model of mtcars.
> cat("Value of mpg when disp=180, hp=200 and wt=6 is\n")
Value of mpg when disp=180, hp=200 and wt=6 is
> newdata = data.frame(disp=180, hp=200, wt=6)
> print(predict(mtcars.lm, newdata))
1
7.90019

Result:
Multiple linear regression has been performed in R.

Department of Big Data Analytics


32
16BDA71011

ASSIGNMENT 11.
LOGISTIC REGRESSION
Aim: To perform logistic regression in R.

Software used: R-studio.

Procedure:
Logistic Regression is a type of predictive model that can be used when the
target variable is a categorical variable with two categories – for example live/die, has
disease/doesn’t have disease, purchases product/doesn’t purchase, wins race/doesn’t
win, etc. A logistic regression model does not involve decision trees and is more akin to
nonlinear regression such as fitting a polynomial to a set of data values. Logistic
regression can be used only with two types of target variables:

• A categorical target variable that has exactly two categories (i.e., a binary or
dichotomous variable).
• A continuous target variable that has values in the range 0.0 to 1.0 representing
probability values or proportions.
The logistic formula has each continuous predictor variable, each dichotomous
predictor variable with a value of 0 or 1, and a dummy variable for every category of
predictor variables with more than two categories less one category. The form of the
logistic model formula is: P = 1/(1+exp(-(B0 + B1*X1 + B2*X2 + ... + Bk*Xk)))
Where B0 is a constant and Bi are coefficients of the predictor variables (or dummy
variables in the case of multi-category predictor variables). The computed value, P, is a
probability in the range 0 to 1. The "exp" function is e raised to a power. You can exclude
the B0 constant by turning off the option “Include constant (intercept) term” on the
logistic regression model property page.

Problem:
The in-built data set "mtcars" describes different models of a car with their various
engine specifications. In "mtcars" data set, the transmission mode (automatic or
manual) is described by the column am which is a binary value (0 or 1). Create a logistic
regression model between the columns "am" and 3 other columns - hp, wt and cyl.
Input:
am.glm = glm(formula=am ~ hp + wt + cyl,data=mtcars,family=binomial)
print(summary(am.glm))

Output:
> am.glm = glm(formula=am ~ hp + wt +
cyl,data=mtcars,family=binomial) > print(summary(am.glm))

Call:
glm(formula = am ~ hp + wt + cyl, family = binomial, data = mtcars)

Department of Big Data Analytics


33
16BDA71011

Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641

Coefficients:
Estimate Std. Error z value
Pr(>|z|) (Intercept) 19.70288
8.11637 2.428 0.0152 * hp
0.03259 0.01886 1.728 0.0840 . wt
-9.14947 4.15332 -2.203 0.0276 *
cyl 0.48760 1.07162 0.455
0.6491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.2297 on 31 degrees of freedom


Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Result:
Logistic regression has been performed in R. The result is that the horsepower and weight of
the cars determine whether the car has automatic or manual gears, whereas the number of
cylinders has no effect.

Department of Big Data Analytics


34

You might also like