You are on page 1of 23

Probability and Statistics for Engineering Correlation and Regression Analysis

CHAPTER ii. Linear and non –linear correlation.


3 iii. Simple. Partial and multiple correlations.

Positive, Negative and No-Correlation


Two variables are said to be positively correlated if they
3. Correlation and Regression Analysis tend to change together in the same direction, that is, if they
tend to increase or decrease together. Such positive
3.1 INTRODUCTION correlation is postulated by economic theory for the
consumption and income. When the income increases the
The various statistical methods discussed up to now
consumption increases, and conversely, when income
have been concerned primarily with statistical summaries
decreases the consumption decreases. For example,
such as measure of central tendency, dispersion, Skewness,
Kurtosis, which measures average variability and shape of
data, which are usually sufficient when we have university Advertisement Cost
10 9 8 7
data. But in practice, we may come across a number of i. (Rs)
problems consisting two or more variable. Distribution Sales (Rs) 80 50 40 30
consisting two variables are said to be bi-variate distribution.
In this chapter we discuss various method used to determine if ii. Income (Rs in ‘000) 10 12 15 19 20
there exist any relationship between two variables. As for
Consumption (Rs in 6 8 9 10 12
example, quality of steel and life of bridge, quantity of bitumen
‘000)
and strength of road etc
Two variables are said to be negatively correlated if they
There are two basic tools for summarizing bi-variate data
tend to change in the opposite direction: when X increases Y
correlation analysis summarizes the strength of relationship (if
decreases, and vice versa. For example, the quantity of a
any) between two factors, while regression analysis shows
commodity demanded and its price are negatively correlated.
how to predict of control one of the variables using the other
When price increases, demand for commodity decreases and
one.
when price falls demand increases. For example,
Correlation analysis is the statistical tool generally used
to describe the degree to which one variable is linearly related
Demand (unit in ’00) 10 9 8 6 4
to another. There are various phenomena which are related to
each other. For instance, when quality of bitumen used in road Price (Rs in ‘000) 6 8 9 10 12
reduces the condition of road becomes worse. When demand Two variables are uncorrelated when they tend to change
of certain commodities increases, then its price goes up and with no connection to each other. For example one should
when its demand decreases then its pries comes down. The expect zero correlation between the height of the inhabitants
theory by means of which quantitative connection between of a country and the production of steel.
two sets of phenomena are determined is called correlation The following table shows nature of correlation between
theory. On basis of the theory of correlation one can study the two variables.
comparative change occurring in two related phenomena and
Change in Change in Nature of
their cause effect analysis.
one variable other variable correlation

3.1.1 Types of Correlation Increase (↑) Increase (↑) Positive (+)


The correlation can be classified in to three types. Decrease (↓) Decrease (↓) Positive (+)
i. Positive, negative and correlation. Increase (↑) Decrease (↓) Negative (-)

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis

Decrease (↓) Increase (↑) Negative (-) 3.3 METHOD OF STUDYING CORRELATION
Linear and Non-linear Correlation There are following three methods to study the
correlation between two variables.
Correlation may be linear, when all points (X, Y) on a
scatter diagram seem to cluster near a straight line, or a. Graphical method or scatter diagram method
nonlinear, when all points seem to lie near a curve. For b. Mathematical Method
example,
Linear correlation X 4 5 6 7 3.3.1 Scatter Diagram Method
Y 10 12 14 16 The first step in determining whether there is a
relationship between two variables is to examine the graph of
Non-linear X 4 5 6 7 the observed (or known) data. In this method, one variable's
correlation
Y 10 14 25 40 values are kept in X-axis and another variable's values are kept
in Y-axis. The graph formed by plotting these pairs of values of
Simple, Partial and Multiple Correlations X and Y is known as scatter diagram. If the plotted dots show
In simple correlation, we measure the correlation some trend of upward or downward, then the two considering
between two variables (of which one is dependent and the variables are said to be correlated. If the dots are close
other is independent). For example,
together and follow some trend of either increasing or
i. The correlation between amount of rainfall and decreasing then there is a strong relationship between them.
production
The following scatter diagrams show different form of
ii. The correlation between number of unit sold and price
correlation.
Y Y
Y
3.2 COVARIANCE
Before we study the correlation analysis we introduce
the concept of covariance between two variables. Let X 1, X2, ... 0 X X X
0 0
Xn be n number of observations of a random variable X and Y 1, Perfectly positive correlation perfectly negative correlation High degree positive correlation
Y
Y2 ... Yn be the corresponding observations of another variable
Y. Let the set of n pairs of observations be given by (X 1, Y1), (X2, Y
Y2), ... (Xn, Yn). Then, the covariance between variables X and
Y denoted by Cov (X, Y) is defined by
Cov (X, Y) =  (X –¯) (Y –¯)
X
=–
X 0
0
The value of covariance between two variables X and Y
High degree negative correlation Low degree positive correlation
measures the simultaneous changes
Y Y
between two variables.

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis

3.4 KARL PEARSON’S CORRELATION


COEFFICIENT
The Karl Pearson’s correlation coefficient measures the
X degree of linear association between two variables. This
X
0 0 method is popularly known as Pearsonian correlation
Low degree negative correlation No correlation coefficient. It is unitless measure of correlation. It is defined
as the ratio of covariance between two variables to the
It is the simplest method of ascertaining the correlation
product of the standard deviation of the two variables.
between two variables and not influenced by the size of
Let X and Y are two variables, then Karl Pearson’s
extreme items, but it gives only rough idea about the
correlation coefficient between X and Y denoted by r xy or r(x,y)
correlation and it doesn't give the exact extent to which the or simply ‘r’ and is defined as
two variables are correlated. r==

Merits and demerits of scatter diagram Where, Cov (X, Y) = (X - –) (Y - –) = –


Merits Var (X) = = =–2
i. This is the simplest method to measure correlation. Var (Y) = = = – 2
ii. It is least affected by the size of the extreme values. Where – = The Arithmetic mean of values of variables X
iii. Even a non-mathematical person can get idea about and
correlation. – = The Arithmetic mean of values of variables Y
iv. It establishes the base line for fitting the appropriate and n = Number of pair of observations
regression equation.
Demerits 3.4.1 Calculation of Karl Pearson’s correlation
i. This is graphical method. So, it does not give exact coefficient
numerical value of correlation.
We use the following formulae to calculate Karl Pearson’s
3.3.2 MATHEMATICAL METHOD correlation coefficient
Scatter diagram method doesn't give exact numerical Direct method
value of correlation; it only gives rough idea about correlation. Let X and Y are two variables, then Karl Pearson’s
So, to get the exact numerical value of correlation between correlation coefficient between X and Y under this method is
variables we use the following mathematical methods. By r=
following two methods correlation is measured in terms of This method is appropriate to calculate correlation
coefficients coefficient of two variables consisting small numerical size
without any common factor.
i. Karl Pearson’s correlation coefficient.
ii. Spearman’s Rank correlation coefficient.
3.4.2 Properties of Karl Pearson’s correlation
coefficient

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
Following are the important properties of Karl Pearson’s If the covariance between the two variables X and Y is 6
correlation coefficient. and the standard deviation of X and Y are 2.45 and 2.61
i. Correlation coefficient lies between – 1 and + 1 i.e. – respectively. Find the coefficient of correlation.
Solution
1r+1
We have given Covariance (X, Y) = 6,  x = 2.45,  y =
ii. The formula of Correlation coefficient is symmetrical
i.e. rxy = ryx. 2.61
Correlation coefficient (r) =?
iii. Correlation coefficient is geometric mean between two
regression coefficient i.e. r = Now, correlation coefficient between X and Y is given by
r = = = 0.9383.
iv. Correlation coefficient is a relative statistical measure
and has no unit. Thus, there is high degree positive correlation between X
v. Two independent variables are uncorrelated but the and Y.
converse may not be true i.e. uncorrelated variables may Example 2
not be independent. From the following data compute the Karl Pearson’s
vi. Correlation coefficient is independent of change of origin coefficient of correlation between X and Y
as well as of scale i.e. rXY = ruv. X series Y series
No. of items: 6 6
3.4.3 Interpretation of calculated value of sum of variable: 160 116
Sum of square of
correlation coefficient variable:
4480 2462
The calculate value of Karl Pearson’s correlation
Sum of product of variable is 3275
coefficient may be interpreted in the following manner.
Solution
i. If r = 1, there is perfect positive correlation between the
Here, we are given in usual notation n= 6, X = 160, Y =
variables.
116,
ii. If r = –1, there is perfect negative correlation between
the variables.
X2 = 4480, Y2 = 2462 and XY = 3275
iii. If r = 0, there is no correlation between the variables. Now , Karl Pearson’s correlation coefficient is given by

iv. if r lies between 0.700 to 0.999, there is high degree r =


positive correlation. = = 0.84

v. If r lies between – 0.700 to – 0.999, there is high degree Thus, the value of ‘r’ indicates that there is high degree
negative correlation. positive correlation between two variables X and Y.
vi. If r lies between, 0.500 to 0.699, there is moderate Example 3
positive correlation. Calculate the Karl Pearson’s correlation coefficient
vii. If r lies between, – 0.500 to – 0.699, there is moderate between the data of driving speed and mileage of a car.
negative correlation.
viii. If r lies between, 0.001 to 0.499, there is low degree Driving speed 3 5 4 5 3 2 6 2 5 5
positive correlation. 0 0 0 5 0 5 0 5 0 5
ix. If r lies between – 0.001 to – 0.499, there is low degree Mileage 2 2 2 2 3 3 2 3 2 2
negative correlation. 8 5 5 3 0 2 1 5 6 5
Example 1 Solution

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
Let, Driving speed = X and Mileage = Y 30 65 0 0 10 50 22
Calculation of Correlation coefficient Here, n = 5, X = 30, Y = 65, x = 10, y = 50 and xy =
2 2

X Y X2 Y2 XY 22
30 28 900 784 840 Then, means are as; – = = = 6 and – = = = 13
50 25 2500 625 1250
Therefore, the value of Karl Pearson’s correlation
40 25 1600 625 1000
coefficient is
55 23 3025 529 1265
30 30 900 900 900 r= = = 0.984
25 32 625 1024 800 Therefore, there is high degree positive correlation
60 21 3600 441 1260 between two variables X and Y.
25 35 625 1225 875 Example 5
50 26 2500 676 1300
55 25 3025 625 1375 Find the correlation coefficient between age of marriage
and number of women in a certain locality
420 270 19300 7454 10865
Age of marriage 10 - 15 15 - 20 20 - 25 25 - 30 30 - 35 35 - 40
Here, n= 10, X = 420, Y = 270, X2 = 19300, Y2 = 7554 and
XY = 10865 Number of women 10 22 46 25 5 2
Now, the value of Karl Pearson’s correlation coefficient is Solution
given by Let, X- Age of marriage, Y –Number of women, A = 22.5
r= and B = 25
r = = – 0.91
u = and v = Y - 25
Thus, the value of ‘r’ indicates that there is high degree Calculation of Correlation coefficient
negative correlation between Driving speed and mileage. Age X Y u v u2 v2 uv
Example 4 10 – 15 12.5 10 –2 – 15 4 225 30
Calculate the Karl parson’s correlation coefficient for the 15 – 20 17.5 22 –1 –3 1 9 3
following data. 20 – 25 22.5 46 0 21 0 441 0
X 4 5 6 7 8 25 – 30 27.5 25 1 0 1 0 0
30 – 35 32.5 5 2 – 20 4 400 – 40
Y 9 11 12 15 18
35 – 40 37.5 2 3 – 23 9 529 – 69
Total 3 – 40 19 1604 – 76
Solution Here,
Let x = (X - –) and y = (Y - –) n= 6, u = 3, v = - 40, u2 = 19, v2 = 1604 and uv = – 76
Table for the calculation of correlation coefficient Now, the value of Karl Pearson’s correlation coefficient
X Y x y x2 y2 xy is given by
4 9 -2 -4 4 16 8 r=
r = = – 0.366
5 11 -1 -2 1 4 2
Thus, The value of correlation coefficient indicates that
6 12 0 -1 0 1 0
there is low degree negative correlation between age of
7 15 1 2 1 4 2 women and the responses in marriage.
8 18 2 5 4 25 10

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis

3.4.5 Coefficient of Determination iii. For other situations except above two cases, nothing can
be stated.
The Square of the simple correlation coefficient is called
the coefficient of determination. It is used for interpretation of Moreover, the probable error of a correlation coefficient
the value of the calculated correlation coefficient. enables us to find the limits within which the population
If r is correlation coefficient then r 2 is coefficient of correlation coefficient can be expected to lie. The limits are r 
determination. If r = 0.8, the value of r 2 = 0.64. This shows that PE(r).
64% of the total variation in dependent variable has been
explained by the independent variable and other 36% is due to Example 9
other unknown factors. 1 - r 2 is called coefficient of non- The correlation coefficient between the marks of 25
determination. students in Statistics and Economics is found to be 0.86. Test
Example 8 the significance of the calculated correlation coefficient. And
If the correlation coefficient between two random calculate the limits of population correlation coefficient.
variable is found to be 0.9. Find the coefficient of Solution
determination.
Here, sample size (n) = 25
Solution Correlation coefficient (r ) = 0.86
the correlation coefficient (r) = 0.9 therefore the The probable error PE(r)= 0.6745 = 0.0351
coefficient of determination is r2 = (0.9)2 = 0.81 i.e. the total Now, 6 PE(r) = 6 x 0.0351 = 0.2107
variation in the value of dependent variable is determined 81%
Here, |r| = 0.86 ≥ 6 x PE = 0.2107
by independent variable and remaining other 19% is due to
Therefore, correlation coefficient is significant.
other unknown variable.
 The limits for population correlation coefficient ()are
r  PE(r)= 0.86  0.0351
3.4.6 Probable Error of Correlation Coefficient
 Lower limit = 0.86 - 0.0351= 0.825 and
Probable error of correlation coefficient is an old
Upper limit = 0.86 + 0.0351 = 0.895
measure for testing the reliability or significance of an
observed correlation coefficient. The probable error of
3.5 PARTIAL AND MULTIPLE CORRELATION
correlation coefficient is abbreviated by PE(r). Let, r be the
So far we have discussed the theory of correlation
correlation coefficient calculated from n pair of sample
between two variables only, but the value of a variable are
observations, then the probable error of r is defined by
often influence by several other variables, Now we shall
PE(r) = 0.6745 discuses three or more than three correlated variables then
It is used to interpret whether calculated value of r is there are two concepts one is partial correlation and other is
significant or not. multiple correlation.

i. If |r| < PE(r), there is no evidence of correlation i.e. the


correlation coefficient is not significant. 3.6.1 Partial Correlation
ii. If |r| > 6 PE(r), there is evidence of correlation i.e. In partial correlation we measure the correlation
correlation coefficient is significant. between a dependent variable and one particular independent
variable keeping all other independent variables as constant.

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
For example, the correlation between number of unit sold and R2.13 =
advertising expenditure keeping price as constant. R3.12 =
The correlation between two variables (x 1, X2) when the The Square of Multiple correlation coefficient is known
linear effect of the other variables (x 3, x4, X5, ……….xn ) in them as the coefficient of multiple determination and use to explain
has been eliminated from both is called partial correlation the total variation of dependent variable due to combine effect
between x1 and x2 . of independent variables
It is often important to measure the correlation between Example 13
a dependent variable and one particular independent variable The following correlation coefficients between three
when all other variables are kept constants; that is, when related variables X1, X2and X3are given below
effects of all other variables are removed. This can be done by
r12 = 0.98, r13 = 0.44, r23 = 0.54
defining a coefficient of partial correlation. If we denote by “r
12.3” the coefficient of partial correlation between X 1 and X2
Calculate the Three partial correlation coefficient and
keeping X3 constant. We find that three multiple correlation coefficients
Solution
r12.3 =, similarly
The three Partial correlation coefficients are calculated
r13.2 =
as follows
r23.1 =
r12.3 == = 0.9822
Similarly, r12.34 is the coefficient of partial correlation
r13.2 == = – 0.53257
between X1 and X2 keeping X3 and X4 constant then
r23.1 == = 0.6088
r12.34 ==
The Three Multiple Correlation coefficients are
The results are useful since by means of these formulae
calculated as follows
any partial correlation coefficient can ultimately be made to
depend on the simple correlation coefficients r12, r13 , r23 . R1.23 =
The Square of Partial correlation coefficient is known as = = 0.9857
the coefficient of partial determination and use to explain the R2.13 =
total variation of dependent variable due to independent == 0.9875
variable when the effect of other variable kept constant.
R3.12 =
== 0.7018
3.6.2 Multiple Correlations
Multiple correlation studies the relationship between a
dependent variable and combine effect of two or more
3.7 REGRESSION ANALYSIS
independent variables. For example, the correlation between In general decisions are based on the relationship
numbers of unit sold and joint effect of advertising expenditure between two or more variables. For example, after considering
and price. the relationship between quality of cement and strength of
concrete an engineer might attempt to control the quality of
In multiple correlations we study three or more than
cement production. An Economist might be interested to
three variables at a time. Let X1, X2 and X3 be three variables.
predict the demand of particular product using the relationship
Then the R1.23 denoted the multiple correlation coefficient of
between price and the demand of the product. Sometimes a
the dependent variable X1 on independent variables X2 and X3
manager will rely on intuition to judge how two variables are
is given as
related. However, if data can be obtained, a statistical
R1.23 = , similarly

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
procedure called regression analysis can be used to develop But in reality for a given value of independent variable X,
an equation showing how the variables are related. values of Y varies from actual values by a random amount,
Regression analysis helps one determine the probable which is observed by the simple probabilistic model as
form of the relationship between the variables. The objective Y = 0 + 1X + e
of this method of analysis is usually to predict or estimate the Where, e is an error terms, the difference between an
value of one variable corresponding to a given value of another observed value of Y and the mean value of Y. the error term is
variable. The English scientist Sir Francis Galton (1822-1911) normally distributed with mean zero and variance 2.
first proposed the ideas of regression in reports of his research
in the area of heredity. Galton used the word regression to
describe a tendency of adult offspring, even those with short
3.7.3 Least square method of estimation of the
or tall parents, to revert toward the average height of the parameters of simple linear model
general population. Consider the simple linear model Y = 0 + 1X + e. Let
In regression terminology, the variable being predicted is there be n pair of observations (x 1, y1), (x2, y2)………..(xn, yn). In
the called dependent variable. The variable or variables being the above model Y is called the dependent variable and X, the
used to predict the value of the dependent variable are called independent variable and e is the random error. For the i th pair
the independent variables. For example, in analyzing the effect of observation Yi = 0 + iXi + ei
of advertising expenditures on sales, a marketing manager’s Or ei = Yi - 0 + 1Xi may be positive or negative, to avoid the
desire to predict sales would suggest making sales the problem of sign of error, square both side and also take the
dependent variable. Advertising expenditure would be sum over all pair then we have
independent variable used to predict sales. Economists might
base their predictions of the annual gross domestic product, or Q = ei2 = (Yi - 0 - 1xi)2 to get the best estimate of 0 and
GDP, on the final consumption spending within the economy. 1 one has to minimize Q = ei2, which amounts to minimize
Thus, the final consumption spending is the independent (Yi - 0 - 1Xi)2.
variable and the GDP is the dependent variable. Generally, the Thus value of 0 and 1 which minimize Q are known as least
dependent variable is denoted by Y and independent variable square estimator. To estimate parameters 0 and 1 differential
by X. Q partially with respect to 0 and 1 respectively and equate to
zero
3.7.1 Linear or Non-Linear Regression i.e. = 0 and = 0 which gives
If the graph of dependent and independent variables Yi = n0 – ixi and
shows a linear trend then it is called linear regression. But if it
is not in a straight line then it is called non-linear or curvilinear xiYi = 0 Xi – 1Xi2
regression. These two equations are known as normal equation. On
solving these two equations we obtain as
3.7.2 A simple Probabilistic Model of Regression 0 = – – 1– and 1 =
Let, Y = 0 + 1X is the deterministic mathematical model
of regression of dependent variable Y on independent variable 3.7.4 Lines of Regression
X because a value of Y is determined by putting a value of X.
The lines of regression are the lines, which gives the best
where 0 is Y-intercept of line and 1 is the slope of line
estimated value of dependent variables for the given value of
Regression line of y on x
independent variable’s value. Let X and Y are two variables and
there are two lines of regression.
( X ,Y )

143 144
Regression line of X on Y

X
Probability
Y and Statistics for Engineering Correlation and Regression Analysis
i. Regression line of Y on X Y= a + bX
and where Y is estimated or calculated value of Y for given
ii. Regression line of X on Y. value of X.
The derivation and
assumption of the two b. Regression line of X on Y
regression lines are quite
The regression line of X on Y gives the best estimated
different. Where as these regression lines are not
value of X for the given values of Y. and its equation is given by
interchangeable and these lines are of average relationship
X=A+BY … (i)
but not of actual relationship. The regression lines of Y on X is
obtained by minimizing error sums of square of estimation of Y Where A is constant or X-intercept and B is the slope of
while regression line of X on Y is obtained by minimizing the the regression line or regression coefficient of X on Y which is
error sums of square for estimation X, This is the reason why denoted by bXY . It is a measure of average change in
there are two different regression equations for a given set of dependent variable X corresponding to a unit change in
a data. independent variable Y.
In equation (i) knowing the value Y, one can’t estimate
a. Regression line of Y on X the value of X without knowing the values of constants A and B
so, first we should estimate the value of unknown regression
The regression line of Y on X gives the best estimated
parameters by using the Principle of least square estimation
value of Y for the given values of X. The regression equation of
(LSE). Using the LSE for equation (i) we obtain the following
Y on X is,
two normal equations as
Y = a + bX … (i)
X = nA+ B Y … (ii)
Where ‘a’ is constant or y-intercept and ‘b’ is the slope of
XY = AY + B Y 2
… (iii)
the regression line or regression coefficient of Y on X which is
denoted by bYX. It is a measure of average change in dependent Where ‘n’ is the number of pairs of observation, by solving
variable Y corresponding to a unit change in independent these two normal equations, we get
variable X. A = – - B – and B = = bXY

In equation (i) knowing the value X, one can’t estimate Now, putting the values of ‘A’ and ‘B’ in equation (i) we get
the value of Y without knowing the values of constants ‘a’ and the required estimated regression equation of X on Y which is
‘b’ so; first we should estimate the value of unknown as follows.
regression parameters by using the Principle of least square X= a' + b'Y,
estimation (LSE). Using the LSE for equation (i) we obtain the Where X is estimated or calculated value of X for given
following two normal equations as; value of Y.
Y = na + bX … … (ii)
XY = aX + b X2 … … (iii) 3.7.5 Alternative Method of Fitting Regression
Where ‘n’ is the number of pairs of observation, by solving Lines
these two normal equations, we get
The regression lines of Y on X and X on Y can be
a = – - b– and b = = bYX
expressed respectively as
Now, putting the values of ‘a’ and ‘b’ in equation (1) we
Y – ¯ = bYX (X – ¯)
get the required estimated regression equation of Y on X which
is as follows. X – ¯ = bXY (Y – ¯)

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
This shows that (¯and ¯) is the common point of two i. The two regression lines coincides if there is perfect
regression lines. Where, ¯and ¯ are the arithmetic means of X correlation between the two variable i.e. r = ±1
and Y series. bYX and bXY are the regression coefficient of Y on
ii. The two regression lines are perpendicular to each other
X and regression coefficient of X on Y respectively. The
regression coefficients are computed as if there is no correlation between the two variables i.e. r
= 0.
Direct method
nΣ XY  ΣX ΣY
bYX = bXY = 3.7.6 Properties of Regression Coefficients
nΣ X 2  (Σ X) 2
The following are the important properties of regression
nΣ XY  ΣX ΣY
coefficients.
nΣ Y 2  (Σ Y) 2
i. Both regression coefficients must have same sign and
Shortcut method the sign of correlation coefficient is same as the sign of
If we take the deviation from assumed mean A and B regression coefficients.
then, ii. The product of the two regression coefficient must be
nΣ uv  Σ uΣ v less than or equal to 1 i.e. bYX bXY  1
bYX = bvu = and bXY = buv =
nΣ u 2  (Σ u) 2 iii. The correlation coefficient is the geometric mean
nΣ uv  Σ uΣ v between two regression coefficient i.e. r =
nΣ v 2  (Σ v) 2 iv. The two lines of regression intersect at the point ( ¯,¯),
Where, u = X - A and v = Y - B. where X and Y are the variables under study.

A and B are assumed mean of X and Y series v. Regression coefficient are independent of change of
respectively. origin but not of scale
Step deviation method i.e. bXY = bUV if u = X – A and v = Y- B and
Under this method, bXY  bUV if u = and v = , Where A and B are assumed mean,
nΣ uv  Σ uΣ v h and k are the common factor of X and Y series
bYX = × = bvu respectively.
nΣ u 2  (Σ u) 2
vi. Arithmetic mean of regression coefficient is greater than
nΣ uv  Σ uΣ v
bXY = × = buv the correlation coefficient.
nΣ v 2  (Σ v) 2
Proof: we have two regression coefficients as
Where, u = and v = , h and k are the common factor of X
and Y series respectively. σY σX
bYX = r and bXY = r . Then we have to show
σ σ σX σY
It is to be notated that bYX = r Y and bXY = r X
σX σY that
>r
Also, r = , where r is the correlation coefficient.
Or, >r
Remarks
Or, >2r
Or, 2Y + 2X > 2 Yx

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
Or, (Y– x)2 > 0, which is always true. regression. Some of commonly used crvilinear regressions in
engineering are as follows
3.7.7 Angle between Two Regression lines i. Quadratic or Parabolic Regression Equation
We have two regression lines as ii. Exponential Regression Equation
Y - ¯ = (X - ¯)
iii. Reciprocal regression equation
X - ¯ = (Y - ¯)
Now, slope of regression equation of Y on X (m1) = 3.8.1 Fitting of Quadratic Regression Equation
Slope of regression equation of X on Y (m2) =
A quadratic or second degree or parabolic regression is
If be the angle between two regression lines then we the simplest of the curvilinear models. If future estimation of
have
regresson is not properly drawn by straight line regression, we
Tan ±= ± use this method.
= [taking negative sign since r2 ≤ 1]
Let, y = a + bX + cX2 .........(i), be a quadratic or parabolic
tan–1
line equation of the given data.
Case I : When r = 0, π/2 , it means if two variables are
Where a, b and c are constant to be estimated by using
uncorrelated , the two regression becomes
perpendicular to each other. the method of least square for this we have to solve the
Case II : If r = ± 1, andπ, it means if two variables are following three normal equations as
perfectly correlated then two regression equations are Y = na + bX + cX2......................... (ii)
either parallel or coincide.
XY = aX + bX2 + cX3....................(iii)
Case III : If tanthen is acute angle and If tanthen 
X2Y = aX2 + bX3 + cX4 ................(iv)
is obtuse angle
Case IV : Correlation coefficient lies between – 1 to 1 By solving the above equations, (ii), (iii) and (iv), we get

Proof: The regression equation of Y on X is given as value of a, b and c. Then the fitted second degree regression
 equation is
Y - ¯ = (X - ¯)
Or, X (Y - ¯) = r Y (X - ¯) , on squaring both sides Yc = a + bX + cX2

Or, X (Y - ¯)2 = r 22Y (X - ¯)2, taking expectation on both


sides 3.8.2 Fitting of Exponential Regression Equation
Or,  X E(Y - ¯) = r 
 2 2 2
Y E(X - ¯) 2
An exponential regression equation is defined as
Or, X  2
Y = r 
2 2
Y X

Y = a bX
2
Or, r =1 Taking log on both sides we have
Or, |r| = 1 –1≤r≤1 Log10Y = Log10a + Log10b X .
This is the equation of straight line, log 10Y is dependent
3.8 CURVILINEAR REGRESSION
on X and Log10a and log10b are constant obtained by method of
There are various relatonship between variables X and Y
least square.
namely linear and non-linear. The non linear relationship
between variables X and Y is aslo known as curvilinear Another form of exponential relationship is

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
b
Y=aX 67 68 4489 4624 4556
Taking log on both sides we have 63 66 3969 4356 4158
Log10Y = Log10a + b Log10X. 66 65 4356 4225 4290
71 70 5041 4900 4970
This is the equation of straight line, log 10Y is dependent
69 69 4761 4761 4761
on log10X and Log10a and b are constant obtained by method of
65 67 4225 4489 4355
least square.
62 64 3844 4096 3968
70 71 4900 5041 4970
3.8.3 Fitting of Reciprocal Regression Equation 61 60 3721 3600 3660
We have the reciprocal relationship between two 72 63 5184 3969 4536
variables X and Y commonly used in engineering as 666 663 44490 44061 44224
Here n = 10, X= 666, Y = 663, X2 = 44490, Y2 = 44061
Y=
and
Or, a + bX = . XY = 44224
Here for the points (X, ) the graph is approximately i. The regression equation of Y on X i.e. Y = a + bX … (i),
straight line. depends on Variable X and constant a and b are
Where Y is height of son, X is height of father and a and b
obtained by method of least square by solving following two
are constant obtained by solving following two normal
normal equations as
equations.
 = n a + b X and
Y = na + bX
 = a X + b X2
XY = a X + b X
Example 14
 663 = 10a + 666b … (ii)
The following measurements show the respective height
44224 = 666a + 44490 ... (iiii)
in inches of 10 fathers and their eldest sons
father On solving equation (i) and (iii) we get the value of a and
67 63 66 71 69 65 62 70 61 72
(X) b as
son (Y) 68 66 65 70 67 67 64 71 62 63 a = 32.53 and b = 0.507
i. Find the regression line of son’s height on father’s  The required line of regression is Y= 32.53 + 0.507 X.
heights
ii. The estimated height of son for the given height of father
ii. Estimate the height of son for the given height of father
as 70 inches is Y= 32.53 + 0.507 x 70 = 68.02 inches.
as the 70 inches.
iii. Find the regression line of X on Y. iii. Similarly, the regression line of X on Y is X = A + BY and
iv. At what point the two line of regression intersect. normal equations to got the value of A and B are
v. Fine the correlation between variables X and Y X = nA + B Y
XY = A Y + BY2
Solution
 666 = 10 A + 663 B … (iv)
44224 = 663 A+ 44061 B … (v)
Calculation table
X Y X2 Y2 XY

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
On solving equations (iv) and (v) we get the value of A For the estimation of production, put X = 40 mm in
and B as equation (i) then
A = 23.174 and B = 0.655 Y = 16 x 40 +20 = 660.
Therefore, the regression line of X on Y is X= 23.174
Thus, the volume of production corresponding to the
+0.655Y
rainfall 40 mm is 660 m. tones.
iv. Two lines of regression intersect at the point (–, –) then
Example 16
– = = = 66.6 and
– = = = 66.3 Two lines of regression are given by

Therefore, point of intersection of two lines is (66.6, 4X– 5Y + 33 = 0 and 20X – 9Y –107 = 0 and variance of X
66.3). = 9.
v. The correlation coefficient between X and Y is given as Find
the geometric mean between two regression coefficients
i. The mean values of X and Y
as
r = = = 0.576 ii. The standard deviation of Y.
Solution
i.e. there is moderate positive correlation between height
of father and height of son. i. Since the lines of regression pass through the average of

Example 15 X and Y i.e. (–, –) so, 4– - 5– + 33 = 0 and 20– - 9– - 107 = 0

Find the most likely production corresponding to a On solving two equations we get –= 13 and – = 17
rainfall 40 mm from the following data. ii. From equation 4X – 5Y +33 = 0
Rainfall(m production(m. Y = X + ; which is regression equation of Y on X
m) tons)
so, bYX =
Average 30 500
standard Again from equation 20X -9Y -107 = 0
5 100
deviation X = Y + which is the regression equation of X on Y
And the coefficient of correlation = 0.8 So, bXY =
Solution Now, we have correlation coefficient is the geometric
Let rainfall and production be denoted by X and Y mean of regression coefficient thus
respectively then we are given r = = = 0.6
–= 30, –= 500, x = 5, y = 100 and r = 0.8. Again we have bYX =
First we find the regression equation of Y on X i.e. Or, r = which implies Y = 4
regression equation of production on rainfall. Therefore, the value of standard deviation of Y is 4.
Now, bYX = r = 0.8 = 16 Example 17
Therefore, the regression equation of Y on X is Y - – = bYX The following data gives the experience of machine
(X - –) operators in years and their performance as given by the
number of good parts turned out per 100 pieces.
Or, Y – 500 = 16(X -30)
Operator I II III IV V VI VII VIII
Or, Y = 16X +20 … (i)
Experience (X) 16 12 18 4 3 10 5 12

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
Performance( 87 88 89 68 78 80 75 83 Therefore, Y = 1.133 x 8 + 69.67 = 78.734.
Y) Thus, if an operator has 8 years experience then his
Calculate the regression equation of performance rating performance is 78.74 good parts turned out per 100 pieces.
on experience and estimate the probable performance if an
operator has 8 years experience. 3.9 MULTIPLE REGRESSION
Solution Multiple regressions generally explain the relationship
Here we have to find the regression equation of between multiple independent or multiple predictor variables
performance rating (Y) on experience (X) i.e. Y on X. The and one dependent or criterion variable. A dependent variable
regression equation of Y on X is given by is modeled as a function of several independent variables with
corresponding coefficients, along with the constant term.
Y - – = bYX (X - –) ……..(i)
Multiple regressions require two or more predictor variables,
Where – is mean of Y , –mean of X series, and bYX is the and this is why it is called multiple regressions.
regression coefficient of Y on X. The multiple regression equation explained above takes the
Let, u = X – 12 and v = Y- 83 following form
Calculation of regression equation y = a+ b1x1 + b2x2 + … + bnxn

X Y u v u2 v2 uv Here, bi’s (i=1,2…n) are the regression coefficients, which


16 87 4 4 16 16 16 represent the value at which the criterion variable changes
12 88 0 5 0 25 0 when the predictor variable changes. The constant a is the
18 89 6 6 36 36 36 value of dependent variable Y when the values of all
4 68 -8 -15 64 225 120 independent variable is zeros.
3 78 -9 -5 81 25 45
As an example, let’s say that the test score of a student
10 80 -2 -3 4 9 6 in an exam will be dependent on various factors like his focus
5 75 -7 -8 49 64 56 while attending the class, his intake of food before the exam
12 83 0 0 0 0 0 and the amount of sleep he gets before the exam. Using this
Total -16 -16 250 400 279 test one can estimate the appropriate relationship among
Here n = 8, u = -16, v = -16, u2 = 250, v2 = 400 and uv these factors.
= 279 Let us consider three variables X 1, X2 and X3. The multiple
 – = a + = 12 + = 10 and regression equation of dependent variable X1 on two
– = b + = 83 + = 81 independent variables X2 and X3 is an equation for estimating a
bYX = = = 1.133 dependent variable X1 from two independent variables X2 and X3
Now, Substituting the value of –, –and bYX on equation (i) which is given below
then X1 = a + b1X2 + b2X3 (1)
Y – 81 = 1.133 ( X – 10) Where b1 and b2 are known as coefficients of regression.
Y = 1.133X +69.67 which is the required regression The three constants are obtained by solving following three
equation of Y on X normal equations simultaneously obtained by the method of
least squares
For estimation when X = 8 year Y = ?
∑X1 = na + b1∑X2 + b2∑X3

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
2
∑X1 X2 = a∑X2 + b1∑X2 + b2∑X3 X2 No.of sales 2 3 5 4 2 1 4
2
∑X1 X3 = a ∑X3+ b1∑X2 X3 + b2∑X3 person
These equations can be obtained by multiplying both years of 5 7 11 10 8 7 8
sides of equation (1) By 1, X2 and X3 successively and summing experience
on both sides i. Estimate the best line of fit to estimate the sales of the
Similarly the multiple regression equation of dependent company using sales person with experience in years
variable X2 on two independent variables X 1 and X3 is an ii. Estimate the sales of the company using one sales
equation for estimating a dependent variable X 2 from two person with experience of nine years.
independent variables X1 and X3 which is given below Solution
X2 = a + b1X1 + b2X3 (2)
To find the sales of the company, we should assume
Where b1 and b2 are known as coefficients of regression. sales as dependent variable. The multiple regression equation
The three constants are obtained by solving following three of dependent variable X1 on two independent variables X2 and
normal equations simultaneously obtained by the method of X3 is given below
least squares X1 = a + b1X2 + b2X3 (1)
∑X2 = na + b1∑X1 + b2∑X3 Where b1 and b2 are known as coefficients of regression,
2 the three constants are obtained by solving following three
∑X1 X2 = a∑X1 + b1∑X1 + b2∑X1 X3
normal equations simultaneously obtained by the method of
∑X2 X3 = a ∑X3+ b1∑X1 X3 + b2∑X32 least squares
Also, the multiple regression equation of dependent ∑X1 = na + b1∑X2 + b2∑X3
variable X3 on two independent variables X 1 and X2 is an ∑X1 X2 = a∑X2 + b1∑X22 + b2∑X3 X2
equation for estimating a dependent variable X 2 from two ∑X1 X3 = a ∑X3+ b1∑X2 X3 + b2∑X32
independent variables X1 and X3 which is given below Calculation Table
X3 = a + b1X1 + b2X2 (3)
X1 X2 X3 X22 X3 2 X1 X2 X1 X3 X2X3
Where b1 and b2 are known as coefficients of regression.
20 2 5 4 25 40 100 10
The three constants are obtained by solving following three
30 3 7 9 49 90 210 21
normal equations simultaneously obtained by the method of
least squares 25 5 11 25 121 125 275 55
∑X3 = na + b1∑X1 + b2∑X2 20 4 10 16 100 80 200 40

∑X1 X3 = a∑X1 + b1∑X12 + b2∑X1 X2 40 2 8 4 64 80 320 16


60 1 7 1 49 60 420 7
∑X2 X3 = a ∑X2+ b1∑X1 X2 + b2∑X22
15 4 8 16 64 60 120 32
Example 18
210 21 56 75 472 535 1645 181
Following data reveals the sales of a company due to the
Substituting these values in above equations, we get
number of sales persons and the years of experience
7a + 21 b1 + 56 b2 = 210
Sales(Rs’000) 20 30 25 20 40 60 15
21a + 75 b1 + 181 b2 = 535

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
56a + 181 b1 + 472 b2 = 1645 Now, correlation coefficient
On solving these three equations we get r =
a = 21.22, b1 = - 15.34 and b2 = 6.85.
=
i. Thus the required estimated best line of fit is
= = – 0.991
X1 = 21.22 – 15.34X2 + 6.85X3
ii. The estimated value of sales of company using This shows that there is high degree of negative
one sales person X2 = 1 and Nine years of experience X 3 = 9 is correlation between age and playing habits of students.
given as
Example 20
X1 = 21.22 – 15.34 Х 1 + 6.85 Х 9 = 67.43
Thus the estimated sales is Rs 67.43 (Rs ‘000) While calculating correlation coefficient by using Karl
Pearson's correlation method the pairs of observations (20, 65)
SOME WORKED OUT EXAMPLES and (18, 50) were taken wrongly instead of (25, 60) and (28, 59)
Example 19 and the information while calculating correlation are given as,
Find the coefficient of correlation between the age and r = 0.793, n = 10, X = 220, Y = 590
playing habit of the following students. X2 = 4954, Y2= 35232 and XY= 13214
Age (year) 15 16 17 18 19 20 Calculate the corrected correlation coefficient.
No. of students 125 100 75 60 50 40 Solution
No. of regular 100 75 45 24 15 6 To calculate the corrected value of r we firstly calculate
players the corrected values as follows,
Solution Corrected X = 220 – 20 – 18 + 25 + 28 = 235
In order to calculate the correlation coefficient between Corrected Y = 590 – 65 – 50 + 60 + 59 = 594
age and playing habit of the students, let us convert the data
Corrected X2= 4954 – (20) 2 - (18) 2 + (25) 2 + (28) 2 =
of number of players with respect to total number of students
5639
in each age group with fixed base for this purpose, percentage
of players is calculated. Then the correlation between Age (X) Corrected Y2= 35232 – (65) 2 - (50) 2 + (60) 2 + (59) 2 =
and percentage of player (Y) is calculated as follows. 35588
Percentage of player for year 15 = × 100 = 80% and so on. Corrected XY = 13214 – (20 × 65) – (18 × 50) + (25 × 60) +
Let, u = X - 17, v = (28 × 59) = 12314 – 360 – 3250 + 700 + 3540
Calculation of correlation coefficient = 13844.
X Y u v uv u2 v2 r Corrected =
15 80 –2 4 –8 4 16 =
16 75 –1 3 –3 1 9 10 13844  235  594

17 60 0 0 0 0 0 10  5639  (235) 2 10  35588  (594) 2


18 40 1 –4 –4 1 16 138440  139590
=
19 25 2 –7 –14 4 49 34.132  55.1724
20 15 3 –9 –27 9 81 1150
Total u = 3 v = – 13 uv = – 56  u2= 19  v2 = 171 = = – 0.610
34.132  55.1724

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
Therefore, the required corrected correlation coefficient i. We have correlation coefficient for bi-variate distribution
is – 0.610. is,
Example 21 r=
From the following bi-variate distribution, compute two =
regression coefficients, coefficient of variation, coefficient of = =
correlation and estimate the value of Y when value of X is 45. = 0.706
ii. For coefficient of variation, first we calculate the means
X
10 – 20 20 – 30 30 – 40 40 – 50 and standard deviation of X and Y.
Y
Mean of X, –= A +  h = 25 +  10
10 – 20 20 26 – –
20 – 30 8 14 37 – = 25 + 3.5 = 28.5
30 – 40 – 4 18 3 Mean of Y, –= B +  k = 35 -  10
40 – 50 – – 4 6 = 35 – 10.071 = 24.929
S.D. of X, σ x =  h =  10
Solution
Let, A = 25, B = 35 and u = and
Y  35 =  10 = 8.695
v= S.D. of Y, y =  k =  10
10

Class 10-20 20-30 30-40 40-50 =  10 = 8.904.

15 25 35  Coefficient of variation of X = 100 %


X 45
= 100 % = 30.51%
u 1 2
Y v -1 0 f fv fv2 fuv Coefficient of variation of Y = 100 %
40 0 = 100 % = 35.72%
10-20 15 -2 _ _ 46 -92 184 40
20 26 iii. The regression coefficient of Y on X is
8 0
-37
20-30 25 -1 8 _ 59 -59 59 -29 byx = r = 0.706  = 0.723
14 37
0 0 0 And, the regression coefficient of X on Y is
30-40 35 0 _ 3 25 0 0
4 18 0 bxy = r = 0.706  = 0.689
4 12 iv. To estimate the value of Y when value of X is given, we
40-50 45 1 _ _ 10 10 10 16
4 6 have to find the regression line of Y on X as,
f 28 44 59 9 140 -141 253 27
Y - Y = byx (X- X )
fu -28 0 59 18 49
Y – 24.929 = 0.723(X – 28.5)
fu 2 28 0 59 36 123
fuv 48 0 33 12 27 Y = 24.929 + 0.723X – 20.6055

Here, N = f = 140
Ŷ = 0.723X + 4.3235
Which, is required regression line of Y on X
f u = 49, f v = - 141
Then, value of Y when, X = 45 is
f u2 = 123, f v2 = 253, fuv = 27

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis

Ŷ = 0.723  45 + 4.3235 X = 5450 – 102 = 5350 Y2 = 2000 – 62 = 1964


2

 Ŷ = 36.858 XY= 3090 – 10 ×6 = 3030


Now putting the corrected values of X, Y, X2, Y2 and
Example 22
XY in the following formula we get corrected correlation
In two sets of variables X and Y with 50 observations
coefficient
each the following data were observed.
__ = 10, __ = 6, σx = 3, σx = 2 r =

Coefficient of correlation between X and Y is 0.3. =


However on subsequent verification it was found that one pair ==
(X = 10, Y = 6) was inaccurate and hence waived out. With the
= = 0.3
remaining 49 pairs of values, how is the original value of
correlation coefficient affected? Therefore, the corrected correlation coefficient is 0.3.
Thus, in this case, the original value of correlation coefficient
Solution
is not affected.
We are given, n = 50, ¯ = 10, ¯ = 6, σx = 3, σx = 2, rxy = 0.3
Example 23
We have,
A computer while calculating correlation coefficient
¯= ¯=
between two variables X and Y from 25 pairs of observations
obtained the following results.
10 = 6=
n = 25, X = 125,  X2 = 650,  Y = 100,  Y2 = 460,  XY =
X = 500 Y = 300 508
 x
2
= (X–¯) 2
 Y
2
= (Y–¯) 2
It was however discovered at the time of checking that
 x
2
= – (¯) Y = – (¯)
2 2 2 two pairs of observations were not correctly copied. They were
n X
2
= X - n ¯2 2
n Y
2
= Y - n ¯
2 2 taken as (6, 14) and (8, 6) while the correct values were (8, 12)
X = n (X + ¯ )
2 2 2
Y = n (Y + ¯ )
2 2 2 and (6, 8). Prove that the correct value of the correlation
coefficient should be .
= 50 (9 + 100) = 50 (4 + 36)
= 5450 = 2000 Solution

Also, the value of Karl Pearson’s correlation coefficient is We have to add the correct values and subtract the
r= wrong values as in all sum values. The corresponding

r X Y = Cov (X, Y) = – ¯ ¯ corrected sum values are

× 3 × 2 = – 10 × 6 Correct X = 125 – 6– 8 + 8 + 6 = 125


XY = 50 × 61.8 = 3090 Corrected Y = 100 – 14 – 6 + 12 + 8 = 100
One pair of observations (X = 10, Y = 6) is wrong. Corrected X2 = 650 – 62 – 82 + 82 + 62 = 650
Omitting this pair of observations we have, Corrected Y2 = 460 –142 – 62 + 122 + 82 = 436
n = 50 – 1= 49 Corrected XY= 508 – 6 × 14 – 8 × 6 + 8 ×12 + 6 × 8 = 520
Now, the corresponding correct values are Corrected value of r is given by
X = 500 –10 = 490 Y = 300 – 6 = 294 Corrected r =

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
= Rs)
=
Correlation coefficient between X and Y is 8/15.
= = =
Find a. The regression coefficient of Y on X and X on Y
= Thus verified
b. The two regression equations
Example 24
c. The most likely value of Y when X = 100 rupees.
A student calculates the value of r as 0.7, when the value Solution
of n is 5 and he concludes that r is highly significant. Does he
We have,
correct? Calculate the limits for population correlation
coefficient. If the calculated value of PE (r) = 0.085 for r = 0.7 ¯= 6, ¯ = 8, x = 5, y = 40/3, r = 8/15
find the value of n. a. Regression coefficient of Y on X is
Solution byx = r = × = 1.422
We have, r = 0.7, n = 5 Similarly, regression coefficient of Y or Y is
PE (r) = 0.6745 = 0.6745 × = 0.154 bxy = r = × = 0.2
And, 6 PE (r) = 6 × 0.154 = 0.924 b. The regression equation of Y on X is
Hence, this shows that r is not greater than 6 PE. Y - ¯ = byx (X – ¯)
Thus, we can not make any decision about the Y - 8 = 1.422 (X – 6)
significance of correlation coefficient. It is seen that his ^ = 1.422 X – 0.532
conclusion becomes wrong.
Similarly, the regression equation of X on Y is
Limits for population correlation coefficients are
X – ¯ = bxy (Y – ¯)
r ± PE (r) = 0.7 ± 0.154
X – 6 = 0.2 (Y – 8)
Upper limit of r = 0.7 + 0.15 = 0.854
X̂ = 0.2Y + 4.4
Lower limit of r = 0.7 – 0.154 = 0.546
c. ^ = ? When X = 100
Now, if PE(r) = 0.085, r = 0.7, n = ?
^ = 1.422 × 100 – 0.532
We have, PE(r) = 0.6745
= 142.2 – 0.532
= 141.67
0.085 = 0.6745 ×
Thus, the most likely value of Y is Rs 141.67.
0.085 = 0.344
=
= 4 .047
EXERCISE – 6
n = 16 (approximately)
Example 25 1. What do you mean by correlation? Mention its tyes.
2. What are different methods of finding correlation
For the following information
between two variables? Explain briefly.
X Y 3. Explain the concept of simple multiple and partial
Arithmetic mean(in Rs) 6 8 correlation coefficient.
Standard deviation (in 5 40/3

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
4. Define Karl Pearson’s correlation coefficient and 12. Calculate the coefficient of correlation using product
interpret the result of its coefficient. moment formula from the data of price and supply given
5. Define Spearman’s correlation coefficient and interpret below:
the result of its coefficient. Price(Rs160 162 165 161 163 164 166
6. Define the term ‘regression’. Discuss tow regression .)
lines. supply 63 62 64 63 62 66 68
7. Mention the properties of regression coefficients.

Numerical for correlation coefficient 13. The following table gives the age and blood pressure of
8. Draw a scatter diagram from the following data. 10 patients.
Height(inc Age 56 42 36 47 49 42 60 72 63 55
Pressu 14 12 11 12 14 14 15 16 14 15
h)
re 7 5 8 8 5 0 5 0 9 0
weight(lb
Compute the coefficient of correlation assuming 49 and
s)
140 as the assumed means of age and pressure
Also indicate whether correlation is positive or negative.
respectively.
9. If the covariance between X and Y variable is 18 and the
14. Calculate Karl Pearson’s coefficient of correlation
variance of X and Y are 25 and 81 respectively. Find the
between expenditure on advertising (Rs. ‘000) and sells
coefficient of correlation between them, (Lakh Rs.) from the data given below:
10. Calculate the correlation coefficient between X and Y Ad.
series from the following data. Expe
X Y nses
s s Sales
e e 15. From the following table calculate the coefficient of
r r correlation by Karl Pearson’s method.
i i X 6 2 10 4 8
e e Y 9 11 ? 8 7
s s Arithmetic means of X and Y series are 6 and 8
No. of 16 16 respectively.
observati 16. Calculate the Karl Pearson’s coefficient of correlation
ons: from the following data:
standard 3.01 3.03 Sum of deviation of X = 5
deviation: Sum of deviation of Y = 4
 (X - –) (Y - –) =122 Sum of squares of deviation of X =40
11. For 10 observations on Height (X) and Weight (Y), the Sum of squares of deviation of Y =50
following data were obtained (in approximate units) X = Sum of product of deviation of X and Y = 32 and
130,Y = 220, Number of pairs of observation = 10
X = 2290, Y2 = 5510 and XY = 3467,
2 17. Calculate the coefficient of correlation between the age
Compute the coefficient of correlation. of students and pass percentage given below:
Age(year) % pass Age(year) % pass

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
13-14 39 18-19 39 24. Find the most likely price in market A corresponding to
14-15 40 19-20 48 the Price of Rs. 75 at market B from the following data
15-16 43 20-21 49 Average Price in Market A = Rs. 67
16-17 44 21-22 54
Average Price in Market B = Rs. 65
17-18 36
Coefficient of Variation in Market A = 5.22 %
18. In order to find the correlation coefficient between two
Coefficient of Variation in Market B = 2.85 %
variables X and Y from 12 pairs of observations, the
The coefficient of correlation between them = 0.82
following calculations were made,
25. Estimate the loss in Production in a day when the
X = 30, Y = 5, X2 = 670, Y2 = 285 andXY = 334
number of workers in strike is 18,000 from the following
On subsequent verification, it was found that the pair (X
information.
= 11, Y = 4) was copied wrongly, the correct value being
Mean number of workers on strike = 800
(X = 10, Y = 14). Find the correct value of correlation
Mean loss of daily production in ‘000 Rs = 35
coefficient.
Standard deviation of number of workers on strike = 100
19. For a sample of 25 observations, the correlation
Standard deviation of daily production in ‘000 Rs = 2
coefficient is found to be 0.7. Find the limits within which
The coefficient of correlation between number of
correlation coefficient lies for population by using the
workers on strike and daily production was == 0.80
probable error of correlation coefficient.
26. The equation of two regression lines between two
20. If the correlation coefficient is found to be 0.6 for a pair
variables are expressed as 3X – 4Y + 30 = 0 and 5Y – 2X +
of 64 observations, find the probable error of correlation
8 = 0.
coefficient r and determine the limits of population
i. Identify which of the two can be called regression
correlation coefficient.
equation of Y on X and X on Y.
Numerical for Regression ii. Find the mean of X and Y and correlation coefficient.
21. Fit the regression line of Y on X from the following 27. In a partially destroyed record of the following data
data. available,
X 57 58 59 59 60 61 62 64 Variance of variable X = 25
Y 67 68 65 68 72 72 69 71
Two regression equations are:
22. Find the regression equations from the following data.
5X – Y – 22 = 0 and 64X – 45 Y – 24 = 0 Find the
X 11 12 13 14 15
i. Mean value of variable X and Y
Y 11 13 15 17 19
ii. Coefficient of correlation between variables X and Y
23. From the data given below, estimate the most likely
iii. Standard deviation of Variable Y.
height of a brother whose sister’s height is 50 cm.
Brother Sister 28. The following table gives the ages and blood pressure
Mean height 170 Cm 75 Cm of 10 women
Standard deviation of Age 56 42 36 47 49 42 60 72 63 55
5.22 Cm 3.85 Cm
height Pressu
The coefficient of correlation between the heights of r 14 12 11 12 14 14 15 16 14 15
brothers and sisters is 0.60. e
Estimate the blood pressure of a women whose age is 45
year.

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis
29. From the data given below, X1 5 7 8 4 9
Marks in X2 2 3 4 3 4
25 28 35 32 31 36 29 38 34
Research X3 2 0 3 1 2
Marks in Also estimate the value of X1 when X2 = 2.5 and X3 =2
Probabilit 43 46 49 41 36 32 31 30 33 34. A family income and expenditure survey result the
y following data
Find,
Expenditure on food 5 7 8 9 11
i. Two regression equations
(Rs’000)
ii. The coefficient of correlation between the marks in
Annul income (Rs’000) 25 40 30 50 25
probability and research methodology.
iii. The most likely marks in Probability when a mark in Family Size 3 2 4 5 1
research is 30. i. Estimate the expenditure on food of a family with

30. From the following data between age of husbands and annual income Rs 50,000 and having 4 family members
ii. Compute the multiple determination coefficient
wives. Calculate the two regression equations and find
the husbands age when wife’s age is 20.
Wife’s age 18 20 22 23 27 28 30
Husband’s
23 25 27 30 32 31 35
age
Numerical for Partial and multiple correlations
31. Three related variates X1 , X2 and X3 takes the following
sets of values
X1 1 2 3 4 5
X2 2 1 5 4 3
X3 3 1 4 5 2
Calculate the partial correlation coefficient r 12.3 and the
multiple correlation coefficient R1.23
32. Compute the multiple correlation coefficient from the
following data by treating the first variable X 1 as
dependent variable and the remaining two variables as
independent
X1 9 12 10 7 17
X2 2 5 4 3 6
X3 4 5 5 3 8
Also calculate the partial correlation coefficient r12.3
Numerical for Multiple Regression
33. Find the multiple regression equation of X1 on X2 and X3
from the data relating to three variables given below

143 144
Probability and Statistics for Engineering Correlation and Regression Analysis

ANSWERS:
11 r = 0.957 3X – 4Y + 30 = 0 is Y on X and 5Y - 2X + 8 = 0 is X on Y
26
12 r = 0.725 ii Mean of X = 2 and mean of Y = 9 and r = 0.5477
13 r = 0.892, High 27 i. 6 and 8 ii. 0.533 iii. 13.33
14 r = 0.7804 28 Y = 83.756 + 1.11X and Blood pressure = 133.708
X = 40.88 – 0.2337 Y And Y = 59.146 - .664 X
15 r = – 0.92 29

16 r = 0.704 ii. r = - 0.394 iii. 39.23 year

17 r = 0.7225 30 Husband’s age = 25.34


18 rCorrected = 0.77 31 0.57 and 0.60
Lower Limit = 0.631 and
19 Upper Limit = 0.769 32 0.986 and 0.42
PE = 0.054, Lower Limit = 546
20 and Upper Limit = 0.654 33 X1 =0.429 + 1.873 X2 + 0.111X3 and 5.334

21 = 29 + 0.67 X 34 Rs 8.54 thousand and 0.122


X = 5.5 + 0.5y And Y = – 11 +
22 2x
23 155 cm
24 Rs. 78.46

25 = 0.016 X + 22.2, Rs. 310200

143 144

You might also like