You are on page 1of 52

Mathematical modeling using

linear regresion
1

CARMEN VIOLETA POPESCU


VASILE ALECSANDRI UNIVERSITY OF BACAU
ROMANIA
University of LAquila

1/27/17

Mathematical modeling
2

A mathematical model, begins with a situation in the real

world which we wish to understand.


Does mathematical modeling differ from what you
already learned, in particular, from problem solving?
Problem solving may not refer to the outside world at all.
Even when it does, problem solving usually begins with the
idealized real-world situation in mathematical terms, and
ends with a mathematical result.

1/27/17

Mathematical modeling, on the other hand, begins in the


unedited real world, requires problem formulating before
problem solving, and once the problem is solved, moves back into
the real world where the results are considered in their original
context.
1/27/17

Mathematical modeling taking into account the


characteristics of the original which are on the one hand
information and on the other hand admit the formalization.
The formalization - implies the possibility of putting in
correspondence the characteristics of the original with the
appropriate mathematical: numbers, functions, matrix
which might help at solving of equations, approximation
problems, etc

1/27/17

Regression models
5

Ones of the real situations is the developing prognoses and

estimations in which is often necessary to build so-called


statistical models or regression models
The regression models describe the correlation between two
variables and in particular case, between one variable and time.
The substantiation of these models is based on a lot of data and
the software packages designed to assist calculations of
prognoses, prove their usefulness (Statistica, Matlab, SPSS, R,
etc).

1/27/17

For example, in the economic domain is necessary to anticipate

(forecast) the dependencies of the values of some variables as


economic variables and specific socio-economic phenomena,
which does not evolve independently but are connected with other
economic variables.
This does not necessarily imply that one variable causes the other,
but that there is some significant association between the two
variables.
From this point of view, any regression analysis begins with a
careful look at data and very importantly, what do the variables
look like.
1/27/17

Example An enployer wants to know the following :


It is suitable to invest in advertising to sell a specific
product ?
What form of advertising is best?
What is the link between the amount invested in
advertising and sales volume for the product ?
If they have invested in advertising x U.M , which is
forecast sales volume ?

1/27/17

For such situations

is needed data, statistical data for


which the processing and analysis process by specific
methods will provide the necessary information.
Moreover, it is important to find that for example, there is
some connection and how strong is this correlation
between amount invested in advertising and sales volume
for the product.
One of solution for this kind of problem is the regression
model.

1/27/17

Why regression analysis?


9

To model some phenomena in order to better

understand in order to affect policy or to make


decisions about appropriate actions to take.
To model some phenomena in order to predict
values for that phenomenon at other places or other
times.
You can also use regression analysis to test
hypotheses (assumptions).

1/27/17

10

The concept of regression was introduced by Sir Francis

Galton (1822-1911) who studying the relationship between


the height of children and parents observed that parents
with excessive heights tend to have children with lower
height than theirs , i.e, closer to the average than parents .
Galton found a correlation between parental height (X )
and children height (Y ), r = + 0.67
Galton called this tendency " regression to mediocrity " ,
but the term is now known as the " regression to the mean
"

1/27/17

Regression Analysis components

11

Dependent Variable (Criterion Variable) Y: The variable whose


variation we want to explain or predict.
Independent Variable (Explanatory variables) X: Variable used
to predict systematic changes in the dependent. For predicting
annual purchases for a proposed store, you might include in your
model explanatory variables representing the number of
potential customers, distance to competition, store visibility, etc.
Regression coefficients- represent the strength and type of
relationship the explanatory variable has to the dependent
variable. When the relationship is positive, the sign for the
associated coefficient is also positive and when the relationship
is a strong one, the coefficient is large.

1/27/17

12

The Regression models are part of the stochastic models

( statistical ), where all the explanatory factors of a phenomenon,


which are not directly in the model, appear cumulative as a
random variable called error.
The variable Y ( the output parameter ) that quantifies the

phenomenon can be explained by regression on one or more


explanatory factors X ( the input parameters )
All the explanatory factors that are not relevant enough for Y,
enters into the model as a cumulative error !

1/27/17

13

If the relevant explanatory factors is limited to a single factor X ,


the model is called the simple regression model , given by form
Y = f (X )+ e , (3.1)
where e gives the error from model, a component which contains
all the factors on which Y depend, except X, and f is a function
which describe the link between the variables, called the
regression function.

1/27/17

14

Regression models are very important tools in one of the


following purposes:
Estimation - to determine the value of the regression function
for a particular combination of the values of the predictor
variables, including values for which no data have been measured
or observed.

Prediction- to determine either the value of a new observation


of the response variable, or the values of a specified proportion of all
future observations of the response variable

1/27/17

15

Optimization- Optimization is performed to determine the


values of process inputs that should be used to obtain the desired
process output.

1/27/17

Scatterplot analysis
16

Analysis gives information

about the direction,


form and strengthof the
relationship between the
variables.
Most scatterplots contain a line
of best fit, which is a straight
line drawn through the center
of the data points that best
represents the trend of the data.
Scatterplots provide a visual
representation of
the correlation, or relationship
between the two variables.
1/27/17

17

(a) Data that is ill-suited for linear least-

squares regression
(b) Indication that a parabola may be more
suitable

Regression analyses trough the correlation


value, make a stronger claim; they attempt
to demonstrate the degree to which one or
more variables potentially promote positive
or negative change in another variable

1/27/17

18

A scatterplot can be a helpful tool in determining the strength of


the relationship between two variables. If there appears to be no
association between the proposed explanatory and dependent
variables then fitting a linear regression model to the data
probably will not provide a useful model. A valuable numerical
measure of association between two variables is the
correlation coefficient, which is a value between -1 and 1
indicating the strength of the association of the observed data for
the two variables. A correlation of 1, whether it is +1 or -1, is a
perfect correlation.

1/27/17

Example investigate the association between gestational age at birth, measured


in weeks, and birth weight, measured in grams.
19

http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_Multivariable/BS704_Multivariable5.html

1/27/17

20

1/27/17

21

The formula for the sample correlation coefficient is

where Cov(x,y) is the covariance of x and y defined as

are the sample variances of x and y, defined as

1/27/17

Summarize the gestational age data.


22

1/27/17

Summarize the birth weight data.


23

1/27/17

24

1/27/17

25

-the sample correlation coefficient indicates a strong positive


correlation.
-if the chart points spread all over, without following a particular
rule, then we say that the two variables are not correlated;
-if instead, points describe a certain curve, called the empirical
regression curve, then we say that there is a correlation and it is
more intense, the more the points that stretch is narrower;
1/27/17

26

Linear regression analysis consists of more than just fitting a

linear line through a cloud of data points.


It consists of :
analyzing the correlation and directionality of the
data
estimating the model, i.e., fitting the line,
evaluating the validity and usefulness of the model.

1/27/17

Linear regression
27

A linear regression line has an equation of the form


Y = a + bX,
where X is the explanatory variable and Y is the dependent variable.
The slope of the line is b, and a is the intercept (the value of y when x = 0).

1/27/17

28

For the inputs

we presume that the following


data are colected
and from their scatterplot it is
observed that these data aproximate a line, given by the
above equation.
Corresponding to the input data, the approximating values
(predicted values) from the approximation line will be
evalueted by
where a and b will have the same value because we
presume that all the data are located in the vicinity of the
line.
1/27/17

29

Fitting a straight line to a

set of paired observations:


(x1, y1), (x2, y2),,(xn, yn)
yi : measured value
e : error
yi = b + a xi + e
e = y i - b - a xi
a : slope
b : interceptrepresents the
expected value for the
dependent variable if all of the independent variables are zero.
1/27/17

30

Best strategy to minimize the

error term is to minimize the


sum of the squares of the
residuals between the
measured-y and the y
calculated with the linear
model
The method is well known as
the method of least squares

Sr e
i 1

2
i

( yi ,measured yi ,model ) 2
i 1
n

S r ( yi b axi ) 2
i 1

1/27/17

31

The minimization process will be achieved by equaling to zero the

partial derivatives with respects to the regression coefficients

After simple calculations

(1)

1/27/17

2
x
i

xi

a

n b
i

yx

32

i i
i

n xi yi xi yi
n x xi
2
i

and using the second relation of (1), b can


be expressed as

b y ax , where y

,x

The example data in Table 1 are plotted in Figure 1. You can see that
there is a positive relationship between X and Y. If you were going to
predict Y from X, the higher the value of X, the higher your prediction
of Y.
1/27/17

Example (http://onlinestatbook.com/2/regression/intro.html)
33

1/27/17

34

The black diagonal line in Figure 2 is the regression line

and consists of the predicted score on Y for each possible


value of X. The vertical lines from the points to the
regression line represent the errors of prediction. The red
point is very near to the regression line so its error of
prediction is small.
By contrast, the yellow point is much higher than the
regression line and therefore its error of prediction is large.

1/27/17

35

1/27/17

36

The formula for a regression line is


Y= aX +b
where Y is the predicted score, a is the slope of the
line, and b is the Y intercept.
The equation for the line in Figure 2 is
Y = 0.425X + 0.785

1/27/17

37

n xi yi xi yi
n x xi
2
i

0.4250

b y ax 0.785, where y

,x

.
1/27/17

38

The error of prediction for a point is the value of the point minus

the predicted value (the value on the line).


Table 2 shows the predicted values (Y') and the errors of
prediction (Y-Y'). For example, the first point has a Y of 1.00 and
a predicted Y (called Y') of 1.21. Therefore, its error of prediction
is -0.21.

1/27/17

39

1/27/17

40

The last column in Table 2 shows the squared errors of

prediction. The sum of the squared errors of prediction shown in


Table 2 is lower than it would be for any other regression line.

1/27/17

41

Interpolation the process by which we use the

regression line to predict a value of the y variable for a


value of the x variable that is not one of the data points
but is within the range of the data set.
Extrapolation the process by which we use the
regression line to predict a value of the y variable for a
value of the x variable that is outside of the range of the
data set.

1/27/17

How to choose a regression model?


42
The spread of data
(a) around the mean
(b) around the best-fit line

Notice the improvement in the


error due to linear regression

Sr = Sum of the squares of residuals around the regression line


St = total sum of the squares around the mean
(St Sr) quantifies the improvement or error reduction due to describing data in terms of a
straight line rather than as an average value.
r : correlation coefficient
St S r
2
S ( y y)2

For a perfect fit Sr=0 and r = r22 = 1


signifies that the line explains 100 percent of
the variability of the data.

St

S r ( yi a0 a1 xi ) 2
i 1

1/27/17

43

R-squared is a statistical measure of how close the data are to the

fitted regression line. It is also known as the coefficient of


determination; offers the percentage of the response variable
variation that is explained by a linear model.
The higher the R-squared, the better the model fits your data.

1/27/17

Multiple regression
44

For example, if Y is annual income ($1000/year), X1 is educational level

(number of years of schooling), X2 is number of years of work experience,


and X3 is gender (X3 = 0 is male, X3 = 1 is female), then the population
mean function may be

1/27/17

45

Based on this mean function, we can determine the expected


income for any person as long as we know his or her educational
level, work experience, and gender. For example, according to this
mean function, a female with 12 years of schooling and 10 years of
work experience would expect to earn $26,600 annually. A male
with 16 years of schooling and 5 years of work experience would
expect to earn $30,300 annually.

1/27/17

46

A useful extension of linear regression is the case where y is a linear

function of two or more independent variables. For example:


y = ao + a1x1 + a2x2
For this 2-dimensional case, the regression line becomes a plane as

shown in the figure below.

1/27/17

47

Multiple linear regression analysis


helps us to understand how much
will the dependent variable change,
when we change the independent
variables.

http://gerardnico.com/wiki/data_mining/multiple_regression

1/27/17

48

Example (2 - vars) : Minimize error :

i 1

i 1

S r ei2 ( yi a0 a1 x1i a2 x2i ) 2


na0 x1i a1 x2i a2 yi

S r
2 ( yi ao a1 x1i a2 x2i ) 0
ao
S r
2 x1i ( yi ao a1 x1i a2 x2i ) 0
a1

x a x a x

S r
2 x2i ( yi ao a1 x1i a2 x2i ) 0
a2

x a x

n
x1i
x2 i

x
x
x x
1i
2
1i

1i

1i

2i

x
x x
x

2
1i

x a2 x1i yi

1i 2 i

x a1

1i 2 i

a0



a
1i 2 i
1

2 i a2

2i

2i

x a x
2
2i

2i

yi

y
x y
x y
i

1i

2i

1/27/17

Example of multiple regression


49

The following data is calculated from the equation

y = 5 + 4x1 - 3x2
Use multiple linear regression to fit
this data.
16.5 14 a0 54
6
16.5 76.25 48 a 243.5

14
48
54 a2 100
This system can be solved using Gauss Elimination.
The result is: a0=5 a1=4 and a2= -3
y = 5 + 4x1 -3x2
1/27/17

How to choose the best fitting model


50

In order to choose the best model some statistics are involved

Adjusted R-squared and Predicted R-squared- higher adjusted and


predicted R-squared values.
The adjusted R squared increases only if the new term improves
the model more than would be expected by chance and it can also
decrease with poor quality predictors.
P-values for the predictors- low p-values indicate terms that are
statistically significant; removing the term with the highest pvalue one-by-one until you are left with only significant
predictors.

1/27/17

51

P-Values- most regression methods perform a statistical test to


compute a probability, called a p-value, for the coefficients
associated with each independent variable. Small p-values reflect
small probabilities, and suggest that the coefficient is, indeed,
important to your model with a value that is significantly
different from zero. A coefficient with a p value of 0.01, for
example, is statistically significant at the 99% confidence level;
the associated variable is an effective predictor.
Stepwise regression and Best subsets regression:-

1/27/17

52

Stepwise regression selects a model by automatically adding or


removing individual predictors, a step at a time, based on their
statistical significance. The end result of this process is a single
regression model, which makes it nice and simple.

Best Subsets compares all possible models using a specified set of

predictors, and displays the best-fitting models that contain one


predictor, two predictors, and so on. The end result is a number of
models and their summary statistics. It is up to you to compare and
choose one.

1/27/17

You might also like