Professional Documents
Culture Documents
On
BY
HEMANT GANDHI
2014B4A4PS763H
Any work irrespective of its magnitude or complexity is always a group effort and is never fully
complete unless due gratitude is bestowed upon all who contributed to its success. I would like to
take the opportunity to thank Prof., ADDEPALLI RAMU Associate Professor, Birla Institute of
Technology and Science Pilani, Hyderabad Campus, for having given me this wonderful chance to
work under his guidance.
We are grateful to the administration of BITS Pilani, Hyderabad Campus for providing
opportunities to the students for development of their academic skills and logical thinking through
open ended study oriented activities.
CERTIFICATE
This is to certify that the project report entitled Least Square Regression submitted by Mr.
Hemant Gandhi (2014B4A4763H), in partial fulfilment of the requirements of the course MATH
F266 (Study Oriented Project), embodies the work done by him under my supervision and guidance.
Curve fitting is the process of constructing a curve, or mathematical function that has the best fit to
a series of data points, possibly subject to constraints.
It is frequently used in engineering. For example the empirical relations that we use in heat
transfer and fluid mechanics are functions fitted to experimental data.
Regression: Mainly used with experimental data, which might have significant amount of error
(Noise). No need to find a function that passes through all discrete points.
Interpolation: Used if the data is known to be very precise. Find a function (or a
series of functions) that passes through all discrete points.
Polynomial Interpolation Spline Interpolation
Least Square Regression
The method of least squares is a standard approach in regression analysis to the approximate
solution of over determined systems, i.e., sets of equations in which there are more equations than
unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the
residuals made in the results of every single equation.
The most important application is in data fitting. The best fit in the least-squares sense
minimizes the sum of squared residuals (a residual being: the difference between an observed value,
and the fitted value provided by a model).
Least squares problems fall into two categories: linear or ordinary least squares and non-linear least
squares, depending on whether or not the residuals are linear in all unknowns. The linear least-
squares problem occurs in statistical regression analysis; it has a closed-form solution. The non-
linear problem is usually solved by iterative refinement; at each iteration the system is approximated
by a linear one, and thus the core calculation is similar in both cases. Polynomial least
squares describe the variance in a prediction of the dependent variable as a function of the
independent variable and the deviations from the fitted curve.
Linear Regression
Linear least squares regression is by far the most widely used modeling method. It is what most
people mean when they say they have used "regression", "linear regression" or "least squares" to fit
a model to their data. Not only is linear least squares regression the most widely used modeling
method, but it has been adapted to a broad range of situations that are outside its direct scope.
Mathematically, linear least squares is the problem of approximately solving an over determined
system of linear equations, where the best approximation is defined as that which minimizes the
sum of squared differences between the data values and their corresponding modeled values. The
approach is called linear least squares since the assumed function is linear in the parameters to be
estimated.
Several possibilities to minimize the error (deviation) to get a best-fit line (to find a0 and a1) are:
Minimize the sum of squares of individual errors. This is the preferred strategy.
Minimizing the square of individual errors
Sum of squares of the residuals:
Or,
1. Exponential Equation
2. Power Equation
We can model the expected value of y as an nth degree polynomial, yielding the general polynomial
regression model
Direction b: We find that the quadratic polynomial regression model appears to fit the data best.
Least squares parameter estimates for this model are = (5.8667, 30.2242, 2.3636)T .
There are several possible uses of a regression model. One is understand the relationship between the
two or more variables. A more common use of a regression analysis is prediction, providing
estimates of values of the dependent variable (variables) by using the prediction equation. Point
predictions are not perfect and are subject to error. The error is due to the uncertainty in estimation as
well as the natural variation of points about the regression line.
We can compute e.g. 95 % prediction interval for strains a, b, c in particular directions marked
as a, b, c Figures 1(b), 2(b), 3(b) show the 95 % prediction interval for strains in particular directions
by using the best polynomial regression model.
R-squared
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
Plotting fitted values by observed values graphically illustrates different R-squared values for
regression models.
The regression model on the left accounts for 38.0% of the variance while the one on the right
accounts for 87.4%. The more variance that is accounted for by the regression model the closer the
data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the
variance, the fitted values would always equal the observed values and, therefore, all the data points
would fall on the fitted regression line.
R-squared cannot determine whether the coefficient estimates and predictions are biased, which is
why you must assess the residual plots.
R-squared does not indicate whether a regression model is adequate. You can have a low R-squared
value for a good model, or a high R-squared value for a model that does not fit the data!
Are Low R-squared Values Inherently Bad?
No! There are two major reasons why it can be just fine to have low R-squared values.
In some fields, it is entirely expected that your R-squared values will be low. For example, any field
that attempts to predict human behavior, such as psychology, typically has R-squared values lower
than 50%. Humans are simply harder to predict than, say, physical processes.
Furthermore, if your R-squared value is low but you have statistically significant predictors, you can
still draw important conclusions about how changes in the predictor values are associated with
changes in the response value. Regardless of the R-squared, the significant coefficients still represent
the mean change in the response for one unit of change in the predictor while holding other predictors
in the model constant. Obviously, this type of information can be extremely valuable.
A low R-squared is most problematic when you want to produce predictions that are reasonably
precise (have a small enough prediction interval). How high should the R-squared be for prediction?
Well, that depends on your requirements for the width of a prediction interval and how much
variability is present in your data. While a high R-squared is required for precise predictions, its not
sufficient by itself, as we shall see.
No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a
surprise, but look at the fitted line plot and residual plot below. The fitted line plot displays the
relationship between semiconductor electron mobility and the natural log of the density for real
experimental data.
The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%,
which sounds great. However, look closer to see how the regression line systematically over and
under-predicts the data (bias) at different points along the curve. You can also see patterns in the
Residuals versus Fits plot, rather than the randomness that you want to see. This indicates a bad fit,
and serves as a reminder as to why you should always check the residual plots.
Residuals
The difference between the observed value of the dependent variable (y) and the predicted value () is
called the residual (e). Each data point has one residual.
Both the sum and the mean of the residuals are equal to zero. That is, e = 0 and e = 0.
Residual Plots
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on
the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a
linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Non-Linear Regression
While simple and multiple linear regression functions are adequate for modeling a wide variety of
relationships between response variables and predictor variables, many situations require nonlinear
functions. Nonlinear regression is a form of regression analysis in which observational data are
modeled by a function which is a nonlinear combination of the model parameters and depends on one
or more independent variables. The data are fitted by a method of successive approximations.
Conclusion
Regression analysis is a statistical tool for the investigation of relationships between variables. The
multiple regression analysis is a useful method for generating mathematical models where there are
several (more than two) variables involved. Polynomial regression model is consisting of successive
power terms. Each model will include the highest order term plus all lower order terms (significant or
not). We can view polynomial regression as a particular case of multiple linear regression. Polynomial
models are an effective and flexible curve fitting technique. The most widely used method of
regression analysis is ordinary least squares analysis. This method works by creating a best fit line
through all of the available data points and parameter estimates are chosen to minimize error sum of
squares. Fitting a regression model requires several assumptions. Estimation of the model parameters
requires the assumption that the errors are uncorrelated random variables with mean zero and constant
variance. Tests of hypotheses and interval estimation require that the errors are normally distributed.
There are a number of advanced statistical tests that can be used to examine whether or not these
assumptions are true for any given regression equation.
Bibliography
http://users.metu.edu.tr/csert/me310/me310_5_regression.pdf
https://en.wikipedia.org/wiki/Least_squares
http://www.sciencedirect.com/science/article/pii/S1877705812046085
https://en.wikipedia.org/wiki/Linear_regression
https://en.wikipedia.org/wiki/Polynomial_regression
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-
squared-and-assess-the-goodness-of-fit