You are on page 1of 27

Regression Analysis

Regression Analysis

Least-Squares Linear Regression


Enables fit of linear or exponential function to data. The goal in regression analysis is the development of a statistical model that can be used to predict the values of a dependent or response variable from the values of the independent variable(s).

Linear Fits Most Common For exponential functions, data must be transformed.

Regression Analysis

Method of Least Squares

If we have N pairs of data (xi, yi) we seek to fit a straight line through the data of the form: Determine constants, a0 and a1, such that the distance between the actual y data and the fitted/ predicted line is minimized.

y = a0 + a1 x
a0 = x i " x i y i ! " x i2 " y i "

Each xi is assumed to be error free. All the error is assumed to be in the y values.

" x " y ! N" x y a = (" x ) ! N " x


i i 1 2 i 2 i

x i ! N " x i2 "
2 i i

Regression Analysis

Manual Calculation Method


Raw Data yi xi 1.2 2 2.4 3.5 3.5 12.6 1 1.6 3.4 4 5.2 15.2 xiyi 1.2 3.2 8.16 14 18.2 44.76 xi
2

Sum

1 2.56 11.56 16 27.04 58.16

Seeking an equation with the form: y=a0+a1x y=0.879+0.540x (15.2)(44.76)! (58.16)(12.6) = 0.879 a0 = (15.2)2 ! (5)(58.16)

(15.2)(12.6)! (5)(44.76) = 0.540 a1 = (15.2)2 ! (5)(58.16)


Regression Analysis 4

How good is the fit?


Coefficient of Determination (R2) measures the goodness of fit and the proportion of the variation of the y values associated with the variation in the x variable in the regression. The ratio of the explained variation to the total variation.

R2 =1 Perfect Fit (good prediction) R2 =0 No correlation between x and y For engineering data, R2, will normally be quite high (0.8-0.90 or higher) A low value might indicate that some important variable was not considered, but is affecting the results.

R2

" (ax + b ! y ) = 1! " (y ! y )


i i 2 i

= Excel Function RSQ (yi 's, x i 's)

where y = average of the yi 's


Regression Analysis 5

Standard Error of Estimate SEE

The standard error of estimate (SEE or Syx) is a statistical measure of how well the best-fit line represents the data. This is, effectively, the standard deviation of the differences between the data points and the best-fit line.

It provides an estimation of the scatter/random error in the data about the fitted line. This is analogous to standard deviation for sample data. It has the same units as y. 2 degrees of freedom are lost to calculate coefficients a0 and a1.

sey = SEE = Syx =

" ( yi ! yi ) N !2

= Excel Function STEYX(yi ' s, xi ' s)

where yi = actual value of y for a given x i yi = predicted value of y for a given x i


Regression Analysis 6

Linear Regression Assumptions


Variation in the data is assumed to be normally distributed and due to random causes. Assuming random variation exists in y values, while x values are error free. Since error has been minimized in the y direction, an erroneous conclusion may be made if x is estimated based on a value for y. For power law or exponential relationships, data needs to be transformed before carrying out linear regression analysis. (As we will discuss later, the method of least squares can also be applied to nonlinear functional relationships.)

Regression Analysis

Linear Regression Example

Use Excel Chart>>Add Trendline to obtain coefficients Functions RSQ() and STEYX() to determine R2 and SEE

3.00

Output, Volts

2.50 2.00 1.50 1.00 0.50 0.00 0.00

y = 0.9977x + 0.0295 R2 = 0.9993

0.50

1.00

1.50 Length, cm

2.00

2.50

3.00

Regression Analysis

Regression Analysis using Excel Analysis Tools

Linear regression is a standard feature of statistical programs and most spreadsheet programs. It is only necessary to input the x and y data. The remaining calculations are performed immediately.

Excel Regression Analysis macro


Performs linear regression only Non-linear relationships must be transformed Calculates the slope, intercept, SEE, and the upper and lower confidence intervals for the slope and intercept Does not produce any graphical output on the users plot. Does not update automatically. The user must interpret the results.
Regression Analysis 9

Linear Regression in Excel 2008


Y = m1iX + b
Torque, N-m (Y) 4.89 4.77 3.79 3.76 2.84 4.12 2.05 1.61 RPM (X) 100 201 298 402 500 601 699 799 Y Predicted Residual Residual/SEE=Residual/sey 4.998433207 0.108433207 0.17558474 4.559896053 -0.210103947 -0.340219088 4.138726707 0.348726707 0.564689451 3.687163697 -0.072836303 -0.117943051 3.261652399 0.421652399 0.682777249 2.823115245 -1.296884755 -2.100031702 2.397603947 0.347603947 0.562871377 1.963408745 0.353408745 0.572271025

Outlier

-0.004341952 5.432628409 0.000954031 0.481645161 0.775391233 0.617554846 20.71311576 6

m1 se1 r^2 F

b seb sey df

=LINEST(A2:A9,B2:B9,TRUE,TRUE)

Regression Analysis

10

Linear Regression Example: Omit Outlier


Torque, N-m (Y) 4.89 4.77 3.79 3.76 2.84 2.05 1.61 RPM (X) 100 201 298 402 500 699 799 Y Predicted 5.000219168 4.504157858 4.02774254 3.516946736 3.03561992 2.058231795 1.567081983 Residual 0.110219168 -0.265842142 0.23774254 -0.243053264 0.19561992 0.008231795 -0.042918017 Residual/SEE=Residual/sey 0.504559919 -1.21696881 1.088334807 -1.112646171 0.895506407 0.037683406 -0.196469559

-0.004911498 0.000348477 0.975447633 198.6463557 9.479149271

5.49136898 0.170606738 0.218446143 5 0.238593586

m1 se1 r^2 F m1

b seb sey df b

Regression Analysis

11

Uncertainties on Regression
Confidence Interval for Regression Line SEE=sey TINV(a=0.05,n=5) 95% C.I.=TINV(=0.05,=5)*SEE/SQRT(7)
Prediction Band for Regression Line 95% P.I.=TINV(=0.05,=5)*SEE
Uncertainty in Slope b=TiINV(0.05,5)*se1
Uncertainty in Intercept b=TiINV(0.05,5)*seb
0.218446143 2.570581835 0.212239784

0.561533687

0.000895789

0.438558582

Regression Analysis

12

Regression Line Confidence Intervals & Prediction Band

Not only do you want to obtain a curve fit relationship but you also want to establish a confidence interval in the equation or measure of random uncertainty in a curve fit. =N-2 in determination of t-value. Two degrees of freedom are lost because m1 and b are determined. 6 Syx Sey SEE CI = !y " t# ,$ = t# ,$ = t# ,$ N N N 5 where
Prediction Band -95% CI - 95% Torque, Lease Squares Fit CI +95% Prediction Band +95% Data

Torque, N-m

4 3 2 1 0

t#

,$

= TINV (# , $ )

(two-sided t-table) # = 1% P PB " t# ,$ SEE = t# ,$ Syx = t# ,$ Sey

200

400 RPM

600

800

1000

Regression Analysis

13

Regression Line Confidence Interval & Prediction Band

1 (x * " x )2 sey CI!in!Curve!Fit! = t! 2,n " 2 # sey + $ t! 2,n " 2 # n Sxx n

!yPrediction!Band

n +1 x # x = t" 2,n # 2 sey + n Sxx


*

$ t" 2,n # 2 sey

More accurate Approximate -minimum at mean -flares out at low & high extremes

Regression Analysis

14

Summations Used in Statistics & Regression


Variable Sample Standard Deviation Expressions used in regression analysis Sum of squares for evaluating CI & PI Standard error of estimate
Sxx = " ( xi ! x )
2

Expression
$ 1 2' Sx = & " # ( xi ! x ) ) %N !1 (
1/2

# " ( yi ! y predicted ! at ! x = x )2 & i sey = SEE = Syx = % ( $ ' N !2

1/2

CI in slope and intercept


Slope, m

CI !in!slope = t! 2,v " se1


Intercept, b

CI in Intercept = t! 2,v " seb


Note 1: =n-2. Note 2: m & b are not independent variables. Therefore, do not apply RSS to y=mx+b to determine y. Instead, use CI for curve fit.
Regression Analysis 16

Outliers in x-y Data Sets

Method involves computing the ratio of the residuals (predicted-actual) to the standard error of estimate (sey=SEE)
1. 2.

3.

Residuals=ypredicted-yactual at each xi Plot the ratio of residuals/SEE for each xi. These are the standardized residuals. Standardized residuals exceeding 2 may be considered outliers. Assuming the residuals are normally distributed, you can expect that 95% of residuals are in the range 2 (that is, within 2 standard deviations from best fit line)

Regression Analysis

17

Linear Regression with Data Transformation

Regression Analysis

18

Data Transformation

Commonly, test data do not show an approximate linear relationship between the dependent (Y) and independent (X) variables and a direct linear regression is not useful.
The form of the relationship expected between the dependent and independent variables is often known. The data needs to be transformed prior to performing a linear regression. Transformations often can be accomplished by taking the logarithms of or natural logarithms of one or both sides of the equation.

Regression Analysis

19

Common Transformations
Relationship Plot Method Log y vs. Log x (log plot) Log(y)=Log()+Log(x) Ln y vs. x (log-log paper) Ln(y)=Ln()+Ln(x) Transformed Intercept, b Log() Transformed Slope, m1 Ln() Log() Ln() Log(e)

y=x

y=ex

Log y vs. x (semi-log plot) Log(y)=Log()+Log(e)x Ln y vs. x (semi-log plot) Ln(y)=Ln()+x

Regression Analysis

20

Regression with Transformation

Example
A velocity probe provides a voltage output that is related to velocity, U, by the form E=+U , , and are constants

Output Voltage, VDC

4.5 4 3.5 3 0 10
U (ft/s) 0 10 20 30 40 Ei (V) 3.19 3.99 4.3 4.48 4.65

Output Voltage, VDC


50

10

1 1 10 Velocity, ft/s 100

20 30 Velocity, ft/s

40

Regression Analysis

21

Data Relationship Transformation


E=+U (E==3.19 at U=0) Log(E-3.19)=Log(U) Log(E-3.19)=Log()+Log(U)= Log()+Log(U) Y
U (ft/s) 0 10 20 30 40 Ei (V) 3.19 3.99 4.3 4.48 4.65 Lets Tranform this X 1.00 1.30 1.48 1.60

m1 X
Y -0.097 0.045 0.111 0.164

Perform Regression on the transformed Data


Regression Analysis 22

Solution (Excel 2004 Output)


SUMMARY OUTPUT Regression Statistics Multiple R 0.998723855 R Square 0.997449339 Adjusted R Square 0.996174009 Standard Error 0.01 Observations 4 ANOVA df Regression Residual Total 1 2 3 SS MS F Significance F 0.038118269 0.038118 782.1106 0.00127614 9.74754E-05 4.87E-05 0.038215745

t t*SEE t!value 3.18TINV (0.05,2) = 4.3026 ," = 0.02 SEE=0.0070

Intercept X Variable 1

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% -0.525 0.021056315 -24.9274 0.001605 -0.61547736 -0.4342812 0.432 0.015438034 27.96624 0.001276 0.36531922 0.49816831

Y=-0.525+0.432X
Regression Analysis 23

Regression with Transformation & Uncertainty


Y predicted -0.0931 0.0368 0.1129 0.1668 Y+ -0.0781 0.0519 0.1279 0.1818 Y-0.1082 0.0218 0.0978 0.1518 Transform it Back Again E 3.19 4.00 4.28 4.49 4.66 E+ 3.19 4.03 4.32 4.53 4.71 E3.19 3.97 4.24 4.44 4.61

Example 4.10 5

4.5
E, V

B=Logb -0.525=Logb b=0.298

3.5

E=3.19+0.298U0.432
0 10 20 U, ft/s 30 40 50

Regression Analysis

24

Multiple and Polynomial Regression

Regression analysis can also be performed in situations where there is more than one independent variable (multiple regression) or for polynomials of an independent variable (polynomial regression) Polynomial Expression Seeks the form

Y=b+m1*x+m2*x2++mkxk

Multiple Regression seeks a function of the form


Y = b + m1 x1 + m2 x2 + m3 x3 + .... + mk xk where x may represent several independent variables For example: x1 = x1 x2 = x2 x3 = x1 ! x2
Regression Analysis 25

Linear Regression in Excel 2004

Input the result values

Input the independent variable

Input desired confidence level

Regression Analysis

26

Excel 2004 Linear Regression Output


SUMMARY OUTPUT Regression Statistics Multiple R 0.99964308 R Square 0.99928628 Adjusted R Square 0.99910785 Standard Error 0.02788582 Observations 6 ANOVA df Regression Residual Total 1 4 5 SS 4.35502286 0.00311048 4.35813333 MS 4.35502286 0.00077762 F Significance F 5600.45805 1.9107E-07

R2 SEE=sey N

Intercept X Variable 1

Coefficients Standard Error t Stat 0.02952381 0.02018228 1.46285828 0.99771429 0.01333197 74.8362082

P-value Lower 95% Upper 95% 0.21733392 -0.02651117 0.08555879 1.9107E-07 0.9606988 1.03472978

intercept b"

slope m1"

The lower and upper bounds for the coefficients. To obtain the +- bound, simply subtract the lower from the upper and divide by two.
Regression Analysis 27