Professional Documents
Culture Documents
Probability Distributions 1
Predictive Models
Predictive modeling is a process that uses data mining and
probability to forecast outcomes. Each model is made up of a
number of predictors, which are variables that are likely to
influence future results.
Probability Distributions 2
Regression and Classification Problems
Probability Distributions 3
Classification Problems
Classification is the process of predicting the class of given
data points. Classes are sometimes called as targets/ labels or
categories
Examples: Spam detection, Medical diagnosis, Fraud
detection, credit scoring, image detection, etc…
Probability Distributions 4
Classification Models
Logistic Regression
Linear Discriminant Analysis (LDA)
Classification Trees
K Nearest Neighbors
5
Linear Regression Model
Problems involving sets of variables when it is known that
there exists some inherent relationship among the variables
can be solved by regression models.
Examples: House Price based on key features, gas milage
based on engine capacity, epidemiology, economics.
Probability Distributions 6
Regression Models
Regression Trees Linear Regression Model
Probability Distributions 7
Linear Regression Model
A simple linear regression model has the following structure:
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
𝛽0 is the intercept
𝛽1 is the slope. Y change per unit of X change
𝑌 is the dependent variable. The variable to be predicted
X is the indepedent variable. Can be controlled, is not random.
𝜀 is the perturbation. E[𝜀]=0
Probability Distributions 8
Linear Regression Model
• The objective is to estimate β0 and β1 based on a dataset with pairs
x i , yi
𝑦ො = 𝑏0 + 𝑏1 𝑥
Probability Distributions 9
Linear Regression Model Steps
• 1. Dispersion diagram
• 2. Estimate 𝑏0 and 𝑏1
• 3. Validate model with analysis of variance (ANOVA)
• 4. Determination coefficient Analysis (R2)
• 5. Residuals Analysis (Normality, Homoscedasticity and
independence)
• 6. Confidence intervals of coefficients 𝑏0 and 𝑏1
• 7. Confidence intervals of 𝑌 and prediction interval of Y
Probability Distributions 10
Step 1: Dispersion Diagram
Probability Distributions 11
Step 2: Estimate 𝑏0 and 𝑏1
We shall find b0 and b1, the estimates of β0 and β1 , so that the sum of the
squares of the residuals is a minimum. The residual sum of squares is often
called the sum of squares of the errors about the regression line and is
denoted by SSE.
σ 𝑦𝑖 − 𝑏 σ 𝑥𝑖 𝑛 σ 𝑥𝑖 𝑦𝑖 − σ 𝑥𝑖 σ 𝑦𝑖
𝑏0 = = 𝑦ത − 𝑏1 𝑥ҧ 𝑏1 =
𝑛 𝑛 σ 𝑥 2 − σ 𝑥𝑖 2
Probability Distributions 12
Example Errors
Probability Distributions 13
Example
A company assigns different prices to an electronic product. The following
table shows the product sales for different prices.
Probability Distributions 14
Step 3: Model Validation ANOVA
The analysis of variance splits the variation of Y in different
components (Regression model and error)
• 𝐻0 : 𝛽1 = 0
• 𝐻1 : 𝛽1 ≠ 0
Probability Distributions 15
Step 3: Model Validation ANOVA
𝑛 𝑛 𝑛
2 2 2
𝑦𝑖 − 𝑦ത = 𝑦ො𝑖 − 𝑦ത + 𝑦𝑖 − 𝑦ො𝑖
𝑖=1 𝑖=1 𝑖=1
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
Probability Distributions 16
Step 3: Model Validation ANOVA
• 𝑆𝑆𝑇: Total Variability of 𝑌.
• 𝑆𝑆𝑅: Variability of 𝑌 explained by the regression model.
• 𝑆𝑆𝐸: Variability of 𝑌 explained by the error.
𝑆𝑦𝑦 = 𝑦𝑖 − 𝑦ത 2
𝑆𝑆𝑇 = 𝑆𝑦𝑦
𝑛 σ 𝑥𝑖 𝑦𝑖 − σ 𝑥𝑖 σ 𝑦𝑖
𝑆𝑆𝑅 = 𝑏1 𝑆𝑥𝑦 𝑆𝑥𝑦 =
𝑛
𝑛 σ 𝑥𝑖2 − σ 𝑥𝑖 2
𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝑏1 𝑆𝑥𝑦 𝑆𝑥𝑥 =
𝑛
Probability Distributions 17
Step 3: Model Validation ANOVA
Source SS DF MS 𝒇𝑻𝒆𝒔𝒕
𝑆𝑆𝑅ൗ
𝑆𝑆𝑅 1
Regression 𝑆𝑆𝑅 1 𝑀𝑆𝑅 =
1 𝑆𝑆𝐸ൗ
𝑛−2
𝑆𝑆𝐸
Error 𝑆𝑆𝐸 𝑛−2 𝑀𝑆𝐸 =
𝑛−2
Total 𝑆𝑆𝑇 𝑛−1
Probability Distributions 18
Step 3: Model Validation ANOVA
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
Probability Distributions 19
Step 4: Coeficient of Determination
2
𝑆𝑆𝐸
𝑅 =1−
𝑆𝑆𝑇
Probability Distributions 20
Example
Y X
The grades of a class of 9 students on a midterm 82 77
report (x) and on the final test (y) are as follow: 66 50
(a) Estimate the linear regression line. 78 71
(b) Carry out an ANOVA 34 40
(b) Estimate the final examination grade of a 47 46
student who received a grade of 85 on the midterm 85 90
report. 99 96
99 99
68 67
Probability Distributions 21
Problem: Simple Linear Regression
A study was made on the amount of converted Converted Sugar Temperature
sugar in a certain process at various temperatures. 7.7 1
The data were coded and recorded as follows: 7.8 1.1
(a) Estimate the linear regression line. 8.2 1.2
8.4 1.3
(b) Estimate the mean amount of converted sugar 8.8 1.4
produced when the coded temperature is 1.75. 8.9 1.5
8.6 1.6
9 1.7
9.3 1.8
9.2 1.9
10.5 2
Probability Distributions 22
Step 5: Residuals Analysis- Normality test
Probability Distributions 23
Step 5: Residuals Analysis- Homogeneity of Variance
o𝐻0 : Residuals have homogeneous variance.
o𝐻1 : : Residuals have heteregoneous variance.
Plot the standardize errors versus 𝑦ො or x and evaluate if there are
patters or variance changes
Probability Distributions 24
Step 5: Residuals Analysis- Independence
Residuals independence can be tested analyzing the residuals plot or
using a statistical test such as Durbin-Watson
Probability Distributions 25
Step 5: Residuals Analysis- Independence
The Durbin-Watson is estimated using the residuals ei
The statistic will be between 0 ≤ 𝐷𝑤 ≤ 4. When 𝐷𝑤 is close to 2 the residuals are assumed
independent. For a given significance level the critical values dl and du are found from the
table. The are based on the following guidelines:
Si 0≤ 𝐷𝑤 ≤ 𝑑l, Negative correlation
Si 𝑑𝑙≤ 𝐷𝑤 ≤ 𝑑𝑢, inconclusive
Si 𝑑𝑢≤ 𝐷𝑤 ≤ 4−𝑑𝑢 residuals are independent.
Si 4−𝑑𝑢≤ 𝐷𝑤 ≤4−𝑑𝑙, inconclusive
Si 4−𝑑𝑙≤ 𝐷𝑤 ≤4, Positive correlation
Probability Distributions 26
Step 6: Inference of 𝛽0 and 𝛽1
The estimations 𝑏0 and 𝑏1 of 𝛽0 and 𝛽1 are values of the random
variables 𝐵0 and 𝐵1 .
𝐸 𝐵0 = 𝛽0 and 𝐸 𝐵1 = 𝛽1 .
The variances 𝐵0 and 𝐵1 are:
2 2
1 𝑥 ҧ 𝑠
𝑉 𝐵0 = 𝑠 2 + 𝑉 𝐵1 =
𝑛 𝑆𝑥𝑥 𝑆𝑥𝑥
2 2 𝑆𝐶𝐸
Where 𝑠 is given by the following equation: 𝑠 = .
𝑛−2
Probability Distributions 27
Step 6: Inference of 𝛽0 and 𝛽1
The confidence intervals at 1 − 𝛼 100% of 𝛽0 and 𝛽1 are:
𝑏0 − 𝑡𝑛−2,𝛼ൗ 𝑉 𝐵0 ≤ 𝛽0 ≤ 𝑏0 + 𝑡𝑛−2,𝛼ൗ 𝑉 𝐵0
2 2
𝑏1 − 𝑡𝑛−2,𝛼ൗ 𝑉 𝐵1 ≤ 𝛽1 ≤ 𝑏1 + 𝑡𝑛−2,𝛼ൗ 𝑉 𝐵1
2 2
Probability Distributions 28
Step 7: Inference of 𝑌
The confidence interval at 1 − 𝛼 100% of 𝑌 is:
Probability Distributions 29
Step 7: Inference of 𝑌
The prediction interval at 1 − 𝛼 100% of 𝑌 𝑖𝑠:
Where,
2
1 𝑥𝑜 − 𝑥ҧ
𝑉 𝑌 = 𝑠2 1+ +
𝑛 𝑆𝑥𝑥
Probability Distributions 30