Chapter 2 Lecture Notes

ISA 291 Chapter 2 Notes
Review of Basic Statistical Concepts and Terms
Overview of Data
Data: Singular or Plural?
Data Set:
Example data set:

ID
Number
1
2
3
4
5
Height
65
70
67
69
66
Gender
female
male
female
female
female
Hair_$
40
25
50
60
45
Job?
no
no
yes
no
no
Studytime
3.0
4.0
2.0
8.0
10.0
Smoker?
no
no
no
no
no
Dates
0
0
1
1
2
HS_GPA
3.0
3.5
3.8
4.0
3.6
CUM_GPA
2.4
3.3
3.3
3.9
3.2
Observation:
Variable:
Univariate Data:
Bivariate Data:
Multivariate Data:
Two Types of Variables

1. Qualitative or Categoricala category or label indicates some attribute
ex. gender (M,F); degree received (Bachelors, Masters, Doctorate)
Sometimes Categorical variables are coded using numbers. For
example, we might code the variable, Gender as 1 = Male, 0 =
Female. We might also code Degree Received as 1 =
Bachelors, 2 = Masters, 3 = Doctorate.
Data coded in this way is still categorical. The numbers are
simply used to represent the categories.
With categorical variables, summary values such as the mean
dont make sense.
2. Numerical or Quantitativea numerical value (how much, how many, how

long) ex. age, expected income
With numerical variables, it makes sense to compute arithmetic
quantities such as averages. For example, one can compute the
average age (quantitative variable) of students, but not the average
gender (qualitative variable).
There are two types of quantitative variables:
1. Discretethe possible values form a set of separate numbers
such as 0, 1, 2,..., e.g. the number of children in a family.
2. Continuousthe possible values form a continuous interval,

e.g., age, height, weight.
Time Series Data: same variable measured at regular intervals over time, usually
on the same unit. The plot below shows a time series.
58
Average Annual Temperature
57
56
55
54
53
52
51
50
49
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Year
Plot of the Average Annual Temperature in NYC from 1900-2000
Cross-Sectional Data: several variables are measured on several units at one

snapshot in time. The data sheet below represents survey responses of 5 students
taken on a given day.
ID
Number
1
2
3
4
5
Height
65
70
67
69
66
Gender
female
male
female
female
female
Hair_$
40
25
50
60
45
Job?
no
no
yes
no
no
Studytime
3.0
4.0
2.0
8.0
10.0
Smoker?
no
no
no
no
no
Dates
0
0
1
1
2
HS_GPA
3.0
3.5
3.8
4.0
3.6
CUM_GPA
2.4
3.3
3.3
3.9
3.2
Sampling Concepts
Sample vs. Census.
A sample involves looking only at some items selected from the

population.
A census is an examination of all items in a defined population.
Why cant the United States Census survey every person in the population?
mobility, un-documented workers, budget constraints, incomplete
responses, etc.
The census is currently conducted by a mail questionnaire. The
response rate for the census was around 65% in 2000.
Census takers make personal visits to households who do not return
the questionnaire.
Many are simply not counted in the census. Who might not be
counted?
It is often more practical to sample rather than trying to take a census.
Population
Sample
4
Target Population
The population must be carefully specified and the sample must be drawn
scientifically so that the sample is representative.
The target population is the population we are interested in (e.g., U.S.
gasoline prices).
The sampling frame is the group from which we take the sample (e.g.,
115,000 stations).
The frame should not differ from the target population.
With or Without Replacement

If we allow duplicates when sampling, then we are sampling with
replacement.
Duplicates are unlikely when n is much smaller than N.
If we do not allow duplicates when sampling, then we are sampling without
replacement.
Parameter:
Statistic:
Random Variables and Their Probability Distributions

A random variable is a numerical measurement of the outcome of a random
experiment or phenomenon. Often, the randomness results from the use of random
sampling or a randomized experiment to gather the data.
Examples of Discrete Random Variables:
The result of a coin toss
The result of a toss of a fair die
Whether or not a batter hits a home run
An individuals response in an exit poll as to who they voted for in the

election
The starting salary of a randomly selected graduate of the college of

business
The number of chips in a randomly selected chocolate chip cookie.
Probability Distribution: A discrete probability distribution assigns a probability

to each value of a discrete random variable X.
To be a valid probability distribution, each of the following must be satisfied.
Examples of Continuous Random Variables:
weight of a randomly selected student
the time it takes to produce a manufactured product
the length of a manufactured part
the amount of rainfall in a given month
the grade a randomly selected student will receive on the next exam
Probability Distribution: A continuous probability distribution is the function or

model that describes the probability allocations for continuous random variables.
For a continuous random variable, y, the probability distribution is usually written

f(y). For example, if y is a normally distributed random variable,
f ( y) =
1
e
2
( y )2
2 2 for
< y < .
Here, is the population mean, and is the population standard deviation. A

graph of this function looks like this.
Because continuous random variables can assume an infinite number of values, A

continuous random variable can take on any value over a continuous range.
Probabilities are Represented as Area Under the Probabiliy Density Curve
Continuous probability functions are smooth curves.

Unlike discrete distributions, the area at any single point = 0.
The entire area under any Probability Density Function (PDF )must be 1.
Mean is the balance point of the distribution.
The normal distribution
The normal distribution is another continuous probability model.
It is commonly called the bell curve because of its symmetric bell shape.
The normal distribution is very important in statistics because it accurately

describes many kinds of data.
o distributions of scores on a standardized test (SAT)
o weight of a randomly selected student
o the length of a manufactured part
o the amount of rainfall in a given
month
The mean (expected value) of the normal

distribution is at the center of the PDF.
The distribution is symmetric about the
mean. We use the symbol to represent
the mean of a density curve (as opposed
to X for a sample mean).
The standard deviation of the normal

distribution describes the spread of the distribution around the mean. We use
the symbol to represent the standard deviation of a density curve (as
opposed to s for a sample standard deviation).
10
The 68-95-99.7 rule

If the data follow a normal distribution with mean,, and standard deviation ,
68% of the observations will fall within 1 of the mean, .
95% of the observations will fall within 2 of the mean, .
99.7% of the observations will fall within 3 of the mean, .
Example: Suppose a heights of students are distributed normally with a mean of

= 66 inches, and a standard deviation of = 3 inches.
68% of the students are between ______ and ______ inches tall.
95% of the students are between ______ and ______ inches tall.
99.7% of the students are between ______ and ______ inches tall.
11
The Standard Normal Distribution: Calculating Normal Distribution

Probabilities using a Z-Table
The Standard Normal Distribution is a particular Normal distribution with a mean
of = 0 and a standard deviation of = 1.
The probabilities for many Z values can be found by integrating the normal
distribution function. Alternately, we can use probability distribution tables that
have already taken care of the calculus.
Normal distribution tables are given in Appendix A, Table A-1.
12
Example: Heights of students are distributed normally with a mean of 66 inches

and a standard deviation of 3 inches.
a. Find the proportion of students taller than 70 inches.
b. What height represents the 90th percentile of heights?
13
Sampling Distribution of the Mean

Here are histograms of the sampling distributions of the average values of one,
two, five, and ten rolls of a die.
0.18
Probability of Occurrence
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
5.8
5.4
4.6
5.0
4.2
3.4
3.8
3.0
2.2
2.6
1.8
1.4
1.0
0.00
Average Outcome on Two Rolls
0.18
0.16
0.18
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
Average Outcome on Five Rolls
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
0.00
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
Outcome of Roll of One Die
Average Outcome on Ten Rolls
What is the random variable in each case?
What does the sampling distribution represent?
What is happening to the sampling distributions as the sample size increases?
14
Facts:
1. X is a random variable.
2. Different samples will give different values of X .
3. X has a distribution
4. The probability distribution of X is called the sampling distribution of X . It
is called the sampling distribution because the differences in X are due to the
fact that we sampled. If we plot all possible values of X s on a histogram
with the height of the bars equal to the probability of occurrence of a particular
value, we will see a picture of the sampling distribution of the sample means.
5. The larger the sample size, the more precisely X estimates (equivalently, the
larger the sample size, the smaller the standard error of X .)
6. The standard deviation of the sampling distribution of X is known as the
standard error of X . The standard error of X is equal to
, where is
the standard deviation of the population.
15
Estimating a Mean with Confidence
o A sample statistic is often used to estimate a population parameter. For

example, we often use the sample mean, X , to estimate the population mean,
. X is known as a point estimate of the population mean.
o The sample mean will change from sample to sample.
o The sample mean has a distribution known as the sampling distribution.
o Because the sample mean will change from sample to sample, it is often more
useful to provide not just a single estimate of the population mean, but a range
of values to estimate .
o We use the sampling distribution of X to compute this interval or range of

values to estimate .
o This range of values is known as a confidence interval.
16
Example: Suppose we know that the amount of money spent monthly on credit
cards is normally distributed with an unknown mean. Suppose further (although
this is an unrealistic supposition) that we know the standard deviation of the
amount spent is equal to =$50. We randomly sample 100 credit card holders and
compute their average monthly expenditures to be X = $500 .
o The sampling distribution of X is normal with mean equal to the unknown
50
value, , and standard deviation equal to
=
= 5.
n
100
o We know from the 68-95-99.7 rule that approximately 95% of the time, X
will fall within the range 2 50
100
= 10 .
o We can use this fact to determine that approximately 95% of the time, the
unknown population mean will fall within the
range X 2 50
= 500 10=
100
( $490,$510 ) .
Reality: The calculations above assume we know the population standard

deviation, , which is not likely. Thus, we can rarely use the distribution of the
random variable, z =
. Rather, we use the sampling distribution of
n
t=
x
.
s
n
The random variable, t, follows a Students t-distribution, or simply the tdistribution.
17
The difference in the t- and z-distributions:

The z-distribution assumes that the population standard deviation, , is
known.
The t-distribution assumes that the population standard deviation is
estimated with s, the sample standard deviation.
The t-distribution looks almost like the standard normal distribution in that it
is symmetric about zero.
However, the tails of the t-distribution are fatter than that of the standard
normal.
This is to take into account the use of the sample standard deviation (s)
instead of the population standard deviation ().
The exact spread of the t-distribution depends on the degrees of freedom. In
the case of a sample mean, the degrees of freedom are n-1.
As the degrees of freedom approach infinity, the t-distribution approaches
the z-distribution.
The t-table can be found in Table A2 of Appendix A in your text.
18
Example: We are interested in estimating the mean monthly rent for all
unfurnished one bedroom apartments in the community. We have no other
information. Assume we randomly sample n = 10 apartments advertised in the
local paper. They have the following monthly rental rates.
500
550
650
515
600
495
505
650
450
395
We compute X = 531, and s = 82.79.

To compute a C level confidence interval for estimating , we use the following
formula:
This can be rewritten as
The margin of error is given by
19
Suppose we wish to compute a 95% confidence interval.

1. Find t
2. Compute the interval:
3. Interpret the interval:
20
What is the margin of error for the confidence interval?
What is the width of the confidence interval?
What is the confidence level of the confidence interval?
Suppose the degrees of freedom are not listed on the t-table. In this case, you
choose the next lower value, or preferably, look up a more exact value in a
computer software program.
21
The Meaning of 95% Confidence
22
One Sample Mean Test
Do Low Carb Diets Work? In a recent study, 41 overweight subjects were placed
on a low carbohydrate diet but given no limit on caloric intake. The prediction was
that subjects on such a diet would lose weight, on the average. After 16 weeks, the
weight change averaged 21.38 pounds (before-after) with a standard deviation of
7.5 pounds. Does the low carb diet really work?
Here we are conducting a test for a quantitative variable: Amount of weight
lost after 16 weeks on a low carb diet.
The parameter we are testing is the mean weight loss at the end of the 16 week
period for all those who may ever be on the diet. We represent this population
mean with the Greek letter, .
State Ho and Ha.
Ho:
Ha:
Compute the test statistic.

The test statistic is a measure of how many standard errors the actual observed
sample mean is from the hypothesized value (according to Ho). We measure this
distance with a t-statistic. The formula for the test statistic is as follows:
=
t
x 0 x 0
=
s
se
n
23
Find the p-value.

The p-value is a probability from the t-distribution. It is the probability of
observing a sample mean in a subsequent sample that is as or more extreme (in the
direction Ha) as the one you observed. In our case, this would represent putting
another random 41 subjects on the same low carb regimen and observing an
average weight loss that is greater than 21.38 pounds.
24
Base your decision on the p-value.

Small p-values give evidence against the null hypothesis. Large p-values give
evidence in favor of the null hypothesis.
In this case our p-value of .0000 means that if indeed the low carb diet does
not work, then there is literally no chance of observing another group of 41
subjects on this diet who lost 21.38 pounds or more.
The near impossibility of repeating this study with similar or more extreme
results makes suggests that the null hypothesis assumption (the diet doesnt
work) is unlikely to be true.
Decision:
Conclusion:
How small is a small p-value?

We usually have a preset significance level of a test. The significance level is a
small value such as .05 or .10. This value tells us how much evidence we need
against the null hypothesis in order to reject it.
We will reject Ho if the p-value is smaller than the level of significance.
If we reject Ho, we say that the results are statistically significant.
25
Steps to conducting a hypothesis test:

1. Consider the Assumptions:
a. Quantitative variable with population mean, , defined.
b. A random sample
c. The population distribution is approximately normal.
2. State the Hypotheses:

Null:
Ho:
= 0
Alternative:
Ha:
> 0
< 0
0
3. Compute the Test Statistic:
t
=
x 0 x 0
=
s
se
n
26
4. Find the p-value

Alternative
p-value
Graph
Hypothesis
> 0
Right tail probability
< 0
Left tail probability
Two-tail probability
27
5. Draw a conclusion.
a. If the p-value is smaller than the significance level (), reject the null
hypothesis in favor of the alternative.
b. If the p-value is larger than the significance level (), do not reject the
null hypothesis.
c. Relate the conclusion to the context of study.
28
Example: Is the ideal number of children in a family 2.0? The responses to a

recent GSS question for What do you think is the ideal number of children to
have? results in a mean of 2.49, and a standard deviation of .85 in a sample of
1302 respondents. A statistical software package gives the following output.
Hypothesis test results:
: population mean
H0 : = 2
HA : 2
Mean Sample Mean Std. Err.
DF
T-Stat
P-value
2.49 0.023556644 1301 20.800924 <0.0001
Assumptions:
State Ho and Ha.
29
Find the p-value:
Conclusion:
30
Example: Suppose a school within a university is going to require a minimum

GPA of 2.5 to enter that school. Suppose the students enrolled in an introductory
course represent a random selection of students who desire entry into the school.
A survey gives the following results. Use the output to test whether or not the
mean GPA of all students desiring entry into the school differs from 2.5.
Hypothesis test results:
: mean of Variable
H0 : = 2.5
HA : 2.5
Variable Sample Mean
CUM_GPA
Std. Err.
DF
T-Stat
P-value
2.8846443 0.02866734 266 13.417506 <0.0001

Histogram
Frequency
Cumulative %
100.00%
60
90.00%
50
80.00%
70.00%
Frequency
40
60.00%
30
50.00%
40.00%
20
30.00%
20.00%
10
10.00%
0
0.00%
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
bins
Assumptions:
31
State Ho and Ha.
Find the p-value:
Conclusion:
32
Hypothesis Tests and Confidence Intervals are Equivalent When

The confidence level = 1-.
The test is two-tailed.
99% confidence interval results:

: mean of Variable
Variable Sample Mean Std. Err.
CUM_GPA
DF L. Limit U. Limit
2.8846443 0.02866734 266 2.8102686 2.95902
Compute the Confidence Interval:
Interpret the Confidence Interval:
33
Other Important Probability Models
We have discussed two very important continuous probability models, the normal
(standard normal) and the t-distribution. The Chi-Square and F-distributions are
other important continuous probability distributions used in statistical modeling.
Chi-Square
The shape of the curve is right-skewed, and it has a lower bound of zero.
The shape of the curve depends heavily on the number of degrees of
freedom, indicated above as .
We are usually interested in right-tail areas of the chi-square distribution.
34
F-Distribution
The density curve of the F-distribution is right-skewed, and has a lower boud
of zero.
The shape of the curve depends heavily on two sets of degrees of freedom,
listed above as r1 and r2.
The F-distribution is also called the variance ratio distribution because it is
derived from the ratio of two variance distributions, namely chi-square
distributions.
The value, r1, corresponds to the numerator degrees of freedom, or the
numerator of the ratio.
The value, r2, corresponds to the denominator degrees of freedom, or the
denominator of the ratio.
35
Table A3 in Appendix A (594-597) gives the upper tail critical values for the
F-distribution for several sets of numerator and denominator degrees of
freedom.
If the exact degrees of freedom combination you need is not in the table, you
should round down to the closest value, or preferably, use a software
package to obtain a more accurate calculation.
Example: Find the 95th percentile of an F-distribution with numerator d.f. = 2, and
denominator d.f. = 10.
Find the 99th percentile of an F-distribution with numerator d.f. = 2 and

36
Find the 99th percentile of an F-distribution with numerator d.f = 33 and

37

Chapter 2 Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Lecture Notes

Uploaded by

Copyright:

Available Formats

ISA 291 Chapter 2 Notes

Review of Basic Statistical Concepts and Terms

Example data set:

Two Types of Variables

2. Numerical or Quantitativea numerical value (how much, how many, how

2. Continuousthe possible values form a continuous interval,

Average Annual Temperature

Plot of the Average Annual Temperature in NYC from 1900-2000

Cross-Sectional Data: several variables are measured on several units at one

A sample involves looking only at some items selected from the

A census is an examination of all items in a defined population.

mobility, un-documented workers, budget constraints, incomplete

With or Without Replacement

Random Variables and Their Probability Distributions

The result of a coin toss

The result of a toss of a fair die

Whether or not a batter hits a home run

An individuals response in an exit poll as to who they voted for in the

The starting salary of a randomly selected graduate of the college of

The number of chips in a randomly selected chocolate chip cookie.

Probability Distribution: A discrete probability distribution assigns a probability

Examples of Continuous Random Variables:

weight of a randomly selected student

the time it takes to produce a manufactured product

the length of a manufactured part

the amount of rainfall in a given month

Probability Distribution: A continuous probability distribution is the function or

For a continuous random variable, y, the probability distribution is usually written

Here, is the population mean, and is the population standard deviation. A

Because continuous random variables can assume an infinite number of values, A

Probabilities are Represented as Area Under the Probabiliy Density Curve

Continuous probability functions are smooth curves.

The normal distribution

The normal distribution is another continuous probability model.

The normal distribution is very important in statistics because it accurately

The mean (expected value) of the normal

The standard deviation of the normal

The 68-95-99.7 rule

68% of the observations will fall within 1 of the mean, .

95% of the observations will fall within 2 of the mean, .

99.7% of the observations will fall within 3 of the mean, .

Example: Suppose a heights of students are distributed normally with a mean of

The Standard Normal Distribution: Calculating Normal Distribution

Normal distribution tables are given in Appendix A, Table A-1.

Example: Heights of students are distributed normally with a mean of 66 inches

b. What height represents the 90th percentile of heights?

Sampling Distribution of the Mean

Average Outcome on Two Rolls

Average Outcome on Five Rolls

Outcome of Roll of One Die

Average Outcome on Ten Rolls

What is the random variable in each case?

What does the sampling distribution represent?

What is happening to the sampling distributions as the sample size increases?

the standard deviation of the population.

Estimating a Mean with Confidence

o A sample statistic is often used to estimate a population parameter. For

o The sample mean will change from sample to sample.

o The sample mean has a distribution known as the sampling distribution.

o We use the sampling distribution of X to compute this interval or range of

o This range of values is known as a confidence interval.

Reality: The calculations above assume we know the population standard

. Rather, we use the sampling distribution of