You are on page 1of 37

ISA 291 Chapter 2 Notes

Review of Basic Statistical Concepts and Terms

Overview of Data
Data: Singular or Plural?

Data Set:

Example data set:


ID
Number
1
2
3
4
5

Height
65
70
67
69
66

Gender
female
male
female
female
female

Hair_$
40
25
50
60
45

Job?
no
no
yes
no
no

Studytime
3.0
4.0
2.0
8.0
10.0

Smoker?
no
no
no
no
no

Dates
0
0
1
1
2

HS_GPA
3.0
3.5
3.8
4.0
3.6

CUM_GPA
2.4
3.3
3.3
3.9
3.2

Observation:

Variable:

Univariate Data:

Bivariate Data:

Multivariate Data:

Two Types of Variables


1. Qualitative or Categoricala category or label indicates some attribute
ex. gender (M,F); degree received (Bachelors, Masters, Doctorate)
Sometimes Categorical variables are coded using numbers. For
example, we might code the variable, Gender as 1 = Male, 0 =
Female. We might also code Degree Received as 1 =
Bachelors, 2 = Masters, 3 = Doctorate.
Data coded in this way is still categorical. The numbers are
simply used to represent the categories.
With categorical variables, summary values such as the mean
dont make sense.

2. Numerical or Quantitativea numerical value (how much, how many, how


long) ex. age, expected income
With numerical variables, it makes sense to compute arithmetic
quantities such as averages. For example, one can compute the
average age (quantitative variable) of students, but not the average
gender (qualitative variable).
There are two types of quantitative variables:
1. Discretethe possible values form a set of separate numbers
such as 0, 1, 2,..., e.g. the number of children in a family.

2. Continuousthe possible values form a continuous interval,


e.g., age, height, weight.

Time Series Data: same variable measured at regular intervals over time, usually
on the same unit. The plot below shows a time series.

58

Average Annual Temperature

57
56
55
54
53
52
51
50
49
1900

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000

Year

Plot of the Average Annual Temperature in NYC from 1900-2000

Cross-Sectional Data: several variables are measured on several units at one


snapshot in time. The data sheet below represents survey responses of 5 students
taken on a given day.
ID
Number
1
2
3
4
5

Height
65
70
67
69
66

Gender
female
male
female
female
female

Hair_$
40
25
50
60
45

Job?
no
no
yes
no
no

Studytime
3.0
4.0
2.0
8.0
10.0

Smoker?
no
no
no
no
no

Dates
0
0
1
1
2

HS_GPA
3.0
3.5
3.8
4.0
3.6

CUM_GPA
2.4
3.3
3.3
3.9
3.2

Sampling Concepts
Sample vs. Census.

A sample involves looking only at some items selected from the


population.

A census is an examination of all items in a defined population.

Why cant the United States Census survey every person in the population?

mobility, un-documented workers, budget constraints, incomplete

responses, etc.
The census is currently conducted by a mail questionnaire. The
response rate for the census was around 65% in 2000.
Census takers make personal visits to households who do not return
the questionnaire.
Many are simply not counted in the census. Who might not be
counted?
It is often more practical to sample rather than trying to take a census.

Population

Sample
4

Target Population
The population must be carefully specified and the sample must be drawn
scientifically so that the sample is representative.
The target population is the population we are interested in (e.g., U.S.
gasoline prices).
The sampling frame is the group from which we take the sample (e.g.,
115,000 stations).
The frame should not differ from the target population.

With or Without Replacement


If we allow duplicates when sampling, then we are sampling with
replacement.
Duplicates are unlikely when n is much smaller than N.
If we do not allow duplicates when sampling, then we are sampling without
replacement.

Parameter:

Statistic:

Random Variables and Their Probability Distributions


A random variable is a numerical measurement of the outcome of a random
experiment or phenomenon. Often, the randomness results from the use of random
sampling or a randomized experiment to gather the data.
Examples of Discrete Random Variables:

The result of a coin toss

The result of a toss of a fair die

Whether or not a batter hits a home run

An individuals response in an exit poll as to who they voted for in the


election

The starting salary of a randomly selected graduate of the college of


business

The number of chips in a randomly selected chocolate chip cookie.

Probability Distribution: A discrete probability distribution assigns a probability


to each value of a discrete random variable X.
To be a valid probability distribution, each of the following must be satisfied.

Examples of Continuous Random Variables:

weight of a randomly selected student

the time it takes to produce a manufactured product

the length of a manufactured part

the amount of rainfall in a given month

the grade a randomly selected student will receive on the next exam

Probability Distribution: A continuous probability distribution is the function or


model that describes the probability allocations for continuous random variables.

For a continuous random variable, y, the probability distribution is usually written


f(y). For example, if y is a normally distributed random variable,

f ( y) =

1
e
2

( y )2
2 2 for

< y < .

Here, is the population mean, and is the population standard deviation. A


graph of this function looks like this.

Because continuous random variables can assume an infinite number of values, A


continuous random variable can take on any value over a continuous range.

Probabilities are Represented as Area Under the Probabiliy Density Curve

Continuous probability functions are smooth curves.


Unlike discrete distributions, the area at any single point = 0.
The entire area under any Probability Density Function (PDF )must be 1.
Mean is the balance point of the distribution.

The normal distribution

The normal distribution is another continuous probability model.

It is commonly called the bell curve because of its symmetric bell shape.

The normal distribution is very important in statistics because it accurately


describes many kinds of data.
o distributions of scores on a standardized test (SAT)
o weight of a randomly selected student
o the length of a manufactured part
o the amount of rainfall in a given
month

The mean (expected value) of the normal


distribution is at the center of the PDF.
The distribution is symmetric about the
mean. We use the symbol to represent
the mean of a density curve (as opposed
to X for a sample mean).

The standard deviation of the normal


distribution describes the spread of the distribution around the mean. We use
the symbol to represent the standard deviation of a density curve (as
opposed to s for a sample standard deviation).

10

The 68-95-99.7 rule


If the data follow a normal distribution with mean,, and standard deviation ,

68% of the observations will fall within 1 of the mean, .

95% of the observations will fall within 2 of the mean, .

99.7% of the observations will fall within 3 of the mean, .

Example: Suppose a heights of students are distributed normally with a mean of


= 66 inches, and a standard deviation of = 3 inches.

68% of the students are between ______ and ______ inches tall.

95% of the students are between ______ and ______ inches tall.

99.7% of the students are between ______ and ______ inches tall.

11

The Standard Normal Distribution: Calculating Normal Distribution


Probabilities using a Z-Table
The Standard Normal Distribution is a particular Normal distribution with a mean
of = 0 and a standard deviation of = 1.

The probabilities for many Z values can be found by integrating the normal
distribution function. Alternately, we can use probability distribution tables that
have already taken care of the calculus.

Normal distribution tables are given in Appendix A, Table A-1.

12

Example: Heights of students are distributed normally with a mean of 66 inches


and a standard deviation of 3 inches.
a. Find the proportion of students taller than 70 inches.

b. What height represents the 90th percentile of heights?

13

Sampling Distribution of the Mean


Here are histograms of the sampling distributions of the average values of one,
two, five, and ten rolls of a die.
0.18
Probability of Occurrence

Probability of Occurrence

0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02

0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8

5.8

5.4

4.6

5.0

4.2

3.4

3.8

3.0

2.2

2.6

1.8

1.4

1.0

0.00

Average Outcome on Two Rolls

0.18
0.16

Probability of Occurrence

0.18

0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00

0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02

Average Outcome on Five Rolls

1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8

0.00

1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8

Probability of Occurrence

Outcome of Roll of One Die

Average Outcome on Ten Rolls

What is the random variable in each case?

What does the sampling distribution represent?

What is happening to the sampling distributions as the sample size increases?

14

Facts:
1. X is a random variable.
2. Different samples will give different values of X .
3. X has a distribution
4. The probability distribution of X is called the sampling distribution of X . It
is called the sampling distribution because the differences in X are due to the
fact that we sampled. If we plot all possible values of X s on a histogram
with the height of the bars equal to the probability of occurrence of a particular
value, we will see a picture of the sampling distribution of the sample means.
5. The larger the sample size, the more precisely X estimates (equivalently, the
larger the sample size, the smaller the standard error of X .)
6. The standard deviation of the sampling distribution of X is known as the
standard error of X . The standard error of X is equal to

, where is

the standard deviation of the population.

15

Estimating a Mean with Confidence

o A sample statistic is often used to estimate a population parameter. For


example, we often use the sample mean, X , to estimate the population mean,
. X is known as a point estimate of the population mean.

o The sample mean will change from sample to sample.

o The sample mean has a distribution known as the sampling distribution.

o Because the sample mean will change from sample to sample, it is often more
useful to provide not just a single estimate of the population mean, but a range
of values to estimate .

o We use the sampling distribution of X to compute this interval or range of


values to estimate .

o This range of values is known as a confidence interval.

16

Example: Suppose we know that the amount of money spent monthly on credit
cards is normally distributed with an unknown mean. Suppose further (although
this is an unrealistic supposition) that we know the standard deviation of the
amount spent is equal to =$50. We randomly sample 100 credit card holders and
compute their average monthly expenditures to be X = $500 .
o The sampling distribution of X is normal with mean equal to the unknown

50
value, , and standard deviation equal to
=
= 5.
n

100

o We know from the 68-95-99.7 rule that approximately 95% of the time, X
will fall within the range 2 50

100

= 10 .

o We can use this fact to determine that approximately 95% of the time, the
unknown population mean will fall within the
range X 2 50

= 500 10=
100

( $490,$510 ) .

Reality: The calculations above assume we know the population standard


deviation, , which is not likely. Thus, we can rarely use the distribution of the
random variable, z =

. Rather, we use the sampling distribution of

n
t=

x
.
s
n

The random variable, t, follows a Students t-distribution, or simply the tdistribution.

17

The difference in the t- and z-distributions:


The z-distribution assumes that the population standard deviation, , is
known.
The t-distribution assumes that the population standard deviation is
estimated with s, the sample standard deviation.
The t-distribution looks almost like the standard normal distribution in that it
is symmetric about zero.
However, the tails of the t-distribution are fatter than that of the standard
normal.
This is to take into account the use of the sample standard deviation (s)
instead of the population standard deviation ().
The exact spread of the t-distribution depends on the degrees of freedom. In
the case of a sample mean, the degrees of freedom are n-1.
As the degrees of freedom approach infinity, the t-distribution approaches
the z-distribution.
The t-table can be found in Table A2 of Appendix A in your text.

18

Example: We are interested in estimating the mean monthly rent for all
unfurnished one bedroom apartments in the community. We have no other
information. Assume we randomly sample n = 10 apartments advertised in the
local paper. They have the following monthly rental rates.
500
550

650
515

600
495

505
650

450
395

We compute X = 531, and s = 82.79.


To compute a C level confidence interval for estimating , we use the following
formula:

This can be rewritten as

The margin of error is given by

19

Suppose we wish to compute a 95% confidence interval.


1. Find t

2. Compute the interval:

3. Interpret the interval:

20

What is the margin of error for the confidence interval?

What is the width of the confidence interval?

What is the confidence level of the confidence interval?

Suppose the degrees of freedom are not listed on the t-table. In this case, you
choose the next lower value, or preferably, look up a more exact value in a
computer software program.

21

The Meaning of 95% Confidence

22

One Sample Mean Test

Do Low Carb Diets Work? In a recent study, 41 overweight subjects were placed
on a low carbohydrate diet but given no limit on caloric intake. The prediction was
that subjects on such a diet would lose weight, on the average. After 16 weeks, the
weight change averaged 21.38 pounds (before-after) with a standard deviation of
7.5 pounds. Does the low carb diet really work?
Here we are conducting a test for a quantitative variable: Amount of weight
lost after 16 weeks on a low carb diet.
The parameter we are testing is the mean weight loss at the end of the 16 week
period for all those who may ever be on the diet. We represent this population
mean with the Greek letter, .
State Ho and Ha.
Ho:

Ha:

Compute the test statistic.


The test statistic is a measure of how many standard errors the actual observed
sample mean is from the hypothesized value (according to Ho). We measure this
distance with a t-statistic. The formula for the test statistic is as follows:

=
t

x 0 x 0
=
s
se
n

23

Find the p-value.


The p-value is a probability from the t-distribution. It is the probability of
observing a sample mean in a subsequent sample that is as or more extreme (in the
direction Ha) as the one you observed. In our case, this would represent putting
another random 41 subjects on the same low carb regimen and observing an
average weight loss that is greater than 21.38 pounds.

24

Base your decision on the p-value.


Small p-values give evidence against the null hypothesis. Large p-values give
evidence in favor of the null hypothesis.
In this case our p-value of .0000 means that if indeed the low carb diet does
not work, then there is literally no chance of observing another group of 41
subjects on this diet who lost 21.38 pounds or more.
The near impossibility of repeating this study with similar or more extreme
results makes suggests that the null hypothesis assumption (the diet doesnt
work) is unlikely to be true.

Decision:

Conclusion:

How small is a small p-value?


We usually have a preset significance level of a test. The significance level is a
small value such as .05 or .10. This value tells us how much evidence we need
against the null hypothesis in order to reject it.
We will reject Ho if the p-value is smaller than the level of significance.
If we reject Ho, we say that the results are statistically significant.

25

Steps to conducting a hypothesis test:


1. Consider the Assumptions:
a. Quantitative variable with population mean, , defined.
b. A random sample
c. The population distribution is approximately normal.

2. State the Hypotheses:


Null:

Ho:

= 0

Alternative:

Ha:

> 0
< 0

0
3. Compute the Test Statistic:

t
=

x 0 x 0
=
s
se
n

26

4. Find the p-value


Alternative

p-value

Graph

Hypothesis

> 0

Right tail probability

< 0

Left tail probability

Two-tail probability

27

5. Draw a conclusion.
a. If the p-value is smaller than the significance level (), reject the null
hypothesis in favor of the alternative.
b. If the p-value is larger than the significance level (), do not reject the
null hypothesis.
c. Relate the conclusion to the context of study.

28

Example: Is the ideal number of children in a family 2.0? The responses to a


recent GSS question for What do you think is the ideal number of children to
have? results in a mean of 2.49, and a standard deviation of .85 in a sample of
1302 respondents. A statistical software package gives the following output.
Hypothesis test results:
: population mean
H0 : = 2
HA : 2
Mean Sample Mean Std. Err.

DF

T-Stat

P-value

2.49 0.023556644 1301 20.800924 <0.0001

Assumptions:

State Ho and Ha.

Compute the test statistic.

29

Find the p-value:

Conclusion:

30

Example: Suppose a school within a university is going to require a minimum


GPA of 2.5 to enter that school. Suppose the students enrolled in an introductory
course represent a random selection of students who desire entry into the school.
A survey gives the following results. Use the output to test whether or not the
mean GPA of all students desiring entry into the school differs from 2.5.
Hypothesis test results:
: mean of Variable
H0 : = 2.5
HA : 2.5
Variable Sample Mean
CUM_GPA

Std. Err.

DF

T-Stat

P-value

2.8846443 0.02866734 266 13.417506 <0.0001


Histogram

Frequency
Cumulative %
100.00%

60

90.00%
50

80.00%
70.00%

Frequency

40

60.00%
30

50.00%
40.00%

20

30.00%
20.00%

10

10.00%
0

0.00%
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
bins

Assumptions:

31

State Ho and Ha.

Compute the test statistic.

Find the p-value:

Conclusion:

32

Hypothesis Tests and Confidence Intervals are Equivalent When


The confidence level = 1-.
The test is two-tailed.

99% confidence interval results:


: mean of Variable
Variable Sample Mean Std. Err.
CUM_GPA

DF L. Limit U. Limit

2.8846443 0.02866734 266 2.8102686 2.95902

Compute the Confidence Interval:

Interpret the Confidence Interval:

33

Other Important Probability Models

We have discussed two very important continuous probability models, the normal
(standard normal) and the t-distribution. The Chi-Square and F-distributions are
other important continuous probability distributions used in statistical modeling.

Chi-Square

The shape of the curve is right-skewed, and it has a lower bound of zero.
The shape of the curve depends heavily on the number of degrees of
freedom, indicated above as .
We are usually interested in right-tail areas of the chi-square distribution.

34

F-Distribution

The density curve of the F-distribution is right-skewed, and has a lower boud
of zero.
The shape of the curve depends heavily on two sets of degrees of freedom,
listed above as r1 and r2.
The F-distribution is also called the variance ratio distribution because it is
derived from the ratio of two variance distributions, namely chi-square
distributions.
The value, r1, corresponds to the numerator degrees of freedom, or the
numerator of the ratio.
The value, r2, corresponds to the denominator degrees of freedom, or the
denominator of the ratio.

35

Table A3 in Appendix A (594-597) gives the upper tail critical values for the
F-distribution for several sets of numerator and denominator degrees of
freedom.
If the exact degrees of freedom combination you need is not in the table, you
should round down to the closest value, or preferably, use a software
package to obtain a more accurate calculation.
Example: Find the 95th percentile of an F-distribution with numerator d.f. = 2, and
denominator d.f. = 10.

Find the 99th percentile of an F-distribution with numerator d.f. = 2 and


denominator d.f. = 10.

36

Find the 99th percentile of an F-distribution with numerator d.f = 33 and


denominator d.f. = 35.

37

You might also like