Professional Documents
Culture Documents
Overview of Data
Data: Singular or Plural?
Data Set:
Height
65
70
67
69
66
Gender
female
male
female
female
female
Hair_$
40
25
50
60
45
Job?
no
no
yes
no
no
Studytime
3.0
4.0
2.0
8.0
10.0
Smoker?
no
no
no
no
no
Dates
0
0
1
1
2
HS_GPA
3.0
3.5
3.8
4.0
3.6
CUM_GPA
2.4
3.3
3.3
3.9
3.2
Observation:
Variable:
Univariate Data:
Bivariate Data:
Multivariate Data:
Time Series Data: same variable measured at regular intervals over time, usually
on the same unit. The plot below shows a time series.
58
57
56
55
54
53
52
51
50
49
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Year
Height
65
70
67
69
66
Gender
female
male
female
female
female
Hair_$
40
25
50
60
45
Job?
no
no
yes
no
no
Studytime
3.0
4.0
2.0
8.0
10.0
Smoker?
no
no
no
no
no
Dates
0
0
1
1
2
HS_GPA
3.0
3.5
3.8
4.0
3.6
CUM_GPA
2.4
3.3
3.3
3.9
3.2
Sampling Concepts
Sample vs. Census.
Why cant the United States Census survey every person in the population?
responses, etc.
The census is currently conducted by a mail questionnaire. The
response rate for the census was around 65% in 2000.
Census takers make personal visits to households who do not return
the questionnaire.
Many are simply not counted in the census. Who might not be
counted?
It is often more practical to sample rather than trying to take a census.
Population
Sample
4
Target Population
The population must be carefully specified and the sample must be drawn
scientifically so that the sample is representative.
The target population is the population we are interested in (e.g., U.S.
gasoline prices).
The sampling frame is the group from which we take the sample (e.g.,
115,000 stations).
The frame should not differ from the target population.
Parameter:
Statistic:
the grade a randomly selected student will receive on the next exam
f ( y) =
1
e
2
( y )2
2 2 for
< y < .
It is commonly called the bell curve because of its symmetric bell shape.
10
68% of the students are between ______ and ______ inches tall.
95% of the students are between ______ and ______ inches tall.
99.7% of the students are between ______ and ______ inches tall.
11
The probabilities for many Z values can be found by integrating the normal
distribution function. Alternately, we can use probability distribution tables that
have already taken care of the calculus.
12
13
Probability of Occurrence
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
5.8
5.4
4.6
5.0
4.2
3.4
3.8
3.0
2.2
2.6
1.8
1.4
1.0
0.00
0.18
0.16
Probability of Occurrence
0.18
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
0.00
1.
0
1.
4
1.
8
2.
2
2.
6
3.
0
3.
4
3.
8
4.
2
4.
6
5.
0
5.
4
5.
8
Probability of Occurrence
14
Facts:
1. X is a random variable.
2. Different samples will give different values of X .
3. X has a distribution
4. The probability distribution of X is called the sampling distribution of X . It
is called the sampling distribution because the differences in X are due to the
fact that we sampled. If we plot all possible values of X s on a histogram
with the height of the bars equal to the probability of occurrence of a particular
value, we will see a picture of the sampling distribution of the sample means.
5. The larger the sample size, the more precisely X estimates (equivalently, the
larger the sample size, the smaller the standard error of X .)
6. The standard deviation of the sampling distribution of X is known as the
standard error of X . The standard error of X is equal to
, where is
15
o Because the sample mean will change from sample to sample, it is often more
useful to provide not just a single estimate of the population mean, but a range
of values to estimate .
16
Example: Suppose we know that the amount of money spent monthly on credit
cards is normally distributed with an unknown mean. Suppose further (although
this is an unrealistic supposition) that we know the standard deviation of the
amount spent is equal to =$50. We randomly sample 100 credit card holders and
compute their average monthly expenditures to be X = $500 .
o The sampling distribution of X is normal with mean equal to the unknown
50
value, , and standard deviation equal to
=
= 5.
n
100
o We know from the 68-95-99.7 rule that approximately 95% of the time, X
will fall within the range 2 50
100
= 10 .
o We can use this fact to determine that approximately 95% of the time, the
unknown population mean will fall within the
range X 2 50
= 500 10=
100
( $490,$510 ) .
n
t=
x
.
s
n
17
18
Example: We are interested in estimating the mean monthly rent for all
unfurnished one bedroom apartments in the community. We have no other
information. Assume we randomly sample n = 10 apartments advertised in the
local paper. They have the following monthly rental rates.
500
550
650
515
600
495
505
650
450
395
19
20
Suppose the degrees of freedom are not listed on the t-table. In this case, you
choose the next lower value, or preferably, look up a more exact value in a
computer software program.
21
22
Do Low Carb Diets Work? In a recent study, 41 overweight subjects were placed
on a low carbohydrate diet but given no limit on caloric intake. The prediction was
that subjects on such a diet would lose weight, on the average. After 16 weeks, the
weight change averaged 21.38 pounds (before-after) with a standard deviation of
7.5 pounds. Does the low carb diet really work?
Here we are conducting a test for a quantitative variable: Amount of weight
lost after 16 weeks on a low carb diet.
The parameter we are testing is the mean weight loss at the end of the 16 week
period for all those who may ever be on the diet. We represent this population
mean with the Greek letter, .
State Ho and Ha.
Ho:
Ha:
=
t
x 0 x 0
=
s
se
n
23
24
Decision:
Conclusion:
25
Ho:
= 0
Alternative:
Ha:
> 0
< 0
0
3. Compute the Test Statistic:
t
=
x 0 x 0
=
s
se
n
26
p-value
Graph
Hypothesis
> 0
< 0
Two-tail probability
27
5. Draw a conclusion.
a. If the p-value is smaller than the significance level (), reject the null
hypothesis in favor of the alternative.
b. If the p-value is larger than the significance level (), do not reject the
null hypothesis.
c. Relate the conclusion to the context of study.
28
DF
T-Stat
P-value
Assumptions:
29
Conclusion:
30
Std. Err.
DF
T-Stat
P-value
Frequency
Cumulative %
100.00%
60
90.00%
50
80.00%
70.00%
Frequency
40
60.00%
30
50.00%
40.00%
20
30.00%
20.00%
10
10.00%
0
0.00%
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
bins
Assumptions:
31
Conclusion:
32
DF L. Limit U. Limit
33
We have discussed two very important continuous probability models, the normal
(standard normal) and the t-distribution. The Chi-Square and F-distributions are
other important continuous probability distributions used in statistical modeling.
Chi-Square
The shape of the curve is right-skewed, and it has a lower bound of zero.
The shape of the curve depends heavily on the number of degrees of
freedom, indicated above as .
We are usually interested in right-tail areas of the chi-square distribution.
34
F-Distribution
The density curve of the F-distribution is right-skewed, and has a lower boud
of zero.
The shape of the curve depends heavily on two sets of degrees of freedom,
listed above as r1 and r2.
The F-distribution is also called the variance ratio distribution because it is
derived from the ratio of two variance distributions, namely chi-square
distributions.
The value, r1, corresponds to the numerator degrees of freedom, or the
numerator of the ratio.
The value, r2, corresponds to the denominator degrees of freedom, or the
denominator of the ratio.
35
Table A3 in Appendix A (594-597) gives the upper tail critical values for the
F-distribution for several sets of numerator and denominator degrees of
freedom.
If the exact degrees of freedom combination you need is not in the table, you
should round down to the closest value, or preferably, use a software
package to obtain a more accurate calculation.
Example: Find the 95th percentile of an F-distribution with numerator d.f. = 2, and
denominator d.f. = 10.
36
37