Professional Documents
Culture Documents
Lab 1: Introduction to
Statistics
Fall 2010
Variables
A variable is anything we can measure/observe
Three types:
Continuous: values span an uninterrupted range (e.g.
height)
Discrete: only certain fixed values are possible (e.g. counts)
Categorical: values are qualitatively assigned (e.g.
low/med/hi)
Dependence in variables:
Dependent variables depend on independent ones
Independent variable may or may not be prescribed
experimentally
Determining dependence is often not trivial!
Descriptive Statistics
Descriptive statistics
Techniques to summarize
data
Numerical
Mean
Variance
Standard
deviation
Standard error
Median
Mode
Skew
etc.
Graphical
Histogram
Boxplot
Scatterplot
etc.
The Mean:
Most important measure of central
tendency
Population Mean
N
i=1
The Mean:
Most important measure of central
tendency
Sample Mean
n
X=
i=1
(n is even)
1, 1, 2, 4, 6, 6, 6, 7, 7, 7, 7, 8, 8, 9, 9, 10
Which to use: mean, median or mode?
Variance:
Most important measure of dispersion
Population Variance
=
2
(X
)
i
Variance:
Most important measure of dispersion
Sample Variance
s =
2
(X
i
2
n
X)- 1
From now on, well ignore sample vs. population. But remember:
We are almost always interested in the population, but can measure only a
Additional dispersion
measures
Standard deviation:
average distance from the mean
Standard error:
the accuracy of our estimate of
the population mean
s = s2
(duh!)
s
SE =
n
Bigger sample
size (n) smaller
error
Additional dispersion
measures
Coefficient of Determination:
Standard deviation scaled to
data
Example: 1.2
1.4
1.6
1.8
2.0
2.2
2.4
X = 1.8kg
s = 0.43kg vs.
V = 24%
1200
1400
1600
1800
2000
2200
2400
s
V=
X
X = 1800kg
s = 432kg
V = 24%
Graphical Statistics
Constructing a Histogram is
Easy
Histogram of X
X (data)
Frequency
(count)
7.4
7.6
8.4
8.9
10.0
10.3
11.5
11.5
12.0
12.3
0
6
10
Value
12
14
Interpreting Histograms
Mean?
Median?
Mode?
Standard
deviation?
Variance?
Skew?
(which way does
the tail
point?)
Shape?
40
Frequency
20
0
0
20
Value
40
60
Interpreting Histograms
Mean?
= 9.2
Median? = 6.5
Mode? = 3
Standard
deviation? =
8.3
Variance?
Skew?
(which way does
the tail point?)
Shape?
40
Frequency
20
0
0
20
Value
40
60
An alternative:
Boxplots
Frequency
40
20
20
Value
40
Boxplots also
summarize a
lot of
information...
Weight (kg)
75% percentile
Median
25% percentile
6
5
4
3
2
1
SC
Ba
Se
Island
Es
Da
Is
Fe
Normality
>99%
~95%
~68%
Amazing!
Handy!
Important!
Its OK!
Although lots of data are Gaussian (because
of the CLT), many simply arent.
Example: Fire return intervals
Solutions:
Transform data to make
it normal (e.g. take
logs)
Use a test that doesnt
assume normal data
Break!
Inferential Statistics:
Introduction
Statistical Hypotheses
Should be related to a scientific hypothesis!
Very often presented in pairs:
Null Hypothesis (H0):
Usually the boring hypothesis of no difference
Significance
Your sample will never match H 0
perfectly, even when H0 is in fact true
The question is whether your sample is
different enough from the expectation
under H0 to be considered significant
If your test finds a significant
difference, then you reject H0.
p-Values Measure
Significance
The p-value of a test is the probability of
observing data at least as extreme as your
sample, assuming H0 is true
If p is very small, it is unlikely that H0 is true
(in other words, if H0 were true, your observed sample would be
unlikely)
Proof in statistics
Failing to reject (i.e. accepting) H 0 does
not prove that H0 is true!
And accepting HA doesnt prove that HA is
true either!
Why?
Statistical inference tries to draw
conclusions about the population from a
small sample
By chance, the samples may be misleading
Example: if you always accept H0 at p=0.05,
then 1 in 20 times you will be wrong!
Assumptions of inferential
statistics
All inferential tests are based on
assumptions
If your data cannot meet the assumptions,
the test results are invalid!
In particular:
Inferential tests assume random sampling
Many tests assume the data fit a theoretical
distribution (often normal)
These are parametric tests
Luckily, there are non-parametric alternatives
Inferential Statistics:
Methods
Students t-Test
Students t-test
Several versions, all using inference on a
sample to test whether the true
population mean () is different from
__________
The one-sample version tests whether the
population mean equals a specified value, e.g.
H0: = 0
t-Test Methods
Compute t
x-
For one-sample test: t =
SE
Remember:
s
SE =
n
t=
x1 x2
Sx1-x2
Sx1-x2 =
S12
Sn212 + n2
t-Test Methods
Once you have t, use a
look-up table or computer
to check for significance
Significance depends on
degrees of freedom
(basically, sample size)
Bigger difference in
means and bigger sample
size both improve ability
to reject H0
If the null-hypothesis
mean is very unlikely
to fall under the
curve, we reject H0
Reject H0 if your t is
in one of the red
= 0.05
Try it!
Work through the Excel
exercises for one- and twosample t-tests now
ANOVA
ANOVA: ANalysis Of
VAriance
Tests for significant effect of 1 or more factors
Each factor may have 2 or more levels
Can also test for interactions between factors
For just 1 factor with 2 levels, ANOVA = t-test
So why cant we just do lots of t-tests for more
complicated experiments?
ANOVA: Concept
Despite the name, ANOVA really looks for
difference in means between groups
(factors & levels)
To do so, we partition the variability in our
data into:
(1) The variability that can be explained by
factors
(2) The leftover unexplained variability (error or
residual variability)
Square = clay so
Diamond = sand
Triangle = loam s
replicate = plot
y = growth
We can then go back and see which groups differ (e.g. by t-test)
Source
SS
df
MS
Soil
0.025
Error
Total
99.2
49.6
4.24
315.5
414.7
27
11.7
29
What do
we
conclude
?
Try it!
Work through the Excel exercise
for
ANOVA now
Chi-square ( ) Test
2
(O
E)2
E
Example
2
Example
2
31
69
35
65
200
Example
2
PB
fish
cheese
Observed
31
69
35
64
200
Expected
50
50
50
50
200
Then we compute 2:
=
2
(31 - 50)2
+
50
(69 - 50)2
+
50
(35 - 50)2
+50
(65 - 50)2
50
= 22.0
Example
2
In our
example:
2
= 22.0
and
df = 4 1 = 3
20.05,3
= 7.815
Try it!
Work through the Excel exercise
for
the Chi-square test now
Correlation
Correlation measures the strength of the
relationship between two variables
When X gets larger, does Y consistently get
larger (or smaller)?
Correlation
r=
y
2
2
x y
Amount that X
and Y vary
together
Total amount of
variability in X
and Y
-1 r 1
Correlation Examples
Correlation Cautions
Correlation does not imply
causality!
(note: doesnt matter which data are X
vs. Y)
r can be misleading
it implies nothing about slope
it is blind to outliers and obvious
nonlinear relationships
Same r in
each panel!
Try it!
Work through the Excel exercise
for
correlations now
Regression
Unlike correlation, regression does imply
a functional relationship between
variables
The dependent variable is a function of the
independent variable(s)
Regression
There are many possible relationships between two
variables...
Linear
Exponential
Quadratic
Hyperbolic
Logistic
Trigonometric
Regression
Well focus on simple linear regression of
one dependent (Y) and one independent
(X) variable
The model is:
Y = a + bX +
Potential Regression
Outcomes
Positive
Relationshi
p
b>0
Negative
Relationshi
p
b<0
No
Relationshi
p
b=0
How do we fit a
Regression?
(X5, Y5)
(X2, Y2)
(X6, Y6)
. (X1, Y1)
How do we fit a
Regression?
Find individual
residuals ():
Yi Yi =
Observed value
Residual
Predicted value
(from regression line)
(Y
Y)
2
A computer or clever mathematician can find the a
i
and b that minimize this expression (producing
the
Regression Example
Altitude (m)
Temperature
(C)
0
25.0
50
24.1
190
23.5
305
21.2
456
20.6
501
20.0
615
18.5
700
17.2
825
17.0
Regression Example
From the data on the previous slides, we fit a
regression line and obtain these parameter
estimates:
a = 24.88
b = 0.0101
Thus, our best-fit line has the equation:
temperature = 24.88 0.0101 x altitude
Does this work if we change units?
..
..
.
.
.
..
.
.
. .
. .
..
. .
Good fit
.
.
..
.
.
.
.
OK fit (Ill take
it!)
Regression Reservations
Again, regression does imply causality
(unlike correlation), but importantly, it still
does not test a causal biological
relationship
Some other variable might affect both X and Y
You could even have the relationship
backwards!
Try it!
Work through the Excel exercise
for
simple linear regression now