You are on page 1of 419

3.

Descriptive Statistics

Describing data with tables and


graphs
(quantitative or categorical
variables)

Numerical descriptions of center,


variability, position (quantitative
variables)
1. Tables and Graphs

Frequency distribution: Lists possible values


of variable and number of times each
occurs

Example: Student survey (n = 60)


www.stat.ufl.edu/~aa/social/data.html

political ideology measured as ordinal


variable with 1 = very liberal, , 4 =
moderate, , 7 = very conservative
Histogram: Bar graph of frequencies
or percentages
Shapes of histograms
(for quantitative variables)

Bell-shaped (IQ, SAT, political ideology in all


U.S. )
Skewed right (annual income, no. times
arrested)
Skewed left (score on easy exam)
Bimodal (polarized opinions)

Ex. GSS data on sex before marriage in Exercise


3.73: always wrong, almost always wrong,
wrong only sometimes, not wrong at all
Stem-and-leaf plot (John Tukey,
1977)

Example: Exam scores (n = 40 students)

Stem Leaf
3 6
4
5 37
6 235899
7 011346778999
8 00111233568889
9 02238
2.Numerical descriptions
Let y denote a quantitative variable, with
observations y1 , y2 , y3 , , yn

a. Describing the center

Median: Middle measurement of ordered


sample

Mean: y1 y2 ... yn yi
y
n n
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,


Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.
20.1

Ordered sample:

Median =

y
Mean =
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,


Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.
20.1

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9,


20.1

Median =
y
Mean =
Example: Annual per capita carbon dioxide
emissions (metric tons) for n = 8 largest nations
in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2,


Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S.
20.1

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9,


20.1

Median = (1.4 + 1.8)/2 = 1.6


y
Mean = (0.3 + 0.7 + 1.2 + + 20.1)/8 = 4.7
Properties of mean and
median
For symmetric distributions, mean =
median
For skewed distributions, mean is drawn in
direction of longer tail, relative to median
Mean valid for interval scales, median for
interval or ordinal scales
Mean sensitive to outliers (median often
preferred for highly skewed distributions)
When distribution symmetric or mildly
skewed or discrete with few values, mean
preferred because uses numerical values
of observations
Examples:

New York Yankees baseball team, 2006


mean salary = $7.0 million
median salary = $2.9 million

How possible? Direction of skew?

Give an example for which you would


expect

mean < median


b. Describing variability

Range: Difference between largest and smallest


observations
(but highly sensitive to outliers, insensitive to
shape)

Standard deviation: A typical distance from the


mean

The deviation of observation i from the mean is


yi y
The variance of the n observations is

s
( yi y ) ( y1 y ) ... ( yn y )
2 2 2
s
2

n 1 n 1
The standard deviation s is the square root of the
variance,

s s 2
Example: Political ideology
For those in the student sample who attend
religious services at least once a week (n = 9
of the 60),
y = 2, 3, 7, 5, 6, 7, 5, 6, 4
y 5.0,
(2 5) 2
(3 5) 2
... (4 5) 2
24
s
2
3.0
9 1 8
s 3.0 1.7

For entire sample (n = 60), mean = 3.0, standard


deviation = 1.6, tends to have similar variability but be
more liberal
Properties of the standard deviation:
s 0, and only equals 0 if all observations are equal
s increases with the amount of variation around the mean
Division by n - 1 (not n) is due to technical reasons (later)
s depends on the units of the data (e.g. measure euro vs $)
Like mean, affected by outliers

Empirical rule: If distribution is approx. bell-shaped,


about 68% of data within 1 standard dev. of mean
about 95% of data within 2 standard dev. of mean
all or nearly all data within 3 standard dev. of mean
Example: SAT with mean = 500, s = 100
(sketch picture summarizing data)

Example: y = number of close friends you have


GSS: The variable frinum has mean 7.4, s =
11.0

Probably highly skewed: right or left?

Empirical rule fails; in fact, median = 5,


mode=4

Example: y = selling price of home in Syracuse,


NY.
If mean = $130,000, which is realistic?
c. Measures of position
pth percentile: p percent of
observations below it, (100 - p)%
above it.

p = 50: median
p = 25: lower quartile (LQ)
p = 75: upper quartile (UQ)

Interquartile range IQR = UQ - LQ


Quartiles portrayed graphically by box
plots
(John Tukey)
Example: weekly TV watching for n=60
from student survey data file, 3 outliers
Box plots have box from LQ to UQ, with
median marked. They portray a five-
number summary of the data:
Minimum, LQ, Median, UQ,
Maximum
except for outliers identified separately

Outlier = observation falling


below LQ 1.5(IQR)
or above UQ + 1.5(IQR)

Ex. If LQ = 2, UQ = 10, then IQR = 8


3. Bivariate description
Usually we want to study associations between
two or more variables (e.g., how does number
of close friends depend on gender, income,
education, age, working status, rural/urban,
religiosity)
Response variable: the outcome variable
Explanatory variable(s): defines groups to
compare

Ex.: number of close friends is a response


variable, while gender, income, are
explanatory variables
Summarizing associations:
Categorical vars: show data using contingency
tables
Quantitative vars: show data using scatterplots
Mixture of categorical var. and quantitative var.
(e.g., number of close friends and gender) can
give numerical summaries (mean, standard
deviation) or side-by-side box plots for the
groups

Ex. General Social Survey (GSS) data


Men: mean = 7.0, s = 8.4
Example: Income by highest degree
Contingency Tables

Cross classifications of categorical variables


in which rows (typically) represent categories
of explanatory variable and columns
represent categories of response variable.

Counts in cells of the table give the


numbers of individuals at the corresponding
combination of levels of the two variables
Happiness and Family Income
(GSS 2008 data: happy,
finrela)
Happiness
Income Very Pretty Not too Total
-------------------------------
Above Aver. 164 233 26 423

Average 293 473 117 883


Below Aver. 132 383 172 687

------------------------------
Total 589 1089 315
Can summarize by percentages on response
variable (happiness)

Example: Percentage very happy is

39% for above aver. income (164/423 =


0.39)
33% for average income (293/883 = 0.33)
19% for below average income (??)
Happiness
Income Very Pretty Not too
Total
--------------------------------------------
Above 164 (39%) 233 (55%) 26 (6%)
423
Average 293 (33%) 473 (54%) 117 (13%)
883
Below 132 (19%) 383 (56%) 172 (25%)
687
----------------------------------------------

Inference questions for later chapters? (i.e.,


Scatterplots (for quantitative variables)
plot response variable on vertical axis,
explanatory variable on horizontal axis

Example: Table 9.13 (p. 294) shows UN data for


several nations on many variables, including
fertility (births per woman), contraceptive use,
literacy, female economic activity, per capita
gross domestic product (GDP), cell-phone use,
CO2 emissions

Data available at
http://www.stat.ufl.edu/~aa/social/data.html
Example: Survey in Alachua County,
Florida, on predictors of mental health
(data for n = 40 on p. 327 of text and at
www.stat.ufl.edu/~aa/social/data.html)

y = measure of mental impairment (incorporates


various dimensions of psychiatric symptoms,
including aspects of depression and anxiety)
(min = 17, max = 41, mean = 27, s = 5)

x = life events score (events range from severe


personal disruptions such as death in family,
extramarital affair, to less severe events such as
new job, birth of child, moving)
(min = 3, max = 97, mean = 44, s = 23)
Bivariate data from 2000 Presidential election
Butterfly ballot, Palm Beach County, FL, text
p.290
Example: The Massachusetts Lottery
(data for 37 communities)

% income
spent on
lottery

Per capita income


Correlation describes strength of
association
Falls between -1 and +1, with sign indicating
direction of association (formula later in
Chapter 9)

The larger the correlation in absolute value, the


stronger the association (in terms of a straight
line trend)

Examples: (positive or negative, how strong?)


Mental impairment and life events, correlation =
GDP and fertility, correlation =
Correlation describes strength of
association

Falls between -1 and +1, with sign indicating


direction of association

Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation =


0.37
GDP and fertility, correlation = - 0.56
GDP and percent using Internet, correlation =
0.89
Regression analysis gives line
predicting y using x
Example:
y = mental impairment, x = life events

Predicted y = 23.3 + 0.09x

e.g., at x = 0, predicted y =
at x = 100, predicted y =
Regression analysis gives line
predicting y using x
Example:
y = mental impairment, x = life events

Predicted y = 23.3 + 0.09x

e.g., at x = 0, predicted y = 23.3


at x = 100, predicted y = 23.3 + 0.09(100) =
32.3

Inference questions for later chapters?


Example: student survey
y = college GPA, x = high school GPA

(data at www.stat.ufl.edu/~aa/social/data.html)

What is the correlation?

What is the estimated regression equation?

Well see later in course the formulas for


finding the correlation and the best
fitting regression equation (with possibly
several explanatory variables), but for
now, try using software such as SPSS to
find the answers.
Sample statistics /
Population parameters

We distinguish between summaries of


samples (statistics) and summaries of
populations (parameters).

Common to denote statistics by Roman


letters, parameters by Greek letters:

Population mean = standard deviation =



proportion are parameters.

In practice, parameter values unknown, we


make inferences about their values using
The sample meany estimates
the population mean (quantitative
variable)

The sample standard deviation s estimates


the population standard deviation
(quantitative variable)

A sample proportion p estimates


a population proportion (categorical
variable)
Chapter 1: Statistics
Chapter Goals
Create an initial image of the field of
statistics.
Learn how to obtain sample data.
Example: A recent study examined the math and
verbal SAT scores of high school seniors across
the country. Which of the following statements
are descriptive in nature and which are
inferential.
The mean math SAT score was 492.
The mean verbal SAT score was 475.
Students in the Northeast scored higher in
math but lower in verbal.
80% of all students taking the exam were
headed for college.
32% of the students scored above 610 on the
verbal SAT.
The math SAT scores are higher than they were
10 years ago.
1.2 Introduction to Basic
Terms
Population: A collection, or set, of
individuals or objects or events whose
properties are to be analyzed.
Two kinds of populations: finite or
infinite.

Sample: A subset of the population.


Variable: A characteristic about each individual
element of a population or sample.
Data (singular): The value of the variable
associated with one element of a population or
sample. This value may be a number, a word, or
a symbol.
Data (plural): The set of values collected for the
variable from each of the elements belonging to
the sample.
Experiment: A planned activity whose results
yield a set of data.
Parameter: A numerical value summarizing all
the data of an entire population.
Statistic: A numerical value summarizing the
sample data.
Example: A college dean is interested in learning about the
average age of faculty. Identify the basic terms in this
situation.

The population is the age of all faculty members at the


college.
A sample is any subset of that population. For example,
we might select 10 faculty members and determine their
age.
The variable is the age of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the
ages forming the sample and determining the actual age
of each faculty member in the sample.
The parameter of interest is the average age of all
faculty at the college.
The statistic is the average age for all faculty in the
sample.
Two kinds of variables:
Qualitative, or Attribute, or
Categorical, Variable: A variable that
categorizes or describes an element of a
population.
Note: Arithmetic operations, such as
addition and averaging, are not
meaningful for data resulting from a
qualitative variable.
Quantitative, or Numerical, Variable:
A variable that quantifies an element of a
population.
Note: Arithmetic operations such as
addition and averaging, are meaningful for
Example: Identify each of the following examples as
attribute (qualitative) or numerical (quantitative)
variables.

1. The residence hall for each student in a statistics


class. (Attribute)
2. The amount of gasoline pumped by the next 10
customers at the local Unimart. (Numerical)
3. The amount of radon in the basement of each of 25
homes in a new development. (Numerical)
4. The color of the baseball cap worn by each of 20
students. (Attribute)
5. The length of time to complete a mathematics
homework assignment. (Numerical)
6. The state in which each truck is registered when
stopped and inspected at a weigh station. (Attribute)
Qualitative and quantitative variables may be
further subdivided:

Nominal
Qualitative
Ordinal
Variable
Discrete
Quantitative
Continuous
Nominal Variable: A qualitative variable that categorizes
(or describes, or names) an element of a population.

Ordinal Variable: A qualitative variable that incorporates


an ordered position, or ranking.

Discrete Variable: A quantitative variable that can


assume a countable number of values. Intuitively, a
discrete variable can assume values corresponding to
isolated points along a line interval. That is, there is a gap
between any two values.

Continuous Variable: A quantitative variable that can


assume an uncountable number of values. Intuitively, a
continuous variable can assume any value along a line
interval, including every possible value between any two
values.
Note:
1. In many cases, a discrete and continuous
variable may be distinguished by
determining whether the variables are
related to a count or a measurement.
2. Discrete variables are usually associated
with counting. If the variable cannot be
further subdivided, it is a clue that you are
probably dealing with a discrete variable.
3. Continuous variables are usually associated
with measurements. The values of discrete
variables are only limited by your ability to
measure them.
Example: Identify each of the following as
examples of qualitative or numerical variables:
1. The temperature in Barrow, Alaska at 12:00
pm on any
given day.
2. The make of automobile driven by each
faculty member.
3. Whether or not a 6 volt lantern battery is
defective.
4. The weight of a lead pencil.
5. The length of time billed for a long distance
telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by
an adult.
Example: Identify each of the following as
examples of (1) nominal, (2) ordinal, (3) discrete,
or (4) continuous variables:
1. The length of time until a pain reliever begins
to work.
2. The number of chocolate chips in a cookie.
3. The number of colors used in a statistics
textbook.
4. The brand of refrigerator in a home.
5. The overall satisfaction rating of a new car.
6. The number of files on a computers hard
disk.
7. The pH level of the water in a swimming pool.
8. The number of staples in a stapler.
1.3: Measure and Variability
No matter what the response
variable: there will always be
variability in the data.
One of the primary objectives of
statistics: measuring and
characterizing variability.
Controlling (or reducing) variability in
a manufacturing process: statistical
process control.
Example: A supplier fills cans of soda marked 12
ounces. How much soda does each can really
contain?

It is very unlikely any one can contains exactly


12 ounces of soda.
There is variability in any process.
Some cans contain a little more than 12
ounces, and some cans contain a little less.
On the average, there are 12 ounces in each
can.
The supplier hopes there is little variability in
the process, that most cans contain close to 12
ounces of soda.
1.4: Data Collection
First problem a statistician faces:
how to obtain the data.
It is important to obtain good, or
representative, data.
Inferences are made based on
statistics obtained from the data.
Inferences can only be as good as
the data.
Biased Sampling Method: A sampling method
that produces data which systematically differs
from the sampled population. An unbiased
sampling method is one that is not biased.

Sampling methods that often result in biased


samples:
1. Convenience sample: sample selected from
elements of a
population that are easily accessible.
2. Volunteer sample: sample collected from
those elements
of the population which chose to contribute
the needed
information on their own initiative.
Process of data collection:

1. Define the objectives of the survey or


experiment.
Example: Estimate the average life of an
electronic component.
2. Define the variable and population of
interest.
Example: Length of time for anesthesia to
wear off after surgery.
3. Defining the data-collection and data-
measuring schemes. This includes sampling
procedures, sample size, and the data-
measuring device (questionnaire, scale, ruler,
etc.).
4. Determine the appropriate descriptive or
inferential data-analysis techniques.
Methods used to collect data:

Experiment: The investigator controls or


modifies the environment and observes the
effect on the variable under study.

Survey: Data are obtained by sampling some of


the population of interest. The investigator does
not modify the environment.

Census: A 100% survey. Every element of the


population is listed. Seldom used: difficult and
time-consuming to compile, and expensive.
Sampling Frame: A list of the elements
belonging to the population from which the
sample will be drawn.

Note: It is important that the sampling frame be


representative of the population.

Sample Design: The process of selecting


sample elements from the sampling frame.

Note: There are many different types of sample


designs. Usually they all fit into two categories:
judgment samples and probability samples.
Judgment Samples: Samples that are selected
on the basis of being typical.

Items are selected that are representative of the


population. The validity of the results from a
judgment sample reflects the soundness of the
collectors judgment.

Probability Samples: Samples in which the


elements to be selected are drawn on the basis
of probability. Each element in a population has
a certain probability of being selected as part of
the sample.
Random Samples: A sample selected in such a
way that every element in the population has a
equal probability of being chosen. Equivalently,
all samples of size n have an equal chance of
being selected. Random samples are obtained
either by sampling with replacement from a finite
population or by sampling without replacement
from an infinite population.
Note:
1. Inherent in the concept of randomness: the next result (or
occurrence) is not predictable.
2. Proper procedure for selecting a random sample: use a random
number generator or a table of random numbers.
Example: An employer is interested in the time it
takes each employee to commute to work each
morning. A random sample of 35 employees will
be selected and their commuting time will be
recorded.

There are 2712 employees.


Each employee is numbered: 0001, 0002, 0003,
etc. up to 2712.
Using four-digit random numbers, a sample is
identified: 1315, 0987, 1125, etc.
Systematic Sample: A sample in which every
kth item of the sampling frame is selected,
starting from the first element which is randomly
selected from the first k elements.

Note: The systematic technique is easy to


execute. However, it has some inherent dangers
when the sampling frame is repetitive or cyclical
in nature. In these situations the results may not
approximate a simple random sample.

Stratified Random Sample: A sample obtained


by stratifying the sampling frame and then
selecting a fixed number of items from each of
the strata by means of a simple random
sampling technique.
Proportional Sample (or Quota Sample): A
sample obtained by stratifying the sampling
frame and then selecting a number of items in
proportion to the size of the strata (or by quota)
from each strata by means of a simple random
sampling technique.

Cluster Sample: A sample obtained by


stratifying the sampling frame and then selecting
some or all of the items from some of, but not
all, the strata.
1.5: Comparison of Probability and
Statistics
Probability: Properties of the
population are assumed known.
Answer questions about the sample
based on these properties.

Statistics: Use information in the


sample to draw a conclusion about the
population.
Example: A jar of M&Ms contains 100 candy
pieces, 15 are red. A handful of 10 is selected.

Probability question: What is the probability that


3 of the 10 selected are red?

Example: A handful of 10 M&Ms is selected from


a jar containing 1000 candy pieces. Three
M&Ms in the handful are red.

Statistics question: What is the proportion of red


M&Ms in the entire jar?
1.6: Statistics and the
Technology
The electronic technology has had a
tremendous effect on the field of
statistics.
Many statistical techniques are
repetitive in nature: computers and
calculators are good at this.
Lots of statistical software packages:
MINITAB, SYSTAT, STATA, SAS,
Statgraphics, SPSS, and calculators.
Remember: Responsible use of statistical
methodology is very important. The
burden is on the user to ensure that the
appropriate methods are correctly applied
and that accurate conclusions are drawn
and communicated to others.

Note: The textbook illustrates statistical


procedures using MINITAB, EXCEL 97, and
the TI-83.
Chapter 1: Introduction to
Statistics

71
Variables
A variable is a characteristic or
condition that can change or take on
different values.
Most research begins with a general
question about the relationship
between two variables for a specific
group of individuals.

72
Population
The entire group of individuals is
called the population.
For example, a researcher may be
interested in the relation between
class size (variable 1) and academic
performance (variable 2) for the
population of third-grade children.

73
Sample
Usually populations are so large that
a researcher cannot examine the
entire group. Therefore, a sample is
selected to represent the population
in a research study. The goal is to
use the results obtained from the
sample to help answer questions
about the population.

74
Types of Variables
Variables can be classified as
discrete or continuous.
Discrete variables (such as class
size) consist of indivisible categories,
and continuous variables (such as
time or weight) are infinitely divisible
into whatever units a researcher may
choose. For example, time can be
measured to the nearest minute,
second, half-second, etc.
76
Real Limits
To define the units for a continuous
variable, a researcher must use real
limits which are boundaries located
exactly half-way between adjacent
categories.

77
Measuring Variables
To establish relationships between
variables, researchers must observe
the variables and record their
observations. This requires that the
variables be measured.
The process of measuring a variable
requires a set of categories called a
scale of measurement and a
process that classifies each individual
into one category.
78
4 Types of Measurement
Scales
1. A nominal scale is an unordered
set of categories identified only by
name. Nominal measurements only
permit you to determine whether
two individuals are the same or
different.
2. An ordinal scale is an ordered set
of categories. Ordinal
measurements tell you the direction
of difference between two
individuals. 79
4 Types of Measurement
Scales
3. An interval scale is an ordered series of
equal-sized categories. Interval
measurements identify the direction and
magnitude of a difference. The zero point
is located arbitrarily on an interval scale.
4. A ratio scale is an interval scale where a
value of zero indicates none of the
variable. Ratio measurements identify
the direction and magnitude of
differences and allow ratio comparisons
of measurements.
80
Correlational Studies
The goal of a correlational study is
to determine whether there is a
relationship between two variables
and to describe the relationship.
A correlational study simply
observes the two variables as they
exist naturally.

81
Experiments
The goal of an experiment is to
demonstrate a cause-and-effect
relationship between two variables;
that is, to show that changing the
value of one variable causes changes
to occur in a second variable.

83
Experiments (cont.)
In an experiment, one variable is
manipulated to create treatment
conditions. A second variable is observed
and measured to obtain scores for a group
of individuals in each of the treatment
conditions. The measurements are then
compared to see if there are differences
between treatment conditions. All other
variables are controlled to prevent them
from influencing the results.
In an experiment, the manipulated
variable is called the independent
variable and the observed variable is the
dependent variable. 84
Other Types of Studies
Other types of research studies,
know as non-experimental or
quasi-experimental, are similar to
experiments because they also
compare groups of scores.
These studies do not use a
manipulated variable to differentiate
the groups. Instead, the variable
that differentiates the groups is
usually a pre-existing participant
variable (such as male/female) or a
time variable (such as before/after). 86
Other Types of Studies
(cont.)
Because these studies do not use the
manipulation and control of true
experiments, they cannot
demonstrate cause and effect
relationships. As a result, they are
similar to correlational research
because they simply demonstrate
and describe relationships.

87
Data
The measurements obtained in a
research study are called the data.
The goal of statistics is to help
researchers organize and interpret
the data.

89
Descriptive Statistics
Descriptive statistics are methods
for organizing and summarizing data.

For example, tables or graphs are


used to organize data, and
descriptive values such as the
average score are used to
summarize data.
A descriptive value for a population
is called a parameter and a
descriptive value for a sample is 90
Inferential Statistics
Inferential statistics are methods for
using sample data to make general
conclusions (inferences) about
populations.
Because a sample is typically only a part
of the whole population, sample data
provide only limited information about the
population. As a result, sample statistics
are generally imperfect representatives of
the corresponding population parameters.
91
Sampling Error
The discrepancy between a sample
statistic and its population parameter
is called sampling error.
Defining and measuring sampling
error is a large part of inferential
statistics.

92
Notation
The individual measurements or scores
obtained for a research participant will be
identified by the letter X (or X and Y if
there are multiple scores for each
individual).
The number of scores in a data set will be
identified by N for a population or n for a
sample.
Summing a set of values is a common
operation in statistics and has its own
notation. The Greek letter sigma, , will
be used to stand for "the sum of." For
example, X identifies the sum of the 94
Order of Operations
1. All calculations within parentheses are
done first.
2. Squaring or raising to other exponents is
done second.
3. Multiplying, and dividing are done third,
and should be completed in order from
left to right.
4. Summation with the notation is done
next.
5. Any additional adding and subtracting is
done last and should be completed in
order from left to right. 95
Basics of Statistics

Definition: Science of collection, presentation, analysis, and reasonable


interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into data. For
example, suppose we measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to provide an informative
account. However statistics can give an instant overall picture of data based
on graphical presentation or numerical summarization irrespective to the
number of data points. Besides data summarization, another important task of
statistics is to make inference and predict relations of variables.
A Taxonomy of Statistics
Statistical Description of
Data
Statistics describes a numeric set of
data by its
Center
Variability
Shape
Statistics describes a categorical set
of data by
Frequency, percentage or proportion of each
category
Some Definitions
Variable - any characteristic of an individual or entity. A variable can
take different values for different individuals. Variables can be
categorical or quantitative. Per S. S. Stevens
Nominal - Categorical variables with no inherent order or ranking sequence such
as names or classes (e.g., gender). Value may be a numerical, but without
numerical value (e.g., I, II, III). The only operation that can be applied to Nominal
variables is enumeration.
Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe.
Can be compared for equality, or greater or less, but not how much greater or less.
Interval - Values of the variable are ordered as in Ordinal, and additionally,
differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
operations.
Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication,
and division are all meaningful operations.
Some Definitions
Distribution - (of a variable) tells us what values the variable takes
and how often it takes these values.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the
frequency distribution of variable age can be tabulated as
follows:
Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by
shape, center, and spread of the data. An individual value that falls
outside the overall pattern is called an outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation Categorical
Variable
Bar Diagram: Lists the categories and presents the percent or count of
individuals who fall in each category.

Treatment Frequency Proportion Percent


Group (%)

1 15 (15/60)=0.25 25.0

2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3
Total 60 1.00 100
Data Presentation Categorical
Variable
Pie Chart: Lists the categories and presents the percent or count of
individuals who fall in each category.

Treatment Frequency Proportion Percent


Group (%)

1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3

Total 60 1.00 100


Graphical Presentation Numerical
Variable
Histogram: Overall pattern can be described by its shape, center,
and spread. The following age distribution is right skewed. The
center lies between 80 to 100. No outliers.

Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
Graphical Presentation Numerical
Variable
Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age

Box Plot
Numerical Presentation
A fundamental concept in summary statistics is that of a central value for a set of
observations and the extent to which the central value characterizes the whole
set of data. Measures of central value such as the mean or median must be
coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let


us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement

Center measurement is a summary measure of the overall level of a


dataset

Commonly used methods are mean, median, mode, geometric mean


etc.
Mean: Summing up all the observation and dividing by number of
observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2, ...xn are n observations of a variable
x. Then the mean of this variable,
n

x1 x2 ... xn x i
x i 1

n n
Methods of Center Measurement

Median: The middle value in an ordered sequence of observations.


That is, to find the median we need to order the data set and then
find the middle value. In case of an even number of observations
the average of the two middle most values is the median. For
example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the
number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the
median is the average of the two middle values from the sorted
sequence, in this case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is


undefined for sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the
mean and thus a better measure than the mean for highly skewed
distributions, e.g. family income. For example mean of 20, 30, 40,
and 990 is (20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of 4 lie
between 20-40. So, the mean 270 really fails to give a realistic
picture of the major part of the data. It is influenced by extreme value
990.
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation,


interquartile range, coefficient of variation etc.

Range: The difference between the largest and the smallest


observations. The range of 10, 5, 2, 100 is (100-2)=98. Its a crude
measure of variability.
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of the


squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,xn is

( x1 x ) 2 .... ( xn x ) 2
S
2

n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5 5) 2 (3 5) 2 (7 5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard
deviation of the above example is 2.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total
range of observed values. Cut points for these regions are known as
quartiles.
In notations, quartiles of a data is the ((n+1)/4)q th observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile
(Q2) is between the 25th and 50th percentage points in the data. The
upper bound of Q2 is the median. The third quartile (Q3) is the 25% of
the data lying between the median and the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is


the median of the second half of the ordered observations.
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The
4th observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is
also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range


of the previous example is 61- 40=21. The middle half of the ordered
data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles. 25th percentile is the Q1, 50th percentile
is the Median (Q2) and the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of


the data, where p is the desired percentile and n is the number of
observations of data.

Coefficient of Variation: The standard deviation of data divided by its


mean. It is usually expressed in percent.

Coefficient of Variation = 100
x
Five Number Summary

Five Number Summary: The five number summary of a distribution


consists of the smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum)
observation written in order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The
central box spans the quartiles. A line within the box marks the
median. Lines extending above and below the box mark the
smallest and the largest observations (i.e., the range). Outlying
samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month
Choosing a Summary
The five number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
extreme outliers. The mean and standard deviation are reasonable for
symmetric distributions that are free of outliers.

In real life we cant always expect symmetry of the data. Its a common
practice to include number of observations (n), mean, median, standard
deviation, and range as common for data summarization purpose. We
can include other summary statistics like Q1, Q3, Coefficient of variation
if it is considered to be important for describing data.
Shape of Data
Shape of data is measured by
Skewness
Kurtosis
Skewness
Measures asymmetry of data
Positive or right skewed: Longer right tail
Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi x ) 3
Skewness i 1
3/ 2
n



(x x)
i 1
i
2


Kurtosis
Measures peakedness of the distribution of
data. The kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi x ) 4
Kurtosis i 1
2
3
n



(x x)
i 1
i
2


Summary of the Variable Age in
the given data set
Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects

6
Sample Variance 913.8403955
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95
2
Minimum 48
0

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
Summary of the Variable Age in the
given data set

Boxplot of Age in Month


140
120
Age(month)

100
80
60
Class Summary (First Part)
So far we have learned-

Statistics and data presentation/data summarization


Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box Plot
Numerical Presentation: Measuring Central value of data (mean,
median, mode etc.), measuring dispersion (standard deviation,
variance, co-efficient of variation, range, inter-quartile range etc),
quartiles, percentiles, and five number summary

Any questions ?
Brief concept of Statistical Softwares

There are many softwares to perform statistical analysis and visualization


of data. Some of them are SAS (System for Statistical Analysis), S-plus, R,
Matlab, Minitab, BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP,
GLIM, HIL, MS Excel etc. We will discuss MS Excel and SPSS in brief.

Some useful websites for more information of statistical softwares-

http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_Wuerlaender/sta
tsoft.htm#archiv
http://www.R-project.org
Microsoft Excel
A Spreadsheet Application. It features calculation, graphing tools, pivot
tables and a macro programming language called VBA (Visual Basic for
Applications).

There are many versions of MS-Excel. Excel XP, Excel 2003, Excel 2007
are capable of performing a number of statistical analyses.

Starting MS Excel: Double click on the Microsoft Excel icon on the


desktop or Click on Start --> Programs --> Microsoft Excel.

Worksheet: Consists of a multiple grid of cells with numbered rows down the page
and alphabetically-tilted columns across the page. Each cell is referenced by its
coordinates. For example, A3 is used to refer to the cell in column A and row 3.
B10:B20 is used to refer to the range of cells in column B and rows 10 through 20.
Microsoft Excel
Opening a document: File Open (From a existing workbook). Change the
directory area or drive to look for file in other locations.
Creating a new workbook: FileNewBlank Document
Saving a File: FileSave

Selecting more than one cell: Click on a cell e.g. A1), then hold the Shift key and
click on another (e.g. D4) to select cells between and A1 and D4 or Click on a cell
and drag the mouse across the desired range.

Creating Formulas: 1. Click the cell that you want to enter the formula,
2. Type = (an equal sign), 3. Click the Function Button, 4. Select the
formula you want and step through the on-screen instructions.

fx
Microsoft Excel
Entering Date and Time: Dates are stored as MM/DD/YYYY. No need to enter
in that format. For example, Excel will recognize jan 9 or jan-9 as 1/9/2007 and
jan 9, 1999 as 1/9/1999. To enter todays date, press Ctrl and ; together. Use a
or p to indicate am or pm. For example, 8:30 p is interpreted as 8:30 pm. To
enter current time, press Ctrl and : together.

Copy and Paste all cells in a Sheet: Ctrl+A for selecting, Ctrl +C for copying
and Ctrl+V for Pasting.

Sorting: Data Sort Sort By

Descriptive Statistics and other Statistical methods: ToolsData Analysis Statistical


method. If Data Analysis is not available then click on Tools Add-Ins and then select
Analysis ToolPack and Analysis toolPack-Vba
Microsoft Excel
Statistical and Mathematical Function: Start with = sign and then select
function from function wizard f x .

Inserting a Chart: Click on Chart Wizard (or InsertChart), select chart,


give, Input data range, Update the Chart options, and Select output
range/ Worksheet.

Importing Data in Excel: File open FileType Click on File Choose


Option ( Delimited/Fixed Width) Choose Options (Tab/ Semicolon/
Comma/ Space/ Other) Finish.

Limitations: Excel uses algorithms that are vulnerable to rounding and


truncation errors and may produce inaccurate results in extreme
cases.
Statistics Package
for the Social Science (SPSS)
A general purpose statistical package SPSS is widely used in the social
sciences, particularly in sociology and psychology.
SPSS can import data from almost any type of file to generate tabulated
reports, plots of distributions and trends, descriptive statistics, and
complex statistical analyzes.
Starting SPSS: Double Click on SPSS on desktop or ProgramSPSS.

Opening a SPSS file: FileOpen

MENUS AND TOOLBARS


Data Editor
Various pull-down menus appear at the top of the Data Editor window. These
pull-down menus are at the heart of using SPSSWIN. The Data Editor menu
items (with some of the uses of the menu) are:
Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

FILE used to open and save data files

EDIT used to copy and paste data values; used to find data in a
file; insert variables and cases; OPTIONS allows the user to set
general preferences as well as the setup for the Navigator, Charts,
etc.

VIEW user can change toolbars; value labels can be seen in cells
instead of data values

DATA select, sort or weight cases; merge files

TRANSFORM Compute new variables, recode variables, etc.


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts, etc

UTILITIES add comments to accompany data file (and other,


advanced features)

ADD-ons these are features not currently installed (advanced


statistical procedures)

WINDOW switch between data, syntax and navigator windows

HELP to access SPSSWIN Help information


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS

Navigator (Output) Menus


When statistical procedures are run or charts are created, the output will appear
in the Navigator window. The Navigator window contains many of the pull-down
menus found in the Data Editor window. Some of the important menus in the
Navigator window include:

INSERT used to insert page breaks, titles, charts, etc.

FORMAT for changing the alignment of a particular portion of the output


Statistics Package
for the Social Science (SPSS)
Formatting Toolbar
When a table has been created by a statistical procedure, the user can edit the
table to create a desired look or add/delete information. Beginning with version
14.0, the user has a choice of editing the table in the Output or opening it in a
separate Pivot Table (DEFINE!) window. Various pulldown menus are activated
when the user double clicks on the table. These include:

EDIT undo and redo a pivot, select a table or table body (e.g., to
change the font)

INSERT used to insert titles, captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells


Statistics Package
for the Social Science (SPSS)
Additional menus
CHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window


Show or hide a toolbar

Click on VIEW TOOLBARS to show it/ to hide it


Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to
its new location

Customize a toolbar

Click on VIEW TOOLBARS CUSTOMIZE


Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
Data from an Excel spreadsheet can be imported into SPSSWIN as follows:
1. In SPSSWIN click on FILE OPEN DATA. The OPEN DATA FILE Dialog
Box will appear.
2. Locate the file of interest: Use the "Look In" pull-down list to identify the folder
containing the Excel file of interest
3. From the FILE TYPE pull down menu select EXCEL (*.xls).
4. Click on the file name of interest and click on OPEN or simply double-click on
the file name.
5. Keep the box checked that reads "Read variable names from the first row of
data". This presumes that the first row of the Excel data file contains variable
names in the first row. [If the data resided in a different worksheet in the Excel
file, this would need to be entered.]
6. Click on OK. The Excel data file will now appear in the SPSSWIN Data
Editor.
Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
7. The former EXCEL spreadsheet can now be saved as an SPSS file (FILE
SAVE AS) and is ready to be used in analyses. Typically, you would label variable
and values, and define missing values.
Importing an Access table
SPSSWIN does not offer a direct import for Access tables. Therefore, we must follow
these steps:
1. Open the Access file
2. Open the data table
3. Save the data as an Excel file
4. Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN.
Importing Text Files into SPSSWIN
Text data points typically are separated (or delimited) by tabs or commas.
Sometimes they can be of fixed format.
Statistics Package
for the Social Science (SPSS)
Importing tab-delimited data
In SPSSWIN click on FILE OPEN DATA. Look in the appropriate location for
the text file. Then select Text from Files of type: Click on the file name and then
click on Open. You will see the Text Import Wizard step 1 of 6 dialog box.

You will now have an SPSS data file containing the former tab-delimited data. You
simply need to add variable and value labels and define missing values.

Exporting Data to Excel


click on FILE SAVE AS. Click on the File Name for the file to be exported. For
the Save as Type select from the pull-down menu Excel (*.xls). You will notice the
checkbox for write variable names to spreadsheet. Leave this checked as you will
want the variable names to be in the first row of each column in the Excel
spreadsheet. Finally, click on Save.
Statistics Package
for the Social Science (SPSS)
Running the FREQUENCIES procedure

1. Open the data file (from the menus, click on FILE OPEN DATA) of
interest.

2. From the menus, click on ANALYZE DESCRIPTIVE STATISTICS


FREQUENCIES
3. The FREQUENCIES Dialog Box will appear. In the left-hand box will be a listing
("source variable list") of all the variables that have been defined in the data file. The
first step is identifying the variable(s) for which you want to run a frequency analysis.
Click on a variable name(s). Then click the [ > ] pushbutton. The variable name(s)
will now appear in the VARIABLE[S]: box ("selected variable list"). Repeat these
steps for each variable of interest.

4. If all that is being requested is a frequency table showing count, percentages


(raw, adjusted and cumulative), then click on OK.
Statistics Package
for the Social Science (SPSS)
Requesting STATISTICS
Descriptive and summary STATISTICS can be requested for numeric variables. To
request Statistics:
1. From the FREQUENCIES Dialog Box, click on the STATISTICS... pushbutton.
2. This will bring up the FREQUENCIES: STATISTICS Dialog Box.
3. The STATISTICS Dialog Box offers the user a variety of choices:

DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics


(click on ANALYZE DESCRIPTIVE STATISTICS DESCRIPTIVES). The
procedure offers many of the same statistics as the FREQUENCIES procedure,
but without generating frequency analysis tables.
Statistics Package
for the Social Science (SPSS)
Requesting CHARTS
One can request a chart (graph) to be created for a variable or variables included in
a FREQUENCIES procedure.

1. In the FREQUENCIES Dialog box click on CHARTS.


2. The FREQUENCIES: CHARTS Dialog box will appear. Choose the intended chart
(e.g. Bar diagram, Pie chart, histogram.

Pasting charts into Word


1. Click on the chart.
2. Click on the pulldown menu EDIT COPY OBJECTS
3. Go to the Word document in which the chart is to be embedded. Click on EDIT

PASTE SPECIAL
4. Select Formatted Text (RTF) and then click on OK
5. Enlarge the graph to a desired size by dragging one or more of the black squares

along the perimeter (if the black squares are not visible, click once on the graph).
Statistics Package
for the Social Science (SPSS)
BASIC STATISTICAL PROCEDURES: CROSSTABS
1. From the ANALYZE pull-down menu, click on DESCRIPTIVE STATISTICS
CROSSTABS.
2. The CROSSTABS Dialog Box will then open.

3. From the variable selection box on the left click on a variable you wish to
designate as the Row variable. The values (codes) for the Row variable make up
the rows of the crosstabs table. Click on the arrow (>) button for Row(s). Next,
click on a different variable you wish to designate as the Column variable. The
values (codes) for the Column variable make up the columns of the crosstabs
table. Click on the arrow (>) button for Column(s).

4. You can specify more than one variable in the Row(s) and/or Column(s). A cross
table will be generated for each combination of Row and Column variables
Statistics Package
for the Social Science (SPSS)
Limitations: SPSS users have less control over data manipulation and
statistical output than other statistical packages such as SAS, Stata etc.

SPSS is a good first statistical package to perform quantitative research


in social science because it is easy to use and because it can be a good
starting point to learn more advanced statistical packages.
Introduction to
Statistics
Colm ODushlaine

Neuropsychiatric Genetics, TCD


codushlaine@gmail.com

145
Overview

Descriptive Statistics & Graphical


Presentation of Data
Statistical Inference
Hypothesis Tests & Confidence Intervals
T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
ANOVA/ANCOVA
Intended as an interview. Will provide slides
after lectures
Whats in the lectures?...
146
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Descriptive Statistics and Graphical Presentation of
Data

1. Terminology
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)

147
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Statistical Inference

1. Distributions & Densities


2. Normal Distribution
3. Sampling Distribution & Central Limit
Theorem
4. Hypothesis Tests
5. P-values
6. Confidence Intervals
7. Two-Sample Inferences
8. Paired Data

148
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Sample Inferences

1. Two-Sample Inferences
Paired t-test
Two-sample t-test
2. Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Interactions in Two-way ANOVA
3. DataDesk demo

149
Lecture 1 Lecture 2 Lecture 3
Lecture 4

1. Regression
2. Correlation
3. Multiple Regression
4. ANCOVA
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
150
FIRST, A REALLY USEFUL SITE
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data

151
1. Terminology
Populations & Samples
Population: the complete set of
individuals, objects or scores of interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results
from an experiment repeated ad infinitum)

Sample: A subset of the population.


A sample may be classified as random (each
member has equal chance of being selected
from a population) or convenience (whats
available).
Random selection attempts to ensure the
sample is representative of the population.
152
Variables
Variables are the quantities
measured in a sample.They may
be classified as:
Quantitative i.e. numerical
Continuous (e.g. pH of a sample, patient
cholesterol levels)
Discrete (e.g. number of bacteria
colonies in a culture)
Categorical
Nominal (e.g. gender, blood group)
Ordinal (ranked e.g. mild, moderate or
severe illness). Often ordinal variables
are re-coded to be quantitative. 153
Variables
Variables can be further classified as:
Dependent/Response. Variable of primary
interest (e.g. blood pressure in an
antihypertensive drug trial). Not controlled by
the experimenter.
Independent/Predictor
called a Factor when controlled by
experimenter. It is often nominal (e.g.
treatment)
Covariate when not controlled.
If the value of a variable cannot be
predicted in advance then the variable is
154
referred to as a random variable
Parameters & Statistics
Parameters: Quantities that
describe a population
characteristic. They are usually
unknown and we wish to make
statistical inferences about
parameters. Different to
perimeters.

Descriptive Statistics:
Quantities and techniques used to
describe a sample characteristic or155
2. Frequency Distributions
An (Empirical) Frequency Distribution
or Histogram for a continuous variable
presents the counts of observations
grouped within pre-specified classes or
groups

A Relative Frequency Distribution


presents the corresponding proportions of
observations within the classes

A Barchart presents the frequencies for a


categorical variable
156
Example Serum CK
Blood samples taken from 36 male
volunteers as part of a study to
determine the natural variation in CK
concentration.

The serum CK concentrations were


measured in (U/I) are as follows:

157
Serum CK Data for 36 male
volunteers

121 82 100 151 68 58


95 145 64 201 101 163
84 57 139 60 78 94
119 104 110 113 118 203
62 83 67 93 92 110
25 123 70 48 95 42
158
Relative Frequency Table
Serum CK Frequency Relative Cumulative Rel.
(U/I) Frequency Frequency
20-39 1 0.028 0.028
40-59 4 0.111 0.139
60-79 7 0.194 0.333
80-99 8 0.222 0.555
100-119 8 0.222 0.777
120-139 3 0.083 0.860
140-159 2 0.056 0.916
160-179 1 0.028 0.944
180-199 0 0.000 0.944
200-219 2 0.056 1.000
Total 36 1.000
159
Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
8 100.0% maximu
99.5%
97.5%
90.0%
6 75.0% quart
50.0% media
25.0% quart

Frequency
10.0%
4 2.5%
0.5%
0.0% minimu

20 40 60 80 100 120 140 160 180 200 220

160
Relative Frequency
Distribution
Distributions
CK-concentration-(U/l)
Quantiles
Mode
Shaded area is 100.0% maxim
percentage of 99.5%
males with CK 0.20 97.5%
values between 90.0%
60 and 100 U/l, 75.0% quar

Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
10.0%
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail
0.05

20 40 60 80 100 120 140 160 180 200 220


161
3. Measures of Central
Tendency (Location)
Measures of location indicate where on the
number line the data are to be found. Common
measures of location are:

(i) the Arithmetic Mean,


(ii) the Median, and
(iii) the Mode

162
The Mean

Let x1,x2,x3,,xn be the realised


values of a random variable X, from
a sample of size n. The sample
arithmetic mean is defined as:
n
x 1
n xi
i 1

163
Example
Example 2: The systolic blood pressure
of seven middle aged men were as
follows:
151, 124, 132, 170, 146, 124 and 113.
x
151 124 132 170 146 124 113
7
The mean is 137.14

164
The Median and Mode
If the sample data are arranged in
increasing order, the median is
(i) the middle value if n is an odd
number, or
(ii) midway between the two middle
values if n is an even number
The mode is the most commonly
occurring value.

165
Example 1 n is odd
The reordered systolic blood pressure data seen
earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered


data, i.e. 132.

Two individuals have systolic blood pressure =


124 mm Hg, so the Mode is 124.

166
Example 2 n is even

Six men with high cholesterol participated in a study to


investigate the effects of diet on cholesterol level. At the
beginning of the study, their cholesterol levels (mg/dL)
were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings,


i.e. (274+292) 2 = 283.

Two men have the same cholesterol level- the Mode is 274.
167
Mean versus Median

Large sample values tend to inflate the mean. This will


happen if the histogram of the data is right-skewed.

The median is not influenced by large sample values and is


a better measure of centrality if the distribution is skewed.

Note if mean=median=mode then the data are said to be


symmetrical

e.g. In the CK measurement study, the sample mean =


98.28. The median = 94.5, i.e. mean is larger than median
indicating that mean is inflated by two large data values
201 and 203.

168
4. Measures of Dispersion

Measures of dispersion characterise how


spread out the distribution is, i.e., how
variable the data are.
Commonly used measures of dispersion
include:
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
deviation)
4. Inter-quartile range

169
Range
the sample Range is the difference
between the largest and smallest
observations in the sample
easy to calculate;
Blood pressure example: min=113
and max=170, so the range=57
mmHg
useful for best or worst case
scenarios
sensitive to extreme values
170
Sample Variance
The sample variance, s2, is the
arithmetic mean of the squared
deviations from the sample mean:
n
xi x
2

s i 1
2
n 1

>

171
Standard Deviation
The sample standard deviation, s, is
the square-root of the variance

n
xi x
2

i 1
s
n 1

s has the advantage of being in the same units


as the original variable x
172
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x 137.14 173
Example (contd.)
7

x x
2
i 2304.86
7

i 1

i s 2304.86
2
x x
Therefore,
i 1 7 1
19.6
174
Coefficient of Variation
The coefficient of variation (CV) or
relative standard deviation (RSD) is the
sample standard deviation expressed as a
percentage of thes mean, i.e.

CV 100%
x
The CV is not affected by multiplicative
changes in scale
Consequently, a useful way of comparing the
dispersion of variables measured on different
scales
175
Example
The CV of the blood pressure data is:

19.6
CV 100 %
137.1
14.3%

i.e., the standard deviation is 14.3% as


large as the mean.

176
Inter-quartile range
The Median divides a distribution into two
halves.

The first and third quartiles (denoted Q1 and


Q3) are defined as follows:
25% of the data lie below Q1 (and 75% is above Q1),
25% of the data lie above Q3 (and 75% is below Q3)

The inter-quartile range (IQR) is the


difference between the first and third quartiles,
i.e. 177
Example
The ordered blood pressure data is:
113124 124 132 146 151 170

Q1 Q3

Inter Quartile Range (IQR) is 151-124


= 27

178
60% of slides complete!

179
5. Box-plots
A box-plot is a visual description of
the distribution based on
Minimum
Q1
Median
Q3
Maximum
Useful for comparing large sets of
data
180
Example 1
The pulse rates of 12 individuals
arranged in increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78,
78, 80

Q1=(68+70)2 = 69, Q3=(76+78)2 =


77

IQR = (77 69) = 8


181
Example 1: Box-plot

182
Example 2: Box-plots of intensities
from 11 gene expression arrays

14
12
10
8

AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel


183
Outliers
An outlier is an observation which
does not appear to belong with the
other data
Outliers can arise because of a
measurement or recording error or
because of equipment failure during
an experiment, etc.
An outlier might be indicative of a sub-
population, e.g. an abnormally low or
high value in a medical test could 184
Outlier Boxplot

Re-define the upper and lower limits


of the boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR

Note that the lines may not go as far


as these limits
If a data point is < lower limit or >
upper limit, the data point is
considered to be an outlier. 185
Example CK data

outliers

186
6. Scatter-plot

Displays the relationship between


two continuous variables

Useful in the early stage of


analysis when exploring data and
determining is a linear regression
analysis is appropriate

May show outliers in your data 187


Example 1: Age versus Systolic
Blood Pressure in a Clinical Trial

188
Example 2: Up-regulation/Down-regulation
of gene expression across an array
(Control Cy5 versus Disease Cy3)

189
Example of a Scatter-plot matrix
(multiple pair-wise plots)

190
Other graphical representations
Dot-Plots, Stem-and-leaf plots
Not visually appealing
Pie-chart
Visually appealing, but hard to compare two
datasets. Best for 3 to 7 categories. A total must be
specified.
Violin-plots
=boxplot+smooth density
Nice visual of data shape

191
Multivariate Data
Clustering is useful for visualising
multivariate data and uncovering patterns,
often reducing its complexity

Clustering is especially useful for high-


dimensional data (p>>n): hundreds or
perhaps thousands of variables

An obvious areas of application are gel


electrophoresis and microarray
experiments where the variables are
protein abundances or gene expression
ratios
192
7. Clustering

Aim: Find groups of samples or variables


sharing similiarity

Clustering requires a definition of distance


between objects, quantifying a notion of
(dis)similarity
Points are grouped on the basis on minimum
distance apart (distance measures)

Once a pair are grouped, they are combined


into a single point (using a linkage method)
e.g. take their average. The process is then
repeated. 193
Clustering
Clustering can be applied to rows or columns of a data
set (matrix) i.e. to the samples or variables

A tree can be constructed with branch length


proportional to distances between linked clusters, called
a Dendrogram

Clustering is an example of unsupervised learning: No


use is made of sample annotations i.e. treatment groups,
diagnosis groups

194
UPGMA
Unweighted Pair-Group Method Average
Most commonly used clustering method
Procedure:
1. Each observation forms its own cluster
2. The two with minimum distance are grouped into
a single cluster representing a new observation-
take their average
3. Repeat 2. until all data points form a single
cluster

195
Contrived Example
5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3

p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2 2 2

cyclinE 6 5 5

caspase 8 1 10 3

Calculate distance between each pair of genes


e.g. d ( p53, mdm2) (9 10) 2 (3 2) 2 (7 9) 2 2.5

196
Example
Construct a distance matrix of all pair-wise
distances
p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75


mdm2 - 0 12.5 6.4 13.93
bcl2 - - 0 6.48 1.41
cyclinE - - - 0 7.35
caspase 8 - - - - 0

Cluster the 2 genes with smallest distance


Take their average & re-calculate distances to other genes

197
{caspase-8 &
p53 mdm2 cyclin E
bcl-2}
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1

cyclin E 0 6.9
{caspase-8 &
0
bcl-2}

{p53 & {caspase-8 &


cyclin E
mdm2} bcl-2}
{p53 & mdm2} 0 3.7 9.2
cyclin E 0 6.9

{caspase-8 & bcl-2} 0

198
Example (contd)

..and the final


cluster:

199
Example of a gene expression
dendrogram

200
Variety of approaches to clustering

Clustering techniques
agglomerative -start with every element in its own
cluster, and iteratively join clusters together
divisive - start with one cluster and iteratively divide it
into smaller clusters
Distance Metrics
Euclidean (as-the-crow-flies)
Manhattan
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity
metrics)
Linkage Rules
average: Use the mean distance between cluster
members
single: Use the minimum distance (gives loose clusters)
complete: Use the maximum distance (gives tight
clusters)
median: Use the median distance
centroid: Use the distance between the average 201
Clustering Summary
The clusters & tree topology often depend
highly on the distance measure and linkage
method used

Recommended to use two distance metrics,


such as Euclidean and a correlation metric

A clustering algorithm will always yield


clusters, whether the data are organised in
clusters or not!

202
What is Statistics?
Statistics is a way to get information
from data
Statistics

Data Information

Data: Facts, Information:


especially numerical Knowledge
facts, collected communicated
together for concerning some
reference or particular fact.
information.

Definitions: Oxford English Dictionar


1.203
Interval Data
Interval data
Real numbers, i.e. heights, weights,
prices, etc.
Also referred to as quantitative or
numerical.

Arithmetic operations can be performed on


Interval Data, thus its meaningful to talk
about 2*Height, or Price + $1, and so on.

1.204
Nominal Data
Nominal Data
The values of nominal data are categories.
E.g. responses to questions about marital status,
coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4

Because the numbers are arbitrary arithmetic


operations dont make any sense (e.g. does Widowed
2 = Married?!)

Nominal data are also called qualitative or


categorical.

1.205
Ordinal Data
Ordinal Data appear to be categorical in nature, but their
values have an order; a ranking to them:

E.g. College course rating system:


poor = 1, fair = 2, good = 3, very good = 4, excellent = 5

While its still not meaningful to do arithmetic on this data


(e.g. does 2*fair = very good?!), we can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what numeric
values are assigned to each category.

1.206
Graphical & Tabular Techniques for Nominal
Data

The only allowable calculation on nominal data is to


count the frequency of each value of the variable.

We can summarize the data in a table that presents


the categories and their counts called a frequency
distribution.

A relative frequency distribution lists the


categories and the proportion with which each occurs.

Refer to Example 2.1

1.207
Nominal Data (Tabular
Summary)

1.208
Nominal Data (Frequency)

Bar Charts are often used to display frequencies


1.209
Nominal Data
It all the same information,
(based on the same data).
Just different presentation.

1.210
Graphical Techniques for Interval
Data
There are several graphical methods that are
used when the data are interval (i.e. numeric,
non-categorical).

The most important of these graphical methods


is the histogram.

The histogram is not only a powerful graphical


technique used to summarize interval data,
but it is also used to help explain probabilities.

1.211
Building a Histogram
1) Collect the Data
2) Create a frequency distribution for
the data.
3) Draw the Histogram.

1.212
Histogram and Stem &
Leaf

1.213
Ogive

Is a graph of a cumulative frequency


distribution.

We create an ogive in three steps


1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies by adding the current class
relative frequency to the previous class
cumulative relative frequency.
(For the first class, its cumulative relative frequency is just its
relative frequency)

1.214
Cumulative Relative
Frequencies
first class
next class: .
355+.185=.540

:
:

last class: .
930+.070=1.00

1.215
Ogive
The ogive can be used
to answer questions
like:

What telephone bill


value is at the 50th
percentile?

around $35
(Refer also to Fig. 2.13 in your textbook
1.216
Scatter Diagram
Example 2.9 A real estate agent wanted
to know to what extent the selling price
of a home is related to its size

1) Collect the data


2) Determine the independent variable (X
house size) and the dependent variable
(Y selling price)
3) Use Excel to create a scatter diagram

1.217
Scatter Diagram
It appears that in fact there is a
relationship, that is, the greater the
house size the greater the selling
price

1.218
Patterns of Scatter
Diagrams
Linearity and Direction are two
concepts we are interested in

Positive Linear Relationship Negative Linear Relationship

Weak or Non-Linear Relationship 1.219


Time Series Data
Observations measured at the same point in
time are called cross-sectional data.

Observations measured at successive points


in time are called time-series data.

Time-series data graphed on a line chart,


which plots the value of the variable on the
vertical axis against the time periods on the
horizontal axis.

1.220
Numerical Descriptive
Techniques
Measures of Central Location
Mean, Median, Mode

Measures of Variability
Range, Standard Deviation, Variance, Coefficient of
Variation

Measures of Relative Standing


Percentiles, Quartiles

Measures of Linear Relationship


Covariance, Correlation, Least Squares Line

1.221
Measures of Central
Location
The arithmetic mean, a.k.a.
average, shortened to mean, is the
most popular & useful measure of
central location.
Sum of the observations
Mean
It is =
computed byofsimply adding up
Number observations
all the observations and dividing by
the total number of observations:

1.222
Arithmetic Mean

Sample Mean
Population Mean

1.223
Statistics is a pattern
language
Population Sample

Size N n

Mean

1.224
The Arithmetic Mean
is appropriate for describing
measurement data, e.g. heights of
people, marks of student papers, etc.

is seriously affected by extreme values


called outliers. E.g. as soon as a
billionaire moves into a neighborhood,
the average household income increases
beyond what it was previously!

1.225
Measures of Variability
Measures of central location fail to
tell the whole story about the
distribution; that is, how much are
For example, two sets of class grades
the The
are shown. observations
mean (=50) is the spread out around
same in each case
the mean value?
But, the red class has greater
variability than the blue class.

1.226
Range
The range is the simplest measure of variability,
calculated as:

Range = Largest observation Smallest observation

E.g.
Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
The range is the same in both cases,
but the data sets have very different distributions

1.227
Statistics is a pattern
language
Population Sample

Size N n

Mean

Variance

1.228
Variance population mean

population size
The variance of a populationsample
is: mean

The variance of a sample is:


Note! the denominator is sample size (n) minus one !

1.229
Application
Example 4.7. The following sample consists of the
number of jobs six randomly selected students
applied for: 17, 15, 23, 7, 9, 13.
Finds its mean and variance.

What are we looking to calculate?

The following sample consists of the number of


jobs six randomly selected students applied for:
17, 15, 23, 7, 9, 13.
Finds its mean and variance.
as opposed to or 2
1.230
Sample
Sample Mean
Mean & Variance

Sample Variance

Sample Variance (shortcut method)

1.231
Standard Deviation
The standard deviation is simply the
square root of the variance, thus:

Population standard deviation:

Sample standard deviation:

1.232
Standard Deviation
Consider Example 4.8 where a golf
club manufacturer has designed a
new club and wants to determine if it
is hit more consistently (i.e. with less
variability) than with an old club.
Using Tools > Data Analysis >
[may need to add in

Descriptive Statistics in Excel, we produce


You get more
consistent
the following tables for distance with
the new club.
interpretation
1.233
The Empirical Rule If the histogram
Approximately 68% of all observations
is bell shaped fall
within one standard deviation of the mean.

Approximately 95% of all observations fall


within two standard deviations of the mean.

Approximately 99.7% of all observations fall


within three standard deviations of the mean.

1.234
Chebysheffs TheoremNot often used because interval
is very wide.

A more general interpretation of the


standard deviation is derived from
Chebysheffs Theorem, which applies to
all shapes of histograms (not just bell
shaped).
For k=2 (say), the theorem
states that atin
The proportion of observations least
any 3/4 of all
observations lie within 2
sample that lie standard deviations of the
mean. This
within k standard deviations ofisthe
a lower
meanbound
is
compared to Empirical Rules
at least: approximation (95%).
1.235
Box Plots
These box plots are
based on data in
Xm04-15.

Wendys service
time is shortest and
least variable.

Hardees has the


greatest variability,
while Jack-in-the-Box
has the longest
service times.

1.236
Methods of Collecting
Data
There are many methods used to
collect or obtain data for statistical
analysis. Three of the most popular
methods are:
Direct Observation
Experiments, and
Surveys.

1.237
Sampling
Recall that statistical inference permits us to draw
conclusions about a population based on a sample.

Sampling (i.e. selecting a sub-set of a whole


population) is often done for reasons of cost (its less
expensive to sample 1,000 television viewers than
100 million TV viewers) and practicality (e.g.
performing a crash test on every automobile
produced is impractical).

In any case, the sampled population and the


target population should be similar to one another.

1.238
Sampling Plans
A sampling plan is just a method or
procedure for specifying how a sample will be
taken from a population.

We will focus our attention on these three


methods:

Simple Random Sampling,


Stratified Random Sampling, and
Cluster Sampling.
1.239
Simple Random Sampling
A simple random sample is a sample
selected in such a way that every possible
sample of the same size is equally likely to
be chosen.

Drawing three names from a hat containing


all the names of the students in the class is
an example of a simple random sample: any
group of three names is as equally likely as
picking any other group of three names.

1.240
Stratified Random
Sampling
After the population has been
stratified, we can use simple
random sampling to generate the
complete sample:

f we only have sufficient resources to sample 400 people total,


we would draw 100 of them from the low income group

if we are sampling 1000 people, wed draw


50 of them from the high income group.
1.241
Cluster Sampling
A cluster sample is a simple random sample of
groups or clusters of elements (vs. a simple
random sample of individual objects).

This method is useful when it is difficult or costly


to develop a complete list of the population
members or when the population elements are
widely dispersed geographically.

Cluster sampling may increase sampling error


due to similarities among cluster members.

1.242
Sampling Error
Sampling error refers to differences between the
sample and the population that exist only because of the
observations that happened to be selected for the
sample.

Another way to look at this is: the differences in results


for different samples (of the same size) is due to
sampling error:

E.g. Two samples of size 10 of 1,000 households. If we


happened to get the highest income level data points in
our first sample and all the lowest income levels in the
second, this delta is due to sampling error.

1.243
Nonsampling Error
Nonsampling errors are more serious and are
due to mistakes made in the acquisition of data or
due to the sample observations being selected
improperly. Three types of nonsampling errors:

Errors in data acquisition,


Nonresponse errors, and
Selection bias.

Note: increasing the sample size will not reduce


this type of error.

1.244
Approaches to Assigning
Probabilities
There are three ways to assign a probability, P(O i),
to an outcome, Oi, namely:

Classical approach: make certain assumptions


(such as equally likely, independence) about
situation.

Relative frequency: assigning probabilities


based on experimentation or historical data.

Subjective approach: Assigning probabilities


based on the assignors judgment.

1.245
Interpreting Probability
One way to interpret probability is this:

If a random experiment is repeated an infinite


number of times, the relative frequency for any
given outcome is the probability of this outcome.

For example, the probability of heads in flip of a


balanced coin is .5, determined using the classical
approach. The probability is interpreted as being
the long-term relative frequency of heads if the
coin is flipped an infinite number of times.

1.246
Conditional Probability
Conditional probability is used to
determine how two events are
related; that is, we can determine
the probability of one event given
the occurrence of another related
event.

Conditional probabilities are written


as P(A | B) and read as the
probability of A given B and is 1.247
Independence
One of the objectives of calculating conditional
probability is to determine whether two events are
related.

In particular, we would like to know whether they are


independent, that is, if the probability of one event
is not affected by the occurrence of the other event.

Two events A and B are said to be independent if


P(A|B) = P(A)
or
P(B|A) = P(B)

1.248
Complement Rule
The complement of an event A is the event that occurs
when A does not occur.

The complement rule gives us the probability of an


event NOT occurring. That is:

P(AC) = 1 P(A)

For example, in the simple roll of a die, the probability


of the number 1 being rolled is 1/6. The probability
that some number other than 1 will be rolled is 1
1/6 = 5/6.

1.249
Multiplication Rule
The multiplication rule is used to
calculate the joint probability of
two events. It is based on the
formula for conditional probability
defined earlier:
If we multiply both sides of the equation by P(B) we have:

P(A and B) = P(A | B)P(B)

Likewise, P(A and B) = P(B | A) P(A)

If A and B are independent events, then P(A and B) = P(A)P(B)

1.250
Addition Rule
Recall: the addition rule was introduced
earlier to provide a way to compute the
probability of event A or B or both A and B
occurring; i.e. the union of A and B.

P(A or B) = P(A) + P(B) P(A and B)

Why do we subtract the joint probability P(A


and B) from the sum of the probabilities of A
P(A orB?
and B) = P(A) + P(B) P(A and B)

1.251
Addition Rule for Mutually Excusive
Events
If and A and B are mutually exclusive the occurrence of
one event makes the other one impossible. This means
that

P(A and B) = 0

The addition rule for mutually exclusive events is

P(A or B) = P(A) + P(B)

We often use this form when we add some joint


probabilities calculated from a probability tree

1.252
Two Types of Random
Variables
Discrete Random Variable
one that takes on a countable number of values
E.g. values on the roll of dice: 2, 3, 4, , 12

Continuous Random Variable


one whose values are not discrete, not countable
E.g. time (30.1 minutes? 30.10000001 minutes?)

Analogy:
Integers are Discrete, while Real Numbers are
Continuous

1.253
Laws of Expected Value
1. E(c) = c
The expected value of a constant (c) is just
the value of the constant.

2. E(X + c) = E(X) + c
3. E(cX) = cE(X)
We can pull a constant out of the
expected value expression (either as part of
a sum with a random variable X or as a
coefficient of random variable X).
1.254
Laws of Variance
1. V(c) = 0
The variance of a constant (c) is zero.

2. V(X + c) = V(X)
The variance of a random variable and a constant is
just the variance of the random variable (per 1 above).

3. V(cX) = c2V(X)
The variance of a random variable and a constant
coefficient is the coefficient squared times the variance
of the random variable.

1.255
Binomial Distribution
The binomial distribution is the probability
distribution that results from doing a binomial
experiment. Binomial experiments have the
following properties:

1. Fixed number of trials, represented as n.


2. Each trial has two possible outcomes, a success
and a failure.
3. P(success)=p (and thus: P(failure)=1p), for all trials.
4. The trials are independent, which means that the
outcome of one trial does not affect the outcomes of
any other trials.

1.256
Binomial Random Variable
The binomial random variable
counts the number of successes in n
trials of the binomial experiment. It
can take on values from 0, 1, 2, , n.
Thus, its a discrete random variable.

To calculate the probability


for x=0, 1, 2, , n
associated with each value we use
combintorics:
1.257
Binomial Table
What is the probability that Pat fails
the quiz?
i.e. what is P(X 4), given
P(success) = .20 and n=10 ?

P(X 4) = .967
1.258
Binomial Table
What is the probability that Pat gets
two answers correct?
i.e. what is P(X = 2), given
P(success) = .20 and n=10 ?

P(X = 2) = P(X2) P(X1) = .678 .376 = .302


remember, the table shows cumulative probabilities 1.259
=BINOMDIST() Excel
Function
There is a binomial distribution
function in Excel that can also be
used to calculate these probabilities.
# successes

For example: # trials

What is the probability that Pat gets


P(success)
two answers correct?
cumulative
(i.e. P(Xx)?)

P(X=2)=.3020
1.260
=BINOMDIST() Excel
Function
There is a binomial distribution
function in Excel that can also be
used to calculate these probabilities.
# successes

For example: # trials

What is the probability that Pat fails


P(success)
the quiz?
cumulative
(i.e. P(Xx)?)

P(X4)=.9672
1.261
Binomial Distribution
As you might expect, statisticians
have developed general formulas for
the mean, variance, and standard
deviation of a binomial random
variable. They are:

1.262
Poisson Distribution
Named for Simeon Poisson, the Poisson
distribution is a discrete probability distribution
and refers to the number of events (a.k.a.
successes) within a specific time period or region
of space. For example:
The number of cars arriving at a service station in 1
hour. (The interval of time is 1 hour.)
The number of flaws in a bolt of cloth. (The specific
region is a bolt of cloth.)
The number of accidents in 1 day on a particular
stretch of highway. (The interval is defined by both time,
1 day, and space, the particular stretch of highway.)

1.263
The Poisson Experiment
Like a binomial experiment, a Poisson experiment
has four defining characteristic properties:
1. The number of successes that occur in any interval is
independent of the number of successes that occur
in any other interval.
2. The probability of a success in an interval is the
same for all equal-size intervals
3. The probability of a success is proportional to the
size of the interval.
4. The probability of more than one success in an
interval approaches 0 as the interval becomes
smaller.

1.264
Poisson Distribution
The Poisson random variable is the number
of successes that occur in a period of time or
successes
an interval of space in a Poisson experiment.

E.g. On average, 96 trucks arrive at a border


time period
crossing
every hour.

E.g. The number of typographic errors in a new


textbook edition averages 1.5 per 100 pages.
successes (?!) interval

1.265
Poisson Probability
Distribution
The probability that a Poisson random
variable assumes a value of x is given by:

and e is the natural logarithm base.

FYI:
1.266
Example 7.12
The number of typographical errors in new
editions of textbooks varies considerably
from book to book. After some analysis he
concludes that the number of errors is
Poisson distributed with a mean of 1.5 per
100 pages. The instructor randomly selects
100 pages of a new book. What is the
probability that there are no typos?

That is, what is P(X=0) given that = 1.5?


There is about a 22% chance of finding zero errors
1.267
Poisson Distribution
As mentioned on the Poisson experiment slide:

The probability of a success is


proportional to the size of the interval

Thus, knowing an error rate of 1.5 typos per


100 pages, we can determine a mean value for
a 400 page book as:

=1.5(4) = 6 typos / 400 pages.

1.268
Example 7.13
For a 400 page book, what is the
probability that there are
no typos?

P(X=0) =
there is a very small chance there are no typos

1.269
Example 7.13
Excel is an even better alternative:

1.270
Probability Density
Functions
Unlike a discrete random variable which
we studied in Chapter 7, a continuous
random variable is one that can assume
an uncountable number of values.
We cannot list the possible values
because there is an infinite number of
them.
Because there is an infinite number of
values, the probability of each individual
value is virtually 0.
1.271
Point Probabilities are Zero
Because there is an infinite number of values, the
probability of each individual value is virtually 0.

Thus, we can determine the probability of a range


of values only.

E.g. with a discrete random variable like tossing a die, it is


meaningful to talk about P(X=5), say.
In a continuous setting (e.g. with time as a random variable), the
probability the random variable of interest, say task length, takes
exactly 5 minutes is infinitesimally small, hence P(X=5) = 0.
It is meaningful to talk about P(X 5).

1.272
Probability Density
Function
A function f(x) is called a probability density
function (over the range a x b if it meets
the following requirements:

1) f(x) 0 for all x between a and b, and


f(x)

area=1

a b x
2) The total area under the curve between a and b is
1.0

1.273
The Normal Distribution
The normal distribution is the most important of
all probability distributions. The probability density
function of a normal random variable is given by:

It looks like this:


Bell shaped,
Symmetrical around the mean

1.274
The Normal Distribution
Important things to note:
The normal distribution is fully defined by two parameters:
its standard deviation and mean

The normal distribution is bell shaped and


symmetrical about the mean

Unlike the range of the uniform distribution (a x b)


Normal distributions range from minus infinity to plus infinity
1.275
Standard Normal
Distribution
A normal distribution whose mean is zero and standard
deviation is one is called
0 the standard normal
distribution. 1

As we shall see shortly, any normal distribution can be


converted to a standard normal distribution with
simple algebra. This makes calculations much easier.

1.276
Calculating Normal
Probabilities
We can use the following function to
convert any normal random variable
to a standard normal random
variable
0

Some advice:
always draw a
picture!
1.277
Calculating Normal
Probabilities
Example: The time required to build a computer is
normally distributed with a mean of 50 minutes
and a standard deviation of 10 minutes:

What is the probability that a computer is


assembled in a time between 45 and 60 minutes?

Algebraically speaking, what is P(45 < X < 60) ?

1.278
Calculating Normal
Probabilities
mean of 50 minutes and a
standard deviation of 10 minutes
P(45 < X < 60) ?

1.279
Calculating Normal
Probabilities
We can use Table 3 in
Appendix B to look-up
probabilities P(0 < Z < z)

We can break up P(.5 < Z < 1) into:


P(.5 < Z < 0) + P(0 < Z < 1)

The distribution is symmetric around zero, so we have:


P(.5 < Z < 0) = P(0 < Z < .5)
Hence: P(.5 < Z < 1) = P(0 < Z < .5) + P(0 < Z <
1)

1.280
Calculating Normal
Probabilities
How to use Table 3
This table gives probabilities P(0 < Z < z)
First column = integer + first decimal
Top row = second decimal place

P(0 < Z < 0.5)

P(0 < Z < 1)

P(.5 < Z < 1) = .1915 + .3414 = .5328

1.281
Using the Normal Table (
Table 3)
What is P(Z > 1.6)P(0?< Z < 1.6) = .4452

0 1.6

P(Z > 1.6) = .5 P(0 < Z < 1.6)


= .5 .4452
= .0548
1.282
Using the Normal Table (
Table 3)
What is P(Z < -2.23) ? P(0 < Z < 2.23)

P(Z < -2.23) P(Z > 2.23)

-2.23 0 2.23

P(Z < -2.23) = P(Z > 2.23)


= .5 P(0 < Z < 2.23)
= .0129
1.283
Using the Normal Table (
Table 3)
What is P(Z < 1.52)P(0
P(Z < 0) = .5 ? < Z < 1.52)

0 1.52

P(Z < 1.52) = .5 + P(0 < Z < 1.52)


= .5 + .4357
= .9357
1.284
Using the Normal Table (
Table 3)
What is P(0.9 < ZP(0
<<1.9)
Z < 0.9) ?

P(0.9 < Z < 1.9)

0 0.9 1.9

P(0.9 < Z < 1.9) = P(0 < Z < 1.9) P(0 < Z < 0.9)
=.4713 .3159
= .1554
1.285
Finding Values of Z

Other Z values are


Z.05 = 1.645
Z.01 = 2.33

1.286
Using the values of Z

Because z.025 = 1.96 and - z.025=


-1.96, it follows that we can state

P(-1.96 < Z < 1.96) = .95

Similarly
P(-1.645 < Z < 1.645) = .90

1.287
Other Continuous
Distributions
Three other important continuous
distributions which will be used
extensively in later sections are
introduced here:

Student t Distribution,
Chi-Squared Distribution, and
F Distribution.

1.288
Student t Distribution
Here the letter t is used to represent the random
variable, hence the name. The density function
for the Student t distribution is as follows

(nu) is called the degrees of freedom, and


(Gamma function) is (k)=(k-1)(k-2)(2)(1)

1.289
Student t Distribution
In much the same way that and define the normal
distribution, , the degrees of freedom, defines the
Student
t Distribution:

Figure 8.24
As the number of degrees of freedom increases, the t
distribution approaches the standard normal distribution.

1.290
Determining Student t
Values
The student t distribution is used extensively in
statistical inference. Table 4 in Appendix B lists values of

That is, values of a Student t random variable with


degrees of freedom such that:

The values for A are pre-determined


critical values, typically in the
10%, 5%, 2.5%, 1% and 1/2% range.

1.291
Using the t table (Table 4) for
values
For example, if we
Area under thewant the(tvalue
curve valueA
of
) : COLUMN
t with 10 degrees of freedom such
tthat the area under the Student t
.05,10
curve is .05:
t.05,10=1.812

Degrees of Freedom : ROW

1.292
F Distribution
The F density function is given by:

F > 0. Two parameters define this distribution, and


like weve already seen these are again degrees
of freedom.
is the numerator degrees of freedom and
is the denominator degrees of freedom.

1.293
Determining Values of F
For example, what is the value of F
for 5% of the area under the right
hand tail of the curve, with a
There are different tables
numerator
for different values of A. degree of freedom of 3
Make sure you start with
andtable!!
the correct a denominator degree of
freedom of 7? F =4.35

F
.05,3,7

Solution:
.05,3,7 use the F look-up (Table 6)
Denominator Degrees of Freedom : ROW
Numerator Degrees of Freedom : COLUMN
1.294
Determining Values of F
For areas under the curve on the left
hand side of the curve, we can
leverage the following relationship:

Pay close attention to the order of the terms!

1.295
Chapter 9

Sampling Distributions

1.296
Sampling Distribution of the
Mean
A fair die is thrown infinitely many times,
with the random variable X = # of spots on
any throw.

x 1 2 3 4 5 6
The probability distribution of X is:
P(x) 1/6 1/6 1/6 1/6 1/6 1/6

and the mean and variance are calculated


as well:
1.297
Sampling Distribution of Two
Dice
A sampling distribution is created by looking at
all samples of size n=2 (i.e. two dice) and their means

While there are 36 possible samples of size 2, there are


only 11 values for , and some (e.g. =3.5) occur more
frequently than others (e.g. =1).

1.298
Sampling Distribution of Two Dice

The
P( )sampling
6/36
distribution of is
shown below:
1.0
1.5
1/36
2/36
5/36
2.0 3/36
4/36
)
2.5 4/36
3.0 5/36
3/36
P(

3.5 6/36
4.0 5/36
4.5 4/36 2/36
5.0 3/36
5.5 2/36
6.0 1/36 1/36

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

1.299
Compare
Compare the distribution of X

1 2 3 4 5 6 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

with the sampling distribution of .

As well, note that:

1.300
Central Limit Theorem
The sampling distribution of the
mean of a random sample drawn
from any population is
approximately normal for a
sufficiently large sample size.

The larger the sample size, the more


closely the sampling distribution of X
will resemble a normal distribution.
1.301
Central Limit Theorem
If the population is normal, then X is normally
distributed for all values of n.

If the population is non-normal, then X is


approximately normal only for larger values of n.

In many practical situations, a sample size of 30


may be sufficiently large to allow us to use the
normal distribution as an approximation for the
sampling distribution of X.

1.302
Sampling Distribution of the Sample
Mean
1.

2.

3. If X is normal, X is normal. If X is
nonnormal, X is approximately normal for
sufficiently large sample sizes.
Note: the definition of sufficiently large
depends on the extent of nonnormality of x
(e.g. heavily skewed; multimodal)
1.303
Example 9.1(a)
The foreman of a bottling plant has
observed that the amount of soda in each
32-ounce bottle is actually a normally
distributed random variable, with a mean
of 32.2 ounces and a standard deviation of
.3 ounce.

If a customer buys one bottle, what is the


probability that the bottle will contain
more than 32 ounces?
1.304
Example 9.1(a)
We want to find P(X > 32), where X is
normally distributed and =32.2
and =.3

there is about a 75% chance


that a single bottle of soda
1.305
contains more than 32oz.
Example 9.1(b)
The foreman of a bottling plant has observed
that the amount of soda in each 32-ounce
bottle is actually a normally distributed
random variable, with a mean of 32.2 ounces
and a standard deviation of .3 ounce.

If a customer buys a carton of four bottles,


what is the probability that the mean
amount of the four bottles will be greater
than 32 ounces?

1.306
Example 9.1(b)
We want to find P(X > 32), where X is normally
distributed
with =32.2 and =.3

Things we know:
1)X is normally distributed, therefore so will X.

2) = 32.2 oz.

3)
1.307
Example 9.1(b)
If a customer buys a carton of four bottles,
what is the probability that the mean
amount of the four bottles will be greater
than 32 ounces?

There is about a 91% chance the mean


of the four bottles will exceed 32oz.
1.308
Graphically Speaking
mean=32.
2

what is the probability that one what is the probability that the
bottle will contain more than 32 mean of four bottles will exceed 32
ounces? oz?

1.309
Sampling Distribution: Difference
of two means
The final sampling distribution introduced is that of the
difference between two sample means. This
requires:

independent random samples be drawn from each


of two normal populations

If this condition is met, then the sampling distribution of


the difference between the two sample means, i.e.
will be normally distributed.
(note: if the two populations are not both normally
distributed, but the sample sizes are large (>30), the
distribution of is approximately normal)

1.310
Sampling Distribution: Difference
of two means
The expected value and variance of the
sampling distribution of are given by:

mean:

standard deviation:

(also called the standard error if the difference


between two means)

1.311
Estimation
There are two types of inference: estimation
and hypothesis testing; estimation is
introduced first.

The objective of estimation is to determine


the approximate value of a population
parameter on the basis of a sample statistic.

E.g., the sample mean ( ) is employed to


estimate the population mean ( ).
1.312
Estimation
The objective of estimation is to determine
the approximate value of a population
parameter on the basis of a sample statistic.

There are two types of estimators:

Point Estimator

Interval Estimator

1.313
Point & Interval Estimation
For example, suppose we want to estimate the mean
summer income of a class of business students. For
n=25 students,
is calculated to be 400 $/week.

point estimate interval estimate

An alternative statement is:


The mean income is between 380 and 420 $/week.

1.314
Estimating when is
known
the confidence
interval

We established in Chapter 9:

the sample mean


is in the center of
the interval
Thus, the probability that the interval:

contains the population mean is 1 . This is


a confidence interval estimator for .
1.315
Four commonly used confidence
levels
Confidence Level
cut
& keep handy!

Table 10.1
1.316
Example 10.1
A computer company samples demand during
lead time over
235 25374
time 309
periods:
499 253
421 361 514 462 369
394 439 348 344 330
261 374 302 466 535
386 316 296 332 334

Its is known that the standard deviation of


demand over lead time is 75 computers. We
want to estimate the mean demand over lead
time with 95% confidence in order to set
inventory levels
1.317
CALCULATE

Example 10.1
In order to use our confidence interval estimator, we need
the following pieces of
370.16
Calculated data:
from the data

1.96

75
Given
n 25

therefore:

The lower and upper confidence limits are 340.76 and


399.56.

1.318
INTERPRET

Example 10.1
The estimation for the mean demand during lead
time lies between 340.76 and 399.56 we can use
this as input in developing an inventory policy.

That is, we estimated that the mean demand during


lead time falls between 340.76 and 399.56, and this
type of estimator is correct 95% of the time. That
also means that 5% of the time the estimator will be
incorrect.

Incidentally, the media often refer to the 95% figure


as 19 times out of 20, which emphasizes the
long-run aspect of the confidence level.

1.319
Interval Width
A wide interval provides little information.
For example, suppose we estimate with 95%
confidence that an accountants average starting
salary is between $15,000 and $100,000.

Contrast this with: a 95% confidence interval


estimate of starting salaries between $42,000 and
$45,000.

The second estimate is much narrower, providing


accounting students more precise information about
starting salaries.

1.320
Interval Width
The width of the confidence interval
estimate is a function of the
confidence level, the population
standard deviation, and the sample
size

1.321
Selecting the Sample Size
We can control the width of the interval by
determining the sample size necessary to produce
narrow intervals.

Suppose we want to estimate the mean demand


to within 5 units; i.e. we want to the interval
estimate to be:

Since:

It follows that
Solve for n to get requisite sample size!
1.322
Selecting the Sample Size
Solving the equation

that is, to produce a 95% confidence


interval estimate of the mean (5
units), we need to sample 865 lead
time periods (vs. the 25 data points
we have currently).
1.323
Sample Size to Estimate a
Mean
The general formula for the sample
size needed to estimate a population
mean with an interval estimate of:

Requires a sample size of at least


this large:

1.324
Example 10.2
A lumber company must estimate the
mean diameter of trees to determine
whether or not there is sufficient lumber to
harvest an area of forest. They need to
estimate this to within 1 inch at a
confidence level of 99%. The tree
diameters are normally distributed with a
standard deviation of 6 inches.

How many trees need to be sampled?


1.325
Example 10.2
Things we know:

Confidence level = 99%, therefore =.01

We want 1 , hence W=1.


We are given that = 6.
1.326
Example 10.2
We compute

1
That is, we will need to sample at
least 239 trees to have a
99% confidence interval of

1.327
Nonstatistical Hypothesis Testing

A criminal trial is an example of hypothesis testing


without the statistics.
In a trial a jury must decide between two hypotheses.
The null hypothesis is
H0: The defendant is innocent

The alternative hypothesis or research hypothesis is


H1: The defendant is guilty

The jury does not know which hypothesis is true. They


must make a decision on the basis of evidence
presented.

1.328
Nonstatistical Hypothesis Testing

There are two possible errors.


A Type I error occurs when we reject
a true null hypothesis. That is, a Type
I error occurs when the jury convicts
an innocent person.

A Type II error occurs when we dont


reject a false null hypothesis. That
occurs when a guilty defendant is
acquitted. 1.329
Nonstatistical Hypothesis Testing

The probability of a Type I error is


denoted as (Greek letter alpha).
The probability of a type II error is
(Greek letter beta).

The two probabilities are inversely


related. Decreasing one increases
the other.

1.330
Nonstatistical Hypothesis Testing

The critical concepts are theses:


1. There are two hypotheses, the null and the alternative
hypotheses.
2. The procedure begins with the assumption that the
null hypothesis is true.
3. The goal is to determine whether there is enough
evidence to infer that the alternative hypothesis is true.
4. There are two possible decisions:
Conclude that there is enough evidence to support the
alternative hypothesis.
Conclude that there is not enough evidence to support
the alternative hypothesis.

1.331
Nonstatistical Hypothesis Testing

5. Two possible errors can be made.


Type I error: Reject a true null
hypothesis
Type II error: Do not reject a false
null hypothesis.

P(Type I error) =
P(Type II error) =
1.332
Concepts of Hypothesis Testing (1)

There are two hypotheses. One is called the null


hypothesis and the other the alternative or research
hypothesis. The usual notation is:
pronounced
H nought

H0: the null hypothesis

H1: the alternative or research hypothesis

The null hypothesis (H0) will always state that the


parameter equals the value specified in the
alternative hypothesis (H1)

1.333
Concepts of Hypothesis
Testing
Consider Example 10.1 (mean demand for
computers during assembly lead time) again.
Rather than estimate the mean demand, our
operations manager wants to know whether the
mean is different from 350 units. We can
rephrase this request into a test of the hypothesis:

H0: = 350

Thus, our research hypothesis


This becomes:
is what we are
interested in
H1: 350 determining

1.334
Concepts of Hypothesis Testing (4)

There are two possible decisions that can be made:

Conclude that there is enough evidence to support


the alternative hypothesis
(also stated as: rejecting the null hypothesis in favor of
the alternative)

Conclude that there is not enough evidence to


support the alternative hypothesis
(also stated as: not rejecting the null hypothesis in favor
of the alternative)
NOTE: we do not say that we accept the null
hypothesis

1.335
Concepts of Hypothesis
Testing
Once the null and alternative hypotheses are stated, the
next step is to randomly sample the population and
calculate a test statistic (in this example, the sample
mean).

If the test statistics value is inconsistent with the null


hypothesis we reject the null hypothesis and infer
that the alternative hypothesis is true.
For example, if were trying to decide whether the mean is
not equal to 350, a large value of (say, 600) would
provide enough evidence. If is close to 350 (say, 355) we
could not say that this provides a great deal of evidence to
infer that the population mean is different than 350.

1.336
Types of Errors
A Type I error occurs when we reject a true null
hypothesis (i.e. Reject H0 when it is TRUE)
H0 T F

Reject I

Reject II

A Type II error occurs when we dont reject a false


null hypothesis (i.e. Do NOT reject H0 when it is FALSE)

1.337
Recap I
1) Two hypotheses: H0 & H1
2) ASSUME H0 is TRUE
3) GOAL: determine if there is enough
evidence to infer that H1 is TRUE
4) Two possible decisions:
Reject H0 in favor of H1
NOT Reject H0 in favor of H1
5) Two possible types of errors:
Type I: reject a true H0 [P(Type I)= ]
Type II: not reject a false H 0 [P(Type II)= ]
1.338
Example 11.1
A department store manager determines that a
new billing system will be cost-effective only if
the mean monthly account is more than $170.

A random sample of 400 monthly accounts is


drawn, for which the sample mean is $178. The
accounts are approximately normally distributed
with a standard deviation of $65.

Can we conclude that the new system will


be cost-effective?
1.339
Example 11.1
The system will be cost effective if the mean account
balance for all customers is greater than $170.

We express this belief as a our research hypothesis, that


is:

H 1: > 170 (this is what we want to determine)

Thus, our null hypothesis becomes:

H0: = 170 (this specifies a single value for the


parameter of interest)

1.340
Example 11.1
What we want to show:
H1: > 170
H0: = 170 (well assume this is true)

We know:
n = 400,
= 178, and
= 65

Hmm. What to do next?!


1.341
Example 11.1
To test our hypotheses, we can use two
different approaches:

The rejection region approach (typically used


when computing statistics manually), and

The p-value approach (which is generally used


with a computer and statistical software).

We will explore both in turn

1.342
Example 11.1 Rejection
Region
The rejection region is a range of
values such that if the test statistic
falls into that range, we decide to
reject the null hypothesis in favor of
the alternative hypothesis.

is the critical value of to reject H0.


1.343
Example 11.1
All thats left to do is calculate
and compare it to 170.

we can calculate this based on any level of


significance ( ) we want

1.344
Example 11.1
At a 5% significance level (i.e. =0.05), we get

Solving we compute =175.34


Since our sample mean (178) is greater than the
critical value we calculated (175.34), we reject the null
hypothesis in favor of H1, i.e. that: > 170 and that
it is cost effective to install the new billing system

1.345
Example 11.1 The Big
Picture

H1: > 170 =175.34


H0: = 170 =178

Reject H0 in favor of
1.346
Standardized Test Statistic
An easier method is to use the standardized test
statistic:

and compare its result to : (rejection region: z > )

Since z = 2.46 > 1.645 (z.05), we reject H0 in favor of


H1

1.347
PLOT POWER CURVE

1.348
p-Value
The p-value of a test is the probability of
observing a test statistic at least as extreme
as the one computed given that the null
hypothesis is true.

In the case of our department store example,


what is the probability of observing a
sample mean at least as extreme as the
one already observed (i.e. = 178), given
that the null hypothesis (H0: = 170) is true?
p-value

1.349
Interpreting the p-value
The smaller the p-value, the more statistical evidence
exists to support the alternative hypothesis.
If the p-value is less than 1%, there is overwhelming
evidence that supports the alternative hypothesis.
If the p-value is between 1% and 5%, there is a
strong evidence that supports the alternative
hypothesis.
If the p-value is between 5% and 10% there is a weak
evidence that supports the alternative hypothesis.
If the p-value exceeds 10%, there is no evidence that
supports the alternative hypothesis.
We observe a p-value of .0069, hence there is
overwhelming evidence to support H1: > 170.

1.350
Interpreting the p-value
Compare the p-value with the selected value of the
significance level:

If the p-value is less than , we judge the p-value


to be small enough to reject the null hypothesis.

If the p-value is greater than , we do not reject


the null hypothesis.

Since p-value = .0069 < = .05, we reject H0


in favor of H1

1.351
Chapter-Opening Example

The objective of the study is to draw a conclusion


about the mean payment period. Thus, the parameter
to be tested is the population mean. We want to know
whether there is enough statistical evidence to show
that the population mean is less than 22 days. Thus,
the alternative hypothesis is

H1: < 22

The null hypothesis is

H0: = 22

1.352
Chapter-Opening Example
The x
z test statistic is
/ n

We wish to reject the null hypothesis in favor of


the alternative only if the sample mean and
hence the value of the test statistic is small
enough. As a result we locate the rejection
region in the left tail of the sampling distribution.
We set the significance level at 10%.

1.353
Chapter-Opening Example
z z z.10 1.28
Rejection region:

From the data in SSA we compute

x
x

4,759
i
21.63
220 220
and

x 21.63 22
z .91
/ n 6 / 220
p-value = P(Z < -.91) = .5 - .3186 = .1814

1.354
Chapter-Opening Example

Conclusion: There is not enough evidence


to infer that the mean is less than 22.

There is not enough evidence to infer


that the plan will be profitable.

Since Z(- .91) > -Z.10(-1.28)


We fail to reject Ho: > 22
at a 10% level of significance.
1.355
PLOT POWER CURVE

1.356
Right-Tail Testing
Calculate the critical value of the
mean ( ) and compare against the
observed value of the sample mean (
)

1.357
Left-Tail Testing
Calculate the critical value of the
mean ( ) and compare against the
observed value of the sample mean (
)

1.358
TwoTail Testing
Two tail testing is used when we want
to test a research hypothesis that a
parameter is not equal () to some
value

1.359
Example 11.2
AT&Ts argues that its rates are such that customers wont
see a difference in their phone bills between them and
their competitors. They calculate the mean and standard
deviation for all their customers at $17.09 and $3.87
(respectively).

They then sample 100 customers at random and


recalculate a monthly phone bill based on competitors
rates.

What we want to show is whether or not:


H1: 17.09. We do this by assuming that:
H0: = 17.09

1.360
Example 11.2
The rejection region is set up so we can reject the null
hypothesis when the test statistic is large or when it is
small.

stat is small stat is large

That is, we set up a two-tail rejection region. The total


area in the rejection region must sum to , so we
divide this probability by 2.

1.361
Example 11.2
At a 5% significance level (i.e. =.
05), we have
/2 = .025. Thus, z.025 = 1.96 and
our rejection region is:

z < 1.96 -or- z > 1.96


z
-z.025 0 +z.025

1.362
Example 11.2
From the data, we calculate = 17.55

Using our standardized test statistic:

We find that:

Since z = 1.19 is not greater than 1.96, nor less than


1.96 we cannot reject the null hypothesis in favor of H 1.
That is there is insufficient evidence to infer that
there is a difference between the bills of AT&T
and the competitor.

1.363
PLOT POWER CURVE

1.364
Summary of One- and Two-Tail
Tests

One-Tail Test Two-Tail Test One-Tail Test


(left tail) (right tail)

1.365
Inference About A
Population[SIGMA UNKNOWN]
Population

Sample

Inference

Statistic
Parameter

We will develop techniques to estimate and


test three population parameters:
Population Mean
Population Variance
Population Proportion p
1.366
Inference With Variance Unknown

Previously, we looked at estimating and testing


the population mean when the population
standard deviation ( ) was known or given:

But how often do we know the actual


population variance?

Instead, we use the Student t-statistic,


given by:

1.367
Testing when is
unknown
When the population standard
deviation is unknown and the
population is normal, the test
statistic for testing hypotheses about
is:

which is Student t distributed with


= n1 degrees of freedom. The
confidence interval estimator of is
1.368
Example 12.1
Will new workers achieve 90% of the level of
experienced workers within one week of
being hired and trained?

Experienced workers can process 500


packages/hour, thus if our conjecture is
correct, we expect new workers to be able to
process .90(500) = 450 packages per hour.

Given the data, is this the case?

1.369
IDENTIFY

Example 12.1
Our objective is to describe the population of the
numbers of packages processed in 1 hour by new
workers, that is we want to know whether the new
workers productivity is more than 90% of that of
experienced workers. Thus we have:

H1: > 450

Therefore we set our usual null hypothesis to:

H0: = 450

1.370
COMPUTE

Example 12.1
Our test statistic is:

With n=50 data points, we have n1=49 degrees of


freedom. Our hypothesis under question is:
H 1: > 450
Our rejection region becomes:

Thus we will reject the null hypothesis in favor of the


alternative if our calculated test static falls in this
region.

1.371
COMPUTE

Example 12.1
From the data, we calculate = 460.38, s
=38.83 and thus:

Since

we reject H0 in favor of H1, that is, there is


sufficient evidence to conclude that the new
workers are producing at more than 90% of
the average of experienced workers.
1.372
IDENTIFY

Example 12.2
Can we estimate the return on
investment for companies that won
quality awards?

We are given a random sample of n


= 83 such companies. We want to
construct a 95% confidence interval
for the mean return, i.e. what is:
??
1.373
COMPUTE

Example 12.2
From the data, we calculate:

For this term

and so:

1.374
Check Requisite
Conditions
The Student t distribution is robust, which means
that if the population is nonnormal, the results of
the t-test and confidence interval estimate are still
valid provided that the population is not
extremely nonnormal.

To check this requirement, draw a histogram of


the data and see how bell shaped the resulting
figure is. If a histogram is extremely skewed (say in
the case of an exponential distribution), that could
be considered extremely nonnormal and hence t-
statistics would be not be valid in this case.

1.375
Inference About Population
Variance
If we are interested in drawing inferences about a
populations variability, the parameter we need to
investigate is the population variance:

The sample variance (s2) is an unbiased, consistent and


efficient point estimator for . Moreover,

the statistic, , has a chi-squared


distribution,

with n1 degrees of freedom.

1.376
Testing & Estimating Population
Variance
Combining this statistic:

With the probability statement:

Yields the confidence interval


estimator
lower confidence for : upper confidence
limit limit

1.377
IDENTIFY

Example 12.3
Consider a container filling machine.
Management wants a machine to fill 1 liter
(1,000 ccs) so that that variance of the fills is
less than 1 cc2. A random sample of n=25 1 liter
fills were taken. Does the machine perform as it
should at the 5% significance level?
Variance is less than 1 cc2

We want to show that:


H1 : <1
(so our null hypothesis becomes: H 0: = 1). We
will use this test statistic:

1.378
COMPUTE

Example 12.3
Since our alternative hypothesis is phrased as:
H1: <1

We will reject H0 in favor of H1 if our test statistic


falls into this rejection region:

We computer the sample variance to be:

re
pa
s2=.8088

m
co
And thus our test statistic takes on this value

1.379
Example 12.4
As we saw, we cannot reject the null hypothesis
in favor of the alternative. That is, there is not
enough evidence to infer that the claim is true.
Note: the result does not say that the variance
is greater than 1, rather it merely states that
we are unable to show that the variance is
less than 1.

We could estimate (at 99% confidence say) the


variance of the fills

1.380
COMPUTE

Example 12.4
In order to create a confidence interval
estimate of the variance, we need these
formulae:
lower confidence upper confidence
limit limit

we know (n1)s2 = 19.41 from our previous


calculation, and we have from Table 5 in
Appendix B:
1.381
Comparing Two
Populations
Previously we looked at techniques to
estimate and test parameters for one
population:
Population Mean , Population Variance
We will still consider these parameters when
we are looking at two populations,
however our interest will now be:
The difference between two means.
The ratio of two variances.

1.382
Difference of Two Means
In order to test and estimate the difference
between two population means, we
draw random samples from each of two
populations. Initially, we will consider
independent samples, that is, samples that
are completely unrelated to one another.

Because we are compare two population


means, we use the statistic:

1.383
Sampling Distribution of
1. is normally distributed if the original
populations are normal or approximately normal
if the populations are nonnormal and the sample
sizes are large (n1, n2 > 30)

2. The expected value of is

3. The variance of is

and the standard error is:


1.384
Making Inferences About
Since is normally distributed if the
original populations are normal or approximately
normal if the populations are nonnormal and the
sample sizes are large (n1, n2 > 30), then:

is a standard normal (or approximately normal)


random variable. We could use this to build test
statistics or confidence interval estimators for

1.385
Making Inferences About
except that, in practice, the z statistic is rarely
used since the population variances are unknown.

??

Instead we use a t-statistic. We consider two cases


for the unknown population variances: when we
believe they are equal and conversely when they
are not equal.

1.386
When are variances equal?
How do we know when the population
variances are equal?

Since the population variances are


unknown, we cant know for certain
whether theyre equal, but we can
examine the sample variances and
informally judge their relative values to
determine whether we can assume that
the population variances are equal or not.
1.387
Test Statistic for (equal
variances)
1) Calculate the pooled variance
estimator as

2) and use it here:

degrees of freedom
1.388
CI Estimator for (equal
variances)
The confidence interval estimator
for when the population
variances are equal is given by:

pooled variance estimator degrees of freedom

1.389
Test Statistic for (unequal
variances)
The test statistic for when
the population variances are
unequal is given by:
degrees of freedom

Likewise, the confidence interval


estimator is:
1.390
IDENTIFY

Example 13.2
Two methods are being tested for assembling
office chairs. Assembly times are recorded (25
times for each method). At a 5% significance
level, do the assembly times for the two
methods differ?

That is, H1:

Hence, our null hypothesis becomes: H0:

Reminder: This is a two-tailed test.

1.391
COMPUTE

Example 13.2
The assembly times for each of the
two methods are recorded and
preliminary data is prepared

The sample variances are similar, hence we will assume that


the population variances are equal
1.392
COMPUTE

Example 13.2
Recall, we are doing a two-tailed test,
hence the rejection region will be:

The number of degrees of freedom


is:

Hence our critical values of t (and


our rejection region) becomes:
1.393
COMPUTE

Example 13.2
In order to calculate our t-statistic,
we need to first calculate the pooled
variance estimator, followed by
the t-statistic

1.394
INTERPRET

Example 13.2

Since our calculated t-statistic does not fall into


the rejection region, we cannot reject H 0 in favor
of H1, that is, there is not sufficient evidence to
infer that the mean assembly times differ.

1.395
INTERPRET

Example 13.2
Excel, of course, also provides us
with the information

Compare

or look at p-value

1.396
Confidence Interval
We can compute a 95% confidence interval estimate
for the difference in mean assembly times as:

That is, we estimate the mean difference between the


two assembly methods between .36 and .96 minutes.
Note: zero is included in this confidence interval

1.397
Matched Pairs Experiment
Previously when comparing two populations,
we examined independent samples.

If, however, an observation in one sample is


matched with an observation in a second
sample, this is called a matched pairs
experiment.

To help understand this concept, lets


consider example 13.4
1.398
Identifying Factors
Factors that identify the t-test and
estimator of :

1.399
Inference about the ratio of two
variances
So far weve looked at comparing measures of central
location, namely the mean of two populations.

When looking at two population variances, we consider


the ratio of the variances, i.e. the parameter of interest to
us is:

The sampling statistic: is F distributed with

degrees of freedom.

1.400
Inference about the ratio of two
variances
Our null hypothesis is always:

H 0:

(i.e. the variances of the two populations will be


equal, hence their ratio will be one)

Therefore, our statistic simplifies to:

df1 = n1 - 1
df2 = n2 - 1

1.401
IDENTIFY

Example 13.6
In example 13.1, we looked at the variances of
the samples of people who consumed high fiber
cereal and those who did not and assumed
they were not equal. We can use the ideas just
developed to test if this is in fact the case.

We want to show: H1:


(the variances are not equal to each other)

Hence we have our null hypothesis: H0:

1.402
CALCULATE

Example 13.6
Since our research hypothesis is: H1:
We are doing a two-tailed test, and
our rejection region is:

1.403
CALCULATE

Example 13.6
Our test statistic is:

.58 1.61 F
Hence there is sufficient evidence to reject the null
hypothesis in favor of the alternative; that is, there is a
difference in the variance between the two populations.

1.404
INTERPRET

Example 13.6
We may need to work with the Excel
output before drawing conclusions
Our research hypothesis
H1:
requires two-tail testing,
but Excel only gives us values
for one-tail testing

If we double the one-tail p-value Excel gives us, we have the p-


value of
the test were conducting (i.e. 2 x 0.0004 = 0.0008). Refer to
the text and CD Appendices for more detail.
1.405
Show of Hands
Who is doing a study
that involves
statistical analysis of
data?
What type of
(quantitative) data are
you collecting?
Will there be enough
data to achieve
statistical
significance?
(adequate power vs.
pilot) If pilot:
Descriptive statistics
Chart/graph
9/14/2010 406
Types of data
Continuous
Equal
increments

Ordinal/Rank
In order but not
equal (Likert)

Categorical
Names
9/14/2010 407
Continuous Data
If comparing 2 groups
(treatment/control)
t-test
If comparing > 2 groups
ANOVA (F-test)
If measuring association between 2
variables
Pearson r correlation
If trying to predict an outcome
(crystal ball)
Regression or multiple regression
9/14/2010 408
Ordinal Data
Beyond the capability of Excel just FYI
If comparing 2 groups
Mann Whitney U (treatment vs. control)
Wilcoxon (matched pre vs. post)
If comparing > 2 groups
Kruskal-Wallis (median test)
If measuring association between 2
variables
Spearman rho ()
Likert-type scales are ordinal data
9/14/2010 409
Categorical Data
Called a test of frequency how
often something is observed (AKA:
Goodness of Fit Test, Test of
Homogeneity)
Chi-Square (2)
Examples of burning research
questions:
Do negative ads change how people
vote?
Is there a relationship between marital
9/14/2010status and health insurance coverage? 410
Words we use to describe
statistics
Mean ()
The arithmetic
average (add all of
the scores
together, then
divide by the
number of scores)
= x / n

9/14/2010 412
Median
The middle number
(just like the
median strip that
divides a highway
down the middle;
50/50)
Used when data is
not normally
distributed
Often hear about
the median price of
housing
9/14/2010 413
Mode
The most
frequently
occurring number
(score,
measurement,
value, cost)
On a frequency
distribution, its the
highest point (like
the la mode on
pie)
9/14/2010 414
Standard Deviation ()

99%
95%

9/14/2010 415
We Make Mistakes!
Alpha level p value
Set BEFORE we collect Calculated AFTER we
data, run statistics gather the data
Defines how much of The calculated
an error we are willing probability of a mistake
by saying it works
to make to say we
AKA: level of significance
made a difference
Describes the percent of
If were wrong, its an
the population/area
alpha error or Type 1 under the curve (in the
error tail) that is beyond our
statistic

9/14/2010 416
2-tailed Test
The critical value is
the number that
separates the blue
zone from the middle
( 1.96 this example)
In a t-test, in order to
be statistically
significant the t score
needs to be in the
blue zone
If = .05, then 2.5%
of the area is in each
tail
9/14/2010 417
1-tailed Test

The critical value is


either + or -, but
not both.
In this case, you
would have
statistical
significance (p < .
05) if t 1.645.

9/14/2010 418
Chi-Square ( ) 2

Any number squared


is a positive number
Therefore, area under
the curve starts at 0
and goes to infinity
To be statistically
significant, needs to
be in the upper 5% (
= .05)
Compares observed
frequency to what we
expected
9/14/2010 419

You might also like