You are on page 1of 17

Biostatistics I Probability distributions 1

Master Degree Public Health 2016/2017

Probability distributions

Margarida Fonseca Cardoso

Biostatistics I Probability distributions 2


Master Degree Public Health 2016/2017

Relative frequency of an event


The relative frequency of an event is the proportion of the total observations
of outcomes that event represents.
Consider an outcome set with two elements, such as the possible results
from tossing a coin (Head; Tail) or the sex of a person (male; female).
If n is the total number of coin tosses and f is the total number of heads
observed, then the relative frequency of heads is f/n.
If heads are observed 52 times in 100 coin tosses, the relative frequency is
52/100=0.52 (or 52%)

If 275 males occur in 500 human births, the relative frequency of males is
f/n=275/500=0.55 (or 55%)

frequency of that event f


relative frequency of an event = =
total number of all events n
The value of f may, of course, range from 0 to n, and the relative frequency
may, therefore, range from 0 to 1 (or 0% to 100%).
(Zar, 2010).

1
Biostatistics I Probability distributions 3
Master Degree Public Health 2016/2017

Probability of an event
The probability of an event is the likelihood of that event expressed as either
by the relative frequency observed from a large number of data or by
knowledge of the system under study.
Using the data from the preceding slide , we can estimate:
 the probability that a human birth will be a male is 0.55.
 the probability of tossing a coin that lands head side up is 0.52.

A probability may sometimes be predicted on the basis of knowledge about


the system.
For example, if we assume that there is no reason why a tossed coin should
land “heads” more or less often than “tails”, we say that there is an equal
probability of each outcome: P(H)=1/2 and P(T)=1/2.
Probabilities, like relative frequencies, can range from 0 to 1.
A probability of 0 means that the event is impossible.
In tossing a coin, P(neither H nor T)=0
A probability of 1 means that an event is certain.
In tossing a coin, P(H or T)=1

Biostatistics I Probability distributions 4


Master Degree Public Health 2016/2017

Randomness and probability


 We cal a phenomenon random if individual outcomes are uncertain but
there is nonetheless a regular distribution outcomes in a large number of
repetitions.
 The probability of any outcome of a random phenomenon is the
proportion of times the outcome would occur in a very long series of
repetitions.

Probability models
 The sample space S of a random phenomenon is the set of all possible
outcomes.
 An event is an outcome or a set of outcomes of a random phenomenon.
That is, an event is a subset of the sample space.
 A probability model is a mathematical description of a random
phenomenon consisting of two parts: a sample space and a way of
assigning probabilities to events.

2
Biostatistics I Probability distributions 5
Master Degree Public Health 2016/2017

A sample space S can be very simple or very complex.

When one child is born, there are only two outcomes, male or female. The
sample space is S = {M, F}

When the National Health Survey records the body weights in a random
sample of adults, the sample space contains all possible adult weights over a
realistic interval.

Biostatistics I Probability distributions 6


Master Degree Public Health 2016/2017

Discrete probability model


A probability model with a sample space made up of a list of individual
outcomes is called discrete.
To assign probabilities in a discrete model, list the probabilities of all the
individual outcomes.
The probabilities must be numbers between 0 and 1 and must have sum 1.
The probability of any event is the sum of probabilities of the outcomes
making up the event.

Example: Hearing impairment in dalmatians


Pure dog breeds are often highly inbred, leading to a high numbers of
congenital defects. A study examined hearing impairment of 5333 dalmatians.
The researchers found the following probability model for X – number of ears
impaired (deaf) in a randomly chosen dalmatian:
X 0 1 2
Probability 0.70 0.22 0.08

(Baldi and Moore , 2012).

3
Biostatistics I Probability distributions 7
Master Degree Public Health 2016/2017

Continuous probability models


Most variables encountered in biology are continuous, such as the infant birth
weight, survival time, etc.
For continuous variables, the theoretical probability distribution, or
probability density function, can be represented by a continuous curve.
Density curve
A density curve is a curve that
•is always on or above the horizontal axis, and
•has area exactly 1 underneath it
A density curve describes the overall pattern of a distribution.
The area under the curve and above any range of values on the horizontal
axis is the proportion of all observations that fall in that range.

x2

P (x1 < x < x2 ) = ∫ f (x )dx


x1

Biostatistics I Probability distributions 8


Master Degree Public Health 2016/2017

A random variable is a variable whose value is a numerical outcome of a


random phenomenon.
The probability distribution of a random variable X tell us what values X can
take and how to assign probabilities to those values.
There are two main types of random variables, corresponding to two types of
probability models: discrete or continuous.

Examples:
Discrete probability models (x = 0, 1, 2,..., n)
- Binomial distribution
- Poisson distribution

Continuous probability models ( x ∈ IR)


- Normal distribution
- t student
- Chi-square
- F distribution

4
Biostatistics I Probability distributions 9
Master Degree Public Health 2016/2017

Normal distribution
Of all the continuous probability distributions, none is more widely used than
the family of normal distributions.
Normal distributions are extremely important in statistical inference.
They are also a good mathematical model for many (but certainly not all)
biological variables, such as blood pressure, bone mineral density, and the
height of plants or animals.
Formally, the normal probability density function can be represented by the
expression:

1  1  x − µ 2 
exp−   
2πσ 2  2  σ  

A normal probability function has two parameters: the parametric mean μ


and the parametric standard deviation σ, which determine the location and
shape of the distribution respectively.

Biostatistics I Probability distributions 10


Master Degree Public Health 2016/2017

34.1% 34.1%

13.6% 13.6%
0.15% 2.2% 2.2% 0.15%
µ −3σ µ −2σ µ −1σ µ µ +1σ µ +2σ µ +3σ

There is not just one normal distribution, as one might naively believe when
encountering the same bell-shaped image.
Rather, the number of such curves is infinite because the parameters: the
parametric mean μ and the parametric standard deviation σ, can assume an
infinite number of values.
The curve is symmetrical around the mean, therefore, the mean and median
of the normal distribution are at the same point.

5
Biostatistics I Probability distributions 11
Master Degree Public Health 2016/2017

34.1% 34.1%

13.6% 13.6%
0.15% 2.2% 2.2% 0.15%
µ −3σ µ −2σ µ −1σ µ µ +1σ µ +2σ µ +3σ

The following percentages of items in a normal frequency distribution with


mean μ and standard deviation σ lie within the indicated limits
Approximately 68% of the items fall within σ of the mean µ
Approximately 95% of the items fall within 2σ σ of µ
Approximately 99.7% of the items fall within 3σ σ of µ
Expressed in probability terms, this means that if the random variable X has a
normal distribution with mean μ and standard deviation σ, then
P(µ - σ < X < µ + σ) ≈ 0.68
P(µ - 2σ < X < µ + 2σ) ≈ 0.95
P(µ - 3σ < X < µ + 3σ) ≈ 0.997

Biostatistics I Probability distributions 12


Master Degree Public Health 2016/2017

Example: The distribution of heights of young women aged 18 to


24 is approximately Normal with mean μ = 163.83 cm and standard
deviation σ 6.35 cm.
The heights of the 95% of young women are approximately between
µ - 2σ = 163.83 – 2 x 6.35 = 163.83 – 12.70 = 151.13
and
µ + 2σ = 163.83 + 2 x 6.35 = 163.83 + 12.70 = 176.53

That is, 95% of young women are between 151.13 and 176.53 cm tall.
The tallest 2.5% of young women are taller than 176.53 cm.
In probability terms, there is probability 0.025 approximately that a
randomly selected woman is taller than 176.53 cm.

6
Biostatistics I Probability distributions 13
Master Degree Public Health 2016/2017

The standard normal distribution


A normal curve with μ = 0 and standard deviation σ = 1 is said to be a
standardized normal curve.
If a variable X has any Normal distribution with mean μ and standard
deviation σ, the standardized value of X

X −µ
Z=
σ
has the standard Normal distribution.

Areas under a Normal curve represent proportions (frequencies) of


observations from that Normal distribution.
Therefore, they also represent probabilities of randomly selecting an
individual from that Normal distribution.
Calculations use either software or tables.

Biostatistics I Probability distributions 14


Master Degree Public Health 2016/2017

34.1% 34.1%

13.6% 13.6%

0.15% 0.15%
2.2% 2.2%
µ −3σ µ −2σ µ −1σ µ µ +1σ µ +2 σ µ +3σ

X i −µ
Zi =
σ

34.1% 34.1%

13.6% 13.6%

0.15% 0.15%
2.2% 2.2%
-3 -2 -1 0 1 2 3

7
Biostatistics I Probability distributions 15
Master Degree Public Health 2016/2017

Example:
The distribution of weights of men of a given population
is Normal with mean μ = 70 Kg and standard deviation, σ , 10Kg.

What is the probability that a randomly selected man weights more than 80 Kg?
 80 − 70 
P ( x > 80) = P Z >  = P (Z > 1) = 0.1587
 10 
What is the probability that a randomly selected man weights less than 50 Kg?

 50 − 70 
P ( x < 50) = P Z <  = P (Z < −2 ) = 0.0228
 10 
What is the probability that a randomly selected man weights between 50 and
80 Kg?
P (50 < x < 80) = P (− 2 < Z < 1)
= 1 − P (Z < −2 ) − P ( Z > 1)
= 1 − P (Z > 2 ) − P ( Z > 1)
= 0.8185

Biostatistics I Probability distributions 16


Master Degree Public Health 2016/2017

Example: The body length of fish of a given species follow


a Normal distribution with mean μ = 145 cm and standard
deviation, σ , 4 cm.
What is the length of the biggest 5% fish?
• X1 = ? : P(x > X1) = 0.05
• Find the given proportion in the body of the table of Standardized normal
distribution and then read the corresponding z from the left column and top
row.
• P(Z>z1)=0.05: z1 = 1.645
• Unstandardize to transform z back to the original scale

z1 =
(X1 − 145 )⇒ 1.645 = (X1 − 145 )
4 4
• Solving this equation for X1 gives X1 = 145 + 4x1.645 =151.6 cm

Fish bigger than 151.6 cm are among the 5% biggest fish.

8
Biostatistics I Probability distributions 17
Master Degree Public Health 2016/2017

What is the body length, above which there are 95% of the fish?
• X2 = ? : P(x > X2) = 0.95
• Finding the corresponding z value:
• P(Z > z2) = 0.95 ⇔ P(Z < z2) = 0.05
• As P(Z>1.645)=0.05 then P(Z<-1.645)=0.05: z2 = -1.645
• Unstandardize to transform z back to the original scale

z2 =
(X 2 − 145 )⇒ −1.645 = (X 2 − 145 )
4 4
Solving this equation for X2 gives X2 = 145 - 4x1.645 = 138.4 cm

5% of the fish are smaller than 138.4 cm

Biostatistics I Probability distributions 18


Master Degree Public Health 2016/2017

What are the lengths of the middle 95% of fish?


• X3, X4 = ? : P(X3 < x < X4) = 0.95
• P(x > X4) = 0.025 and P(x < X3) = 0.025
• Finding the corresponding z values:
• P(Z > z4) = 0.025: z4 = 1.96
• P(Z < z3) = 0.025: z3 =-z4= -1.96
• Unstandardize to transform z back to the original scale

Z4 =
(X 4 − 145)⇒ 1.96 = (X 4 − 145) ⇔ X
4 = 152 .8 cm
4 4
(X − 145)⇒ −1.96 = (X 3 − 145) ⇔ X = 137.2 cm
Z3 = 3 4
4 4

95% of fish are between 137.2 and 152.8 cm

9
Biostatistics I Probability distributions 19
Master Degree Public Health 2016/2017

The distribution of means


If random samples of size n are drawn from a normal population, the
means of these samples will conform to normal distribution.
The distribution of means from a nonnormal population will not be
normal, but it will tend to approximate a normal distribution as n increases
in size – Central limit theorem:

Biostatistics I Probability distributions 20


Master Degree Public Health 2016/2017

Sampling distribution of means


The variance of the distribution of means will decrease as n increases.
The variance of the population of all possible means of samples of size n
from a population with variance σ2 is:
σ2
σX2 =
n
The standard deviation of the mean, or the standard error of the mean
(sometimes abbreviatted SEM):
σ
σX =
n

Just as Z = (X – μ) / σ is a normal deviate that refers to the normal distribution


of Xi values,
X −µ
Z=
σX
Is a normal deviate referring to the normal distribution of means.

10
Biostatistics I Probability distributions 21
Master Degree Public Health 2016/2017

Sampling distribution of
means, with mean µ and
standard deviation, σ X
34.1% 34.1%

13.6% 13.6%

0.15% 0.15%
2.2% 2.2%
µ −3σ µ −2σ µ −1σ µ µ +1σ µ +2σ µ +3σ X

X −µ X −µ
Z= =
Standard Normal σX σ n
distribution (z).

34.1% 34.1%

13.6% 13.6%

0.15% 0.15%
2.2% 2.2%
-3 -2 -1 0 1 2 3

Biostatistics I Probability distributions 22


Master Degree Public Health 2016/2017

Example: A population of one-year-old children’s chest circumferences has


μ = 47.0 cm and σ =12.0 cm.
What is the probability of drawing from it a random sample of nine
measurements that has a mean larger than 50.0 cm?
12
σX = = 4.0 cm
9

X − µ 50.0 − 47.0
Z= = = 0.75
σX 4.0

( )
P X > 50.0 = P (Z > 0.75 ) = 0.2266

What is the probability of drawing a sample of 25 measurements from the


preceding population and finding that the mean of this sample is less than 40.0
cm?
12 X − µ 40.0 − 47.0
σX = = 2.4 cm Z= = = −2.92
25 σX 2 .4

( )
P X < 40.0 = P (Z < −2.92 ) = P (Z > 2.92 ) = 0.00118
(Zar, 2010).

11
Biostatistics I Probability distributions 23
Master Degree Public Health 2016/2017

If 500 random samples of size 25 are taken from the preceding population,
How many would have means larger than 50.0 cm?
12
σX = = 2.4 cm
25

X − µ 50.0 − 47.0
Z= = = 1.25
σX 2 .4

( )
P X > 50.0 = P (Z > 1.25 ) = 0.1056

Therefore 0.1056 x 500 =53 samples would be expected to have means larger than
50.0 cm.

Biostatistics I Probability distributions 24


Master Degree Public Health 2016/2017

The binomial distribution


The binomial setting
1. There are a fixed number n of observations.
2. The n observations are all independent. That is, knowing the result of one
observation does not change the probabilities we assign to other
observations.
3. Each observation falls into one of just two categories, which for convenience
we call “success” and “failure”.
4. The probability of a success, call it p, is the same for each observation.

Example: An obstetrician overseeing n single-births deliveries is an example


of the binomial setting.
Each single-birth delivery is either a baby girl or a baby boy. Knowing the
outcome of one birth doesn’t change the probability of a girl on any other birth, so
the births are independent. If you call a girl a “success”, then p is the probability
of a girl that remains the same over all births overseen by that obstetrician.
The number of girls we count is a discrete random variable X. The distribution of
X is called a binomial distribution.
(Baldi and Moore , 2012).

12
Biostatistics I Probability distributions 25
Master Degree Public Health 2016/2017

Binomial distribution
The count X of successes in the binomial setting has the binomial distribution with
parameters n and p.
The parameter n is the number of observations, and p is the probability of a
success on any one observation.
The possible values of X are the whole numbers from 0 to n.

Binomial probability
If X has the binomial distribution with n observations and probability p of success
on each observation, the possible values of X are 0, 1, 2, …, n. If k is any one of
these values,
P ( X = k ) = C kn p k ( 1 − p)n − k
n!
C kn =
k!( n − k )!

Biostatistics I Probability distributions 26


Master Degree Public Health 2016/2017

Example: Each child born to a particular set of parents has


probability 0.25 of having blood type O. If these parents have 5 children, what is
the probability that exactly 2 of them have type O blood?

The count of children with type O blood is a binomial random variable X with n = 5
tries and probability p = 0.25 of success on each try. We want P(X = 2).

1. Find the probability that a specific 2 of the 5 tries give successes. The
probability is: 0.25x0.25x0.75x0.75x0.75 = (0.25)2 x (0.75)3
2. Any arrangement of 2 successes and 3 failures has this same probability. Here
are all the possible arrangements:
SSFFF SFSFF SFFSF SFFFS FSSFF
FSFSF FSFFS FFSSF FFSFS FFFSS
3. There are 10 of them, all with the same probability. The overall probability of 2
successes is therefore:
P(X = 2) = 10 x (0.25)2 x (0.75)3 = 0.2637

(Baldi and Moore , 2012).

13
Biostatistics I Probability distributions 27
Master Degree Public Health 2016/2017

Probability distribution with n = 5 and p = 0.25

P(x=0) = 0.2373
P(x=1) = 0.3955
P(x=2) = 0.2637
P(x=3) = 0.0879 0,5
P(x=4) = 0.0146
0,4
P(x=5) = 0.0010
Probability

0,3

0,2

0,1

0
0 1 2 3 4 5
Count

Biostatistics I Probability distributions 28


Master Degree Public Health 2016/2017

Binomial mean and standard deviation


If a count X has the binomial distribution based on n observations with probability
p of success, what is its mean μ?
That is, in very many repetitions of the binomial setting, what will be the average
count of successes?

If the probability that a human birth will be a male is 0.50, the mean
number of baby boys in 10 births overseen by a obstetrician should be 50% of 10,
or 5.

Binomial mean and standard deviation


If a count X has the binomial distribution with number of observations n and
probability of success p, the mean and standard deviation of X are

σ = np(1 − p)

14
Biostatistics I Probability distributions 29
Master Degree Public Health 2016/2017

0,5

0,4
Probability 0,3

0,2

0,1

0
0 1 2 3 4 5
Count

The mean and standard deviation of the binomial distribution with n = 5 and p =
0.25 are

σ = np(1 − p) = 5 × 0.25 × (1 − 0.25) = 0.968

Biostatistics I Probability distributions 30


Master Degree Public Health 2016/2017

Binomial distributions with probability 0.2 and 0.4 of successes on each


observation:

As the number of observations n gets larger, the binomial distribution gets


close to a Normal distribution.

15
Biostatistics I Probability distributions 31
Master Degree Public Health 2016/2017

Normal approximation for binomial distributions


Suppose that a count X has the binomial distribution with n observations and
success probability p.
When n is large, the distribution of X is approximately Normal with
mean np
standard deviation np (1 − p )

The Normal approximation to the binomial distribution does not work well when n
is small. As a rule of thumb, np and n(1-p) should be at least 5.
We will use the Normal approximation when n is so large that:
np ≥ 5 and n(1-p) ≥ 5

Biostatistics I Probability distributions 32


Master Degree Public Health 2016/2017

Example: The body length of fish of a given species


follows a Normal distribution with mean μ = 145 cm and
standard deviation, σ , 4 cm.

What is the probability that a randomly selected fish has a length higher than
149 cm?
 149 − 145 
P (x > 149 ) = P Z >  = P (Z > 1) = 0.1587
 4 

What is the probability of 2 of a sample of 5 have a length higher than 149 cm?

5!
P ( x = 2) = × 0.16 2 × 0.84 3 = 0.15
2!(5 − 2)!

16
Biostatistics I Probability distributions 33
Master Degree Public Health 2016/2017

What is the probability that in a sample of 100 at least 20


fish have a length higher than 149 cm?

µ = np = 100 × 0.1587 = 15.87

σ 2 = np(1 − p ) = 100 × 0.1587 × 0.8413 σ = 13.35 = 3.65

For slightly more accurate results with the Normal distribution, you can use a
continuity correction. Because counts can only take integer values but the Normal
distribution can take any real value, the proper continuous equivalent to a count is
the interval around it with size 1.
In the example, the continuous equivalent to a 20 count is the interval between 19.5
and 20.5
 19.5 − 15.87 
P ( x > 19.5) = P z >  = P ( z > 0.99 ) = 0.1611
 3.65 

Biostatistics I Probability distributions 34


Master Degree Public Health 2016/2017

References
The text of these slides may be found in the following references*:

Baldi B., D.S. Moore - The practice of statistics in the life sciences, W.H. Freeman and
Company, 2012.
Sokal R. and F.J. Rohlf – Biometry – The principles and practice of statistics in biological
research, W.H. Freeman and Company, 4th edition, 2012.
Zar J.H. - Biostatistical Analysis, 5th edition, Prentice - Hall International Inc., 2010.

* The copy and reproduction of portions created by other authors was done only for
educational use.

17

You might also like