You are on page 1of 57

PROBABILITY &

STATISTICS
Prepared by:
Ms. KAREN S. TAFALLA

INTRODUCTION
STATISTICS is a collection of methods for planning experiments,
obtaining data, and then organizing, summarizing, analyzing,
interpreting, and drawing conclusions based on the data.

DESCRIPTIVE STATISTICS consists of procedures used to


summarize and describe the important characteristics of a set
of measurements.

INFERENTIAL STATISTICS consists of procedures used to


make inferences about population characteristics from
information contained in a sample drawn from this population

"The theory of statistics uses probability to measure the


uncertainty associated with an inference. It enables us to
calculate the probabilities of observing specific samples,
under specific assumptions about the population. The
statistician uses these probabilities to evaluate the
uncertainties associated with sample inferences."
Definition of terms:
Data are information or facts necessary to conduct a certain
study.
A variable is a characteristic that changes or varies over time
and/or for different individuals or object under
consideration.
A random variable is a variable whose numerical is
determined by the outcome of some chance experiment.

An experimental unit is the individual or object on which a


variable is measured. A single measurement or data
value results when a variable is actually measured on an
experimental unit.
The population in a statistical study is the group of objects
drawn about which conclusions are to be drawn.

A sample is a subset of measurements selected from the


population of interest.
A parameter is a numerical measurement describing some
characteristics of a population and a statistic is a
numerical measurement describing some characteristics
of a sample

Univariate data result when a single variable is measured on


a single experimental unit.
Bivariate data result when two or more variables are
measured on a single experiment unit.
Multivariate data results when more than two variables are
measured.

A. Types of variables:
Qualitative variable measures a quality or characteristic on
each experiment unit.
Ex. - taste ranking: excellent, good, fair, poor,
- color of M&M candy: brown, yellow, red orange,
green, blue
Quantitative variable measures a numerical quantity or
amount on each experiment unit.
Ex. - weight of package ready to be shipped
- volume of orange juice in a glass

Types of Quantitative Data:


Discrete data results from either a finite of possible values or
countable number of possible values (That is, the number
of possible values is 0, 1 or 2, and so on)
Continuous data results from many possible values that can
be associated with points on a continuous scale in such a
way that there are no gaps or interruptions.

B. Four Levels of Measurement:


The nominal level of measurement is characterized by the
data that consist of names, labels or categories only, and the
data cannot be arranged in an ordering scheme.
Ex. - collection of yes, no, undecided responses to a
survey question.
- responses consisting of 10 nurses, 15 teachers,
16 engineers, 5 priests, 20 businessmen.
The ordinal level of measurement involves data that may
be arranged in some order, but differences between data
values either cannot be determined or are meaningless.
Ex.
-In a sample of 24 car stereos, 15 were rated
good, 6 were rated better, 3 were rated best
-in considering employee promotion, a manager
ranked Myrna 3rd, Al 7th, and Jena 10th

The Interval level measurement is like the ordinal level, with


the a additional that meaningful amounts of differences
between data can be determined. However, there is no
inherent zero stating point.
Ex.
-body temperatures ( in degrees Celsius )

The ratio level of measurement is the interval level modified


to include the inherent zero starting point. For values at
this level, differences and ratios are meaningful.
Ex.
-heights of pine trees along Session road.
- temperature readings on Kelvin Scale since
the scale ha s an absolute zero

Classify the following statements as belonging to the area of


descriptive statistics or statistical inference:
(a) As a result of recent cutbacks by the oil-producing
nations, we can expects the price of gasoline to double
in the next years.
(b) At least 5% of all fires reported last year in a certain city
were deliberately set by arsonists.
(c)
Of all patients who have received this particular type of
drug at a local clinic, 60% later developed significant
side effects.
(d) Assuming that less than 20% of the Columbian coffee
beans were destroyed by frost this past winter, we
should expect an increase of no more than 30 cents for
a kilogram of coffee by the end of the year
(e) As a result of a recent poll, most Americans are in favor
of building additional nuclear power plants.

EXERCISES: Understanding the concepts


A. Identify the experimental units on which the ff.
Variables are measured:
1. Gender of student
2. Number of errors on a midterm exam
3. Age of a cancer patient
4. Number of flowers on an azalea plant
5. Color of a car entering the parking lot

B.
Identify each variable as quantitative or qualitative:
1. Amount of time it takes to assemble a simple puzzle
2. Number of students in a first grade classroom
3. Rating of newly elected politician ( excellent, good,
fair, poor )
4. State in which a person lives.
C. Identify the following quantitative variables as discrete
or continuous:
1. Population in a particular area of the Philippines
2. Weight of newspapers recovered for recycling on a
single day.
3. Time to complete a probability exam

D.
A data set consist of the ages at death for each of the
41 past president of the United States
1. Is this a set of measurements a population or a
sample?
2. What is the variable being measured?
3.
Is the variable in part b quantitative or qualitative?
E.
Determine which of the four level of measurement is
most appropriate:
1. Weights of a sample of M&M candies
2. Instructors rated as superior, above average, average,
or poor
3. Lengths (in minutes) of movies
4. Zip codes
5. Movies listed according to their genre, such as comedy,
adventure, and romance

FREQUENCY DISTRIBUTION
When the set of data includes a large number of
observe values. It becomes practical to group the data into
classes or categories with the corresponding number of
terms falling into each class. The result is a tabular
arrangement called a frequency distribution.
Definition of terms:
A frequency table categories (or classes) of scores,
along with counts (or frequencies) of the number of scores
that fall into each category.
The frequency for a particular class is the number of
original scores that fall into that class.

Lower class limits are the smallest number that can actually
belong to the different classes.
Upper class limits are the largest number that can actually
belong to the different classes.
Class boundaries are the numbers used to separate
classes, but without the gaps created by the class limits.
They are obtained increasing the upper class limits and
decreasing the lower class limits by the same amount so
that there are no gaps between consecutive classes. The
amount be added or subtracted is one-half the difference
between the upper limit of one class and the lower limit of
the following class.
Class marks are the midpoints of the classes. They can be
found by adding lower class limits and dividing by 2.

Class width or Class size is the difference between two


consecutive lower class limits or two consecutive lower
class boundaries.

Relative Frequency ratio of the class frequency to the total


frequency
Cumulative Frequency accumulated frequency that is <, > to
a stated value. We obtain the > cumulative frequency if the
frequencies are summed from bottom up to find the
number of observations greater than a specified lower
class boundary. The less than cumulative is constructed if
the frequencies are summed from top down to find the
number of observations less than a particular upper class
boundary.

A. Steps in constructing Frequency table.


Step 1: Count the number of data points in the set of data.
Step 2: Determine the range R, for the entire data set. The
range is the smallest value in the set of data subtracted
from the largest value
Step 3: Decide on the number of the class intervals. The
ideal number of class intervals is somewhere between 5
and 15. To approximate the appropriate number of class
intervals, we may use Herbert Sturges Formula
K = 1 +3.322 log n
Where K stands for the number of classes suggested and
n represents the total frequency. Avoid having too many
classes or too few classes. Too many classes may lead to
several empty classes. Too few classes tend to lose
important details of the data.

Step 4: Determine the class width by dividing the number


of classes into the range. Round the result up to a
convenient number. This rounding up ( not off ) not only
is convenient, but also guarantees that all of the data will
be included in frequency table.
Class width ( i ) = round up of ( range/number of classes )
Step 5: Select as the lower limit of the first class either the
lower score or convenient value slightly less than the
lowest score. This value serve as the starting point.
Step 6:
Add the class width to the starting point to get
the second lower class limit. Add the class width to the
second lower class limit to get the third, so on.

Step 7:
List the lower class limits in a vertical column,
and enter the upper class limits, which can be easily
identified at this stage.

Step 8:
Represent each score by a tally in the
appropriate class.
Step 9:
Replace the tally marks in each class with the
total frequency count for that class.

Example: The test scores of sixty students in Statistics are


recorded as follows:
78

51

61

74

68

78

62

71

88

72

66

77

82

68

68

73

56

82

66

71

58

75

67

75

86

66

70

71

64

73

85

74

62

84

66

92

91

57

61

78

63

73

58

79

61

83

88

81

75

57

68

70

54

79

62

78

59

70

66

81

1. Number of data points = 60


2. Range = 92 51 = 41
3. 3. Using Sturges formula, K = 1 +3.322 log 60 = 7.
Therefore, class intervals is seven.
4. The class size or width is computed as i = 41/7 = 5.86 = 6
Instead of starting the first class at 51, choose to start
at the nice round number 50.
Thus , the first class is 50- 55. Adding 6 to both limits, we
obtain the next interval 56-61.

CLASS
CLASS
MIDPOINT TALLY
INTERVAL BOUNDARIES
50 55

49. 5 55.5

56 61

55.5 61.5

62 67
68 73
74 79
80 85
86 91
92 97

61.5 67.5
67.5 73.5
73.5 79.5
79.5 85.5
85.5 91.5
92.5 97.5

FREQUENCY

3.

The number of television viewing hours per household


and the prime viewing times are two factors that affect
television advertising income, A random sample of 50
households in a particular viewing area produced the
following estimated of viewing hours per household.
3.0 6.0

7.5

15.0 12.0 6.6

9.5

6.5 8.0

4.0

5.5

6.0

5.6

13.3 13.1 5.5

12.5

5.0 12.0 1.0

3.5

3.0

2.4

3.8

4.5

8.0

2.5

7.5 5.0

10.0 8.0

3.5

2.6

8.5

2.5

6.4

7.6

9.0 2.0

6.5

5.0

7.7

9.3

6.5

8.2

8.8

1.0

14.5 10.5 11.0

a. Starting with the lowest value as the lower class limit,


construct a frequency distribution.
b Determine the class marks, class boundaries, relative
frequency, <CF, and >CF.

GRAPHICAL REPRESENTATION OF FREQUENCY


DISTRIBUTION
A histogram or frequency histogram, is a bar
graph which consist of a set of rectangles while the
frequency polygon is a line graph. Both graphs are
intended to show more salient features of the frequency
distribution.
a. HISTOGRAM
The histogram is a set of vertical bars having their bases
or the horizontal axes which center on the class marks.
The width corresponds to the class marks and the height
correspond to the frequencies.
A histogram differs from a bar chart in the bases of each
bar are the class boundaries rather than the class limits.

b.

FREQUENCY POLYGON
The frequency polygon is a modification of the histogram;
only, the frequency polygon is line graph where the class
frequencies is plotted against the class marks. To close the
polygon, an extra class mark at each end must be added. The
frequency polygon can also be obtained by connecting
midpoints of the tops of the rectangles in the histogram.
c. OGIVES
A line graph showing the cumulative frequency of distribution
is called an ogive. For the less than ogive, the less than
cumulative frequencies are plotted against the upper class
boundaries. For the greater than ogive, the greater than
cumulative frequencies are plotted directly above the lower
class boundaries. These graphs are useful in estimating the
number of observations that are less than or more than a
specified value.

STEM AND LEAF PLOTS


Another simple way to display the distribution of a
quantitative data set is the stem and leaf plot. This
procedure was introduced by Tukey and is one of the
primary tools of explanatory data analysis. A stem and leaf
diagram consists of a series of horizontal rows of
numbers. The number used to label a row is called a stem,
and the remaining numbers in the row are called leaves..

Steps:
1. Divide each measurement into two parts: the stem and
the leaf.
2. List the stem in a column, with a vertical line to their right.
3. For each measurement, record the leaf potion in the
same row as its corresponding stem.
4. Order the leaves from the lowest to highest in each stem.
5. Provide a key to your stem and leaf coding so that the
reader can recreate the actual measurements if
necessary.

Sometimes the available stem choices result in a plot that


contains too few stems and a large number of leaves
within each stem. In this situation, you can stretch the
stems by dividing each one into several lines, depending
on the leaf values assigned to them. Stems are usually
divided in one of two ways:
Into two lines, with leaves 0-4 in the first line and
leaves 5-9 in the second line.
Into five lines, with leaves 0-1, 2-3, 4-5, 6-7, and
8-9 in the five lines respectively.

Example:
The data below ate the GPAs of 30 Adamson University
freshmen, recorded at the end of the freshmen year.
Construct a stem and leaf plot to display the distribution
of the data.

2.0

3.1

1.9

2.5

1.9

2.3

2.6

3.1

2.5

2.1

2.9

3.0

2.7

2.5

2.4

2.7

2.5

2.4

3.0

3.4

2.6

2.8

2.5

2.7

2.9

2.7

2.8

2.2

2.7

2.1

DESCRIPTIVE STATISTICS
MEASURES OF CENTRAL TENDENCY
A measure of central tendency gives a single
value that acts as a representative average of
the values of all the outcomes of your
experiment. Three parameters that measure the
center of the distribution in some sense are of
interest. These parameters, called the
population mean, the population median and the
population mode.

a. THE MEAN
For Ungrouped Data:
Let x1 , x2 , x3 ,. xn be n observations of a random variable X. The
sample mean, denoted by x, is the arithmetic average of these
values. That is,
_
x1 + x2 + x3 ++ xn
x (x-bar) =
------------------------------n
For Grouped Data
k
_
fi xi
i =1
x (x-bar) =
--------- k fi
i=1

Where:

fi is the frequency of class interval i


xi is the class midpoint of class interval i

B. THE MEDIAN
For Ungrouped Data:
Let x1 , x2 , x3 ,. xn be a sample observations arranged in the order of smallest to largest. The
sample median for this collection is given by the middle observation if n is odd. If n is even, the
sample median is the average of the two middle observations.
For Grouped Data:
When the data are grouped into a frequency distribution, the median is obtained by finding the cell
that has the middle umber and then interpolating within the cell.
n/2 <cfi-1
n/2 >cfi-1
~
~
x = Lbi + -------------------- (i)
OR
x = Ubi - -------------------- (i)
fi
fi
where:
Lbi
= lower class boundary of the interpolated interval
Ubi
= lower class boundary of the interpolated interval
<cfi-1 = less than cumulative frequency of the class before interpolated interval
>cfi-1 = greater than cumulative frequency of the class before interpolated interval
fi
= frequency of the interpolated interval
i
= class size
n
= number of data points

C. THE MODE
The last measure of central tendency is the mode. For a finite
population, the population mode is the value of X that occurs most
often. The mode of a sample is the value that occurs most often in
the sample. The drawback to this measure is that there might not be
a unique mode. There might be no single number that occurs more
often that any another. For this reason, the mode is not a particularly
useful descriptive measure.
When the data are grouped into a frequency distribution, the
midpoint of the cell with the highest frequency is the mode, since
this point represents the highest point (greatest frequency).

EXAMPLES:
The reaction times for a random sample of 9 subjects to a stimulant
were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1 and 4.3
seconds. Calculate the mean, median and mode.
2.5 + 3.6 + 3.1 + 4.3 + 2.9 + 2.3 + 2.6 + 4.1 + 4.3
Mean = -----------------------------------------------------------------9
Mean = 3.3
1.

Median : 2.3, 2.5, 2.6, 2.9, 3.1 , 3.6, 4.1, 4.3, ,4.3
Median = 3.1

Mode = 4.3

2.

The frequency table (on the right side) represent the final
examination for an statistics course. Find the mean, the
median and the mode.
Class Interval

Frequency

Class mark

Cumulative
Frequency
<CF

10 19

14.5

20 29

24.5

30 39

34.5

40 49

44.5

12

50 59

54.5

17

60 69

11

64.5

28

70 79

14

74.5

42

80 89

14

84.5

56

90 99

94.5

60

fi xi
Mean = -------------- fi
(3)(14.5) + (2)(24.5) +( 3)(34.5) + (4)(44.5) + (5)(54.5) +
(11)(64.5) + 14(74.5)+ (14)(84.5) +(4)(94.5)
Mean = -------------------------------------------------------------------------------3 + 2 + 3 + 4 + 5 + 11 + 14 + 14 + 14
Mean = 66

n/2 <cfi-1
Median = Lb + -------------------- (i)
fi
60/2 28
Median = 69.5 + -------------------- (10)
14
Median = 70.93
Mode = Classmark with the highest frequency
Mode = 74.5 and 84.5

MEASURES OF VARIABILITY
Refers to the extent of scatter or dispersion around the
zone of central tendency
A. RANGE
One measure of variation is the range, which has the advantage of
being very easy to compute. The range, R, of a set of n measurements is
defined as the difference between the largest and smallest
measurements.
Formula:
Range = Highest score Lowest Score or R = (H L)
B. VARIANCE and STANDARD DEVIATION
The variance of a population of N measurements is defined to be the
average of the squares of the deviations of the measurements about their
mean . The population variance is denoted by and is given by the
formula
(x - )
= -------------for ungrouped data
N
(x - )
= ----------------for grouped data

The variance of a_sample of n measurements is defined to be the sum of the


squared deviations of the measurement about their mean x divided by (n-1). The
sample variance is denoted by s and is given by the formula
(x x)_
s =
--------------for ungrouped data
n-1
(x x)
s = ------------------for grouped data
-1
The standard deviation, in essence, represents the average amount of
variability in a set of measures, using the mean as a reference point. Strictly
speaking, the standard deviation is the positive square root of the average of the
square deviations about the mean or the positive square root of the variance. The
standard deviation is basically a measure of how far each score, on the average,
is from the mean

1.

The reaction times for a random sample of 9 subjects to a stimulant


were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1 and 4.3
seconds. Calculate the range, variance and standard deviation.
Range = HV LV
= 4.3 2.3 = 2
(x x-bar)
s =
-------------------------n-1
(2.5-3.3)2 + (3.6-3.3)2 + (3.1-3.3)2 +(4.3-3.3)2 + (2.9-3.3)2 +
(2.3-3.3)2 +(2.6-3.3)2 + (4.1-3.3)2 + (4.3-3.3)2
= ----------------------------------------------------------------------------------9 -1
= 0. 6325 (sample variance)

s = sqrt (0.6325)
= 0.795298686 or 0.80 (sample standard deviation)

The frequency table (below) represent the final


examination for statistics course. Find the population
range, population variance and population standard
deviation
Class Interval

Frequency

Class mark

Cumulative
Frequency

10 19

14.5

20 29

24.5

30 39

34.5

40 49

44.5

12

50 59

54.5

17

60 69

11

64.5

28

70 79

14

74.5

42

80 89

14

84.5

56

90 99

94.5

60

Range = Highest Upper Class Boundary - Smallest Lower Class


Boundary
= 99.5 9.5
= 90
(x - )
=
----------------
3(14.5 66)2 +2 (24.5 66)2 +3 (34.5 66)2 + 4(44.5 66)2 +
5(54.5 66)2 +11 (64.5 66)2 +14 (74.5 66)2 +
14(84.5 66)2 + 4(94.5 66)2
=
---------------------------------------------------------------------------60
= 432.75

= 20.80264406 or 20.80

Measures of Shape

- refer to the visual characteristics of a certain


distribution.
- knowledge of the shape of the distribution can
help in concluding whether the distribution is
normal or not

Two (2) Principal Measures


of Shape
SKEWNESS
KURTOSIS

Measures of Shape

Skewness
refers to the symmetry of a
distribution. A distribution
which is not symmetric with
respect to its mean can be
termed as either positivelyskewed or negatively-skewed

Kurtosis
refers to the flatness or
peakedness of a particular
distribution

Skewness

SK = 0

SK= S[(Xi - )/]


N

where:
Xi N -

individual reading
standard deviation
mean
population size

Symmetric (Normal)

SK > 0
Positively Skewed

SK< 0
Negatively Skewed

negative skew: The left tail is longer than the right tail. It
has relatively few low values. The distribution is said to
be left-skewed or "skewed to the left; Example
(observations): 1,1000,1001,1002,1003
positive skew: The right tail is longer the left tail. It has
relatively few high values. The distribution is said to be
right-skewed or "skewed to the right".Example
(observations): 1,2,3,4,100.

The skewness for a normal distribution is zero, and any


symmetric data should have a skewness near zero.

Kurtosis

k = 3

k = S[(Xi - )/]
N

where:
Xi N -

individual reading
standard deviation
mean
population size

MesoKurtic (Normal)

k > 3
LeptoKurtic

k < 3
PlatyKurtic

Platykurtic data set has a flatter peak around its mean,


which causes thin tails within the distribution. The
flatness results from the data being less concentrated
around its mean, due to large variations within
observations
Mesokurtic data, A term used in a statistical context
wherekurtosis of a distribution is similar, or identical, to
the kurtosis of a normally distributed data set.
Leptokurtic distributions have higher peaks around the
mean compared to normal distributions, which leads to
thick tails on both sides. These peaks result from the data
being highly concentrated around the mean, due to lower
variations within observations.

Examples
1.

A technician checks the resistance value of 5 coils and


records the values in ohms: 3.35, 3.37, 3.28, 3.34 and
3.30. Determine the average.

2.

Tensile tests on aluminum alloy rods are conducted at


three different times, which results in three different
average values in megapascals (Mpa). On the first
occasion, 5 tests are conducted with an average of 207
Mpa; on the second occasion, 6 tests, with an average of
203MPa; and on the last occasion, 3 tests, with an
average of 206MPa. Determine the weighted average.

3.

4.

Determine the standard deviation of the moisture content


of a roll of kraft paper. The results of six readings across
the paper web are 6.7, 6.0, 6.4, 6.4, 5.9, and 5.8%.
Given the frequency distribution of the life of 320
automotive tires in 1000 km as shown in table below,
determine the average and standard deviation
Boundaries

Midpoint

Frequency

23.5-26.5

25.0

26.5-29.5

28.0

36

29.5-32.5

31.0

51

32.5-35.5

34.0

63

35.5-38.5

37.0

58

38.5-41.5

40.0

52

41.5-44.5

43.0

34

44.5-47.5

46.0

16

47.5-50.5

49.0

PRACTICAL SIGNIFICANCE OF THE


STANDARD DEVIATION
A. TCHEBYSHEFFS THEOREM
Tchebysheffs theorem applies to any set of measurements and can
be used to describe either a sample of or population. The idea
involved in this theorem is illustrated below. An interval is
constructed by measuring a distance k on either side of the mean
. Note that the theorem is true for any number we choose for k as it
is greater than or equal to 1. Then at least 1 (1/k) of the total
number of n measurements lies constructed interval

11/ k2

The theorem states that:


At least one the measurements lie in the interval
- to +.
At least of the measurements lie in the interval
-2 to +2.
At least 8/9 of the measurements lie in the
interval -3 to +3.

B. EMPIRICAL RULE
Another rule helpful in interpreting a value for a
standard deviation is the Empirical rule, which
applies to a data set having a distribution that is
approximately bell-shaped. The empirical rule is
often stated in abbreviated form, sometimes
called the 68-95-99 rule.

1. A sample of 3000 observations has a mean of 82


and a standard deviation of 16.
Using the empirical rule, find what percentage of the
observations fall in the intervals
x+2s; x+3s.
2. The mean life of a certain brand of auto batteries is
44 months with a standard deviation of three
months. Assume that the lives of all auto batteries of
this brand have a bell-shaped distribution. Using the
empirical rule, find the percentage of auto batteries
of this brand that have a life of
a. 41 to 47 months
b. 38 to 50 months
c. 35 to
53 months

3.The ages of cars owned by all employees of


a large company have a bell-shaped
distribution with a mean of seven years and
a standard deviation of 2 years.
a. Using the empirical rule, find the
percentage of cars owned by these
employees are i. 5 to 9 years old ii. 1 to 13
years old.
b. Using the empirical rule, find the interval
that contains the ages of the cars owned by
95% of all employees of this company.

MEASURES OF POSITION

A. PERCENTILE
A set of n measurements on the variable
x has been arranged in order of
magnitude. The pth percentile is the value
that separate the bottom p% of the ranked
score from the top (100-p)%.
( Xnp + Xnp+1 )

if np is integer

Any percentile =

Xnp ( round to the

next largest integer)

if np is non-integer

For Grouped Data


np <cfi
Any Percentile = Lb + -------------------- (i)
fi
OR
Any Percentile = Ub -

n(1-p) >cfi
-------------------- (i)

fi
where:
Lb
Ub
<cfi
interval
>cfi
interval
fi
i
n
p

= lower class boundary of the interpolated interval


= lower class boundary of the interpolated interval
= less than cumulative frequency of the class before interpolated
= greater than cumulative frequency of the class before interpolated
= frequency of the interpolated interval
= class size
= number of data points.
= the desired proportion or percentile

B.

C.

QUARTILE are values that divide a set of


observations into 4 equal parts. These
values, denoted by Q1 , Q2 and Q3 are
such that 25% of the data falls below Q1
50% fall below Q1 , and 75% falls below
Q3
DECILE are values that divide a set of
observations into 10 equal parts.

You might also like