Professional Documents
Culture Documents
Introduction
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to
read and write.”
G. WELLS
In the modern world of computers and information technology, the importance of
statistics is very well recognized by all the disciplines. Statistics has originated as a
science of statehood and found applications slowly and steadily in Agriculture,
Economics, Commerce, Biology, Medicine, Industry, planning, education and so on.
In the mean time, there is no other human walk of life, where statistics cannot be
applied. Hence, we are constantly being bombarded with statistics and statistical
information.
1
In the last 50 years, there has been a great development in new statistical
methods, especially computational demanding methods such as the bootstrap
and nonparametric smoothing. Due to the recent availability of high-speed
computers together with new simulation-based fitted algorithms, Bayesian
methods have become increasingly popular. In contrast to the middle period
of statistics, where frequentist methods were dominate, we currently live in a
frequentist/Bayesian world where statisticians routinely use Bayesian
methods in situations where this inferential perspective has particular
advantages.
The word Statistics and Statistical are all derived from Latin word status which means
a political state. Statistics is defined differently by different authors over a period of
time. In the olden days, statistics was confined to only state affairs but in modern
days it embraces almost every sphere of human activity. Therefore, a number of old
definitions, which was confined to narrow field of enquiry, were replaced by other
definitions, which are much more comprehensive and exhaustive.
1.2 Definition and classification of Statistics
First let’s see different ways of defining statistics by different authors or Dictionaries.
The American Heritage Dictionary defines statistics as “The mathematics of
collection, organization and interpretation of numerical data, especially the
analyses of population characteristics by inference from sampling.”
The Merriam-Webster’s collegiate Dictionary defines statistics as “A branch of
mathematics dealing with the collection, analyses, interpretation, and
presentation of masses of numerical data.”
The former American Statistical Association president Jon Kettering define
statistics as “… the science of learning from data …It presents exciting
opportunities for those who work as professional statisticians. Statistics is
essential for the proper running of government, central to decision making in
industry and a core component of modern educational curricula at all level.”
Therefore into the consideration of the above concepts we can define statistics in two senses
2
a. In the plural sense : statistics are the raw data themselves , like statistics of births,
statistics of deaths, statistics of students, statistics of imports and exports, etc.
b. In the singular sense: statistics is the subject that deals with the collection, organization,
presentation, analysis and interpretation of numerical data
Classifications:
Depending on how data can be used statistics is sometimes divided in to two main areas or
branches.
Descriptive statistics: deals with the meaningful presentation of data such that its
characteristics can be effectively observed. Descriptive statistics consists of the collection,
organization, summarization, and presentation of data. It encompasses the tabular, graphical
or pictorial display of data, condensation of large data into tables, preparation of summary
measures to give a concise description of complex information and also to exhibit pattern that
may be found in data sets.
Inferential statistics: Inferential statistics on other hand, deals with drawing inferences and
taking decision by studying a subset or sample from the population. That means, it generalize
the result from sample to population, performing estimations and hypothesis tests,
determining relationships among variables, and making predictions. For example, the average
income of all families (the population) in Ethiopia can be estimated from figures obtained
from a few hundred (the sample) families. It is important because statistical data usually arises
from sample.
3
3. Presentation of Data: In this stage the collected and organized data are presented with in
some systematic order to facilitate statistical analysis. The organized data are presented with
the help of tables, diagrams and graphs.
4. Analysis of Data: Analysis of data involves extraction of relevant information from the
collected data using some mathematical and statistical tools. In other words, it involves
extracting relevant information from the data (like mean, median, mode, range, variance…),
mainly through the use of elementary mathematical operation.
5. Interpretation of Data: This stage involves drawing a valid conclusion from the analyzed
data. That is interpretation of data involves making inferences (drawing conclusions) based
on the analysis of data.
Applications of statistics:
• In almost all fields of human endeavor.
• Almost all human beings in their daily life are subjected to obtaining numerical facts
e.g. abut price.
• Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution.
• In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena.
The following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
As a science statistics has its own limitations. The following are some of the limitations:
• Deals with only quantitative information.
• Deals with only aggregate of facts and not with individual data items.
• Statistical data are only approximately and not mathematical correct.
• Statistics can be easily misused and therefore should be used be experts.
5
gender of statistics students, marital status of instructors at UoG, ethnic group of
patients, the age of patients seen in a dental clinic, the number of daily admissions to a
general hospital, and the number of decayed, missing or filled teeth per child in an
elementary school.
Data refers to a collection of facts, values, observations, or measurements that the
variables can assume. The raw material of statistics is data. A collection of data values
forms a data set. Each value in the data set is called a data value or a datum.
6
Weight, age, length, temperature, weight, speed, salary and mark of students
Scales of measurements
We may generally refer to data as a collection of facts, values, observations, or
measurements. So if our data consists of observations that can be classified, ordered, or
quantified, then at what level does the measurement take place? Or how data are classified,
measured or counted? Here we are interested in the forms in which data is found or the scales
on which data is measured. Measurement scale refers to the property of value assigned to the
data based on the properties of order, distance and fixed zero. These scales, stated in terms of
increasing information content, are classified as nominal, ordinal, interval, and ratio.
Nominal Scales
It is associated with the word name since this scale identifies categories. Observations on a
nominal scale possess neither numerical values nor order. However, observations on this
type of scale can be given numerical codes such as “0 or 1” or “1, 2, 3 . . .”. Note that when
dealing with a nominal scale, the categories defined must be mutually exclusive (each item
falls into one and only one category) and collectively exhaustive (the list of categories is
complete in that each item can be classified). These numbers serve only as identifiers; the
magnitude of the differences between these numerical values is meaningless. Classifying
residents according to zip codes is an example of the nominal level of measurement. Even
though numbers are assigned as zip codes, there is no meaningful order or ranking. The only
valid operations for variables represented by a nominal scale are the determination of “=” or
“≠.”
In short, in Nominal scales of measurements:
No order or ranking can be imposed on the data.
No arithmetic and relational operation applied between the data.
Examples:
Using numbers to distinguish among the various medical diagnoses
Sex (Male or Female.)
Marital status(married, single, widow, divorce)
Country code
Students identification number
Regional differentiation of Ethiopia.
Ordinal scales
7
The ordinal scale (think of the word order) includes all properties of the nominal scale with
the additional property that the observations can be ranked from the smallest to the largest or
from the least important to the most important. (Note that nominal measurements cannot be
ordered—all items are treated equally.) In this regard, the only valid operations for ordinally
scaled variables are “=, ≠, <, >.”
That means, in Ordinal scales of measurement:
There are orders or ranks among the data but differences between the ranks do not
exist.
Arithmetic operations are not applicable but relational operations are applicable.
Example:
Letter grades (A, B, C, D, F).
Rating scales (Excellent, Very good, Good, Fair, poor).
Military status.
Note: Both the nominal and ordinal scales are termed nonnumeric scales since differences
among their values are of no consequence or meaningless.
Interval scales
It includes all the properties of the ordinal scale with the additional property that distance
between observations is meaningful. Here the numbers assigned to the observations indicate
order and possess the property that the difference between any two consecutive values is the
same as the difference between any other two consecutive values (the difference 10 − 9 = 1
has the same meaning as 3 − 2 = 1). It is important to note that while an interval scale has a
zero point; its location may be arbitrary. Hence ratios of interval scale values have no
meaning.
For example the Fahrenheit temperature scale, measured in degrees, is an interval scale, as is
the centigrade scale. The temperature difference between 50 and 60 degrees centigrade (10
degrees) equals the temperature difference between 80 and 90 degrees centigrade (10
degrees). Note that the 0 in each of these scales is arbitrarily placed, which makes the interval
scale different from ratio. If the temperatures in Gondar and Bahirdar were 20 and 40 degree
centigrade respectively, then we cannot say that Bahirdar is twice as hot as Gondar, and
hence, ratio is meaningless in this scale of measurement. The operations for handling
variables measured on an interval scale are “=, ≠, >, <, +, −.”
In general,
8
Interval scale of measurement is a level of measurement which classifies data
that can be ranked and differences are meaningful. However, there is no
meaningful zero or true zero, so ratios are meaningless.
Example:
IQ
Temperature
SAT scores
Ratio scales
It includes all the properties of the interval scale with the added property that ratios of
observations are meaningful. This is because absolute zero is uniquely defined. Clearly
variable Gift in dollar is a ratio variable in that $0 measures the absence of any gift and a gift
of $2000 is twice as large as a gift of $1000 (the ratio is 2/1 = 2). Valid operations for
variables measured on a ratio scale are “=, ≠, >, <, +, −, ×, ÷.”
Generally,
It is a level of measurement which classifies data that can be ranked,
differences are meaningful, and there is a true zero. True ratios exist between
the different units of measure.
All arithmetic and relational operations are applicable.
Examples:
Height, weight, time, salary, age and number of students in the class.
Note: Both the interval and the ratio scales are said to be metric scales (since differences
between values measured on these scales are meaningful), and variables measured on
these scales are said to be quantitative variables.
9
The following present a list of different attributes and rules for assigning numbers to objects.
Try to classify the different measurement systems into one of the four types of scales.
(Exercise)
10
Chapter Two: Data Collection, Presentation and Analysis
2 Methods of Data Collection and Presentation
2.1 Methods of data collection
Data: is the raw material of statistics. It can be obtained either by measurement or counting.
2.1.1. Sources of Data:
Statistical data may be obtained from two sources, namely, primary and secondary.
1. Primary Data: data measured or collected by the investigator or the user directly
from the source. Primary sources are sources that can supply first hand information
for immediate user.
-There are various methods of collecting primary data:
a. Direct observation: this is counting the data of interest in person.
Drawback: not always possible to observe directly.
Example: Data on the number of cigarette smokers in UoG.
b. Personal interview: this is contacting the desired people in individual and
asking questions.
11
Example: to determine whether the salary of workers in a given factory is fair or
not, an investigator may contact each worker and ask his or her opinion.
Drawback:-It is time consuming
- Cost of training interviews is high
- People may not be open in giving the information we need.
c. Telephone interview: this is contacting the desired people through telephone
lines.
Drawbacks: -Respondents may not be available to telephone calls
- Personal type of questions may not be answered.
d. Written questionnaires: in this case written questionnaires are mailed to
individuals and the method is most widely used because;
a) Large number of individuals may be contact within a very short
period of time (i.e. it takes less time).
b) It reduces cost.
2. Secondary Data: When an investigator uses data, which have already been collected
by others, such data are called secondary data. Such data are primary data for the
agency that collected them, and become secondary for someone else who uses these
data for his own purposes. Data gathered or compiled from published and unpublished
sources or files is known as secondary data.
• When our source is secondary data check that:
o The type and objective of the situations.
o The purpose for which the data are collected and well-matched with the
present problem.
o The nature and classification of data is appropriate to our problem.
o There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the other purpose.
2.1.2. Methods Collection Data
A data collection instrument is a document used for gathering and recording of data in a
survey. Questionnaire is the main data collection instrument in formal sample survey.
Data Gathering Techniques
The objective of the survey, the nature of the items of information, the operational feasibility
and cost will often determine the method of data collection. Of the various methods of
collecting the data just a few of them are outlined below.
12
1. Self administered Questionnaire
Mail and self administered questionnaire is a method of data collection in which researchers
can give questionnaires with instructions directly to respondents or mail them to respondents
who read instructions and questions, then record their answers and give it back or return it by
mail again to data collecting agency.
Advantages of this method
Cheapest and can be conducted by a single researcher.
Researcher can send questionnaires to a wide geographical area
Disadvantage
Mail questionnaire is not suitable for illiterate community
Researchers can’t usually observe the respondent’s reactions to questions.
A low response rate is the biggest problem
14
3. Grouped Frequency Distributions
-There are specific procedures for constructing each type of frequency distribution.
-Tables: include the systematic arrangement of statistical data in columns and rows.
When a single variable is used for classification, the table formed is
considered as one way table.
When a 2 variable is used for classification, the table formed is considered as
two ways or contingency table.
When >2 variable is used for classification, the table formed is considered as
high order table.
1. Categorical Frequency Distributions:
- Used for data that can be placed in specific categories such as nominal or ordinal.
Example: Twenty-five army inductees were given a blood test to determine their blood type.
The data set is as follows.
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
15
f
%= *100%
n
Where: f = frequency of the class
n = total number of values.
Percentages are not normally a part of frequency distribution but they can be added
since they are used in certain types of graphical presentations, such as pie graphs.
Step-5: Find the columns C and D.
Combining all the steps we can construct the following frequency distribution.
16
63 / 1 5.0
65 / 1 5.0
70 //// 4 20.0
74 / 1 5.0
75 / 1 5.0
76 // 2 10.0
80 /// 3 5.0
85 /// 3 15.0
90 / 1 5.0
Total 20 100.0
17
• Cumulative frequency: is the number of observations less than/more than or equal to a
specific value.
• Cumulative frequency above (more than type): it is the total frequency of all values
greater than or equal to the lower class boundary of a given class.
• Cumulative frequency blow (less than type): it is the total frequency of all values less
than or equal to the upper class boundary of a given class.
• Cumulative Frequency Distribution (CFD): it is the tabular arrangement of class interval
together with their corresponding cumulative frequencies. It can be more than or less than
type, depending on the type of cumulative frequency used.
• Relative frequency (rf): it is the frequency divided by the total frequency.
• Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total
frequency.
1. There should be between 5 and 20 classes. We rarely use less than 5 or more than 20
classes. The exact number we use depends on the number of observations we have.
2. The classes must be mutually exclusive. This means that no data value can fall into
two different classes.
3. The classes must be all inclusive or exhaustive. This means that all data values must
be included.
4. The classes must be continuous. There are no gaps in a frequency distribution. Classes
that have no values in them must be included (unless it's the first or last class which is
dropped).
5. The classes must be equal in width. The exception here is the first or last class. It is
possible to have a "below ..." or "... and above" class. This is often used with ages.
18
Where: k= the number of classes desired;
n= the total number of observation of the given data
4. Find the class width by dividing the range by the number of classes and rounding up,
R LS
not off. W , where: L= largest value and S= Smallest value
k k
5. Pick a suitable starting point less than or equal to the minimum value. The starting
point is the lower limit of the first class. Continue to add the class width to this lower
limit to get the rest of the lower limits. The starting point plus the number of classes
times the class width must be greater than the maximum value.
6. To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the
upper limits.
7. Find the boundaries by subtracting U 2 units from the lower limits and adding U 2
units from the upper limits. The boundaries are also half-way between the upper limit
of one class and the lower limit of the next class.
8. Tally the data.
9. Find the frequencies.
10. Find the cumulative frequencies. Depending on what you're trying to accomplish, it
may not be necessary to find the cumulative frequencies.
11. If necessary, find the relative frequencies and/or relative cumulative frequencies.
Example*: The blood glucose level for 50 patients is shown below. Construct a frequency
distribution for the following data.
44 50 79 63 66 54 56 70 56 63
60 87 60 70 59 60 62 88 71 53
56 65 74 80 51 83 69 77 69 50
58 42 43 85 43 75 55 60 58 49
72 67 55 77 48 45 61 47 44 61
Solution:
Step 1: Find the highest and the lowest value H=88, L=42
Step 2: Find the range; R=H-L=88-42=46.
19
Step 3: Select the number of classes desired using Sturges formula;
k=1+3.322log (50) =6.64=7(rounding up)
Step 4: Find the class width; w=R/k=46/7=6.57=7 (rounding up)
Step 5: Select the starting observation as lowest class limit (this is usually the lowest
observation). Add the width to that observation to get the lower limit of the next class. Keep
adding until there are 7 classes.
42, 49, 56, 63, 70, 77, 84 are the lower class limits.
Step 6: Find the upper class limit; e.g. the first upper class=42-U=49-1=48
48, 55, 62, 69, 76, 83, 90 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes.
Class limits
42-48
49-55
56-62
63-69
70-76
77-83
84-90
Step 7: Find the class boundaries by subtracting 0.5 from each lower class limit
and adding 0.5 to the UCL as shown.
LCBi LCLi U 2 and UCBi UCLi U 2
Example: For class 1 LCB1 =42-0.5=41.5 and UCB1 48 0.5 48.5
• Then continue adding W on both boundaries to obtain the rest boundaries. By
doing so one can obtain the following classes.
Class boundary
41.5 – 48.5
48.5 – 55.5
55.5 – 62.5
62.5 – 69.5
69.5 – 76.5
76.5 – 83.5
83.5 – 90.5
Step 8: Tally the data.
Step 9: Write the numeric values for the tallies in the frequency column.
20
Step 10: Find cumulative frequency.
Step 11: Find relative frequency and /or relative cumulative frequency.
The complete frequency distribution follows:
Class Class Class Freq. <CF >CF RF <RCF >RCF
limits boundary Mark
42-48 41.5 – 48.5 45 8 8 50 0.16 0.16 1
49-55 48.5 – 55.5 52 8 16 42 0.16 0.32 0.84
56-62 55.5 – 62.5 59 13 29 34 0.26 0.58 0.68
63-69 62.5 – 69.5 66 7 36 21 0.14 0.72 0.42
70-76 69.5 – 76.5 73 6 42 14 0.12 0.84 0.28
77-83 76.5 – 83.5 80 5 47 8 0.10 0.94 0.16
84-90 83.5 – 90.5 87 3 50 3 0.06 1 0.06
Total 50 1
21
Step 2: Find the angle of the sector for each class.
Step 3: Using a protractor and compass, graph each section and write its name corresponding
percentage.
Class Frequency Percent Degree
Not immunization 49 37 133.2
Partially immunization 46 35 126
Fully immunization 37 28 100.8
Sum 132 100 360
Fully immunized
28% Not immunized
37% Not immunized
Partially immunized
Fully immunized
Partially immunized
35%
2. Pictogram
-This is a diagrammatic representation of categorical data using small symbolic figures and
pictures to represent data.
-It can be drawn horizontally or vertically.
Note: pictograms (short for “picture diagrams”)
Example: Draw a pictogram to represent the following population of a town during the
years: 1989 to 1992.
Year 1989 1990 1991 1992
Population 2000 3000 5000 7000
3. Bar charts
A set of bars (thick lines or narrow rectangles) used to represent and
compare the frequency distribution of discrete variables and attributes or
categorical series.
In presenting data using bar diagram, all bars must have equal width and the
distance between bars must be equal.
The height or length of each bar indicates the size (frequency) of the figure
represented.
22
Bars can be drawn either horizontally or vertically.
There are different types of bar charts. The most common being are:
i. Simple bar char
ii. Component bar chart (subdivided bar chart)
iii. Multiple bar chart
iv. Percentage bar chart
v. Broken bar chart
vi. Deviation or two way bar chart
i. Simple bar chart
It is used to represent a single set of data (variable) classified in different
category.
Example: Consider the immunization status of children
50
40
30
20
Number of Children
10
0
Not immunized Partially Fully immunized
Immunized
Immunization Status
Fig. Immunization status of children
Single 12 18
Married 24 21
Divorced 24 35
Widowed 14 16
23
Solution:
24
Plot the points.
Draw the bars or lines to connect the points.
Histogram
-A graph which places the class boundaries on the horizontal axis and the frequencies on a
vertical axis. Class marks and class limits are sometimes used as quantity on the X axes.
-For each class in the distribution a vertical rectangle is drown with its base on the horizontal
axis extending from one class boundary of the class to the other class boundary, there will
never be any gap between the histogram rectangles.
- If all of the classes have equal width, then the histogram consists of a set of rectangles
having heights equal to the class frequencies and bases equal to the class width.
Example: Construct a histogram to represent the previous data (example *).
Number of Patients
14
12
10
8
6
4
2
0
41.5 – 48.5 – 55.5 – 62.5 – 69.5 – 76.5 – 83.5 –
48.5 55.5 62.5 69.5 76.5 83.5 90.5
Blood Glucose Level
Fig. Histogram for blood glucose level in milligrams per deciliter, for 50 patients
Frequency Polygon:
-It is a line graph of class frequency in the vertical axis plotted against class marks on the
horizontal axis. It is customer to the next higher and lower class intervals with corresponding
frequency of zero, this is to make it a complete polygon.
Remark: It can be obtained by connecting the midpoints of the tops of the rectangles in a
histogram.
Example: Consider example * and construct a frequency polygon
25
Fig: Frequency polygon for blood glucose level, in milligrams per deciliter, for 50 patients.
Cumulative frequency curve or Ogive
A graph showing the cumulative frequency (less than or more than type) plotted against upper
or lower class boundaries respectively. That is class boundaries are plotted along the
horizontal axis and the corresponding cumulative frequencies are plotted along the vertical
axis. The points are joined by a free hand curve.
To construct an ogive curve:
Compute the cumulative frequency of the distribution.
Prepare a graph with the cumulative frequency on the vertical axis and the true lower
class limits (class boundaries) of the interval scaled along the x-axis (horizontal axis).
- The true lower limit of the lowest class interval with lowest scores is included in the x-
axis scale. This is also the true lower limit of the next lower interval having a
cumulative frequency of 0.
Example: Consider example * and construct an ogive curve(less than type)
Exercises
26
1. The following data shows the number of experimental rats tested for their response to
a given drug in 30-day period. Construct a frequency distribution using appropriate
class size.
68 32 28 28 32 53 29
59 23 32 33 20 59 29
31 58 18 32 48 47 28
19 45 25 31 60 31 43
28 37
2. Construct a histogram, frequency polygon and less than and or more Ogive for the
data in exercise 5.
3. In a certain frequency distribution having 50 observations, the smallest and the
highest observations are 27 and 57 respectively. The distribution has constant class
width with classes. Then find:
a. The class width
b. The class limits
c. All the class marks
d. The class boundaries
27
CHAPTER 3
MEASURES OF CENTRAL TENDENCY
INTRODUCTION
Measures of central tendency are measures of the location of the middle or the center value of
a distribution. The definition of "middle" or "center" is purposely left somewhat vague so that
the term "central tendency" can refer to a wide variety of measures. The tendency of
statistical data to get concentrated at certain values is called the “central tendency” and the
various methods of determining the actual value at which the data tend to concentrate are
called measures of central tendency or averages.
Properties of Measures of Central Tendency:
It should be easy to understand and calculate
It should be rigidly (well) defined, in the sense that it should have one and only one
interpretation so that the personal bias of the investigator does not affect the value or
its usefulness.
It should be representative of the data
It should be as little as affected by extreme observations.
It should be capable of further algebraic treatment. For example, if we are given the
average of some groups, then we should be able to find the average of all the items
taken together.
It should be as little as affected by fluctuations of sampling.
It should be based on all observation under investigation.
The Summation Notation:
- Let X 1 , X 2 ,..., X n be a number of measurements where n is the total number of
observation and X i is i th observation.
n
- The symbol X
i 1
i is used to denote the sum of all the X i ’s from i 1 to i n , i.e. by
definition:
28
n
X i 1
i X 1 X 2 .... X n
- The symbol is the Greek capital letter sigma, denoting the sum.
Properties of Summation:
n
1. C nC , where C is any constant number
i 1
n n
2. CX
i 1
i C X i , where C is any constant number
i 1
n n
3. a bX i na b X i ,
i 1 i 1
where a and b are any constant number
n n n
4. X i Yi X i Yi ,
i 1 i 1 i 1
n n n
5. X i * Yi X i * Yi ,
i 1 i 1 i 1
Examples:
4
a) X
i 1
i X1 X 2 X 3 X 4
4
b) 3Xi 1
i 3 ( X 1 X 2 X 3 X 4 ) 3X 1 3X 2 3X 3 3X 4
4
c) a a a a a 4a
i 1
29
-The arithmetic mean of a sample (or simply the sample mean) of n observations
X 1 , X 2 , ..., X n , denoted by X is computed as:
1 n 1
X
n i 1
X i ( X 1 X 2 ... X n )
n
-The population mean, (mu) is defined as:
N
1 1
N
X
i 1
i
N
( X 1 X 2 ... X N )
30
Solution:
First find the class marks
Find the product of frequency and class marks
Find the mean using the formula
1 k 1 3104
X
n i 1
f i X i (360 416 ... 261)
50 50
60.08
f (X
i 1
i i X) 0.
o Uniqueness: For a given set of data there is one and only one arithmetic mean.
- The sum of squares of deviations from the arithmetic mean is less than of those
computed from any other point. Symbolically,
k k
f (X X ) 2 f i X i A , where A X
2
i i
i 1 i 1
Example: In a class of 60 students, who have taken an exam, 50 are male with an average
mark of 45 and the average mark of females was 60. Find the average mark obtained by the
entire class.
Solutions:
Males Females
31
n m 50 n f 60 50 10
X m 45 X f 60
nm X n f X f 50 * 45 10 * 60 2850
XC 47.5.
nm n f 50 10 60
X corr X incorr
Correct X = incorrect X +
n
Where: X corr sum of correct items
X incorr sum of incorrect items
1
Correct X =incorrect X + ( X correct X incorr )
n
1
=53+ (90 60) 53 0.3 = 53.3
100
o The effect of transforming original series on the mean:
a. If any constant k is added/ subtracted to/from every observation then the new mean
will be the old mean k respectively.
b. If every observations are multiplied by a constant k then the new mean will be k*old
mean.
Example-1: the mean of n variables X 1 , X 2 ,..., X n are known to be 12. New set of
variables are obtained by the linear transformation of Yi 2 * X i 0.5 then what will be the
mean of the new set of variables.
Solutions: Ynew 2 * X old 0.5 2 *12 0.5 23.5.
Example-2: The mean of a set of variable is 500.
a. If 10 are added to each of the numbers in the set, then what will be the mean
of the new set?
32
b. If each of the numbers in the set are multiplied by -5, then what will be the
mean of the new set?
Solution:
a. X new X old 10 500 10 510
X W i i
XW i 1
n
W i 1
i
X W i i
60 * 1 75 * 2 63 * 1 59 * 3 55 * 3 615
XW i 1
61.5.
n
1 2 1 3 3 10
W i 1
i
33
It is meaningless for nominal or qualitative classified data
Geometric Mean (G.M)
Definition: If all the given observations X 1 , X 2 ,..., X n are positive, their geometric mean is
simply the nth root of their product. Like the arithmetic mean it also depends on all
observations. That is
1
n n
GM X i X 1 * X 2 * .... * X n
i 1
The geometric mean gives a better measure of central tendency than other means if the
values are measured as ratios, proportions or percentages.
There is one great drawback with it, that it cannot be calculated if any one or more values
are zero or negative.
’S
- In practice, GM can be computed by taking logarithmic values of X , that is
1
n n 1
LogGM X i log( X i ) for i=1,2,. . . ,N.
i 1 n
If the data are arranged in the for of a frequency distribution in which an observation X i has
frequency fi (i=1, 2, . . ., k), the harmonic mean is given by,
1
H
1 fi Where N f i for i=1, 2, …, k.
n
Xi
34
-It fulfils almost all properties of a good measure of central tendency, except when any
observation is zero, it cannot be calculated. Its main advantage is that it gives more weight
age to small values and less weight age to large values.
Example 3.6:
1) A man travels from A.A to Awasa by a car and takes four hours to cover the whole
distance. In the first hour he maintains a speed of 50km/h, in the second hour his speed
remains 64km/h, in the third 80km/h and in the fourth hour he travels at the speed of
55km/h.Find the average speed of the motorist?
2) The price commodity increased by 5%, 8% and 77% for three consecutive years. What is
average yearly price increase?
3) The arithmetic mean of two numbers is 13 and their geometric mean is 12. Find
a) The numbers
b) H.M
4) Proof the following theorem
a) If x1 and x2 are two observed values, the geometric mean of their arithmetic mean and
harmonic mean is equal to the geometric mean of the numbers x1 and x2.
b) If A, G, and H stand for A.M, G.M and H.M respectively, the relation A G H
holds.
Mode ( X̂ ):
Definition: It is the value of the distribution that occurs with the highest frequency among all
the observations in a sample. The mode may not exist, and even if it does exist, it mayn’t be
unique.
Unimodal: is a distribution having one mode.
Bimodal: is a distribution with two modes.
Multimodal: A data set which contain more than one mode
-For individual series:
Mode = the highest frequency value.
Example: The modal age of the age distribution: 23, 28, 28, 31, 32, 34, 37, 42, 50, and 61 is
28, since it occurred twice while the other values occurred only once.
-For a grouped frequency distribution the mode of the distribution is calculated by the
formula
1
Xˆ Lmod w
1 2
35
Where; Lmod lower class boundary of the modal class
1 difference of frequency of the modal class and pre-modal class
Example: Calculate the mode for grouped frequency distribution of blood pressure levels.
Class limits Freq.
42-48 8
49-55 8
56-62 13
63-69 7
70-76 6
77-83 5
84-90 3
Total 50
Solutions:
- Identify the modal class: the modal class is a class having the highest frequency in
the distribution. 56-62 is a modal class.
- Find the mode using the formula.
1
Xˆ Lmod w
1 2
(13 8)
55.5
(13 8) (13 7)
55.5 0.46
55.96
Advantages of mode:
o It can be calculated for distribution with open end class.
o It is not affected by extreme values.
o We can change the size of the observations without changing the model
o Easy to calculate and simple to understand.
36
o It can be used when the data is nominal such as gender, religious preference, or
political affiliation
Disadvantage of mode:
o It is not based on all values
o The mode is not always unique that is a data set can have more than one mode.
o The mode doesn’t always exist for a data set.
Median
~
Definition: It is the center value of an order data. It is denoted by X .
- Before one can find median, the data must be arranged in order. Then,
i. When the number of observation is odd, then, the median is the meddle
value.
ii. When the number of observation is even, then, the median is the arithmetic
mean of the two middle values.
Suppose there are n observations in a sample. If these observations ordered from
~
the smallest to the largest, then the median ( X ) is:
~
X = the
X n 1 th value if n is odd.
2
~ X
X = The average of the n2 th
and X
1 value if n is even.
n
2
th
37
- It is used when one must find the center or middle value of a data set.
- Uniqueness: There is only one median for a given set of data.
- Unlike the mean it is not affected by extreme values. Therefore, when there are
extreme values it is advisable to use the median instead of the mean, especially in
application.
Example-1: Consider the following data, which consists of white blood counts (in
thousands) taken on admission of all patients entering a small hospital on a given day:
7, 35, 5, 9, 8, 3, 10, 12, and 8.
Solution:
First order the samples as follows: 3, 5, 7, 8, 8, 10, 12, and 35.
Since n is odd (n=9) median is given by the (9 1 2) th 5 th , point, which is equal to
8.
Example-2: Consider the grouped frequency distribution of blood pressure levels of
50 first-year male medical students. Calculate the median
Solution:
- First find the less than cumulative frequency
- Identify the median class: median class is the first class whose cumulative
frequency is at least n/2 = 50/2=25. 56-62 is a median class.
- Find the median using the formula.
Class limits Freq. <CF
42-48 8 8
49-55 8 16
56-62 13 29
63-69 7 36
70-76 6 42
77-83 5 47
84-90 3 50
Total 50
~ n w
X LMed C
2 f Med
7
55.5 25 16
13
60.35
Remark:
i. For nominal data (such as sex or race), the mode is the only valid measure.
38
ii. For ordinal data (such as salary categories), only the mode and median can be
used.
Remark: The kth quartile class (class containing ) is the class with the smallest cumulative
Note that:
39
Percentile divides a give set of data in to hundred equal parts
Note:
40