Professional Documents
Culture Documents
So, What is Statistics? It is the Collection of Data, and the Conversion of this data into Information.
What is Data, and what is Information? Data is pieces of attributes, and there are two main types,
Quantitative and Qualitative data. Information is organized data for specific purposes. In statistics
data is organized for two main purposes, to describe the Data (Descriptive), to use the information
to make decisions (Inferential). So statistical study can be broken into three parts, Statistical Data
Collection, Descriptive Statistics, and Inferential Statistics. We shall later expand on each of them.
So, Where does Statistics come from? On a bigger scale rulers have always being interested in
knowing the composition of their subjects especially the number of young men fit to be sent out to
kill, or be killed for the glory of the ruler. Rulers have therefore being very fond of Census, one of
the well known, according to the Bible, led to Jesus Christ being born in a manger. On a smaller
scale it is a major part of the basis of most of our personal decisions because it forms the basis of
what we call experience. “I shop at No Frills, because prices there are cheaper (most times) than
Loblaws’.” It is intertwined with chance or probability, so it comes from most of our daily routines.
So, Why should ‘we’ study it? Variation (according to Derek Stephens of Sick Kids) is what makes
the study of statistics necessary. Example, if all the people in a country are alike in every
conceivable characteristic, then what satisfies one satisfies all, and there will be no need for census.
We therefore study statistics primarily to make sense of the variation in a group. The scientific
method, which underlies experimental science, technological research and development, and
research in the social sciences, is essentially statistical methods, and for this alone it is worth
studying statistics. Knowledge in statistics makes us ‘better’ consumers of advertisement and
propaganda. Statistics is invaluable in planning especially for large groups like people in a country.
We will now learn a bit more about the three parts of Statistics: Statistical Data Collection,
Descriptive Statistics, and Inferential Statistics.
Statistics Text - http://www.cimt.plymouth.ac.uk/projects/mepres/allgcse/pbtxt.pdf
Statistical Data Collection:
http://www.cimt.plymouth.ac.uk/projects/mepres/alevel/stats_ch2.pdf
Types of Data: 1. Quantitative Data 2. Qualitative (Category) Data.
Quantitative Data are Numerical Data. There are two types; a) Discrete and b) Continuous.
Discrete Numerical Data: Possible numerical outcomes can be counted. Example, the number of
students in a class are either 0, 1, 2, 3, …. That is it is a non-negative integer. Shoe sizes, since there
are a finite number of them (note: some are fractions). Bank balance of Canadians
Continuous Numerical Data: Possible numerical outcomes cannot be counted. Example, the size of
feet measured in any units of length, e.g. centimetres. The weight of any three oranges measured in
any units of weight, e.g. grams.
Qualitative (Category) Data: Is a ‘measure’ that put subjects in non-quantifiable groups. There are
two types; a) Nominal: category by name, b) Rank: category by rank. Colour (e.g. of cars), place of
birth, SIN that starts with 405, 905, etc are examples of Nominal, and 1 st, 2nd, 3rd, year is an example
of Rank.
Statistical Population: Are all the subjects under a statistical study. TYP students become the
population if the study is restricted to TYP students, example, finding the height of TYP students.
Census is a statistical study, which includes each member of the adult population of a country.
If each member of a population is included in the study, then the problem of Data Collection is
reduced to ‘how ‘information’ is collected’ from the subjects. For most statistical studies, the size of
the population and/or the cost of the study and/or the nature of the study make it impractical to
collect data from each member of the population. In such situations, data is collected from a
representative group or sample chosen from the population. The data from this sample is then
assumed to be applicable to the whole population. So besides the problem of ‘how ‘information’ is
collected’ is added the problem of ‘how the sample was chosen’.
‘How ‘Information’ is collected’? Data is collected from subjects either passively by not interacting
with the subjects or actively by interacting with the subjects. Collecting data passively is mostly by
observation, and counting. Example of such, is data on number of people passing through a given
place in a given time interval. Active data collection involves measuring usually with an instrument,
and most often by questionnaire. Problems with measuring with instruments are problems
associated with the instruments, which most times are resolved technically. Data collection by
questionnaire on the other hand has many ‘hidden’ problems from the way the questions are framed
to whether subjects respond verbally or in writing. So we reduce the problems we will look at in
data collection to whether the whole population is studied and if not how a sample is chosen
(Sampling), and if questionnaires are used how they are prepared and used.
Sampling: http://www.cimt.plymouth.ac.uk/projects/mepres/book9/bk9_18.pdf
The criteria for choosing a sample to represent a population for statistical study is that each member
of the population must have an equal chance of being chosen. This is similar to Lotto 649, for which
each of the 49 numbers has an equal chance of being one of the 6 numbers chosen for the jackpot.
The best method to achieve this is by Random Sampling. So to Randomly choose a sample is to
give each member of your population the same chance of being chosen. There are many methods of
Random Sampling and one of the most used is by using Random Numbers, for example Lotto 649
numbers.
Questionnaire: http://www.cimt.plymouth.ac.uk/projects/mepres/book8/bk8_20.pdf
Check the above site for criteria a good questionnaire meets.
3
number of times the score is in the data.)
Graphical Illustration: Some of the illustrations are: 1) Line Graph; 2) Pie Charts; 3) Bar Charts;
4) Histogram; 5) Cumulative Frequency Diagram; etc.
Summary, Statistics: These are values derived from the data to give a short description of data.
These are the Measures of Central Tendency, and the Measures of Dispersion. Besides describing
the data, these measures are convenient for the comparison of two sets of data.
Measures of Central Tendency: These are the Mode, Median, and the Mean. They are also known
as the averages. The Mode is the score with the highest frequency. The Median is the ‘score’ with
the same number of scores greater than it as are less than it. The mean is the sum of all the scores in
the data, divided by the number of scores in the data.
We shall illustrate all these by examples using the following set of Data:
Test 1:
56 40 7 70 31 17 56 71 70 36 71 71
63 56 46 91 46 73 60 97 67 53 86 64
44 92 46 75 77 93 70 97 53 71 79 57
Test 2:
65 60 30 70 63 65 45 56 70 58 48 37
92 40 86 62 60 40 53 47 45 60 31 91
31 35 61 61 72 80 50 60 73 28 58 38
5
Median – To find the median, the score is arranged in order of magnitude, and the score in the
middle position is the median. There are 36 scores or numbers in the data, so the middle position
lies between the 18th and the 19th positions. From the stem and leaf diagram, counting from the least
to the greatest score, the score in the 18th position is 64, and the score in the 19th position is 67. The
median is the sum of these two numbers divided by 2. Median is 65.5
N +1
F
orad
ataw
ithNsco
res, th
eMed
ianP
ositio
n = .Arran
ged
ataino
rdero
fmag
nitu
de.
2
Th
enth
eM ed
ianisth
escoreinth
ispositio
n.Itisfoundbycounting.Ifth
epo
sitio
nfallsb
etween
tw
osco
resasabovethenth
em eanofthetw oscoresisth
em ed
ian .
Arithmetic Mean:
This is the sum of all the scores divided by the number of scores. There are 36 scores, and the sum
can be found from the original data or from the stem and leaf diagram.
Sum of the scores = 2252. Then the Arithmetic mean is 2252 divided by 36. Arithmetic mean = 63.
Some Properties of Arithmetic Mean:
The product of the Arithmetic Mean and the Number of scores gives the sum of the scores. This is
useful in finding the required mark to make a certain grade.
Example 1: Akua’s mean mark for her first 3 tests is 78.
i. Akua wants her mean mark for the course to be at least 80. What should be her
minimum mark on the 4th (and last) test if she is to get 80?
ii. What is the highest possible mean mark Akua can get in the course?
Solution:
i. For Akua to get a mean mark of 80 for 4 tests, the sum of her marks for the 4 tests must be equal
to the product of 80 (mean of the tests) and 4 (the number of tests). This is 320. The sum of Akua’s
mark for the first three tests is the product of 78 (mean of the 3 tests) and 3 (number of tests). This
is 234. The difference between the sum of the 4 tests and the 3 tests is Akua’s mark for the 4 th test.
That is the difference of 320 and 234. This is 86. That is Akua must get 86 on the 4 th test for her
mean for the course to be 80.
ii. The highest possible mark Akua can get on the 4th test is 100. The sum of the first three tests is
234. So the sum of the 4 tests cannot be more than the sum of 234 and 100. This is 334. Akua’s
maximum mean mark is 334 divided by 4. This is 83.5. So the highest possible mean mark Akua
can get on the course is 83.5.
Note: The sum of the frequencies in a frequency table is equal to the number of scores. The
frequency table lends itself to many uses in finding statistical measures, the measures of central
7
tendency, and measures of dispersion. Example to find the Arithmetic Mean from a frequency
table; (i) multiply each score by its corresponding frequency, (ii) find the sum of the products in (i),
(iii) divide the sum in (ii) by the sum of the frequencies.
Exercise: From the frequency table for Test 1, find the Mode, Median, Arithmetic Mean, and
Range. Compare your answers to the answers obtained by using the Stem and Leaf diagram.
Notation : x (or y, or z) represents a score or number in a data, and f the frequency of a score.
N is the number of scores, and is equal to the sum of frequencies of the scores.
Symbol : ∑ (sigma) is the symbol for summation or addition. Example, ∑x means sum all (of)
the x (or scores) . ∑ f means sum all (of) f (or the frequencies).
∑x
f isth
es
umo
fa
lln
u
mbe
r
sun
de
rth
ec
olu
mn' x
f ',s
im
ila
r
ly ∑fisth
es
um
Note: o
fa
llth
en
um
be
r
sun
de
rth
ec
olu
mn' f'.
Exercise: (i) Organize Test 2 in an Ungroup Frequency Table. (ii) Find the Arithmetic Mean of Test
2 (using the above procedure).
Measures of Dispersion: Variance and Standard Deviation
Notation: σ
2
,(
o
rs2
)r
ep
r
e
s
en
t
s
Va
r
ia
n
c
e
. σ
,(
ors
)r
e
pr
e
s
en
t
s
St
a
nd
a
r
dD
e
v
i
at
i
o
n.
N
o
t
e :T
h
e
s
q
u
ar
e
o
ft
h
e
St
a
n
da
r
d
De
v
i
a
t
io
n
i
st
h
e
Va
r
i
an
c
e
; o
r
S
t
a
nd
a
r
dD
e
v
i
a
t
io
n =V
a
r
i
a
n
c
e
∑(x − x) . ∑(x − x) .
2 2
σ =
2
Sumof the squares of the σ =
2
Square Root, of the sum
N N
difference of each score and the mean, divided, of the squares of the difference of each score
by the number of scores. and the mean divided by the number of scores.
It simplifies to : It simplifies to :
∑x 2
∑x
()
2 2
σ = − x
()
2 2
N σ = − x
N
∑ (x − x ) f
2
σ2 =
∑f σ =
∑f
It simplifies to :
It simplifies to :
∑x f2
()
2
σ = − x ∑x f
2 2
∑f σ = − x ()
2
∑f
9
Comments: The Standard Deviation and the Variance as the formula shows, give a measure of a
spread of the scores of a data about or around the Arithmetic Mean (a measure of central tendency).
Unlike the Range, which depends only on the two extreme scores, the lowest and highest, the
Standard Deviation and the Variance is dependent on all the scores of a data. They are the most
widely used measures of dispersion especially in Inferential Statistics.
Data with a large number of scores are most often given in a frequency table, or first organized in a
frequency table before any further analysis. So as an example, the Variance and Standard Deviation
will be calculated for Test 1 from the frequency table of Test 1.
Example: Find the Variance and Standard Deviation for Test 1 (in the Ungroup Frequency Table).
Solution:
∑x f 2
∑x f
2
∑xf
() ()
2 2
σ 2
= − x isV
ariance and σ= − x isS
tandardD
eviation. x=
∑f ∑f ∑f
So foreachscore' x', andcorrespondingfrequency' f', thefollow
ingm
ustbefound; xf, andx 2 f.
Notation: x is score; f is frequency; xf is the product of a score and its frequency as its x2f.
63 1 63 3600 93 1 93 8464
3969 97 2 194 8649
18818
Fromthetable : ∑f = 36; ∑xf = 2252 ; ∑x f
2
= 156 504
Comment: The standard deviation acts as a unit of the scale of measurement of the scores in the
sense of the number of standard deviations of a score from the arithmetic mean.
Example: For Test 1, find the percentage of the number of scores that are within;
i. One standard deviation of the mean?
ii. Two standard deviations of the mean?
iii. 95, is how many standard deviations from the mean?
iv. Find the number of standard deviations, 7 is from the mean?
11
Solution :
The mean x = 62.56 and the Standard Deviation σ = 20.90
i. A number within one standard deviation of the mean is greater than or equal to x - σ
and less than or equal to x + σ . So; x - σ ≤ Number within one σ of the x ≤ x + σ .
Therefore, x - σ ≤ A score within one σ of the x ≤ x + σ .
Substituting, x - σ = 62.56 - 20.90 = 41.66 and x + σ = 62.56 + 20.90 = 83.46
From the Frequency Table or Stem Leaf, the number of scores greater than or equal to 41.66
and less than or equal to 83.46 is 25. This is the number of scores from 44 to 79. The number
of all the scores is 36. Therefore the percentage of the number of the scores that lie within one
standard deviation of the mean = 25
36 × 100% = 69.44%
x − x
iii. For any number ' x': z = is the number of standard deviations of ' x' from the mean.
σ
95 − 62.56
So for x = 95; z = = 1.55. ∴ 95 is 1.55 standard deviations from the mean.
20.90
7 − 62.56
iv. From (iii), for x = 7; z = = − 2.66. ∴ the number of standard deviations, 7
20.90
is from the mean, is - 2.66.
Exercise: For Test 2, find the percentage of the number of scores that are within;
i. One standard deviation of the mean? ii. Two standard deviations of the mean? iii. 37, is how
many standard deviations from the mean? iv. Find the number of σ s, 91 is from the mean?
Group Frequency Table: Is a table of groups of scores and sum of the frequencies of the
individual scores in the group. A group of scores is called a class. Each score belongs to a class,
and can belong to only one class. So classes do not overlap. The other aspects of a class are: (i)
Class Limits (Lower and Upper), (ii) Class Size, (iii) Class Boundary (Lower and Upper), (iv)
Class Interval, and (v) Class Mark. These would be discussed at the appropriate points. Whilst
there is only one Ungroup Frequency Table for a given data, there is more than one Group
Frequency Table for the same data. The distinguishing features are the Class Size, which is the
number of scores in a class, and the Lowest or Greatest Class Limit.
Group Frequency Table 1 for Test 1
Test Marks of Students Number of Students Test Marks of Students Number of Students
Scores Frequency Scores Frequency
7 - 11 1 57 - 61 2
12 - 16 0 62 - 66 2
17 - 21 1 67 - 71 8
22 - 26 0 72 - 76 2
27 - 31 1 77 - 81 2
32 - 36 1 82 - 86 1
37 - 41 1 87 - 91 1
42 - 46 4 92 - 96 2
47 - 51 0 97 - 101 2
52 - 56 5
Comment: Each class has the same size, 5 (different scores). The lower limit of the fourth class is
22, and the upper limit of the first class is 11. In general the class sizes need not be equal.
Group Frequency Table 2 for Test 1
Scores Frequency Scores Frequency
4 - 13 1 54 - 63 6
14 - 23 1 64 - 73 10
24 - 33 1 74 - 83 3
34 - 43 2 84 - 93 4
44 - 53 6 94 - 103 2
Comment: Each class size is 10. The lowest limit is a score of 4 and the greatest limit 103. None of
these is a score of the data.
Large data is often given in a Group Frequency Table. This summarizes the data at the expense of
details. The larger the class size the shorter the summary and the more detail that is lost. It is
therefore necessary to balance brevity of summary against too much detail. This is comparable to
the assignment of grades to course marks. By the rule of thumb or by convention, the number of
classes must not be less than 5, and it must not be more than 25.
13
Mean, Variance, and Standard Deviation from Group Frequency Table: Each class is represented
by a Class Mark which then is given the frequency of the class. This ‘reduces’ the Group Frequency
Table to an Ungroup Frequency Table with the Class Marks as the scores, with frequencies of the
corresponding Classes.
Class Mark, x of a Class: Is the mean of the Lower and Upper Class Limits of the class. That is,
(Lower Class Limit + Upper Class Limit) ÷2.
Example: Find the Mean, Variance and Standard Deviation for Test 1 Group Frequency Table 2.
Solution: The following table is in reference to the formulas to be used;
Test Scores # of students: f Class Mark: x xf x2f = x(xf)
4 - 13 1 8.5 8.5 72.25
14 - 23 1 18.5 18.5 342.25
24 - 33 1 28.5 28.5 812.25
34 - 43 2 38.5 77 2964.5
44 - 53 6 48.5 291 14113.5
54 - 63 6 58.5 351 20533.5
64 - 73 10 68.5 685 46922.5
74 - 83 3 78.5 235.5 18486.75
84 - 93 4 88.5 354 31329
94 - 103 2 98.5 197 19404.5
Sum Σ 36 2246 154981
2246
M
ean x= = 62.388.. ∴M ean x = 62 (nearestw holenum
ber)
36
154981 2246 2
Variance σ2 = − = 412.654321 ∴Variance σ2 = 413 (nearestwholenum
ber)
36 36
and Stan dard D eviation σ = 20.31 (2decim
alplaces) and σ = 20(nearestw
holenumber)
Comment: Compare these values to the corresponding values for the Ungroup Frequency Table.
Frequency Table for Category Data: Is the ‘non-numerical’ attributes of the Category Data with
their corresponding frequencies.
Example: The Frequency Table of the following Category Data of Colour of Cars in a Car Park;
Blue, White, Blue, Blue, Black, Blue, Black, Silver, Silver, Blue, Silver, Green, Black, White,
Silver, Silver, Black, Blue, Green, Blue, Green, Red, Black, Red.
Frequency Table of Colour of Cars in Car Park
Score Frequency
Colour of Car Number of Cars
Blue 7
White 2
Black 5
Silver 5
Green 3
Red 2
Mode is Blue. That is there are more Blue cars than any other Colour of cars.
Comment:
Numerical Data, were organized by, (i) Stem and Leaf, and (ii) Frequency table for both ungroup
that is single score, and group that is class of scores and frequency: and calculated (i) the Measures
of Central Tendency or the Averages; Mode, Median, and Mean, and (ii) some of the Measures of
Dispersion; Range, Variance, and (from the Variance) the Standard Deviation.
Category Data was organized in a Frequency Table of a category attribute and frequency: and
calculated the Mode, a Measure of Central Tendency or Average. The mode is the only measure of
central tendency that makes sense for category data. There is no measure of dispersion, because
none make sense for a category data.
15
Pie Graph or Chart:
The pie chart is a circle divided into sectors, to represent the proportion of the frequency of a
‘Score’, ‘Class’ or ‘Category’ to the number of scores (sum of frequencies of the scores).
Example:
Colour of Cars Number of Cars Pie Graph for the Colour of Cars in a Car Park
Blue 7
White 2
Black 5
Silver 5
Green 3
Red 2
17
Colour of Cars
The line graph has the score as the independent variable and the frequency as the value of a
function. The bar graph has ‘category’ on the horizontal axis as the base of a rectangle (same
width) and the frequency as the height. If it is group numerical data, the class defined by the lower
and upper class limit is the ‘category’ and then, the base of the rectangle is proportional to the class
size and the height is the frequency.
Histogram:
The Histogram is the graph formed by rectangles representing the classes of a group frequency table
of numerical data. The area of the rectangle for a class on a histogram is equal to the frequency of
the class. The lower and upper class boundary is the base, and the height of the rectangle is the
frequency of the class divided by the class interval (width). For equal class intervals the height of a
rectangle is ‘frequency’ of the class. There are no gaps between the rectangles of a histogram.
(There can be gaps between the rectangles of a Bar graph.)
One use of the histogram is to find the ratio or fraction of the scores between numbers of standard
deviations from the mean, (for example, one standard deviation from the mean) and the total
number of scores. This is the ratio or fraction of the area of the rectangles in the region (of interest)
to the total area of the histogram. These ratios interpreted as the probability of a score in the region
are used in statistical decision-making.
4 - 13
14 - 23
24 - 33
34 - 43
44 - 53
54 - 63
64 - 73
74 - 83
84 - 93
94 - 103
1
1
1
2
6
6
10
3
4
2
19
21
~~
23
http://www.cimt.plymouth.ac.uk/projects/mepres/book9/bk9_8.pdf
http://www.cimt.plymouth.ac.uk/projects/mepres/book8/bk8_5.pdf
http://www.cimt.plymouth.ac.uk/projects/mepres/allgcse/pbtxt.pdf