Professional Documents
Culture Documents
MAT 120D Introduction to Statistics Descriptive Statistics (Organizing Data) Complementary reading (Chapter 2 Weiss) AbouEl-Makarim Aboueissa, Ph. D.
Frequency Distribution
The data that just have collected in original form are called raw data. These raw data do not furnish any useful information and rather make confuse to mind. They need to organize in such a way that is easily comprehensible. A better way is to organize them by constructing a frequency distribution. A frequency distribution exhibits how the frequencies are distributed over various categories.
Copyright 2011
Example 1:
A teacher gave a test in basic science to his Form 1 pupils. Their marks were:
3 3 4 5
8 5 5 8
6 4 0 5
5 4 7 7
6 3 6
4 6 5
7 7 6
6 8 7
5 1 1
3 10 7
5 7 5
6 6 4
It should be noted that the minimum mark is 0 and the maximum mark is 10. Construct a frequency distribution for these data.
Answer:
Frequency ( f ) 1 2 0 4 5 9 8 7 3 0 1
( x)
0 1 2 3 4 5 6 7 8 9 10
f = 40
Copyright 2011
You can see that it is easier to gather the following types of information from the frequency table than the raw data.
(a) The highest mark is 10 and the lowest mark is 0. (b) 4 pupils scored more than 7. (c) 12 pupils scored less than 5. (d) 8 pupils scored 6. (e) Nobody scored 9. (f) 40 pupils did the test.
Notes:
(1) The distribution of the variable (i.e. the characteristics under study e.g.
marks in the above example) along with their frequency is known as frequency distribution.
(2) The frequency is the number of times a score (e.g. marks as in the
example) is repeated.
= f f
The relative frequency of a class is the class frequency, total number of data values, i.e., the overall sample size
Copyright 2011
first class;
f = 1 = 0.025 . f 40
If each relative frequency is multiplied by 100%, we have a percentage distribution.
Example 2:
Construct a relative frequency distribution for the data given in Example 1.
Answer:
The corresponding relative frequency distribution of these data is given below:
( x)
0 1 2 3 4 5 6 7 8 9 10
( f n)
1/40 = 0.025 2/40 = 0.500 0/40 = 0.000 4/40 = 0.100 5/40 = 0.125 9/40 = 0.225 8/40 = 0.200 7/40 = 0.175 3/40 = 0.075 0/40 = 0.000 1/40 = 0.025
f = 40
We can see from this table that relative frequency of mark 4 is 0.125. It means that 12.5% of pupils got 4 marks.
Copyright 2011
Example 3:
Twenty-five army personnel were given a blood test to determine their blood type. The data set is:
A B B A O
AB O O O B AB AB
AB B B B O A O B A
A O O O
Construct a frequency distribution and relative frequency distribution for these data.
Frequency
(f)
Relative Frequency ( f n) 5/25 = 0.20 7/25 = 0.28 9/25 = 0.36 4/25 = 0.16 1.00
Percent
( x)
A B O AB 5 7 9 4
(( f
n)100 )
20 28 36 16 100
n = f = 25
You can see that it is easier to gather the following types of information from the frequency table than the raw data.
(a) More people have type O blood than any other type. (b) Very few have type AB.
Copyright 2011
We divide an interval containing all the data into a small number of segments, usually of equal width. These segments are called classes (or class intervals).
Example 4:
The weights (in kg.) of 50 pieces of luggage are presented in a grouped frequency distribution with the class interval as follows:
(a) The intervals of weights i.e. 7-9, 10-12, , 19-21 are known as class
intervals.
(b) 7, 10, 13, 16, 19 are called lower limits of the respective classes.
Copyright 2011
(c) 9, 12, 15, 18, 21 are called upper limits of the respective classes. (d) 6.5 9.5,
9.5 12.5, 12.5 15.5, 15.5 18.5 and 18.5 21.5 are known as class boundaries. These class boundaries are obtained by:
Lower class boundary =lower class limit d 2 Upper class boundary =Upper class limit + d 2
Where d = difference between any two consecutive classes .
For the above example,
d d = 1 = 0.5 2
(e)
2, 8, 14, 19 and 7 are called class frequencies.
(f) The class width is the difference between the upper and lower class
boundaries of a class interval. Thus, the class width for the class interval 13 15 is
Copyright 2011
(1) Number of Classes: The number of classes usually varies from 5 to 20.
It depends on the number of observation in the data set. It is preferable to have more classes as the size of a data set increases.
c = 1 + 3.3 log( n)
Where is the number of classes, and the data set.
Copyright 2011
(2) Class Width: One can determine the class width of same size by using
the following formula.
(3) Lower Limit of First Class: Smallest value or less than smallest value
in the data set can be used as lower limit of the first class.
Relative Frequency:
The relative frequency and percentages are obtained as follows:
Copyright 2011
Example 5:
The following are the marks (out of 100) obtained by 50 students of MAT120D in Spring Semester, 2009 Examination.
55 62 72 78 81
54 64 53 55 86
76 80 54 69 58
70 85 76 80 72
77 78 90 72 92
80 42 66 74 78
84 72 85 74 38
66 63 82 54 85
80 85 79 54 69
61 50 83 54 82
Construct a grouped frequency distribution. Use classes 30 39, 40 49, 50 59, etc.
Answer:
Number of classes:
Copyright 2011
10
30 39 40 49 50 59 60 69 70 79 80 89 90 99
/ / //// //// //// /// //// //// //// //// //// //// //
1 1 10 8 14 14
2
f =50
Note: The lower class limit of the first class is 30 is less than the smallest
data value 38. The upper class limit of the last class is 99 is greater than the largest data value 92.
Copyright 2011
11
Example 6:
The following is the distribution of the ages of new employees joined at a factory. Frequency Distribution Table
(a) Obtain the class boundaries and class marks of the class intervals. (b) What is the upper class limit of the class 30 39? (c) What is the lower class boundary of the class 50 59? (d) What is the class mark of the class 40 49?
Answer:
(a) The class boundaries and class marks are given in the following table:
It should be noted that:
d d = 30 29 = 1 = 0.5 2
Copyright 2011
12
Therefore, for example, the lower class boundary of the first class is given by:
d 20 = 20 0.5 = 19.5 2
and the upper class boundary of the first class is given by:
29 +
d = 29 + 0.5 = 29.5 2
d 30 = 30 0.5 = 29.5 2
and the upper class boundary of the second class is given by:
d 39 + = 39 + 0.5 = 39.5 2
The lower class boundary of the third class is given by:
d 40 = 40 0.5 = 39.5 2
Copyright 2011
13
and the upper class boundary of the third class is given by:
d 49 + = 49 + 0.5 = 49.5 2
The lower class boundary of the fourth class is given by:
d 50 = 50 0.5 = 49.5 2
and the upper class boundary of the fourth class is given by:
d 59 + = 59 + 0.5 = 59.5 2
Finally. the lower class boundary of the fifth class is given by:
d 60 = 60 0.5 = 59.5 2
and the upper class boundary of the fifth class is given by:
d 69 + = 69 + 0.5 = 69.5 2
Copyright 2011
14
40 + 49 = 44.5 2
Copyright 2011
15
Example 7:
The following frequency distribution gives the lengths of 15 cucumbers.
(Length (cm) ) 5 10 10 15 15 20 20 25 25 30
Frequency ( f ) 3 4 5 2 1 f =15
(a) What is the upper class limit of the class interval 15-20? (b) What is the lower class boundary of the class interval 15-20? (c) What is the class width of the class interval 15-20? (d) What is the class mark of the class interval 15-20?
Answer:
Copyright 2011
16
Example 8:
The following data represent glucose blood levels (mg/100 ml) after 12-hour fast for a random sample of 70 women (Reference: American Journal of Clinical Nutrition, Vol. 19, pp. 345-351). These data are as also available with other software on the statSpace CD-ROM. These data are:
45 85 81 93 65 101
66 77 76 85 89 71
83 82 96 83 70 109
71 90 83 80 80 73
76 87 67 78 84 73
64 72 94 80 77 80
59 79 101 85 65 72
59 69 94 83 46 81
76 83 89 84 80 63
82 71 94 74 70 74
80 87 73 81 75
81 69 99 70 45
(a) Find the class width. (b) Make a frequency table showing class limits, class boundaries,
midpoint, frequencies, relative frequencies, and cumulative frequencies.
(a)
For these data we have: largest data value = 109 smallest data value = 45
Copyright 2011
17
Number of classes:
Copyright 2011
18
Tally marks
Frequency
(f)
Class midpoint
( xm )
3
Relative frequency f f
Cumulative Frequency
55- 64
////
65 - 74
19
75 - 84
27
85 -94
//// //// //
12
95 104
////
n= f = 70
Copyright 2011
19
The class boundaries are the halfway points between (i.e. the average of) the (adjacent) upper class limit of one class and the lower class limit of the next class. The lower class boundary of the first class is the lower class limit minus one-half unit. The upper class boundary for the last class is the upper class limit plus one-half unit. For the first class, the class boundaries are
45 1 = 44.5 and 54 + 55 = 54.5 . For the last class, the class 2 2 104 +105 = 104.5 and 114 + 1 = 114.5 . boundaries are 2 2
OR:
d = 55 54 = 1
d 1 = = 0.5 2 2
Therefore, the lower class boundary of the first class is given by:
lower class boundary of the first class = lower class limit of the first class = 45 0.5 = 44.5
d 2
upper class boundary of the first class = upper class limit of the first class + = 54 + 0.5 = 54.5
And so on.
d 2
Copyright 2011
20
The class mark or midpoint is the average of the class limits (lower and upper) for that class. For the first class, the class midpoint is
The relative frequency of a class is the class frequency, total number of data values, i.e., the overall sample size
class;
The cumulative frequency of a class is the sum of the frequencies for all previous classes, plus the frequency of that class. For the first class and second classes, the class cumulative frequencies are 3 and 3+4 = 7, respectively.
Copyright 2011
21
Cumulative Relative frequency and Cumulative Percentage: The cumulative relative frequency for any score (or class) is obtained by:
And the cumulative percentage for any score (or class) is obtained by:
Cumulative percentage =
It gives the percentage of the total score that fall below a particular score (or below an upper boundary of a particular class).
Copyright 2011
22
(f)
( xm )
49.5
45 - 54
0.0429
55- 64
54.5 64.5
59.5
0.0571
65 - 74
64.5 74.5
19
69.5
0.2714
26
75 - 84
74.5 84.5
27
79.5
0.3857
53
85 -94
84.5 -94.5
12
89.5
0.1714
65
95 104
94.5 -104.5
99.5
0.0571
69
104.5 114.5
109.5
0.0143
70
n= f = 70
1.000
Copyright 2011
23
(d)
Histograms
The histogram plots the class frequencies on the
boundaries on the x axis . Since adjacent classes share boundary values, the bars touch each other. [Alternatively, the bars may be centered over the class marks (midpoints)]. A histogram is the most commonly used graphic representation of a frequency distribution. Here the horizontal axis ( x axis ) represents the data and the vertical axis ( y axis ) represents the frequency. Along the
25
20 Frequency
19
15
12
10
4 3
4 1
0 44.5 54.5 64.5 74.5 84.5 Glucose Blood Levels 94.5 104.5 114.5
Copyright 2011
24
Shapes of Histograms
A histogram can have many shapes. The most common of these shapes are:
Symmetric Histogram
20
15
10
0 6 8 10 12 mean=median=mode 14
Copyright 2011
25
25
20
15
10
0 8 10 12 14 mean>median>mode 16 18
25
20
15
10
0 2 4 6 8 mean<median<mode 10 12
Copyright 2011
26
Uniform Histogram: If a histogram has the same frequency for each class, then it is said to be uniform or rectangular histogram.
Uniform Histogram
16 14 12 10 8 6 4 2 0 6 8 10 12 14
(e) Based on the histogram of these data (at the end of page 31), the data set
is almost symmetric.
Copyright 2011
27
Frequency Polygons
Another way to represent a frequency distribution is by using a frequency polygon. The frequency polygon is especially useful in conveying the shape of the distribution. The frequency polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the mid point of the classes. The frequencies are represented by the heights of the points. To construct a frequency polygon, first find the midpoint of each class. Draw a horizontal x axis and a vertical y axis . Level the midpoints on the x axis and use a suitable scale the on y axis for the frequency. Above each midpoint, place a dot at a height equal to the frequency of the class. Then connect the adjacent dots with straight line and extend the line to the x axis . The extended lines meet at the midpoints of two hypothetical classes.
f , instead n
Copyright 2011
28
25
20 Frequency
19
15
12
10
5
0
4 3
4 1 0
x axis . y axis .
Plot points corresponding to each upper boundary and its cumulative frequency (or cumulative percent). Join the points by straight lines.
29
Copyright 2011
The cumulative frequency graphs are used to locate visually how many values are below a certain upper class boundary.
To create the ogive, place a dot on the x axis at the lower class boundary of the first class and then, for each class, place a dot above the upper class boundary value at the height of the cumulative frequency for the class. Connect the dots with line segments.
Ogive of Glucose Blood Levels
70 60 Cumulative Frequency 50 40 30 20 10 0 40 50 60 70 80 90 Class Boundaries 100 110 120
Copyright 2011
30
Advantages Visually strong Can compare to normal curve Usually vertical axis is a frequency count of items falling into each category
Visually appealing
Disadvantages Cannot read exact values because data is grouped into categories More difficult to compare two data sets Use only with continuous data Anchors at both ends may imply zero as data points
Hard to visualize
results in large data sets Flat trend line gives inconclusive results Data on both axes should be continuous
Shows 5-point
Not as visually
summary and outliers appealing as other graphs Easily compares two or more data sets Exact values not retained Handles extremely
31
data set. More about boxplots Stem and Leaf Plot: Stem and leaf plots record data values in rows, and can easily be made into a histogram. Large data sets can be accomodated by splitting stems.
Concise
Not visually
representation of data appealing Shows range, Does not easily minimum & indicate measures of maximum, gaps & centrality for large clusters, and outliers data sets easily Can handle extremely large data sets
Copyright 2011
32
Types of Distributions
When all of the scores in a set of data are consider together, it is commonly called a distribution of scores or just a distribution. As it turns out, there are a number of specific types of distributions that deserve discussion. Therefore, we will discuss a normal distribution and two types of skewed distributions.
Normal Distributions
Perhaps the most common type of distribution in the social sciences is a Normal Distribution. This can also be called a bell-shaped distribution. In this type of distribution, most scores occur in the center of the distribution and fewer scores are present as you go further away from the mean. A normal distribution is symmetrical. This means that if you divide the distribution in the center, the area to the left (below) of the mean is a mirror image of the area to the right (above) of the mean. The normal distribution is shown in the following figure.
Skewed Distributions
Another type of distribution that can occur is a skewed distribution. In a skewed distribution, the majority of the scores are not in the center of the distribution of scores. This means that the distribution is not symmetrical. In a positively skewed distribution, the majority of the scores in the distribution are shifted to the left. Alternatively, in a negatively skewed distribution, the majority of the scores are shifted to the right side of the distribution.
Copyright 2011
33