Professional Documents
Culture Documents
Objectives
Define variable and data Describe types of data and measurement scales Define and calculate ratio, rate and proportion Define and calculate measures of central tendency and measures of spread Organize and display data Extract useful information
Any aspect of an individual or object that is measured (e.g., BP) or recorded (e.g., age, sex) and takes any value. There may be one variable in a study or many. E.g., A study of treatment outcome of TB
Eg. Nominal
Marital status: 1. Single 2. Married 3. Widow 4. Divorce
E.g. Ordinal
Pain level 1. None 2. Mild 3. Moderate 4. Severe The numbers have LIMITED meaning 4>3>2>1 is all we know apart from their utility as labels
Eg. Interval
- Temperature. in oC on 4 consecutive days Days: A B C D Temp. oC: 18 20 22 23 For these data, not only is day A with 18o cooler than day D with 23o, but is 5o cooler. - It has no true zero point. 0 is arbitrarily chosen and doesnt reflect the absence of temp.
Eg. Ratio
-Height, age, weight, BP, etc Someone who weighs 80 kg is two times as heavy as someone else who weighs 40 kg.
Interval
Nominal
Ordinal
Ratio
Summary of Data
Variable
Qualitative or categorical
Quantitative measurement
Continuous Discrete (real-valued) (count data) e.g. number of e.g. height admissions
Categorizing Data
Can facilitate data analysis Must choose:
Number of categories Category cut points
Categorizing Variables-Exercise
1. 2. 3. 4. 5.
Year of birth Marital status of women Identification number study participant Class rank Length of infants at ANC clinic
Categorizing Variables-Exercise
1. Year of birth: Quantitative/Discrete 2. Marital status: Categorical/Nominal 3. Identification number: Categorical/Nominal 4. Class rank: Categorical/Ordinal 5. Length: Quantitative/Continuous
Discrete or Continuous?
Identify whether the following data is discrete or continuous:
1. Distance from primary health center to reference lab 2. Number of times a child under 5 has experienced fever in the last month 3. Number of fatal accidents on a road over the past year 4. Weight gained or lost by a 9 month old in the past 3 months
Discrete or Continuous?
Identify whether the following data is discrete or continuous:
1. Distance from primary health center to reference lab: Continuous 2. Number of times a child under 5 has experienced fever in the last month: Discrete 3. Number of fatal accidents on a road over the past year: Discrete 4. Weight gained or lost by a 9 month old in the past 3 months: Continuous
Counts
Most basic measure of disease frequency is a simple count of affected individuals. Example: 350,000 cases of polio 350,000 cases of polio in 1988 350,000 cases of polio in 1988 in 125 countries
Example of Counts
Number of Cases of Hemorrhagic Fever by Age and Sex, Zaire, 1976
Age (years)
Male
Female
Total
<1
1 - 14 15 - 29
10
18 33
14
25 60
24
43 93
30 - 49
50+ Total
57
23 141
52
26 177
109
49 318
Ratio
The quotient of 2 numbers Numerator NOT INCLUDED in the denominator No relationship necessary between numerator and denominator May be expressed as a/b or a:b
= 2 / 5 = .4 X 100 = 40
Ratio Example 1
A university has 4000 male students and 2000 female students. The ratio of male to female students is: 4000/2000 = 2/1 or 2:1 For every 2 male students there is one female student
Ratio Example 2
A foodborne epidemic occurred in an elementary school canteen. The attack rate in the first grade was 24% while the attack rate in the second grade was 16%. Compare these two attack rates. 24/16 = 3/2 or 3:2 For every 3 first graders who fell ill, there were 2 second graders who also fell ill.
Ratio Example 3
A city of 4 million people has 400 clinics. Calculate the ratio of clinics per person.
Multiply by 104
Ratio = 0.0001 x 104 = 1 clinic / 10,000 persons
Proportion
The quotient of 2 numbers Numerator is a sub-group of the population in the denominator Numerator is always INCLUDED in the denominator Proportion ranges between 0 and 1 Percentage = proportion x 100
+
2 cases 0.5 100 50% 4 total
Proportion Example 1
A university has 4000 male students and 2000 female students. Calculate the proportion of male and female students. Male: 4000/6000 x 100% = 66.7% Female: 2000/6000 x 100% = 33.3%
Proportion Example 2
40 children are currently ill with the measles, 80 children all together have had the measles 40 / 80 = .50 (proportion) 40 / 80 = .50 * 100 = 50% (percentage)
Rate
The quotient of 2 numbers Measures the probability of occurrence of an event over time Numerator: number of EVENTS Denominator: POPULATION at risk for event in numerator observed for a given TIME
Rate Example 1
Mortality rate of tetanus in France in 1995
Tetanus deaths: 17 Population in 1995: 58 million Time period: 1 year Mortality rate = 0.029 per 100,000 population per year
Rate Example 2
Maternal Mortality for Various Continents (1995) Continent Rate
Africa Asia
Europe Latin America/Caribbean
273000 217000
2000 22000
South America
North America Australia/New Zealand
15000
490 25
Summary
W is the Measure of Frequency? hat
Is numerator included in denominator? Yes Is time included in denominator? Yes Measure: Rate No Proportion Ratio
14
No
Measures of Spread
Range, IQR, Variance, Standard deviation
Central Location
?
20
Number of people
15
10
Spread
0-9 10-19 20-29 30-39 40-49 50-59 60-69 Age 70-79 80-89 90-99
Age 27 30 28 31 28 36 29 37 29 34
30
30 27 30 28 31 32 30 29 29
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 29 30 30 30 30 30 31 31 32 34 36 37
Order the data set from the lowest value to the highest value Add observation numbers
Mode
Definition: Mode is the value that occurs most frequently Method for identification 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs 2. Identify the value that occurs most often
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 29 30 30 30 30 30 31 31 32 34 36 37
Mode
Age 27 28 29 Frequency 2 3 4
30
31 32
5
2 1
Mode
33
34 35 36 37 Total
0
1 0 1 1 20
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 30 30 30 30 30 31 31 32 34 36 37 29
Mode
The most frequent value of the variable
Mode = 30
7 6
Frequency
5 4 3 2 1 27 2 8 29 30 31 32 33 34 35 36 37
Age (years)
Example
Finding Mode from Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Mode = 10
Number of patients
5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 Nights of stay
Outliers
6
Number of patients
20 18 16 Population 14 12 10 8 6 4
Unimodal Distribution
2
0
18
16 Population 14 12 10 8 6 4 2 0
Bimodal Distribution
Median
Definition: Median is the middle value; also, the value that splits the distribution into two equal parts 50% of observations are below the median 50% of observations are above the median
Method for identification 1. Arrange observations in order 2. Find middle position as (n + 1) / 2 3. Identify the value at the middle
Obs 1 2 3 4 5
Age 27 27 28 28 28
6
7 8 9
29
29 29 29
N+1 2
19+1 2 20 2 10
10
11 12 13 14 15 16 17 18
30
30 30 30 30 31 31 32 34
19
36
Obs 1 2 3 4 5
Age 27 27 28 28 28
6
7 8 9 10 11 12 13 14 15 16 17 18 19
29
29 29 29 30 30 30 30 30 31 31 32 34 36
20+1 2
21 2 10.5
Median age = Average value between 10th and 11th observation 30+30 = 2
30 years
Examples
Median at 50% = 10
Quartiles
Definition: Quartile is the value that splits the distribution into four equal parts
25% of observations are below the first quartile (Q1) 25% of observations are between Q1 and Q2 (median) 25% of observations are between Q2 (median) and Q3 25% of observations are above Q3
Obs
Age
1
2 3
27
27 28 28 28 29 29 29 29 30 30
Quartiles
Q1 age = 28 Q2 age = 30 Q3 age = 31
N+1 4 Q1 observation = round 20+1 21 = = 4 4 = 5.25 ~ 5th obs
Q1
4 5 6 7 8 9
Q2
10 11
12
13 14 15 16 17 18 19
30
30 30 31 31 32 34 36
Q3
Percentiles
Value of the variable that splits the distribution in 100 equal parts
35 % of observations are below the 35th percentile 65 % of observations are above 35th percentile
Obs 1 2
Age 27 27 28 28 28 29 29 29 29
Percentiles
Values (Age) 27 Fre q 2 Percent (Freq/Tota l) 10% Cumulativ e Percent 10%
3 4 5
6 7 8 9
28
29 30 31 32 34 36 37 Total
3
4 5 2 1 1 1 1 20
15%
20% 25% 10% 5% 5% 5% 5% 100%
25%
45% 70% 80% 85% 90% 95% 100%
25th Percentile
10
11 12 13
30
30 30 30
90th Percentile
14
15 16 17 18 19
30
31 31 32 34 36
Arithmetic Mean
Arithmetic mean = average value
Obs
Age
1
2 3 4 5
27
27 28 28 28
Arithmetic Mean
6
7 8 9 10 11 12 13 14
29
29 29 29 30 30 30 30 30
x i m N
N = 20 Sxi = 605
15
16 17 18 19
31
31 32 34 36
605 20
30.25
Example
Finding the Mean Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 Sum = 360 n = 30 Mean = 360 / 30 = ?
Sensitive to Outliers
6 5 4 3 2 1 0 0
6
Mean = 12.0
5 10 15 20 25 30 Nights of stay 35 40 45 50
Number of patients
Mean = 15.3
OK!
Summary
Measure of Central Location single measure that represents an entire distribution Mode most common value Median central value Arithmetic mean average value Mean uses all data, so sensitive to outliers Mean has best statistical properties Mean preferred for normally distributed data Median preferred for skewed data Geometric mean for dilutional titer
Measures of Spread
Definition: Measures that quantify the variation or dispersion of a set of data from its central location Also known as: Measure of dispersion Measure of variation Common measures Range Standard error Interquartile range 95% confidence interval Variance / standard deviation
Same center
but different dispersions
Range
Definition: difference between largest and smallest values
9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Range = 0 to 49
20 25 30 Nights of stay 35 40 45 50
Number of patients
6 5 4 3
Range = 0 to 149
2
1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay
Interquartile Range
Definition: the central 50% of a distribution Properties / Uses Used with median Used to show the most typical 50% of the values
Example
IQR Length of Stay Data
Q1 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, M 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 Q3 Q1 = 25th percentile = (30+1) / 4 = 7 Median = 50th percentile = 15.5 Q3 = 75th percentile = 3 (30+1) / 4 = 23 6 10 14
Q1
Q3
3
2
1
0 0 5 10 15 20 25 30 35 40 45 50
Nights of stay
(x i - x ) s = n-1
( x i - x ) s = n-1
Normal Distribution
2.5% 95 % 68% 2.5%
Standard deviation
Mean
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Mode Median
Arithmetic mean
Interquartile range
Mode Median
Arithmetic mean
Interquartile range
Mean*
Standard deviation
Median
14
Mode
12 Population
10 8 6 4 2 0
Age
1st quartile
3rd quartile
Minimum
Maximum
Charts
Bar charts Pie charts
Frequency distributions
A simple and effective way of summarizing categorical data is to construct a frequency distribution table. First column: Level of the variables. Second column: Count number of observation E.g. Table below shows the frequency distribution of birth weight for 9975 newborns between 1976-1996. BWT . Very low Low Normal Big Total Freq. Rel.Freq(%) 43 0.4 793 8.0 8870 88.9 268 2.7 9974 100 Cum. Freq 43 836 9706 9974
Relative Frequency
Useful to compute the proportion, or percentages of observations in each level. The distribution of proportions is called the relative frequency distribution of the variable Given a total number of observations, the relative frequency distribution is easily derived from the frequency distribution. Conversion in the opposite direction is also possible, but the conversion is often inaccurate because of rounding The third column of Table below shows the relative frequency distribution of birth weight for 9975 newborns between 1976-1996
Cumulative frequency
The cumulative frequency of a category is the number of observations in the category plus observations in all categories smaller than it. BWT Freq. Rel.Freq(%) Cum.Freq Cum.rel.freq.(%) Very low 43 0.4 43 0.4 Low 793 8.0 836 8.4 Normal 8870 88.9 9706 97.3 Big 268 2.7 9974 100 Total 9974 100
Table 2. Frequencies of serum cholesterol levels for 1067 US males of ages 25-34 1976-1980
-----------------------------------------------------------------------------------Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ------------------------------------------------------------------------------------------Total 1067 100
Charts
The frequency distribution of a categorical variable is often presented graphically as a bar chart or pie chart. Bar charts: display the frequency distribution for nominal or ordinal data. Horizontal axis: Labels of the variable Vertical bar: Frequency or the relative frequency The bars should be of equal width and should be separated from one another so as not to imply continuity
Bar charts showing frequency distribution of the variable BWT described in Table
6000
100
5000
80
4000
Rel. Freq.
Freq.
60
3000
2000
40
1000
20
0
Very low
Low BWT
Normal
Big
BWT
Percent
Yes No
9 7.9
Low
Bar chart indicating categories of birth weight of 9975 newborns grouped by antenatal follow-up of the mothers
Pie chart
Pie Chart: displays the frequency distribution for nominal or ordinal data.
In a pie chart the various categories into which the observation fall are represented along sectors of a circle, such that each sector represents either the frequency or the relative frequency of observation within the class the angles of which are proportional to frequency or the relative.
Fig 3(b) Pie chart indicating relative frequency of categories of birth weight
2.7
0.4 8
Very low Low Normal Big
268
43 793
88.9
8870
Histograms
Histograms are frequency distributions with continuous class interval that have been turned into graphs. Given a set of numerical data, we can obtain impression of the shape of its distribution by constructing a histogram. Horizontal axis: Labels of the variable Vertical bar: Frequency or the relative frequency Except for the two boundaries, class intervals are usually chosen to be of equal width. If this is not the case, the histogram could give a misleading impression of the shape of the data
Example Consider the following table and the histogram showing distribution of the age of women at the time of marriage
Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49 No. of women 11 36 28 40 13 35 7 3 30 2
No of women
A histogram displaying frequency distribution of birth weight of newborns at Tikur Anbessa Hospital
2000 1800 1600 1400 1200 1000 800 600
Frequency
400 200 0
00 52 00 48 00 44 00 40 00 36 00 32 00 28 00 24 00 20 00 16 00 12
0 80
Birth weight
Frequency polygons
Instead of drawing bars for each class interval, sometimes a single point is drawn at the mid point of each class interval and consecutive points joined by straight line.
Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males and females
50
40
%
30
20
SEX
10
Males Females
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Birth Weight
Cumulative frequency polygons can also be used to obtain percentiles of a set of data.
Roughly the 50th percentile is the value that is greater than or equal to 50%.
Table 2. Frequencies of serum cholesterol levels for 1067 US males of ages 25-34 1976-1980
------------------------------------------------------------------------------------
Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ---------------------------------------------------------------------------------------80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ---------------------------------------------------------------------------------------Total 1067 100
Table 3. Frequencies of serum cholesterol levels for 1227 US males of ages 55-64 1976-1980
------------------------------------------------------------------------------------------Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------80-119 5 0.4 5 0.4 120-159 48 3.9 53 4.3 160-199 265 21.6 318 25.9 200-239 458 37.3 776 63.2 240-279 281 22.9 1057 86.1 280-319 128 10.4 1185 96.5 320-359 35 2.9 1220 99.4 360-399 7 0.5 1227 100 ------------------------------------------------------------------------------------------Total 1227 100
Frequency polygon and Cumulative frequency polygons of serum cholesterol levels for 2294 males aged 25-34 and55-64 years, 1976-1980
100
Cumulative relative frequency (%)
Box Plots
A visual picture called box plot can be used to convey a fair amount of information about the distribution of a set of data. The box shows the distance between the first and the third quartiles, The median is marked as a line within the box and The end lines show the minimum and maximum values respectively
Illustration of Box-plot
18
20
22
24
26
28
30
32
34
36
Numbers
A box-plot indicating birth weight of 5092 newborns by gestational age at Tikur Anbessa Hospital studied
Pre
Gest. age
Term
Post
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Birth weight(grams)
Tables
Summary
Diagrams
Although a certain information is lost when data are summarized using tables and graphs, a great deal is gained Tables are effective ways of summarizing categorical data Tables are more informative when they are not overly complex Tables and the columns within them should always be clearly labeled and units of measurement be specified Diagrams have greater attraction than mere figures. The give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it. They help in deriving the required information in less time and without any mental strain. They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature. They facilitate comparison