Chapter I Descriptive Statistics

Chapter I Descriptive Statstics
Objectives
Define variable and data Describe types of data and measurement scales Define and calculate ratio, rate and proportion Define and calculate measures of central tendency and measures of spread Organize and display data Extract useful information
Any aspect of an individual or object that is measured (e.g., BP) or recorded (e.g., age, sex) and takes any value. There may be one variable in a study or many. E.g., A study of treatment outcome of TB
Eg. Nominal
Marital status: 1. Single 2. Married 3. Widow 4. Divorce
The numbers have NO meaning They are labels only
E.g. Ordinal
Pain level 1. None 2. Mild 3. Moderate 4. Severe The numbers have LIMITED meaning 4>3>2>1 is all we know apart from their utility as labels
Eg. Interval
- Temperature. in oC on 4 consecutive days Days: A B C D Temp. oC: 18 20 22 23 For these data, not only is day A with 18o cooler than day D with 23o, but is 5o cooler. - It has no true zero point. 0 is arbitrarily chosen and doesnt reflect the absence of temp.
Eg. Ratio
-Height, age, weight, BP, etc Someone who weighs 80 kg is two times as heavy as someone else who weighs 40 kg.
. Note on meaningfulness of ratio-
Interval
Nominal
Degree of precision in measuring
Ordinal
Ratio
Summary of Data
Variable
Qualitative or categorical
Quantitative measurement
Nominal (not ordered) e.g. ethnic group
Ordinal (ordered) e.g. response to treatment
Continuous Discrete (real-valued) (count data) e.g. number of e.g. height admissions
Categorizing Data
Can facilitate data analysis Must choose:
Number of categories Category cut points
Some options for cut points:

Percentiles, natural breaks, established criteria Example: WHO body mass index classification
Underweight: <18.50 kg/m2 Normal: 18.50 24.99 kg/m2 Overweight: 25.00 kg/m2
Categorizing Variables-Exercise
1. 2. 3. 4. 5.
Year of birth Marital status of women Identification number study participant Class rank Length of infants at ANC clinic
Categorizing Variables-Exercise
1. Year of birth: Quantitative/Discrete 2. Marital status: Categorical/Nominal 3. Identification number: Categorical/Nominal 4. Class rank: Categorical/Ordinal 5. Length: Quantitative/Continuous
Discrete or Continuous?
Identify whether the following data is discrete or continuous:
1. Distance from primary health center to reference lab 2. Number of times a child under 5 has experienced fever in the last month 3. Number of fatal accidents on a road over the past year 4. Weight gained or lost by a 9 month old in the past 3 months
Discrete or Continuous?
Identify whether the following data is discrete or continuous:
1. Distance from primary health center to reference lab: Continuous 2. Number of times a child under 5 has experienced fever in the last month: Discrete 3. Number of fatal accidents on a road over the past year: Discrete 4. Weight gained or lost by a 9 month old in the past 3 months: Continuous
Describing categorical data

A prerequisite for any research is the ability to quantify the occurrence of disease How many people are affected by a certain disease? (Count) What is the rate at which the disease in occurring through time? (Rate) How does the disease burden vary by location, by sex, by age, or various modes of exposure? (Ratio, Proportion)
Counts
Most basic measure of disease frequency is a simple count of affected individuals. Example: 350,000 cases of polio 350,000 cases of polio in 1988 350,000 cases of polio in 1988 in 125 countries
How is count data used?

1988 Polio > 350 000 cases 125 countries
2002 Polio 1918 cases 7 countries
Example of Counts
Number of Cases of Hemorrhagic Fever by Age and Sex, Zaire, 1976
Age (years)
Male
Female
Total
<1
1 - 14 15 - 29
10
18 33
14
25 60
24
43 93
30 - 49
50+ Total
57
23 141
52
26 177
109
49 318
Ratio Proportion Rate

What, who is in the denominator????
Ratio
The quotient of 2 numbers Numerator NOT INCLUDED in the denominator No relationship necessary between numerator and denominator May be expressed as a/b or a:b
What is the sex ratio?
# males 100 Sex ratio = males:females = # females
= 2 / 5 = .4 X 100 = 40
When is a ratio used?

Sex ratio: Male to female Number of health facilities per population Number of participants in the course per facilitator Number of inhabitants per latrine Odds ratio Relative risk Prevalence ratio Maternal mortality ratio
Ratio Example 1
A university has 4000 male students and 2000 female students. The ratio of male to female students is: 4000/2000 = 2/1 or 2:1 For every 2 male students there is one female student
Ratio Example 2
A foodborne epidemic occurred in an elementary school canteen. The attack rate in the first grade was 24% while the attack rate in the second grade was 16%. Compare these two attack rates. 24/16 = 3/2 or 3:2 For every 3 first graders who fell ill, there were 2 second graders who also fell ill.
Ratio Example 3
A city of 4 million people has 400 clinics. Calculate the ratio of clinics per person.
Ratio = 400 / 4,000,000 = 0.0001 clinics / person
Multiply by 104
Ratio = 0.0001 x 104 = 1 clinic / 10,000 persons
Proportion
The quotient of 2 numbers Numerator is a sub-group of the population in the denominator Numerator is always INCLUDED in the denominator Proportion ranges between 0 and 1 Percentage = proportion x 100
What is the proportion of cases?
+
2 cases 0.5 100 50% 4 total
When is a proportion used?

Proportion of samples positive for P. Falciparum
1000 samples, 236 positive Proportion of positive samples = 236/1000 = 0.236 Percentage of positive samples = 0.236 x 100 = 23. 6%
Proportion of malaria deaths

123 malaria cases, 7 deaths Proportion of malaria deaths = 7/123 = 0.057 Percentage of malaria deaths = 0.057 x 100 = 5.7%
Proportion Example 1
A university has 4000 male students and 2000 female students. Calculate the proportion of male and female students. Male: 4000/6000 x 100% = 66.7% Female: 2000/6000 x 100% = 33.3%
Proportion Example 2
40 children are currently ill with the measles, 80 children all together have had the measles 40 / 80 = .50 (proportion) 40 / 80 = .50 * 100 = 50% (percentage)
Rate
The quotient of 2 numbers Measures the probability of occurrence of an event over time Numerator: number of EVENTS Denominator: POPULATION at risk for event in numerator observed for a given TIME
What is the rate of death?
Observed in one year
2 2 deaths per 100 population per year 100 per year
When is a rate used?

Morbidity rates
Attack rates Prevalence rates Incidence rates
Mortality rates Natality rates
Rate Example 1
Mortality rate of tetanus in France in 1995
Tetanus deaths: 17 Population in 1995: 58 million Time period: 1 year Mortality rate = 0.029 per 100,000 population per year
Rate may be expressed in any power of 10

100, 1,000, 10,000, 100,000
Rate must include an aspect of time

Per year, per month, per day
Rate Example 2
Maternal Mortality for Various Continents (1995) Continent Rate
Africa Asia
Europe Latin America/Caribbean
273000 217000
2000 22000
South America
North America Australia/New Zealand
15000
490 25
Summary
W is the Measure of Frequency? hat
Is numerator included in denominator? Yes Is time included in denominator? Yes Measure: Rate No Proportion Ratio
14
No
Describing Quantitative Variables

Measures of Central Location
Mean, Median, Mode
Measures of Spread
Range, IQR, Variance, Standard deviation
Measure of Central Location

Central Location / Position / Tendency
A single value that represents (is a good summary of) an entire distribution of data
Also known as: Measure of central tendency Measure of central position
Common measures Arithmetic mean Median Mode
Central Location
?
20
Number of people
15
10
Spread
0-9 10-19 20-29 30-39 40-49 50-59 60-69 Age 70-79 80-89 90-99
Age 27 30 28 31 28 36 29 37 29 34
Raw data set: Ages of students in a class (years)
30
30 27 30 28 31 32 30 29 29
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 29 30 30 30 30 30 31 31 32 34 36 37
Order the data set from the lowest value to the highest value Add observation numbers
Mode
Definition: Mode is the value that occurs most frequently Method for identification 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs 2. Identify the value that occurs most often
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 29 30 30 30 30 30 31 31 32 34 36 37
Mode
Age 27 28 29 Frequency 2 3 4
30
31 32
5
2 1
Mode
33
34 35 36 37 Total
0
1 0 1 1 20
Ob s
Age
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27
27 28 28 28 29 29 29 30 30 30 30 30 31 31 32 34 36 37 29
Mode
The most frequent value of the variable
Mode = 30
7 6
Frequency
5 4 3 2 1 27 2 8 29 30 31 32 33 34 35 36 37
Age (years)
Example
Finding Mode from Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Mode = 10
Finding Mode from Histogram

6
Number of patients
5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 Nights of stay
Mode Properties / Uses

Easiest measure to understand, explain, identify Always equals an original value Insensitive to extreme values (outliers) Good descriptive measure, but poor statistical properties May be more than one mode May be no mode Does not use all the data
Outliers
6
Number of patients
5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay
20 18 16 Population 14 12 10 8 6 4
Unimodal Distribution
2
0
18
16 Population 14 12 10 8 6 4 2 0
Bimodal Distribution
Median
Definition: Median is the middle value; also, the value that splits the distribution into two equal parts 50% of observations are below the median 50% of observations are above the median
Method for identification 1. Arrange observations in order 2. Find middle position as (n + 1) / 2 3. Identify the value at the middle
Obs 1 2 3 4 5
Age 27 27 28 28 28
Median: Odd Number of Values

N = 19 Median = Observation = =
6
7 8 9
29
29 29 29
N+1 2
19+1 2 20 2 10
10
11 12 13 14 15 16 17 18
30
30 30 30 30 31 31 32 34
Median age = 30 years
19
36
Obs 1 2 3 4 5
Age 27 27 28 28 28
Median: Even Number of Values

N = 20 Median = Observation = = = N+1 2
6
7 8 9 10 11 12 13 14 15 16 17 18 19
29
29 29 29 30 30 30 30 30 31 31 32 34 36
20+1 2
21 2 10.5
Median age = Average value between 10th and 11th observation 30+30 = 2
30 years
Examples
Find Median of Length of Stay Data;

0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Median at 50% = 10
Median Properties / Uses

Does not use all the data available Insensitive to extreme values (outliers) Good descriptive measure but poor statistical properties Measure of choice for skewed data Equals an original value of n is odd
Quartiles
Definition: Quartile is the value that splits the distribution into four equal parts
25% of observations are below the first quartile (Q1) 25% of observations are between Q1 and Q2 (median) 25% of observations are between Q2 (median) and Q3 25% of observations are above Q3
Obs
Age
1
2 3
27
27 28 28 28 29 29 29 29 30 30
Quartiles
Q1 age = 28 Q2 age = 30 Q3 age = 31
N+1 4 Q1 observation = round 20+1 21 = = 4 4 = 5.25 ~ 5th obs
Q1
4 5 6 7 8 9
Q2
10 11
Q2 observation = 10.5 (median)

3(N+1) Q3 observation = round 4 3(20+1) 3(21) = = 4 4 = 15.75 ~ 16th obs
12
13 14 15 16 17 18 19
30
30 30 31 31 32 34 36
Q3
Percentiles
Value of the variable that splits the distribution in 100 equal parts
35 % of observations are below the 35th percentile 65 % of observations are above 35th percentile
Obs 1 2
Age 27 27 28 28 28 29 29 29 29
Percentiles
Values (Age) 27 Fre q 2 Percent (Freq/Tota l) 10% Cumulativ e Percent 10%
3 4 5
6 7 8 9
28
29 30 31 32 34 36 37 Total
3
4 5 2 1 1 1 1 20
15%
20% 25% 10% 5% 5% 5% 5% 100%
25%
45% 70% 80% 85% 90% 95% 100%
25th Percentile
10
11 12 13
30
30 30 30
90th Percentile
14
15 16 17 18 19
30
31 31 32 34 36
Arithmetic Mean
Arithmetic mean = average value
Method for identification

1. Sum up all of the values 2. Divide the sum by the number of observations (n)
Obs
Age
1
2 3 4 5
27
27 28 28 28
Arithmetic Mean
6
7 8 9 10 11 12 13 14
29
29 29 29 30 30 30 30 30
x i m N
N = 20 Sxi = 605
15
16 17 18 19
31
31 32 34 36
605 20
30.25
Example
Finding the Mean Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 Sum = 360 n = 30 Mean = 360 / 30 = ?
Arithmetic Mean Properties / Uses

Probably best known measure of central location Use all of the data Affected by extreme values (outliers) Best for normally distributed data Not usually equal to one of the original values Good statistical properties
Sensitive to Outliers
6 5 4 3 2 1 0 0
6
Mean = 12.0
5 10 15 20 25 30 Nights of stay 35 40 45 50
Number of patients
5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay
Mean = 15.3
When to use the arithmetic mean?

Centered distribution
Approximately symmetrical Few extreme values (outliers)
OK!
Summary
Measure of Central Location single measure that represents an entire distribution Mode most common value Median central value Arithmetic mean average value Mean uses all data, so sensitive to outliers Mean has best statistical properties Mean preferred for normally distributed data Median preferred for skewed data Geometric mean for dilutional titer
Other Measures of Central Location

Midrange = Minimum + maximum values / 2 Quick and dirty Geometric Mean Can use if log of data are normally distributed (e.g., lab titers) = nth root of (Obs1 x Obs2 x Obs3 x Obsn) = antilog (sum log xi / n)
Measures of Spread
Definition: Measures that quantify the variation or dispersion of a set of data from its central location Also known as: Measure of dispersion Measure of variation Common measures Range Standard error Interquartile range 95% confidence interval Variance / standard deviation
Same center
but different dispersions
Range
Definition: difference between largest and smallest values
Example: Finding the Range of Length of Stay Data

0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Range Sensitive to Outliers?

6 5 4 3 2 1 0 0 5 10 15
Range = 0 to 49
20 25 30 Nights of stay 35 40 45 50
Number of patients
6 5 4 3
Range = 0 to 149
2
1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay
Interquartile Range
Definition: the central 50% of a distribution Properties / Uses Used with median Used to show the most typical 50% of the values
Example
IQR Length of Stay Data
Q1 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, M 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 Q3 Q1 = 25th percentile = (30+1) / 4 = 7 Median = 50th percentile = 15.5 Q3 = 75th percentile = 3 (30+1) / 4 = 23 6 10 14
IQR Length of Stay Data

IR = 7.75
6 5 4
Q1
Q3
3
2
1
0 0 5 10 15 20 25 30 35 40 45 50
Nights of stay
Variance and Standard Deviation

Definition: measures of variation that quantifies how closely clustered the observed values are to the mean Variance = average of squared deviations from mean
= Sum (x mean)2 / n-1
Standard deviation = square root of variance
Variance and Standard Deviation

Mean Mean
Equations for Variance and Standard Deviation

: Mean xi : Data value n : No. of observation s: Variance s : Standard deviation
(x i - x ) s = n-1
( x i - x ) s = n-1
Standard Deviation Properties / Uses

Standard deviation usually calculated only when data are more or less normally distributed (bell shaped curve) For normally distributed data, 68.3% of the data fall within 95.5% of the data fall within 95.0% of the data fall within 99.7% of the data fall within plus/minus plus/minus plus/minus plus/minus 1 SD 2 SD 1.96 SD 3 SD
Normal Distribution
2.5% 95 % 68% 2.5%
Standard deviation
Mean
Comparison of Mode, Median and Mean

Symmetrical:
Mode = Median = Mean
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Match the Measures of Central Location & Spread
Mode Median
Standard deviation Range
Arithmetic mean
Interquartile range
Match the Measures of Central Location & Spread
Mode Median
Standard deviation Range
Arithmetic mean
Interquartile range
Name the Appropriate Measures of Central Location and Spread

Distribution Central Location Spread
Single peak, symmetrical Skewed or Data with outliers
Name the Appropriate Measures of Central Location and Spread

Distribution Central Location Spread
Single peak, symmetrical
Mean*
Standard deviation
Skewed or Median Range or Data with outliers Interquartile range

* Median and mode will be similar
Properties of Measures of Central Location & Spread

Arithmetic mean best for normally distributed data Median best for skewed data Mode simple, descriptive, not always useful Standard deviation use with mean Range/Interquartile Range use with median
Median
14
Mode
12 Population
10 8 6 4 2 0
Age
1st quartile
3rd quartile
Minimum
Interquartile interval Range
Maximum
Displaying categorical variables

Table of frequency distributions
Frequency Relative frequency Cumulative frequencies
Charts
Bar charts Pie charts
Frequency distributions
A simple and effective way of summarizing categorical data is to construct a frequency distribution table. First column: Level of the variables. Second column: Count number of observation E.g. Table below shows the frequency distribution of birth weight for 9975 newborns between 1976-1996. BWT . Very low Low Normal Big Total Freq. Rel.Freq(%) 43 0.4 793 8.0 8870 88.9 268 2.7 9974 100 Cum. Freq 43 836 9706 9974
8.4 97.3 100
Relative Frequency
Useful to compute the proportion, or percentages of observations in each level. The distribution of proportions is called the relative frequency distribution of the variable Given a total number of observations, the relative frequency distribution is easily derived from the frequency distribution. Conversion in the opposite direction is also possible, but the conversion is often inaccurate because of rounding The third column of Table below shows the relative frequency distribution of birth weight for 9975 newborns between 1976-1996
Table 1. Frequency Distribution of birth weight of newborns between 1976-1996 at TAH.

BWT Very low Low Normal Big Total Freq. 43 793 8870 268 9974 Rel.Freq(%) Cum. Freq 0.4 43 8.0 836 88.9 9706 2.7 9974 100 q.(%) 0 8. 97.3 10
Cumulative frequency
The cumulative frequency of a category is the number of observations in the category plus observations in all categories smaller than it. BWT Freq. Rel.Freq(%) Cum.Freq Cum.rel.freq.(%) Very low 43 0.4 43 0.4 Low 793 8.0 836 8.4 Normal 8870 88.9 9706 97.3 Big 268 2.7 9974 100 Total 9974 100
Table 2. Frequencies of serum cholesterol levels for 1067 US males of ages 25-34 1976-1980
-----------------------------------------------------------------------------------Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ------------------------------------------------------------------------------------------Total 1067 100
Charts
The frequency distribution of a categorical variable is often presented graphically as a bar chart or pie chart. Bar charts: display the frequency distribution for nominal or ordinal data. Horizontal axis: Labels of the variable Vertical bar: Frequency or the relative frequency The bars should be of equal width and should be separated from one another so as not to imply continuity
Bar charts showing frequency distribution of the variable BWT described in Table
6000
100
5000
80
4000
Rel. Freq.
Freq.
60
3000
2000
40
1000
20
0
Very low
Low BWT
Normal
Big
0 Very low Low Normal Big
BWT
Bar charts for comparison

In order to compare the distribution of a variable for two or more groups, bars are often drawn along side each other for groups being compared in a single bar chart
100 90 80 70 60 50 40 30 20 10 0 88.9 89
Percent
Yes No
9 7.9
2.1 3.1 Normal BWT Big
Low
Bar chart indicating categories of birth weight of 9975 newborns grouped by antenatal follow-up of the mothers
Pie chart
Pie Chart: displays the frequency distribution for nominal or ordinal data.
In a pie chart the various categories into which the observation fall are represented along sectors of a circle, such that each sector represents either the frequency or the relative frequency of observation within the class the angles of which are proportional to frequency or the relative.
Fig 3(b) Pie chart indicating relative frequency of categories of birth weight
Fig 3(a) Pie chart indicating frequency of catego of birth weight
2.7
0.4 8
Very low Low Normal Big
268
43 793
88.9
8870
Displaying numerical variables

Graphs Histograms Frequency polygons Cumulative frequency polygons Box Plots
Histograms
Histograms are frequency distributions with continuous class interval that have been turned into graphs. Given a set of numerical data, we can obtain impression of the shape of its distribution by constructing a histogram. Horizontal axis: Labels of the variable Vertical bar: Frequency or the relative frequency Except for the two boundaries, class intervals are usually chosen to be of equal width. If this is not the case, the histogram could give a misleading impression of the shape of the data
Example Consider the following table and the histogram showing distribution of the age of women at the time of marriage
Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49 No. of women 11 36 28 40 13 35 7 3 30 2
No of women
Age of women at the time of marriage
25 20 15 10 5 0 14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 Age group 34.5-39.5 39.5-44.5 44.5-49.5
A histogram displaying frequency distribution of birth weight of newborns at Tikur Anbessa Hospital
2000 1800 1600 1400 1200 1000 800 600
Frequency
400 200 0
Std. Dev = 502.34 Mean = 3126 N = 9975.00
00 52 00 48 00 44 00 40 00 36 00 32 00 28 00 24 00 20 00 16 00 12
0 80
Birth weight
Frequency polygons
Instead of drawing bars for each class interval, sometimes a single point is drawn at the mid point of each class interval and consecutive points joined by straight line.
A graph drawn in this way is called frequency polygons (line graphs).

Frequency polygons are superior to histograms for comparing two or more sets of data.
Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males and females
50
40
%
30
20
SEX
10
Males Females
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Birth Weight
Cumulative frequency polygons

Horizontal axis: Labels of the variable Vertical bar: cumulative relative frequency.
The points are then connected by straight lines. Like frequency polygons, cumulative frequency polygons may be used to comparing sets of data.
Cumulative frequency polygons can also be used to obtain percentiles of a set of data.
Roughly the 50th percentile is the value that is greater than or equal to 50%.
------------------------------------------------------------------------------------
Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ---------------------------------------------------------------------------------------80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ---------------------------------------------------------------------------------------Total 1067 100
------------------------------------------------------------------------------------------Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------80-119 5 0.4 5 0.4 120-159 48 3.9 53 4.3 160-199 265 21.6 318 25.9 200-239 458 37.3 776 63.2 240-279 281 22.9 1057 86.1 280-319 128 10.4 1185 96.5 320-359 35 2.9 1220 99.4 360-399 7 0.5 1227 100 ------------------------------------------------------------------------------------------Total 1227 100
Frequency polygon and Cumulative frequency polygons of serum cholesterol levels for 2294 males aged 25-34 and55-64 years, 1976-1980
45 40 35 30 25 20 1 5 1 0 5 0 80-1 9 1 1 59 20-1 1 99 60-1 200-239 240-279 280-31 9 320-359 360-399
100
Cumulative relative frequency (%)
90 80 70 60 50 40 30 20 10 0 80-119 120-159 160-199 200-239 240-279 280-319 320-359 360-399
Relative frequency (%)
Ages 25-34 Ages 55-64
Ages 25-34 Ages 55-64
Serum cholesterol levels (m g/100m l)
Serum cholesterol levels (mg/100ml)
Box Plots
A visual picture called box plot can be used to convey a fair amount of information about the distribution of a set of data. The box shows the distance between the first and the third quartiles, The median is marked as a line within the box and The end lines show the minimum and maximum values respectively
Illustration of Box-plot
18
20
22
24
26
28
30
32
34
36
Numbers
A box-plot indicating birth weight of 5092 newborns by gestational age at Tikur Anbessa Hospital studied
Pre
Gest. age
Term
Post
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Birth weight(grams)
Tables
Summary
Diagrams
Although a certain information is lost when data are summarized using tables and graphs, a great deal is gained Tables are effective ways of summarizing categorical data Tables are more informative when they are not overly complex Tables and the columns within them should always be clearly labeled and units of measurement be specified Diagrams have greater attraction than mere figures. The give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it. They help in deriving the required information in less time and without any mental strain. They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature. They facilitate comparison

Chapter I Descriptive Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter I Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Chapter I Descriptive Statstics

The numbers have NO meaning They are labels only

. Note on meaningfulness of ratio-

Degree of precision in measuring

Nominal (not ordered) e.g. ethnic group

Ordinal (ordered) e.g. response to treatment

Some options for cut points:

Describing categorical data

How is count data used?

2002 Polio 1918 cases 7 countries

Ratio Proportion Rate

What is the sex ratio?

# males 100 Sex ratio = males:females = # females

When is a ratio used?

Ratio = 400 / 4,000,000 = 0.0001 clinics / person

What is the proportion of cases?

When is a proportion used?

Proportion of malaria deaths

What is the rate of death?

Observed in one year

2 2 deaths per 100 population per year 100 per year

When is a rate used?

Mortality rates Natality rates

Rate may be expressed in any power of 10

Rate must include an aspect of time

Describing Quantitative Variables

Measure of Central Location

Also known as: Measure of central tendency Measure of central position

Common measures Arithmetic mean Median Mode

Raw data set: Ages of students in a class (years)

Finding Mode from Histogram

Mode Properties / Uses

5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay

Median: Odd Number of Values

Median age = 30 years

Median: Even Number of Values

Find Median of Length of Stay Data;

Median Properties / Uses

Q2 observation = 10.5 (median)

Method for identification

Arithmetic Mean Properties / Uses

5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Nights of stay

When to use the arithmetic mean?

Other Measures of Central Location

Example: Finding the Range of Length of Stay Data

Range Sensitive to Outliers?

IQR Length of Stay Data

Variance and Standard Deviation

Standard deviation = square root of variance

Variance and Standard Deviation

Equations for Variance and Standard Deviation

Standard Deviation Properties / Uses

Comparison of Mode, Median and Mean

Match the Measures of Central Location & Spread

Standard deviation Range

Match the Measures of Central Location & Spread

Standard deviation Range

Name the Appropriate Measures of Central Location and Spread

Single peak, symmetrical Skewed or Data with outliers

Name the Appropriate Measures of Central Location and Spread

Single peak, symmetrical

Skewed or Median Range or Data with outliers Interquartile range

Properties of Measures of Central Location & Spread