You are on page 1of 43

BIOSTATISTICS

Statistics deals with


Planning Research
Collecting Data
Describing Data
Summarizing- Presenting Data
Analyzing Data
Interpreting Results
Reaching decisions or discovering new knowledge
Biostatisitcs is the application of statistical methods to health sciences.
DEFINITION:
It is the method of collection, organizing , analysing, tabulating and
interpretation of datas related to living organisms and human beings.
[Soben Peter]
History
John Graunt (1620-1674) , who was neither a physician nor a mathematician
is the father of Health Sciences. The term biometry was coined by W.F.R Weldon
(1860-1906), a zoologist at University College, London.
Use of biostatistics
To test whether the difference between two populations is real or a chance
occurrence.
To study the correlation between attributes in the same population.
To evaluate the effect of vaccines, sera etc.
To measure mortality and morbidity
To evaluate achievements of public health programs.
To fix priorities in public health programs.
To help promote health legislation and create administrative standards for
oral health.


Aims of biostatistics
To generate the statistical data through experimental investigation and
sample surveys.
To organize and represent the data in suitable tables, diagrams, charts or
graphs, etc
To draw valid inferences from the data collected, put forth definite
interpretations or predict the future outcomes from the data.

Why should medical/Dental students learn biostatistics?
1. Medicine is becoming increasingly QUANTITATIVE.
The aim is to improve the Health Status of the population.
We have to clarify the relationships between certain factors and
diseases.
Enumarate the occurances of diseases
Explain the etiology of diseases (which factors cause which diseases)
Predict the number of disease occurence
Read, understand and criticize the medical literature.
2. The planning, conduct and interpretation of much of medical research are
becoming increasingly reliant on statistical methods.
Planning
3. How many patients must be treated?
4. How do we have to allocate the subjects to treatments?
5. What are the other factors which may influence the response variable?
Conduct:
Under which conditions must the study be conducted?
Is matching necessary?
Is blinding (single blinding or double blinding) necessary?
Is there a need for a control group?
Shoud the placebo effect be considered?
Which experimental design technique is more appropriate?
Interpretation:
Example:

Distribution of Women with a Diagnosis of Tromboembolism Among Blood Groups
Blood Group Frequency %
A 32 58.2
AB 4 7.3
B 8 14.5
O 11 20.0
Total 55 100.0

Terminologies:

Data : Set of values of one or more variables recorded on one or more
observational units.
Observation (case): Individual source of data.
Variable: This is a quantity which varies such that it may take any one of a
specified set of values. It may be measurable or non-measurable.
Population: A collection, or set, of individuals, objects, or measurements whose
properties are to be analyzed.
Sample: A subset of the population, selected in such a way that it is representative
of the larger population.
Parameter : A summary value which in some way characterizes the nature of the
population in the variable under study.
Statistic : A summary value calculated from a sample of observation.

DATA:
Sources of data
1. Routinely kept records
2. Published data sources
3. Data on electronic media
4. Surveys and Experimental research
5. Census
6. Generated or artificial data
Types of Data
1. Qualitative Data
Results from a variable that asks for a quality type of description of the subject.
2. Quantitative Data
Results from obtaining quantities-counts or measurements.
Types of biostatistics
Descriptive biostatistics
Inferential biostatistics

Descriptive biostatistics:
It is the study of biostatistical procedures which deal with the collection,
representation, calculation and processing, i.e., the summarization of the data to
make it more informative and comprehensible. The primary function of
descriptive statistics is to provide meaningful and convenient techniques for
describing features of data that are of interest. The failure to choose appropriate
descriptive statistics often lead to faulty scientific inference. The field of
descriptive statistics is not concerned with the implications or conclusions that can
be drawn from the sets of the data.


Inferential biostatistics:
It constitutes the procedures which serve to make generalization or drawing
conclusions on the basis of the studies of the sample. This is also known as
sampling biostatistics. The study of the quantitative aspects of the inferential
process provides a solid basis, on which the more general substantive process of
inference can be founded.

Basis for statistical analyses
Statistical analyses are based on three primary entities:
The population 'U' that is of interest
The set of characteristics (variables) of units of this population 'V'
The probability distribution 'P' of these characteristics in the population

The population 'U'
The population is a collection of units of observation that are of interest and
is the target of the investigation. For example , in determining the effectiveness of
a particular drug for a disease, the population would consist of all possible patients
with the disease. It is essential, in any research study, to identify the population
clearly and precisely. The success of the investigation will depend to a large extent
on the identification of the population of interest.

The variables 'V'
A variable is a state, condition, concept or event whose value is free to vary
within the population.
Once the population is identified, we should clearly define what
characteristics of the units of this population (subjects of the study) are we
planning to investigate.
For example in the case of a particular drug, one needs to define the disease
and what other characteristics of the people (e.g. age, sex, education, etc.) one
intends to study.
Clear and precise definitions and methods for measuring these
characteristics (a simple observation, a laboratory measurement, or tests using a
questionnaire) are essential for the success of the research study.

Variables can be classified as ,
Independent variables: variables that are manipulated or treated in a study
in order to see what effect, differences in them will have on those
variables proposed as being dependent on them. Synonyms: cause, input,
predisposing factor, antecedent, risk factor, characteristic, attribute,
determinant.
Dependent variables: variables in which changes are results of the level or
amount of the independent variable or variables. Synonyms: effect,
outcome, consequence, result, condition, disease.
Confounding or intervening variables: variables that should be studied
because they may influence or 'confound' the effect of the independent
variables on the dependent variables. E.g. the study of tobacco
(independent variable) on oral cancer (dependent variable), the nutritional
status of the individual may play an intervening role.
Background variables: variables that are so often of relevance in
investigations of the groups or populations that they should be considered
for possible inclusion in the study. Synonyms: sex, age, ethnic origin,
education, marital status, social status.

The probability distribution 'P'
The probability distribution is a way to enumerate the different values the
variable can have, and how frequently each value appears in the population. The
actual frequency distribution is approximated to a theoretical curve that is used as
the probability distribution. Common examples of probability distributions are
binomial and normal.
For e.g. the incidence of a relatively common illness may be approximated
by a binomial distribution, whereas the distributions of continuous variables
(blood pressure, heart rate) are often considered to be normally distributed.
Probability distributions are characterized by parameters, i.e., quantities that
allow us to calculate the probabilities of various events concerning the variable, or
that allow us to determine the value of probability for a particular value.
The binomial distributions has two parameters. It occurs when a fixed
number of subjects are observed, the characteristic is dichotomous in nature (only
two possible values), and each subject has the same probability (p) of having one
value and (1-p) the other value.
The normal distribution on the other hand is a mathematical curve
represented by two quantities ,m and s. The former represents the mean of the
values of the variables, and latter, the standard deviation. The type of statistical
analyses done depends on the design of the study.

Collection of data

In scientific research work data is collected only from personal experimental
study. i.e., primary data is used. Statistical data can be collected on two ways.
1. census method
2. sampling method
Census method
In this method the data is collected from all the individual items that are
connected with the inquiry.

Advantages of census method
i. The data has high degree of accuracy.
ii. The data is more representative and true.
iii. Results are more reliable.
iv. Possibility of bias is minimised.
Disadvantage of census method
i. It is less economical as it consumes more time, more energy and more
expenditure.
ii. It requires organizational skills and large number of investigators.
iii. It cannot be applied to all the situations, e.g to determine the blood cell
count it is not possible to analyse the whole blood.


Sampling method
In this method the data is collected from a small group of population which
is termed as sample. A sample is a portion of the population selected to represent
the population.

Types of samples
There are two types of samples which are used in biostatistics :

1. Qualitative samples: when we say that children from African population are
taller than those in India it is called as qualitative sample.
2. Quantitative samples: when we try to know the number of decayed teeth of
individuals of particular age group then it is called quantitative sample.

Size of samples

The total number of units which are used in the study to get significant
results is termed as sample size. To select the proper sample size is very important.
The sample size should not be very small or very large because the conclusions are
directly affected by it.
Advantages of the sampling method
This method is comparatively more economical as it consumes less energy,
less time and less expenditure.
It requires less number of investigations.
It is most suited to those places and situations where census method cannot
be applied.

Disadvantages of sampling method
It requires services of experts, otherwise incorrect or misleading results will
be obtained.
In this method selection of appropriate method of sampling is necessary.

If the population is very small and we need precise information then the census
method is preferred. If the population is very large or the field of investigation is
very wide and the quick results are required, sampling methods should be used.

Types of sampling methods
There are two types of sampling techniques:
a. Random or probability sampling
1. Simple randomized sampling
2. Stratified randomized sampling
3. Systematic sampling
4. Cluster sampling
5. Multistage sampling
b. Nonrandom or nonprobability sampling
1. Convenience sampling
2. Purposive sampling
3. Quota sampling

Random or probability sampling
In random sampling a sample is selected in such a way that every element in
the population has an equal opportunity of being included in the sample. It means
random sampling is made without deliberate discrimination. Random sampling is
carried out to ascertain a particular character of the population. It involves
unbiased or non preferential samples.
Selection of random samples
Sampling without replacement
In this type of sampling an observation is included only once and is selected
randomly without any preference or conscious effort.


Sampling with replacement
In this type of sampling the observation has a chance to be selected at each draw.

Properties of random samples:
The several samples drawn from the same population will differ, i.e. their
statistical characteristics will change from sample to sample.
Random sample should be large, because larger the sample, lesser will be
the variation of characteristics of the sample from one random sample to
another.
A random sample must be selected in such a way that every element in the
population had an equal opportunity of being included in the sample.

Advantages of random sampling
The main advantages of random sampling are:
1. The random sampling enables the researcher to draw inferences about the
whole population.
2. It eliminates personal bias. The researcher cannot reject those observations
which do not support his theory. Similarly, the researcher cannot select only
those observations which may support his theory.

Types of random sampling methods
1. Simple randomized sampling
2. Stratified randomized sampling
3. Systematic sampling
4. Cluster sampling
5. Multistage sampling
6.
Simple random sampling
In this method samples are chosen at random and each member or sample
unit of the population has an equal chance of being selected in the sample. This
method is well applicable when the population is small, homogenous and readily
available. This method is sometimes is called unrestricted random sampling.


Simple
Random
sampling
o Every possible sample of a certain size within a population has a
Known probability of being chosen
Equal probability of being chosen
o Most basic type of probability sampling.
o Actual selection is done by randomly picking the desired number of units
from the population.
o Statistically equivalent to
Identifying all possible samples of the desired size
Picking one of those samples at random

Stratified random sampling
Samples are chosen random from different strata of usually different sizes of
a population and are based on a priority information about the variation and site.
Stratified random sampling is done in heterogeneous populations, i.e., this
procedure is followed when population is not homogenous. A heterogeneous
population is divided into several more or less homogenous sections or groups.
These are called strata. A sample is drawn from each stratum by simple random
sampling. Thus the variability in each stratum is adequately represented in the
sample also.
Stratified
Random
sampling
o Probability sampling procedure
o The chosen sample is forced to contain units from each of the
segments, or strata, of the population
o AKA proportional or quota random sampling involves
dividing population into homogeneous subgroups
Take a simple random sample in each subgroup.
o Statistically more efficient
o Provides a more accurate population estimate variables.
o Two types of stratified random sampling
proportionate
disproportionate.



Systematic sampling

This is a simple procedure and utilized when a complete list of population
from which a sample is to be drawn is available. It is more often applied to field
studies when population is large, scattered and heterogeneous. In this sampling
method, samples are drawn evenly spaced after a random start position A is
chosen. From a large population, samples are selected every 10th, 20th, 25th or
50th item.

Cluster sampling
In this method the population is divided into separate natural groups of
elements. These groups are called clusters. Each cluster includes only one type of
elements. A simple random sample is taken from each cluster. A cluster may
consist of units such as villages, wards, blocks, factories, slums of a town, children
of a school, etc.
Generally the clusters are natural groupings and if they are geographic
regions, the sampling is called as 'area sampling'.



Cluster
sampling
o Probability sampling procedure
o Clusters of population units are selected at random
o All or some units in the chosen clusters are studied.
o When an adequate sampling frame of individual population units is not
readily available, cluster sampling is helpful.
o Even when such a sampling frame is available, if the frame can be
conveniently divided into a series of representative clusters, a cluster
sampling approach may be easier to use than a simple or stratified
random-sampling approach.
In cluster sampling, we follow these steps:
o divide population into clusters (usually along geographic boundaries)
o randomly sample clusters (areas in red)
o measure all units within sampled clusters




Multistage sampling
In multistage sampling the clusters or segments are selected in the primary
cluster sample and these secondary clusters are again sampled instead of being
fully inspected. This procedure is employed in large scale country.
Systematic
sampling
o Researcher selects the first unit randomly
o The remaining units systematically
o number the units in the population from 1 to N
o Decide on the n (sample size) that you want or need
o k = N/n = the interval size
o Randomly select an integer between 1 to k
o Take every kth unit
o simplicity relative to the other methods
o Requires only one random number to select a sample.
o Statistical efficiency is practically equivalent



5.Practical Considerations: Probability Sampling Methods
o Not all methods may be equally practical in any project.
o Base choice upon
Nature of the population
Degree of precision desired
Resources available for research.
Nonrandom or nonprobability sampling
Nonprobability
sampling
o Subjective procedure
o Probability of selection for the population units cannot
be determined.
o The selection is not done on a strictly chance basis
o Offers researchers greater freedom and flexibility in sampling.
o Nonprobability samples
Cannot depend upon the rationale of probability theory.
May or may not represent the population well
Difficult to know how well we've done so.
o There may be circumstances where random sampling is not
feasible
practical
theoretically sensible.

In non random sampling, the samples are drawn without following any
crtiteria or any yardstick. The sample collected does not show any specific
approach nor the samples can be used to assess properly the accuracy of the
estimator. In this sampling procedure many investigator biases are likely to occur.
This is of three types :
1. Accidental, Haphazard or Convenience Sampling: this is known as
accidental accessibility or haphazard sampling. The major reason is
administrative convenience. The sample chosen with ease of access being
the sole concern.
convenience
sampling
o a researcher's convenience forms the basis for selecting a
sample of units
o Very popular in online research, and is known as intercept sampling or
pop-up surveys.
o Traditional "man on the street" interviews conducted frequently by
television news programs
o Use of college students in much psychological research is primarily a
matter of convenience.
o Many research projects simply ask for volunteers.

2. J udgement/Purposive sampling: this is also known as judgemental
sampling. The experimenter exercises deliberate subjective choice in
drawing the representative sample. The judgemental random sampling aims
at elimination of anticipated sources of distortion, but there will always
remain the risk of distortion due to personal prejudices or lack of knowledge
of certain crucial features in the structure of population.
Judgment
sampling
o researcher exerts some effort in selecting a sample that is believed to be most
appropriate.
Researcher will usually be knowledgeable about the nature of the ideal population.
Requires greater researcher effort
Generally more appropriate than a convenience sample.
Can be very useful
when you need to reach a targeted sample quickly
when sampling for proportionality is not the primary concern.
Likely to yield opinions of your target population
Likely to overweight more readily accessible subgroups.
3. Quota sampling: this combines convenience and judgement and is more
structured than either of the two. Quota sampling needs a proper statistical
design to determine what numbers are needed in each of the quotas.

Quota
sampling
o sampling a quota of units
o selected from each population cell
o based on the judgment
Most refined form of nonprobability sampling
Often used in practice, especially in personal interviewing.
Resembles stratified random sampling
Features of judgment and convenience sampling as well.
Select people nonrandomly according to some fixed quota.
proportional
quota
samplisng
o Represent the major characteristics of the population
o Sample a proportional amount of each.
o Example
Population has 40% women and 60% men
Required sample size of 100
Continue sampling until achieving the percentages
Then stop.
non proportional
quota
sampling
Specify the minimum number of sampled units you want in each category.
Not concerned with numbers that match the proportions in the population.
Simply need enough to assure the ability to talk about even small groups
in the population.
Nonprobabilistic analogue of stratified random sampling
Typically used to assure that smaller groups are adequately represented
4. Heterogeneity Sampling aka Sampling for diversity
o To include all opinions or views
o Not concerned about representing these views proportionately.
o Obtain a broad spectrum of ideas
o Not identifying the "average" or "modal instance" ones.
o Sampling ideas not people.
o To get all of the ideas (especially the unusual ones)
o Include a broad and diverse range of participants.
5. Snowball Sampling
o Identifying someone who meets the criteria for inclusion
o Ask them to recommend others who
they may know
also meet the criteria.
o Useful when trying to reach populations that are inaccessible or hard to find.
A. Sampling Error
Sampling
error
o The difference between a statistic value generated through sampling and
o The parameter value, which can be determined only through a census study
o Magnitude of the sampling error says how precisely the population parameter can be estimated from
a sample value
o Estimate the average amount of sampling error associated with a given sampling procedure.
o True population parameter value is unknown
o Sample statistic value may vary from sample to sample within the population



PRESENTATION OF DATA
Objective of classification of data :
make the data simple,
concise, meaningful,
interesting and
helpful in further analysis.
two main methods of presenting data:
Tabulation and
Diagrams
TABULATION
classified on the following bases:
Geographical. i.e , area-wise, e.g. cities, districts etc.
Chronological i,e, on the basis of time.
Qualitative i.e according to some attribute.
Quantitative i,e in terms of magnitude.
The two elements of classification are
The variable and
The frequency.
Variable: a name denoting a condition , occurrence or effect that can assume
different values
Divided: subgroups ,classes.
have lowest and highest values
Class interval : difference between the upper and lower limit of a class
Eg: in the class 5 -14,
5 - lower limit and 14 - upper limit.
class interval = 14 - 5 =9.
Frequency: is the number of units belonging to each group of the variable.
Frequency distribution table: way of presenting data in the tables
Frequency distribution table
Title of the table named at the bottom
The no of class intervals - between 5 and 20. no rigidity about it.
The class intervals - at equal width.
Clearly defined class limits to avoid ambiguity.
For e.g., 0-4.5-9. 10-14. Etc.
Clearly defined row and column with the headings
Units of measurement should be specified.
If the data is not original, the source of the data should be mentioned at the
bottom of the table.

Diagrams:
Extremely useful
attractive to the eyes,
give a bird's eye view of the entire data,
have a lasting impression
TYPES OF DIAGRAMS:
Bar Diagram : qualitative data.
Multiple Bar: qualitative data
Component Bar Diagram: qualitative data.
Proportional Bar Diagram
Histogram: quantitative data of continuous type.
Frequency Polygon: qualitative data
Pie Diagram: qualitative data
Line diagram: qualitative data
Cartograms or Spot Map: geographical distribution of frequencies

Basic rules :
Self explanatory
Simple and consistent with the data.
Values of the variables - on horizontal or X-axis and the frequency - vertical
line or Y-axis.
No too many lines on the graph, should not look clumsy.
The scale of presentation right hand top corner of the graph.
The scale of division of the two axes should be proportional.
The details of the variables and frequencies presented on the axes.
Bar Diagram
Represent qualitative data.
Only one variable.
width of the bar remains the same
The length varies according to the frequency in each category.
Bars: vertical or horizontal.
Limitation:
represent only one classification
cannot be used for comparison
Facilitate comparison of data relating to different time periods and regions.

Multiple Bar:
compare qualitative data with respect to a single variable.
Eg: sex wise or with respect to time or region.
each category of the variable have a set of bars of the same width
corresponding to the different sections without any gap in between the width
and the length corresponds to the frequency.

Component Bar Diagram:
represent qualitative data.
both, the number of cases in major groups as well as the subgroups
simultaneously
cases of the major group drawn
each rectangle is divided according to no in the subgroups.

Proportional Bar Diagram:
represent qualitative data.
compare only the proportion of sub-groups between different major groups
of observations, then bars are drawn for each group with the same length,
either as 1 or 100%. These are then divided according to the sub-group
proportion in each major group.
PIE DIAGRAM
The frequency of the group is shown in a circle.
Degree of angle denotes the frequency.
Instead of comparing the length of bar , the areas of segments are compared.
Line diagram:
useful to study changes of values in the variable over time
simplest type
X-axis, - hours, days, weeks, months or years
Y-axis- value of any quantity pertaining to X-axis,
Histogram
quantitative data of continuous type.
bar diagram without gap between the bars.
represents a frequency distribution.
X-axis: the size of an observation is marked. Starting from 0 the limit of
each class interval is marked, the width corresponding to the width of the
class interval in the frequency distribution.
Y-axis :the frequencies are marked. A rectangle is drawn above each class
interval with height proportional to the frequency of that interval.
Frequency Polygon
frequency distribution of quantitative data
compare two or more frequency distributions.
a point is marked over the mid-point of the class interval, corresponding to
the frequency.
points are connected by straight lines.
The first point and last point are joined to the midpoint of previous and next
class respectively.
SCATTER DIAGRAM







Cartograms or Spot Map
show geographical distribution of frequencies of a characteristic.
PICTOGRAM
The pictures representing the value of items are called pictograms.
It is most useful way of representing data to those people who cannot
understand.
Measures of central tendency:
single estimate of a series of data that summarizes the data is
known as the parameter and one such parameter is the measure of central
tendency.
Objective:
to condense the entire mass of data
Fig.--. Height and Weight of 20 students of CODS
0
10
20
30
40
50
60
70
80
3 4 5 6 7
Height in feet
W
e
i
g
h
t

i
n

K
G
s
Weight
to facilitate comparison
Types:
Arithmetic mean- mathematical estimate.
Median - positional estimate.
Mode- based on frequency.
Properties of central tendency:
should be based on each and every item in the series.
should not be affected by extreme observations (either too small or too large
values).
should be capable of further statistical computations.
It should have sampling stability. i..e, if different samples of same size, say 1
are picked up from the same population and the measure of central
tendency is calculated, they should not differ from each other markedly.
Arithmetic Mean:
simplest measure of central tendency.
Ungrouped data:
Mean = Sum of all the observations of the data
Number of observations in the data
1. Grouped data with range for class interval:
frequencies in a class interval are equally distributed on either side of the
mid point of the class interval.
The formula :
X = X
i
f
i

f
i

Where,
X
i
: midpoint of the class interval, mean
f
i
: corresponding frequency
2. Grouped data with single value for class interval:
Symbolically,
X = X
i
f
i

fi
Where,
X
i
: is grouped variable ,
f
i
: corresponding frequency
MEDIAN
middle value in a distribution such that one half of the units in the
distribution have a value smaller than or equal to the median and one half
have a value higher than or equal to the median.
Calculation of Median:
Ungrouped Data:
observations are arranged in the order of magnitude & then the middle value
of the observations : median.
Odd number of observations : (n + 1) / 2
Even: the mean of the two middle values
Grouped: total no observations / 2


X = Xi
n
: sigma, means the sum of.
Xi : is the value of each observation in the data,
n: is the number of observations in the data.
MODE
value in a series of observations which occurs with the greatest frequency.
Eg: series on age at eruption of the canine as 6,6,5,7, 8, 6, 7, 5;
6 - mode.
Ill defined mode :
Mode = 3 Median - 2 mean.
Variability & its measures
Types
Biological variability
Real variability
Experimental variability
Biological variability
Normal or natural differences within accepted biological limits
Individual variability
Periodical variability
Class , group or category variability
Real variability
When the difference b/w two readings is more than the defined limits
Due to the external factors
Experimental variability
Errors or variations due to materials & methods
Observer error Subjective error
Objective error
Instrument error
Sampling error
Measures of variability
Synonyms:
Measures of dispersion
Measures of variation or scatter
Dispersion is the degree of spread or variation of the variable about a central
value.
Uses:
Determine reliability of an average
Serve as a basis of control of variability
Comparison of two or more series
Facilitate further statistical analysis
A good measure of dispersion : simple , easy to compute , based on all items
, amenable for further analysis and not affected by extreme values.
Of individual observations -
Range
Interquartile range
Mean deviation
Standard deviation
Coefficient of variation
Variability of samples-
Standard error of mean
Standard error of difference b/w 2 means
Standard error of proportion
Difference b/w 2 proportions
Standard error of correlation coefficient
Standard deviation of regression coefficient
Range
Difference between the value of the smallest item and the value of the
largest item.
simplest method.
gives no information about the values that lie between the extreme values.
subjected to fluctuations from sample to sample.
Mean deviation
The average of the deviations from the arithmatic mean
M.D= (x-x)
52,44,54,56,60,64,66,76,60,68
41,54,43,45,60,75,77,66,79,60
Standard deviation:
most important and widely used
it is the square root of the mean of the squared deviations from arithmetic
mean.
root mean square deviation
Greater the deviation greater the dispersion
Smaller the deviation- higher degree of uniformity
Calculation of S.D
For ungrouped data:
Calculate the mean = x
Diff of each observation from mean,
d = x
i
x
Square these = d
Total these = d
Divide this by no of observations minus 1,
variance = d/ (n-1)
Square root of this variance is
S.D = d
(n-1)
For grouped data: with single units for class intervals
Make frequency table
Determine mid pt of each range
SD= (Xi- x)
2
fi
n-1
Xi individual observation in the class
x- mean
fi frequency
n- total frequency
Calculation for grouped data with range for class interval:
Class intervals in terms of range:
Frequency- -centered in mid points
S = (xi- x) fi
n-1
Xi -midpoint of class interval
x- mean
fi frequency
n- total frequency

Uses of standard deviation
Summarizes the deviations , of a large distribution
Indicates whether the variation from mean is by chance or real
Helps in finding standard error
Helps in finding the suitable size of sample
Standard deviation is only interpretable as a summary measure for variations
having approximately symmetric preparations
Coefficient of variation
Compare relative variability
Variation of same character in two or more series
compare the variability of one character in two different groups having
different magnitude of values or
to compare two characters in the same group by expressing in percentage
C V = S.D x 100
mean
Higher the C.V greater variability
Normal distribution & Normal curve
Height of bars or curve greatest in middle
Values are spread around mean
Maximum values around mean , few at extremes
half values above & half below mean

Properties of the normal Distribution
curve is bell shaped.
The curve is symmetrical about the middle point.
The mean is located at the highest point of the curve
measures of central tendency coincide.
Maximum number of observations is at the value of the variable
corresponding to the mean
number of observations gradually decreases on either side with with very
few observations at the extreme points.
area under the curve between any 2 pts which correspond to the number of
observations between any 2 values of the variate - in terms of a relationship
between the mean and the SD:
a) Mean 1 S.D. covers 68.3% of the observations;
b) Mean 2 S.D. covers 95.4% of the observations;
c) Mean 3 S.D. covers 99.7% of the observations.
This relationship is used for fixing confidence interval.
Normal distribution law forms the basis for various tests of significance
Relative or standard normal deviate
Deviation from mean in normal distribution
Measured in terms of S.D
indicates how much an observation is bigger or smaller than means in units
of SD
Z = observation mean
SD
Z= x- x
S
Probability or chance
relative frequency or probable chances of occurrence with which an event
is expected to occur on an average
Expressed as p
Ranges from 0-1
when p= 0, no chance of event happening
When p=1 , 100% chances of event happening
p no of events occurring
total no of trials
Statistical hypothesis
Methods to estimate the difference b/w estimates of samples
two hypothesis are made:
Null hypothesis or hypothesis of no difference
Alternative hypothesis of significant difference
Null hypothesis or hypothesis of no difference [H
o
]
Asserts that there is no real difference in sample & general population
The difference found is accidental & arises out of sampling variations
Alternative hypothesis of significant difference [H
1
]
States that sample result is different than the hypothetical value of
population
To minimize errors the sampling distribution or area under normal curve is
divided into two regions or zones
1. Zone of acceptance :samples in the area of mean 1.96 SE, null hypothesis
accepted
2. Zone of rejection: sample in the shaded area is beyond the mean 1.96 SE,
null hypothesis rejected
Degrees of freedom
The quantity in the denominator which is one less than the independent no of
observations in sample.
Eg:
When there are 10 values , 9 choices or degrees of freedom
In unpaired t test of difference between 2 means
df = n
1
+n
2
-2
Where;n
1
& n
2
are no observations.
In paired t- test df = n-1
Standard error
A measure of variability of the mean sample
Obtained by SD / square root of the sample size
SE = SD
n
2 types of errors;
Type 1 error
Type 2 error
Type I error :
null hypothesis is rejected { when it is true}
Type I error:
The null hypothesis is rejected even it falls in the zone of acceptance
serious error.
Type II error
null hypothesis is wrongly accepted
error


Nullhypothesis is true
Accept it

Correct decision
Reject it

Type I error
Nullhypothesis is false Type II error Correct decision
the null hypothesis is accepted even it falls in the zone of rejection
not serious error, needs only confirmation of result by changing the level of
significance

Tests of significance
Parametric and non parametric tests or methods
Parametric methods
The methods of statistical inference that are based on the assumption that the
population has a certain probability distribution, the resulting collection of
statistical tests and procedures are referred to as parametric methods. For example,
t- distribution and F-distribution are associated with the values of parameters of an
assumed normal probability distribution.
Non parametric methods
The statistical procedures that do not require assumptions of any form of
probability distribution from which experiments come are known as non
parametric methods. These are also called distribution free methods. For example,
chi square frequency techniques are non parametric.

Parametric tests
Eg. T test, Z test, Chi-square test,Pearson correlation coifficient
Non parametric tests
Eg. Chi-square test, Kruskal-Wallis test, Spearman correlation
coifficient
Tests of significance- Steps involved
Define the problem
state the hypothesis
Null hypothesis
Alternate hypothesis
Fix the level of significance
Select appropriate test to find test statistic
Find degree of freedom (df)
Compare the observed test statistic with theoretical one at desired level of
significance & corresponding DF
If the observed test statistic value is greater than the theoretical value, reject
the null hypothesis.
Draw the inference based on the level of significance
Objective of using tests of significance
To compare sample mean with population
Means of two samples
Sample proportion with population
Proportion of two samples
Association b/w two attributes
t - test
Students t-test
Designed by W.S Gossett
Unpaired t- test (two independent samples)
Paired t- test ( single sample correlated observation)
Essential conditions:
randomly selected samples from the corresponding populations
Homogeneity of variances in the 2 samples
Quantitative data
Variable normally distributed
samples < 30
Unpaired t- test
Unpaired data of independent observation made on the individual of two
different or separate groups or samples drawn from 2 populations
Null hypothesis is stated
difference between means of two samples
(X
1
-X
2
) measures variation in variable
calculate the t value
t = (X
1
-X
2
)
SE
Paired t- test
To study the role of factor or cause when the observations are made
before & after the its play:
Eg: exertion on pulse rate, effect of a drug on blood pressure etc
To compare the effect of 2drugs , given to the same individual in the sample
on two different occasions
eg: adrenaline & noradrenaline on pulse rate
to study the comparative accuracy of 2 difft instruments
eg: 2 difft types of sphygmomanometers
to compare the results of 2 difft lab techniques
To compare the observations made at two different sites in the same body
Testing procedure:
Null hypothesis
X
1
-X
2
= x
Calculate mean of the difference x = x /n
calculate SD of differences & SE of mean
SE= SD/ n
Determine t value
t = x -o
SD / n
Find the degrees of freedom , n-1
refer the table & find the probability
P >0.05 not significant
P< 0.05 significant
Variance ratio test or F test
Variance: a measure of the extent of the variation present in a set of data
Obtained by taking the sum of squares
Measured in squared units
Comparison of variance b/w two samples
Test developed by Fisher & Snedecor
Involves another distribution called F distribution
Calculate variance of two samples first, S
1

2
& S
2

2
,
(Variance = SD)
F = S
1
2
/ S
2
2

S
1
2
>

S
2
2

S
1
2
- numerator
Significance of F by referring to F- table
Degrees of freedom , (n
1
1 ) & (n
2
1) in the two samples
Table gives variance ratio values at diff levels of significance at df (n
1
1)
given horizontally and (n
2
2) , vertically
Eg sample A : sum of squares = 36 ; df = 8
Sample B : sum of squares = 42 : df = 9
F = 42/9 / 36 /8 = 42/9 x 8/36 = 1.04
This value of F < table value at p =0.05, not significant
Analysis of variance(ANOVA) test
Compare more than two samples
Compares variation between the classes as well as within the classes
For such comparisons there is high chance of error using t or Z test
Variation in experimental studies natural variation/ random / error
variation
Variation caused due to experimenter- imposed variation or treatment
variation
A :b/w groups variation = random variation (always) + imposed variation
(maybe)
B :Within group variation = random variation
Total variation = A+B
If there is no real difference b/w groups, then

between treatment = random variation
Within treatment random variation

If there is any real difference b/n the R
/

between treatment = random variation+ imposed variation > 1
Within treatment random variation
Chi square test ( test )
Non parametric test
Developed by Karl Pearson
Not based on normal distribution of any variable
Used for qualitative data
To test whether the difference in distribution of attributes in different groups
is due to sampling variation or otherwise.
Applications
1. Test for goodness of fit
2. Test of association (independence)
3. Test of homogeneity or population variance
2 test is non parametric in the first two cases and parametric in the
third case
Calculation of value
Three requirements
A random sample
Qualitative data
Lowest expected frequency > 5
= (observed f expected f )
Expected f
Expected f = row total x column total / grand total
df =( r-1)x (c-1)
Calculated value is correlated with table
Drawbacks :
Tells us about the association but fails to measure the strength of
association.

Test is unreliable if the expected frequency in any one cell
is less than 5.
Correction is done by subtracting 0.5 from [ 0-E ]
Yatess correction
For Tables larger that 2 x 2 , Yates correction cannot be applied
Not applicable when there is 0 or 1 in any of the cells [ Resort to Fishers
exact probability test ]
values interpreted with caution when sample < 50
Non parametric tests:
A family of statistical tests also called as distribution free tests that do not require
any assumption about the distribution the data set follows and that do not require
the testing of distribution parameters such as means or variances.
Friedmans test nonparametric equivalent of analysis of variance
Kruskal Wallis test to compare medians of several independent samples
equivalent of one way analysis of variance
Mann Whitney U test compare medians of two independent samples.
Equivalent of t test
McNemars test variant of chi squared test , used when data is paired
Wilcoxons Sign rank test paired data
Spearmans rank correlation correlation coefficient










CONCLUSION
Its more important to understand the indications and limitations of various
statistical tests rather than the robust mathematical calculations since the
latter is taken care of by the software like SPSS
Understanding the classification of data is crucial for the selection of
appropriate test of significance


REFERENCES
B.K. Mahajan. Methods in Biostatistics, 6
th
edition, Jaypee brothers
P.S.S.Sundar Rao, J.Richard. An introduction to Biostatistics,3
rd
edition,
Prentice Hall of India.
James F Jekel, David L Katz, Joann G Elmore. Epidemiology, biostatistics
and preventive medicine, 2
nd
edition, WB Saunders Company
Research methodology- C.R.Kothari.
Preventive and Community Dentistry- Soben Peter 4th edition.

You might also like