Biostatistics is the application of statistical methods to Health Sciences. It is the method of collection, organizing, analysing, tabulating and interpretation of datas related to living organisms and human beings. The aim is to improve the health Status of the population.
Biostatistics is the application of statistical methods to Health Sciences. It is the method of collection, organizing, analysing, tabulating and interpretation of datas related to living organisms and human beings. The aim is to improve the health Status of the population.
Biostatistics is the application of statistical methods to Health Sciences. It is the method of collection, organizing, analysing, tabulating and interpretation of datas related to living organisms and human beings. The aim is to improve the health Status of the population.
Planning Research Collecting Data Describing Data Summarizing- Presenting Data Analyzing Data Interpreting Results Reaching decisions or discovering new knowledge Biostatisitcs is the application of statistical methods to health sciences. DEFINITION: It is the method of collection, organizing , analysing, tabulating and interpretation of datas related to living organisms and human beings. [Soben Peter] History John Graunt (1620-1674) , who was neither a physician nor a mathematician is the father of Health Sciences. The term biometry was coined by W.F.R Weldon (1860-1906), a zoologist at University College, London. Use of biostatistics To test whether the difference between two populations is real or a chance occurrence. To study the correlation between attributes in the same population. To evaluate the effect of vaccines, sera etc. To measure mortality and morbidity To evaluate achievements of public health programs. To fix priorities in public health programs. To help promote health legislation and create administrative standards for oral health.
Aims of biostatistics To generate the statistical data through experimental investigation and sample surveys. To organize and represent the data in suitable tables, diagrams, charts or graphs, etc To draw valid inferences from the data collected, put forth definite interpretations or predict the future outcomes from the data.
Why should medical/Dental students learn biostatistics? 1. Medicine is becoming increasingly QUANTITATIVE. The aim is to improve the Health Status of the population. We have to clarify the relationships between certain factors and diseases. Enumarate the occurances of diseases Explain the etiology of diseases (which factors cause which diseases) Predict the number of disease occurence Read, understand and criticize the medical literature. 2. The planning, conduct and interpretation of much of medical research are becoming increasingly reliant on statistical methods. Planning 3. How many patients must be treated? 4. How do we have to allocate the subjects to treatments? 5. What are the other factors which may influence the response variable? Conduct: Under which conditions must the study be conducted? Is matching necessary? Is blinding (single blinding or double blinding) necessary? Is there a need for a control group? Shoud the placebo effect be considered? Which experimental design technique is more appropriate? Interpretation: Example:
Distribution of Women with a Diagnosis of Tromboembolism Among Blood Groups Blood Group Frequency % A 32 58.2 AB 4 7.3 B 8 14.5 O 11 20.0 Total 55 100.0
Terminologies:
Data : Set of values of one or more variables recorded on one or more observational units. Observation (case): Individual source of data. Variable: This is a quantity which varies such that it may take any one of a specified set of values. It may be measurable or non-measurable. Population: A collection, or set, of individuals, objects, or measurements whose properties are to be analyzed. Sample: A subset of the population, selected in such a way that it is representative of the larger population. Parameter : A summary value which in some way characterizes the nature of the population in the variable under study. Statistic : A summary value calculated from a sample of observation.
DATA: Sources of data 1. Routinely kept records 2. Published data sources 3. Data on electronic media 4. Surveys and Experimental research 5. Census 6. Generated or artificial data Types of Data 1. Qualitative Data Results from a variable that asks for a quality type of description of the subject. 2. Quantitative Data Results from obtaining quantities-counts or measurements. Types of biostatistics Descriptive biostatistics Inferential biostatistics
Descriptive biostatistics: It is the study of biostatistical procedures which deal with the collection, representation, calculation and processing, i.e., the summarization of the data to make it more informative and comprehensible. The primary function of descriptive statistics is to provide meaningful and convenient techniques for describing features of data that are of interest. The failure to choose appropriate descriptive statistics often lead to faulty scientific inference. The field of descriptive statistics is not concerned with the implications or conclusions that can be drawn from the sets of the data.
Inferential biostatistics: It constitutes the procedures which serve to make generalization or drawing conclusions on the basis of the studies of the sample. This is also known as sampling biostatistics. The study of the quantitative aspects of the inferential process provides a solid basis, on which the more general substantive process of inference can be founded.
Basis for statistical analyses Statistical analyses are based on three primary entities: The population 'U' that is of interest The set of characteristics (variables) of units of this population 'V' The probability distribution 'P' of these characteristics in the population
The population 'U' The population is a collection of units of observation that are of interest and is the target of the investigation. For example , in determining the effectiveness of a particular drug for a disease, the population would consist of all possible patients with the disease. It is essential, in any research study, to identify the population clearly and precisely. The success of the investigation will depend to a large extent on the identification of the population of interest.
The variables 'V' A variable is a state, condition, concept or event whose value is free to vary within the population. Once the population is identified, we should clearly define what characteristics of the units of this population (subjects of the study) are we planning to investigate. For example in the case of a particular drug, one needs to define the disease and what other characteristics of the people (e.g. age, sex, education, etc.) one intends to study. Clear and precise definitions and methods for measuring these characteristics (a simple observation, a laboratory measurement, or tests using a questionnaire) are essential for the success of the research study.
Variables can be classified as , Independent variables: variables that are manipulated or treated in a study in order to see what effect, differences in them will have on those variables proposed as being dependent on them. Synonyms: cause, input, predisposing factor, antecedent, risk factor, characteristic, attribute, determinant. Dependent variables: variables in which changes are results of the level or amount of the independent variable or variables. Synonyms: effect, outcome, consequence, result, condition, disease. Confounding or intervening variables: variables that should be studied because they may influence or 'confound' the effect of the independent variables on the dependent variables. E.g. the study of tobacco (independent variable) on oral cancer (dependent variable), the nutritional status of the individual may play an intervening role. Background variables: variables that are so often of relevance in investigations of the groups or populations that they should be considered for possible inclusion in the study. Synonyms: sex, age, ethnic origin, education, marital status, social status.
The probability distribution 'P' The probability distribution is a way to enumerate the different values the variable can have, and how frequently each value appears in the population. The actual frequency distribution is approximated to a theoretical curve that is used as the probability distribution. Common examples of probability distributions are binomial and normal. For e.g. the incidence of a relatively common illness may be approximated by a binomial distribution, whereas the distributions of continuous variables (blood pressure, heart rate) are often considered to be normally distributed. Probability distributions are characterized by parameters, i.e., quantities that allow us to calculate the probabilities of various events concerning the variable, or that allow us to determine the value of probability for a particular value. The binomial distributions has two parameters. It occurs when a fixed number of subjects are observed, the characteristic is dichotomous in nature (only two possible values), and each subject has the same probability (p) of having one value and (1-p) the other value. The normal distribution on the other hand is a mathematical curve represented by two quantities ,m and s. The former represents the mean of the values of the variables, and latter, the standard deviation. The type of statistical analyses done depends on the design of the study.
Collection of data
In scientific research work data is collected only from personal experimental study. i.e., primary data is used. Statistical data can be collected on two ways. 1. census method 2. sampling method Census method In this method the data is collected from all the individual items that are connected with the inquiry.
Advantages of census method i. The data has high degree of accuracy. ii. The data is more representative and true. iii. Results are more reliable. iv. Possibility of bias is minimised. Disadvantage of census method i. It is less economical as it consumes more time, more energy and more expenditure. ii. It requires organizational skills and large number of investigators. iii. It cannot be applied to all the situations, e.g to determine the blood cell count it is not possible to analyse the whole blood.
Sampling method In this method the data is collected from a small group of population which is termed as sample. A sample is a portion of the population selected to represent the population.
Types of samples There are two types of samples which are used in biostatistics :
1. Qualitative samples: when we say that children from African population are taller than those in India it is called as qualitative sample. 2. Quantitative samples: when we try to know the number of decayed teeth of individuals of particular age group then it is called quantitative sample.
Size of samples
The total number of units which are used in the study to get significant results is termed as sample size. To select the proper sample size is very important. The sample size should not be very small or very large because the conclusions are directly affected by it. Advantages of the sampling method This method is comparatively more economical as it consumes less energy, less time and less expenditure. It requires less number of investigations. It is most suited to those places and situations where census method cannot be applied.
Disadvantages of sampling method It requires services of experts, otherwise incorrect or misleading results will be obtained. In this method selection of appropriate method of sampling is necessary.
If the population is very small and we need precise information then the census method is preferred. If the population is very large or the field of investigation is very wide and the quick results are required, sampling methods should be used.
Types of sampling methods There are two types of sampling techniques: a. Random or probability sampling 1. Simple randomized sampling 2. Stratified randomized sampling 3. Systematic sampling 4. Cluster sampling 5. Multistage sampling b. Nonrandom or nonprobability sampling 1. Convenience sampling 2. Purposive sampling 3. Quota sampling
Random or probability sampling In random sampling a sample is selected in such a way that every element in the population has an equal opportunity of being included in the sample. It means random sampling is made without deliberate discrimination. Random sampling is carried out to ascertain a particular character of the population. It involves unbiased or non preferential samples. Selection of random samples Sampling without replacement In this type of sampling an observation is included only once and is selected randomly without any preference or conscious effort.
Sampling with replacement In this type of sampling the observation has a chance to be selected at each draw.
Properties of random samples: The several samples drawn from the same population will differ, i.e. their statistical characteristics will change from sample to sample. Random sample should be large, because larger the sample, lesser will be the variation of characteristics of the sample from one random sample to another. A random sample must be selected in such a way that every element in the population had an equal opportunity of being included in the sample.
Advantages of random sampling The main advantages of random sampling are: 1. The random sampling enables the researcher to draw inferences about the whole population. 2. It eliminates personal bias. The researcher cannot reject those observations which do not support his theory. Similarly, the researcher cannot select only those observations which may support his theory.
Types of random sampling methods 1. Simple randomized sampling 2. Stratified randomized sampling 3. Systematic sampling 4. Cluster sampling 5. Multistage sampling 6. Simple random sampling In this method samples are chosen at random and each member or sample unit of the population has an equal chance of being selected in the sample. This method is well applicable when the population is small, homogenous and readily available. This method is sometimes is called unrestricted random sampling.
Simple Random sampling o Every possible sample of a certain size within a population has a Known probability of being chosen Equal probability of being chosen o Most basic type of probability sampling. o Actual selection is done by randomly picking the desired number of units from the population. o Statistically equivalent to Identifying all possible samples of the desired size Picking one of those samples at random
Stratified random sampling Samples are chosen random from different strata of usually different sizes of a population and are based on a priority information about the variation and site. Stratified random sampling is done in heterogeneous populations, i.e., this procedure is followed when population is not homogenous. A heterogeneous population is divided into several more or less homogenous sections or groups. These are called strata. A sample is drawn from each stratum by simple random sampling. Thus the variability in each stratum is adequately represented in the sample also. Stratified Random sampling o Probability sampling procedure o The chosen sample is forced to contain units from each of the segments, or strata, of the population o AKA proportional or quota random sampling involves dividing population into homogeneous subgroups Take a simple random sample in each subgroup. o Statistically more efficient o Provides a more accurate population estimate variables. o Two types of stratified random sampling proportionate disproportionate.
Systematic sampling
This is a simple procedure and utilized when a complete list of population from which a sample is to be drawn is available. It is more often applied to field studies when population is large, scattered and heterogeneous. In this sampling method, samples are drawn evenly spaced after a random start position A is chosen. From a large population, samples are selected every 10th, 20th, 25th or 50th item.
Cluster sampling In this method the population is divided into separate natural groups of elements. These groups are called clusters. Each cluster includes only one type of elements. A simple random sample is taken from each cluster. A cluster may consist of units such as villages, wards, blocks, factories, slums of a town, children of a school, etc. Generally the clusters are natural groupings and if they are geographic regions, the sampling is called as 'area sampling'.
Cluster sampling o Probability sampling procedure o Clusters of population units are selected at random o All or some units in the chosen clusters are studied. o When an adequate sampling frame of individual population units is not readily available, cluster sampling is helpful. o Even when such a sampling frame is available, if the frame can be conveniently divided into a series of representative clusters, a cluster sampling approach may be easier to use than a simple or stratified random-sampling approach. In cluster sampling, we follow these steps: o divide population into clusters (usually along geographic boundaries) o randomly sample clusters (areas in red) o measure all units within sampled clusters
Multistage sampling In multistage sampling the clusters or segments are selected in the primary cluster sample and these secondary clusters are again sampled instead of being fully inspected. This procedure is employed in large scale country. Systematic sampling o Researcher selects the first unit randomly o The remaining units systematically o number the units in the population from 1 to N o Decide on the n (sample size) that you want or need o k = N/n = the interval size o Randomly select an integer between 1 to k o Take every kth unit o simplicity relative to the other methods o Requires only one random number to select a sample. o Statistical efficiency is practically equivalent
5.Practical Considerations: Probability Sampling Methods o Not all methods may be equally practical in any project. o Base choice upon Nature of the population Degree of precision desired Resources available for research. Nonrandom or nonprobability sampling Nonprobability sampling o Subjective procedure o Probability of selection for the population units cannot be determined. o The selection is not done on a strictly chance basis o Offers researchers greater freedom and flexibility in sampling. o Nonprobability samples Cannot depend upon the rationale of probability theory. May or may not represent the population well Difficult to know how well we've done so. o There may be circumstances where random sampling is not feasible practical theoretically sensible.
In non random sampling, the samples are drawn without following any crtiteria or any yardstick. The sample collected does not show any specific approach nor the samples can be used to assess properly the accuracy of the estimator. In this sampling procedure many investigator biases are likely to occur. This is of three types : 1. Accidental, Haphazard or Convenience Sampling: this is known as accidental accessibility or haphazard sampling. The major reason is administrative convenience. The sample chosen with ease of access being the sole concern. convenience sampling o a researcher's convenience forms the basis for selecting a sample of units o Very popular in online research, and is known as intercept sampling or pop-up surveys. o Traditional "man on the street" interviews conducted frequently by television news programs o Use of college students in much psychological research is primarily a matter of convenience. o Many research projects simply ask for volunteers.
2. J udgement/Purposive sampling: this is also known as judgemental sampling. The experimenter exercises deliberate subjective choice in drawing the representative sample. The judgemental random sampling aims at elimination of anticipated sources of distortion, but there will always remain the risk of distortion due to personal prejudices or lack of knowledge of certain crucial features in the structure of population. Judgment sampling o researcher exerts some effort in selecting a sample that is believed to be most appropriate. Researcher will usually be knowledgeable about the nature of the ideal population. Requires greater researcher effort Generally more appropriate than a convenience sample. Can be very useful when you need to reach a targeted sample quickly when sampling for proportionality is not the primary concern. Likely to yield opinions of your target population Likely to overweight more readily accessible subgroups. 3. Quota sampling: this combines convenience and judgement and is more structured than either of the two. Quota sampling needs a proper statistical design to determine what numbers are needed in each of the quotas.
Quota sampling o sampling a quota of units o selected from each population cell o based on the judgment Most refined form of nonprobability sampling Often used in practice, especially in personal interviewing. Resembles stratified random sampling Features of judgment and convenience sampling as well. Select people nonrandomly according to some fixed quota. proportional quota samplisng o Represent the major characteristics of the population o Sample a proportional amount of each. o Example Population has 40% women and 60% men Required sample size of 100 Continue sampling until achieving the percentages Then stop. non proportional quota sampling Specify the minimum number of sampled units you want in each category. Not concerned with numbers that match the proportions in the population. Simply need enough to assure the ability to talk about even small groups in the population. Nonprobabilistic analogue of stratified random sampling Typically used to assure that smaller groups are adequately represented 4. Heterogeneity Sampling aka Sampling for diversity o To include all opinions or views o Not concerned about representing these views proportionately. o Obtain a broad spectrum of ideas o Not identifying the "average" or "modal instance" ones. o Sampling ideas not people. o To get all of the ideas (especially the unusual ones) o Include a broad and diverse range of participants. 5. Snowball Sampling o Identifying someone who meets the criteria for inclusion o Ask them to recommend others who they may know also meet the criteria. o Useful when trying to reach populations that are inaccessible or hard to find. A. Sampling Error Sampling error o The difference between a statistic value generated through sampling and o The parameter value, which can be determined only through a census study o Magnitude of the sampling error says how precisely the population parameter can be estimated from a sample value o Estimate the average amount of sampling error associated with a given sampling procedure. o True population parameter value is unknown o Sample statistic value may vary from sample to sample within the population
PRESENTATION OF DATA Objective of classification of data : make the data simple, concise, meaningful, interesting and helpful in further analysis. two main methods of presenting data: Tabulation and Diagrams TABULATION classified on the following bases: Geographical. i.e , area-wise, e.g. cities, districts etc. Chronological i,e, on the basis of time. Qualitative i.e according to some attribute. Quantitative i,e in terms of magnitude. The two elements of classification are The variable and The frequency. Variable: a name denoting a condition , occurrence or effect that can assume different values Divided: subgroups ,classes. have lowest and highest values Class interval : difference between the upper and lower limit of a class Eg: in the class 5 -14, 5 - lower limit and 14 - upper limit. class interval = 14 - 5 =9. Frequency: is the number of units belonging to each group of the variable. Frequency distribution table: way of presenting data in the tables Frequency distribution table Title of the table named at the bottom The no of class intervals - between 5 and 20. no rigidity about it. The class intervals - at equal width. Clearly defined class limits to avoid ambiguity. For e.g., 0-4.5-9. 10-14. Etc. Clearly defined row and column with the headings Units of measurement should be specified. If the data is not original, the source of the data should be mentioned at the bottom of the table.
Diagrams: Extremely useful attractive to the eyes, give a bird's eye view of the entire data, have a lasting impression TYPES OF DIAGRAMS: Bar Diagram : qualitative data. Multiple Bar: qualitative data Component Bar Diagram: qualitative data. Proportional Bar Diagram Histogram: quantitative data of continuous type. Frequency Polygon: qualitative data Pie Diagram: qualitative data Line diagram: qualitative data Cartograms or Spot Map: geographical distribution of frequencies
Basic rules : Self explanatory Simple and consistent with the data. Values of the variables - on horizontal or X-axis and the frequency - vertical line or Y-axis. No too many lines on the graph, should not look clumsy. The scale of presentation right hand top corner of the graph. The scale of division of the two axes should be proportional. The details of the variables and frequencies presented on the axes. Bar Diagram Represent qualitative data. Only one variable. width of the bar remains the same The length varies according to the frequency in each category. Bars: vertical or horizontal. Limitation: represent only one classification cannot be used for comparison Facilitate comparison of data relating to different time periods and regions.
Multiple Bar: compare qualitative data with respect to a single variable. Eg: sex wise or with respect to time or region. each category of the variable have a set of bars of the same width corresponding to the different sections without any gap in between the width and the length corresponds to the frequency.
Component Bar Diagram: represent qualitative data. both, the number of cases in major groups as well as the subgroups simultaneously cases of the major group drawn each rectangle is divided according to no in the subgroups.
Proportional Bar Diagram: represent qualitative data. compare only the proportion of sub-groups between different major groups of observations, then bars are drawn for each group with the same length, either as 1 or 100%. These are then divided according to the sub-group proportion in each major group. PIE DIAGRAM The frequency of the group is shown in a circle. Degree of angle denotes the frequency. Instead of comparing the length of bar , the areas of segments are compared. Line diagram: useful to study changes of values in the variable over time simplest type X-axis, - hours, days, weeks, months or years Y-axis- value of any quantity pertaining to X-axis, Histogram quantitative data of continuous type. bar diagram without gap between the bars. represents a frequency distribution. X-axis: the size of an observation is marked. Starting from 0 the limit of each class interval is marked, the width corresponding to the width of the class interval in the frequency distribution. Y-axis :the frequencies are marked. A rectangle is drawn above each class interval with height proportional to the frequency of that interval. Frequency Polygon frequency distribution of quantitative data compare two or more frequency distributions. a point is marked over the mid-point of the class interval, corresponding to the frequency. points are connected by straight lines. The first point and last point are joined to the midpoint of previous and next class respectively. SCATTER DIAGRAM
Cartograms or Spot Map show geographical distribution of frequencies of a characteristic. PICTOGRAM The pictures representing the value of items are called pictograms. It is most useful way of representing data to those people who cannot understand. Measures of central tendency: single estimate of a series of data that summarizes the data is known as the parameter and one such parameter is the measure of central tendency. Objective: to condense the entire mass of data Fig.--. Height and Weight of 20 students of CODS 0 10 20 30 40 50 60 70 80 3 4 5 6 7 Height in feet W e i g h t
i n
K G s Weight to facilitate comparison Types: Arithmetic mean- mathematical estimate. Median - positional estimate. Mode- based on frequency. Properties of central tendency: should be based on each and every item in the series. should not be affected by extreme observations (either too small or too large values). should be capable of further statistical computations. It should have sampling stability. i..e, if different samples of same size, say 1 are picked up from the same population and the measure of central tendency is calculated, they should not differ from each other markedly. Arithmetic Mean: simplest measure of central tendency. Ungrouped data: Mean = Sum of all the observations of the data Number of observations in the data 1. Grouped data with range for class interval: frequencies in a class interval are equally distributed on either side of the mid point of the class interval. The formula : X = X i f i
f i
Where, X i : midpoint of the class interval, mean f i : corresponding frequency 2. Grouped data with single value for class interval: Symbolically, X = X i f i
fi Where, X i : is grouped variable , f i : corresponding frequency MEDIAN middle value in a distribution such that one half of the units in the distribution have a value smaller than or equal to the median and one half have a value higher than or equal to the median. Calculation of Median: Ungrouped Data: observations are arranged in the order of magnitude & then the middle value of the observations : median. Odd number of observations : (n + 1) / 2 Even: the mean of the two middle values Grouped: total no observations / 2
X = Xi n : sigma, means the sum of. Xi : is the value of each observation in the data, n: is the number of observations in the data. MODE value in a series of observations which occurs with the greatest frequency. Eg: series on age at eruption of the canine as 6,6,5,7, 8, 6, 7, 5; 6 - mode. Ill defined mode : Mode = 3 Median - 2 mean. Variability & its measures Types Biological variability Real variability Experimental variability Biological variability Normal or natural differences within accepted biological limits Individual variability Periodical variability Class , group or category variability Real variability When the difference b/w two readings is more than the defined limits Due to the external factors Experimental variability Errors or variations due to materials & methods Observer error Subjective error Objective error Instrument error Sampling error Measures of variability Synonyms: Measures of dispersion Measures of variation or scatter Dispersion is the degree of spread or variation of the variable about a central value. Uses: Determine reliability of an average Serve as a basis of control of variability Comparison of two or more series Facilitate further statistical analysis A good measure of dispersion : simple , easy to compute , based on all items , amenable for further analysis and not affected by extreme values. Of individual observations - Range Interquartile range Mean deviation Standard deviation Coefficient of variation Variability of samples- Standard error of mean Standard error of difference b/w 2 means Standard error of proportion Difference b/w 2 proportions Standard error of correlation coefficient Standard deviation of regression coefficient Range Difference between the value of the smallest item and the value of the largest item. simplest method. gives no information about the values that lie between the extreme values. subjected to fluctuations from sample to sample. Mean deviation The average of the deviations from the arithmatic mean M.D= (x-x) 52,44,54,56,60,64,66,76,60,68 41,54,43,45,60,75,77,66,79,60 Standard deviation: most important and widely used it is the square root of the mean of the squared deviations from arithmetic mean. root mean square deviation Greater the deviation greater the dispersion Smaller the deviation- higher degree of uniformity Calculation of S.D For ungrouped data: Calculate the mean = x Diff of each observation from mean, d = x i x Square these = d Total these = d Divide this by no of observations minus 1, variance = d/ (n-1) Square root of this variance is S.D = d (n-1) For grouped data: with single units for class intervals Make frequency table Determine mid pt of each range SD= (Xi- x) 2 fi n-1 Xi individual observation in the class x- mean fi frequency n- total frequency Calculation for grouped data with range for class interval: Class intervals in terms of range: Frequency- -centered in mid points S = (xi- x) fi n-1 Xi -midpoint of class interval x- mean fi frequency n- total frequency
Uses of standard deviation Summarizes the deviations , of a large distribution Indicates whether the variation from mean is by chance or real Helps in finding standard error Helps in finding the suitable size of sample Standard deviation is only interpretable as a summary measure for variations having approximately symmetric preparations Coefficient of variation Compare relative variability Variation of same character in two or more series compare the variability of one character in two different groups having different magnitude of values or to compare two characters in the same group by expressing in percentage C V = S.D x 100 mean Higher the C.V greater variability Normal distribution & Normal curve Height of bars or curve greatest in middle Values are spread around mean Maximum values around mean , few at extremes half values above & half below mean
Properties of the normal Distribution curve is bell shaped. The curve is symmetrical about the middle point. The mean is located at the highest point of the curve measures of central tendency coincide. Maximum number of observations is at the value of the variable corresponding to the mean number of observations gradually decreases on either side with with very few observations at the extreme points. area under the curve between any 2 pts which correspond to the number of observations between any 2 values of the variate - in terms of a relationship between the mean and the SD: a) Mean 1 S.D. covers 68.3% of the observations; b) Mean 2 S.D. covers 95.4% of the observations; c) Mean 3 S.D. covers 99.7% of the observations. This relationship is used for fixing confidence interval. Normal distribution law forms the basis for various tests of significance Relative or standard normal deviate Deviation from mean in normal distribution Measured in terms of S.D indicates how much an observation is bigger or smaller than means in units of SD Z = observation mean SD Z= x- x S Probability or chance relative frequency or probable chances of occurrence with which an event is expected to occur on an average Expressed as p Ranges from 0-1 when p= 0, no chance of event happening When p=1 , 100% chances of event happening p no of events occurring total no of trials Statistical hypothesis Methods to estimate the difference b/w estimates of samples two hypothesis are made: Null hypothesis or hypothesis of no difference Alternative hypothesis of significant difference Null hypothesis or hypothesis of no difference [H o ] Asserts that there is no real difference in sample & general population The difference found is accidental & arises out of sampling variations Alternative hypothesis of significant difference [H 1 ] States that sample result is different than the hypothetical value of population To minimize errors the sampling distribution or area under normal curve is divided into two regions or zones 1. Zone of acceptance :samples in the area of mean 1.96 SE, null hypothesis accepted 2. Zone of rejection: sample in the shaded area is beyond the mean 1.96 SE, null hypothesis rejected Degrees of freedom The quantity in the denominator which is one less than the independent no of observations in sample. Eg: When there are 10 values , 9 choices or degrees of freedom In unpaired t test of difference between 2 means df = n 1 +n 2 -2 Where;n 1 & n 2 are no observations. In paired t- test df = n-1 Standard error A measure of variability of the mean sample Obtained by SD / square root of the sample size SE = SD n 2 types of errors; Type 1 error Type 2 error Type I error : null hypothesis is rejected { when it is true} Type I error: The null hypothesis is rejected even it falls in the zone of acceptance serious error. Type II error null hypothesis is wrongly accepted error
Nullhypothesis is true Accept it
Correct decision Reject it
Type I error Nullhypothesis is false Type II error Correct decision the null hypothesis is accepted even it falls in the zone of rejection not serious error, needs only confirmation of result by changing the level of significance
Tests of significance Parametric and non parametric tests or methods Parametric methods The methods of statistical inference that are based on the assumption that the population has a certain probability distribution, the resulting collection of statistical tests and procedures are referred to as parametric methods. For example, t- distribution and F-distribution are associated with the values of parameters of an assumed normal probability distribution. Non parametric methods The statistical procedures that do not require assumptions of any form of probability distribution from which experiments come are known as non parametric methods. These are also called distribution free methods. For example, chi square frequency techniques are non parametric.
Parametric tests Eg. T test, Z test, Chi-square test,Pearson correlation coifficient Non parametric tests Eg. Chi-square test, Kruskal-Wallis test, Spearman correlation coifficient Tests of significance- Steps involved Define the problem state the hypothesis Null hypothesis Alternate hypothesis Fix the level of significance Select appropriate test to find test statistic Find degree of freedom (df) Compare the observed test statistic with theoretical one at desired level of significance & corresponding DF If the observed test statistic value is greater than the theoretical value, reject the null hypothesis. Draw the inference based on the level of significance Objective of using tests of significance To compare sample mean with population Means of two samples Sample proportion with population Proportion of two samples Association b/w two attributes t - test Students t-test Designed by W.S Gossett Unpaired t- test (two independent samples) Paired t- test ( single sample correlated observation) Essential conditions: randomly selected samples from the corresponding populations Homogeneity of variances in the 2 samples Quantitative data Variable normally distributed samples < 30 Unpaired t- test Unpaired data of independent observation made on the individual of two different or separate groups or samples drawn from 2 populations Null hypothesis is stated difference between means of two samples (X 1 -X 2 ) measures variation in variable calculate the t value t = (X 1 -X 2 ) SE Paired t- test To study the role of factor or cause when the observations are made before & after the its play: Eg: exertion on pulse rate, effect of a drug on blood pressure etc To compare the effect of 2drugs , given to the same individual in the sample on two different occasions eg: adrenaline & noradrenaline on pulse rate to study the comparative accuracy of 2 difft instruments eg: 2 difft types of sphygmomanometers to compare the results of 2 difft lab techniques To compare the observations made at two different sites in the same body Testing procedure: Null hypothesis X 1 -X 2 = x Calculate mean of the difference x = x /n calculate SD of differences & SE of mean SE= SD/ n Determine t value t = x -o SD / n Find the degrees of freedom , n-1 refer the table & find the probability P >0.05 not significant P< 0.05 significant Variance ratio test or F test Variance: a measure of the extent of the variation present in a set of data Obtained by taking the sum of squares Measured in squared units Comparison of variance b/w two samples Test developed by Fisher & Snedecor Involves another distribution called F distribution Calculate variance of two samples first, S 1
2 & S 2
2 , (Variance = SD) F = S 1 2 / S 2 2
S 1 2 >
S 2 2
S 1 2 - numerator Significance of F by referring to F- table Degrees of freedom , (n 1 1 ) & (n 2 1) in the two samples Table gives variance ratio values at diff levels of significance at df (n 1 1) given horizontally and (n 2 2) , vertically Eg sample A : sum of squares = 36 ; df = 8 Sample B : sum of squares = 42 : df = 9 F = 42/9 / 36 /8 = 42/9 x 8/36 = 1.04 This value of F < table value at p =0.05, not significant Analysis of variance(ANOVA) test Compare more than two samples Compares variation between the classes as well as within the classes For such comparisons there is high chance of error using t or Z test Variation in experimental studies natural variation/ random / error variation Variation caused due to experimenter- imposed variation or treatment variation A :b/w groups variation = random variation (always) + imposed variation (maybe) B :Within group variation = random variation Total variation = A+B If there is no real difference b/w groups, then
between treatment = random variation Within treatment random variation
If there is any real difference b/n the R /
between treatment = random variation+ imposed variation > 1 Within treatment random variation Chi square test ( test ) Non parametric test Developed by Karl Pearson Not based on normal distribution of any variable Used for qualitative data To test whether the difference in distribution of attributes in different groups is due to sampling variation or otherwise. Applications 1. Test for goodness of fit 2. Test of association (independence) 3. Test of homogeneity or population variance 2 test is non parametric in the first two cases and parametric in the third case Calculation of value Three requirements A random sample Qualitative data Lowest expected frequency > 5 = (observed f expected f ) Expected f Expected f = row total x column total / grand total df =( r-1)x (c-1) Calculated value is correlated with table Drawbacks : Tells us about the association but fails to measure the strength of association.
Test is unreliable if the expected frequency in any one cell is less than 5. Correction is done by subtracting 0.5 from [ 0-E ] Yatess correction For Tables larger that 2 x 2 , Yates correction cannot be applied Not applicable when there is 0 or 1 in any of the cells [ Resort to Fishers exact probability test ] values interpreted with caution when sample < 50 Non parametric tests: A family of statistical tests also called as distribution free tests that do not require any assumption about the distribution the data set follows and that do not require the testing of distribution parameters such as means or variances. Friedmans test nonparametric equivalent of analysis of variance Kruskal Wallis test to compare medians of several independent samples equivalent of one way analysis of variance Mann Whitney U test compare medians of two independent samples. Equivalent of t test McNemars test variant of chi squared test , used when data is paired Wilcoxons Sign rank test paired data Spearmans rank correlation correlation coefficient
CONCLUSION Its more important to understand the indications and limitations of various statistical tests rather than the robust mathematical calculations since the latter is taken care of by the software like SPSS Understanding the classification of data is crucial for the selection of appropriate test of significance
REFERENCES B.K. Mahajan. Methods in Biostatistics, 6 th edition, Jaypee brothers P.S.S.Sundar Rao, J.Richard. An introduction to Biostatistics,3 rd edition, Prentice Hall of India. James F Jekel, David L Katz, Joann G Elmore. Epidemiology, biostatistics and preventive medicine, 2 nd edition, WB Saunders Company Research methodology- C.R.Kothari. Preventive and Community Dentistry- Soben Peter 4th edition.