Professional Documents
Culture Documents
Introduction
Descriptive Statistics
Dr Asad Ali
Department of Space Science Institute of Space Technology Islamabad, Pakistan
1 / 81
Introduction
Descriptive Statistics
About Me
Name: Asad Ali PhD (2007-2011) in Astro-Statistics from Department of Statistics, University of Auckland, New Zealand During PhD: Worked as gravitational wave data analyst in NASA-ESA space mission; the Laser Interferometer Space Antenna (LISA) (a space-borne GW detector). Developed Bayesian Monte Carlo Algorithms for Gravitational Wave Spectrum Analysis Used supercomputers such as
BeSTGRID (AUS-NZ) ATLAS (Max-Planck Institute for Gravitational Physics (AEI), Honnover, Germany)
Currently associated with European project, the Einstein Telescope (a deep Earth GW detector) in the same role. Oce 022, Hostel Block.
2 / 81
Introduction
Descriptive Statistics
Why I am here?
3 / 81
Introduction
Descriptive Statistics
These are just reference notes. No need to memorize. The purpose of these lecture notes is just to teach you the how to of statistics. Are self-explanatory. You are an END user... You are not supposed to worry about how your car is manufactured... rather,... you need to learn how to drive it... So... Spend your minds on understanding the statistical concepts and their applications. If you can not present and interpret your measurements. Your knowledge is of unsatisfactory kind. There are several good books on statistics, available in IST library. There are lots of good websites on Internet, as well.
4 / 81
Introduction
Descriptive Statistics
Textbooks 1. Modern Mathematical Statistics with Applications Second Edition by Jay L. Devore 2. Mathematical Statistics with Applications by John E. Freund Reference Material 1. Probability and Statistics for Engineering and the Sciences Fifth Edition by Jay L. Devore and Kenneth N. Berk 2. Probability & Statistics for Engineers and Scientists Fifth Edition by Ronald E. Walpole You can pick any book and visit any website, which you think can help you in learning statistics. For example, I would recommend 1. Introduction to Statistical Theory by Sher Muhammad Chaudhry and Dr Shahid Kamal (Part I, for now).
5 / 81
Introduction
Descriptive Statistics
Introduction
Chapter 1: Introduction
6 / 81
Introduction
Descriptive Statistics
Introduction
Everything dealing with the collection, processing, analysis, and interpretation of numerical data belongs to the domain of statistics. In engineering, this includes such diversied tasks as calculating the average length of the downtimes of a computer (System Engineering) collecting and analyzing data on various weather events; temperature, air pressure, water vapor (Meteorology) evaluating the eectiveness of commercial products (Quality Control) predicting the reliability of a rocket, or studying the vibrations of airplane wings (Aerospace and Aeronautics) estimating the break-points and analyzing the stress-strain relationships of materials. (Materials) calculating the average life of an electrical equipment (Electrical Engineering) and many other tasks pertaining to engineering and other disciplines of science and art
7 / 81
Introduction
Descriptive Statistics
What is Statistics?
Statistics is the area of science that deals with the collection, organization, analysis, and interpretation of data to assist in making more eective decisions in the face of uncertainty (incomplete information why? note it.). Branches of statistics: statistics can be divided into two major branches Descriptive statistics that involves the organization, summarization, and display of data. Descriptive statistics are typically presented graphically, in tabular form (in tables), or as summary statistics (single values). Inferential statistics are used to interpret the meaning of descriptive statistics. Inferential statistics are procedures used that allow researchers to infer or generalize observations made with samples to the larger population from which they were selected. In this course you will learn how to apply and interpret both types of statistics in science and in practice to make you a good interpreter of the statistical information and an excellent decision maker in the face of incomplete information.
Further reading and exercises: Have a look of the introduction and Section 1.1 of Devores book and the examples there in.
8 / 81
Introduction
Descriptive Statistics
Variable
A variable is a characteristic (or an attribute) that describes a person, place, thing, or idea. The value of the variable varies from one entity to another. Variable are generally expressed by X , Y , Z and their values/realizations by xi , yi , zi with subscript i denoting the ith object/item for which the observation is made. More clearly, xi is simply the ith observation on X .
Types of variables
Quantitative variable: A variable is called a quantitative when a characteristic can be expressed numerically such as temperature, time, weight, number of students in classes etc. Qualitative variable: A variable is called a qualitative when a characteristic can be expressed only with dierent categories such as eye color (blue, brown, black), education (BA, MA, MS), survey response (yes, no, agree, disagree) etc.
9 / 81
Introduction
Descriptive Statistics
Measurement scales: The types of measurements of observations are usually called measurements scales. These are four, which are listed below.
Nominal scale: Categorical with no ordering or ranking, e.g. red, blue, green Ordinal scale: Categorical with ordering or ranking, e.g. low, medium, strong Interval scale: A constant interval size, but with no meaningful zero point, e.g. temperature Ratio scale: An interval scale with a meaningful zero point, e.g. length, age, weight
10 / 81
Introduction
Descriptive Statistics
11 / 81
Introduction
Descriptive Statistics
Another type of error: The dierence between a statistic and a parameter is important to understand
Further reading: Have a look of section 1.1 in Devore and try to solve the exercises at the end of section.
12 / 81
Introduction
Descriptive Statistics
Descriptive Statistics
13 / 81
Introduction
Descriptive Statistics
Descriptive Statistics
Researchers can measure many physical processes, such as pressure, strength, survival time, and amount. Often, hundreds or thousands of measurements are made, and procedures were developed to organize, summarize, and make sense of these measurements. These procedures, referred to as descriptive statistics, are specically used to condense and summarize numerical observations to get the initial (meaningful) information and make the data ready for further manipulations. In univariate case, descriptive statistics mainly covers the following tasks of data analysis. Presentation of data using
Tabulation methods (frequency distributions) Graphical methods (diagrams and graphs)
Measures of central tendency (averages and quantiles) Measures of dispersion (ranges, deviations, variations) In the multivariate case, descriptive statistics covers, along with the above, the analysis of the relationships (covariance, correlation and regression etc) between dierent variables as well.
14 / 81
Introduction
Descriptive Statistics
Presentation
Tabulation methods Frequency distribution: The frequency (f ) of a particular observation is the number of times that observation occurs in the data. A frequency distribution is a table that lists the observations along with their respective frequencies. Frequency distribution with no grouping: For discrete data with small range (or small number of actually distinct values) the frequency table is constructed by arranging the collected data values in ascending order of magnitude with their corresponding frequencies. Frequency distribution with grouping: In case of very broad range of values or if the data is continuous, the entire data is divided into dierent non-overlapping groups or classes with the number of observations falling in each group or class. A frequency distribution condenses bulky data to a small table, which tells us about the pattern and shape of the distribution of values of the underlying variable or population.
15 / 81
Introduction
Descriptive Statistics
Presentation
A very simple example (without grouping) Example 1. The marks awarded for an assignment set for a BE (MS&E) class of 20 students were as follows: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8.
Present this information in a frequency table. Solution : To construct a frequency table, we proceed as following: Draw a three columns table with columns heading Marks, Tally, and Frequency. Put all the possible distant values without repetition in the rst column in ascending (or descending) order as shown below. Marks 4 5 6 7 8 9 10 Tally Frequency
16 / 81
Introduction
Descriptive Statistics
Presentation
data: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8. The rst data value is 6, put a tally bar against it, second is 7 put a tally bar for it too. Go ahead and put tallies for all the values. Count the bars for each data value and thats the frequency. When the number of tally bars equals 5, bundle them in a group of 4 with a slash across it. Marks 4 5 6 7 8 9 10 Tally Frequency Marks 4 5 6 7 8 9 10 Tally Frequency 2 2 4 5 4 2 1
So we now have the data in a meaningful form. We can now answer the following questions? Where is the data concentration (peak) point? How is it declining? Is this a normal marks distribution? Or there is some thing wrong with class performance? Do we need further investigations?
17 / 81
Introduction
Descriptive Statistics
Presentation
The how to of a frequency distribution with grouping. When there are too many values in the data and are more spread out, it is dicult to set up a frequency table for every data value as there will be too many rows in the table. Before proceeding ahead, we need to learn about a few terms and rules that we will need for the construction of a frequency distribution with grouping or classes.
Class-limits: The numbers that describe a class or group. The two limits are called lower class limit and the upper class limit. The class-limits (CL) should be inclusive and should not cause any overlapping between any adjacent classes, e.g. age in years can be classied as 10-14, 15-19, 20-24 or 10.0-14.9, 15.0-19.9, 20.0-24.9 etc. Class-boundaries: The class-boundaries (CB) are precise numbers that separate one class from its rst neighbours. CBs are just the midpoint of the upper limit of one class and the lower limit of the next class, e.g. consider the rst two classes 10-14, 15-19, the class boundaries are calculated by 14+15 = 14.5. Thus, for 10-14, 15-19, 20-24, the CBs are 9.5-14.5, 14.5-19.5, 19.5-24.5, thus CBs are 2 by one decimal place more precise than class-limits. The upper class-boundary of one class coincides with the lower class-boundary of the next class, thus leaving no gap. Class marks: Class marks are simply the midpoints of classes. For example, the class mark of class 10-14 is 10+14 = 12. 2 Class interval or class width: Class interval, traditionally denoted by h is the dierence between the two class-boundaries of the same class or the dierence between the lower (or upper) limits of the two consecutive classes. In the above case the class interval is 5. Ideally, all the classes should have equal intervals, unequal intervals can also happens, but should be avoided, until required, because of diculty in interpretations. Class frequency: The frequency of a particular class is the number of times the data value occurs within the limits of that class.
18 / 81
Introduction
Descriptive Statistics
Presentation
A typical frequency distribution with grouping looks like the following table. Classes 10-14 15-19 20-24 Class-boundaries 9.5-14.5 14.5-19.5 19.5-24.5 Tally bars Class-Marks 10+14 =12 2 17 22 Frequency
The columns of class-boundaries and class-marks help in the calculations of dierent statistical quantities such as mean, median and quantiles as we will see in next chapter.
19 / 81
Introduction
Descriptive Statistics
Presentation
A few rules
How many classes? There is no hard rule to decide as to how many classes should we make. Both very few or too many classes will defeat the purpose of constructing the frequency distribution. Too few classes will result in the loss of lot of information and too many classes will kill the purpose of condensation. As a rule of thumb, a number between 5 and 15 would give reasonable results.
(I think, 15 is still too large; I would not take a number larger than 10, unless I am using a computer.)
Find the range, that is the dierence between the maximum and the minimum values in the data. Calculate the class width/interval h by dividing the range of data by the number of classes. If the division results in a decimal number, take the next higher whole number. Avoid using fractional numbers as intervals, it brings you headache. Taking a multiple of 5 or 10 would ease up the problem and also would increase the readability of the table. The resulting classes should cover the whole of data. Note: you can also choose a proper interval rst and then calaculate the number of classes, provided the whole data is covered in a reasonable number of classes. Where to start the rst class from? Usually the lower class-limit is put at or below the smallest data. Remember, the lower class-limit of the rst class should never be larger than the smallest value of the data otherwise that values at the lower end of data will be lost. Starting from a multiple of 5 or 10 would not hurt. Find the upper class-limit by counting from the lower class-limit to the end of the interval. Note that adding the interval directly to lower class-limit is erroneous, as we know the classes are inclusive. Adding an interval to the lower class-limit of a class gives you the lower class-limit of the next class, rather than the upper limit of the same class. (most students forget it...be careful)
20 / 81
Introduction
Descriptive Statistics
Presentation
Find the rest of the classes by just adding the interval to the lower and the upper class-limits to get the lower and upper class-limits of the next class. Now the hard part... scanning the data (mouse hunt)... and putting the values in appropriate classes. Placing tally marks and frequencies. Determine the sum of frequencies to check whether all the values were included.
An example of frequency distribution with grouping Example 2. Thirty energy saver light bulbs were tested to determine how long they usually last. The results, to the nearest day, were recorded as follows:
423 392 399 369 408 415 387 431 428 411 401 422 393 363 396 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390
Construct a frequency distribution for these values. Solution: First we need to nd the range Range = Largest - Smallest = 431 363 = 68 Lets there be 8 classes, therefore class interval is 68 Range = = 8.5 10.0 Number of classes 8 We take h = 10.0 because it eases up the data scanning process. h=
21 / 81
Introduction
Descriptive Statistics
Presentation
Now lets make the table and set the classes. The smallest value is 363, we start from 360 and set the rst class as 360-369, second as 370-379 and so on. Now start scanning the data, allocate the values to their corresponding classes and put tallies for them accordingly. When a data value is allocated to some class, cancel that value in the actual data set, indicating that it has been counted, to avoid recounting.
423 369
392 399 408 415
Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total
Frequency (f )
22 / 81
Introduction
Descriptive Statistics
Presentation
Go on scanning, canceling and counting and put the tallies accordingly. Fill up the rest of the columns.
423 369 387 411 393 394 371 377 389 409 392 408 431 401 363 391 405 382 400 381 399 415 428 422 396 372 410 419 386 390
Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total TBs Frequency (f ) 2 3 5 7 5 4 3 1 f = n = 30
Sum up the frequencies to check whether all the data values are picked up. By looking at this frequency distribution, we can quickly nd that generally most of the bulbs have life between 390 and 399 days as this group has the largest frequency (7). Thus, this group can be regarded as a representative group of this data. We can also see how the frequencies decrease toward the tails of the distribution and the distribution looks fairly symmetric.
23 / 81
Introduction
Descriptive Statistics
Presentation
Relative frequency and percentage frequency While studying these data we may want to know not only how long the bulbs last, but also what proportion of the bulbs falls into each class of bulbs life. This is called the relative frequency (RF) of a particular observation or class and is found by dividing its corresponding frequency (f ) by the total number of observations n: that is: RF = f n
A more clear measure is the percentage frequency, which is found by multiplying each relative frequency value by 100. Thus: PRF = RF 100 The PRF tells us about what percent of observations fall in a particular class. This gives us a bit clearer picture than RF.
24 / 81
Introduction
Descriptive Statistics
Presentation
Example 3. Lets calculate the RF and PRF for Example 2.
Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total f 2 3 5 7 5 4 3 1 f = n = 30
2 30 3 30 f RF = n = 0.07 = 0.10 0.17 0.23 0.17 0.13 0.10 0.03 1.0
PRF
2 100 = 7 30 3 100 = 10 30
17 23 17 13 10 3 100
Looking at this table we can now say that: The chance of any randomly selected bulb having a life in this range is approximately 0.23. 23% of bulbs have a life of from 390 days up to but less than 400 days.
25 / 81
Introduction
Descriptive Statistics
Presentation
Cumulative frequency distribution A cumulative frequency distribution table is the same as a frequency distribution table with additional columns that give the cumulative frequency (CF) and the cumulative percentage (CP) of the data. The cumulative frequency distribution gives us an idea of how many observations of the data falls below or above a given value. It also tells us about the number of observations that lie between a given interval of two values. The CFs are obtained by adding the frequencies of dierent classes in successive manner to the cumulative total of previous frequencies, that is accumulating (the running total) the elements of frequency column. The accumulation can be conducted either from the top class (or value), in which case the CF is called the less than type CF, or from the bottom class (or value), which is known as the more than type CF. In grouped data, for the less than type CF the upper class boundaries are used and for more than type the lower class boundaries are used.
26 / 81
Introduction
Descriptive Statistics
Presentation
Example 4. We calculate a less than type CF and CP for the data in Example 2.
Upper Class Boundaries <369.5 <379.5 <389.5 <399.5 <409.5 <419.5 <429.5 <439.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 2 2+3=5 5+5=10 10+7=17 17+5=22 22+4=26 26+3=29 29+1=30 CP =
CF 100 n 2 100 = 7 30 5 100 = 17 30
33 57 73 87 97 100
Suppose we have been asked to nd as to how many or what percent of observations lie below 399.5. From the table we quickly learn that - there are 17 observations below the given value, which makes them 57% of the entire data. Note: We use the upper class boundaries for a less than (<) type CF distribution.
27 / 81
Introduction
Descriptive Statistics
Presentation
Example 5. Now lets calculate a more than type CF and CP for the data in Example 2.
Upper Class Boundaries >359.5 >369.5 >379.5 >389.5 >399.5 >409.5 >419.5 >429.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 28+2=30 25+3=28 20+5=25 13+7=20 8+5=13 4+4=8 1+3=4 1 CP = CF 100 n 30 100 = 100 30 28 100 = 93 30 83 67 43 27 13 1
Suppose now we are asked to tell as to how many or what percent of observations lie above 399.5. From the table we quickly learn that - there are 13 observations above the given value, which makes them 43% of the entire data. Note: We use the lower class boundaries for a more than (>) type CF distribution.
28 / 81
Introduction
Descriptive Statistics
Presentation
Graphical Methods We now introduce the widely used graphic displays for data presentation in Engineering sciences. Most of the time we want visual presentation of data for clearly seeing patterns in data. Patterns in data are commonly described in terms of: center, spread, shape, and unusual features. Some common distributions have special descriptive labels, such as: symmetric, bell-shaped, skewed, etc. We often need answer to questions like Where are the data (center) located? How spread out are the data? Are the data symmetric or skewed? Are there outliers in the data? Histogram Histogram is a visual version of frequency table. The main purpose of a histogram is to enhance the presentation of data. You can present the same information in a table; however, the graphic presentation format usually makes it easier to see the nature of distribution. It consists of vertical bars, usually called bins or frequency bins, that represent dierent classes of a frequency table. Usually, there is no space between adjacent bars. The height of bars indicates the frequency of classes. A histogram can typically help you answer the following questions: What is the most frequent observation? What distribution (center, variation and shape) does the data have? Does the distribution of data look symmetric or is it skewed towards the left or right?
29 / 81
Introduction
Descriptive Statistics
Presentation
Example 6. Lets construct a histogram and relative frequency histogram for the energy saver bulbs data given in Example 2. We already have constructed the frequency table in Example 3. Lets now depict it.
Histogram of Data
7
Frequency
360
380
420
440
0.00 360
0.05
0.10
0.15
380
420
440
One can also construct a percentage relative frequency histogram by multiplying the relative frequencies by 100.
30 / 81
Introduction
Descriptive Statistics
Presentation
Some of the key features that we usually look for in a histogram. Center: Graphically, the center of a distribution is located at the median of the distribution. Median is the point in a graphic display where about half of the observations are on either side. In the chart to the right, the height of each column indicates the frequency of observations. Here, the observations are centered over 4. Spread: The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are more clustered around a single value, the spread is smaller.
31 / 81
Introduction
Descriptive Statistics
Presentation
Shape: The shape of a distribution is described by the following characteristics. Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak are called unimodal, and distributions with two clear peaks are called bimodal. Symmetry. When it is graphed, a unimodal symmetric distribution can be divided at the center so that each half is a mirror image of the other. A single peaked symmetric distribution is referred to as bell-shaped distribution. Skewness. When displayed graphically, some unimodal distributions have many more observations on one side of the graph than the other side. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and distributions with most of their observations on the right (toward higher values) are said to be skewed left. Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution. A uniform distribution has no clear peak(s). Gaps. Gaps refer to areas of a distribution where there are no observations. The second last gure on the next slide has a gap; there are no observations in that part of the distribution. Outliers. Sometimes, distributions are characterized by extreme values that dier greatly from the other observations. These extreme values are called outliers.
32 / 81
Introduction
Descriptive Statistics
Presentation
f (x i )
f (x i )
A normal distribution
xi
xi
A skewed distribution
xi
xi
f (x i )
f (x i )
f (x i )
f (x i )
A uniform distribution
xi
xi
f (x i )
f (x i )
A bimodal distribution
xi
xi
xi
f (x i )
f (x i )
f (x i )
f (x i )
Cliplike distribution
xi
xi
33 / 81
Introduction
Descriptive Statistics
Presentation
Cumulative Histogram Like histogramfrequency table pairing the cumulative histogram is a visual version of the cumulative frequency table. It tells what percentage of the total number of observations accumulates at each bin (or interval). It makes nding the percentage or proportion of observations falling within a given interval rather more easy. An ordinary and a cumulative histogram of the same data are given in the following gures.
Histogram of Data
30 7 7
26 6 25
22
Cumulative Frequency
Frequency
20
17 15
10
10
360
380
400
420
440
0 360
380
400
420
440
Data values
Data values
Cumulative histogram is the actual concept that most of the probability distributions uses to calculate probabilities associated with dierent events. So learning about it, and understanding it, is must.
34 / 81
Introduction
Descriptive Statistics
Presentation
Dotplots A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values, especially discrete values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes, and gaps. Example 7. The study included 33 students whose rst-grade IQ scores are given here:
The following gure shows a dotplot for the above data. A representative IQ value is around 110, and the data is fairly symmetric about the center.
35 / 81
Introduction
Descriptive Statistics
Presentation
Stem and Leaf Displays A stem-and-leaf plot (aka stemplot) of a quantitative variable is a textual graph that classies data items according to their most signicant numeric digits. It is generally used for small data sets (50 or fewer observations). A stem and leaf display is similar to a histogram, since it shows how many values in a set fall under a certain interval. It has even more information, it shows the actual values within the interval. A stem is the leading digit of an observation whereas the remaining digits are leaves. For example the observation 327 can be split as stem=3, and leaf=27 or stem=32, and leaf=7. The stemplot is drawn with two columns separated by a vertical line with stems listed to the left of the vertical line. Each stem is listed only once and no numbers are skipped, even if it has no leaves. The leaves are listed in increasing order in a row to the right of each stem. When there is a repeated number in the data (such as two 72s) then the plot must reect such (e.g. the plot of 72 72 75 76 would look like 7 | 2 2 5 6.)
36 / 81
Introduction
Descriptive Statistics
Presentation
Example 8. The stem-and-leaf plot of energy saver bulb data is constructed as below. Stem 36 37 38 39 40 41 42 43
(Key: 40|8 = 408)
In this example we could also use a stem of single digit but then there would have been only two stems; 3 and 4, resulting in a very less informative plot. In the case of values with decimal points (continuous data), the decimal part in each number is taken as leaf. Rounding may be used to suppress certain number of decimal points so that all data values have the same number of decimal points. Further reading and exercises: Have a look of the introduction and Section 1.2 of Devores book and the examples there in. Then solve questions 10, 11, 12, 13, 14, 15, 16.a, 16.b, 17, 20, 24, 25, 29 in exercise 1.2.
37 / 81
Introduction
Descriptive Statistics
Descriptive Statistics
38 / 81
Introduction
Descriptive Statistics
4+8+9 21 = = 7. 3 3 The sample mean is written as x , and the population mean as the Greek letter mu (). Despite its popularity, the mean may not be an appropriate measure of central tendency in skewed distributions, or in data with outliers. For example, the mean of 4, 8, and 9 is
39 / 81
Introduction
Descriptive Statistics
xi
i = 1, ..., n.
You can ignore the limits (i = 1, n) of the summation symbol and can simply write it as . For grouped data, arranged in frequency table, with k classes with midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk , the mean is given by, f1 x1 + f2 x2 +, ..., +fk xk f1 + f2 +, ..., +fk
k i=1 fi xi k i=1 fi
= =
i = 1, ..., k.
Note that
k i=1
fi = n.
40 / 81
Introduction
Descriptive Statistics
The arithmetic mean is quite sensitive to any change in a single value, that makes it an inappropriate measure under certain circumstances. It gives good results when the observations are reasonably similar. Its value can be greatly aected by the presences of a single outlier (extreme value observations). For example, in the above example, if one data value, say 52, was mistakenly recorded as 352. The resulting mean will then be 110 , which is quite dierent than the previous one, leading to a dierent decision about the data.
41 / 81
Introduction
Descriptive Statistics
f 2 3 5 7 5 4 3 1 30
11925 = 397.5 30
42 / 81
Introduction
Descriptive Statistics
For grouped data the median is calculated by the following formula x =l+ h f n C 2
where l is lower class boundary of the median class and C is the cumulative frequency of the preceding class. Where median class is a class corresponding to n th observation. No need to worry about n 2 being odd or even.
43 / 81
Introduction
Descriptive Statistics
(a) Find the median for the full data. (b) Omit the largest (or the smallest) observation and nd the median again. Solution : Rearranging the values in ascending order. We have
7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4
and
n +1 2
th
observations.
The two middle values are indicated by under-brace in the ordered data
7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4
x =
Introduction
Descriptive Statistics
10.4
11.5
11.9
15.2
16.2
Thus the median is x = 9 .7 . .3 = 11.61. Now lets replace 20.4 by 50.0, we see that now mean is For n = 12, the mean is x = 139 12 .9 x = 168 = 14 . 07 but the median remains unchanged. Thus median is insensitive to outliers. 12
45 / 81
Introduction
Descriptive Statistics
C n = 2
30 2
= 15
Now we have l = 389.5, h = 10, f = 7, C = 10. Thus by putting the values in the formula we have, x =l+ h f n 10 C = 389.5 + (15 10) = 396.64 2 7
46 / 81
Introduction
Descriptive Statistics
fm f1 h 2fm f1 f2
Introduction
Descriptive Statistics
f1 fm f2
From the above table we have l = 389.5, h = 10, f1 = 5, fm = 7 and f2 = 5. So the mode is Mode = l + fm f1 75 h = 389.5 + 10 = 394.5 2f m f 1 f 2 2755
48 / 81
Introduction
Descriptive Statistics
Quartiles are calculated in the same manner as median except the multiplication by extra factors of 1, 2, 3 for rst, second and third quartiles respectively. Since Q2 = Median therefore we usually calculate only the rst and the third quartiles for a given data. Quartiles are also known as fourths (Devores terminology). Q1 is called lower quartile or lower fourth and Q3 is known as upper quartile or upper fourth.
49 / 81
Introduction
Descriptive Statistics
If the result contains a fraction (because n is even), then the value is the mean between the values at the index above and below.
50 / 81
Introduction
Descriptive Statistics
Solution : First arrange the values in ascending order, 14 We have, (n + 1) (11 + 1) 12 th = th = th = 3rd 4 4 4 Q1 = 45 Q1 = and 3(n + 1) 3(11 + 1) 36 th = th = th = 9th 4 4 4 Q3 = 87 Q3 = observation observation 35 45 55 55 56 56 65 87 89 92
Now, if n was 10, then the index of the 1st quartile is 2.5. The quartile is the average of the 2nd and 3rd value in the list.
51 / 81
Introduction
Descriptive Statistics
14
35
45
55
55
56
56
65
87
89
92
Q1 = 45+55 =50 2
Upper Half
Since n is odd, therefore there is a single middle value: 56, we divide the above data into two halves and include 56 in both sets. Both halves have an even number of data points. Using the method of nding the median for even n, we get Q1 = 50 and Q3 = 76. Deciles: Deciles divide a rank-ordered data set into ten equal parts. These are dened in the same way as quartiles, except that now the divisor is 10 instead of 4 and j runs from 1 to 10. These are denoted by D1 , D2 ,..., D10 . Percentiles: Percentiles divide a rank-ordered data set into hundred equal parts. These are also dened in the same way as quartiles, except that now the divisor is 100 instead of 4 and j runs from 1 to 100. These are denoted by P1 , P2 ,..., P100 .
52 / 81
Introduction
Descriptive Statistics
For grouped data (frequencies) one has to use the cumulative frequency, as was used in the calculation of median. Percentiles are useful for giving the relative standing of an individual observation in a population, they are essentially the rank position of an individual observation. For grouped data, we calculate the quartiles, deciles and percentiles using the same formula, with a slight modication, as that for median. Thus, Quartiles h jn Qj = l + C where j = 1, 2, 3 f 4 Deciles Dj = l + and Percentiles Pj = l + h f jn C 100 where j = 1, 2, ..., 100 h f jn C 10 where j = 1, 2, ..., 10
Introduction
Descriptive Statistics
54 / 81
Introduction
Descriptive Statistics
Descriptive Statistics
55 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Imagine, you are comparing two dierent data sets (for now, measured in the same units, e.g. kg , km etc). By chance, it happens that the two data sets have the same means, medians or modes. Does it mean that the two data sets are the same or they have the same features? No. Here we need some extra insight into the data; as a rst step, we need to measure their respective dispersions or variabilities about the center and then compare them. Some of the most commonly used measures of dispersions are
Range, Mid-range Inter-quartile Range (also called the fourth-spread), Semi-inter-quartile Range Mean Deviation Variance and Standard Deviation
Range is quite a simple measure, as you know, just the dierence of the two extreme values in the data and mid-range is just the average of two extreme values, i.e. mid-range = max-value + min-value 2
Introduction
Descriptive Statistics
Measures of Dispersion
Inter-quartile range (aka fourth-spread): The interquartile range, denoted IQR, is a measure of spread from the lower quartile to the upper quartile, IQR = Q3 Q1 From now on we will denote IQR by fs for simplicity. Semi-Inter-quartile range: (SIQR) is just the half of IQR, SIQR = Q3 Q1 2
The pure measure (free of units of measurements) is the co-ecient of quartile deviation (CQD) dened as Q3 Q1 CQD = Q3 + Q1 This measure is free of measurements units and can be used to compare two or more data with dierent units of measurement.
57 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Mean Deviation: Mean (or median) deviation (MD) or mean absolute deviation (MAD) is also a measure of dispersion dened as the average of the absolute dierences/deviations between the data values and the data center (usually, mean or median). Mathematically, Using the mean as the data center, MD = Similarly, for median the MedD is dened as, MedD =
n i=1 n i=1
|x i x | . n
|xi x | n
x . where x = n For grouped data, arranged in frequency table, with k classes having midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk , the MD and MedD are given by, MD = where x = fx . n
58 / 81
k i=1
fi |xi x | n
and
MedD =
k i=1
f i |x i x | n
Introduction
Descriptive Statistics
Measures of Dispersion
Example 16. Find the MD and MedD for the following simple data. 65 55 89 56 35 14 56 55 87 45 92
Solution : Lets denote the data by X . What we need rst, are the mean and median. The mean is x = = = xi n 65 + 55 + ... + 92 649 = 11 11 59.
n i=1
Since n is odd, the median is just the middle observation of the oredered data, 14 35 45 55 55 56 56 65 87 89 92
59 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Lets arrange the data in a table, and calculate the required quantities i.e. for the above formulas,
xi 65 55 89 56 35 14 56 55 87 45 92 xi x 65-59=6 55-59=-4 30 -3 -24 -45 -3 -4 28 -14 33 |x i x | 6 4 30 3 24 45 3 4 28 14 33 194 xi x 65-56=9 55-56=-1 33 0 -21 -42 0 -1 31 -11 36 33 | xi x | 9 1 33 0 21 42 0 1 31 11 36 185
|x i x | and
|xi x |,
60 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Example 17. Find the MD and MedD for the following grouped data. x f : : 14 4 35 7 45 11 55 13 56 18 65 13 87 8 89 6 92 3
Solution : Again, rst we need the mean and the median to calculate the necessary columns. The mean is x = and the median is
k i=1 fi xi k i=1 fi
4870 = 58.7 83
n th observation = 41.5th observation 2 From the table on the following slide, we nd that median is, x = The x = 56
61 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi x -44.7 -23.7 -13.7 -3.7 -2.7 6.3 28.3 30.3 33.3 fi |xi x | 178.7 165.7 150.4 47.8 48.1 82.2 226.6 182.0 100.0 1182 cf 4 11 22 35 53 66 74 80 82 xi x -42 -21 -11 -1 0 9 31 33 36 | f |xi x 168 147 121 13 0 117 248 198 108 1120
We have all the stu for MD and MedD, fi |xi x | 1182 = = 14.2 fi 83 fi |xi x | 1120 = 13.5 = fi 83
MD MedD
= =
Thats it!
62 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Variance and Standard Deviation: Variance is dened as the mean of the squared deviations of all the observations from the mean. Population variance is denoted by 2 and the sample variance is denoted by S 2 or 2 . Mathematically, for simple data, 2 = S2 = or s2 = (xi x )2 , n1 for small sample data (n 30), (xi )2 , N 2 (x i x ) , n for population data, for large sample data (n > 30),
The standard deviation is just the positive square root of the variance, dened as, = S= or s= (xi x )2 , n1 for small sample data (n 30)
63 / 81
(xi )2 , N (xi x )2 , n
Introduction
Descriptive Statistics
Measures of Dispersion
Standard deviation (SD) is a widely used measure of variability or diversity, used in statistics and probability theory. It shows how much variation or dispersion exists from the average (mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data points are spread out over a large range of values.
Do you remember the formula D = (x1 x2 )2 + (y1 y2 )2 ? Do you notice the similarity between SD and this formula? SD does almost the same function as D, except that it averages the squared deviations and that the coordinates of the second (here mean) point are the same for all pairs. i.e. for two data points SD = (x1 x )2 + (x2 x )2 2
64 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
For grouped data arranged in frequency table with k classes with midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk the variance and standard deviations are given by,
k i=1
S2 = and S= where x =
fi (xi x )2
k i=1
fi
k i=1
fi (xi x )2
k i=1
fi
k i=1 fi xi k i=1 fi
In next slides we will solve an example problem using the data from example 17.
65 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
Example 18. Find the SD and variance of the data in Example 16. Solution : Looking at the nature of data (i.e. observations with frequencies), we need to use the formula for SD at previous page. That is, k )2 i=1 fi (xi x S= k i=1 fi For which we need to nd the mean; x , which is 58.7 (from Example 16), so lets just calculate the k required quantity for the above formula, i.e. )2 . The variance is calculated by taking i=1 fi (xi x square of the SD. So we need to construct the following table. We now have the required stu for the formula, lets put the values in it. S= 30056 = 19.0 83
x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi x -44.7 -23.7 -13.7 -3.7 -2.7 6.3 28.3 30.3 33.3 f i (x x )2 7983 3923 2057 176 129 520 6419 5518 3332 30056
Thus the standard deviation (SD) of the said data is 19.0. The variance is simply the square of S , i.e. S 2 = 19.02 = 362.0.
66 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
In practice, variance and SD are calculated by using computationally friendly formulas given as below, For a sample of size n with values xi ; i = 1, 2, ..., n, S2 = = 1 n
n
(xi x )2
i=1 n i=1
x2 i
n i=1
xi
Similarly for grouped data, distributed in k groups, with midpoints xi and frequencies fi (i = 1, 2, ..., k) we use S2 = 1 n
k
fi (xi x )2
i=1 k i=1
fi x2 i
k i=1
fi xi
The benet of these formulas is that, here one does not need to calculate the column of dierences i.e. xi x . By taking positive square root of S , we get SD.
67 / 81
Introduction
Descriptive Statistics
Lower Quartile Q1 4
Median 2
Upper Quartile Q3 0 2
Data
Introduction
Descriptive Statistics
Measures of Dispersion
Boxplots That Show Outliers A boxplot can be decorated further to indicate explicitly the presence of outliers. Denition Any observation farther than 1.5fs from the closest fourth is an outlier. An outlier is extreme if it is more than 3fs from the nearest fourth, and it is mild otherwise. Example 19. The relevant summary quantities for Example 1.17 (page 39 Devore) are x = 92.17 Q3 Q1 = fs = 122.15 Q1 = 45.64 1.5fs = 183.225 Q3 = 167.79 3fs = 366.45
Subtracting 1.5fs from the Q1 gives a negative number, and none of the observations are negative, so there are no outliers on the lower end of the data. However, Q3 + 1.5fs = 351.015 and Q3 + 3fs = 534.24
Thus the four largest observations 563.92, 690.11, 826.54, and 1529.35 are extreme outliers, and 352.09, 371.47, 444.68, and 460.86 are mild outliers. The box-plot for the above data can then be sketched as following.
69 / 81
Introduction
Descriptive Statistics
Measures of Dispersion
The whiskers in the boxplot in Figure 1.19 extend out to the smallest observation 9.69 on the low end and 312.45, the largest observation that is not an outlier, on the upper end. There is some positive skewness in the middle half of the data (the median line is somewhat closer to the right edge of the box than to the left edge) and a great deal of positive skewness overall. We will learn about positive/negative skewness in the next few slides. Most importantly, boxplots can be used to compare several data sets at once, e.g. see the following gure of the monthly boxplots of the daily temperatures in some country.
q q
q q
q q q q q q
10
January
February
March
April
May
June
July
August
September
October
November
December
70 / 81