Descriptive Statistics

Descriptive statistics Dr. C. George Boeree Descriptive statistics are ways of summarizing large sets of quantitative (numerical) information.
If you have a large number of measurements, the best thing you can do is to make a graph with all the possible scores along the bottom (x axis), and the number of times you came across that score recorded vertically (y axis) in the form of a bar. But such a graph is just plain hard to do statistical analyses with, so we have other, more numerical ways of summarizing the data. Here is a small set of data: The grades for 15 students. For our purposes, they range from 0 (failing) to 4 (an A), and go up in steps of .2. John -- 3.0 Mary -- 2.8 George -- 2.8 Beth -- 2.4 Sam -- 3.2 Judy -- 2.8 Fritz -- 1.8 Kate -- 3.8 Dave -- 2.6 Jenny -- 3.4 Mike -- 2.4 Sue -- 4.0 Don -- 3.4 Ellen -- 3.2 Orville -- 2.2 Here is the information in bar graph form:
Central tendency Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way "central" to the set. The mode. The mode is the measurement that has the greatest frequency, the one you found the most of. Although it isn't used that much, it is useful when differences are rare or when the differences are non numerical. The prototypical example of something is usually the mode. The mode for our example is 3.2. It is the grade with the most people (3). The median. The median is the number at which half your measurements are more than that number and half are less than that number. The median is actually a better measure of centrality than the mean if your data are skewed, meaning lopsided. If, for example, you have a dozen ordinary folks and one millionaire, the distribution of their wealth would be lopsided towards the ordinary people, and the millionaire would be an outlier, or highly deviant member of the group. The millionaire would influence the mean a great deal, making it seem like all the members of the group are doing quite well. The median would actually be closer to the mean of all the people other than the millionaire. The median for our example is 3.0. Half the people scored lower, and half higher (and one exactly). The mean. The mean is just the average. It is the sum of all your measurements, divided by the number of measurements. This is the most used measure of central tendency, because of its mathematical qualities. It works best if the data is distributed very evenly across the range, or is distributed in the form of a normal or bell-shaped curve (see below). One interesting thing about the mean is that it represents the expected value if the distribution of measurements were random! Here is what the formula looks like:
So 3.0 + 2.8 + 2.8 + 2.4 + 3.2 + 2.8 + 1.8 + 3.8 + 2.6 + 3.4 + 2.4 + 4.0 + 3.4 + 3.2 + 3.2 is 43.8. Divide that by 15 and that is the mean or average for our example: 2.92.
Statistical dispersion
Dispersion refers to the idea that there is a second number which tells us how "spread out" all the measurements are from that central number. The range. The range is the measure from the smallest measurement to the largest one. This is the simplest measure of statistical dispersion or "spread." The range for our example is 2.2, the distance from the lowest score, 1.8, to the highest, 4.0. Interquartile range. A slightly more sophisticated measure is the interquartile range. If you divide the data into quartiles, meaning that one fourth of the measurements are in quartile 1, one fourth in 2, one fourth in 3, and one fourth in 4, you will get a number that divides 1 and 2 and a number that divides 3 and 4. You then measure the distance between those two numbers, which therefore contains half of the data. Notice that the number between quartile 2 and 3 is the median! The interquartile range for example is .9, because the quartiles divide roughly at 2.45 and 3.35. The reason for the odd dividing lines is because there are 15 pieces of data, which, of course, cannot be neatly divided into quartiles! The standard deviation. The standard deviation is the "average" degree to which scores deviate from the mean. More precisely, you measure how far all your measurements are from the mean, square each one, and add them all up. The result is called the variance. Take the square root of the variance, and you have the standard deviation. Like the mean, it is the "expected value" of how far the scores deviate from the mean. Here is what the formula looks like:
So, subtract the mean from each score and square them and sum: 5.1321. Then divide by 15 and take the square root and you have the standard deviation for our example: .5849.... One standard deviation above the mean is at about 3.5; one standard deviation below is at about 2.3.
The normal curve At its simplest, the central tendency and the measure of dispersion describe a rectangle that is a summary of the set of data. On a more sophisticated level, these measures describe a curve, such as the normal curve, that contains the data most efficiently. This curve, also called the bell-shaped curve, represents a distribution that reflects certain probabilistic events when extended to an infinite number of measurements. It is an idealized version of what happens in many large sets of measurements: Most measurements fall in the middle, and fewer fall at points farther away from the middle. A simple example is height: Very few people are below 3 feet tall; very few are over 8 feet tall; most of us are somewhere between 5 and 6. The same applies to weight, IQs, and SATs! In the normal curve, the mean, median, and mode are all the same.
One standard deviation below the mean contains 34.1% of the measures, as does one standard deviation above the mean. From one to two below contains 13.6%, as does from one to two above. From two to three standard deviations contains 2.1% on each end. An other way to look at it: Between one standard deviation below and above, we have 68% of the data; from two below to two above, we have 95%; from three below to three above, we have 99.7% Because of its mathematical properties, especially its close ties to probability theory, the normal curve is often used in statistics, with the assumption that the mean and standard deviation of a set of measurements define the distribution. Hopefully, it is obvious that this is not at all true for nearly all cases. The best representation of your measurements is a diagram which includes all the measurements, not just their mean and standard deviation! Our example above is a clear example - a normal curve with a mean of 2.92 and a standard deviation of .58 is quite different from the pattern of the original data. A good real life example is IQ and intelligence: IQ tests are intentionally scored in such a way that they generate a normal curve, and because IQ tests are what we use to measure intelligence, we often assume that intelligence is normally distributed, which is not at all necessarily true!
Mean, Median, and Mode In many real-life situations, it is helpful to describe data by a single number that is most representative of the entire collection of numbers. Such a number is called a measure of central tendency. The most commonly used measures are as follows. 1. The mean, or average, of numbers is the sum of the numbers divided by . 2. The median of numbers is the middle number when the numbers are written in order. If is even, the median is the average of the two middle numbers. 3. The mode of numbers is the number that occurs most frequently. If two numbers tie for most frequent occurrence, the collection has two modes and is called bimodal. Example 1 _ Comparing Measures of Central Tendency On an interview for a job, the interviewer tells you that the average annual income of the companys 25 employees is $60,849. The actual annual incomes of the 25 employees are shown below. What are the mean, median, and mode of the incomes? Was the person telling you the truth? $17,305, $478,320, $45,678, $18,980, $17,408, $25,676, $28,906, $12,500, $24,540, $33,450, $12,500, $33,855, $37,450, $20,432, $28,956, $34,983, $36,540, $250,921, $36,853, $16,430, $32,654, $98,213, $48,980, $94,024, $35,671 Solution The mean of the incomes is To find the median, order the incomes as follows. $12,500, $12,500, $16,430, $17,305, $17,408, $18,980, $20,432, $24,540, $25,676, $28,906, $28,956, $32,654, $33,450, $33,855, $34,983, $35,671, $36,540, $36,853, $37,450, $45,678, $48,980, $94,024, $98,213, $250,921, $478,320 From this list, you can see that the median (the middle number) is $33,450. From the same list, you can see that $12,500 is the only income that occurs more than once. So, the mode is $12,500. Technically, the person was telling the truth because the average is (generally) defined to be the mean. However, of the three measures of central tendency Mean: $60,849 Median: $33,450 Mode: $12,500 it seems clear that the median is most representative. The mean is inflated by the two highest salaries. _ 1,521,225 25 _ $60,849. Mean _ 17,305 _ 478,320 _ 45,678 _ 18,980 _ . . . _ 35,671
25 n n n nn A.2 Measures of Central Tendency and Dispersion _ What you should learn How to find and interpret the mean,median, and mode of a set of data How to determine the measure of central tendency that best represents a set of data How to find the standard deviation of a set of data How to create and use box-and-whisker plots _ Why you should learn it Measures of central tendency and dispersion provide a convenient way to describe and compare sets of data. For instance, in Exercise 36 on page A13, the mean and standard deviation are used to analyze the price of gold for the years 1981 through 2000. Choosing a Measure of Central Tendency Which of the three measures of central tendency is the most representative? The answer is that it depends on the distribution of the data and the way in which you plan to use the data. For instance, in Example 1, the mean salary of $60,849 does not seem very representative to a potential employee. To a city income tax collector who wants to estimate 1% of the total income of the 25 employees, however, the mean is precisely the right measure. Example 2 _ Choosing a Measure of Central Tendency Which measure of central tendency is the most representative of the data shown in each frequency distribution? a. Number Tally b. Number Tally c. Number Tally 171916 2 20 2 8 2 1 3 15 3 7 3 2 4 11 4 6 4 3 585555 636665 727774 808883 9 15 9 9 9 0 Solution a. For this data, the mean is 4.23, the median is 3, and the mode is 2. Of these, the mode is probably the most representative. b. For this data, the mean and median are each 5 and the modes are 1 and 9 (the distribution is bimodal). Of these, the mean or median is the most representative. c. For this data, the mean is 4.59, the median is 5, and the mode is 1. Of these, the mean or median is the most representative. Variance and Standard Deviation Very different sets of numbers can have the same mean. You will now study two measures of dispersion, which give you an idea of how much the numbers in a set differ from the mean of the set. These two measures are called the variance of
the set and the standard deviation of the set. A6 Appendix A _ Concepts in Statistics Definitions of Variance and Standard Deviation Consider a set of numbers with a mean of The variance of the set is and the standard deviation of the set is ( is the lowercase Greek letter sigma). _ _ _ _v v_ _x1 _ x_2 _ _x2 _ x_2 _ . . . _ _xn _ x_2 n _x1 x. , x2, . . . , xn_ The standard deviation of a set is a measure of how much a typical number in the set differs from the mean. The greater the standard deviation, the more the numbers in the set vary from the mean. For instance, each of the following sets has a mean of 5. and The standard deviations of the sets are 0, 1, and 2. Example 3 _ Estimations of Standard Deviation Consider the three sets of data represented by the bar graphs in Figure A.4. Which set has the smallest standard deviation? Which has the largest? FIGURE A.4 Solution Of the three sets, the numbers in set A are grouped most closely to the center and the numbers in set are the most dispersed. So, set A has the smallest standard deviation and set C has the largest standard deviation. C 1234567 4 5 3 2 1 1234567 5 4 2 3 1 234567 4 5 3 2 1 1 SetA Set B Set C _2 _3 ___3 _ 5_2 _ _3 _ 5_2 _ _7 _ 5_2 _ _7 _ 5_2 4 _1 _2 ___4 _ 5_2 _ _4 _ 5_2 _ _6 _ 5_2 _ _6 _ 5_2 4 _0 _1 ___5 _ 5_2 _ _5 _ 5_2 _ _5 _ 5_2 _ _5 _ 5_2 4 _5, 5, 5, 5_, _4, 4, 6, 6_, _3, 3, 7, 7_ Section A.2 _ Measures of Central Tendency and Dispersion A7
Example 4 _ Finding Standard Deviation Find the standard deviation of each set shown in Example 3. Solution Because of the symmetry of each bar graph, you can conclude that each has a mean of The standard deviation of set A is The standard deviation of set B is The standard deviation of set C is These values confirm the results of Example 3. That is, set A has the smallest standard deviation and set C has the largest. The following alternative formula provides a more efficient way to compute the standard deviation. Because of messy computations, this formula is difficult to verify. Conceptually, however, the process is straightforward. It consists of showing that the expressions and are equivalent. Try verifying this equivalence for the set with x _ _x1 _ x2 _ x3__3. x2 x3_ _x1 , , _x1 2 _ x2 2 _ . . . _ xn 2 n _ x2 __x1 _ x_2 _ _x2 _ x_2 _ . . . _ _xn _ x_2 n _ 2.22. _ __5__3_2 _ 4__2_2 _ 3__1_2 _ 2_0_2 _ 3_1_2 _ 4_2_2 _ 5_3_2 26 _ 2. _ __2__3_2 _ 2__2_2 _ 2__1_2 _ 2_0_2 _ 2_1_2 _ 2_2_2 _ 2_3_2 14 _ 1.53. _ __(_3_2 _ 2__2_2 _ 3__1_2 _ 5_0_2 _ 3_1_2 _ 2_2_2 _ _3_2 17 x _ 4. A8 Appendix A _ Concepts in Statistics Alternative Formula for Standard Deviation The standard deviation of is _ __x1 2 _ x2 2 _ . . . _ xn 2 n _ x2. _x1, x2, . . . , xn_ Example 5 _ Using the Alternative Formula Use the alternative formula for standard deviation to find the standard deviation of the following set of numbers. 5, 6, 6, 7, 7, 8, 8, 8, 9, 10 Solution Begin by finding the mean of the set, which is 7.4. So, the standard deviation is You can use the statistical features of a graphing utility to check this result. A well-known theorem in statistics, called Chebychevs Theorem, states that at least of the numbers in a distribution must lie within k standard deviations of the mean.
So, 75% of the numbers in a set must lie within two standard deviations of the mean, and at least 88.9% of the numbers must lie within three standard deviations of the mean. For most distributions, these percentages are low. For instance, in all three distributions shown in Example 3, 100% of the numbers lie within two standard deviations of the mean. Example 6 _ Describing a Distribution The table at the left above shows the number of hospitals (in thousands) in each state and the District of Columbia in 1999. Find the mean and standard deviation of the numbers. What percent of the numbers lie within two standard deviations of the mean? (Source: Health Forum) Solution Begin by entering the numbers into a graphing utility that has a standard deviation program. After running the program, you should obtain and The interval that contains all numbers that lie within two standard deviations of the mean is or From the histogram in Figure A.5, you can see that all but two of the numbers (96%) lie in this intervalall but the numbers that correspond to the number of hospitals (in thousands) in California and Texas. _97.18 _ 2_81.99_, 97.18 _ 2_81.99_ __66.80, 261.16 . _ _ 81.99. x _ 97.18 1_ 1 k2 _ 1.43. _ _2.04 __568 10 _ 54.76 _ __52 _ 2_62_ _ 2_72_ _ 3_82_ _ 92 _ 102 10 _ _7.4_2 Section A.2 _ Measures of Central Tendency and Dispersion A9 Number of states 200 - 249 250 - 299 300 - 349 350 - 399 400 - 499 0 - 49 50 - 99 100 - 149 150 - 199 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 Number of hospitals (in thousands) FIGURE A.5 AK 17 AL 109 AR 83 AZ 61 CA 395 CO 67 CT 35 DC 12 DE 6 FL 203 GA 154 HI 22 IA 115 ID 42 IL 198 IN 111 KS 131 KY 105 LA 122 MA 79 MD 49 ME 37 MI 145 MN 134 MO 118 MS 96 MT 53 NC 114 ND 41 NE 85 NH 28 NJ 81 NM 36 NV 22 NY 218 OH 167 OK 109 OR 59 PA 210 RI 11 SC 64 SD 48 TN 121 TX 408 UT 42 VA 89 VT 14 WA 86 WI 123 WV 58 WY 23 Box-and-Whisker Plots Standard deviation is the measure of dispersion that is associated with the mean. Quartiles measure dispersion associated with the median. Example 7 _ Finding Quartiles of a Set Find the lower and upper quartiles for the set. 34, 14, 24, 16, 12, 18, 20, 24, 16, 26, 13, 27 Solution Begin by ordering the set. 12, 13, 14, 16, 16, 18, 20, 24, 24, 26, 27, 34 1st 25% 2nd 25% 3rd 25% 4th 25% The median of the entire set is 19. The median of the six numbers that are less than 19 is 15. So, the lower quartile is 15. The median of the six numbers that are greater than 19 is 25. So, the upper quartile is 25.
Quartiles are represented graphically by a box-and-whisker plot, as shown in Figure A.6. In the plot, notice that five numbers are listed: the smallest number, the lower quartile, the median, the upper quartile, and the largest number. Also notice that the numbers are spaced proportionally, as though they were on a real number line. The next example shows how to find quartiles when the number of elements in a set is not divisible by 4. A10 Appendix A _ Concepts in Statistics Definition of Quartiles Consider an ordered set of numbers whose median is m. The lower quartile is the median of the numbers that occur before m. The upper quartile is the median of the numbers that occur after m. 12 15 19 25 34 FIGURE A.6 Example 8 _ Sketching Box-andWhisker Plots Sketch a box-and-whisker plot for each set. a. 27, 28, 30, 42, 45, 50, 50, 61, 62, 64, 66 b. 82, 82, 83, 85, 87, 89, 90, 94, 95, 95, 96, 98, 99 c. 11, 13, 13, 15, 17, 18, 20, 24, 24, 27 Solution a. This set has 11 numbers. The median is 50 (the sixth number). The lower quartile is 30 (the median of the first five numbers). The upper quartile is 62 (the median of the last five numbers). See Figure A.7. b. This set has 13 numbers. The median is 90 (the seventh number). The lower quartile is 84 (the median of the first six numbers). The upper quartile is 95.5 (the median of the last six numbers). See Figure A.8. c. This set has 10 numbers. The median is 17.5 (the average of the fifth and sixth numbers). The lower quartile is 13 (the median of the first five numbers). The upper quartile is 24 (the median of the last five numbers). See Figure A.9. 11 13 17.5 24 27 82 84 90 95.5 99 27 30 50 62 66 Section A.2 _ Measures of Central Tendency and Dispersion A11 In Exercises 16, find the mean, median, and mode of the set of measurements. 1. 5, 12, 7, 14, 8, 9, 7 2. 30, 37, 32, 39, 33, 34, 32 3. 5, 12, 7, 24, 8, 9, 7 4. 20, 37, 32, 39, 33, 34, 32 5. 5, 12, 7, 14, 9, 7 6. 30, 37, 32, 39, 34, 32 7. Reasoning Compare your answers for Exercises 1 and 3 with those for Exercises 2 and 4. Which of the measures of central tendency is sensitive to extreme measurements? Explain your reasoning. 8. Reasoning (a) Add 6 to each measurement in Exercise 1 and calculate the mean, median, and mode of the revised measurements. How are the measures of central tendency changed?
(b) If a constant is added to each measurement in a set of data, how will the measures of central tendency change? k A.2 Exercises FIGURE A.9 FIGURE A.8 FIGURE A.7 9. Electric Bills A person had the following monthly bills for electricity. What are the mean and median of the collection of bills? January $67.92 February $59.84 March $52.00 April $52.50 May $57.99 June $65.35 July $81.76 August $74.98 September $87.82 October $83.18 November $65.35 December $57.00 10. Car Rental A car rental company kept the following record of the numbers of miles a rental car was driven. What are the mean, median, and mode of this data? Monday 410 Tuesday 260 Wednesday 320 Thursday 320 Friday 460 Saturday 150 11. Six-Child Families A study was done on families having six children. The table shows the numbers of families in the study with the indicated numbers of girls. Determine the mean, median, and mode of this set of data. 12. Sports A baseball fan examined the records of a favorite baseball players performance during his last 50 games. The numbers of games in which the player had 0, 1, 2, 3, and 4 hits are recorded in the table. (a) Determine the average number of hits per game. (b) Determine the players batting average if he had 200 at-bats during the 50-game series. 13. Think About It Construct a collection of numbers that has the following properties. If this is not possible, explain why it is not. 14. Think About It Construct a collection of numbers that has the following properties. If this is not possible, explain why it is not. 15. Test Scores A professor records the following scores for a 100-point exam. 99, 64, 80, 77, 59, 72, 87, 79, 92, 88, 90, 42, 20, 89, 42, 100, 98, 84, 78, 91 Which measure of central tendency best describes these test scores? 16. Shoe Sales A salesman sold eight pairs of mens black dress shoes. The sizes of the eight pairs were as follows: 8, 12, 10, 11, and Which measure (or measures) of central tendency best describes the typical shoe size for this data? In Exercises 1724, find the mean variance and
standard deviation of the set. 17. 4, 10, 8, 2 18. 3, 15, 6, 9, 2 19. 0, 1, 1, 2, 2, 2, 3, 3, 4 20. 2, 2, 2, 2, 2, 2 21. 1, 2, 3, 4, 5, 6, 7 22. 1, 1, 1, 5, 5, 5 23. 49, 62, 40, 29, 32, 70 24. 1.5, 0.4, 2.1, 0.7, 0.8 In Exercises 2530, use the alternative formula to find the standard deviation of the set. 25. 2, 4, 6, 6, 13, 5 26. 10, 25, 50, 26, 15, 33, 29, 4 27. 246, 336, 473, 167, 219, 359 28. 6.0, 9.1, 4.4, 8.7, 10.4 29. 8.1, 6.9, 3.7, 4.2, 6.1 30. 9.0, 7.5, 3.3, 7.4, 6.0 In Exercises 31 and 32, line plots of sets of data are given. Determine the mean and standard deviation of each set. 31. (a) (b) (c) (d) 4 6 8 10 12 8 10 12 14 16 16 18 20 22 24 8 10 12 14 16 __ x_, v_, 1012 9 . 12 10 , 12 10 , 12 , Mean _ 6, median _ 6, mode _ 4 Mean _ 6, median _ 4, mode _ 4 A12 Appendix A _ Concepts in Statistics Number of girls 0 1 2 3 4 5 6 Frequency 1 24 45 54 50 19 7 Number of hits 0 1 2 3 4 Frequency 14 26 7 2 1 32. (a) (b) (c) (d) 33. Reasoning Without calculating the standard deviation, explain why the set has a standard deviation of 8. 34. Reasoning If the standard deviation of a set of numbers is 0, what does this imply about the set? 35. Test Scores An instructor adds five points to each students exam score. Will this change the mean or standard deviation of the exam scores? Explain. 36. Price of Gold The following data represents the
average prices of gold (in dollars per fine ounce) for the years 1981 to 2000. Use a computer or graphing utility to find the mean, variance, and standard deviation of the data. What percent of the data lies within two standard deviations of the mean? (Source: U.S. Bureau of Mines and U.S. Geological Survey) 460, 376, 424, 361, 318, 368, 478, 438, 383, 385, 363, 345, 361, 385, 386, 389, 332, 295, 280, 280 37. Think About It The histograms represent the test scores of two classes of a college course in mathematics. Which histogram has the smaller standard deviation? 38. Test Scores The scores of a mathematics exam given to 600 science and engineering students at a college had a mean and standard deviation of 235 and 28, respectively. Use Chebychevs Theorem to determine the intervals containing at least and at least of the scores. How would the intervals change if the standard deviation were 16? In Exercises 3942, sketch a boxand-whisker plot for the data without the aid of a graphing utility. 39. 23, 15, 14, 23, 13, 14, 13, 20, 12 40. 11, 10, 11, 14, 17, 16, 14, 11, 8, 14, 20 41. 46, 48, 48, 50, 52, 47, 51, 47, 49, 53 42. 25, 20, 22, 28, 24, 28, 25, 19, 27, 29, 28, 21 In Exercises 4346, use a graphing utility to create a box-and-whisker plot for the data. 43. 19, 12, 14, 9, 14, 15, 17, 13, 19, 11, 10, 19 44. 9, 5, 5, 5, 6, 5, 4, 12, 7, 10, 7, 11, 8, 9, 9 45. 20.1, 43.4, 34.9, 23.9, 33.5, 24.1, 22.5, 42.4, 25.7, 17.4, 23.8, 33.3, 17.3, 36.4, 21.8 46. 78.4, 76.3, 107.5, 78.5, 93.2, 90.3, 77.8, 37.1, 97.1, 75.5, 58.8, 65.6 47. Product Lifetime A company has redesigned a product in an attempt to increase the lifetime of the product. The two sets of data list the lifetimes (in months) of 20 units with the original design and 20 units with the new design. Create a boxand-whisker plot for each set of data, and then comment on the differences between the plots. Original Design 15.1 78.3 56.3 68.9 30.6 27.2 12.5 42.7 72.7 20.2 53.0 13.5 11.0 18.4 85.2 10.8 38.3 85.1 10.0 12.6 New Design 55.8 71.5 25.6 19.0 23.1 37.2 60.0 35.3 18.9 80.5 46.7 31.1 67.9 23.5 99.5
54.0 23.2 45.5 24.8 87.8 89 3 4 Score Frequency 84 88 92 96 1 2 3 4 5 6 Score Frequency 86 90 94 98 1 2 3 4 5 6 _4, 4, 20, 20_ 2468 22 24 26 28 12 14 16 18 12 14 16 18 Section A.2 _ Measures of Central Tendency and Dispersion A13

Descriptive Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Descriptive statistics Dr. C. George Boeree Descriptive statistics are ways of summarizing large sets of quantitative (numerical) information.

You might also like