You are on page 1of 41

STAT 3704 Sections 2.1-2.

Introduction } Stem-and-Leaf Plots


}

Boxplots

Construction Interpreting Depth Construction Interpreting

Summary

What do we actually do with a data set when it s handed to us?

Most data sets are too large to be able to draw any useful conclusions from just by staring at the list of numbers and characters. We need some way of summarizing the data in an informative form that is easy on the eyes.

This chapter covers several ways of creating visual summaries of data sets.

Using these visual tools is a critical first step when analyzing data, to be done before doing anything else!

These methods are designed for a data set consisting of one variable; other methods for two or more variables will be discussed as they come up.

By observing visual summaries of the data, we can

All of this can be done before any formal analysis, and it can often be sufficient in its own right!

Determine the general pattern of data Pick out any outliers that seem like they don t belong Check whether the data follow some theoretical distribution Make quick comparisons between groups of data.

There are several types of visual summaries. The ones we ll cover in this chapter include
Stem-and-leaf plots Boxplots Histograms Time series plots (or time plots)

The first two are usually pretty easy to draw by hand, so we ll start with them.

A computer software package makes life easier for all four displays. We ll discuss some of these next time.
6

The first visual display we will look at is a stem-and-leaf plot. An example will be the easiest way to illustrate a stem plot.
This graph sorts the data in order by grouping elements with similar sizes together.

Example: Suppose that the homework scores (out of 100) for a statistics class (after sorting) are as follows:
50 60 72 76 80 56 61 72 76 81 58 67 72 77 83 59 67 74 79 85 59 68 75 80 86

Notice how we have several values sharing a digit in the 10 s place. We can group these 5 | 06899 together as follows: 6 | 01778
7 | 222456679 8 | 001356
8

Notice how the digit in the 10 s place is pulled out to the left, while all the corresponding digits in the 1 s place are grouped together.
5 6 7 8 | | | | 06899 01778 222456679 001356

We say that the column on the left is the stem, while each of the leaves to the right represents an element of the data set when paired with its stem number.
9

It s a judgment call as to what place to pick for the stem values. Generally, we try to pick the stem so that no one stem element has all the leaves or that each stem only has one or two leaves on it.
Computer software packages will tend make this selection automatically.

10

If we find that there are a lot of elements within a 10 s place but that splitting to the 1 s place would make the leaves too sparse, we can split the 10 s place in one of two ways:
Make 0-4 and 5-9 groups
Some software packages list these as * and ., respectively.

Make groups for 0-1 , 2-3 , 4-5 , 6-7 , and 8-9

Some software packages list these groups as *, t, f, s, and . respectively.

11

For our HW example, we can divide the scores appropriately as follows:


5 6 7 8 | | | | 06899 01778 222456679 001356 5 5 6 6 7 7 8 8 | | | | | | | | 0 6899 01 778 2224 56679 0013 56

Why wouldn t it be a good to idea split each 10 s place into 5 subgroups?

12

Now that we have a stem plot, we can consider what it shows us about the homework scores.
First of all, notice how the data seems shifted so that most of the elements are on the high side of the possible values. Another way of saying this is that the data have a long left tail. If there were any outlier points off in space compared to the others, we would see this easily on the plot.

We ll talk about this interpretation next time when we do histograms!

5 5 6 6 7 7 8 8

| | | | | | | |

0 6899 01 778 2224 56679 0013 56

13

In general, the skewness of a data set can be described in one of three ways:

We ll see why these names are used in the next lecture.

Left-tailed (or left-skewed), where there are relatively few data points stretching into the lower regions of the range Symmetric, where the spread of data is fairly even about the median Right-tailed (or right-skewed), where there are relatively few data points stretching into the regions regions of the range.

14

Notice also that there appear to be two regions where scores are concentrated. This sort of bimodal distribution is almost always indicative of two different subgroups being observed together.
One in the high 50 s, and an even bigger one centered in the high 70 s.
5 5 6 6 7 7 8 8 | | | | | | | | 0 6899 01 778 2224 56679 0013 56

15

Suppose that we also had information about the college of the students.

Students in this class came either from the College of Engineering (blue background) or the College of Architecture (silver background).
50 60 72 76 80 56 61 72 76 81 58 67 72 77 83 59 67 74 79 85 59 68 75 80 86

16

Using this additional information, we can split up (or stratify) our stem plot into two separate plots, one for each group.
Whole Class 5|0 5 | 6899 6 | 01 6 | 778 7 | 2224 7 | 56679 8 | 0013 8 | 56 Engineering 5|5|9 6|6 | 78 7 | 224 7 | 5679 8 | 013 8 | 56 Architecture 5|0 5 | 689 6 | 01 6|7 7|2 7|7 8|0 8|-

17

From this stratified stem plot, it is pretty easy to see that the engineering students (with one exception) did fairly well on the homework, while the architecture students (with two or three exceptions) were struggling a bit.

Engineering 5|5|9 6|6 | 78 7 | 224 7 | 5679 8 | 013 8 | 56 Architecture 5|0 5 | 689 6 | 01 6|7 7|2 7|7 8|0 8|-

This is an example of the importance of separating your data on suspected confounding variables!

18

Often, we can add a depth column to the stem plot to get a better sense of where the median of the data set is.

The median is defined as the value in the data set where half the data set is less than the median and half the data set is greater than the median.

If there are an odd number of data points, the median is just the middle number of the sorted data set. If there are an even number, the median is the average of the two middle numbers.

19

To add a depth column, we start counting leaves from the beginning and the end, as shown: Depth Stem | Leaf
1 5 7 10 (4) 11 6 2 5 5 6 6 7 7 8 8 | | | | | | | | 0 6899 01 778 2224 56679 0013 56

The level with the median is denoted by () around the depth. Most computer software packages make the depth column automatically.

20

The depth column gives us an easy way of counting how many elements are between the edge of the data set and the median.

21

You may have noticed that the stem plot pretty much kept all the original information in the data set, aside from losing the time order when we sorted.
For large data sets in particular, keeping all this information makes stem plots a bit of an eyesore, not much better than the raw data set.

A much simpler visual summary of the data is the boxplot (or box-and-whisker plot).

22

The boxplot relies on five key numbers from the data set, often referred to as the five number summary of the data.
The maximum, minimum, median, and quartiles of the data comprise the five number summary. The quartiles can be thought of in the following way:

You can think of the median as being the 2nd quartile.

When we find the median, we partition the data set in half. If we were to take the median of each of these halves, we get the 1st and 3rd quartiles from the low and high halves, respectively.

23

To make a boxplot, first obtain the five number summary of the data.

As an example, let s consider the data from Exercise 2.5, which is a series of measurements on a differential calorimeter.
343.0 342.4 343.4 343.1 343.3 343.7 343.5 343.1 343.3 343.4 343.8 343.3 343.3 343.3

We ll use the variables min, Q1, med, Q3, and max to denote these.

24

The first thing we need to do is sort the data in order (note that we lose the time order when we do this!).
343.0 342.4 343.4 343.1 343.3 343.7 343.5 343.1 343.3 343.4 343.8 343.3 343.3 343.3

342.4 343.0 343.1 343.1 343.3 343.3 343.3 343.3 343.3 343.4 343.4 343.5 343.7 343.8

25

Next, we calculate the five number summary:


min=342.4 Q1=343.1 med=343.3 (the average of two identical values) Q3=343.4 max=343.8

342.4 343.0 343.1 343.1 343.3 343.3 343.3 343.3 343.3 343.4 343.4 343.5 343.7 343.8

26

Now, we ll draw a scale (can be vertical or horizontal) from at least the min to the max.
This distance is sometimes called the range of the data set.

27

Next, we draw lines denoting the median and quartiles of the data, forming a box with the median partitioning it.
The width of the box doesn t matter.

28

The next step gets a little technical, because we are going to flag outliers with this plot. } To determine which values are outliers, we need some sort of fence to denote the data that is fine and separate out the extreme values.
}

The interquartile range (or IQR for short) turns out to be a useful way to check this.

We define the IQR as being the difference Q3 Q1. It measures the typical spread of the data set. Note: your book uses the term step instead of IQR .

29

We like the IQR because it is robust (or resistant) to outliers and therefore doesn t change much in their presence. Consider it this way: suppose we adjusted the original data set with a typo as follows:
342.4 343.0 343.1 343.1 343.3 343.3 343.3 343.3 343.3 343.4 343.4 343.5 343.7 343.8

342.4 343.0 343.1 343.1 343.3 343.3 343.3 343.3 343.3 343.4 343.4 343.5 343.7 346.8
}

Would the median or quartiles be affected by the typo?


30

For theoretical reasons, we ll define the inner fences as being We ll draw whiskers on the plot from the box out to the last data point within these fences.

31

Any points outside these fences gets flagged as an outlier.

32

Some programs will even differentiate between outliers and extreme outliers.

Extreme outliers lie outside the outer fences defined as Most programs will use a darker symbol to denote extreme outliers if there are any.

33

What does the shape of the plot tell you about the symmetry/skewness of the graph?

34

Just as in the case with stem plots, we can stratify data and create side-by-side (or parallel) boxplots to compare two groups. } These plots make informal comparison between two (or more!) variables very easy.
}

35

For example, consider the situation in Exercise 2.16 of the text.

The data are the number of industrial accidents per quarter over two different five-year periods (in order).
5 4 2 5 6 Period 1 5 10 5 7 8 6 6 5 3 3 8 3 9 10 10 3 1 7 1 4 Period 2 4 2 3 2 7 1 2 2 4 4 0 2 4 1 4

36

The natural question of interest is whether the number of accidents generally went down in the second period.
We could use a time plot (discussed next time!) to check this, but it may not illustrate the difference between the two periods as neatly as parallel boxplots.

37

The boxplots are shown below. Did the number of accidents decrease in period 2?

38

Using visual tools is a great and easy way to make sense of data sets. We can use computer software to draw the graphs, though a few lend themselves to being drawn by hand easily enough.

39

Stem plots preserve most of the information in the data set, and can reveal patterns and structures in the data such as skewness/ symmetry and concentrations (modes) in the data. Stem plots can also reveal outliers.
Multiple modes usually indicate a need to stratify the data into separate groups.

40

Boxplots lose more information than stem plots, but they are easier to read at a glance.
They provide information about the skewness/ symmetry of the data, as well as explicitly flagging outlier points.

Boxplots are extremely useful when comparing multiple groups of data at once.

41

You might also like