You are on page 1of 20

1/29/2008

Describing Categorical Data

VARIATION IN CATEGORICAL DATA ..........................................................................................3-3 FREQUENCY TABLE .....................................................................................................................3-3 BAR CHART .................................................................................................................................3-4 PIE CHART ...................................................................................................................................3-7 THE A REA PRINCIPLE ..................................................................................................................3-9 MODE AND MEDIAN ..................................................................................................................3-14 SUMMARY ..................................................................................................................................3-16

1/29/2008

3 Categorical Data

Shopping on the Web provides a flood of information to retailers. These data reveal when customers shop, what they buy, and, in special cases, who did the shopping. Although many web surfers guard their privacy, others can be enticed to reveal their browsing. In return for a free computer, these people offer analysts the chance to see them surf and shop the Web. Consumers spend more than $1.5 billion on-line every week, and the market is growing. Amazon remains a leading shopping destination on the Web. Every day, more than 100 million shoppers visit its web site, either by typing amazon.com into their browser, or by clicking through from another web site, a host with an ad that directs shoppers to Amazon. This small table describes six visits to Amazon in the fall of 2002.
Date 24Oct2002 05Dec2002 05Dec2002 05Oct2002 19Nov2002 18Oct2002 Purchase no no no yes no no google.com dealtime.com mycoupons.com yahoo.com Host msn.com Household Size 5 5 4 3 2 5 Region North Central North Central West South West North Central Income Range 35 - 50k 35 - 50k 50 - 75k < 15k 50 - 75k 100k+

Table 3-1. Several visits to amazon.com.

Each row describes a session in which a person visited Amazon. Columns indicate the date, whether the customer made a purchase, the host, and demographics of the shopper (number of people in the household, geographic location, and income). If the customer typed amazon.com into a browser, as in the second row, then the host field is blank. The data are a mix of categorical and numerical variables. For instance, the name of the host is categorical, whereas household size is numerical. Which hosts generate the most business? Amazon might be willing to pay more to keep links on busy sites. Ad space on busy web sites has gotten more expensive. Spending for Internet ads grew from $10 billion in 2004 to $20 billion in 2007. Space on an Internet portal that went for $100,000 in 2002 cost more than $300,000 in 2006 if the site has room. A banner ad on Yahoo or MSN goes for about $500,000 per day, about the cost of a 30 second ad on a popular TV show. Revenue from on-line advertising is growing so fast that AOL dropped its $26 monthly member fee in 2006 to attract visitors to its web portal. More visitors mean more see its ads, and those eyes generate more income than membership fees.1

Wiser about the Web. Business Week, Mar 27, 2006. The chart at left is from How Microsoft Is Learning to Love Online Advertising, Wall Street Journal, Nov 16, 2006.

3-2

1/29/2008

3 Categorical Data

Variation in Categorical Data


Our data track a small percentage of activity at Amazon, but that still amounts to 188,996 visits. Table 3-1 shows six of these that we picked out at random to illustrate the contents of each column. You would not want to read through the rest of these data, much less tabulate the number of visits from various hosts. Thats the problem with huge data tables: you cannot see whats going on, and seeing is what we want to do. How else can we find patterns in data unless we can see the variation?
Variation Differences among the values within a column of a data.table.

What do we mean by variation? In general, variation in data refers to the differences among the values in the columns of a data table. For a categorical variable, variation measures the number of different values and how often each occurs. The column that identifies the host has 11,142 different values. If every visitor came from Google, then the host column in every row would identify Google and there would be no variation. Variation is also related to predictability. No variation would mean every one of these visitors came from the same host. In that case, where would you guess the next customer came from? If every prior visitor came from Google, we would expect the next to come from Google as well. Variation reduces our ability to anticipate what will happen next. With 11,142 different hosts, we cannot be sure of the next host. Before we continue our analysis, lets get specific about the motivation. You need a clear goal when you start to look at data. Our task for this analysis is to answer this question: Which hosts send the most visitors to Amazons web site?

Frequency Table
Distribution The collection of values of a variable and how often each occurs.

To describe the variation in categorical data, statistics comes down to counting. Once we have counts, well graph them. The trick is to decide what to count and how to graph the counts. Thats the point of a clear motivation. Because we want to identify hosts that send the most visitors, we know to count the number from each host. The distribution of a categorical variable is a list of the values of the variable along with the count or frequency of each. A frequency table summarizes the distribution of a categorical variable as a table. Each row of a frequency table lists a category along with the number of cases in this category. A statistics package can list the visits from every host, but we need to be more choosy. After all, these data record 11,142 different hosts, and we only want to identify the prominent hosts that send the most visitors. We dont want to see a list of every one. Among these visits, 20 hosts
3-3

Frequency Table A tabular summary that shows the distribution of a variable.

1/29/2008

3 Categorical Data

delivered 500 or more visits. Well show these individually and combine hosts with fewer visits into a category labeled Other. This frequency table summarizes the recoded categorical variable.
Host Typed amazon.com msn.com yahoo.com google.com recipesource.com aol.com iwon.com atwola.com bmezine.com daily-blessings.com imdb.com couponmountain.com earthlink.net popupad.net overture.com dotcomscoop.com netscape.com dealtime.com att.net postcards.org 24hour-mall.com Other Frequency 89,919 7,258 6,078 4,381 4,283 1,639 1,573 1,289 1,285 1,166 886 813 790 589 586 577 544 543 533 532 503 63,229 Proportion 0.47577 0.03840 0.03216 0.02318 0.02266 0.00867 0.00832 0.00682 0.00680 0.00617 0.00469 0.00430 0.00418 0.00312 0.00310 0.00305 0.00288 0.00287 0.00282 0.00281 0.00266 0.33455

Total
Table 3-2. Frequency table of hosts.

188,996

1.00

This frequency table reveals several interesting characteristics of the hosts. First, the most popular way to get to Amazon is to type amazon.com into a browser. You probably recognize the top 3 hosts: msn.com, yahoo.com, and google.com. More than 70% of Americans online visit these portals. The surprise is how many visits originate from less familiar hosts like recipesource.com. The last column of Table 3-2 shows that hosts that provide fewer than 500 visits supply about a third of all visitors to Amazon.
Relative Frequency The frequency of a label divided by the number of cases; a proportion or percentage.

The last column in Table 3-2 adds the proportion of the visits from each host. A proportion (or percentage) is also known as a relative frequency, and a table that shows these is known as a relative frequency table. We prefer tables that show both counts and proportions side-by-side as in Table 3-2. With the proportions added, it is easy to see that nearly half of these visitors typed amazon.com and that a third came from small hosts. The choice between proportions or percentages is a matter of style. Just be clear when you label the table. Lets concentrate on the top 10 sites. A frequency table is hard to beat as a summary of several counts, but it becomes hard to compare the counts
3-4

Bar Chart

1/29/2008

3 Categorical Data

as the table grows. Unless you need to know the exact counts, a picture beats a table as a summary of the number of visits from the top 10 hosts. One choice for the display of a frequency table is a bar chart. A bar chart displays the distribution of a categorical variable. The bars are positioned along a common baseline, and the length of each bar is proportional to the count of the category. Bar charts have small spaces between the bars to indicate that the bars could be rearranged into any order. This bar chart uses horizontal bars to show the counts for each host, ordered from largest to smallest.

Bar Chart A display using horizontal or vertical bars to show the distribution of a categorical variable.

Figure 3-1. Bar chart of top 10 hosts, with the bars drawn horizontally.

By showing the hosts in order of frequency, this bar chart emphasizes the bigger hosts. You can also orient the bars vertically, like this.

Figure 3-2. Bar chart of top 10 hosts, with the bars drawn vertically.

We find it easier to compare the lengths of horizontal bars, but the orientation is a matter of taste. Youll commonly see bar charts drawn either way. The use of vertical bars limits the number of categories in order to show the labels.
3-5

1/29/2008

3 Categorical Data

When the categories in a bar chart are sorted by frequency as in Figure 3-1 and Figure 3-2, the bar chart is sometimes called a Pareto chart. Pareto charts are popular in quality control to identify problems in a business process. We could have shown the hosts in alphabetical order to help readers find a particular host, but with 10 labels, thats not necessary. If the categorical variable is ordinal, however, you must preserve the ordering. To show a variable that measures customer satisfaction, for instance, it would not make sense to arrange the bars alphabetically as Fair, Great, Poor, Typical, and Very good. Bar charts become cluttered when drawn with too many categories. A bar chart of all 20 hosts in Table 3-2 squishes the labels, and the smallest bars become invisible. Imagine how the bar chart would look if you tried to show all 11,142 hosts! Theres also a problem if the frequency of one category is much larger than those of the rest. This bar chart adds a bar that counts the visits from every other host.

tip

Figure 3-3. One category dominates this bar chart.

The long bar for other hosts dwarfs the popular hosts. That might be a point worth making most visits dont come from the top 10 hosts. The accumulated number of hits from the other hosts, however, obscures the differences among the top 10. Its impossible to see that AOL generates almost twice as many visitors as imdb.com. The bar chart in Figure 3-1 provides the best answer to the motivating question. The message: among hosts, MSN sends the most visitors to Amazon, more than 7,000 during this period, followed by Yahoo,

3-6

1/29/2008

3 Categorical Data

Google, and RecipeSource. A variety of other hosts make up the balance of the top 10 hosts.2

Pie Chart
Pie charts also show the distribution of a categorical variable. A pie chart shows distribution of a categorical variable as wedges of a circle. The area of each wedge is proportional to the count in a category. Large wedges indicate categories with large relative frequencies. Pie charts convey immediately how the whole divides into shares, which makes these a common choice when illustrating, for example, market shares or sources of revenue within a company.

Pie Chart A display that uses wedges of a circle to show the distribution of a categorical variable.

Figure 3-4. Pie charts of the composition of two companies.

Pie charts are good for seeing if the relative frequency of a category is near , , or 1/8 because were used to cutting pies into 2, 4, or 8 slices. In the pie chart below, you can tell right away that msn.com generates about of the visitors among the top 10 hosts.
Hits from Top 10 Hosts
msn.com yahoo.com google.com recipesource.com aol.com iwon.com atwola.com bmezine.com daily-blessings.com imdb.com

Figure 3-5. Pie chart of the top 10 hosts.

Edward Tufte offers several elegant improvements to the bar chart, but most of these have been ignored by software packages. Look at his suggestions in his book The Visual Display of Quantitative Information (1983), Graphics Press, Cheshire CT, pp 126128.

3-7

1/29/2008

3 Categorical Data

Pie charts are less useful than bar charts if we want to compare the actual counts. People are better at comparing lengths of bars than comparing the angles of wedges.3 Unless the slices align conveniently, its hard to compare the sizes of the categories. Can you tell in Figure 3-5 whether RecipeSource or Google generates more visitors? Comparisons such as this are easier in a bar chart. Because they slice the whole thing into portions, pie charts make it easy to overlook what is being divided. It is easy to look at the pie chart in Figure 3-5 and come away thinking that 25% of visitors to Amazon come from msn.com. Sure, msn.com makes up of the pie. Its just that this pie shows less than 16% of the visits. The pie chart looks very different if we add back the other category or the typed-in hits.

Figure 3-6. Pie charts with large categories.

The chart on the left shows that the top 10 hosts generate slightly more than 25% of the host-generated visitors to Amazon. More come from small sites. Any marketing plan that forgets this fact might be in for a big surprise. The pie chart on the right makes you realize that any one of the large hosts generates only a small share of the traffic at Amazon. Which pie chart is the best? As with choosing a bar chart, the motivation determines the best choice. Figure 3-5 emphasizes the relative sizes of the major hosts; its best suited to our motivating question. What about the choice between a bar chart and a pie chart? For this analysis, we could use either the bar chart in Figure 3-1 or the pie chart in Figure 3-5. Because its easier to compare the sizes of categories using a bar chart, we prefer bar charts. That said, pie charts are popular so make sure you can interpret these figures Consider this situation: a manager has partitioned the companys sales into 6 districts: North, East, South, Midwest, West, and International. What graph or table would you use to make these points in a presentation for management? (a) A figure that shows that slightly more than half of all sales are made in the West district?
3

Are You There?

See the paper W. S. Cleveland and R. McGill (1984) Graphical perception: Theory, experimentation and application to the development of graphical methods in the Journal of the American Statistical Association 79, 531554.

3-8

1/29/2008

3 Categorical Data

(b) A figure that shows that sales topped $10 million in every district?4

The Area Principle


Theres flexibility in the mechanics of a bar chart, from picking the categories to the layout of the bars. One aspect of a bar chart, however, is not open to choice. Displays of data must obey a fundamental rule called the area principle. The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the data it represents. Violations of the area principle are a common way to mislead with statistics. The bar charts weve shown obey the area principle. The bars have the same width, so their areas are proportional to the counts from each host. By presenting the relative sizes accurately, bar charts make comparisons easy and natural. Figure 3-1 shows that msn.com generates almost twice as many visits as recipesource.com and reminds you that recipesource.com generated almost as many hits as google.com. How do you violate the area principle? Its easier and more common than you might think. News articles often decorate charts to attract attention. Often, the decoration sacrifices accuracy because the plot violates the area principle. For instance, this chart shows the value of wine exports from the US.5

Figure 3-7. Decorated graphics may ignore the area principle.

Wheres the baseline for this chart? Is the amount proportional to the visible portion of the bottle? The following bar chart shows the same data but obeys the area principle.
4

Use a table (there are only 2 numbers) or a pie chart for (a). Pie charts emphasize the breakdown of the total into pieces and make it easy to see that more than half of the total sales are in the western region. For (b), use a bar chart because bar charts show the values rather than the relative sizes. Every bar would be long enough to reach past a grid line at $10 million. 5 From USA Today (November 1, 2006). With only 5 numbers to show, it might make more sense to list the values, but readers tend to skip over articles that are dense with numbers.

3-9

1/29/2008

3 Categorical Data

Figure 3-8. Bar charts must respect the area principle.

This bar chart is less attractive than its counterpart in Figure 3-7, but it is accurate. For example, you can tell that exports to the United Kingdom and Canada are actually very similar. You can respect the area principle and still be artistic. Just because a chart is not a bar chart or a pie chart doesnt mean that its dishonest. For example, these plots divide net sales at Lockheed Martin into its main lines of business.6

Figure 3-9. Alterative graphs of Lockheed earnings.

The artist who prepared the chart on the left divided a grid with photos associated with the 5 divisions of Lockheed. This chart obeys the area principle. Each square in the grid represents $100 million dollars in sales during 2003. The pie chart shows the same data. Which grabs your attention? You may need to look at a plot carefully to see if it violates the area principle, particularly when it comes to circular displays. Heres one of the earliest charts of data, prepared by Florence Nightingale to illustrate the need for better care for soldiers on the battlefield.

This chart appeared in the business section of the November 28, 2004 issue of The New York Times.

3-10

1/29/2008

3 Categorical Data

Figure 3-10. Casualties in the Crimean War.

By arranging the counts with deaths from illness (blue) on the outside of the circle, her chart violates the Area Principle and exaggerates the number of deaths that might be reduced by hospitalization.

4M Rolling Over
Any time a fatal automobile accident happens, authorities enter a description of the accident into the Fatality Analysis Reporting System (FARS). Once obscure, FARS became well known when reporters from The New York Times discovered something unusual. Their tools: a question, some data, and a bar chart. The question arose anecdotally. There seemed to be a lot of accidents associated with vehicles rolling over. Were these coincidental reports, or was something systematic and dangerous happening on the highways? For this example, well put you in the place of a curious reporter following up on a lead that has important implications for the businesses involved. Well use data for the year 2000 because thats when this issue came to the publics attention.

Motivation

Motivation What is the business question that we want to answer, and what are the $ implications.

News reports suggest some types of cars are more prone to roll-over accidents than others. Most of the reported incidents involve SUVs, but are they all dangerous? If some types of cars are more prone to these dangerous accidents, the manufacturer is going to have a real problem. Sounds like news.

3-11

1/29/2008
Method Identify the data and chart your path.

3 Categorical Data
FARS tracked all fatal accidents in 2000. Ill extract those for which the primary cause was a rollover. Ill also stick to accidents on interstate highways. The rows of my data are accidents resulting in rollovers in 2000. The one column of interest is the model of the car. I plan to use a frequency table that includes percentages and a bar chart to show the results.

Method

Mechanics

Mechanics Generate the appropriate summary table and plot.

FARS reports 1024 fatal accidents on interstates in which the primary event was a rollover. The accidents include 189 different types of cars. Of these, 180 models were involved in fewer than 20 accidents. Ill combine these into a category called other. Heres the frequency table, with the names sorted alphabetically.
Model 4-Runner Bronco/Bronco II/Explorer Cherokee Chevrolet C, K, R, V E-Series Van/Econoline F-Series pickup Large Truck Ranger S-10 Blazer Other Total Count 34 122 25 26 22 47 36 32 40 640 1024 Percentage 3.3 11.9 2.4 2.5 2.1 4.6 3.5 3.1 3.9 62.5

Most of the rows land in the other category. A bar chart sorted by count shows that the Ford Bronco (or Explorer) had the most fatal rollovers on interstates in 2000. This also happened when showing the hosts that send visitors to Amazon.

3-12

1/29/2008
Message Describe your conclusion. If there are any important caveats, put them here.

3 Categorical Data
Data from FARS in 2000 show that Ford Broncos (and the comparable Explorer) were involved in more fatal rollover accidents than any other model. Ford Broncos were involved in more than twice as many rollovers as the next closest model (the Ford F-series truck) and more than three times as many as the comparable Chevy Blazer. Perhaps more Broncos show up in rollover accidents because more Broncos are on the road. Maybe Broncos are driven more often on interstates. The bar chart begs for an explanation.

Message

An explanation came, along with a bitter dispute between Ford and Bridgestone, maker of the Firestone tires that were standard equipment on Ford Broncos and Explorers. Ford claimed that the Firestone tires caused the problem, not the vehicle. The dispute led to a massive recall of more than 13 million Firestone tires.

4M: Comparing Chip Sales


Some tasks require us to compare categorical variables. Most often, the categorical variables measure the same thing at different points in time. The US Department of Justice investigated several manufacturers of computer memory for price fixing. The government alleged that the companies agree to set price targets to avoid competing on price. The alleged price fixing occurred in 1999 through 2002. In September 2004, Infineon plead guilty to participating in a conspiracy to fix prices for dynamic random access memory (DRAM) and paid a $160 million fine.7 During the time of the alleged conspiracy, the total revenues of this industry fell from $20.8 billion in 1999 to $15.25 billion in 2002. The shares also changed as shown in this table. Company Hynix Infineon Micron Samsung Others
Motivation What is the question that we want to answer, and what are the $ implications.

1999 Share (%) 19 6 16 23 36

2002 Share 13 13 19 32 23

Table 3-3. Shares of the computer memory market.

Motivation

Infineon pleaded guilty to price fixing and paid a fine of $160 million. Did they gain a larger share of the market for chips during this period?

Price Fixing in the Memory Market, IEEE Spectrum, December 2004, 18-20.

3-13

1/29/2008
Method Identify the variable , report the Ws, and chart your path.

3 Categorical Data
My data show shares of the market for DRAM chips produced by major vendors in 1999 and 2002. The columns are the numbers of chips sold in 1999 or 2002. The row labels identify the manufacturer. I plan to use both pie charts and bar charts to contrast the shares, and pick the plot that most clearly shows the differences in the two years.

Method

Mechanics

Mechanics Generate the appropriate plots.

Pie charts emphasize shares, whereas the following paired bar chart glues two bar charts of the shares together and simplifies the comparison.

I prefer this graph because it compares the shares for each company in the two years rather than comparing the proportions within each year. Message Describe your conclusion. If there are any important caveats, put them here. The plot shows the gain, but we have to use the table for these numbers. Comparison of the share of the market for chips shows that Infineon and Samsung increased their shares during these years. Most of the gain in share appears to have been at the expense of smaller companies (the others). Had Infineon remained at a 6% market share (as in 1999), its sales would have been about $1 billion smaller in 2002.

Message

Mode and Median


Mode The mode of a categorical variable is the most common category.

Plots are the best summaries of categorical data, but sometimes you need a more compact summary. The mode tells you which value is most common, and the median tells you what is in the middle if you can order the values. The mode of a categorical variable is the most common label, the category with the largest frequency. Among the visitors to Amazon, the modal behavior (the most common) is to type amazon.com into the browser. If two or more categories tie for the largest frequency, the data is said to be bimodal (in the case of two) or multimodal (more than two).
3-14

1/29/2008

3 Categorical Data

Median The median of an ordinal variable is the category in the middle of the sorted values.

Ordinal data offer another summary, the median. The median is the group in the middle when you sort the values of an ordinal variable. If we have an even number of items, choose the group on either side of the middle of the sorted list as the median. For instance, letter grades in courses are ordinal; someone who gets an A did better than someone who got a B. In a class with 13 students, the sorted grades might look like AAAAAABBBBCCC The most common grade, the mode, is an A but the median grade is a B.

3-15

1/29/2008

3 Categorical Data

Summary
Frequency tables display the distribution of a categorical variable by showing the counts associated with each category. Relative frequencies show the proportions or percentages. Bar charts summarize graphically the counts of the categories, and pie charts summarize the proportions of data in the categories. Both charts obey the area principle. The area principle requires that the share of the plot region devoted to a category is proportional to its frequency in the data. A bar chart arranged with the categories ordered by size is sometimes called a Pareto chart. When showing the bar chart for an ordinal variable, keep the labels sorted in their natural order rather than by frequency. The mode of a categorical variable is the most frequently occurring category. The median of an ordinal variable is the value in the middle of the sorted list of all the groups.

Key Terms
area principle, 3-9 bar chart, 3-5 distribution, 3-3 frequency table, 3-3 median, 3-15 mode, 3-14 Pareto chart, 3-6 pie chart, 3-7 relative frequency table, 3-4 variation, 3-3

Best Practices
Use a bar chart to show the frequencies of a categorical variable. Order the categories either alphabetically or by size. The bars can be oriented either horizontally or vertically. Use a pie chart to show the proportions of a categorical variable. Arrange the slices (if you can) to make differences in the sizes more recognizable. A pie chart is a good way to show that one category makes up more than half of the total. Preserve the ordering of an ordinal variable. Arrange the bars in order of the labels, not the frequencies. Avoid putting ordinal data into a pie chart. Respect the area principle. The relative size of a bar or slice should match the count of the associated category in the data relative to the total number of cases. Show the best plots to answer the motivating question. You may have looked at several plots when you analyzed your data, but that does not mean you have to show them all to someone else. Choose the plot that makes your point. Label your chart to show the categories and indicate whether some have been combined or omitted. Name the bars in a bar chart and slices in a pie chart. If you have omitted some of the cases, make sure the label of the plot defines the collection that is summarized.

3-16

1/29/2008

3 Categorical Data

Pitfalls
Cool plots may be deceptive. This 3-D pie chart shows the top six hosts.
Referring Sites

daily-blessings.com dealtime.com google.com msn.com recipesource.com yahoo.com

Looks pretty, doesnt it? But showing the pie on a slant violates the area principle and makes it harder to compare the shares the principal feature of data that a pie chart ought to show. Dont show too many categories. A bar chart or pie chart with too many categories might conceal the more important categories. In these cases, group other categories together, and be sure to let your audience know how you have done this. Use pie charts when you need to emphasize parts of a whole, but avoid them for ordered data. Because of the circular arrangement, pie charts are not well suited to ordinal data; the order gets lost. The pie chart shown below conceals the ordinal nature of the data. The slice indicating companies who expect revenue to increase more than a 50% is next to the slice for those who expect revenue to fall the most.8

The Wall Street Journal, Nov 2, 2006.

3-17

1/29/2008

3 Categorical Data

Be careful about rounding, particularly with a pie chart. Do you see the problem with the following pie chart? Evidently, the author rounded the shares to integers at some point, but then forgot to check that the rounded values add up to 100%. Few things hurt your credibility more than these little blunders.9

Not every chart with bars is a bar chart. These two charts from the front page of the Wall Street Journal look like bar charts, but neither shows the distribution of a categorical variable. Both charts use bars to show a sequence of counts over time. It is fine to use bars in this way to emphasize the changing size of the annual value, but these are not displays of a frequency table.10

10

The Philadelphia Inquirer, Mar 7, 2006. The Wall Street Journal, Aug 21, 2006 and Sep 11, 2006.

3-18

1/29/2008

3 Categorical Data

About the Data

The data on Internet shopping in this chapter comes from ComScore, one of many companies that gather data on web surfing. They get the data from computers that people receive for free in return for letting ComScore monitor how they surf the Internet. These visits cover the time period from September through December 2002. An interesting research project would be to see whether the distribution has changed since then.

Software Tips Excel


Use pivot tables to create a frequency table, following the menu commands (see the on-line help for assistance with pivot tables) Data > Pivot Table (report) Once you have the frequency distribution, use the Chart Wizard to build a bar chart or a pie chart. The categories are shown in the chart in the order listed in the spreadsheet. If youd like some other arrangement, just move the rows around and the chart will be redrawn. If your data uses numerical codes for categories, you might first want to convert the numbers into text. That will keep you from averaging zip codes! Use the Data > Change data type commands to convert the data in a column. Columns of text data have a T in the column headers. The commands for bar charts and pie charts are next to each other on the Graph menu. To get the frequency table of a categorical variable, follow the sequence of commands Stat > Tables > Tally individual variables For a bar chart, look in the graph menu and follow the commands Graph > Bar chart, indicate the type of chart, and then fill the dialog with the name of the variable. (Notice that you can produce stacked and clustered versions of bar charts using this dialog as well.) Options allow the plot to show percentages in place of counts. To make a pie chart, follow the commands Graph > Pie chart. If you have used numbers to represent categorical data, you can tell JMP that these numbers are categories, not counts. A panel at the left of JMPs spreadsheet lists the columns. If you click on the symbol next to the column name, a pop-up dialog allows you to change numerical data (called continuous by JMP) to categorical data (either ordinal or nominal).
3-19

Minitab

JMP

1/29/2008

3 Categorical Data

To get the frequency table of a categorical variable, use the Analyze > Distribution command and fill the dialog with the categorical variable of interest. You also will see a bar chart, but the bars are drawn next to each other without the sort of spacing thats more natural. To obtain a better bar chart and a pie chart, follow the menu items Graph > Chart, place the name of the variable in the Categories, X, Levels field, and click the OK button. That gets you a bar chart. To get a pie chart, use the option pop-up menu (click on the red triangle above the bar chart) to change the type of plot. Version 6 (and later) of JMP includes an interactive table-building function. Check this one out by following the menu items Tables > Tabulate This procedure becomes more useful for summarizing two categorical variables at once.

3-20

You might also like