You are on page 1of 6

July 6, 2010

Quick Guide to Oracle Histograms


Statistics when youve got em, and when theyre reasonable, life is good. But even when theyre good, Oracles optimizer can make some strange decisions, generating some hard to explain query plans. Histograms can be part of an effective solution, but they can also make life worse.

Most software developers are aware of a number of factors that can affect the performance of their database queries. All have experienced the common situation of a query that has been thoroughly tested against a variety of data working well for a quite some time, and then later performing dismally, rather unexpectedly.

Often the culprit is database statistics. Unfortunately, for most developers thats the end of it. The DBA fixes the situation, and all is well again. This article hopes to illuminate a small portion of what is typically a foggy area of understanding. Well use Oracle 10g for our base of reference and examples, but the concepts are similar across most modern relational database systems (Sybase, Informix, MySQL, MS SQL Server, etc.).

Table and index statistics are simply a set of numbers gathered by Oracle to describe your data a way of describing or profiling the landscape of your information. These numbers have internal benefit to Oracle because there are many ways to retrieve data, and these methods can vary significantly in cost. By cost, we mean the resources used, most notably: CPU, memory, and disk i/os. As the volume of data increases, and the complexity of the SQL query increases, statistics become even more critical for Oracle to generate what is known as an optimal query plan a series of steps and techniques that Oracle has at its disposal to acquire the data being requested.

In the past, the optimizer component of an RDBMS would develop a query plan based on a set of rules in a relatively straightforward manner: SQL query + rules = query plan. Starting with Oracle 9i, the trend is costbased optimization rather than rule-based optimization. This is proving to be more flexible, dynamic, and effective. But it has its own set of perils.

Suboptimal query plans generated by Oracles cost based optimizer (CBO) are most often a result of a difference of opinion between your own knowledge of the data being queried, and CBOs opinion of what the data looks like. The CBOs opinion is base entirely on the statistics it has gathered about your data. So when these numbers fail to accurately describe reality, some interesting query plans can be generated.

Here are couple of very simple examples of important statistics: 1. The number of rows contained in a table 2. The number of distinct values that a column currently has.

These, and a number of other statistics are used as input for cost calculations. Oracle calculates a number of candidate query plans and their associated costs, then (usually) selects the plan with the least expensive cost.

Frequency Histograms
Histograms are type of statistic that can be an important ingredient in an accurate cost calculation. Oracle uses histograms to improve its selectivity and cardinality calculations for nonuniform data distributions. Ill explain what this means with a real world example, from Jonah Groups FARE 1 data warehouse project. Within some of the larger fact tables, we use numeric columns to describe the year & month of a date. Using an integer or numeric data type for partial date information rather than a date type is a common tactic to save storage space. It also can make the development of reporting criteria simpler.

The diagram below represents what is called a frequency histogram. In this example, an integer column called PAID_MONTH represents the month in which an insurance claim was paid. In SQL, such a column could be used to select claims paid between (and including) October 2009 and January 2010:

... WHERE PAID_MONTH BETWEEN 200910 AND 201001

Frequency histograms allow for every distinct value for the column to be categorized nicely into its own bucket. One limitation of Oracles frequency histograms is that it has a hard limit of 254 buckets. (In this situation, a frequency histogram works out rather well, since the number of months since the beginning of the FARE data warehouse, and today falls below 254.) By pre-constructing a frequency histogram for this column, Oracle can more accurately calculate the selectivity for this column. The number of qualifying rows would be: 61732 + 20152 + 20652 + 81636 = 184172.

Without a histogram, Oracle would make other assumptions concerning the distribution of data, and this would likely affect the outcome of the selectivity calculation. In this case, one assumption that Oracle uses would be that the number of integers between 200910 and 201001 is 91. But since we have inside knowledge of this data that Oracle doesnt have, we know that the number of months in this range of month values is just 5. Not surprisingly, this discrepancy can result in some very different cost calculations and the choice of some suboptimal query plans, for some SQL queries.

Without a histogram, Oracle sees the distribution of data somewhat like this:

Height Balanced Histograms


Height balanced histograms are similar to frequency histograms in their design, but in this case the number of distinct column values exceeds the hard limit of the number of buckets available (254). Here, the number of

column values are divided into ranges so that each range contains approximately the same number of rows. The number of distinct values represented by a bucket can be more than just one value.

Well use an example to illustrate. Lets suppose there is a small medical clinic of providers that specialize in just 999 different surgical procedures, but together they service many thousands of patients each year. You can imagine that a medical procedure code representing each procedure, might be a useful column in a table that contains these many thousands of patient encounters. A height balanced histogram would be represented by the following diagram.

As with frequency histograms, height balanced histograms can also provide the same benefits to more accurate query plan cost calculations because it could provide Oracle with a more accurate view of the distribution, should it be heavily skewed.

Histograms Or Not?
Oracle statistics gathering and histogram creation are specified through the use of the DBMS_STATS package. There are numerous online resources available that detail the use of this package, so we wont get into this here. Unfortunately, the down-side of having Oracle generate histograms during the statistics gathering process is that theyre quite expensive to create, and they arent actually used very much by the CBO.

Nonetheless, it is important to understand when the use of histograms is a good idea, and when its a bad idea. The default position on this matter should be to avoid constructing histograms if you can. There are large amounts of misinformation on the web indicating otherwise. Typical recommendations are to create them for all indexed columns, and/or let Oracle decide when to create them by using its documented auto or skewonly column parameter modifiers. Sadly, both recommendations are usually bad ideas, and depending on the circumstances lead to greater troubles.

Guidelines For Histograms:



Use as a way to solve a specific query plan issue; and explicitly choose them. And when column data is heavily skewed. And when a column is likely to appear in a querys WHERE clause. (Note: Histograms are not just for indexed columns.) And when the time and resources required to create the histogram are not excessive. This will be very dependent on the data volumes involved.

Avoid Histograms:

When column data is evenly distributed Or the column rarely appears in a WHERE clause Never create them on every column of a table Never on the Primary Key The details of the cost calculations that CBO performs are very well explained in Jonathan Lewiss book titled Cost-Based Oracle Fundamentals. Here youll find plenty of detail concerning Oracle histograms and their use.

You might also like