You are on page 1of 6

Teradata Query Efficiency-TIPS

19/02/2010

BI / Teradata
Arockkia Martin Irudayaraja
arockkiamartin.i@tcs.com

Teradata Query Efficiency-Tips

Primary Index

PI Determines the distribution of data across the machine - (AMPs). Non-Unique PI


could cause skewed distribution

Unique PI is more likely to improve performance of SQL

Non-Unique PI incurs extra checking when rows are inserted

PI consists of one or more columns

One PI is allowed per Table

PI Created when Table is created

PI Populated when Table is loaded

Data Retrieval (Select)


Explain the SQL
The Explain statement is used to aid in identifying potential performance issues, it
analyses the SQL and breaks it down into its low-level process. Unfortunately, the output
can be very difficult to interpret, but there are some points to recognise Confidence Level
and Product Joins.
Confidence Level
Teradata attempts to predict the number of rows, which will result at each stage in
the processing, and will qualify the prediction with a confidence level as follows:

No Confidence: Normally means NO STATISTICS. No Primary Index is referred in


JOIN

Low Confidence: Normally means STATS are difficult to use precisely. AND, OR
conditions used in WHERE clause. Table1 with PI is joined with Table2 with Non-PI
column(s)

High Confidence: Normally means Optimiser is sure of the results based on the
STATS available.

Joins
The following are the types of joins available that can be used in SQL:

Inner Join

Left Outer Join

Right Outer Join

Cross Join

Teradata Query Efficiency-Tips

Apart from this, the data retrieval in Teradata is carried out in the following special joins:

Merge Join

Nested Join

Hash Join

Merge Join
The merge join requires that both tables (spool files) be sorted on the hash-code of
the columns being joined (or subset of columns being joined). This will be the case if they
both have the same primary index, or if the optimizer decides that the cost is not too great
to sort one or both tables (or spool) to get them in this state
Nested Join
It requires a condition to be specified in the where clause of the SQL.
Hash Join
The hash join does not require that both tables are sorted. The smaller table/spool
is "hashed" into memory (sometimes using multiple hash partitions). Then, the larger table
is scanned and for each row, it looks up the row from the smaller table in the hashed table
that was created in memory. If the smaller table must be broken into partitions to fit into
memory, the larger table must also be broken into the same partitions prior to the join.
The main advantage of the hash join is that the big table does not have to be
sorted and the small table can be much larger than for a product join. However, if the
optimizer thinks that the small table is too large, then it will not choose the hash join as it
will not be able to fit the small table in memory, even after breaking into partitions.
Anyhow, the optimizer will try to determine the best path based on the cost of the
alternatives.
Product Joins
Product joins are the condition occurring when Teradata compares every row of the
first table to every row of the second table. This process can use huge amounts of CPU and
SPOOL. The Product join is likely to happen in one of the following conditions:

When a join condition is based on an inequality or totally missing

When an Alias has been used to identify a table, but the Alias has not been used
consistently throughout the SQL to identify the table. Therefore, Teradata believes
that a reference is being made to another copy of the table, but there is no join
condition placed on the other table, resulting in a Product Join.

Sometimes a Product Join is an appropriate option for Teradata to use, and


occasionally when Teradata believes it needs to compare a small number of rows
from one table to another then a Product Join is the right choice. HOWEVER, if the
STATS on a Table are incorrect then Teradata may choose a Product Join when in
fact it is the worst choice.

Teradata Query Efficiency-Tips

Join columns of same data type


The data type of the columns should match when joining data because:

The join is inefficient due to the conversion required

Teradata is unable to compare the demographics of columns, which are of a


different type, even if Statistics have been collected. Therefore, the way in which
the join is performed may not be the best choice.

The same type of problem exists when a join is attempted on part of a column (e.g. when
using Substring). Even if Statistics have been collected for the column, Teradata cannot
know the distribution of values in the substring.
Avoid Manipulated Columns in Where clauses
Statistics should exist on columns used in Where clauses to restrict rows being
returned (or join conditions). When coding the restrictions or joins, avoid manipulating the
columns wherever possible. The optimiser is unable to utilise the statistics on manipulated
columns.
For example, rather than code ColumnA - 2 < Date, code ColumnA < Date + 2
Use Date Functions where possible
There are a number of Date Functions that are available to help with the
manipulation of columns that are defined as Dates. Use them rather than attempting to
redefine the date as a character column and split it into its component parts. Teradata
does it much more efficiently.
Use Union All instead of just Union
When creating a Union of 2 sets of rows, the default form of the statement will
check for the presence of duplicate rows, which is unnecessary if duplicates are acceptable.
In the majority of situations, it is known that duplicates cannot possibly exist, and if they
do exist then it is correct to select them. Therefore, in the majority of cases it is better to
code Union All, which recognises that duplicates may exist.

Data Maintenance (Insert/Update/Delete)


Collect Statistics
When processing SQL that joins 2 or more tables, Teradatas choice of join plan is totally
dependent on its knowledge of the values of the data in the columns referenced in the
SQL. Efficiency of query plans will be improved if Statistics on join columns are available
and up-to-date.
Statistics should be collected on:

Primary index columns of every table and also on all known columns used in joins
or restrictions in queries.

Combinations of columns known to be frequently joined.

Teradata Query Efficiency-Tips

any column which features in WHERE conditions

Statistics should normally be collected after the data has been loaded, or reloaded,
or significantly updated.

If the table changes so frequently that recollecting statistics every time would have
a resource impact, then a threshold after which statistics will be collected must be
identified.

If Statistics are not collected or are not current, and the wrong plan is used by
Teradata, then many thousands of CPU secs can be used instead of a few hundred.

The elapsed time of queries is frequently reduced from hours to minutes through
judicious collection of statistics.

Insert Select Rather than Update


Teradata is particularly fast at inserting new rows into an empty table, and because
of this, it can be more efficient to use this technique rather than performing an Update if it
would affect a lot of rows in the table.
e.g: Update OldTable set Column1 = Column1+10
Can be replaced by:
Insert into NewTable Select Column2, Column3, Column1+10 from OldTable;
Note: The Insert Select is better if updating more than 20% of a table with more than 1
million rows
Remove Secondary Indexes when the data is being loaded, updated or deleted
If a table requires a Secondary index, create the index after the data has been
loaded into the table to ensure that the load process is completed as fast as possible.
If an Update or Delete is being performed, remove all Secondary indexes before
applying the change. Once the change is complete, re-create the Secondary index.
Create Tables, which are appropriate to requirements
In Teradata, the default form of Table is the Set table, so most people use only
this. However, a Volatile table is much better for temporarily storing data that is not
required after the session has finished. The reason is that the creation of a Volatile Table is
the only table type that does not take restrictive locks on the Dictionary.

Teradata Query Efficiency-Tips

Listed below are the 4 different table types and their characteristics:

Set Table

Duplicate rows not allowed. (Teradata default)

Multiset Table

Duplicate rows allowed.

Volatile Table

Defined for duration of session only. Rows only exist for


duration of transaction, unless Table definition includes
On Commit preserve rows. Cannot collect statistics.

Global Temporary Table

Same as Volatile except definition is permanent, and


data is deleted at end of session. Can collect statistics

You might also like