You are on page 1of 55

Pure Data for Analytics Best Practices

Technical Account Team


For questions about this presentation contact Joe Baric at jbaricr@us.ibm.com

V 1.1
July 18, 2013 © 2013 IBM Corporation
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

2 © 2013 IBM Corporation


Introduction
The PureData for Analytics appliance is designed from the ground up for simplicity of
management and operation.
As such, best practice does not mean hundreds of rules and regulations to follow,
Recommended that basic principles are followed on the following:
• Distribution
• Data types
• Statistics
• Zone maps
• Data organization (CBTs, materialized views)
• Groom
Follow the basic rules and 99% of applications will perform well.
Focus and effort can then target enhancing for ultra high performance where required and
addressing any performance issues that are hit (the 1%).
Best practice means minimal effort early on for maximum gain.
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

4 © 2013 IBM Corporation


Distribution
Good distribution is a fundamental element of performance!
A data slice is an individual element of parallelism.
E.g. N1001-010 = 92 data slices & N1001-005 = 46 data slices.
If all data slices have the same amount of work to do, a query will be 92 times quicker
than if one data slice was asked to do the same amount of work.
Distribution is determined by the DISTRIBUTE ON clause in the create table.
The same value is always hashed to the same data slice.
Skew to one data slice is the worst case scenario.
Skew affects the query dealing with skewed data and others as the spu processing that
data slice has more work to.
Skew means the machine fills up much quicker (you are as full as the largest data slice).
Simple rule: Good distribution – Good performance
Distributions and Performance
Data
Slice CPU Disk I/O Network
1

2
Response time

Response Time
3 is affected by
the completion
4 time for all of
the data slices
5 in the MPP
array
6

A distribution method that distributes data fairly evenly across all data slices.
This is the one of the most important factors that can influence overall performance.
Hash Distributions and Data Skew
Data
Slice CPU Disk I/O Network
1

Response Time
3

4
Gender = M or F
5 Distributes all table records on 2 data slices

Select a distribution key with unique values and high cardinality.


Hash Distributions and Processing Skew
Data
slice CPU Disk I/O Network
Jan 1

Feb 2

Response Time
Mar 3

Apr 4

May 5

Jun 6
Jul 7

Using a DATE as the distribution key may distribute rows evenly across all data slices.
However, most analysis (queries) is performed on a date range. Massively parallel
processing won’t be achieved when all of the records to be processed for a given date
range are located on one or a few data slices.
Collocated Tables
Data slice 32 Data slice 33 Data slice 34
customer (c_custkey)

... Cust 007 … …

orders (o_custkey)

... Cust 007 … …

... Cust 007 … …

• The tables are both distributed on o_custkey.


• Joining on o_custkey is done local to the data slice.
• No data movement occurs between data slice, SPU and host.
• Maximum performance is achieved.
Single Table Redistribute
Data slice 32 Data slice 33 Data slice 34
customer (c_custkey)

... Cust 007 … …

orders (o_orderkey)

... Cust 007 … #91277

... Cust 007 … #36501

• Customer is distributed on o_custkey but orders is not.


• Data movement occurs between data slices and SPUs for orders.
• The optimizer chooses the table that will result in the lowest overall plan cost to
redistribute.
• Performance is good, but not as god as a collocated join.
Double Table Redistribute
Data slice 32 Data slice 33 Data slice 34

... Cust 007 … Mark F

orders (o_orderkey)

... Cust 007 … #91277

... Cust 007 … #36501

• Both tables are distributed on different columns.


• Data movement occurs between SPUs for both tables.
• There is a performance cost to this.
Using “... Distribute On RANDOM“
Data slice 32 Data slice 33 Data slice 34
customer (c_custkey)

... Cust 007 … …

orders ()

... Cust 007 … #91277

... Cust 007 … #36501

• Random distributes on round robin so the data is spread across all SPUs.
• Lots of data movement between data slices and SPUs.
• Worse case is if both tables are random and large.
• This will give very poor performance.
Broadcasted Table DBOS/Host

Data slice 32 Data slice 33 Data slice 34


customer (c_custkey)

... Cust 007 … …

nation (n_nationkey)
... Japan ... ... ... ... US ... ... ... ... UK ... ... ...
... China ... ... ...

• A small table may be broadcast to all data slices (small being a relative term!).
• The data will be joined locally to the data slice.
Broadcasted Table DBOS/Host

Data slice 32 Data slice 33 Data slice 34


customer (c_custkey)

... Cust 007 … …

nation (n_nationkey)
... US ... ... ... ... US ... ... ... ... US ... ... ...
... UK ... ... ... ... UK ... ... ... ... UK ... ... ...
... Japan ... ... ... ... Japan ... ... ... ... Japan ... ... ...
... China ... ... ... ... China ... ... ... ... China ... ... ...

• This is one case where the use of random for the large table will work..
• Performance will be good if the broadcast data slice does not get large.
Distribution
To create an explicit distribution key, the Netezza SQL syntax is:
usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON [HASH] ( <column> [ ,… ] ) ;

The phrase DISTRIBUTE ON specifies the distribution key, HASH is optional.


You cannot update columns specified as the DISTRIBUTE ON key.
To create a random distribution, the Netezza SQL syntax is:
usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON RANDOM;

Never create a table without specifying a distribution key


If no key is specified the NPS chooses a distribution key. There is no guarantee
what that key is and it can vary depending on the NPS software release.
You can set a system parameter to default to RANDOM distribution if no key is
specified
Distribution
When choosing a distribution key consider the following factors
• The more distinct the distribution key values, the better your distribution is likely to be
• The same distribution key value always goes to the same SPU
• Tables frequently joined together should use the same columns for their distribution
key when possible
• Look at your ON clauses in table joins for candidates for distribution – this will allow
you to collocate tables that are commonly joined. What is your most painful join – fact
to fact or fact to largest dimension? Look to collocate these.
• If performance is slower than you expect, check that there is no accidental processing
skew when there is a good record distribution (see date example).
• If there are no good choices for a single column key, consider random as it will give
perfect distribution
• Small reference tables are likely to get broadcast so random is usually a good choice
Distribution

Checking Distribution
• You can review actual table distribution using the Netezza Performance Portal.
• Alternatively:
• Using SQL:
Select datasliceid, count(datasliceid) as “Rows”

from table_name group by datasliceid order by “Rows”;

• Using the support tools:


nz_skew DBA script (runs on the netezza host system)

Always review after implementation to ensure no skew when deployed or skew that
was hidden on development
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

18 © 2013 IBM Corporation


Data types
Picking the right data types will give you performance for free

– You will save space – remember on a multi-billion row table saving a few bytes a row will
result in tens of gigs of space savings. Whilst data at rest is compressed it is
uncompressed as it is moved between SPUs or processed.

– You will maximize benefit from zone maps

Database operations such as displaying, sorting, aggregating, and joining produce


consistent results. There is no conflict over how different types are compared or displayed.

Having appropriate datatypes helps the system to process queries efficiently

– Integer operations are quicker

– The more compact your datatypes the less data has to be moved around the system
Data types
NUMERIC datatypes with a scale of 0 are similar to INTEGER datatypes.

– A switch to an INTEGER datatype means zone maps and better performance

Floating point data types (REAL/DOUBLE PRECISION) are, by definition, lossy in nature.

– Unless you are working with very large numbers or very small fractions are you sure
you want to be using these datatypes?

Inconsistent data types for the same column on different tables impacts performance

– If you join on them, there is additional processing needed to convert to the same type

– If it’s a distribution key, you will not get collocated data for the same VALUE in
different DATATYPES

The nz_best_practices script can be used to flag where a better data type is available.

– But use common sense!


PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

21 © 2013 IBM Corporation


Zone Maps™ Justification

Rolling Data…..add a new day roll off an old day

Call records, Web logs, Financial transactions, etc

1000 Days…no data skew

Why scan 999 days when the query only wants one day?
Zone Maps
Zone Maps can be used to scan only relevant data
The system knows where data is not and only scans the relevant table extents/pages.

Bad zonemap

Extent # | DAT_CREAZIONE (Min) | DAT_CREAZIONE (Max) | ORDER'ed
----------+---------------------+---------------------+--------- …
0 | 2006-10-01 00:21:47 | 2009-10-10 19:02:27 |
1 | 2006-10-02 13:07:21 | 2009-10-10 19:30:32 |
2 | 2006-10-01 21:00:48 | 2009-11-18 23:07:02 |

3 | 2006-10-02 11:05:06 | 2009-11-18 23:31:25 |

Good zonemap


Extent # | DAT_CREAZIONE (Min) | DAT_CREAZIONE (Max) | ORDER'ed
----------+---------------------+---------------------+---------
0 | 2006-10-01 00:21:47 | 2006-10-01 19:02:27 | …
1 | 2006-10-01 19:02:28 | 2006-10-02 19:30:32 | TRUE
.........
247 | 2009-11-17 21:00:48 | 2009-11-18 23:07:02 | TRUE …
248 | 2009-11-18 23:07:02 | 2009-11-21 23:31:25 | TRUE
Zone Maps: Automatic Performance
Zone Maps enable the optimizer to take advantage of inherent ordering of data within a data
slice.
– When stats are run, the zone map is created.
– For every eligible column (Integers, timestamps, dates) in the table.
– Min and Max values per 3Mb extent (or 128k page in NPS 7.0) are gathered.
When a query runs, the system skips over extents/pages where it knows the data does NOT
reside.
Automatically configured
– During stats
– During Loads
– During inserts, updates, loads, and groom (but not deletes!)
Also exploited by clustered base tables.
Zone Maps: Automatic Performance cont.
Design ETL architecture and batch jobs with Zone Maps in mind.
– Load in sets of data (dates, stores, sites, etc).
– Alternatively take advantage of clustered base tables.
Tip: Primary Key lookups. If you have low latency queries that target specific rows..
– Sort your table on the primary key.
– PK lookups will be as optimal as possible.
– Example: if customer is a dimension table, sort by customer_id (pk) and load data..
Zone Maps: Automatic Performance cont.
Validate/Check
– If a table is rebuilt with CTAS with a different DISTRIBUTION then data ordering will change
and zone maps will behave differently.
– Can also happen when data is migrated between different sized systems
– nz_zonemap script will tell you how sorted the data is on your table.

Extent # | customer_num (Min) | customer_num (Max) | ORDER'ed


----------+---------------------+----------------------+----------
0 | 300127 | 9100808 |
1 | 51775807 | 97100423 | TRUE
2 | 100000053 | 381221123 | TRUE
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

27 © 2013 IBM Corporation


Clustered base tables (CBT)
A new feature introduced in 6.0.

A CBT is a user table that contains data which is organized using 1 to 4 organizing keys.

An organizing key is a column of the table that you specify for clustering the records.

Netezza uses the organizing keys to group records within the table.

Records are saved in the same or nearby extents/pages.

Netezza also creates zone maps for the organizing columns including the first eight
characters of a character organizing field.

This accelerates the performance of queries that restrict using the organizing keys.
Clustered base tables (CBT) cont.
Used generally on large fact tables (millions to billions of rows)

CBTs improve performance of “multi-dimension” lookups.

CBTs allow you to incrementally organize data within your user tables in situations where
data cannot easily be accumulated in staging areas for pre-ordering before insertions.

CBTs can help you to eliminate or reduce pre-sorting of new table records prior to a
load/insert operation.

CBTs do not replicate the base table data and do not allocate additional data structures.

CBTs may reduce compression (e.g. sequential rowids) but the impact will be small.

Guidelines on choosing organizing keys are in the system admin guide.


CBT vs manually ordered tables
The aim of a CBT is to give equal performance for all the organizing columns

– A standard ordered table will give great performance for the first column, less so for
the second, even less for the third, etc.

This means that in cases where you are only interested in restricting by a single column
(say a date field) a standard table will outperform or give the same performance as a
CBT.

However...

– CBTs only need grooming periodically and will only reorder data that needs it.

– You can change your ordering by altering the ORGANISE ON clause and re-
grooming. No change to any ETL code is needed.

– Groom only needs a small space overhead and locks only at commit time. A CTAS
and rename needs 100% of the table space as an overhead.
CBT in practice
The CBT uses a space filling curve in n-dimensional space to determine the best key
order.

For a single column this may result in slightly poorer performance than a perfectly
ordered zone map, but will still be vastly better than no ordering at all.

Unordered column zone map vs clustered column zonemap – significant improvement.


CBT in practice
Ordered column zone map vs clustered column zonem ap – not as efficient but still good.
CBT: Things to consider

Be sure to groom any CBT that is regularly updated (The admin tool will give you a
‘percent organized’ figure).

Consider how ordered the incoming data is and what the nature of the changes will be on
an ongoing basis.

You cannot have a materialized view on a CBT – it’s one or the other.
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

34 © 2013 IBM Corporation


Materialized Views
A materialized view (mview) in a PDA system is a relatively specific tool

exists on one table only

provides a subset of columns materialized on disk

Can be ordered in a different way than the base table.

Requires maintenance as data is added

Typically used to provide high speed lookups on specific columns in a table in


order to satisfy low latency queries..

Alternatively, provides zone maps on the first eight values of a character field.

Finally, for a single or small number of rows the mview can be used as an index
– it contains a pointer to the block that holds the base row in the original table.
Materialized Views cont.
Base Table • A MATERIALIZED VIEW reduces the
width of data being scanned in a base
table by creating a thin version of the

Column 305
Column 304
Column 300

Column 303
base table that conatains a small

Column 302
Column 3
Column 4
Column 5
Column n
Column 1
Column 2

subset of frequently queried columns.


• The MATERIALIZED VIEW containing
the sorted projection (columns) is
stored in a table on disk and is used to
increase query performance.
Column 100
Column 302
Column 5
Column 1

Materialized Views
transparently avoid scanning
unreferenced columns

MATERIALIZED VIEW
36 © 2013 IBM Corporation
Materialized Views cont.
They are useful if you have a very wide table and run queries that will use only a subset
of the data.

When data is loaded into the base table, the columns the mview is interested in are
loaded at the tail end of the mview in an unsorted manner.

This means that to keep good performance, the mview needs to periodically be rebuilt.

How often depends on your performance tolerances.

Use the ALTER VIEW <mview_name> MATERIALIZE REFRESH.

Tables with a materialized view have a number of restrictions:

You cannot GROOM a table with an active mview on it.

You cannot have a mview on a clustered table (or vice versa).

Mview data is not backed up, only its definition. The mview is recreated when you
restore the table.
Materialized Views: Things to consider
Whilst a useful tool, these objects will add maintenance overhead. These need to be
thought through and built in or you will hit performance issues over time.

Periodic refresh

Need to suspend before groom and refresh it afterwards

Consider if a CBT will do what you need before looking at mviews, or if your tables are
appropriately distributed.
PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

39 © 2013 IBM Corporation


Statistics
Netezza uses a cost based optimizer.

The more up to date and accurate table statistics are, the more chance the query.
optimizer has to generate an optimal plan.

Statistics should be built into ETL and ELT processing wherever possible.

This does not mean you need to collect statistics on large tables for every load. It
means you need to think carefully about when and how to collect statistics.

‘Mop up jobs to capture or report on any missed statistics are also recommended.

Consider running nz_genstats periodically to:

Validate that statistics have been deployed to standard.

Regular monitoring should be deployed to check for out of date statistics.

Use nz_genstats to help with this.


PDA Best Practices Agenda
Data Distribution and join strategies

Datatypes

Zonemaps

Clustered Base Tables

Materialized views

Statistics

Groom

41 © 2013 IBM Corporation


Transaction and Row Ids – a brief background
It’s important to understand how a PDA appliance handles transactions, concurrency, and
data consistency to understand how to manage your tables
Transaction Control using RowID
Create XID Delete XID RowID Null Vector Record column_1….column_n
8 bytes 8 bytes 8 bytes n/8 bytes 4 bytes Number of bytes….

8 byte integer stored in each record in a table

Guaranteed to be unique
– But not necessarily sequential within a table

Rowid’s start with the value 100,000

The SMP host allocates a block of sequential rowids to


each SPU
– When records are inserted the SPU assigns a rowid

43 © 2013 IBM Corporation


Transaction IDs
Create XID Delete XID RowID Null Vector Record column_1….column_n
8 bytes 8 bytes 8 bytes n/8 bytes 4 bytes Number of bytes….

Two transaction IDs (xids) stored in each record in a table


– Create XID is the transaction ID that created the record

– Delete XID is the transaction ID that deleted the record


• Defaults to 0 if not deleted.

– Each are 8 byte integer type


• Of which, 48 bits are significant

– XIDs are sequential in nature


• Starting value 1,024
• Each time the NPS starts up, the next XID value used will be a multiple
of 1,024

44 © 2013 IBM Corporation


Concurrency and Isolation
Netezza SQL implements serializable transaction isolation which provides
the highest possible level of consistency.

Netezza SQL does not use conventional locking to enforce consistency


among concurrently executing transactions.

Instead, Netezza implements a combination of


– Multi-versioning
• Transactions see a consistent state that is isolated from other transactions that have
not been committed

– Serialization Dependency Checking


• Concurrent executions that are not serializable are not permitted

A user cannot explicitly lock tables

45 © 2013 IBM Corporation


Updates

Records are not updated in-place on disk


– The original record is marked for deletion (logically deleted)
and the xid value updated

– A new record is inserted with the deleted rowid and the xid
value updated
• Rowid is preserved

Maintains data integrity

Simplifies and accelerates rollback/recovery


operations

46 © 2013 IBM Corporation


Transaction Example
Create XID Delete XID rowid record

1024 0 100000 data, data, data


Record is INSERTed or loaded

The original record is marked as deleted

1024 1025 100000 data, data, data

1025 0 100000 data, some changed data, data


A new record is added to the database Record is UPDATEd

1025 1026 100000 data, some changed data, data


Record is DELETEd

47 © 2013 IBM Corporation


Aborted Transaction Example
Create XID Delete XID rowid record

2100 1 100000 data, data, data


INSERT or load is aborted
The deletexid is set to a special value of 1

2200 0 200000 data, data, data

2201 1 200000 data, some changed data, data


UPDATE is aborted
System reverts to the prior version of the record

2300 0 300000 data, some changed data, data


DELETE is aborted
The deletexid is simply set back to a value of 0

48 © 2013 IBM Corporation


Groom
Introduced in version 6.0 to replace RECLAIM

What does groom do?

• Recovers space occupied by outdated or deleted rows


• Reorganises records in clustered base tables
• Consolidates tables that have had columns added or removed

Why is groom important?

• It frees up space that is being used by logically deleted data


• Tables with large amounts of redundant records can perform poorly as these records
still need to be scanned.
• Unless groomed regularly, clustered base tables will become less organized (affecting
performance)
Groom – the new reclaim
Grooming to recover space:

• Groom tables that receive frequent updates or deletes or if you cancel/abort a load as
this will result in deleted rows.
• As with reclaim, you can groom to remove fully empty leading/training pages (ie there
is no valid data in the entire page). This is the quickest option and suitable if you are,
for example, loading data in day by day or housekeeping to remove old data
regularly..
• You can groom at a record level to remove all deleted records regardless of their
location. This will give you the best space gains but take longer.
• Deciding which option requires knowledge of your specific circumstances.
• Consider using nz_groom to assist with automated grooming.
Grooming for ordering
Grooming to order tables:

• Groom clustered base tables when the organization percentage is low.


• Use /nz/kit/bin/adm/tools/cbts_needing_groom or the Netezza Performance Portal to
determine when you need to groom new data.
• If you’ve changed the organize key or carried out an initial load you need to GROOM
TABLE RECORDS ALL.
• For incremental loads after that you should use GROOM RECORDS READY to only
reorganize new data..
Groom and backups
Groom and backups work together:

• Once you carry out at least one full backup using nzbackup, you need to consider
carefully the impact on groom.
• Groom can only physically delete rows that have been recorded in a backup – this is
to ensure that any differential backups contain a complete record of rows that have
been removed to replay.
• Ensure your groom command follows a differential or full backup to maximize the
space recovered.
• If you need to groom immediately you can specify RECLAIM BACKUPSET NONE to
force groom to reclaim all rows regardless of whether they have been backed up.
• If you do this then the next backup will take a full table backup regardless of whether
you specified a full or differential backup.
Groom
Things to keep in mind:

• Groom is non blocking so does not block other workload.


• Groom is an intensive operation so will consume resources,, but it is under GRA
control.
• Align groom schedules with backups.
• Groom is a SQL command so can be run by anyone with the right permissions from
any tool.

Can groom be avoided?

• Groom is recommended for clustered base tables to keep them ordered.


• From the space perspective, if you have a table whose contents are deleted
completely, consider using truncate rather than delete.
• This eliminates the need to run the groom command BUT does require an exclusive
table lock.
ETL/ELT guidelines
Avoid many small inserts or updates, especially single line inserts and use bulk load
methods where possible.
• You can tell what your ETL is doing ‘behind the scenes’ by looking in the pg.log file
after a job has completed – you want to see external table loads.
• Bulk loads will be orders of magnitude more efficient.
Avoid forcing things into a serial stream with cursor based processing. You will get far
better performance using a series of set based SQL operations.
Use order by on inserts for primary key, date or common join fields to optimize zone
maps, or consider using CBTs and building groom into your schedule.
Look to establish standard load and ETL methods (best practice) for the ETL and load
tools and methods that you use.
Minimize I/O between the host and the ETL server where possible.
When loading with multiple instances of nz_load consider how many parallel streams you
want – typically (# of host CPU cores -2) is the highest you’ll want to go.
Build backup, groom and generation of statistics into the schedule as closely as possible.
© International Business Machines Corporation 2012
International Business Machines Corporation New Orchard Road Armonk, NY 10504
IBM, the IBM logo, PureSystems, PureFlex, PureApplication, PureData and ibm.com are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide.
A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml
All rights reserved. WAP12402-USEN-01

55 © 2013 IBM Corporation

You might also like