Cube Best Practices

Designing High Performance Cubes In Analysis Services 2008
Esendal Yasin Lead Architect Microsoft MEA HQ
Agenda
Dimension Design Cube Design Partitioning Aggregations Scale-out
Attribute Relationships
1 M relationships between attributes Server simply works better Examples
City State, State Country Day Month, Month Quarter, Quarter Year Product Subcategory Product Category
Rigid versus flexible relationships

Flexible (Default)
Customer City Customer PhoneNo
Rigid
Customer BirthDate City State
All attributes implicitly related to key attribute

3
Illustration
Country
State
City
Gender
Marital
Age
Customer
Where are they used
Storage
Query performance
Greatly improved effectiveness of in-memory caching Materialized hierarchies when present
Processing performance: Fewer, smaller hash tables result in faster, less memory intensive processing Aggregation design: Algorithm needs relationships in order to design effective aggregations Member properties: Attribute relationships identify member properties on levels
Natural Hierarchies
1:M relation (via attribute relationships) between every pair of adjacent levels Examples
Country-State-City-Customer (natural) Country-City (natural) Age-Gender-Customer (unnatural) Year-Quarter-Month (depends on key columns)
How many quarters and months? 4 & 12 across all years (unnatural) 4 & 12 for each year (natural)
6
Natural Hierarchies
Performance implications
Only natural hierarchies are materialized on disk during processing Unnatural hierarchies are built on the fly during queries (and cached in memory) Server internally decomposes unnatural hierarchies into natural components
Essentially operates like ad hoc navigation path (but somewhat better)
Aggregation designer favors user defined hierarchies

7
Large Dimensions
Optimizing Processing
Use natural hierarchies
Good attribute/hierarchy relationships forces the AS engine to build smaller DISTINCT queries versus one large and expensive query Consider size of other properties/attributes
Dimension SQL queries are in the form of select distinct Key1, Key2, Name, , RelKey1, RelKey2, from [DimensionTable]
Important to tune your SQL statements

Indexes to underlying tables Create a separate table for dimensions Avoid OPENROWSET queries Use Views to create your own version of query binding
Size limitations for string stores and effect on dimension size

4 GB, stored in Unicode, 6 byte per-string overhead. E.g. 50-character name: 4*1024*1024*1024 / (6+50*2) = 40.5 million members
Dimension Processing
Byattribute vs. Bytable
ProcessingGroup property on each Dimension
By default it is set to ByAttribute This is usually the best choice
What is ByTable?
Sends one SQL query per dimension table Can result in better performance in limited cases Memory requirements and resource cost much higher for AS!
Example
Two dimensions each with >25M members ByTable
Took 80% of available memory (25.6GB out of 32GB)
ByAttribute
Only took 9GB out of 32GB
Agenda
10
Cube Dimensions
Dimensions
Consolidate multiple hierarchies into single dimension (unless they are related via fact table) Use role-playing dimensions (e.g., OrderDate, BillDate, ShipDate)avoids multiple physical copies Use parent-child dimensions prudently
No aggregation support
Set Materialized = true on reference dimensions Use many-to-many dimensions prudently

Slower than regular dimensions, but faster than calculations Intermediate measure group must be small relative to primary measure group Consider creating aggregations on the shared common attributes of the intermediate measure group
11
Measure Groups
Common questions
At what point do you split from a single cube and create one or more additional cubes? How many is too many?
Why is this important?

New measure groups adding new dimensions result in an expansion of the cube space Larger calculation space = more work for the engine when evaluating calculations
Guidance
Look at increase in dimensionality. If significant, and overlap with other measure groups is minimal, consider a separate cube Will users want to analyze measures together? Will calculations need to reference unified measures collection?
12
AMO Design Warnings

~60 best practice rules integrated via real-time designer checks Think of it as auto BPA while you develop Subtle
Blue squiggly lines and build time warnings No pop-ups to get in your way
Dismissible:
By instance or globally Can specify comment in each case
13
Agenda
14
Partitioning
Mechanism to break up large cubes into manageable chunks Measure groups can be partitioned, but not dimensions Measure group can have one or more partitions Fact rows are distributed across partitions as per partitioning scheme Partitioning scheme is managed by DBA (server is unaware) Examples
By Time: Sales for 2001, 2002, 2003, By Geography: Sales for North America, Europe, Asia,
15
Benefits Of Partitioning
Partitions can be added, processed, deleted independently
Update to last months data does not affect prior months partitions Sliding window scenario easy to implement e.g., 24 month window add June 2006 partition and delete June 2004
Partitions can have different storage settings

Storage mode (MOLAP, ROLAP, HOLAP) Aggregation design Alternate disk drive Remote server
16
Benefits Of Partitioning
Partitions can be processed and queried in parallel
Better utilization of server resources Reduced data warehouse load times
Queries are isolated to relevant partitions less data to scan

SELECT FROM WHERE *Time+.*Year+.*2006+ Queries only 2006 partitions
Bottom line partitions enable

Manageability Performance Scalability
17
Best Practices For Partitioning

General guidance: 20M rows per partition
Use judgment, e.g., perhaps better to have 500 partitions with 40 million rows than 1000 20 million row partitions Standard tools unable to manage thousands of partitions
More partitions means more files

E.g. one 10GB cube with ~250,000 files (design issues) Deletion of database took ~25min to complete
Partition by time plus another dimension e.g. Geography

Limits amount of reprocessing Use query patterns to pick another partitioning attribute
When data changes

All data cache for the measure group is discarded Separate cube or measure groups by static and real-time analysis
18
Best Practices For Partitioning

Equal sized partitions
Equal Sized Partitions 11 20
21 30 31 40 41 50 January 2008
Not Equal Sized Partitions 11 15 16 20 21 25

26 50
19
Partition Slices
Defining partition slices allows server to avoid scanning partitions
E.g. Partition { [January 2008] } not scanned for query to [February 2008]
Simple partition slices are automatically detected for MOLAP partitions

You have to explicitly specify slices for ROLAP partitions Non-trivial slices (e.g. a partition with data for more than one month) will not be automatically set you can still set them explicitly
What if I get the error The slice specified for attribute_name attribute is incorrect"
MOLAP processing validates partition data against the defined slice Error raised if data received from source does not match slice specified E.g. Slice = [January] but partition query incorrectly defined to be SELECT FROM facts WHERE Month = February
20
Agenda
21
What Is An Aggregation?
A subtotal of partition data based on a set of attributes from each dimension
Highest-Level Aggregation
Customer All Product All Units Sold 347814123 Sales $345,212,301.30
Customers
All Customers Country State City Name
Products
All Products Category Brand Item SKU
Intermediate Aggregation
countryCode Can US productID sd452 yu678 Units Sold 9456 4623 Sales $23,914.30 $57,931.45
Facts
custID 345-23 563-01 SKU 135123 451236 Units Sold 2 34 Sales $45.67 $67.32
22
Aggregations Help Query Performance

Customers Products
Time
All Customers Country State City Name

Query levels
(All, All, All) (Country, Item, Quarter) (Country, Brand, Quarter) (Country, Category, All) (State, Item, Quarter) (City, Category, Year)
All Products Category Brand Item SKU

Aggregation used
All Time Year Quarter Month Day

Max Cells
Using a higher-level aggregation means fewer cells to consider

23
(All, All, All) 1 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Country, Item, Quarter) 274,356 (Name, SKU, Day) 34,264,872,495 (Name, SKU, Day) 34,264,872,495
Best Practices For Aggregations

Define all possible attribute relationships Set accurate attribute member counts and fact table counts Set AggregationUsage to guide agg designer
Set rarely queried attributes to None Set commonly queried attributes to Unrestricted
Do not build too many aggregations

In the 100s, not 1000s!
Do not build aggregations larger than 30% of fact table size (aggregation design algorithm doesnt)
24
Best Practices For Aggregations

Aggregation design cycle
Use Storage Design Wizard (~20% perf gain) to design initial set of aggregations Enable query log and run pilot workload (beta test with limited set of users) Use Usage Based Optimization (UBO) Wizard to refine aggregations Periodically use UBO to refine aggregations
25
Agenda
26
Scalable Shared Databases

Ask/Need Today's Problem
Easy way of scaling out AS data cross multiple machines. While MOLAP cubes are Read-Only databases, no two servers are share same data directory. Cube Sync works but has latency issues which are not acceptable in load balanced solutions.
AS 2008 Solution
Single read-only copy database is shared between several Analysis Servers.

Virtual IP
Analysis Server
Analysis Server
SAN storage
.. .
Analysis Server
27
Scalable Shared Databases In Practice

Clients
Load Balancer NLB, F5, Custom ASP.NET
Processing Server
Query Servers
Cube Processing Detach & Attach 28
Read Only DB on Shared SAN Drive
Enabling The SSD Scenario

DBStorageLocation
Ability to store database outside server data folder SAN drive, NAS network share, flash/SSD
Attach/Detach
Ability to attach/detach a database Attach from any location Attach as read-only or read-write Multiple instances can attach as read-only (shared) Only one instance can attach as read-write (exclusive)
Read Only Enforcement

Disallow all update operations (processing, writeback, restore, etc.) Disable lazy processing, proactive caching Allow loading database from read-only media
29
30

Cube Best Practices

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cube Best Practices

Uploaded by

Copyright:

Available Formats

Designing High Performance Cubes In Analysis Services 2008

Esendal Yasin Lead Architect Microsoft MEA HQ

Rigid versus flexible relationships

All attributes implicitly related to key attribute

Aggregation designer favors user defined hierarchies

Important to tune your SQL statements

Size limitations for string stores and effect on dimension size

Set Materialized = true on reference dimensions Use many-to-many dimensions prudently

Why is this important?

AMO Design Warnings

Partitions can have different storage settings

Queries are isolated to relevant partitions less data to scan

Bottom line partitions enable

Best Practices For Partitioning

More partitions means more files

Partition by time plus another dimension e.g. Geography

When data changes

Best Practices For Partitioning

Not Equal Sized Partitions 11 15 16 20 21 25

Simple partition slices are automatically detected for MOLAP partitions

Aggregations Help Query Performance

All Customers Country State City Name

All Products Category Brand Item SKU

All Time Year Quarter Month Day

Using a higher-level aggregation means fewer cells to consider

Best Practices For Aggregations

Do not build too many aggregations

Best Practices For Aggregations

Scalable Shared Databases

Single read-only copy database is shared between several Analysis Servers.

Scalable Shared Databases In Practice

Load Balancer NLB, F5, Custom ASP.NET

Cube Processing Detach & Attach 28

Read Only DB on Shared SAN Drive

Enabling The SSD Scenario

Read Only Enforcement

You might also like