You are on page 1of 29

#PCM14

Pentaho Data Integration


Best Practices

Matt Casters
Chief Data Integration, Kettle Founder
About this session
Today You Will Learn Agenda
Best practices when working ❯Working with PDI
with Pentaho Data ❯Design patterns
Integration ❯Demo fun stuff

2 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI

3 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Naming
❯Provide meaningful names for steps and job entries
❯Do not hesitate to use special characters

4 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Naming
❯Avoid environment specific names

Test Database CRM

France MySQL WWW

East Coast Cluster Cluster

5 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Naming
❯Keep your environment tidy
• Folders can have sub-folders!
❯Use naming conventions for everything
• Database tables and fields
• Directories
• Server names

6 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Naming
❯Use a corporate standard
❯Verify and enforce periodically
❯Use rules to validate repository imports
• Database names
• Notes and descriptions

7 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Naming
❯Use a corporate standards
❯Verify and enforce periodically
❯Use rules to validate repository imports
• Database names
• Notes and descriptions

8 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Tidy up!
❯Limit the number of steps or job entries

9 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Tidy up!
❯Enable grid size 32 or 16
❯Prevents accidental move of step or entry

10 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Parameters
❯Explicituse of variables
❯Easier testing
❯Make re-use a breeze

11 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Variables
❯Environment specific : kettle.properties
❯Prefer ${SOLUTION_HOME}
❯Avoid ${Internal.Transformation.Filename.Directory}
❯Configure step copies with variables

12 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Logging
❯Log everything!
❯Measurement is management
❯Use the Pentaho audit mart
❯Learn about all the possible logging features

13 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Mappings
❯Mapping vs Simple Mapping step
❯Realize this is a macro
❯Use completely different field names
❯Avoid renaming or removing fields

14 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Metadata Injection
❯Avoid manual population of dialogs
❯Whenever you need dynamic ETL
❯5.1 supports data streaming
❯Example:
• stage 50 different files with one transformation

15 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Performance
❯Transformations are networks
❯Network speed is limited to the slowest part
❯The slowest step is indicated while running in Spoon
❯Slow steps have a full input and empty output buffer

16 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Performance
❯Firstre-write, re-think, re-organize
❯Parallize work
❯End-to-end data pipe-lining
❯Do work where it's the fastest (ELT)

17 © 2014, Pentaho. All Rights Reserved. #PCM14


Working with PDI: Lifecycle
❯Automate export of repositories
❯Use import rules to validate quality
❯Always use version control for file based setup

18 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns

19 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Loops
❯Usethe job or transformation executor steps
❯Much easier since version 5
❯Demonstration: process lots of small files

20 © 2014, Pentaho. All Rights Reserved. #PCM14


DEMONSTRATION

21 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Queues
❯Process buffer
❯Facilitates parallelism
❯Forces process logging best practice
❯Only way to process recurring files

22 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Load balancing
❯On a step level: PDI EE 5.x built-in
❯Balance jobs and transformations with carte
❯Set up a carte cluster
❯Use a queue
❯Interrogate Carte web services

23 © 2014, Pentaho. All Rights Reserved. #PCM14


DEMONSTRATION

24 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Watchdog
❯“Who watches the watchmen?”
❯Simple recipe:
• On success increment a counter
• Periodically verify that the counter is advancing
• Take action when counter is not advancing

25 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Watchdog
❯Schema

Main Job Watchdog

Validate that
Success? Action when
counter
Increase counter not increased
increased

Job Counter Action

26 © 2014, Pentaho. All Rights Reserved. #PCM14


Design patterns: Auto-recovery
❯Auto-skip:
• use
anything incremental
• Add incremental ID to source tables if missing
❯Auto-cleanup:
• increment run ID after successful job
• Remove run ID from target table at start of job
❯Database recovery:
• Job and transformation level transactions (PDI 5 EE)

27 © 2014, Pentaho. All Rights Reserved. #PCM14


Fun stuff...
a short demo to wrap up

28 © 2014, Pentaho. All Rights Reserved. #PCM14


Summary
To take away:
❯Best practices improve quality, simplify, save time
❯Use this presentation as a checklist
❯Do regular audits of your data integration work

29 © 2014, Pentaho. All Rights Reserved. #PCM14

You might also like