BigData

BigData
OOZIE
What is OOZIE?
Apache Oozie is a java Web application used to schedule Apache Hadoop obs. Oozie combines
multiple obs se uentiallX into one logical unit of work. It is integrated with the Hadoop stack and
supports Hadoop jobs for Apache Map educeHApache PigHApache HiveHand Apache oop. Oozie is a
workflow scheduler system to manage Apache Hadoop jobs. Oozie workflow obs are Directed
Acyclical Graphs (DAGs) of actions. Oozie coordinator obs are recurrent Oozie workflow obs
triggered based time frequency and data availability.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of
the box (such as java map-reduce, Streaming map-reduceHPigHHiveH oop and Match as well as
systems specific obs such as java programs and shell scripts. Oozie is a scalableHreliable and
extensible system after completing this chapter you will be able to:
- What is Oozie
- Features of Oozie
- Oozie components
- Basic types of Oozie jobs
- How Oozie works
- Different ways to interflow from the command line
- Pig and Hive operations through Oozie
Basic Understanding of java or Any Scripting language basics of BigData & Hadoop Basics of XML
Apache Hadoop – Reliable, scalable distributed storage and computing
Apache Hive - SQL-like language and metadata repository
Apache Pig – High –level language for expressing data analysis programs
Apache HBase – Hadoop database for random, real-time read/write access
Apache Zookeeper – Highly-reliable distributed coordination service
Apache Whirr – Library for running Hadoop in the cloud
Apache Flume – Distributed service for collecting and aggregating log and event data
Apache Sqoop – Integrating Hadoop with RDBMS
Hue- Browser – based desktop interface for interacting with Hadoop
Oozie - Server- based workflow engine for Hadoop Activities
1
BigData
OOZIE
UI Framework SDK
HUE HUE SDK
Workflow Scheduling Metadata
OOZIE OOZIE Hive
Data Integration Languages/Compilers Fast Read/Write

Access
PIO_HIVE
HBASE
FLUME_SCOOP
Coordination
ZOOKEEPER
OOZIE Features
 Major Flexibility – start, stop, suspend and re-run the jobs
 Allows you to restart from the failure- skip the failure nodes
 Java client API/Command Line Interface – Launch, control and monitor jobs from your java apps
 Web service API
 Run Periodic jobs – Jobs needed to run every hours, day, week
 Receive an email when job is complete
Oozie Consists of Action and Control Nodes

Control nodes define job chronology, setting rules for beginning and ending a workflow, which
controls the workflow executing path with decision, fork and join nodes. Action nodes trigger the
execution of tasks
Control Flow
- Start, end, kill
- Decision
- Fork, join
2
BigData
OOZIE
Actions
- Map-reduce
- Java
- Pig
- Hdfs
 There are two basic types of Oozie jobs:
- Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions
to execute. The workflow job has to wait
- Oozie Coordinator jobs are recurrent Oozie workflow jobs that are triggered by time and
data availability.
 Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage
the lifecycle of those jobs
How does the Oozie work?

 Oozie trigger workflow actions, but Hadoop MapReduce executes them. This allows Oozie to
leverage other capabilities within the Hadoop stack to balance loads and handle failures
 Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, thereby notifying that URL when, it’s complete.
If the task fails to invoke the callback URL, Oozie can poll the task for completion.
 Often it is necessary to run Oozie workflows on regular time intervals, but in coordination with
unpredictable levels of data availability or events. In these circumstances, Oozie coordinator
allows you to model workflow execution triggers in the form of the data, time or event
predicates. The workflow job is started after those predicates are satisfied.
 Oozie coordinator can also manage multiple workflows that are dependent on the outcome of
subsequent workflows. The output of subsequent workflows becomes the input to the next
workflow. This chain is called a “data application pipeline”.
 It can interface in 2 ways. Both command line and web services API as shown below.
WS API
Tomcat Oozie UI
Command Hadoop, Pig, Hive
Line
DB
Oozie
3
BigData
OOZIE
How to run Oozie from command line?
1. Create application directory structure with workflow definitions and resource

- Workflow.xml jars, etc…
2. Copy application directory to HDFS
3. Create application configuration file
- Specify location of the application directory on HDFS
- Specify location of the name node and resource manage
4. Submit workflow to Ooize
- Utilize oozie command line
5. Monitor running workflow(s)
Find Max letter – Example

OK OK
Count Each Letter Find Max Letter Clean Up
Start Map-Reduce Map-Reduce
OK
Error Error Error
KILL END
 Count letters in a text file separated by lines and spaces

 Basic idea:
- Load this file using a loader
- Count Each letter, if error –kill the job
- Group by each letter
- Find the Max in a group, if error –kill the job
- Store to file and clean up
4
BigData
OOZIE
Appendix
Start OK OK
RunHiveScript RunSqoopExport End
Start
Error Error
Kill
MR1 job
Start Pig Decision
Java
MR2 job Fork Join HDFS End

HDFS

BigData - Oozie

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigData - Oozie

Uploaded by

Copyright:

Available Formats

Apache Hadoop – Reliable, scalable distributed storage and computing

Apache Hive - SQL-like language and metadata repository

Apache HBase – Hadoop database for random, real-time read/write access

Apache Zookeeper – Highly-reliable distributed coordination service

Apache Whirr – Library for running Hadoop in the cloud

Apache Sqoop – Integrating Hadoop with RDBMS

Hue- Browser – based desktop interface for interacting with Hadoop

Oozie - Server- based workflow engine for Hadoop Activities

HUE HUE SDK

Workflow Scheduling Metadata

OOZIE OOZIE Hive

Data Integration Languages/Compilers Fast Read/Write

Oozie Consists of Action and Control Nodes

How does the Oozie work?

How to run Oozie from command line?

1. Create application directory structure with workflow definitions and resource

Find Max letter – Example

 Count letters in a text file separated by lines and spaces

Start Pig Decision

MR2 job Fork Join HDFS End

You might also like