You are on page 1of 5

BigData

OOZIE

What is OOZIE?
Apache Oozie is a java Web application used to schedule Apache Hadoop obs. Oozie combines
multiple obs se uentiallX into one logical unit of work. It is integrated with the Hadoop stack and
supports Hadoop jobs for Apache Map educeHApache PigHApache HiveHand Apache oop. Oozie is a
workflow scheduler system to manage Apache Hadoop jobs. Oozie workflow obs are Directed
Acyclical Graphs (DAGs) of actions. Oozie coordinator obs are recurrent Oozie workflow obs
triggered based time frequency and data availability.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of
the box (such as java map-reduce, Streaming map-reduceHPigHHiveH oop and Match as well as
systems specific obs such as java programs and shell scripts. Oozie is a scalableHreliable and
extensible system after completing this chapter you will be able to:

- What is Oozie
- Features of Oozie
- Oozie components
- Basic types of Oozie jobs
- How Oozie works
- Different ways to interflow from the command line
- Pig and Hive operations through Oozie

Basic Understanding of java or Any Scripting language basics of BigData & Hadoop Basics of XML

Apache Hadoop – Reliable, scalable distributed storage and computing

Apache Hive - SQL-like language and metadata repository

Apache Pig – High –level language for expressing data analysis programs

Apache HBase – Hadoop database for random, real-time read/write access

Apache Zookeeper – Highly-reliable distributed coordination service

Apache Whirr – Library for running Hadoop in the cloud

Apache Flume – Distributed service for collecting and aggregating log and event data

Apache Sqoop – Integrating Hadoop with RDBMS

Hue- Browser – based desktop interface for interacting with Hadoop

Oozie - Server- based workflow engine for Hadoop Activities

1
BigData
OOZIE

UI Framework SDK

HUE HUE SDK

Workflow Scheduling Metadata

OOZIE OOZIE Hive

Data Integration Languages/Compilers Fast Read/Write


Access
PIO_HIVE

HBASE
FLUME_SCOOP

Coordination

ZOOKEEPER

OOZIE Features
 Major Flexibility – start, stop, suspend and re-run the jobs
 Allows you to restart from the failure- skip the failure nodes
 Java client API/Command Line Interface – Launch, control and monitor jobs from your java apps
 Web service API
 Run Periodic jobs – Jobs needed to run every hours, day, week
 Receive an email when job is complete

Oozie Consists of Action and Control Nodes


Control nodes define job chronology, setting rules for beginning and ending a workflow, which
controls the workflow executing path with decision, fork and join nodes. Action nodes trigger the
execution of tasks

Control Flow
- Start, end, kill
- Decision
- Fork, join

2
BigData
OOZIE

Actions
- Map-reduce
- Java
- Pig
- Hdfs
 There are two basic types of Oozie jobs:
- Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions
to execute. The workflow job has to wait
- Oozie Coordinator jobs are recurrent Oozie workflow jobs that are triggered by time and
data availability.
 Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage
the lifecycle of those jobs

How does the Oozie work?


 Oozie trigger workflow actions, but Hadoop MapReduce executes them. This allows Oozie to
leverage other capabilities within the Hadoop stack to balance loads and handle failures
 Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, thereby notifying that URL when, it’s complete.
If the task fails to invoke the callback URL, Oozie can poll the task for completion.
 Often it is necessary to run Oozie workflows on regular time intervals, but in coordination with
unpredictable levels of data availability or events. In these circumstances, Oozie coordinator
allows you to model workflow execution triggers in the form of the data, time or event
predicates. The workflow job is started after those predicates are satisfied.
 Oozie coordinator can also manage multiple workflows that are dependent on the outcome of
subsequent workflows. The output of subsequent workflows becomes the input to the next
workflow. This chain is called a “data application pipeline”.
 It can interface in 2 ways. Both command line and web services API as shown below.

WS API
Tomcat Oozie UI
Command Hadoop, Pig, Hive
Line

DB

Oozie

3
BigData
OOZIE

How to run Oozie from command line?

1. Create application directory structure with workflow definitions and resource


- Workflow.xml jars, etc…
2. Copy application directory to HDFS
3. Create application configuration file
- Specify location of the application directory on HDFS
- Specify location of the name node and resource manage
4. Submit workflow to Ooize
- Utilize oozie command line
5. Monitor running workflow(s)

Find Max letter – Example


OK OK
Count Each Letter Find Max Letter Clean Up
Start Map-Reduce Map-Reduce
OK
Error Error Error

KILL END

 Count letters in a text file separated by lines and spaces


 Basic idea:
- Load this file using a loader
- Count Each letter, if error –kill the job
- Group by each letter
- Find the Max in a group, if error –kill the job
- Store to file and clean up

4
BigData
OOZIE

Appendix

Start OK OK
RunHiveScript RunSqoopExport End
Start

Error Error
Kill

MR1 job

Start Pig Decision

Java

MR2 job Fork Join HDFS End


HDFS

You might also like