You are on page 1of 5

What is Pig?

Pig provides an engine for executing data flows in parallel on Hadoop. Pig is a
scripting language for exploring huge data sets of size gigabytes or terabytes very
easily. Parallel dataflow Language Pig started out as a research project in Yahoo! in
September 2008. It is now Apache open source project.
Pig Latin scripting
Data flow Language (eliminate tradition procedure/OOP control flow)
Pig is built on top of MapReduce
Pig job is submitted to jobtracker which convert into Map and Reduce
In pig for every command you have to depend on previous statement thats why it
is called data flow language

What is Pig Latin?

Pig Latin is a data flow Scripting Language like Perl for exploring large data sets. A
Pig Latin program is made up of a series of operations, or transformations, that are
applied to the input data to produce output.
Pig Latin is a dataflow language. This means it allows users to describe how data
from one or more inputs should be read, processed, and then stored to one or more
outputs in parallel.

What is Pig Engine?

Pig Engine is an execution environment to run Pig Latin programs. It converts these
Pig Latin operators or transformations into a series of MapReduce jobs.

What is GRUNT?

Grunt* is Pigs interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS.
pig -x local
grunt>
Pig is just an interface you use it and leave it once finished
You can also put HDFS command in grunt
grunt>fs -ls

Advantage of PIG

Traditional extract transform load (ETL) data pipelines (web server logs, behavior
prediction model), research on raw data, and iterative processing.

What are the advantages of using Pig over Mapreduce?

In Mapreduce, development cycle is very long. Writing mappers and reducers,


compiling and packaging the code, submitting jobs, and retrieving the results is a
time Consuming process.
Performing Data set joins is very difficult
Low level and rigid, and leads to a great deal of custom user code that is hard to
maintain and reuse is complex.
In pig, no need of compiling or packaging of code. Pig operators will be converted
into map or reduce tasks internally.

Pig Latin provides all of the standard data-processing operations, such as join, filter,
group by, order by, union, etc high level of abstraction for processing large data
sets

What is the difference between Pig Latin and HiveQL ?


Pig Latin:
Pig Latin is a Procedural language
Nested relational data model
Schema is optional

Mapreduce

Compiled language
Lower level abstraction
More lines of code
More development effort

HiveQL:
HiveQL is Declarative
HiveQL flat relational
Schema is required

Pig

scripting language
Higher level
less
comparatively less

Hive

SQL like query language


Higher level
less than MR and pig
comparatively less

What are the common features in Pig and Hive?

Both provide high level abstraction on top of Mapreduce


Both convert their commands internally into Mapreduce jobs
Both doesnt support low-latency queries and thus OLAP or OLTP are not supported.

pig and sql

1) sql is a query language while pig is a data flow language


2) In pig intermidiate resultset generated while sql required temporary tables
3) SQL is designed for the RDBMS environment, where data is normalized and
schemas and proper constraints are enforced while pig is designed for the Hadoop
data-processing environment, where schemas are sometimes unknown or
inconsistent. Data may not be properly constrained, and it is rarely normalized.
4) Pig does not require data to be loaded into tables first. It can operate on data as
soon as it is copied into HDFS.

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs.
After performing the basic parsing and semantic checking, it produces a logical
plan. The
logical plan describes the logical operators that have to be executed by Pig during
execution. After this, Pig produces a physical plan. The physical plan describes the
physical
operators that are needed to execute the script.

Is Pig Latin Case Sensitive?

The names (aliases) of relations and fields are case sensitive. The names of Pig
Latin functions are case sensitive. The names of parameters and all other Pig Latin
keywords are case insensitive.

pig installation

Pig need not to be installed on Hadoop cluster instead install on the machine which
has access to cluster Install either on gateway/edge machine or your own desktop
that have access to their Hadoop cluster
Pig is portable as it is written in java but the shell script that starts Pig is a bash
script, so it requires a Unix environment. Hadoop, which Pig depends on, even in
local mode, also requires a Unix environment for its filesystem operations.
JAVA_HOME is set to the directory that contains your Java distribution.
Copy core-site.xml, hdfs-site.xml, and mapred-site.xml from cluster node to
gateway machine define PIG_CLASSPATH

What are the modes of Pig Execution?

Local Mode: Local execution in a single JVM, all files are installed and run using local
host and file system.
Mapreduce Mode: Distributed execution on a Hadoop cluster, it is the default mode.
Local Mode-> pig x local
HDFS Mode-> pig x mapreduce

or

pig

Local Mode
Local mode is useful for prototyping and debugging your Pig Latin scripts. Some
people also use it for small data when they want to apply the same processing to
large dataso that their data pipeline is consistent across data of different sizes
but they do not want to waste cluster resources on small files and small jobs.
In local mode you are not supposed to load data from HDFS and vice versa
Batch method
cd /my/localfile
ls -l NYSE_dividends
pig_path/bin/pig -x local average_dividend.pig
cat average_dividend/part-r-00000 | head -5
MapReduces LocalJobRunner generating logs.
Interactive method
pig -x local #local
pig local
#cluster
grunt>
#grunt shell
Cluster Mode

Batch Method
PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -copyFromLocal
NYSE_dividends NYSE_dividends
PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig average_dividend.pig
PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e cat average_dividend
Interactive method
pig
grunt>

What are all option availabel in pig?


-e or -execute

Execute a single command in Pig. For example, pig -e fs -ls will


list your home directory.
-h or -help
List the available command-line options.
-h properties
List the properties that Pig will use if they are set by the user.
-P or -propertyFile Specify a property file that Pig should read.
-version
Print the version of Pig.
Properties can be passed to Pig on the command line using -D in the same format as
any Java propertyfor example, bin/pig -D exectype=local. When placed on the
command line, these property definitions must come before any Pig-specific
commandline options (such as -x local). They can also be specified in the
conf/pig.properties file that is part of your Pig distribution. Finally, you can specify a
separate properties file by using -P. If properties are specified on both the command
line and in a properties file, the command-line specification takes precedence.

Pig Return code


Value Meaning
0 Success
1 Retriable failure
2 Failure
3 Partial failure Used with multiquery; see Nonlinear Data Flows
4 Illegal arguments passed to Pig
5 IOException thrown Would usually be thrown by a UDF
6 PigException thrown Usually means a Python UDF raised an exception
7 ParseException thrown (can happen after parsing if variable substitution is being
done)
8 Throwable thrown (an unexpected exception)

Pig Types?
Scalar (int, long, float, double, chararray, bytearray {default}) and Complex (map,
Touple, Bag, Atom)
A map in Pig is a chararray to data element mapping, where that element can be
any Pig type, including a complex type. It is legitimate to have a map with two keys

name and age, where the value for name is a chararray and the value for age is an
int.
For example, ['name'#'bob','age'#55] will create a map with two keys, name and
age. The first value is a chararray, and the second is an integer.
A tuple is a fixed-length, ordered collection of Pig data elements.
For example, ('bob', 55) describes a tuple constant with two fields name, age
Bag is a collection of order or un-ordered touples
For example, {('bob', 55), ('sally', 52), ('john', 25)}

Memory Requirements of Pig Data Types

As a rule of thumb, it takes about four times as much memory as it does disk to
represent the uncompressed data.

How pig handle NULL value

It is important to understand that in Pig the concept of null is the same as in SQL,
which
is completely different from the concept of null in C, Java, Python, etc.

Pig SCHEMA

Pig has a very lax attitude when it comes to schemas. This is a consequence of Pigs
philosophy of eating anything. If a schema for the data is available, Pig will make
use of it, both for up-front error checking and for optimization. But if no schema is
available, Pig will still process the data, making the best guesses it can based on
how the script treats the data.
dividends = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividend:float);
Pig now expects your data to have four fields. If it has more, it will truncate the
extra
ones. If it has less, it will pad the end of the record with nulls.
Table=Bag or Relation
Record=Touple
Value=Atom
No metastore is used for Pig

How Pig internally processed?


Pig script-->Parser-->Sementic check-->Logical optimizer-->Logical to physical
plan-->Physical to MR translate-->MapReduce Launcher-->Hadoop
Parse-->compile-->optimize-->plan-->MR statement-->MapReduce-->HDFS
Client
Pig
Hadoop

(grunt, pig script, Embedded pig latin)


(Pig latin compiler)
(LocalJobRunner{local mode}, Hadoop cluster {cluser mode})

You might also like