You are on page 1of 60

Welcome to:

Unit 2 - Introduction to Big Hadoop

© Copyright IBM Corporation 2015 9.1


Unit objectives IBM ICE (Innovation Centre for Education)
IBM Power Systems

After completing this unit, you should be able to:


• Introduction to Hadoop and Hadoop File System (HDFS)
• HDFS Administration
• Introduction to Map / Reduce
• Setup of an Hadoop Cluster
• Managing Job Execution
• Overview of Jaql
• Data Loading
• Overview of workflow engine

© Copyright IBM Corporation 2015


What do we care about when we
process data ? IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Handle partial hardware failure without going down


– Should be able to switch over to stand by machine
• Able to recover on major failures:
– Regular backups
– Logging
– Mirror database at different site
• Capability
– Should support increasing capacity without restarting
• Result consistency
– Result should be consistent and returned in reasonable amount of time

© Copyright IBM Corporation 2015


Why Hadoop when we have Relational
Database? IBM ICE (Innovation Centre for Education)
IBM Power Systems

• One copy of data


– Exchanging data requires synchronization
– Deadlocks can become a problem
– Need for recovery

• RDBMS systems are very strong in transactional processing

• RDBMS systems works extremely well with structured data


– Some do a go job supporting XML data
– Not as strong with other unstructured data

• Reliability requires expensive hardware

© Copyright IBM Corporation 2015


Introduction to Hadoop IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Hadoop is a large-scale dissemination infrastructure of batch processing


• Although it can be used on a single machine, its real power lies in its ability
to scale to hundreds or thousands of computers, each with multiple
processor cores.
• Hadoop is also designed to efficiently distribute large quantities of work
through a set of machines.
• Hadoop can handle orders of magnitude higher than that of many existing
systems. Hundreds of gigabytes of data are at the low end of Hadoop.
• Hadoop is really set up to process "web" the data in the order of hundreds of
gigabytes to terabytes or petabytes.
• In this scale, it is likely that the set of input data cannot fit in a single hard
drive of the computer, and much less in the memory.
• Therefore include a Hadoop distributed file system that divides the input
data and sends the fractions of the original data on several machines in the
cluster to celebrate.

© Copyright IBM Corporation 2015


Design Principles of Hadoop IBM ICE (Innovation Centre for Education)
IBM Power Systems

• New way of storing and processing the data

• Lets the system handle most of the issues automatically


– Failures
– Scalability
– Reduce communication
– Distribute data and utilization parallelism
– Relatively inexpensive hardware

• Hadoop = HDFC + MapReduce infrastructure

© Copyright IBM Corporation 2015


RDBMS Vs HADOOP IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Structured data with known • Unstructured and structured


schema files
• Records, long fields, objects, • Only inserts and deletes
XML • Hive, Pig, Jaql
• SQL & XQuery • Batch processing
• Quick response, random access • Data loss can happen
• Data loss is not acceptable something
• Security & auditing • Simple file compression
• Encryption • Commodity hardware
• Sophisticated data compression • Access files only(streaming)
• Enterprise hardware
• Random access (indexing)

© Copyright IBM Corporation 2015


The Hadoop Approach IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Hadoop is designed to efficiently handle large volumes of


information, connecting many computers' consumption together to
work in parallel.

• n a distributed environment, performing calculations with large


volumes of data has been done before.

• Hadoop’s unique simplified programming model allows the user to


write quickly and test distributed systems, and its efficiency for
automatic distribution of data.

• In a Hadoop cluster, the data is distributed to all the nodes in the


cluster as it is being loaded

© Copyright IBM Corporation 2015


Data Distribution IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The Hadoop Distributed File System (HDFS) will split up large data files into
fragments that are managed by different nodes of the cluster.
• Each fragment is replicated in multiple computers, so that a single failure in
a machine will not make the data unavailable.
• Content is universally accessible through a single namespace.
• Data is record-oriented in the Hadoop framework programming.
• Input files are divided into lines or other specific formats of the application
logic. Each process running on a cluster node processes a subset of these
records.
• The Hadoop framework programs processes in the proximity of the location
of data and records using the knowledge of the distributed file system.
• Since the files are distributed throughout the distributed file system as
fragments, each process in a computer node works with a subset of the
data. The data operated on by a node is chosen based on their locality to
the node.

© Copyright IBM Corporation 2015


The data is distributed across nodes at the time of
loading IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


MapReduce: Isolated Processes IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Hadoop utilizes the HDFS and executes program on each of the


nodes in parallel.

• These programs are MapReduce jobs that split the data into chunks
which are processed by the Map task in parallel.

• The framework sorts the output of the output of the map task and
directs all output records with the same key values to the same
nodes.

• This directed output then becomes the input into the reduce tasks ,
which would process in parallel.

© Copyright IBM Corporation 2015


Map and reduce tasks running on the nodes where each of the
records of data that are already present IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


Flat Scalability IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Major advantage of using Hadoop is its scalability flat curve


• Running Hadoop in a limited amount of data in a small number of
nodes cannot prove its particular stellar performance, as the
overhead involved in the start programs of Hadoop is relatively high
• Another parallel/distributed programming paradigm such as MPI
(Message Passing Interface) can operate much better in two, four, or
maybe a dozen machines.
• If a program is written in Hadoop and runs on ten nodes, it is not
necessary to work for the same program that runs on a much larger
amount of hardware
• Orders of magnitude of growth can be managed with a little re-work
required for their applications
• The Hadoop platform manages data and hardware resources and
offers a performance increase proportional to the number of available
machines
© Copyright IBM Corporation 2015
HDFS-Hadoop Distributed File System IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Invented by Yahoo(Doug Cutting)


– Process internet scale data
– Save costs – distributed workload on massive parallel system

• Tolerate high component failure rate

• Throughput is given priority over the response time

• Large streaming scans(reads) – no random access

• Large files preferred over small

• Reliability provided though replication


© Copyright IBM Corporation 2015
HDFS-Hadoop Distributed File System
(Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems

• HDFS is a block-structured file system.


• Individual files are divided into blocks of a fixed size. These blocks
are stored in a cluster of one or more computers with data storage
capacity.
• Each machines in the cluster is called DataNodes. A file can consist
of several blocks, and are not necessarily stored in the same
machine.
• The target computers for each block are chosen at random in a block
by block fashion. Therefore, access to a file may require the
cooperation of several machines.
• If multiple machines are involved in carrying out a file, and the file
becomes unavailable at the loss of any of these machines then
HDFS combats this problem by playing each block through a series
of machines (3 by default).

© Copyright IBM Corporation 2015


DataNodes with Blocks of Multiple Files with a
Replica of 2 IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


Cluster Configuration IBM ICE (Innovation Centre for Education)
IBM Power Systems

• HDFS settings can be found in a set of XML files in the configuration


directory Hadoop; conf/ under the main installation directory Hadoop
(where you unzipped Hadoop)
• The conf/Hadoop-defaults.xml contains the default values of all
parameters of Hadoop. This file is considered read-only
• We need to override this setting through the establishment of new
values in conf/Hadoop site.xml.
• This file should be replicated consistently in all the machines in the
cluster
• Configuration options are a set of key-value pairs in the format:

© Copyright IBM Corporation 2015


Configuration Options IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Configuration options are a set of key-value pairs in the format:

<Property>
<Name></name>
<Value>property-value< /value>
< /Property>

• Add the line <Last>true< /end> within the body of property, to avoid
the properties to be overridden by user applications. This is useful for
most configuration options.

© Copyright IBM Corporation 2015


Settings needed to configure the
HDFS IBM ICE (Innovation Centre for Education)
IBM Power Systems

Key Value Example

Fs.default.name Protocol:// HDFS: //alpha.milkman.org:9000


servername:port

Dfs.data.dir. Path /Home/user/data/hdfs

Dfs.name.dir Path /Home/username /hdfs/name

© Copyright IBM Corporation 2015


Keys configured in HDFS IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Fs.default.name
– URI (protocol specifier, the host name and port) describes the NameNode
to the cluster
– Each node in the system that Hadoop is expected to operate, the address
of the NameNode has to be known
– The DataNode cases shall be recorded with the NameNode, and the data
is made available through it
– Individual programs of each client need to connect to this address to
retrieve the location of file blocks.

© Copyright IBM Corporation 2015


Keys configured in HDFS cont.… IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Dfs.data.dir
– This is the path to the local file system in which the instance DataNode
should store its data
– It is not necessary that all the DataNode instances save the data in the
same local path prefix
– The machines can be heterogeneous
– By default, Hadoop places this in / tmp
– This is fine for testing, but is an easy way to lose data in a real production
system and therefore this option must be handled

© Copyright IBM Corporation 2015


Keys configured in HDFS cont.… IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Dfs.name.dir
– This is the path to the local file system of the NameNode instance where
the NameNode metadata is stored
– It is used by the NameNode, to find the information and does not exist in
the DataNodes
– Another parameter of the configuration, not included in the list above, is
dfs replication
– This is the default replication factor of each block of data in the file
system. For a production cluster, this should usually be left at its default
value of 3.

© Copyright IBM Corporation 2015


Hadoop site.xml for a Single-node
Configuration IBM ICE (Innovation Centre for Education)
IBM Power Systems

• <Configuration>
– <Properties>
– <Name>fs.default.name</name>
– <Value>hdfs://your.server.name.com:9000 < /value>
– < /Property>
– <Properties>
– <Name>dfs.data.dir< /name>
• <Value> /home/ username /hdfs/data< /value>
– < /Property>
– <Properties>
– <Name>dfs.name.dir</name>
– <Value> /home/ username /hdfs/name< /value>
– < /Property>
• < /Configuration>
© Copyright IBM Corporation 2015
Keys configured in HDFS (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems

• After copying this information in your conf/Hadoop-site.xml, copy this


file in the conf directory/ on all the computers in the cluster
• The master node needs to know the addresses of all the machines to
be used as DataNodes, because startup scripts depend on this
• Also in the conf/ directory, edit the file to the slaves’ part that contains
a list of the full names of the instances for the slave, one host per line
• The user who owns the Hadoop cases have to read and write access
in each of these directories
• In a large-scale environment, it is recommended that you create a
user named "Hadoop" in each node with the express purpose of
owning and managing the Hadoop tasks
• For a single person in the machine, it is perfectly acceptable to run
Hadoop under your own user name. It is not recommended that you
run Hadoop as root.

© Copyright IBM Corporation 2015


HDFS Start IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Formatting the file system you have just set up:


– User@namenode:Hadoop$ bin/Hadoop namenode –format
• This process is performed only once. When it is completed, we are
free to start the distributed file system:
– User@namenode:Hadoop$ bin/start-dfs.sh
• This command will start the NameNode server on the master
machine
• It also begins the DataNode instances in each of the slaves
• In a single machine "cluster", this is the same computer as the
NameNode instance. In a true group of two or more machines, this
script would ssh on each slave machine and DataNode log instance.

© Copyright IBM Corporation 2015


Interact with HDFS IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Interacting with the HDFS, loading and recovery of data,


manipulate files are done using commands
• Most of the commands that communicate with the cluster are
performed by a monolithic script called bin /Hadoop
• We can load the Hadoop system with the Java virtual machine and
execute a user command
• Commands are specified in the following way:
– User@machine:Hadoop$ bin/Hadoop moduleName -cmd arguments...
– The moduleName tells the program what subset of Hadoop functionality to
use
– -Cmd is the name of a specific command within this module to run. Their
arguments are the name of the command
• Two of these modules are relevant to the HDFS: dfs and dfsadmin.

© Copyright IBM Corporation 2015


Insert the Data in the Cluster IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Create the home directory if it does not already exist


– Someone@anynode:Hadoop$ bin/Hadoop dfs -mkdir /user
– If there is no /user directory in the first place, it will automatically be created
later if necessary
• Upload a file. To load single file in HDFS, we can use the put
command
– Someone@anynode:Hadoop$ bin/Hadoop dfs -putting
/home/someone/myFile.txt /user/yourusername/
– This copy /home/someone/myFile.txt from the local file system in
/user/yourusername/myFile.txt HDFS
• Check if the file is in HDFS using command
– Someone@anynode:Hadoop$ bin/Hadoop dfs -ls /user/yourusername
– Someone@anynode:Hadoop$ bin/Hadoop dfs -ls

© Copyright IBM Corporation 2015


Examples for put command IBM ICE (Innovation Centre for Education)
IBM Power Systems

Command: Assuming: Result:

Bin/Hadoop dfs -place foo No file/directory with the Local Load file foo in a file
bar name/user/ $USER/bar exists in called/user/ $USER/bar
HDFS

Bin/Hadoop dfs -place foo /User/ $USER/bar is a Local Load file foo in a file
bar directory called/user/ $USER/bar/foo

Bin/Hadoop dfs -place foo /User/ $USER/somedir does Local Load file foo in a file
somedir/somefile not exist in the HDFS called/user/
$USER/somedir/somefile, the
creation of the directory lack

Bin/Hadoop dfs -place foo /User/ $USER/bar is already a No change in the HDFS and an
bar file in HDFS error is returned to the user.

© Copyright IBM Corporation 2015


Retrieve Data from HDFS IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Data visualization with the cat command

• Assume that a file with the name "foo" has been loaded into your
directory "home" in HDFS
– Someone@anynode:Hadoop$ bin/Hadoop dfs -cat foo
– (The contents of foo shown here)
– …….
– …….

© Copyright IBM Corporation 2015


Turn off the HDFS IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Switch off the HDFS cluster functionality, then this can be done by
logging in to the NameNode machine and running:
– Someone@namenode:Hadoop$ bin/stop-dfs.sh

• This command must be performed by the same user that starts


HDFS with bin /start-dfs.sh.

© Copyright IBM Corporation 2015


HDFS Command Reference IBM ICE (Innovation Centre for Education)
IBM Power Systems

Command Operation

-Ls Path Displays the contents of a directory specified in the path, which lists the
names, permissions, owner, size and modification date of each entry.

-LSR way It behaves like -ls, but displays the entries of recursively all
subdirectories of the path.

-Du path Displays disk usage, in bytes, of all the files that match the path and file
names are reported with the prefix HDFS protocol.

-Dus way As -du, but prints a summary of the disk usage of all the files and
directories in the path.

-Mv src dest Moves the file or directory indicated by src to dest, in HDFS.

-Cp src dest Copy the file or directory identified by src to dest, in HDFS.

-Rm way Deletes the file or directory gap identified by path.

-RMR way Deletes the file or directory identified by path. Recursively deletes all
entries secondary (i.e., files or subdirectories of the path).
© Copyright IBM Corporation 2015
HDFS Command Reference (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems

-Cat filename Displays the contents of the file to stdout.


-Mkdir path Creates a directory named path in HDFS. Creates the
directories on the main road, which are missing (e.g., such
as mkdir -p in Linux).
-Touchz way Creates a file in the path that contains the current time as a
timestamp. Don't know if a file already exists in the path, unless
the file is already size 0.
-CHMOD [ - Change the file permissions associated with one or more
R]mode,mode, ... objects that are identified by path.... Performs the changes
the path ... recursively with -R, it is a mode of 3 digit octal mode or
{augo} + / - {rwxX}. If it does not assume a specified
range and do not apply a umask.

© Copyright IBM Corporation 2015


HDFS Command Reference (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems

-Chgrp [ - Set the group ownership of files or directories identified by path....


R]group ... Group recursively sets if you specify -R.

-Help cmd Returns information about the use of one of the commands listed
above. You have to omit the main character " -" in cmd

-Setrep [ -R] [ - Sets the objective of replication of files that have been identified in
w]rep way the path to rep (The actual replication factor will move toward the
goal at the time)

-Chown [ -R] Sets the user owner and/or a group of files or directories identified
[owner] [ : ]] [ by path.... Joint owner recursively if -R is specified.
group path ...

© Copyright IBM Corporation 2015


HDFS Permissions and Security IBM ICE (Innovation Centre for Education)
IBM Power Systems

• HDFS file permissions based on permission system the POSIX


model
• The HDFS permissions system is designed to prevent accidental
corruption of data or improper use of informal information within a
group of users that share access to a group
• HDFS security is based on the POSIX model of users and groups.
Each file or directory has 3 permissions (read, write, and execute)
associated with three different granularities: the owner of the file,
users in the same group as the owner, and the rest of the users on
your system.
• Security permissions and ownership can be modified using
the bin/Hadoop dfs -chmod, -chown and chgrp operations described
earlier in this document, which work in a similar fashion to the
POSIX/Linux tools of the same name.

© Copyright IBM Corporation 2015


HDFS Additional Tasks- Rebalancing
Blocks IBM ICE (Innovation Centre for Education)
IBM Power Systems

• New nodes can be added to a cluster in a simple way


• In the new node, the same Hadoop version and configuration
(conf/Hadoop-site.xml) as in the rest of the group must be installed
• Starting the DataNode daemon on the machine causes the contact
with the NameNode and it joins the cluster
• However, the new DataNode does not have the data at the first
moment; therefore, it is not alleviating space problems in the existing
nodes
• New files will be stored in the new DataNode in addition to the
already existing, but for optimum usage, storage should be balanced
in all nodes
• This can be achieved with the automatic balancing tool included with
Hadoop. The class “Balancer” balances smart blocks in the nodes to
achieve a uniform distribution of blocks within a certain threshold,
expressed as a percentage
© Copyright IBM Corporation 2015
Closing The Nodes
IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The nodes can be removed from the cluster while it is running,


without data loss

• The nodes must be removed with a calendar that allows for HDFS to
ensure they are not fully replicated in those who have retired
DataNodes.

• HDFS feature provides a closure that ensures that this process is


performed safely. To use this feature, follow the steps below:
– Configuring the cluster
– Determine the hosts to remove
– Force configuration reload
– Turn off the nodes.
– Edit excludes file again
© Copyright IBM Corporation 2015
Check Health File System IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Hadoop provides a command fsck to make sure that the file system
is healthy

• It can be launched to the command line as follows:


– Bin/Hadoop fsck [track] [ options ]
– If we run without arguments, it prints current usage information and exit.
– If we run with the argument, the health state of the entire file system is
checked and printed as a report.
– If we have a trajectory of a particular directory or file, just check the
archives of that route.
– If an option argument is given but there is no path, you will start from the
file system root

© Copyright IBM Corporation 2015


Rack Awareness IBM ICE (Innovation Centre for Education)
IBM Power Systems

• For small groups in which all the servers are connected by a switch, there
are only two levels of the locality: "in the machine" and "out of the machine"
• Hadoop for large installations span s on several shelves, it is important to
ensure that the replicas of the data exist in multiple racks
• HDFS can be made aware of rack by the use of a script that allows the
master node to assign the network topology of the cluster
• We can use the alternative configuration strategies, the default
implementation allows you to provide an executable script that returns the
"steering rack" of each one of a list of IP addresses
• One or more IP addresses of the nodes in the cluster are sent as arguments
to the network topology script
• Specify the key topology.script.file.name in conf/Hadoop site.xml to set the
rack mapping script
• By default, Hadoop will attempt to send a set of IP addresses to the file as
several command-line arguments. You can control the maximum acceptable
number of arguments with the topology script.number.args
© Copyright IBM Corporation 2015
Implementation of an Example
Program IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The program will read the files you uploaded to the HDFS in the
previous section, and determine how many times each word appears
in the archives
– Create The Project
– Create The Files Of Source Code

• Our program must perform three classes: a map, a buck, and a


controller.

• WordCountMapper, WordCountReducer and WordCount

© Copyright IBM Corporation 2015


WordCountMapper.java IBM ICE (Innovation Centre for Education)
IBM Power Systems

Import java.io.IOException; Import java.util.StringTokenizer;


Import org.apache.Hadoop.io.IntWritable; Import org.apache.Hadoop.io.LongWritable;
Import org.apache.Hadoop.io.Text; Import org.apache.Hadoop.io.Write;
Import org.apache.Hadoop.io.WritableComparable;
Import org.apache.Hadoop.mapred.MapReduceBase;
Import org.apache.Hadoop.mapred.map.
Import org.apache.Hadoop.mapred.OutputCollector;
Import org.apache.Hadoop.mapred.Reporter.

Public Class extends MapReduceBase WordCountMapper


LongWritable<Implements Mapper, Text, Text, IntWritable>
{ Private final IntWritable one = new IntWritable(1);
Private Text = new text( );
Public Void map(WritableComparable Write key, value,
Output OutputCollector, Reporter reporter) throws IOException
{ String Line = value.tostring( );
StringTokenizer ITR = new StringTokenizer(line.toLowerCase( ));
While(ITR.hasMoreTokens())
{ Word.set(ITR.nextToken( ));
Output.collect(word); }
}
} © Copyright IBM Corporation 2015
WordCountReducer.java IBM ICE (Innovation Centre for Education)
IBM Power Systems

Import java.io.IOException; Import java.util.Iterator;


Import org.apache.Hadoop.io.IntWritable; Import org.apache.Hadoop.io.Text;
Import org.apache.Hadoop.io.WritableComparable;
Import org.apache.Hadoop.mapred.MapReduceBase;
Import org.apache.Hadoop.mapred.OutputCollector;
Import org.apache.Hadoop.mapred.buck.
Import org.apache.Hadoop.mapred.Reporter.

Public Class extends MapReduceBase WordCountReducer


Implements Reducer<Text, IntWritable, Text, IntWritable>
{ Public Void reduce(the key text, Iterator values,
Output OutputCollector, Reporter reporter) throws IOException
{ Int sum = 0;
Although (the values. " #hasNext())
{ IntWritable value = (IntWritable) values.next( );
Sum += value.get( ); // process value
}

Output.collect(key, new IntWritable(sum);


}
}
© Copyright IBM Corporation 2015
WordCount.java IBM ICE (Innovation Centre for Education)
IBM Power Systems
WordCount { public class

Public static void main(String[] args) {


JobClient JobClient = new customer( );
JobConf conf = new JobConf(WordCount.class);

Conf. setOutputKeyClass(Text.class); // Specify output types


Conf. setOutputValueClass(IntWritable.class);

FileInputPath.addInputPath(conf, new path( "input" ); // Specify input and output directories


FileOutputPath.addOutputPath(conf, new path( "output" );

Conf. setMapperClass(WordCountMapper.class); // Specify a map

Conf. setReducerClass(WordCountReducer.class); // Specify a reducer.


Conf. setCombinerClass(WordCountReducer.class);

Client.setConf(conf);
Try {
JobClient.RUNJOB(conf);
} Catch (Exception e) {
E. printStackTrace( ); }
}
}
© Copyright IBM Corporation 2015
Running the Job IBM ICE (Innovation Centre for Education)
IBM Power Systems

• After the code has been entered, it is time to run it

• It has already created a directory called entry below / user /Hadoop-


user in HDFS. This will serve as the input files for this process

• In the project explorer, right-click on the controller


class, WordCount.java. In the pop-up menu, select Run as * run on
Hadoop

• A window will appear asking you to select a location to run Hadoop.


Select the VMware server you set up earlier, and click Finish

© Copyright IBM Corporation 2015


Basic Concepts on MapReduce IBM ICE (Innovation Centre for Education)
IBM Power Systems

• MapReduce programs are designed to calculate large volumes of


data in a parallel fashion

• This requires to split the workload across a large number of


machines

• The communication overhead necessary to maintain the data in the


nodes synchronized at all time

• All data elements in MapReduce are immutable, which means that


they cannot be updated

• MapReduce programs transform the lists of elements of the input


data in the lists of output data elements
© Copyright IBM Corporation 2015
Phases in MapReduce IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Processing the List


• Assignment Lists

• Reduce the Lists

© Copyright IBM Corporation 2015


Phases in MapReduce (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Grouped into MapReduce

• Different colors represent different keys. All values with the same key
are presented on a single reduction.

© Copyright IBM Corporation 2015


MapReduce Data Flow IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


A Closer Look at MapReduce IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


MapReduce Additional Functionality IBM ICE (Innovation Centre for Education)
IBM Power Systems

© Copyright IBM Corporation 2015


Fault Tolerance IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Major reasons to use Hadoop to run jobs is due to its high degree of
fault tolerance

• Even in a large group where the individual nodes or network


components may experience high rates of failure, Hadoop can guide
the work toward a successful conclusion.

• Nodes individual tasks (TaskTrackers) are in constant


communication with the head node of the system, called JobTracker

• If a TaskTracker cannot communicate with the JobTracker during a


period of time (by default, 1 minute), the JobTracker will assume that
the TaskTracker in question has ceased to function.

© Copyright IBM Corporation 2015


Fault Tolerance IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The JobTracker knows that map and reduce tasks were assigned to
each TaskTracker

• If the job is still in the allocation phase, then other TaskTrackers are
being asked again to execute all of the tasks of the failed
TaskTracker

• If your job is in the stage, and other TaskTrackers will re-execute all
of the tasks that reduce ongoing error in the TaskTracker

© Copyright IBM Corporation 2015


Checkpoint (1 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

1. Apache Hadoop is a free opensource framework for


A. Data Intensive Application
B. Web Search Technology
C. Read Intensive Application
D. All the above

2. PIG is a high level language that generates


A. Map
B. Functions
C. Data sets
D. All the above

© Copyright IBM Corporation 2015


Checkpoint solution (1 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

1. Apache Hadoop is a free opensource framework for


A. Data Intensive Application
B. Web Search Technology
C. Read Intensive Application
D. All the above

2. PIG is a high level language that generates


A. Map
B. Functions
C. Data sets
D. All the above

© Copyright IBM Corporation 2015


Checkpoint (2 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

3. Parallel data processing supports the following:


A. Grid Computing
B. Workload balancing
C. Parallel databases
D. Option A and C

4. Hadoop Distributed File System (HDFS) was created by


A. Google
B. Yahoo
C. FB
D. None of the above

© Copyright IBM Corporation 2015


Checkpoint solution (2 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

3. Parallel data processing supports the following:


A. Grid Computing
B. Workload balancing
C. Parallel databases
D. Option A and C

4. Hadoop Distributed File System (HDFS) was created by


A. Google
B. Yahoo
C. FB
D. None of the above

© Copyright IBM Corporation 2015


Checkpoint (3 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

5. Hadoop Distributed File System(HDFS) was created for


A. Processing small files
B. Processing large files
C. Scaling the data
D. None of the above

6. This feature is not the design principle of Hadoop


A. Failures
B. Scalability
C. Workload balancing
D. Distribute Data and Utilitize Paralellism

© Copyright IBM Corporation 2015


Checkpoint solution (3 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

5. Hadoop Distributed File System(HDFS) was created for


A. Processing small files
B. Processing large files
C. Scaling the data
D. None of the above

6. This feature is not the design principle of Hadoop


A. Failures
B. Scalability
C. Workload balancing
D. Distribute Data and Utilitize Paralellism

© Copyright IBM Corporation 2015


Checkpoint (4 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

7. touchz command is used to


A. create a file with zero length
B. create a file with zero byte
C. create a read only file
D. Both A and B

8. This is the way to process large data sets by distributing the work
across large number of nodes.
A. Maps
B. MapReduce
C. MapDeduct
D. ReduceMap

© Copyright IBM Corporation 2015


Checkpoint solution (4 of 4) IBM ICE (Innovation Centre for Education)
IBM Power Systems

7. touchz command is used to


A. create a file with zero length
B. create a file with zero byte
C. create a read only file
D. Both A and B

8. This is the way to process large data sets by distributing the work
across large number of nodes.
A. Maps
B. MapReduce
C. MapDeduct
D. ReduceMap

© Copyright IBM Corporation 2015


Unit summary IBM ICE (Innovation Centre for Education)
IBM Power Systems

Having completed this unit, you should be able to:


• Introduction to Hadoop and Hadoop File System (HDFS)
• HDFS Administration
• Introduction to Map / Reduce
• Setup of an Hadoop Cluster
• Managing Job Execution
• Overview of Jaql
• Data Loading
• Overview of workflow engine

© Copyright IBM Corporation 2015

You might also like