System Design and Implementation 5.1 System Design

LOGMINER WITH DISTRIBUTED AND PARALLEL SEARCH USING
HADOOP/MAPREDUCE
5. SYSTEM DESIGN AND IMPLEMENTATION
5.1 SYSTEM DESIGN
System design is the major process which can be termed as the backbone in
the development of any software product. Design phase in any project will
consider the results of the analysis phase and comes out with the prototype of the
product, which is almost similar to the product which is being developed.
The Design part of this project needs certain requirements. A MapReduce
job is a unit of work that the client wants to be performed: it consists of the input
data, the MapReduce program, and configuration information. Hadoop runs the
job by dividing it into tasks, of which there are two types:
 Map tasks
 Reduce tasks.
There are two types of nodes that control the job execution process: a
JobTracker and a number of TaskTracker. The JobTracker coordinates all the jobs
run on the system by scheduling tasks to run on TaskTracker. Tasktracker’s run
tasks and send progress reports to the JobTracker, which keeps a record of the
overall progress of each job. If a task fails, the JobTracker can reschedule it on a
different TaskTracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called
input splits, or just splits. Hadoop creates one map task for each split, which runs
the user defined map function for each record in the split.
Having many splits means the time taken to process each split is small
compared to the time to process the whole input. So if we are processing the splits
in parallel, the processing is better load-balanced if the splits are small, since a
faster machine will be able to process proportionally more splits over the course of
the job than a slower machine. Even if the machines are identical, failed processes
or other jobs running concurrently make load balancing desirable, and the quality
of the load balancing increases as the splits become more fine-grained.
38
HADOOP/MAPREDUCE
On the other hand, if splits are too small, then the overhead of managing the
splits and of map task creation begins to dominate the total job execution time. For
most jobs, a good split size tends to be the size of a HDFS block, 64 MB by
default, although this can be changed for the cluster (for all newly created files), or
specified when each file is created.
Hadoop does its best to run the map task on a node where the input data
resides in HDFS. This is called the data locality optimization. It should now be
clear why the optimal split size is the same as the block size: it is the largest size of
input that can be guaranteed to be stored on a single node. If the split spanned two
blocks, it would be unlikely that any HDFS node stored both blocks, so some of
the split would have to be transferred across the network to the node running the
map task, which is clearly less efficient than running the whole map task using
local data.
Map tasks write their output to local disk, not to HDFS. Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in
HDFS, with replication, would be overkill. If the node running the map task fails
before the map output has been consumed by the reduce task, then Hadoop will
automatically rerun the map task on another node to recreate the map output.
Reduce tasks don’t have the advantage of data locality—the input to a
single reduce task is normally the output from all mappers. In the present example,
we have a single reduce task that is fed by all of the map tasks. Therefore the
sorted map outputs have to be transferred across the network to the node where the
reduce task is running, where they are merged and then passed to the user-defined
reduce function. The output of the reduce is normally stored in HDFS for
reliability.
MapReduce - A Java-based job tracking, node management, and application
container for mappers and reducers written in Java or in any scripting language
that supports STDIN and STDOUT for job interaction.
39
HADOOP/MAPREDUCE
5.2 DATAFLOW DIAGRAMS
MapReduce Data Flow with a Single Reduce Task
MapReduce DataFlow with Multiple Reduce Tasks
40
HADOOP/MAPREDUCE
MapReduce DataFlow with No Reduce Tasks
A Client Writing Data To HDFS
41
HADOOP/MAPREDUCE
Search Process In Map Side
Hadoop MapReduce DataFlow
42
HADOOP/MAPREDUCE
Overall DataFlow Structure
Files loaded from HDFS
Input (key, value) pairs
Intermediate (key, value) pairs
Final (key, value) pairs
Writeback to HDFS
43
HADOOP/MAPREDUCE
5.3 IMPLEMENTATION
MapReduce refers to a framework that runs on a computational cluster to mine
large datasets. The name derives from the application of map() and reduce()
functions repurposed from functional programming languages.
 “Map” applies to all the members of the dataset and returns a list of results
 “Reduce” collates and resolves the results from one or more mapping
operations executed in parallel
 Very large datasets are split into large subsets called splits
 A parallelized operation performed on all splits yields the same results as if
it were executed against the larger dataset before turning it into splits
 Implementations separate business logic from multi-processing logic
 MapReduce framework developers focus on pr ocess dispatching, locking,
and logic flow
 App developers focus on implementing the business logic without worrying
about infrastructure or scalability issues
5.3.1 Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every item in the split. It
produces a list of (k2, v2) pairs for each call.
The framework groups all the results with the same key together in a new split.
The Reduce(k2, list(v2)) -> list(v3) function is applied to each intermediate
results split to produce a collection of values v3 in the same domain. This
collection may have zero or more values. The desired result consists of all the v3
collections, often aggregated into one result file.
44
HADOOP/MAPREDUCE
5.3.2 Hadoop Cluster Building Blocks
Hadoop clusters may be deployed in three basic configurations:

Mode Description Usage
Local(Default) Multi-Threading Componet, Single Development, Test, Debug
JVM
Pseudo-Distributed Multiple JVM, Single node Development, Test, Debug
Fully-Distributed All Component run in separate node Staging, Production
Hadoop comprises tools and utilities for data serialization, file system access, and
interprocess communication pertaining to MapReduce implementations. Single
and clustered configurations are possible. This configuration almost always
includes HDFS because it’s better optimized for high throughput MapReduce I/O
than general-purpose file systems.
Components
45
HADOOP/MAPREDUCE
Each node in a Hadoop installation runs one/more daemons executing MapReduce
code or HDFS commands. Each daemon’s responsibilities in the cluster are:
 NameNode manages HDFS and communicates with every DataNode
daemon in the cluster
 JobTracker dispatches jobs, assigns splits to mappers/reducers as each stage
 TaskTracker executes tasks sent by the JobTracker and reports status
 DataNode manages HDFS content in the node and updates status to the
NameNode
These daemons execute in the three distinct processing layers of a Hadoop cluster:
master (Name Node), slaves (Data Nodes), and user applications.
Name Node (Master)
 Manages the file system name space
 Keeps track of job execution
 Manages the cluster
 Replicates data blocks and keeps them evenly distributed
 Manages lists of files, list of blocks in each file, list of blocks per node, and
file attributes and other meta-data
 Tracks HDFS file creation and deletion operations in an activity log
 Depending on system load, the NameNode and JobTracker daemons may
run on separate computers.
Data Nodes (Slaves)
 Store blocks of data in their local file system
 Store meta-data for each block
 Serve data and meta-data to the job they execute
 Send periodic status reports to the Name Node
 Send data blocks to other nodes required by the Name Node
46
HADOOP/MAPREDUCE
 Data nodes execute the DataNode and TaskTracker daemons described

earlier in this section.
Separate Daemons for Hadoop/MapReduce
User Applications
 Dispatch mappers and reducers to the Name Node for execution in the
Hadoop cluster
 Execute implementation contracts for Java and for scripting languages
mappers and reducers
 Provide application-specific execution parameters
 Set Hadoop runtime configuration parameters with semantics that apply to
the Name or the Data nodes
A user application may be a stand-alone executable, a script, a web application, or
any combination of these. The application is required to implement either the Java
or the streaming APIs.
47
HADOOP/MAPREDUCE
5.3.3 Daemons of Hadoop Cluster

A cluster will have one JobTracker server, one NameNode server, and one
secondary NameNode server, and DataNode and Task-Trackers. The JobTracker
coordinates the activities of the TaskTracker, and the NameNode manages the
DataNode. In the context of Hadoop, a node/machine running the TaskTracker or
DataNode server is considered a slave node. It is common to have nodes that run
both the TaskTracker and DataNode servers. The Hadoop server processes on the
slave nodes are controlled by their respective masters, the JobTracker and
NameNode servers.
48
HADOOP/MAPREDUCE
 JobTracker: The JobTracker provides command and control for job

management. It supplies the primary user interface to a MapReduce cluster.
It also handles the distribution and management of tasks. There is one
instance of this server running on a cluster. The machine running the
JobTracker server is the MapReduce master.
 TaskTracker: The TaskTracker provides execution services for the
submitted jobs. Each TaskTracker manages the execution of tasks on an
individual compute node in the MapReduce cluster. The JobTracker
manages all of the TaskTracker processes. There is one instance of this
server per compute node.
 NameNode: The NameNode provides metadata storage for the shared file
system. The NameNode supplies the primary user interface to the HDFS. It
also manages all of the metadata for the HDFS. There is one instance of this
server running on a cluster. The metadata includes such critical information
as the file directory structure and which DataNode have copies of the data
blocks that contain each file’s data. The machine running the NameNode
server process is the HDFS master.
 Secondary NameNode: The secondary NameNode provides both file
system metadata backup and metadata compaction. It supplies near real-
time backup of the metadata for the NameNode. There is at least one
instance of this server running on a cluster, ideally on a separate physical
machine than the one running the NameNode. The secondary NameNode
also merges the metadata change history, the edit log, into the NameNode’s
file system image.
 DataNode: The DataNode provides data storage services for the shared file
system. Each DataNode supplies block storage services for the HDFS. The
NameNode coordinates the storage and retrieval of the individual data
49
HADOOP/MAPREDUCE
blocks managed by a DataNode. There is one instance of this server process
per HDFS storage node.
General Mapping Process
The Hadoop framework provides a very simple map function, called
IdentityMapper. It is used in jobs that only need to reduce the input, and not
transform the raw input. The magic piece of code is the line output.collect(key,
Val), which passes a key/value pair back to the framework for further processing.
All map functions must implement the Mapper interface, which guarantees that the
map function will always be called with a key. The key is an instance of a Writable
Comparable object, a value that is an instance of a Writable object, an output
object, and a reporter. The framework will make one call to your map function for
each record in your input. There will be multiple instances of your map function
running, potentially in multiple Java Virtual Machines (JVMs), and potentially on
multiple machines. The basic Java code implementation for the Mapper has the
form:
public class LineIndexMapper extends MapReduceBase
implements Mapper<Long Writable, Text, Text, Text>
{
public void map(Long Writable k, Text v, OutputCollector<Text, Text> o,
Reporter r) throws IOException
{ /* implementation here
…….
…….
*/ }
……
……
}
50
HADOOP/MAPREDUCE
General Reducer Process

The Hadoop framework calls the reduce function one time for each unique key.
The framework provides the key and the set of values that share that key. If you
require the output of your job to be sorted, the reducer function must pass the key
objects to the output.collect() method unchanged. The reduce phase is, however,
free to output any number of records, including zero records, with the same key
and different values. This particular constraint is also why the map tasks may be
multithreaded, while the reduce tasks are explicitly only single-threaded.
The combiner is an output handler for the mapper to reduce the total data
transferred over the network. It can be thought of as a reducer on the local node.
The reducer iterates over keys and values generated in the previous step adding a
line number to each word’s occurrence index. A complete index shows the line
where each word occurs, and the file/work where it occurred.
public class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text k, Iterator<Text> v, OutputCollector<Text, Text>
o,
Reporter r) throws IOException
{
/* implementation */ }
……
……
}
……
……
}
51

System Design and Implementation 5.1 System Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

System Design and Implementation 5.1 System Design

Uploaded by

Copyright:

Available Formats

LOGMINER WITH DISTRIBUTED AND PARALLEL SEARCH USING

MapReduce Data Flow with a Single Reduce Task

MapReduce DataFlow with Multiple Reduce Tasks

A Client Writing Data To HDFS

Hadoop MapReduce DataFlow

Overall DataFlow Structure

Files loaded from HDFS

Input (key, value) pairs

Intermediate (key, value) pairs

Final (key, value) pairs

Hadoop clusters may be deployed in three basic configurations:

Fully-Distributed All Component run in separate node Staging, Production

 Data nodes execute the DataNode and TaskTracker daemons described

Separate Daemons for Hadoop/MapReduce

5.3.3 Daemons of Hadoop Cluster

 JobTracker: The JobTracker provides command and control for job

General Reducer Process

You might also like