Professional Documents
Culture Documents
• The Hadoop Distributed File System (HDFS) will split up large data files into
fragments that are managed by different nodes of the cluster.
• Each fragment is replicated in multiple computers, so that a single failure in
a machine will not make the data unavailable.
• Content is universally accessible through a single namespace.
• Data is record-oriented in the Hadoop framework programming.
• Input files are divided into lines or other specific formats of the application
logic. Each process running on a cluster node processes a subset of these
records.
• The Hadoop framework programs processes in the proximity of the location
of data and records using the knowledge of the distributed file system.
• Since the files are distributed throughout the distributed file system as
fragments, each process in a computer node works with a subset of the
data. The data operated on by a node is chosen based on their locality to
the node.
• These programs are MapReduce jobs that split the data into chunks
which are processed by the Map task in parallel.
• The framework sorts the output of the output of the map task and
directs all output records with the same key values to the same
nodes.
• This directed output then becomes the input into the reduce tasks ,
which would process in parallel.
<Property>
<Name></name>
<Value>property-value< /value>
< /Property>
• Add the line <Last>true< /end> within the body of property, to avoid
the properties to be overridden by user applications. This is useful for
most configuration options.
• Fs.default.name
– URI (protocol specifier, the host name and port) describes the NameNode
to the cluster
– Each node in the system that Hadoop is expected to operate, the address
of the NameNode has to be known
– The DataNode cases shall be recorded with the NameNode, and the data
is made available through it
– Individual programs of each client need to connect to this address to
retrieve the location of file blocks.
• Dfs.data.dir
– This is the path to the local file system in which the instance DataNode
should store its data
– It is not necessary that all the DataNode instances save the data in the
same local path prefix
– The machines can be heterogeneous
– By default, Hadoop places this in / tmp
– This is fine for testing, but is an easy way to lose data in a real production
system and therefore this option must be handled
• Dfs.name.dir
– This is the path to the local file system of the NameNode instance where
the NameNode metadata is stored
– It is used by the NameNode, to find the information and does not exist in
the DataNodes
– Another parameter of the configuration, not included in the list above, is
dfs replication
– This is the default replication factor of each block of data in the file
system. For a production cluster, this should usually be left at its default
value of 3.
• <Configuration>
– <Properties>
– <Name>fs.default.name</name>
– <Value>hdfs://your.server.name.com:9000 < /value>
– < /Property>
– <Properties>
– <Name>dfs.data.dir< /name>
• <Value> /home/ username /hdfs/data< /value>
– < /Property>
– <Properties>
– <Name>dfs.name.dir</name>
– <Value> /home/ username /hdfs/name< /value>
– < /Property>
• < /Configuration>
© Copyright IBM Corporation 2015
Keys configured in HDFS (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Bin/Hadoop dfs -place foo No file/directory with the Local Load file foo in a file
bar name/user/ $USER/bar exists in called/user/ $USER/bar
HDFS
Bin/Hadoop dfs -place foo /User/ $USER/bar is a Local Load file foo in a file
bar directory called/user/ $USER/bar/foo
Bin/Hadoop dfs -place foo /User/ $USER/somedir does Local Load file foo in a file
somedir/somefile not exist in the HDFS called/user/
$USER/somedir/somefile, the
creation of the directory lack
Bin/Hadoop dfs -place foo /User/ $USER/bar is already a No change in the HDFS and an
bar file in HDFS error is returned to the user.
• Assume that a file with the name "foo" has been loaded into your
directory "home" in HDFS
– Someone@anynode:Hadoop$ bin/Hadoop dfs -cat foo
– (The contents of foo shown here)
– …….
– …….
• Switch off the HDFS cluster functionality, then this can be done by
logging in to the NameNode machine and running:
– Someone@namenode:Hadoop$ bin/stop-dfs.sh
Command Operation
-Ls Path Displays the contents of a directory specified in the path, which lists the
names, permissions, owner, size and modification date of each entry.
-LSR way It behaves like -ls, but displays the entries of recursively all
subdirectories of the path.
-Du path Displays disk usage, in bytes, of all the files that match the path and file
names are reported with the prefix HDFS protocol.
-Dus way As -du, but prints a summary of the disk usage of all the files and
directories in the path.
-Mv src dest Moves the file or directory indicated by src to dest, in HDFS.
-Cp src dest Copy the file or directory identified by src to dest, in HDFS.
-RMR way Deletes the file or directory identified by path. Recursively deletes all
entries secondary (i.e., files or subdirectories of the path).
© Copyright IBM Corporation 2015
HDFS Command Reference (Cont.) IBM ICE (Innovation Centre for Education)
IBM Power Systems
-Help cmd Returns information about the use of one of the commands listed
above. You have to omit the main character " -" in cmd
-Setrep [ -R] [ - Sets the objective of replication of files that have been identified in
w]rep way the path to rep (The actual replication factor will move toward the
goal at the time)
-Chown [ -R] Sets the user owner and/or a group of files or directories identified
[owner] [ : ]] [ by path.... Joint owner recursively if -R is specified.
group path ...
• The nodes must be removed with a calendar that allows for HDFS to
ensure they are not fully replicated in those who have retired
DataNodes.
• Hadoop provides a command fsck to make sure that the file system
is healthy
• For small groups in which all the servers are connected by a switch, there
are only two levels of the locality: "in the machine" and "out of the machine"
• Hadoop for large installations span s on several shelves, it is important to
ensure that the replicas of the data exist in multiple racks
• HDFS can be made aware of rack by the use of a script that allows the
master node to assign the network topology of the cluster
• We can use the alternative configuration strategies, the default
implementation allows you to provide an executable script that returns the
"steering rack" of each one of a list of IP addresses
• One or more IP addresses of the nodes in the cluster are sent as arguments
to the network topology script
• Specify the key topology.script.file.name in conf/Hadoop site.xml to set the
rack mapping script
• By default, Hadoop will attempt to send a set of IP addresses to the file as
several command-line arguments. You can control the maximum acceptable
number of arguments with the topology script.number.args
© Copyright IBM Corporation 2015
Implementation of an Example
Program IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The program will read the files you uploaded to the HDFS in the
previous section, and determine how many times each word appears
in the archives
– Create The Project
– Create The Files Of Source Code
Client.setConf(conf);
Try {
JobClient.RUNJOB(conf);
} Catch (Exception e) {
E. printStackTrace( ); }
}
}
© Copyright IBM Corporation 2015
Running the Job IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Different colors represent different keys. All values with the same key
are presented on a single reduction.
• Major reasons to use Hadoop to run jobs is due to its high degree of
fault tolerance
• The JobTracker knows that map and reduce tasks were assigned to
each TaskTracker
• If the job is still in the allocation phase, then other TaskTrackers are
being asked again to execute all of the tasks of the failed
TaskTracker
• If your job is in the stage, and other TaskTrackers will re-execute all
of the tasks that reduce ongoing error in the TaskTracker
8. This is the way to process large data sets by distributing the work
across large number of nodes.
A. Maps
B. MapReduce
C. MapDeduct
D. ReduceMap
8. This is the way to process large data sets by distributing the work
across large number of nodes.
A. Maps
B. MapReduce
C. MapDeduct
D. ReduceMap