You are on page 1of 72

Some of the frequently asked Interview questions for hadoop developers are:

(1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod


Secondary Namenode, a poorly named component of hadoop.
(2)What are the Side Data Distribution Techniques.
(3)What is shuffleing in mapreduce?
(4)What is partitioning?
(5)Can we change the file cached by Distributed Cache
(6)What if job tracker machine is down?
(7)Can we deploy job tracker other than name node?
(8)What are the four modules that make up the Apache Hadoop framework?
(9)Which modes can Hadoop be run in? List a few features for each mode.
(10)Where are Hadoops configuration files located?
(11)List Hadoops three configuration files.

(12)What are slaves and masters in Hadoop?

(13)How many datanodes can run on a single Hadoop cluster?

(14)What is job tracker in Hadoop?

(15)How many job tracker processes can run on a single Hadoop cluster?

(16)What sorts of actions does the job tracker process perform?

(17)How does job tracker schedule a job for the task tracker?

(18)What does the mapred.job.tracker command do?

(19)What is PID?

(20)What is jps?

(21)Is there another way to check whether Namenode is working?

(22)How would you restart Namenode?

(23)What is fsck?

(24)What is a map in Hadoop?

(25)What is a reducer in Hadoop?

(26)What are the parameters of mappers and reducers?

(27)Is it possible to rename the output file, and if so, how?

(28)List the network requirements for using Hadoop.

(29)Which port does SSH work on?

(30)What is streaming in Hadoop?

(31)What is the difference between Input Split and an HDFS Block?

(32)What does the file hadoop-metrics.properties do?

(33)Name the most common Input Formats defined in Hadoop? Which one is default?

(34)What is the difference between TextInputFormat and KeyValueInputFormat class?


(35)What is InputSplit in Hadoop?

(36)How is the splitting of file invoked in Hadoop framework


(37)Consider case scenario: In M/R system,

- HDFS block size is 64 MB

- Input format is FileInputFormat

- We have 3 files of size 64K, 65Mb and 127Mb

(38)How many input splits will be made by Hadoop framework?

(39)What is the purpose of RecordReader in Hadoop?

(39)After the Map phase finishes, the Hadoop framework does Partitioning, Shuffle and sort.
Explain what happens in this phase?

(40)If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the
reducer?

(41)What is JobTracker?

(42)What are some typical functions of Job Tracker?

(43)What is TaskTracker?

(44)What is the relationship between Jobs and Tasks in Hadoop?

(46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?

(47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few
slow nodes to rate-limit the rest of the program and slow down the program. What mechanism
Hadoop provides to combat this?

(48)How does speculative execution work in Hadoop?

(49)Using command line in Linux, how will you

- See all jobs running in the Hadoop cluster

- Kill a job?

(50)What is Hadoop Streaming?

(51)What is the characteristic of streaming API that makes it flexible run MapReduce jobs in
languages like Perl, Ruby, Awk etc.?

(52)What is Distributed Cache in Hadoop?

(53)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job?

(54)Is it possible to have Hadoop job output in multiple directories? If yes, how?

(55)What will a Hadoop job do if you try to run it with an output directory that is already present?
Will it

- Overwrite it

- Warn you and continue

- Throw an exception and exit

(56)How can you set an arbitrary number of mappers to be created for a job in Hadoop?

(57)How can you set an arbitrary number of Reducers to be created for a job in Hadoop?

(58)How will you write a custom partitioner for a Hadoop job?

(59)How did you debug your Hadoop code?

(60)What is BIG DATA?

(61)Can you give some examples of Big Data?

(62)Can you give a detailed overview about the Big Data being generated by Facebook?

(63)According to IBM, what are the three characteristics of Big Data?

(64)How Big is Big Data?

(65)How analysis of Big Data is useful for organizations?

(66)Who are Data Scientists?

(67)What are some of the characteristics of Hadoop framework?

(68)Give a brief overview of Hadoop history.

(69)Give examples of some companies that are using Hadoop structure?

(70)What is the basic difference between traditional RDBMS and Hadoop?

(71)What is structured and unstructured data?

(72)What are the core components of Hadoop?

(73)What is HDFS?

(74)What are the key features of HDFS?

(75)What is Fault Tolerance?

(76)Replication causes data redundancy then why is is pursued in HDFS?

(77)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node
will also be replicated on the other two?

(78)What is throughput? How does HDFS get a good throughput?

(79)What is streaming access?

(80)What is a commodity hardware? Does commodity hardware include RAM?

(81)What is a metadata?

(82)Why do we use HDFS for applications having large data sets and not when there are lot of small
files?

(83)What is a daemon?

(84)Is Namenode machine same as datanode machine as in terms of hardware?

(85)What is a heartbeat in HDFS?

(86)Are Namenode and job tracker on the same host?

(87)What is a block in HDFS?

(88)What are the benefits of block transfer?

(89)If we want to copy 10 blocks from one machine to another, but another machine can copy only
8.5 blocks, can the blocks be broken at the time of replication?

(90)How indexing is done in HDFS?

(91)If a data Node is full how its identified?

(92)If datanodes increase, then do we need to upgrade Namenode?

(93)Are job tracker and task trackers present in separate machines?

(94)When we send a data to a node, do we allow settling in time, before sending another data to
that node?

(95)Does hadoop always require digital data to process?

(96)On what basis Namenode will decide which datanode to write on?

(97)Doesnt Google have its very own version of DFS?

(98)Who is a user in HDFS?

(99)Is client the end user in HDFS?

(100)What is the communication channel between client and namenode/datanode?

(101)What is a rack?

(102)On what basis data will be stored on a rack?

(103)Do we need to place 2nd and 3rd data in rack 2 only?

(104)What if rack 2 and datanode fails?

(105)What is a Secondary Namenode? Is it a substitute to the Namenode?

(106)What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

(107)What is Key value pair in HDFS?

(108)What is the difference between MapReduce engine and HDFS cluster?

(109)Is map like a pointer?

(110)Do we require two servers for the Namenode and the datanodes?

(111)Why are the number of splits equal to the number of maps?

(112)Is a job split into maps?

(113)Which are the two types of writes in HDFS?

(114)Why Reading is done in parallel and Writing is not in HDFS?

(115)Can Hadoop be compared to NOSQL database like Cassandra?

(116)How can I install Cloudera VM in my system?

(117)What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop
cluster

(118)What are the four basic parameters of a mapper?

(119)What is the input type/format in MapReduce by default?

(120)Can we do online transactions(OLTP) using Hadoop? SRVMTrainings

(121)Explain how HDFS communicates with Linux native file system

(122)What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

(123)What is the InputFormat ?

(124)What is the InputSplit in map reduce ?

(125)What is a IdentityMapper and IdentityReducer in MapReduce ?

(126)How JobTracker schedules a task?

(127)When is the reducers are started in a MapReduce job?

(128)On What concept the Hadoop framework works?

(129)What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?

(130)What other technologies have you used in hadoop sta ck?

(131)How NameNode Handles data node failures?

(132)How many Daemon processes run on a Hadoop system?

(133)What is configuration of a typical slave node on Hadoop cluster?


(134) How many JVMs run on a slave node?

(135)How will you make changes to the default configuration files?

(136)Can I set the number of reducers to zero?

(137)Whats the default port that jobtrackers listens ?

(138)unable to read options file while i tried to import data from mysql to hdfs. Narendra

(139)What problems have you faced when you are working on Hadoop code?

(140)how would you modify that solution to only count the number of unique words in all the
documents?

(141)What is the difference between a Hadoop and Relational Database and Nosql?

(142)How the HDFS Blocks are replicated?

(143)What is a Task instance in Hadoop? Where does it run?

(144)what is meaning Replication factor?

(145)If reducers do not start before all mappers finish then why does the progress on MapReduce
job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed
when mapper is not finished yet?

(146)How the Client communicates with HDFS?


(147)Which object can be used to get the progress of a particular job
(148)What is next step after Mapper or MapTask?
(149)What are the default configuration files that are used in Hadoop?
(150)Does MapReduce programming model provide a way for reducers to communicate with each
other? In a MapReduce job can a reducer communicate with another reducer?
(151)What is HDFS Block size? How is it different from traditional file system block size?
(152)what is SPF?
(153)Where do you specify the Mapper Implementation?

(154)What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?


(155)Explain the core methods of the Reducer?
(156)What is Hadoop framework?
(157)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job

(158)How would you tackle counting words in several text documents?

(159)How does master slave architecture in the Hadoop?

(160)How would you tackle calculating the number of unique visitors for each hour by mining a huge
Apache log? You can use post processing on the output of the MapReduce job.

(161)How did you debug your Hadoop code ?

(162)How will you write a custom partitioner for a Hadoop job?

(163)How can you add the arbitrary key-value pairs in your mapper?

(164)what is a datanode?

(165)What are combiners? When should I use a combiner in my MapReduce Job?

(166)How Mapper is instantiated in a running job?

(167)Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?

(168)What happens if you don?t override the Mapper methods and keep them as it is?

(169)How does an Hadoop application look like or their basic components?


(170)What is the meaning of speculative execution in Hadoop? Why is it important?

(170)What are the restriction to the key and value class ?

(171)Explain the WordCount implementation via Hadoop framework ?

(172)What Mapper does?

(173)what is MAP REDUCE?

(174)Explain the Reducer?s Sort phase?

(175)What are the primary phases of the Reducer?

(176)Explain the Reducer's reduce phase?

(177)Explain the shuffle?

(178)What happens if number of reducers are 0?

(179)How many Reducers should be configured?

(180)What is Writable & WritableComparable interface?

(181)What is the Hadoop MapReduce API contract for a key and value Class?

(182)Where is the Mapper Output (intermediate kay-value data) stored ?

(183)What is the difference between HDFS and NAS ?

(184)Whats is Distributed Cache in Hadoop

(185)Have you ever used Counters in Hadoop. Give us an example scenario?

(186)can we write map reduce program in other than java programming language. how.

(187)What alternate way does HDFS provides to recover data in case a Namenode, without backup,
fails and cannot be recovered?

(188)What is the use of Context object?

(189)What is the Reducer used for?

(190)What is the use of Combiner?


(191)Explain how input and output data format of the Hadoop framework?

(192)What is compute and Storage nodes?

(193)what is namenode?

(194)How does Mappers run() method works?

(195)what is the default replication factor in HDFS?

(196)It can be possible that a Job has 0 reducers?

(197)How many maps are there in a particular Job?

(198)How many instances of JobTracker can run on a Hadoop Cluser?

(199)How can we control particular key should go in a specific reducer?

(200)what is the typical block size of an HDFS block?

(201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.

(202)What are the main differences between versions 1.5 and version 1.6 of Java?

(203)Describe what happens to a MapReduce job from submission to output?

(204)What mechanism does Hadoop framework provides to synchronize changes made in


Distribution Cache during runtime of the application

(205)Did you ever built a production process in Hadoop ? If yes then what was the process when
your hadoop job fails due to any reason

(206)Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did
you handled it

(207)What is HDFS ? How it is different from traditional file systems?

(208)What is the benifit of Distributed cache, why can we just have the file in HDFS and have the
application read it

(209)How JobTracker schedules a task?

(210)How many Daemon processes run on a Hadoop system?

(211)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?

(212)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?

(213)What is the difference between HDFS and NAS ?

(214)How NameNode Handles data node failures?

(215)Does MapReduce programming model provide a way for reducers to communicate with each
other? In a MapReduce job can a reducer communicate with another reducer?

(216)Where is the Mapper Output (intermediate kay-value data) stored ?

(217)What are combiners? When should I use a combiner in my MapReduce Job?

(218)What is a IdentityMapper and IdentityReducer in MapReduce ?

(219)When is the reducers are started in a MapReduce job?

(220)If reducers do not start before all mappers finish then why does the progress on MapReduce
job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed
when mapper is not finished yet?

(221)What is HDFS Block size? How is it different from traditional file system block size?

(222)How the Client communicates with HDFS?

(223)What is NoSQL?

(224)We have already SQL then Why NoSQL?

(225)What is the difference between SQL and NoSQL?

(226)Is NoSQL follow relational DB model?

(227)Why would NoSQL be better than using a SQL Database? And how much better is it?

(228)What do you understand by Standalone (or local) mode?

(229)What is Pseudo-distributed mode?

(230)What does /var/hadoop/pids do?

(231)Pig for Hadoop - Give some points?

(232)Hive for Hadoop - Give some points?

(233)File permissions in HDFS?

(234)what is ODBC and JDBC connectivity in Hive?

(235)What is Derby database?

(236)What is Schema on Read and Schema on Write?

(237)What infrastructure do we need to process 100 TB data using Hadoop?

(238)What is Internal and External table in Hive?

(239)what is Small File Problem in Hadoop

(240)How does a client read/write data in HDFS?

(241)What should be the ideal replication factor in Hadoop?

(242)What is the optimal block size in HDFS?

(243)explain Metadata in Namenode

(244)how to enable recycle bin or trash in Hadoop

(245)what is difference between int and intwritable

(246)How to change Replication Factor (For below cases):

(247)In Map Reduce why map write output to Local Disk instead of HDFS?

(248)Rack awareness of Namenode

(249)Hadoop the definitive guide (2nd edition) pdf

(250)What is bucketing in Hive?

(251)What is Clustring in Hive?

(252)What type of data we should put in Distributed Cache? When to put the data in DC? How much
volume we should put in?

(253)What is Distributed Cache?

(254)What is Partioner in hadoop? Where does it run,mapper or reducer?

(255) what are mapreduce new and old apis while writing map reduce program . explain how it
works

(256)How to write a Custom Key Class?

(257)What is the utility of using Writable Comparable (Custom Class) in Map Reduce code?

(258)What are Input Format, Input Split & Record Reader and what they do?

(259)Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?

(260)How to enable Recycle bin in Hadoop?

(261)If data is present in HDFS and RF is defined, then how can we change Replication Factor?

(262)How we can change Replication factor when Data is on the fly?

(262)mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory


/user/hadoop/inpdata. Name node is in safemode.

(263)What Hadoop Does in Safe Mode

(264)What should be the ideal replication factor in Hadoop Cluster?

(265)Heartbeat for Hadoop

(266)What will be the consideration while we do Hardware Planning for Master in Hadoop
architecture?

(267)When should be hadoop archive create

(268)what factors the block size takes before creation?

(269)In which location Name Node sores its Metadata and why?

(270)Should we use RAID in Hadoop or not?

(271)How blocks are distributed among all data nodes for a particular chunk of data?

(272)How to enable Trash/Recycle Bin in Hadoop?

(273)what is hadoop archive

(274)How to create hadoop archive

(275)How we can take Hadoop out of Safe Mode

(276)What is safe mode in Hadoop?

(277)Why Mapreduce output written in local disk

(278)When Hadoop Enter in Safe Mode

(279)Data node block size in HDFS, why 64MB?

(280)What is the Non DFS Used

(281)Virtual Box & Ubuntu Installation

(282)What is Rack awareness?

(283)On what basis name node distribute blocks across the data nodes?

(284)What is Output Format in hadoop?

(285)How to write data in Hbase using flume?

(286)What is difference between memory channel and file channel in flume?

(287)How to create table in hive for a json input file.

(288)What is speculative execution in Hadoop?

(289)What is a Record Reader in hadoop?

(290)How to resolve the following error while running a query in hive: Error in metadata: Cannot
validate serde

(291)What is difference between internal and external tables in hive?

(292)What is Bucketing and Clustering in Hive?

(293)How to enable/configure the compression of map output data in hadoop?

(294)What is InputFormat in hadoop?

(295)How to configure hadoop to reuse JVM for mappers?

(296)What is difference between split and block in hadoop?

(297)What is Input Split in hadoop?

(298)How can one write custom record reader?

(299)What is balancer? How to run a cluster balancing utility?

(300)What is version-id mismatch error in hadoop?

(301)How to handle bad records during parsing?

(302)What is identity mapper and reducer? In which cases can we use them?

(303)What is Reduce only jobs?

(304)What is crontab? Explain with suitable example.

(305)Safe-mode execeptions

(306)What is the meaning of the term "non-DFS used" in Hadoop web-console?

(307)What is AMI

(308)Can we submit the mapreduce job from slave node?

(309)How to resolve small file problem in hdfs?

(310)How to overwrite an existing output file during execution of mapreduce jobs?

(311)What is difference between reducer and combiner?

(311)What do you understand from Node redundancy and is it exist in hadoop cluster

(312)how to proceed to write your first mapreducer program.

(313)How to change replication factor of files already stored in HDFS

(314) How to resolve IOException: Cannot create directory, while formatting namenode in hadoop.

(315)How can one set space quota in Hadoop (HDFS) directory

(316)How can one increase replication factor to a desired value in Hadoop?

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In
which kind of scenarios MR jobs will be more useful than PIG?
2. How can we change the split size if our commodity hardware has less storage space?
3. What is the difference between an HDFS Block and Input Split?
4. How can we check whether Namenode is working or not?
5. Why do we need a password-less SSH in Fully Distributed environment?
6. Some details about SSH communication between Masters and the Slaves
7. Why is Replication pursued in HDFS in spite of its data redundancy?
Difference between map-side join and reduce side join?
Difference between static and dynamic partitioning?
What is safe-mode?

How to avoid select * kind of queries in hive?


What are sequence files?
What are map files?
There are 3 input files. write a MR program for word count, such that output should be in 3 different
files corresponding to respective word counts of the 3 input files
8.How can the no of mappers be controlled?
Different configuration files in Hadoop?
Different modes of execution?
Explain JVM profiling?
Load balancing in hdfs cluster
Difference between partitioning and bucketing?
Difference between manages and external tables?
Explain performance tuning is done in hive?
Explain about MRUnit?
Command for moving data from one cluster to another cluster?
Difference between RC and ORC file format?
How to check the schema of a table in hive?
What is metadata? where is it stored in Hive?

For a hadoop developer the questions which mostly asked during interview are:
1. What is shuffling in map reduce?
2. Difference between HDFD block and split?
3. What are the mapfiles in hadoop?
4. What is the use of .pagination class?
5. What are the core components of hadoop?

1.What is Apache Spark?


Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution
engine supporting cyclic data flow and in-memory computing. Spark can run on Hadoop, standalone
or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra
and others.
2.Explain key features of Spark.

Allows Integration with Hadoop and files included in HDFS.


Spark has an interactive language shell as it has an independent Scala (the language in which Spark
is written) interpreter
Spark consists of RDDs (Resilient Distributed Datasets), which can be cached across computing
nodes in a cluster.
Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis
and graph processing

3.Define RDD.
RDD is the acronym for Resilient Distribution Datasets a fault-tolerant collection of operational
elements that run parallel. The partitioned data in RDD is immutable and distributed. There are
primarily two types of RDD:
Parallelized Collections : The existing RDDs running parallel with one another
Hadoop datasets: perform function on each file record in HDFS or other storage system

4.What does a Spark Engine do?


Spark Engine is responsible for scheduling, distributing and monitoring the data application across
the cluster.
5.Define Partitions?
As the name suggests, partition is a smaller and logical division of data similar to split in
MapReduce. Partitioning is the process to derive logical units of data to speed up the processing
process. Everything in Spark is a partitioned RDD.
6.What operations RDD support?
Transformations
Actions

7.What do you understand by Transformations in Spark?


Transformations are functions applied on RDD, resulting into another RDD. It does not execute until
an action occurs. map() and filer() are examples of transformations, where the former applies the
function passed to it on each element of RDD and results into another RDD. The filter() creates a
new RDD by selecting elements form current RDD that pass function argument.
8. Define Actions.
An action helps in bringing back the data from RDD to the local machine. An actions execution is the
result of all previously created transformations. reduce() is an action that implements the function
passed again and again until one value if left. take() action takes all the values from RDD to local
node.
9.Define functions of SparkCore.

Serving as the base engine, SparkCore performs various important functions like memory
management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems.
10.What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using
RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD
always remembers how to build from other datasets.
11.What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext,
connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
12.What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
13.Name commonly-used Spark Ecosystems.
Spark SQL (Shark)- for developers
Spark Streaming for processing live data streams
GraphX for generating and computing graphs
MLlib (Machine Learning Algorithms)
SparkR to promote R Programming in Spark engine.

14.Define Spark Streaming.


Spark supports stream processing an extension to the Spark API , allowing stream processing of
live data streams. The data from different sources like Flume, HDFS is streamed and finally
processed to file systems, live dashboards and databases. It is similar to batch processing as the
input data is divided into streams like batches.
15.What is GraphX?
Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX
component enables programmers to reason about structured data at scale.
16.What does MLlib do?
MLlib is scalable machine learning library provided by Spark. It aims at making machine learning
easy and scalable with common learning algorithms and use cases like clustering, regression
filtering, dimensional reduction, and alike.
17.What is Spark SQL?

SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured
data and perform structured data processing. Through this module, Spark executes relational SQL
queries on the data. The core of the component supports an altogether different RDD called
SchemaRDD, composed of rows objects and schema objects defining data type of each column in
the row. It is similar to a table in relational database.
18.What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet file and consider it be one of the best big
data analytics format so far.
19.What file systems Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3
20.What is Yarn?
Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource
management platform to deliver scalable operations across the cluster . Running Spark on Yarn
necessitates a binary distribution of Spar as built on Yarn support.
21.List the functions of Spark SQL.
Spark SQL is capable of:
Loading data from a variety of structured sources
Querying data using SQL statements, both inside a Spark program and from external tools that
connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using
business intelligence tools like Tableau
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability
to join RDDs and SQL tables, expose custom functions in SQL, and more
22.What are benefits of Spark over MapReduce?
Due to the availability of in-memory processing, Spark implements the processing around 10-100x
faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data
processing tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like
batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only
supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.

23.Is there any benefit of learning MapReduce, then?


Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely
relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive
convert their queries into MapReduce phases to optimize them better.

24.What is Spark Executor?


When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster.
Executors are Spark processes that run computations and store the data on the worker node. The
final tasks by SparkContext are transferred to executors for their execution.
25.Name types of Cluster Managers in Spark.
The Spark framework supports three major types of Cluster Managers:
Standalone: a basic manager to set up a cluster
Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and
other applications
Yarn: responsible for resource management in Hadoop

26.What do you understand by worker node?


Worker node refers to any node that can run the application code in a cluster.
27.What is PageRank?
A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For
instance, an edge from u to v represents endorsement of vs importance by u. In simple terms, if a
user at Instagram is followed massively, it will rank high on that platform.
28.Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?
No because Spark runs on top of Yarn.
29.Illustrate some demerits of using Spark.
Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise
certain problems. Developers need to be careful while running their applications in Spark. Instead of
running everything on a single node, the work must be distributed over multiple clusters.
30.How to create RDD?
Spark provides two methods to create RDD:
By parallelizing a collection in your Driver program. This makes use of SparkContexts parallelize
method
val data = Array(2,4,6,8,10)
val distData = sc.parallelize(data)
By loading an external dataset from external storage like HDFS, HBase, shared file system
Posted by Kalyan Hadoop at 00:26
Reactions:
Links to this post
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Big Data Interview Questions and Answers, Spark

Kalyan Hadoop
Spark Training in Hyderabad
Hadoop Training in Hyderabad
Interview Questions & Answers on Apache Spark [Part 2]
Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to
compute average:
def myAvg(x, y):
return (x+y)/2.0;
avg = myrdd.reduce(myAvg);
What is wrong with it? And How would you correct it?
Ans: The average function is not commutative and associative;
I would simply sum it and then divide by count.
def sum(x, y):

return x+y;
total = myrdd.reduce(sum);
avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus over flow. So, I
would rather divide each number by count and then sum in the following way.
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);

Q2: Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to
compute the square root of sum of squares of these numbers. How would you do it?
Ans:
# We would first load the file as RDD from HDFS on spark
numsAsText = sc.textFile("hdfs://namenode:9000/user/kayan/mynumbersfile.txt");
# Define the function to compute the squares
def toSqInt(str):
v = int(str);
return v*v;
#Run the function on spark rdd as transformation
nums = numsAsText.map(toSqInt);

#Run the summation as reduce action


total = nums.reduce(sum)

#finally compute the square root. For which we need to import math.
import math;
print math.sqrt(total);

Q3: Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

numsAsText =sc.textFile("hdfs://namenode:9000/user/kalyan/mynumbersfile.txt");
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
Ans: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.

Q4: Could you compare the pros and cons of the your approach (in Question 2 above) and my
approach (in Question 3 above)?
Ans:
You are doing the square and square root as part of reduce action while I am squaring in map() and
summing in reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt()
and reducer code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am
computing the sum of squares as part of map.

Q5: If you have to compute the total counts of each of the unique words on spark, how would you
go about it?

Ans:
#This will load the bigtextfile.txt as RDD in the spark
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/bigtextfile.txt");

#define a function that can break each line into words


def toWords(line):
return line.split();

# Run the toWords function on each element of RDD on spark as flatMap transformation.
# We are going to flatMap instead of map because our function is returning multiple values.

words = lines.flatMap(toWords);

# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
def toTuple(word):
return (word, 1);

wordsTuple = words.map(toTuple);

# Now we can easily do the reduceByKey() action.

def sum(x, y):


return x+y;

counts = wordsTuple.reduceByKey(sum)

# Now, print
counts.collect()
Q6: In a very huge text file, you want to just check if a particular keyword exists. How would you
do this using Spark?
Ans:
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/bigtextfile.txt");
def isFound(line):
if line.find(mykeyword) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print FOUND;
else:
print NOT FOUND;

Q7: Can you improve the performance of this code in previous answer?
Ans: Yes.
The search is not stopping even after the word we are looking for has been found. Our map code
would keep executing on all the nodes which is very inefficient.

We could utilize accumulators to report whether the word has been found or not and then stop the
job. Something on these line:
import thread, threading
from time import sleep
result = "Not Set"
lock = threading.Lock()
accum = sc.accumulator(0)
def map_func(line):
#introduce delay to emulate the slowness
sleep(1);
if line.find("Adventures") > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup("job_to_cancel", "some description")
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/wordcount/input/big.txt");
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = "Cancelled"
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup("job_to_cancel")
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())

supress = thread.start_new_thread(stop_job, tuple())


supress = lock.acquire()

Posted by Kalyan Hadoop at 00:15


Reactions:
Links to this post
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Big Data Interview Questions and Answers, Spark

Kalyan Hadoop
Spark Training in Hyderabad
Hadoop Training in Hyderabad
Interview Questions & Answers on Apache Spark [Part 1]
Q1: When do you use apache spark? OR What are the benefits of Spark over Mapreduce?
Ans:
Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.

In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using
Oozie/shell script. This mechanism is very time consuming and the map-reduce task have heavy
latency.
And quite often, translating the output out of one MR job into the input of another MR job might
require writing another code because Oozie may not suffice.
In Spark, you can basically do everything using single application / console (pyspark or scala console)
and get the results immediately. Switching between 'Running something on cluster' and 'doing
something locally' is fairly easy and straightforward. This also leads to less context switch of the
developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.
Q2: Is there are point of learning Mapreduce, then?
Ans: Yes. For the following reason:
Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the
MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
When the data grows beyond what can fit into the memory on your cluster, the Hadoop MapReduce paradigm is still very relevant.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you
understand the Mapreduce then you will be able to optimize your queries better.
Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?
Ans:
Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's
nodes.
So, you just have to install Spark on one node.
Q4: What are the downsides of Spark?
Ans:
Spark utilizes the memory. The developer has to be careful. A casual developer might make
following mistakes:
She may end up running everything on the local node instead of distributing work over to the
cluster.
She might hit some webservice too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your
code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole
data on a single node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a
service from inside of map() or reduce() too many times. This overloading of service is also possible
while using Spark.
Q5: What is a RDD?
Ans:

The full form of RDD is resilience distributed dataset. It is a representation of data located on a
network which is
Immutable - You can operate on the rdd to produce another rdd but you cant alter it.
Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is
done using multiple nodes.
Resilience - If one of the node hosting the partition fails, another nodes takes its data.
RDD provides two kinds of operations: Transformations and Actions.
Q6: What is Transformations?
Ans: The transformations are the functions that are applied on an RDD (resilient distributed data
set). The transformation results in another RDD. A transformation is not executed until an action
follows.
The example of transformations are:
map() - applies the function passed to it on each element of RDD resulting in a new RDD.
filter() - creates a new RDD by picking the elements from the current RDD which pass the function
argument.
Q7: What are Actions?
Ans:
An action brings back the data from the RDD to the local machine. Execution of an action results in
all the previously created transformation. The example of actions are:
reduce() - executes the function passed again and again until only one value is left. The function
should take two argument and return one value.
take() - take all the values back to the local node form RDD.

Hadoop Developer Interview Questions


Explain how Hadoop is different from other parallel computing solutions.
What are the modes Hadoop can run in?
What will a Hadoop job do if developers try to run it with an output directory that is already
present?
How can you debug your Hadoop code?

Did you ever built a production process in Hadoop? If yes, what was the process when your
Hadoop job fails due to any reason? (Open Ended Question)
Give some examples of companies that are using Hadoop architecture extensively.
Hadoop Admin Interview Questions
If you want to analyze 100TB of data, what is the best architecture for that?
Explain about the functioning of Master Slave architecture in Hadoop?
What is distributed cache and what are its benefits?
What are the points to consider when moving from an Oracle database to Hadoop clusters?
How would you decide the correct size and number of nodes in a Hadoop cluster?
How do you benchmark your Hadoop Cluster with Hadoop tools?
Hadoop Interview Questions on HDFS
Explain the major difference between an HDFS block and an InputSplit.
Does HDFS make block boundaries between records?
What is streaming access?
What do you mean by Heartbeat in HDFS?
If there are 10 HDFS blocks to be copied from one machine to another. However, the other
machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down
during the time of replication?
What is Speculative execution in Hadoop?
What is WebDAV in Hadoop?
What is fault tolerance in HDFS?

How are HDFS blocks replicated?


Which command is used to do a file system check in HDFS?
Explain about the different types of writes in HDFS.
Hadoop MapReduce Interview Questions
What is a NameNode and what is a DataNode?
What is Shuffling in MapReduce?
Why would a Hadoop developer develop a Map Reduce by disabling the reduce step?
What is the functionality of Task Tracker and Job Tracker in Hadoop? How many instances of
a Task Tracker and Job Tracker can be run on a single Hadoop Cluster?
How does NameNode tackle DataNode failures?
What is InputFormat in Hadoop?
What is the purpose of RecordReader in Hadoop?
What is InputSplit in MapReduce?
31)In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to
the reducer?
What is replication factor in Hadoop and what is default replication factor level Hadoop
comes with?
What is SequenceFile in Hadoop and Explain its importance?
If you are the user of a MapReduce framework, then what are the configuration parameters
you need to specify?
Explain about the different parameters of the mapper and reducer functions.

How can you set random number of mappers and reducers for a Hadoop job?
How many Daemon processes run on a Hadoop System?
What happens if the number of reducers is 0?
What is meant by Map-side and Reduce-side join in Hadoop?
How can the NameNode be restarted?
Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some
of the slow nodes to rate-limit the rest of the program and slows down the program. What
method Hadoop provides to combat this?
What is the significance of conf.setMapper class?
What are combiners and when are these used in a MapReduce job?
How does a DataNode know the location of the NameNode in Hadoop cluster?
How can you check whether the NameNode is working or not?
Pig Interview Questions
When doing a join in Hadoop, you notice that one reducer is running for a very long time.
How will address this problem in Pig?
Are there any problems which can only be solved by MapReduce and cannot be solved by
PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Give an example scenario on the usage of counters.
Hive Interview Questions
Explain the difference between ORDER BY and SORT BY in Hive?
Differentiate between HiveQL and SQL.

Gartner predicted that, "Big Data Movement will generate 4.4 million new IT jobs by end of 2015
and Hadoop will be in most advanced analytics products by 2015. With the increasing demand for
Hadoop for Big Data related issues, the prediction by Gartner is ringing true.
During March 2014, there were approximately 17,000 Hadoop Developer jobs advertised online. As
of 4 th, April 2015 - there are about 50,000 job openings for Hadoop Developers across the world
with close to 25,000 openings in the US alone. Of the 3000 Hadoop students that we have trained so
far, the most popular blog article request was one on Hadoop interview questions.
There are 4 steps which you must take if you are trying to get a job in emerging technology
domains:
Carefully outline the roles and responsibilities
Make your resume highlight the required core skills
Document each and every step of your efforts
Purposefully Network
With more than 30,000 open Hadoop developer jobs, professionals must familiarize themselves with
the each and every component of the Hadoop ecosystem to make sure that they have a deep
understanding of what Hadoop is so that they can form an effective approach to a given big data
problem.
With the help of Hadoop Instructors, we have put together a detailed list of Hadoop latest interview
questions based on the different components of the Hadoop Ecosystem such as MapReduce, Hive,
HBase, Pig, YARN, Flume, Sqoop, HDFS, etc.
Hadoop Basic Interview Questions
What is Big Data?
Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of
the data that we use today has been generated in the past 20 years. And this data is mostly
unstructured or semi structured in nature. More than the volume of the data it is the nature of the
data that defines whether it is considered as Big Data or not.
What do the four Vs of Big Data denote?
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume Scale of data
b) Velocity Different forms of data
c) Variety Analysis of streaming data
d) Veracity Uncertainty of data
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume Scale of data
b) Velocity Different forms of data

c) Variety Analysis of streaming data


d) Veracity Uncertainty of data
For more the Basic questions and answers click here
Hadoop HDFS Interview Questions
What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a block
in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to
find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk
bandwidth on the datanode.
Explain the difference between NameNode, Backup Node and Checkpoint NameNode.
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the
data of the files is not stored on the NameNode but rather it has the directory tree of all the files
present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespacefsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as that
of NameNodes directory. Checkpoint node creates checkpoints for the namespace at regular
intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The
new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also
maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active
NameNode.
For more the Hadoop HDFS Interview Questions click here
MapReduce Interview Questions
Explain the usage of Context Object.
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can
be used for updating counters, to report the progress and to provide any application level status
updates. ContextObject has the configuration details for the job and also interfaces, that helps it to
generating the output.
What are the core methods of a Reducer?

The 3 core methods of a reducer are


1)setup () This method of the reducer is used for configuring various parameters like the input
data size, distributed cache, heap size, etc.
Function Definition- public void setup (context)
2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary
files.
Function Definition -public void cleanup (context)
For more the MapReduce Interview Questions click here
Hadoop HBase Interview Questions
When should you use HBase and what are the key components of HBase?
HBase should be used when the big data application has
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key based access to data while retrieving.
Key components of HBase are
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the
META table is and META table stores all the regions in the system.
For more the Hadoop HBase Interview Questions click here
Hadoop Sqoop Interview Questions
Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS
table to HDFS. The following command is used to create a job that is importing data from the
employee table in the db database to the HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
--list argument is used to verify the saved jobs. The following command is used to verify the list of
saved Sqoop jobs.
$ Sqoop job --list
Inspect Job (--show)
--show argument is used to inspect or verify particular jobs and their details. The following
command and sample output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
--exec option is used to execute a saved job. The following command is used to execute a saved job
called myjob.
$ Sqoop job --exec myjob

For moreHadoop Sqoop Interview Questions click here


Hadoop Flume Interview Questions
Explain about the core components of Flume.
The core components of Flume are
Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.
For more Hadoop Flume Interview Questions click here
Hadoop Zookeeper Interview Questions
Can Apache Kafka be used without Zookeeper?
It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka
cannot serve client request.
Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
For more Hadoop Zookeeper Interview Questions click here
Pig Interview Questions
What do you mean by a bag in Pig?
Collection of tuples is referred as a bag in Apache Pig
Does Pig support multi-line commands?
Yes
For more Pig Interview Questions click here
Hive Interview Questions
What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.

Are multiline comments supported in Hive?


No
For more Hive Interview Questions click here
Hadoop YARN Interview Questions
What are the stable versions of Hadoop?
Release 2.7.1 (stable)
Release 2.4.1
Release 1.2.1 (stable)
What is Apache Hadoop YARN?
YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale
distributed system for running big data applications.

What is Big Data?


Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of
the data that we use today has been generated in the past 20 years. And this data is mostly
unstructured or semi structured in nature. More than the volume of the data it is the nature of the
data that defines whether it is considered as Big Data or not.
Here is an interesting and explanatory visual on What is Big Data?

What do the four Vs of Big Data denote?


IBM has a nice, simple explanation for the four critical features of big data:
a) Volume Scale of data
b) Velocity Different forms of data
c) Variety Analysis of streaming data
d) Veracity Uncertainty of data
Here is an explanatory video on the four Vs of Big Data
How big data analysis helps businesses increase their revenue? Give example.
Big data analysis is helping businesses differentiate themselves for example Walmart the worlds
largest retailer in 2014 in terms of revenue - is using big data analytics to increase its sales through

better predictive analytics, providing customized recommendations and launching new products
based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in
online sales for $1 billion in incremental revenue. There are many more companies like Facebook,
Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost
their revenue.
Here is an interesting video that explains how various industries are leveraging big data analysis to
increase their revenue
Name some companies that use Hadoop.
Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter
To view a detailed list of some of the top companies using Hadoop CLICK HERE
Differentiate between Structured and Unstructured data.
Data which can be stored in traditional database systems in the form of rows and columns, for
example the online purchase transactions can be referred to as Structured Data. Data which can be
stored only partially in traditional database systems, for example, data in XML records can be
referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi
structured or structured data is referred to as unstructured data. Facebook updates, Tweets on
Twitter, Reviews, web logs, etc. are all examples of unstructured data.
On what concept the Hadoop framework works?
Hadoop Framework works on the following two core components1)HDFS Hadoop Distributed File System is the java based file system for scalable and reliable
storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master
Slave Architecture.
2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that
provides scalability across various Hadoop clusters. MapReduce distributes the workload into
various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks
down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map
job and combines the data tuples to into smaller set of tuples. The reduce job is always performed
after the map job is executed.

Here is a visual that clearly explain the HDFS and Hadoop MapReduce Concepts-

7) What are the main components of a Hadoop Application?


Hadoop applications have wide range of technologies that provide great advantage in solving
complex business problems.
Core components of a Hadoop application areHadoop Common
HDFS
Hadoop MapReduce
YARN
Data Access Components are - Pig and Hive
Data Storage Component is - HBase
Data Integration Components are - Apache Flume, Sqoop, Chukwa
Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.
Data Serialization Components are - Thrift and Avro
Data Intelligence Components are - Apache Mahout and Drill.
What is Hadoop streaming?
Hadoop distribution has a generic application programming interface for writing Map and Reduce
jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop

Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper
or Reducers.
What is the best hardware configuration to run Hadoop?
The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB
or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not
low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users
have experienced various checksum errors by using non ECC memory. However, the hardware
configuration also depends on the workflow requirements and can change accordingly.
What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein the files are broken
down into lines.
Sequence File Input Format- This input format is used for reading files in sequence.

We have further categorized Big Data Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,7,8,9
Hadoop Interview Questions and Answers for Experienced - Q.Nos-3,8,9,10
Click here to know more about our IBM Certified Hadoop Developer course
Hadoop HDFS Interview Questions and Answers
What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a block
in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to
find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk
bandwidth on the datanode.
Explain the difference between NameNode, Backup Node and Checkpoint NameNode.
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the
data of the files is not stored on the NameNode but rather it has the directory tree of all the files
present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespacefsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as that
of NameNodes directory. Checkpoint node creates checkpoints for the namespace at regular
intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The
new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also
maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active
NameNode.
What is commodity hardware?
Commodity Hardware refers to inexpensive systems that do not have high availability or high
quality. Commodity Hardware consists of RAM because there are specific services that need to be
executed on RAM. Hadoop can be run on any commodity hardware and does not require any super
computer s or high end hardware configuration to execute jobs.
What is the port number for NameNode, Task Tracker and Job Tracker?
NameNode 50070
Job Tracker 50030
Task Tracker 50060
Explain about the process of inter cluster data copying.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If
this data copying is within the hadoop cluster then it is referred to as inter cluster data copying.
DistCP requires both source and destination to have a compatible or same version of hadoop.
How can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified or overwritten in 2 ways1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below
command$hadoop fs setrep w 2 /my/test_file (test_file is the filename whose replication factor will be set
to 2)
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified
using the below command-

3)$hadoop fs setrep w 5 /my/test_dir (test_dir is the name of the directory and all the files in this
directory will have a replication factor set to 5)
Explain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS
runs on a cluster of different machines thus there is data redundancy because of the replication
protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across
local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be
used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are
moved to data.
Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1
instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust
the number of times the blocks are to be replicated to ensure high data availability. For every block
that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during
the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data.
Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under
any circumstances, then only single copy of the data would be lost.
What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are
written by a single writer in append only format i.e. writes to a file in HDFS are always made at the
end of the file.
Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that
further points to the address where the next part of data chunk is stored.
What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is
referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by
the NameNode. The process of selecting closer data nodes depending on the rack information is
known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the
file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for
each data block. For each data block, there exists 2 copies in one rack and the third copy is present
in another rack. This is generally referred to as the Replica Placement Policy.
We have further categorized Hadoop HDFS Interview Questions for Freshers and Experienced-

Hadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3,7,9,10,11


Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2, 4,5,6,7,8
Hadoop MapReduce Interview Questions and Answers
Explain the usage of Context Object.
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can
be used for updating counters, to report the progress and to provide any application level status
updates. ContextObject has the configuration details for the job and also interfaces, that helps it to
generating the output.
What are the core methods of a Reducer?
The 3 core methods of a reducer are
1)setup () This method of the reducer is used for configuring various parameters like the input
data size, distributed cache, heap size, etc.
Function Definition- public void setup (context)
2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary
files.
Function Definition -public void cleanup (context)
Explain about the partitioning, shuffle and sort phase
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other
map tasks and also exchange the intermediate outputs with the reducers as required. This process
of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node
before they are given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be received
by each reducer instance is referred to as partitioning. The destination partition is same for any key
irrespective of the mapper instance that generated it.
How to write a custom partitioner for a Hadoop MapReduce job?
Steps to write a Custom Partitioner for a Hadoop MapReduce JobA new class must be created that extends the pre-defined Partitioner Class.
getPartition method of the Partitioner class must be overridden.

The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop
MapReduce or the custom partitioner can be added to the job by using the set method of the
partitioner class.
What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
Is it important for Hadoop MapReduce jobs to be written in Java?
It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in
any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop
Streaming API.
What is the process of changing the split size if there is limited storage space on Commodity
Hardware?
If there is limited storage space on commodity hardware, the split size can be changed by
implementing the Custom Splitter. The call to Custom Splitter can be made from the main
method.
What are the primary phases of a Reducer?
The 3 primary phases of a reducer are
1)Shuffle
2)Sort
3)Reduce
What is a TaskInstance?
The actual hadoop MapReduce jobs that run on each slave node are referred to as Task instances.
Every task instance has its own JVM process. For every new task instance, a JVM process is spawned
by default for a task.
Can reducers communicate with each other?
Reducers always run in isolation and they can never communicate with each other as per the
Hadoop MapReduce programming paradigm.
We have further categorized Hadoop MapReduce Interview Questions for Freshers and
ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,5,6
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,3,4,7,8,9,10

Hadoop HBase Interview Questions and Answers


When should you use HBase and what are the key components of HBase?
HBase should be used when the big data application has
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key based access to data while retrieving.
Key components of HBase are
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the
META table is and META table stores all the regions in the system.
What are the different operational commands in HBase at record level and table level?
Record Level Operational Commands in HBase are put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
What is Row Key?
Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells
logically and it ensures that all cells that have the same RowKeys are co-located on the same server.
RowKey is internally regarded as a byte array.
Explain the difference between RDBMS data model and HBase data model.
RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated
partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Explain about the different catalog tables in HBase?
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META
table is and META table stores all the regions in the system.
What is column families? What happens if you alter the block size of ColumnFamily on an already
populated database?
The logical deviation of data is represented through a key known as column Family. Column families
consist of the basic unit of physical storage on which compression features can be applied. In an
already populated database, when the block size of column family is altered, the old data will
remain within the old block size whereas the new data that comes in will take the new block size.
When compaction takes place, the old data will take the new block size so that the existing data is
read correctly.
Explain the difference between HBase and Hive.
HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse
infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of
Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary
operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is
an ideal choice for analytical querying of data collected over period of time.
Explain the process of row deletion in HBase.
On issuing a delete command in HBase through the HBase client, data is not actually deleted from
the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are
removed at regular intervals during compaction.
What are the different types of tombstone markers in HBase for deletion?
There are 3 different types of tombstone markers in HBase for deletion1)Family Delete Marker- This markers marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers marks all the versions of a column.
Explain about HLog and WAL in HBase.

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains
entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write
Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the
memory till the flush period in case of deferred log flush.
We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-

Hadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,5,7


Hadoop Interview Questions and Answers for Experienced - Q.Nos-2,3,6,8,9,10
Hadoop Sqoop Interview Questions and Answers
Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS
table to HDFS. The following command is used to create a job that is importing data from the
employee table in the db database to the HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
--list argument is used to verify the saved jobs. The following command is used to verify the list of
saved Sqoop jobs.
$ Sqoop job --list
Inspect Job (--show)
--show argument is used to inspect or verify particular jobs and their details. The following
command and sample output is used to verify a job called myjob.

$ Sqoop job --show myjob


Execute Job (--exec)
--exec option is used to execute a saved job. The following command is used to execute a saved job
called myjob.
$ Sqoop job --exec myjob
How Sqoop can be used in a Java program?
The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool
() method must be invoked. The necessary parameters should be created to Sqoop
programmatically just like for command line.
What is the process to perform an incremental data load in Sqoop?
The process to perform incremental data load in Sqoop is to synchronize the modified or updated
data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through
the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into
hive without overwriting it. The different attributes that need to be specified during incremental
load in Sqoop are1)Mode (incremental) The mode defines how Sqoop will determine what the new rows are. The
mode can have value as Append or Last Modified.
2)Col (Check-column) This attribute specifies the column that should be examined to find out the
rows to be imported.
3)Value (last-value) This denotes the maximum value of the check column from the previous
import operation.
Is it possible to do an incremental import using Sqoop?
Yes, Sqoop supports two types of incremental imports1)Append
2)Last Modified

To insert only rows Append should be used in import command and for inserting the rows and also
updating Last-Modified should be used in the import command.
What is the standard location or path for Hadoop Sqoop scripts?
/usr/bin/Hadoop Sqoop
How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as followsSqoop list-tables connect jdbc: mysql: //localhost/user;
How are large objects handled in Sqoop?
Sqoop provides the capability to store large sized data into a single field based on the type of data.
Sqoop supports the ability to store1)CLOB s Character Large Objects
2)BLOBs Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as LobFile i.e.
Large Object File. The LobFile has the ability to store records of huge size, thus each record in the
LobFile is a large object.
Can free form SQL queries be used with Sqoop import command? If yes, then how can they be
used?
Sqoop allows us to use free form SQL queries with the import command. The import command
should be used with the e and query options to execute free form SQL queries. When using the
e and query options with the import command the target dir value must be specified.
Differentiate between Sqoop and distCP.
DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer
data only between Hadoop and RDBMS.
What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of hcatalog
database option with the hcatalog table but the limitation to it is that there are several arguments
like as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
We have further categorized Hadoop Sqoop Interview Questions for Freshers and Experienced-

Hadoop Interview Questions and Answers for Freshers - Q.Nos- 4,5,6,9


Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,2,3,6,7,8,10
Hadoop Flume Interview Questions and Answers
Explain about the core components of Flume.
The core components of Flume are
Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.
Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
How can Flume be used with HBase?
Apache Flume can be used with HBase using one of the two HBase sinks
HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the
novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase
sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the
HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the
initialize method in the serializer which then translates the Flume Event into HBase increments and
puts to be sent to HBase cluster.
Working of the AsyncHBaseSinkAsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only
once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the

getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp
method is called by the serializer.
Explain about the different channel types in Flume. Which channel type is faster?
The 3 different built in channel types available in Flume areMEMORY Channel Events are read from the source into memory and passed to the sink.
JDBC Channel JDBC Channel stores the events in an embedded Derby database.
FILE Channel File Channel writes the contents to a file on the file system after reading the event
from a source. The file is deleted only after the contents are successfully delivered to the sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss. The
channel that you choose completely depends on the nature of the big data application and the value
of each event.
Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.
Explain about the replication and multiplexing selectors in Flume.
Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event
can be written just to a single channel or to multiple channels. If a channel selector is not specified
to the source then by default it is the Replicating selector. Using the replicating selector, the same
event is written to all the channels in the sources channels list. Multiplexing channel selector is used
when the application has to send different events to different channels.
How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from
external sources and transfer it to external destinations.
Is it possible to leverage real time analysis on the big data collected by Flume directly? If
yes, then explain how.
Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers
usingMorphlineSolrSink
Differentiate between FileSink and FileRollSink

The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events
into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the
local file system.
Hadoop Flume Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,10
Hadoop Flume Interview Questions and Answers for Experienced- Q.Nos- 3,7,8,9
Hadoop Zookeeper Interview Questions and Answers
Can Apache Kafka be used without Zookeeper?
It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka
cannot serve client request.
Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
What is the role of Zookeeper in HBase architecture?
In HBase architecture, ZooKeeper is the monitoring server that provides different services like
tracking server failure and network partitions, maintaining the configuration information,
establishing communication between the clients and region servers, usability of ephemeral nodes to
identify the available servers in the cluster.
Explain about ZooKeeper in Kafka
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by
Kafka to store various configurations and use them across the hadoop cluster in a distributed
manner. To achieve distributed-ness, configurations are distributed and replicated throughout the
leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by byepassing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.
Explain how Zookeeper works
ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to
store and facilitate important configuration information updates. ZooKeeper works by coordinating
the processes of distributed applications. ZooKeeper is a robust replicated synchronization service
with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed
between multiple nodes.
3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client
connects to any of the specific server and migrates if a particular node fails. The ensemble of
ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is
dynamically selected by the consensus within the ensemble so if the master node fails then the role
of master node will migrate to another node which is selected dynamically. Writes are linear and
reads are concurrent in ZooKeeper.
List some examples of Zookeeper use cases.

Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high
priority notifications and discovery. The entire service of Found built up of various systems that read
and write to Zookeeper.
Apache Kafka that depends on ZooKeeper is used by LinkedIn
Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter.
How to use Apache Zookeeper command line interface?
ZooKeeper has a command line client support for interactive use. The command line interface of
ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy
of Znodes where each znode can contain data just similar to a file. Each znode can also have children
just like directories in the UNIX file system.
Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden
by the log messages after entering the command, users can just hit ENTER to view the prompt.
What are the different types of Znodes?
There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.
The Znodes that get destroyed as soon as the client that created it disconnects are referred to as
Ephemeral Znodes.
Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and
is pre-fixed when the client assigns name to the znode.
What are watches?
Client disconnection might be troublesome problem especially when we need to keep a track on the
state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can
be set on Znode to trigger an event whenever it is removed, altered or any new children are created
below it.
What problems can be addressed by using Zookeeper?
In the development of distributed systems, creating own protocols for coordinating the hadoop
cluster results in failure and frustration for the developers. The architecture of a distributed system
can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in
making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache
ZooKeeper can be used as a coordination service to write correct distributed applications without
having to reinvent the wheel from the beginning.
Hadoop ZooKeeper Interview Questions and Answers for Freshers - Q.Nos- 1,2,8,9
Hadoop ZooKeeper Interview Questions and Answers for Experienced- Q.Nos-3,4,5,6,7, 10
Hadoop Pig Interview Questions and Answers
What do you mean by a bag in Pig?
Collection of tuples is referred as a bag in Apache Pig
Does Pig support multi-line commands?
Yes

What are different modes of execution in Apache Pig?


Apache Pig runs in 2 modes- one is the Pig (Local Mode) Command Mode and the other is the
Hadoop MapReduce (Java) Command Mode. Local Mode requires access to only a single machine
where all files are installed and executed on a local host whereas MapReduce requires accessing the
Hadoop cluster.
Explain the need for MapReduce while programming in Apache Pig.
Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL
query language. To execute the query, there is need for an execution engine. The Pig engine
converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is
needed to run the programs.
Explain about co-group in Pig.
COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on
statements that contain or involve two or more relations. The COGROUP operator can be applied on
up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first
groups both the tables and after that joins the two tables on the grouped columns.
Explain about the BloomMapFile.
BloomMapFile is a class that extends the MapFile class. It is used n HBase table format to provide
quick membership test for the keys using dynamic bloom filters.
Differentiate between Hadoop MapReduce and Pig
Pig provides higher level of abstraction whereas MapReduce provides low level of abstraction.
MapReduce requires the developers to write more lines of code when compared to Apache Pig.
Pig coding approach is comparatively slower than the fully tuned MapReduce coding approach.
Read More in Detail- http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163
What is the usage of foreach operation in Pig scripts?
FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag
so that respective action is performed to generate new data items.
Syntax- FOREACH data_bagname GENERATE exp1, exp2
Explain about the different complex data types in Pig.
Apache Pig supports 3 complex data typesMaps- These are key, value stores joined together using #.
Tuples- Just similar to the row in a table where different items are separated by a comma. Tuples
can have multiple attributes.
Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples.
What does Flatten do in Pig?

Sometimes there is data in a tuple or bag and if we want to remove the level of nesting from that
data then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the
Flatten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is a
little complex because it requires creating new tuples.
We have further categorized Hadoop Pig Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,7,9
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 3,5,6,8,10
Hadoop Hive Interview Questions and Answers
What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.
Are multiline comments supported in Hive?
No
What is ObjectInspector functionality?
ObjectInspector is used to analyze the structure of individual columns and the internal structure of
the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in
multiple formats.

Hadoop Hive Interview Questions and Answers for Freshers- Q.Nos-1,2,3


Hadoop YARN Interview Questions and Answers
1)What are the stable versions of Hadoop?
Release 2.7.1 (stable)
Release 2.4.1
Release 1.2.1 (stable)
2) What is Apache Hadoop YARN?
YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale
distributed system for running big data applications.
Is YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that
supports MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.

We have further categorized Hadoop YARN Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1
Hadoop Interview Questions Answers Needed
Interview Questions on Hadoop Hive
1)Explain about the different types of join in Hive.
2)How can you configure remote metastore mode in Hive?
3)Explain about the SMB Join in Hive.
4)Is it possible to change the default location of Managed Tables in Hive, if so how?
5)How data transfer happens from Hive to HDFS?
6)How can you connect an application, if you run Hive as a server?
7)What does the overwrite keyword denote in Hive load statement?
8)What is SerDe in Hive? How can you write yourown customer SerDe?
9)In case of embedded Hive, can the same metastore be used by multiple users?
Hadoop YARN Interview Questions
1)What are the additional benefits YARN brings in to Hadoop?
2)How can native libraries be included in YARN jobs?
3)Explain the differences between Hadoop 1.x and Hadoop 2.x
Or
4)Explain the difference between MapReduce1 and MapReduce 2/YARN
5)What are the modules that constitute the Apache Hadoop 2.0 framework?

6)What are the core changes in Hadoop 2.0?


7)How is the distance between two nodes defined in Hadoop?
8)Differentiate between NFS, Hadoop NameNode and JournalNode.
We hope that these Hadoop Interview Questions and Answers have pre-charged you for your next
Hadoop Interview.Get the Ball Rolling and answer the unanswered questions in the comments
below.Please do! It's all part of our shared mission to ease Hadoop Interviews for all prospective
Hadoopers.We invite you to get involved.
What is Hadoop Map Reduce ?
For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is
used. Data analysis uses a two-step map and reduce process.
How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce
phase it aggregates the data as per the document spanning the entire collection. During the map
phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop
framework.
Explain what is shuffling in MapReduce ?
The process by which the system performs the sort and transfers the map outputs to the reducer as
inputs is known as the shuffle
Explain what is distributed Cache in MapReduce Framework ?
Distributed Cache is an important feature provided by map reduce framework. When you want to
share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an
executable jar files or simple properties file.
Explain what is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS
(Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file
system. It keeps the record of all the files in the file system, and tracks the file data across the
cluster or multiple machines

Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its
own JVM process
Hadoop performs following actions in Hadoop
Client application submit jobs to the job tracker
JobTracker communicates to the Namemode to determine data location
Near the data or with available slots JobTracker locates TaskTracker nodes
On chosen TaskTracker Nodes, it submits the work
When a task fails, Job tracker notify and decides what to do then.
The TaskTracker nodes are monitored by JobTracker
Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task
tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is
considered there is some issues with data node or task tracker
Explain what combiners is and when you should use a combiner in a MapReduce Job?
To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be
reduced with the help of combiners that need to be transferred across to the reducers. If the
operation performed is commutative and associative you can use your reducer code as a
combiner. The execution of combiner is not guaranteed in Hadoop
What happens when a datanode fails ?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node
Explain what is Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On
different slave node, multiple copies of same map or reduce task can be executed using Speculative
Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will
create a duplicate task on another disk. Disk that finish the task first are retained and disks that do
not finish first are killed.
Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the
same reducer, eventually which helps evenly distribution of the map output over the reducers
Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
Explain what happens in textinformat ?
In textinputformat, each line in the text file is a record. Value is the content of the line while Key is
the byte offset of the line. For instance, Key: longWritable, Value: text
Mention what are the main configuration parameters that user need to specify to run
Mapreduce Job ?
The user of Mapreduce framework needs to specify
Jobs input locations in the distributed file system
Jobs output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes

Explain what is WebDAV in Hadoop?


To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating
system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a
standard filesystem by exposing HDFS over WebDAV.
Explain what is sqoop in Hadoop ?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is
used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into
HDFS as well as exporting data from HDFS file to RDBMS
Explain how JobTracker schedules a task ?
The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure
that JobTracker is active and functioning. The message also informs JobTracker about the number
of available slots, so the JobTracker can stay upto date with where in the cluster work can be
delegated
Explain what is Sequencefileinputformat?
Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file
format which is optimized for passing data between the output of one MapReduce job to the input
of some other MapReduce job.
Explain what does the conf.setMapper Class do ?
Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data
and generating a key-value pair out of the mapper
th the help of our top Hadoop instructors weve put to1gether a comprehensive list of questions to
help you get through your first Hadoop interview. Weve made sure that the most probable
questions asked during interviews are covered in this list. If you want to learn more, check out
our new courses in Hadoop!
Q1. Name the most common Input Formats defined in Hadoop? Which one is default?
The two most common Input Formats defined in Hadoop are:
TextInputFormat
KeyValueInputF6ormat
SequenceFileInputFormat
TextInputFormat is the Hadoop default.
Q2. What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper
and actual line as Value to the mapper.

KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first
tab character is sent as key to the Mapper and the remainder of the line is sent as value to the
mapper.
Q3. What is InputSplit in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to
process. This is called InputSplit.
Q4. How is the splitting of file invoked in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class
(like FileInputFormat) defined by the user.
Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB
Input format is FileInputFormat
We have 3 files of size 64K, 65Mb and 127Mb
How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
1 split for 64K files
2 splits for 65MB files
2 splits for 127MB files
Q6. What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader
class actually loads the data from its source and converts it into (key, value) pairs suitable for
reading by the Mapper. The RecordReader instance is defined by the Input Format.
Q7. After the Map phase finishes, the Hadoop framework does Partitioning, Shuffle and sort.
Explain what happens in this phase?
Partitioning: It is the process of determining which reducer instance will receive which intermediate
keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer
will receive them. It is necessary that for any key, regardless of which mapper instance generated it,
the destination partition is the same.
Shuffle: After the first map tasks have completed, the nodes may still be performing several more
map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to
where they are required by the reducers. This process of moving map outputs to the reducers is
known as shuffling.

Sort: Each reduce task is responsible for reducing the values associated with several intermediate
keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they
are presented to the Reducer.
Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to
the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this
result.
Q9. What is a Combiner?
The Combiner is a mini-reduce process which operates only on data generated by a mapper. The
Combiner will receive as input all data emitted by the Mapper instances o4n a given node. The
output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
Q10. What is JobTracker?
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
Q11. What are some typical functions of Job Tracker?
The following are some typical tasks of JobTracker:Accepts jobs from clients
It talks to the NameNode to determine the location of the data.
It locates TaskTracker nodes with available slots at or near the data.
It submits the work to the chosen TaskTracker nodes and monitors progress of
each task by receiving heartbeat signals from Task tracker.
Q12. What is TaskTracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from
a JobTracker.
Q13. What is the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.
Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop
do?
It will restart the task again on some other TaskTracker and only if the task fails more than four
(default setting and can be changed) times will it kill the job.

Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few
slow nodes to rate-limit the rest of the program and slow down the program. What mechanism
Hadoop provides to combat this?
Speculative Execution.
Q16. How does speculative execution work in Hadoop?
JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they
announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive
copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the
tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper
completed successfully, first.
Q17. Using command line in Linux, how will you
See all jobs running in the Hadoop cluster
Kill a job?
Hadoop job list
Hadoop job kill jobID
Q18. What is Hadoop Streaming?
Streaming is a generic API that allows programs written in virtually any language to be used as
Hadoop Mapper and Reducer implementations.
Q19. What is the characteristic of streaming API that makes it flexible run MapReduce jobs in
languages like Perl, Ruby, Awk etc.?
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a
MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output
(key, value) pairs on stdout.
Q20. What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives,
jars and so on) needed by applications during execution of the job. The framework will copy the
necessary files to the slave node before any tasks for the job are executed on that node.
Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the
application read it?
This is because distributed cache is much faster. It copies the file to all trackers at the start of the
job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of
distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then

every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try
to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Q.22 What mechanism does Hadoop framework provide to synchronise changes made in
Distribution Cache during runtime of the application?
This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during
the time of Job execution.
Q23. Have you ever used Counters in Hadoop. Give us an example scenario?
Anybody who claims to have worked on a Hadoop project is expected to use counters.
Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job?
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.
Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how?
Yes, by using Multiple Outputs class.
Q26. What will a Hadoop job do if you try to run it with an output directory that is already
present? Will it
Overwrite it
Warn you and continue
Throw an exception and exit
The Hadoop job will throw an exception and exit.
Q27. How can you set an arbitrary number of mappers to be created for a job in Hadoop?
You cannot set it.
Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or
set it up as a configuration setting.
Q29. How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
Create a new class that extends Partitioner Class

Override method getPartition


In the wrapper that runs the Mapreduce, either
Add the custom partitioner to the job programmatically using method set
Partitioner Class or add the custom partitioner to the job as a config file (if your
wrapper reads from config file or oozie)
Q30. How did you debug your Hadoop code?
There can be several ways of doing this but most common ways are:By using counters.
The web interface provided by Hadoop framework.
Q31. Did you ever built a production process in Hadoop? If yes, what was the process when your
Hadoop job fails due to any reason?
It is an open-ended question but most candidates if they have written a production job, should talk
about some type of alert mechanism like email is sent or there monitoring system sends an
alert. Since Hadoop works on unstructured data, it is very important to have a good alerting system
for errors since unexpected data can very easily break the job.

You might also like