Professional Documents
Culture Documents
1.
Lets start simple ....Big Data is data so big that it cannot fit or be
processed into one box and it cannot travel from one place to another
as a whole
2.
Since it cannot fit into one box, each file has to be split and spread
across multiple boxes using a distributed file system called HDFS.
3.
4.
Named Node is a book keeper that keeps track of which split resides
on which data node at the minimum. Named Node is periodically
backed up by Secondary named node as most of the data is stored in
memory for performance reasons.
5.
HDFS also replicates the data and stores the replica across nodes and
data centre racks to improve reliability against node failures.
6.
7.
8.
Hadoop 2.0 has alleviated several scale and performance issues that
was present on Hadoop 1.0. The job tracker in Hadoop used to
perform scheduling, monitoring and job history tracking functions,
this responsibility is distributed now between Resource Manager,
Application Master and Timeline Server. Task Trackers are now
replaced with Node Managers which run application resource
containers which is an upgrade over splitting the cluster capacity as
map or reduce slots from Hadoop 1.0
9.
Named Node was single point of failure in Hadoop 1.0 .As of Hadoop
2.0 it is now highly available using a cluster capability called Quorum
Journal Manager along with Zookeeper that can help manage leader
election
10.
Named Nodes can be federated in Hadoop 2.0 with each named node
managing a part of the group of files in a massive cluster, thereby
improving scale.
11.
The servers that host the Hadoop components are called commodity
servers. Commodity servers doesn't mean cheap low end servers. It
means a device that is relatively inexpensive, widely available and
interchangeable with other hardware of its type.
12.
13.
Since the data cannot travel, tasks have to come to the data that
resides on a data node to perform processing on each split, convert
each split into key value pairs (map) and then aggregate the split
based on splits based on key to perform meaningful computation.
This job is called Map Reduce and as of Hadoop 1.0 involves Job
tracker(scheduler) and task tracker(worker that works on the data
splits)
14.
In the past we had to model the data before ingesting it, now with Big
data , structured (database) , semi structured(log files) and
unstructured(images) can become part of your data analysis
landscape .The sources of data can range from log files , sensors ,
click stream to databases .
15.
16.
Data can be stored into data lake in a plain text format , but it is
preferred to store it in a compressed , split table and binary format to
save space and exploit the underlying power of HDFS and Map
reduce
17.
18.
Hadoop provides a key value file format called Sequence file format
for this purpose but it is limited only to java programs .There are
other file formats like Avro , Thrift and Parquet that supports both
reading and writing files in these formats to HDFS in a
language agnostic way.
19.
These data formats that are split table can be further compressed
using a compression codec like Snappy or Bzip , which allows them to
occupy much lesser space and reduce network bandwidth when
shuffling data across the nodes during job execution thereby
improving performance . Please note that compression is a CPU
intensive process and may require more CPU processing and is more
suitable for I/O intensive jobs like Map reduce.
20.
Structured and semi structured batch data that flows into the lake
from database , Ftp servers or Mainframes can be ingested using
tools like Sqoop ,which fetches data from the database and loads the
data in parallel into HDFS without overwhelming the database
21.
Unstructured and semi structured data can flow into HDFS via
streaming tools like Kafka , Storm, Spark Streams and Flume
22.
Data stored in HDFS in the form of files are not reporting friendly
and hence need to loaded into other big data stores for CRUD and
reporting purpose.
23.
24.
25.
26.
Integrate Sqoop , Pig , Hive , Map Reduce and HDFS jobs into a
single workflow which can be scheduled or run on demand with tools
like Oozie.
27.
Oozie can schedule jobs based on certain recurring time interval and
frequency or based on data availability from the upstream
system(this is a powerful feature)
28.
For free text searching on unstructured files move the data into
Elastic or Solr.
29.
30.
31.
32.
Flume on the other hand is the standard for fetching data from
streaming sources like syslogs, directories and http. It comes with a
plethora of adapters that allows data to be pulled from various
sources, buffer it into a messaging channel (file or memory based)
and write it into sinks like HDFS and HBASE in a format that is
convenient to you.
33.
34.
35.
Graph database like ne04j and titan can also be populated during
ingestion phase and then later used for correlating and aggregating
events during stream processing.
36.
Cipher and Gremlin are graph query language and can be used to
query the graphs you created and also write rules on them
37.
38.
39.
40.
Kerberos can sync up with LDAP if that is the default store for users
within the organisation. However for that to happen both LDAP and
Keberos have to add each other to their trust stores
41.
42.
43.
There are some incubating projects like Apache Knox and Apache
Sentry that are trying to centralize and manage the security policies
centrally
Hope you find it useful!!! , thanks for reading, suggestions welcome and
apologies in advance if I left out any important detail in the process.