Demystifying The Big Data Ecosystem...

Demystifying the Big Data Ecosystem...
When I started reading up on big data I was perplexed with the

terminologies and the vast ecosystem that surrounded it.This article is a
humble attempt to demystify this complex landscape and will give you just
enough information to start relating to some terms, terminologies and
technologies surrounding the BIG DATA ecosystem.
You will find it useful as a starter kit, whether you are embarking on a new
IOT project or trying to build out a data lake for your enterprise. I have
stuck with the open source options available for ease of explanation...
1.
Lets start simple ....Big Data is data so big that it cannot fit or be
processed into one box and it cannot travel from one place to another
as a whole
2.
Since it cannot fit into one box, each file has to be split and spread
across multiple boxes using a distributed file system called HDFS.
3.
HDFS hence contains data node that stores the splits.
4.
Named Node is a book keeper that keeps track of which split resides
on which data node at the minimum. Named Node is periodically
backed up by Secondary named node as most of the data is stored in
memory for performance reasons.
5.
HDFS also replicates the data and stores the replica across nodes and
data centre racks to improve reliability against node failures.
6.
Hadoop 1.0 ecosystem consists of HDFS (Named Node, Data Node

and Secondary Named Node), task scheduler called Job Tracker and
a task executor called Task Tracker.
7.
Task Tracker runs tasks allocated by job tracker in a separate

JVM directly on the data node. While scheduling tasks on the task
tracker the job tracker keeps data locality in mind to avoid network
latency.
8.
Hadoop 2.0 has alleviated several scale and performance issues that
was present on Hadoop 1.0. The job tracker in Hadoop used to
perform scheduling, monitoring and job history tracking functions,
this responsibility is distributed now between Resource Manager,
Application Master and Timeline Server. Task Trackers are now
replaced with Node Managers which run application resource
containers which is an upgrade over splitting the cluster capacity as
map or reduce slots from Hadoop 1.0
9.
Named Node was single point of failure in Hadoop 1.0 .As of Hadoop
2.0 it is now highly available using a cluster capability called Quorum
Journal Manager along with Zookeeper that can help manage leader
election
10.
Named Nodes can be federated in Hadoop 2.0 with each named node
managing a part of the group of files in a massive cluster, thereby
improving scale.
11.
The servers that host the Hadoop components are called commodity
servers. Commodity servers doesn't mean cheap low end servers. It
means a device that is relatively inexpensive, widely available and
interchangeable with other hardware of its type.
12.
The definition of commodity servers are changing year on

year because essentially, at the same price-point, the processing
power available in data-centres continues to increase rapidly. As an
example, consider the following definitions of commodity servers,
2009 8 cores, 16GB of RAM, 4x1TB disk, 2012 16+ cores, 4896GB of RAM, 12x2TB or 12x3TB of disk
13.
Since the data cannot travel, tasks have to come to the data that
resides on a data node to perform processing on each split, convert
each split into key value pairs (map) and then aggregate the split
based on splits based on key to perform meaningful computation.
This job is called Map Reduce and as of Hadoop 1.0 involves Job
tracker(scheduler) and task tracker(worker that works on the data
splits)
14.
In the past we had to model the data before ingesting it, now with Big
data , structured (database) , semi structured(log files) and
unstructured(images) can become part of your data analysis
landscape .The sources of data can range from log files , sensors ,
click stream to databases .
15.
Every organization aggregates and analyses this data in a central

place called Data Lake. Data Lake is different from a data warehouse
in a way that data warehouse can only store structured and modelled
data, while data lake can store unstructured, structured and semi
structured data as well.
16.
Data can be stored into data lake in a plain text format , but it is
preferred to store it in a compressed , split table and binary format to
save space and exploit the underlying power of HDFS and Map
reduce
17.
Also given the changing nature of data file formats, these

formats have to support schema evolution. It is also desirable that
every data set comes with its own schema which makes the data set
self-describing.
18.
Hadoop provides a key value file format called Sequence file format
for this purpose but it is limited only to java programs .There are
other file formats like Avro , Thrift and Parquet that supports both
reading and writing files in these formats to HDFS in a
language agnostic way.
19.
These data formats that are split table can be further compressed
using a compression codec like Snappy or Bzip , which allows them to
occupy much lesser space and reduce network bandwidth when
shuffling data across the nodes during job execution thereby
improving performance . Please note that compression is a CPU
intensive process and may require more CPU processing and is more
suitable for I/O intensive jobs like Map reduce.
20.
Structured and semi structured batch data that flows into the lake
from database , Ftp servers or Mainframes can be ingested using
tools like Sqoop ,which fetches data from the database and loads the
data in parallel into HDFS without overwhelming the database
21.
Unstructured and semi structured data can flow into HDFS via
streaming tools like Kafka , Storm, Spark Streams and Flume
22.
Data stored in HDFS in the form of files are not reporting friendly
and hence need to loaded into other big data stores for CRUD and
reporting purpose.
23.
For, CRUD operations you can use a columnar structure database

called HBASE that stores data into HDFS
24.
For SQL and analytical capabilities, connect Hive/Accumulo to

HBASE or HDFS to perform aggregation and joins
25.
Every entity will be represented in 100 different ways within an

organisation , it is best to transform these entities into a canonical
structure that can be used by the consumers .For Transformation of
data that resides in HDFS into a canonical format you can use PIG
which uses Hadoop and Map Reduce jobs underneath or Spark
workflows
26.
Integrate Sqoop , Pig , Hive , Map Reduce and HDFS jobs into a
single workflow which can be scheduled or run on demand with tools
like Oozie.
27.
Oozie can schedule jobs based on certain recurring time interval and
frequency or based on data availability from the upstream
system(this is a powerful feature)
28.
For free text searching on unstructured files move the data into
Elastic or Solr.
29.
Spark provides a tightly integrated environment to ingest, transform

and load the data from variety of big data sources into a variety of big
data stores. It abstracts the fact that the input file in HDFS is split
into variety of splits that is distributed across multiple nodes using an
abstraction called RDD's (Resilent Distributed Dat Sets) and hence is
rising to popularity quickly as a single shop stop.
30.
Tabulae , Kibana can connect to Hive/Accumulo and Elastic to

provide dash boarding and visualisation capabilities for big data .
31.
Kafka is a high performance messaging middleware that can buffer

huge amounts of data in the topics .It is quite useful when the
producer is sending data faster than the consumer can consume
which is usually the case
32.
Flume on the other hand is the standard for fetching data from
streaming sources like syslogs, directories and http. It comes with a
plethora of adapters that allows data to be pulled from various
sources, buffer it into a messaging channel (file or memory based)
and write it into sinks like HDFS and HBASE in a format that is
convenient to you.
33.
Processing streams involves data being read from various devices

using device adapters, This data is then pushed into the
organization using various protocols via Flume into Kafka, the data is
then fetched from Kafka either by Apache Storm or Spark (as
Dstreams) and further processed .
34.
The outcome of a stream processing is a set of actions which may

include interacting with the devices, sparking out some workflows or
simply running some java , scala or shell programs.
35.
Graph database like ne04j and titan can also be populated during
ingestion phase and then later used for correlating and aggregating
events during stream processing.
36.
Cipher and Gremlin are graph query language and can be used to
query the graphs you created and also write rules on them
37.
Correlation and aggregation of events can also be done using a

window interval concept provided in tools like spark .A window
interval is a set of micro batches that are a group of trickling events
that arrive at a configured time interval. All events within a window
can be processed in a single iteration.
38.
Machine learning algorithms can be applied on top of the streams

using spark built in libraries to take actions on the streams that
trickle in. However training the model has to be done using a batch
job that is periodically kicked off using an Oozie workflow
39.
Authentication in Hadoop ecosystem can be achieved using kerberos

protocol, all devices in the Hadoop ecosystem needs to be a part of
the trusted kerberos network.
40.
Kerberos can sync up with LDAP if that is the default store for users
within the organisation. However for that to happen both LDAP and
Keberos have to add each other to their trust stores
41.
Streams of data that flow in from Flume or Sqoop can be encrypted

using SSL .
42.
Authorization is more complex and happens at every component

level. For example authorization in Hadoop is controlled by a policy
file called policy.xml that stores what activities can be performed by
users or group of users
43.
There are some incubating projects like Apache Knox and Apache
Sentry that are trying to centralize and manage the security policies
centrally
Hope you find it useful!!! , thanks for reading, suggestions welcome and
apologies in advance if I left out any important detail in the process.

Demystifying The Big Data Ecosystem... - Param Natarajan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Demystifying The Big Data Ecosystem... - Param Natarajan

Uploaded by

Copyright:

Available Formats

When I started reading up on big data I was perplexed with the

HDFS hence contains data node that stores the splits.

Hadoop 1.0 ecosystem consists of HDFS (Named Node, Data Node

Task Tracker runs tasks allocated by job tracker in a separate

The definition of commodity servers are changing year on

Every organization aggregates and analyses this data in a central

Also given the changing nature of data file formats, these

For, CRUD operations you can use a columnar structure database

For SQL and analytical capabilities, connect Hive/Accumulo to

Every entity will be represented in 100 different ways within an

Spark provides a tightly integrated environment to ingest, transform

Tabulae , Kibana can connect to Hive/Accumulo and Elastic to

Kafka is a high performance messaging middleware that can buffer

Processing streams involves data being read from various devices

The outcome of a stream processing is a set of actions which may

Correlation and aggregation of events can also be done using a

Machine learning algorithms can be applied on top of the streams

Authentication in Hadoop ecosystem can be achieved using kerberos

Streams of data that flow in from Flume or Sqoop can be encrypted

Authorization is more complex and happens at every component

You might also like