Professional Documents
Culture Documents
Google had bottlenecks as they simply could not write a large enough checks to
process the data:
• To address this their google labs team then developed an algorithm that allowed
for large data calculator to be chopped into smaller chunks and mapped to
many computers and when the calculator were done be brought back together
to produce the resulting data set.
• The developed an algorithm called map reduce this project algorithm was later
used to develop an open source called Hadoop which allows applications to run
using the map reduce algorithm.
• Data is processed in parallel not serially. Big data has big calculations complexity
like statistical simulations.
• Hadoop procedure= Hadoop will start to see Hadoop playing a central role in
statistical analysis, etl processing and business intelligence
• Volume: the amount of data matters.
Organizations collect data from a variety of sources,
including business transactions, social media and
information from sensor or machine-to-machine data.
In the past, storing it would’ve been a problem – but
new technologies (such as Hadoop) have eased the
burden.
• Variety: different kinds of data is being
generated from various sources. Data comes in
all types of formats – from structured, numeric data in
traditional databases to unstructured text documents,
email, video, audio, stock ticker data and financial
transactions.
volume
variety
Velocity
Value
Veracity
Big data an opportunity
Problems of Big Data
• Storing exponentially growing huge datasets
• Storing and spending money on servers due to the growth of data
• Processing data having complex structure
Processing data faster
DIAGRAM ILLUSTRATING PROBLEMS OF BIG DATA
PROBLEMS WITH BIG DATA
DATA NODES:
• Slave daemons
• Stores the actual data
• Serves read and write requests
• Data is replicated and stored in different nodes performs low
level read and write. Sends heart beat to the name node
HDFS BLOCKS
SECONDARY NAME NODE
• Application manager
• Application manager accepts the job submission
• Negotiates to containers for executing the application
specific
• Application mastering and monitoring the progress
• Application masters are the daemons which reside on the
data node
• Communicates to containers for the execution of tasks on
each data node
COMPONENTS EXPLAINED
YARN AND HDFS STRUCTURE COMBINED
Processing data faster
• Provide parallel processing of data present in HDFS
• Allows to process data locally i.e. each node works with a part which
is stored on it.
MAP REDUCE