Sharath Bandaru & Sai Dinesh Koppuravuri Advanced Topics Presentation ISYE 582 :Engineering Information Systems Overview Understanding Big Data
Structured/Unstructured Data
Limitations Of Existing Data Analytics Structure
Apache Hadoop
Hadoop Architecture
HDFS
Map Reduce
Conclusions
References Understanding Big Data Big Data Is creating Large And Growing Files Measured in: Petabytes (10^12) Terabytes (10^15) Which is largely unstructured Structured/Unstructured Data Why now ? D a t a
G r o w t h
STRUCTURED DATA 20% 1980 2013 UNSTRUCTUREDDATA80% Source : Cloudera, 2013 Challenges posed by Big Data Velocity Volume Variety 400 million tweets in a day on Twitter 1 million transactions by Wal-Mart every hour 2.5 peta bytes created by Wal-Mart transactions in an hour Videos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,
Limitations Of Existing Data Analytics Architecture BI Reports + Interactive Apps RDBMS (aggregated data) ETL Compute Grid Storage Only Grid ( original raw data ) Collection Instrumentation Moving Data To Compute Doesnt Scale Cant Explore Original High Fidelity Raw Data Archiving= Premature Data Death So What is Apache ? A set of tools that supports running of applications on big data. Core Hadoop has two main systems:
- Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. History Source : Cloudera, 2013 The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): Schema must be created before any data can be loaded.
An explicit load operation has to take place which transforms data to DB internal structure.
New columns must be added explicitly before new data for such columns can be loaded into the database Data is simply copied to the file store, no transformation is needed.
A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding).
New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. Read is Fast Standards/Governance Load is Fast Flexibility/Agility Pros Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Interactive OLAP Analytics (< 1 sec) Multistep ACID transactions 100 % SQL compliance Use when: Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing Traditional Approach Big Data Powerful Computer Processing limit Enterprise Approach: Hadoop Architecture Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Map Reduce HDFS Hadoop Architecture Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Application Job Tracker Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Application Job Tracker Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Application HDFS: Hadoop Distributed File System A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3). 1 2 3 4 5 HDFS 3 4 5 1 2 5 1 3 4 2 4 5 1 2 3 Optimized for : Throughput Put/Get/Delete Appends Block Replication for : Durability Availability Throughput Block Replicas are distributed across servers and racks Fault Tolerance for Data Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves HDFS Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Map Reduce Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Tables are backed up Map Reduce Input Data Map Map Map Map Map Shuffle Reduce Reduce Results Understanding the concept of Map Reduce Mother Sam An Apple Believed an apple a day keeps a doctor away The Story Of Sam Understanding the concept of Map Reduce Sam thought of drinking the apple He used a to cut the and a to make juice. Understanding the concept of Map Reduce Next day Sam applied his invention to all the fruits he could find in the fruit basket (map ( )) (reduce ( )) Classical Notion of Map Reduce in Functional Programming A list of values mapped into another list of values, which gets reduced into a single value Understanding the concept of Map Reduce 18 Years Later Sam got his first job in Tropicana for his expertise in making juices. Now, its not just one basket but a whole container of fruits Also, they produce a list of juice types separately NOT ENOUGH !! But, Sam had just ONE and ONE Large data and list of values for output Wai t ! Understanding the concept of Map Reduce Brave Sam (<a, > , <o, > , <p, > , ) Each input to a map is a list of <key, value> pairs Each output of a map is a list of <key, value> pairs (<a, > , <o, > , <p, > , ) Grouped by key Each input to a reduce is a <key, value- list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <a, ( )> Reduced into a list of values Implemented parallel version of his innovation Understanding the concept of Map Reduce Sam realized, To create his favorite mix fruit juice he can use a combiner after the reducers If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them The knife (mapper) and blender (reducer) should not contain residue after use Side Effect Free Source: (Map Reduce, 2010). Conclusions The key benefits of Apache Hadoop:
1) Agility/ Flexibility (Quickest Time to Insight)
2) Complex Data Processing (Any Language, Any Problem)
3) Scalability of Storage/Compute (Freedom to Grow)
4) Economical Storage (Keep All Your Data Alive Forever)
The key systems for Apache Hadoop are:
1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage.
2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing. References Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.
Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters.
The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/.
Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.
Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI