You are on page 1of 34

HadoopDB

An An Architectural Hybrid of MapReduce & DBMS Technologies for Analytical Workloads

Road Map
Motivation Introduction Desired Properties Background & Shortfalls HadoopDB Benchmarks Fault Tolerance Conclusion Related Work References
2

Introduction
Analyzing massive structured data on 1000s of shared-nothing nodes Shared nothing architecture:
A collection of independent,possibly virtual matchines eact with local disk and local main memory connected together on a highspeed network

Approachs:
Parallel databases Map/Reduce systems
3

Desired Properties
Performance
A primary characteristic that commercial database systems use to distinguish themselves

A Fault tolerance Heterogeneus environments


Increasing number of nodes Difficult homogeneous

Flexible query interface


Usually JDBC or ODBC UDF mechanism Desirable SQL and no SQL interfaces

Background-PDBMS
Standard relational tables and SQL
Indexing, compression,caching, I/O sharing Tables partitioned over nodes Transparent to the user

Meet performance
Needed highly skilled DBA

Flexible query interfaces


UDFs varies accros implementations Fault tolerance Not score so well Assumption: failures are rare Assumption: dozens of nodes in clusters
5

Background-MapReduce
Satisfies fault tolerance Works on heterogeneus environment Drawback: performance
No enhacing performance techniques

Interfaces
Write M/R jobs in multiple languages SQL not supported directly ( excluding eg: Hive )

MapReduce (Hadoop) MapReduce is a programming model which species:


A map function that processes a key/value pair to generate a set of intermediate key/value pairs, A reduce function that merges all intermediate values associated with the same intermediate key.

Hadoop
Is a MapReduce implementation for processing large data sets over 1000s of nodes. Maps and Reduces run independently of each other over blocks of data distributed across a cluster
7

Background-MapReduce

Dierences between Parallel Databases and MapReduce?

10

HadoopDB

11

HadoopDB
Hadoop as communication layer above multiple nodes running single-node DBMS instances Full open-source solution :
PostgreSQL as DB layer Hadoop as communication layer Hive as translation layer

12

HadoopDB
RDBMS Careful layout of data Indexing Sorting Query optimization compression Hadoop Job scheduling Task coordination Parallellization

13

Ideas
Main goal: achieve the properties described before Connect multiple single-datanode systems
Hadoop as the task coordination & network communication layer Queries parallelized across the nodes using MapReduce framework Fault tolerant and work in heterogeneus nodes Parallel databases performance Query processing in database engine

14

Architecture Background
Data Storage layer (HDFS)
Block structured file system managed by central NameNode Files broken in blocks and ditributed Data processing layer (Map/Reduce framework) Master/slave architecture Job and Task trackers

15

HadoopDB Components
Database Connector Catalog Data Loader Planner (SMS)

16

Database Connector
Interface between DBMS and TaskTacker Responsabilities Connect to the database Execute the SQL query Return the results as key-value pairs Achieved goal Datasources are similar to datablocks in HDFS

17

Catalog
Maintain information about database Database location, driver class Darasets in cluster, replica or partitioning

Catalog stored as xml file in HDFS Plan to deploy as separated service

18

Data Loader
Responsabilities:
Globally partition the data on given key Break single node data into chunks Bulk-loading chunks in single-node databases

Two main components:


Global hasher
Map/Reduce job read from HDS and repartition

Local Hasher
Copies from HDFS to local file system

19

SMS Planner
Extends Hive Steps
Parser transforms query to (AST)abstract syntax tree Get table schema information from catalog Logical plan generator creates query plan Optimizer breaks up plan to Map or Reduce phases Executable plan generated for one or more MapReduce jobs SMS tries to push maximum work to database layer

20

21

Benchmarking
Environment
Amazon EC2 large instances Each instance
7,5 GB memory,2 virtual cores,850 GB storage,64 bits Linux Fedora 8

Systems
Hadoop
256MB data blocks,1024 MB heap size, 200Mb sort buffer

HadoopDB
Similar to Hadoop conf,PostgreSQL 8.2.5,No compress data

Vertica
Used a cloud edition All data is compressed

DBMS-X
Comercial parallel row Run on EC2 (not cloud edition available) 22

Benchmarking
Used data
Http log files, html pages, ranking Sizes (per node):
155 millions user visits (~ 20Gigabytes) 18 millions ranking (~1Gigabyte) Stored as plain text in HDFS

23

Evaluating HadoopDB
Compare HadoopDB to
1 Hadoop 2 Parallel databases (Vertica, DBMS-X)

Features:
1 Performance:
We expected HadoopDB to approach the performance of parallel databases

2 Scalability:
We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. 24

Benchmark tasks
Data loading Grep task Selection task Aggregation task Join task UDF Aggregation task Fault tolerance and Heterogeneous environment

25

Data Load

26

Queries Result

27

load -data loads are slower than Hadoop, but faster than parallel databases runtime Structured data-HadoopDB is faster than Hadoop but slower than parallel databases(HadoopDBs performance is close to parallel databases) Unstructured data- HadoopDBs performance matches Hadoop

28

Scalability:Setup
Simple aggregation task - full table scan Data replicated across 10 nodes Fault-tolerance: Kill a node halfway Fluctuation-tolerance: Slow down a node for the entire experiment

29

Scalability:Results
HadoopDB and Hadoop take advantage of runtime acheduling by splitting data into chunks Parallel databases restart entire query on node failure or wait for the slowest node

30

To Summarize
HadoopDB - a hybrid of DBMS and MapReduce HadoopDB is close in performance to parallel databases HadoopDB is able to operate in truly heterogeneous environment and has the fault tolerance of Hadoop environment Is free and open-source

http://hadoopdb.sourceforge.net
31

Related Work
Pig Project at yahoo SCOPE project at Microsoft Hive project

32

Future Work
Integration with other open source databases Full automation of the loading and replication process Dynamically adjusting fault-tolerance levels based on failure rate

33

Thank You!
34

You might also like