You are on page 1of 17

C-Store: Data Management in

the Cloud

Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Jun 5, 2009
What is the Cloud?

 A definition from Wikipedia


 Cloud computing is a style of computing in which
dynamically scalable and often virtualized resources are
provided as a service over the Internet.

 Platform as a service (e.g, Amazon EC2)


 allows customers to rent computers (virtual machines) on
which to run their own computer applications.
 Software as a service
 infrastructure as a service
Amazon EC2 (Elastic Compute Cloud)

 EC2 uses Xen virtualization.


 Each virtual machine, called an "instance",
functions as a virtual private server in one of
three sizes:
 small, large or extra large.
 Amazon.com sizes instances based on "EC2
Compute Units"
 the equivalent CPU capacity of physical hardware.
One EC2 Compute Unit equals 1.0-1.2 GHz 2007
Opteron or 2007 Xeon processor.
Pricing

 Amazon charges customers in two primary


ways:
 Hourly charge per virtual machine
 Data transfer charge

 Amazon advertising describes the pricing


scheme as "you pay for resources you
consume".
Advantage of Public Cloud

 Public clouds are hosted by large


infrastructure companies such as
 Amazon, Google, Yahoo, Microsoft, Sun
 Can afford huge cloud.
 For many companies, especially for start-ups
and medium-sized business), setting up a
private cloud can be too expensive
 hardware cost
 Software cost
 Personnel cost for maintaining the system
Cloud Characteristics

 Computing power is elastic, but noly if workload is


parallelizable.
 Computing power comes from shared-nothing architecture.

 Data is stored at an un-trusted host.


 A possible solution is encrypting data.

 Data is replicated, often across large geographic


distance.
 To provide data availability and durability.
Transactional Data Management (OLTP)

 Typically does not use a shared-nothing architecture.


 OLTP systems are usually less than 1TB in size.

 It is hard to maintain ACID guarantees in the face of


data replication over large geographic distances.
 Google’s Bigtable implements a replicated shared-nothing
database, by weaking “A” from ACID.
 The H-Store project still remains in vision stage.

 There are big risks in storing transactional data on


an un-trusted host.
 Transactional data include details at the lowest granularity.
First Conclusion

 Transactional data management applications


are not well suited for deployment in the
cloud.
Analytical Data Management (DW)

 Tend to be read-mostly (read-only), with occasional


batch inserts.

 Shared-nothing architecture is a good match.


 The ever increasing amount of data is the primary driver for
choosing shared-nothing.
 Large scans, multidimensional aggregations, and star
schema joins for analytical workload are easy to parallelize
on shared-nothing system.
 Infrequent writes eliminates the need for complex
distributed locking and commit protocols.
Analytical Data Management (DW):
continued
 ACID guarantees are typically not needed.
 Snapshot isolation is usually enough.

 Particularly sensitive data can often be left


out of the analysis.
 Less granular versions of the data are usually
used for analysis.
Second Conclusion

 Analytical Data Management applications are


well-suited for deployment in the cloud.
Vertica (C-Store) for the Cloud
Cloud DBMS Wish List
 Efficiency

 Fault Tolerance
 If a query must restart each time a node fails, then long, complex
queries are difficult to complete.

 Ability to run in a heterogeneous environment.


 Should prevent the slowest node from making a disproportionate
affect on total query performance.

 Ability to operate on encrypted data.

 Ability to interface with business intelligence products.


MapReduce vs. Parallel DBMS (1)

 Efficiency
 MapReduce is good for brute-force scan over unstructured
data such as text documents.
 Parallel DBMS is good for selective access of structured
data.
 Fault Tolerance
 MapReduce takes it as a high priority.
 Most parallel DBMS restart a query upon a faiure.
 Ability to run in a heterogeneous environment.
 MapReduce does well.
 Parallel DBMS are generally designed to run in a
homogeneous environment.
MapReduce vs. Parallel DBMS (2)

 Ability to operate on encrypted data.


 Neither has the native ability to operate on
encrypted data.

 Ability to interface with business intelligence


products.
 MapReduce is not intended for interfacing with BI
products.
 Parallel DBMS supports BI products well.
A Call for A Hybrid Solution

 Bring together ideas from MapReduce and


Parallel DBMS.

 The hybrid solution should combine


 Fault tolerance, heterogeneous cluster, and ease
of use out-of-the-box capabilities of MapReduce
 With the efficiency, performance, and tool
plugability of shared-nothing parallel DBMS.
References

1. Abadi, Daniel J. Data Management in the


Cloud: Limitations and Opportunities. In
IEEE Data Engineering Bulletin, 2009.
2. Vertica Company. Getting Started with
Vertica Analytic Database for the Cloud.
2009.

You might also like