C Store Cloud

C-Store: Data Management in
the Cloud
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Jun 5, 2009
What is the Cloud?
 A definition from Wikipedia

 Cloud computing is a style of computing in which
dynamically scalable and often virtualized resources are
provided as a service over the Internet.
 Platform as a service (e.g, Amazon EC2)

 allows customers to rent computers (virtual machines) on
which to run their own computer applications.
 Software as a service
 infrastructure as a service
Amazon EC2 (Elastic Compute Cloud)
 EC2 uses Xen virtualization.

 Each virtual machine, called an "instance",
functions as a virtual private server in one of
three sizes:
 small, large or extra large.
 Amazon.com sizes instances based on "EC2
Compute Units"
 the equivalent CPU capacity of physical hardware.
One EC2 Compute Unit equals 1.0-1.2 GHz 2007
Opteron or 2007 Xeon processor.
Pricing
 Amazon charges customers in two primary

ways:
 Hourly charge per virtual machine
 Data transfer charge
 Amazon advertising describes the pricing

scheme as "you pay for resources you
consume".
Advantage of Public Cloud
 Public clouds are hosted by large

infrastructure companies such as
 Amazon, Google, Yahoo, Microsoft, Sun
 Can afford huge cloud.
 For many companies, especially for start-ups
and medium-sized business), setting up a
private cloud can be too expensive
 hardware cost
 Software cost
 Personnel cost for maintaining the system
Cloud Characteristics
 Computing power is elastic, but noly if workload is

parallelizable.
 Computing power comes from shared-nothing architecture.
 Data is stored at an un-trusted host.

 A possible solution is encrypting data.
 Data is replicated, often across large geographic

distance.
 To provide data availability and durability.
Transactional Data Management (OLTP)
 Typically does not use a shared-nothing architecture.

 OLTP systems are usually less than 1TB in size.
 It is hard to maintain ACID guarantees in the face of

data replication over large geographic distances.
 Google’s Bigtable implements a replicated shared-nothing
database, by weaking “A” from ACID.
 The H-Store project still remains in vision stage.
 There are big risks in storing transactional data on

an un-trusted host.
 Transactional data include details at the lowest granularity.
First Conclusion
 Transactional data management applications

are not well suited for deployment in the
cloud.
Analytical Data Management (DW)
 Tend to be read-mostly (read-only), with occasional

batch inserts.
 Shared-nothing architecture is a good match.

 The ever increasing amount of data is the primary driver for
choosing shared-nothing.
 Large scans, multidimensional aggregations, and star
schema joins for analytical workload are easy to parallelize
on shared-nothing system.
 Infrequent writes eliminates the need for complex
distributed locking and commit protocols.
Analytical Data Management (DW):
continued
 ACID guarantees are typically not needed.
 Snapshot isolation is usually enough.
 Particularly sensitive data can often be left

out of the analysis.
 Less granular versions of the data are usually
used for analysis.
Second Conclusion
 Analytical Data Management applications are

well-suited for deployment in the cloud.
Vertica (C-Store) for the Cloud
Cloud DBMS Wish List
 Efficiency
 Fault Tolerance
 If a query must restart each time a node fails, then long, complex
queries are difficult to complete.
 Ability to run in a heterogeneous environment.

 Should prevent the slowest node from making a disproportionate
affect on total query performance.
 Ability to operate on encrypted data.
 Ability to interface with business intelligence products.

MapReduce vs. Parallel DBMS (1)
 Efficiency
 MapReduce is good for brute-force scan over unstructured
data such as text documents.
 Parallel DBMS is good for selective access of structured
data.
 Fault Tolerance
 MapReduce takes it as a high priority.
 Most parallel DBMS restart a query upon a faiure.
 Ability to run in a heterogeneous environment.
 MapReduce does well.
 Parallel DBMS are generally designed to run in a
homogeneous environment.
MapReduce vs. Parallel DBMS (2)
 Ability to operate on encrypted data.

 Neither has the native ability to operate on
encrypted data.
 Ability to interface with business intelligence

products.
 MapReduce is not intended for interfacing with BI
products.
 Parallel DBMS supports BI products well.
A Call for A Hybrid Solution
 Bring together ideas from MapReduce and

Parallel DBMS.
 The hybrid solution should combine

 Fault tolerance, heterogeneous cluster, and ease
of use out-of-the-box capabilities of MapReduce
 With the efficiency, performance, and tool
plugability of shared-nothing parallel DBMS.
References
1. Abadi, Daniel J. Data Management in the

Cloud: Limitations and Opportunities. In
IEEE Data Engineering Bulletin, 2009.
2. Vertica Company. Getting Started with
Vertica Analytic Database for the Cloud.
2009.

C Store Cloud

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C Store Cloud

Uploaded by

Copyright:

Available Formats

C-Store: Data Management in

 A definition from Wikipedia

 Platform as a service (e.g, Amazon EC2)

 EC2 uses Xen virtualization.

 Amazon charges customers in two primary

 Amazon advertising describes the pricing

 Public clouds are hosted by large

 Computing power is elastic, but noly if workload is

 Data is stored at an un-trusted host.

 Data is replicated, often across large geographic

 Typically does not use a shared-nothing architecture.

 It is hard to maintain ACID guarantees in the face of

 There are big risks in storing transactional data on

 Transactional data management applications

 Tend to be read-mostly (read-only), with occasional

 Shared-nothing architecture is a good match.

 Particularly sensitive data can often be left

 Analytical Data Management applications are

 Ability to run in a heterogeneous environment.

 Ability to operate on encrypted data.

 Ability to interface with business intelligence products.

 Ability to operate on encrypted data.

 Ability to interface with business intelligence

 Bring together ideas from MapReduce and

 The hybrid solution should combine

1. Abadi, Daniel J. Data Management in the

You might also like