Professional Documents
Culture Documents
http://vargas-solar.com/bigdata-management
3
Collection of data sets so large and complex that
it becomes difficult to process using
on-hand database management tools or
traditional data processing applications
4
THE VS & NEEDS OF BIG DATA
5
BIG DATA AT BRONTO SCALE
Helluva
dealing
1000 Yottabytes
withlotjust
of big
data !!data
1000 Megabytes 1 Brontobyte
1 Gigabyte
1000 Brontobytes 1000 Gigabytes 1 Terabyte
1 Geopbyte
http://spectrum.ieee.org/computing/software/beyond-just-big-data
1000 Terabytes 1 Petabyte
1000 Petabytes 1Exabyte
1000 Exabyte 1 Zettabyte
1000 Zettabytes 1 Yottabyte
6
DATA CENTRIC SCIENCES
Data Science Computational Science
Network Science
Social Data Science
Digital humanities
Data collections as backbone for conducting experiments, drive hypothesis and lead to valid
conclusions, models, simulations, understanding
7
SMART ENERGY MANAGEMENT FOR ALL
Intensive use of Xbox on line
16:00 22:00
22-25C constant
Moderate use of lights
Smart Grid Lamp at night 19:00 6:00
Use Smart Grid technology (sensors, actuators, automatic cooling systems, meteorology services)
Encourage intelligent energy consumption at home which is inversely proportional to:
Observe people energy consumption behaviour at home
Propose
Energy consumption
strategies rate
for reducing La tasa
electricity de when
invoice: consumo de energa
to open/close courtains, how to coordinate the use of electric
appliances
The size of the CO2 foot print
Respect confort requirements
8
GREEN & INTELLIGENT HOUSE
Challenge: determine personalized energy consumption and provisioning models
Confort preferences
9
DATA PROCESSING REQUIREMENTS
Confort requirements
Cloud
10
DATA COLLECTIONS
Consumed data: different sizes; quality, uncertainty, ambiguity degree; evolution in structure, completeness, production
conditions, conditions in which data is retrieved; content, explicit cultural, contextual, background properties; access
policies modification
Conditions of consumption: reproducibility, transparency degree (avoid software artefacts)
Data collections
releases
Applications &
Data consumers
Computing
resources
Yota 1024
Peta 1015
Disk
13 1
3
14
OBJECTIVES
Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
Build services based underlying value added data managers that can
Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources
availability and the data properties
15
Final remarks & Lecture program
16
FINAL REMARKS
Data collections
New scales: bronto scale due to emerging IoT
New types: thick, long hot, cold
New quality measures: QoS, QoE, SLA
17
CONTENT
18
19
Some historical facts about data management
20
Great variety of visual
Beginning of the use of BDs
resources to analyse data
& basic reports
Data was not
stored
21
SEQUENTIAL DATA PROCESSING
Master Tape Updated Master
Computer
Invoice
Orders
External medium (tape or cards) Not much different from Holleriths tabulator
Data dedicated to one application Batch Processing
Data seen as flowing through a stationary
processor
22
DISKS MADE MODERN DATABASES POSSIBLE
Don Chamberlain
23
CODE & DATA: SEPARATED AT BIRTH
COBOL
IDENTIFICATION: document
AUTHOR, PROGRAM-ID, INSTALLATION, Us
SOURCE-COMPUTER, OBJECT-COMPUTER,
SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL,
DATE-WRITTEN, DATE-COMPILED,
SECURITY.
CODASYL DBTG (1967)
COnference on DAta SYstems Languages
ENVIRONMENT: OS Data Base Task Group
CONFIGURATION SECTION. Defined DDL for a network data model
INPUT-OUTPUT SECTION. Set-Relationship semantics
DATA: Files/Records Cursor Verbs
Isolated from procedures
FILE SECTION.
WORKING-STORAGE SECTION. No encapsulation
LINKAGE SECTION. DATA division is the Schema ancestor
REPORT SECTION.
SCREEN SECTION.
PROCEDURE: code
them Jim Gray
24
INTEGRATED DATABASES (MID 60S)
25
Big data variety: the right model according to data
26
1970 - 2000 RELATIONAL DB
More than 30 years: maturity!
RxS
Theoretical & Practical aspects (DBMS) RS
Domains & R D1 x D2 x .... Dn, Algebra
R-S
1st Order Predicate Logic
Languages: SQL (wins), QUEL, QBE R[a]
DBMS Prototypes (1975), Products (1980) R:j
A major improvement in DB: provide data independence & a simple, tabular view of
data -------
Normal Forms & Dependencies (DB design, consistency) R*S
Controversial: missing values, duplicates
27
DATA MODELS
Tuple
Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
Document
Allows values to be nested documents or lists, as well as scalar values.
Attributes are not defined in a global schema
Extensible record
Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be
added on a per-record basis
28
DATA STORES
Key-value
Systems that store values and an index to find them, based on a key
Document
Systems that store documents, providing index and simple query mechanisms
Extensible record
Systems that store extensible records that can be partitioned vertically and horizontally across nodes
Graph
Systems that store model data as graphs where nodes can represent content modelled as document or key-value
structures and arcs represent a relation between the data modelled by the node
Relational
Systems that store, index and query tuples
29
KEY STORE VALUES
Voldemort project-voldemort.com
Replication, versioning, locking, Riak riak.basho.com
transactions, sorting
Membrain schoonerinfotech.com/products
30
SELECT name, pic, profile_url
FROM user SELECT message, attachment
WHERE uid = me() FROM stream
WHERE source_id = me() AND type = 80
SELECT name
FROM friendlist
WHERE owner = me()
>
32
DOCUMENT STORES
33
DOCUMENT STORES
34
EXTENSIBLE RECORD STORES
35
SCALABLE RELATIONAL SYSTEMS
SQL: rich declarative query language
Databases reinforce referential integrity
ACID semantics SYSTEM ADDRESS
Well understood operations:
Configuration, Care and feeding, Backups, Tuning, Failure and MySQL C mysql.com/cluster
ACID transactions inefficient with communication and 2PC overhead Nimbus DB nimbusdb.com
36
DBMS EVOLUTION
37
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication
Data
services Additional
extension
Access services
services
Storage Other
services services
1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)
Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 38
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication
Data
services
Service Level Agreement Additional
extension
In the event of a corruption,
Access or other disaster services
the maximum amount of data loss is the last 15 minutes of transactions
services
Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted measures associated to a query
1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)
Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 39
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication
Data
services Additional
extension
Access services
services
Storage Other
services services
Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted
measures associated to a query
1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)
Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 40
CHALLENGES & OBJECTIVE
41
How to combine, deploy, and deliver DBMS functionalities:
Compliant to application/user requirements
"Design the next generation of data management systems
Optimizing the consumption of computing resources in the presence of greedy data
processing tasks
guided by SLA"
Delivered according to Service Level Agreement contracts
Deployed in elastic architectures
42
DBMS architecture issues
43
3-TIER ARCHITECTURE
Client
n Logical vs. Physical Architectures
Data server
44
SYSTEMS & ARCHITECTURES
45
PARALLEL DATABASE ARCHITECTURES
46
THE CLOUD
Software as a service
PaaS
Platform as a service Illusion of infinite resources
PaaS No up-front cost
Fine-grained billing (e.g. hourly)
Infrastructure as a service
IaaS
Promotes a style of computing in which dynamically scalable and often virtualized resources are
provided as a service over the Internet
PaaS: allows customers to rent computers (virtual machines) on which to run their own computer
applications.
47
THE CLOUD
49
SQL AS A SERVICE
User applications
Relational
Implemented on top of
DBMS
parallel clusters of common
DBMS servers e.g., MySQL
MS SQL Server
50
DATA MANAGEMENT WITOUT
RESOURCES CONSTRAINTS
ELASTIC
COSTAWARE
Algorithms
Systems
Reduce the cost to manage and exploit data sets according to unlimited storage,
memory and computation resources 51
CLOUD DATA MANAGEMENT: SERVICES VIEWS
Definition Storage (persistency)
Querying and exploiting Efficient retrieval (indexing, caching)
Manipulation Fault tolerance (recovery, replication)
Maintenance
Peta
1015
Exa
Zetta 1018
Yota 1021 tape
1024 magnetic
RAID
Data Volume
Cloud
Hardware
52
Genoveva Vargas-Solar
CR1, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr
http://vargas-solar.com
53