Session I Data Centric Sciences

Big Data Analytics Trends
putting forward data centric sciences

Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA
genoveva.vargas@imag.fr
http://vargas-solar.com/bigdata-management
Smart Grid Master, September-October, 2017

http://www.gartner.com/newsroom/id/2819918
2
INTERNET OF THINGS
3
Collection of data sets so large and complex that
it becomes difficult to process using
on-hand database management tools or
traditional data processing applications
4
THE VS & NEEDS OF BIG DATA
increasing volume (amount of data)

Velocity (speed of data in and out)
Variety (range of data types and
sources)
Veracity (data consistency)
Value (which is the real value of
data?)
5
BIG DATA AT BRONTO SCALE
1 bit Binary digit

8 bits 1 byte
1000 bytes 1 Kilobyte
We will no longer have the luxury of
1000 Kilobytes 1 Megabyte
Helluva
dealing
1000 Yottabytes
withlotjust
of big
data !!data
1000 Megabytes 1 Brontobyte
1 Gigabyte
1000 Brontobytes 1000 Gigabytes 1 Terabyte
1 Geopbyte
http://spectrum.ieee.org/computing/software/beyond-just-big-data
1000 Terabytes 1 Petabyte
1000 Petabytes 1Exabyte
1000 Exabyte 1 Zettabyte
1000 Zettabytes 1 Yottabyte
6
DATA CENTRIC SCIENCES
Data Science Computational Science
Network Science
Social Data Science
Digital humanities
Develop methodologies weaving data management, greedy algorithms, and programming

models that must be tuned to be deployed in different target computer architectures
Data collections as backbone for conducting experiments, drive hypothesis and lead to valid
conclusions, models, simulations, understanding
7
SMART ENERGY MANAGEMENT FOR ALL
Intensive use of Xbox on line
16:00 22:00
22-25C constant
Moderate use of lights
Smart Grid Lamp at night 19:00 6:00
Intensive use of electric appliances

Use of washing machines at the same time
as microwave twice a week Enjoys fresh air between
Use of computer and TV between 6:00 8:00 16:00 and 18:00
and 14:00 24:00 Watches stars at night
Artificial lights 6:00 8:00, 17:00-24:00 Uses computer, and TV
and 22:00 24:00 in winter 7:00 8:00 and 14:00
Natural illumination at least 4 hours/day 22:00
Use Smart Grid technology (sensors, actuators, automatic cooling systems, meteorology services)
Encourage intelligent energy consumption at home which is inversely proportional to:
Observe people energy consumption behaviour at home
Propose
Energy consumption
strategies rate
for reducing La tasa
electricity de when
invoice: consumo de energa
to open/close courtains, how to coordinate the use of electric
appliances
The size of the CO2 foot print
Respect confort requirements
8
GREEN & INTELLIGENT HOUSE
Challenge: determine personalized energy consumption and provisioning models
Confort preferences
Processing big data implied in the energy consumption observation

Computing energy consumption behavior models
Analyse and optimize energy consumption versus respecting the confort requirement of
inhabitants
Need of efficient data processing solutions
9
DATA PROCESSING REQUIREMENTS
Confort requirements
Cloud
10
DATA COLLECTIONS
Consumed data: different sizes; quality, uncertainty, ambiguity degree; evolution in structure, completeness, production
conditions, conditions in which data is retrieved; content, explicit cultural, contextual, background properties; access
policies modification
Conditions of consumption: reproducibility, transparency degree (avoid software artefacts)
Data collections
releases
Applications &
Data consumers
Computing
resources
NOT MANAGEABLE NEITHER EXPLOITABLE AS SUCH

RAW DATA:
heterogeneous (variety), huge (volume), incomplete, unprecise, missing, contradictory (veracity), continuous releases produced at
different rates (velocity), proprietary, critical, private (value)
Data collections
releases
MANAGEMENT BIG DATA
Yota 1024
Zetta 1021 Cloud

Data formats Data storage supports
Data collection sizes Data delivery mechanisms
Exa 1018 RAID
Peta 1015
Disk
13 1
3
14
OBJECTIVES
Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
Build services based underlying value added data managers that can
Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources
availability and the data properties
15
Final remarks & Lecture program
16
FINAL REMARKS
Data collections
New scales: bronto scale due to emerging IoT
New types: thick, long hot, cold
New quality measures: QoS, QoE, SLA
Data processing & analytics

Complex jobs, stream analytics are still open issues
Economic cost model & business models (Big Data value & pay-as-U-go)
17
CONTENT
Big Data Analytics Trends

Big data and beyond the mirror
Big Data analytics, Data mining, Data science
Data management at scale: all you need for cooking data

Cooking data: the big picture
Data as service tools: distributed storage, data access API, more complex data processing, declarative languages
New data analytics stacks
Modeling & Predictive analytics

Clustering at different scales
Graph analytics
18
19
Some historical facts about data management
20
Great variety of visual
Beginning of the use of BDs
resources to analyse data
& basic reports
Data was not
stored
21
SEQUENTIAL DATA PROCESSING
Master Tape Updated Master
Computer
Invoice
Orders
External medium (tape or cards) Not much different from Holleriths tabulator
Data dedicated to one application Batch Processing
Data seen as flowing through a stationary
processor
22
DISKS MADE MODERN DATABASES POSSIBLE
The key property of disks:

Random access to stored data
Records can contain pointers to other records

Records can be indexed by their values and
accessed directly using B trees record
record
Prof. Rudolf Bayer, Tech. Univ Munich
record
Don Chamberlain
23
CODE & DATA: SEPARATED AT BIRTH
COBOL
IDENTIFICATION: document
AUTHOR, PROGRAM-ID, INSTALLATION, Us
SOURCE-COMPUTER, OBJECT-COMPUTER,
SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL,
DATE-WRITTEN, DATE-COMPILED,
SECURITY.
CODASYL DBTG (1967)
COnference on DAta SYstems Languages
ENVIRONMENT: OS Data Base Task Group
CONFIGURATION SECTION. Defined DDL for a network data model
INPUT-OUTPUT SECTION. Set-Relationship semantics
DATA: Files/Records Cursor Verbs
Isolated from procedures
FILE SECTION.
WORKING-STORAGE SECTION. No encapsulation
LINKAGE SECTION. DATA division is the Schema ancestor
REPORT SECTION.
SCREEN SECTION.
PROCEDURE: code
them Jim Gray
24
INTEGRATED DATABASES (MID 60S)
qMany applications share data in common

qData is managed by a centralized system
qCosts are shared
qRedundancy and inconsistency are minimized Database Management
System
qControl is improved
qAccess language can be standardized
qData becomes an enterprise resource
qShared utilities: backup, recovery, replication, . . .
Schema & Data
25
Big data variety: the right model according to data
26
1970 - 2000 RELATIONAL DB
More than 30 years: maturity!
RxS
Theoretical & Practical aspects (DBMS) RS
Domains & R D1 x D2 x .... Dn, Algebra
R-S
1st Order Predicate Logic
Languages: SQL (wins), QUEL, QBE R[a]
DBMS Prototypes (1975), Products (1980) R:j
A major improvement in DB: provide data independence & a simple, tabular view of
data -------
Normal Forms & Dependencies (DB design, consistency) R*S
Controversial: missing values, duplicates
27
DATA MODELS
Tuple
Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
Document
Allows values to be nested documents or lists, as well as scalar values.
Attributes are not defined in a global schema
Extensible record
Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be
added on a per-record basis
28
DATA STORES
Key-value
Systems that store values and an index to find them, based on a key
Document
Systems that store documents, providing index and simple query mechanisms
Extensible record
Systems that store extensible records that can be partitioned vertically and horizontally across nodes
Graph
Systems that store model data as graphs where nodes can represent content modelled as document or key-value
structures and arcs represent a relation between the data modelled by the node
Relational
Systems that store, index and query tuples
29
KEY STORE VALUES
Simplest data stores use a data model

similar to the memcached distributed in- SYSTEM ADDRESS
memory cache
Redis code.google.com/p/redis
Single key-value index for all data Scalaris code.google.com/p/scalaris
Provide a persistence mechanism Tokyo tokyocabinet.sourceforge.net
Voldemort project-voldemort.com
Replication, versioning, locking, Riak riak.basho.com
transactions, sorting
Membrain schoonerinfotech.com/products
API: inserts, deletes, index lookups Membase membase.com
No secondary indices or keys
30
SELECT name, pic, profile_url
FROM user SELECT message, attachment
WHERE uid = me() FROM stream
WHERE source_id = me() AND type = 80
SELECT name
FROM friendlist
WHERE owner = me()
SELECT name, pic

SELECT name FROM user
FROM group WHERE online_presence = "active"
WHERE gid IN ( SELECT gid AND
FROM group_member uid IN ( SELECT uid2
WHERE uid = me() ) FROM friend
WHERE uid1 = me() )
https://developers.facebook.com/docs/reference/fql/ 31
<805114856,
>
32
DOCUMENT STORES
Support more complex data: pointerless

objects, i.e., documents
SYSTEM ADDRESS
Secondary indexes, multiple types of documents
(objects) per database, nested documents and SimpleDB amazon.com/simpledb
lists, e.g. B-trees
Couch DB couchdb.apache.org
Automatic sharding (scale writes), no explicit
locks, weaker concurrency (eventual for scaling
reads) and atomicity properties Mongo DB mongodb.org
API: select, delete, getAttributes, Terrastore code.google.com/terrastore

putAttributes on documents
Queries can be distributed in parallel over
multiple nodes using a map-reduce mechanism
33
DOCUMENT STORES
34
EXTENSIBLE RECORD STORES
Basic data model is rows and columns

Basic scalability model is splitting rows and columns over multiple
nodes
SYSTEM ADDRESS
Rows split across nodes through sharding on the primary key
Split by range rather than hash function HBase hbase.apache.com
Rows analogous to documents: variable number of attributes, attribute HyperTable hypertable.org
names must be unique
Cassandra incubator.apache.org/cassandra
Grouped into collections (tables)
Queries on ranges of values do not go to every node
Columns are distributed over multiple nodes using column groups
Which columns are best stored together
Column groups must be pre-defined with the extensible record stores
35
SCALABLE RELATIONAL SYSTEMS
SQL: rich declarative query language
Databases reinforce referential integrity
ACID semantics SYSTEM ADDRESS
Well understood operations:
Configuration, Care and feeding, Backups, Tuning, Failure and MySQL C mysql.com/cluster
recovery, Performance characteristics Volt DB voltdb.com
Use small-scope operations Clustrix clustrix.com
Challenge: joins that do not scale with sharding ScaleDB scaledb.com
Use small-scope transactions Scale Base scalebase.com
ACID transactions inefficient with communication and 2PC overhead Nimbus DB nimbusdb.com
Shared nothing architecture for scalability

Avoid cross-node operations
36
DBMS EVOLUTION
No more monolithic DBMS Components are providing Services

Extensible, lightweight DBMS Blur the boundaries between OS & DBMS
Unbundled technology* Self-adaptive Systems
Component-based architectures* (thick-grain Multi-tier architectures, Web, P2P, GRID,
vs. fine-grain) CLOUD,
OO Frameworks
* See Dittrich, Geppert, Eds, Component Database Systems, MK 2000

* Chaudhuri & Weikum, Rethinking Database System Architecture: Towards a Self-tuning RISC-style Database System, VLDB 2000
37
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication
Data
services Additional
extension
Access services
services
Storage Other
services services
1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)
Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 38
Extension services
Data
services
Service Level Agreement Additional
extension
In the event of a corruption,
Access or other disaster services
the maximum amount of data loss is the last 15 minutes of transactions
services
the maximum amount of downtime the application can tolerate is 20 minutes

Storage Other
services services
Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted measures associated to a query
Extension services
Data
services Additional
extension
Access services
services
Storage Other
services services
Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted
measures associated to a query
CHALLENGES & OBJECTIVE
How to combine, deploy, and deliver DBMS functionalities:

Compliant to application/user requirements
Optimizing the consumption of computing resources in the presence of greedy data
processing tasks
Delivered according to Service Level Agreement (SLA) contracts
Deployed in elastic and distributed platforms
41
How to combine, deploy, and deliver DBMS functionalities:
Compliant to application/user requirements
"Design the next generation of data management systems
Optimizing the consumption of computing resources in the presence of greedy data
processing tasks
guided by SLA"
Delivered according to Service Level Agreement contracts
Deployed in elastic architectures
42
DBMS architecture issues
43
3-TIER ARCHITECTURE
Client
n Logical vs. Physical Architectures
Application n Components linked in different ways

server
Data server
44
SYSTEMS & ARCHITECTURES
HW Influences : machine architecture (P, Very large spectrum of DBMS

M, D), networks, data storage units, from smartcard to (parallel) DB Machine
shared disk, shared nothing Client(s) /Server (s)
SW Influences : O.S. & Middleware Main Memory DBMS
Monolithic DBMS
partition and/or replication
45
PARALLEL DATABASE ARCHITECTURES
46
THE CLOUD
Software as a service
PaaS
Platform as a service Illusion of infinite resources
PaaS No up-front cost
Fine-grained billing (e.g. hourly)
Infrastructure as a service
IaaS
Promotes a style of computing in which dynamically scalable and often virtualized resources are
provided as a service over the Internet
PaaS: allows customers to rent computers (virtual machines) on which to run their own computer
applications.
47
THE CLOUD
Individual users & applications
Software as a service (SaaS)

e.g., Salesforce, Google Apps
Platform as a service (PaaS)

e.g., Microsoft Azure, Google App Engine
Infrastructure as a service (IaaS)

e.g., Amazon EC2, GoGrid, Rackspace
Enabling tecnologies (hardware & software)

[FurhtEscalante 2010]
48
THE CLOUD
n Computing power is elastic, but
Software as a service
noly if workload is
PaaS parallelizable
n Shared-nothing architecture
Platform as a service
PaaS
n Data is stored at un-trusted
Infrastructure as a service hosts
IaaS n Solution: encrypting data
n Data is replicated, across large

geographic distances
n Availability and durability
49
SQL AS A SERVICE
User applications
Relational model and SQL as a

Service e.g. Amazon relational
Relational Cloud storage service
database service (RDS), MS SQL Azure
Relational
Implemented on top of
DBMS
parallel clusters of common
DBMS servers e.g., MySQL
MS SQL Server
50
DATA MANAGEMENT WITOUT
RESOURCES CONSTRAINTS
ELASTIC
COSTAWARE
Algorithms
Systems
Reduce the cost to manage and exploit data sets according to unlimited storage,
memory and computation resources 51
CLOUD DATA MANAGEMENT: SERVICES VIEWS
Definition Storage (persistency)
Querying and exploiting Efficient retrieval (indexing, caching)
Manipulation Fault tolerance (recovery, replication)
Maintenance
Peta
1015
Exa
Zetta 1018
Yota 1021 tape
1024 magnetic
RAID
Data Volume
Cloud
Hardware
52
Genoveva Vargas-Solar
CR1, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr
http://vargas-solar.com
53

Session I Data Centric Sciences

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session I Data Centric Sciences

Uploaded by

Copyright:

Available Formats

Big Data Analytics Trends

putting forward data centric sciences

Smart Grid Master, September-October, 2017

increasing volume (amount of data)

1 bit Binary digit

Develop methodologies weaving data management, greedy algorithms, and programming

Intensive use of electric appliances

Processing big data implied in the energy consumption observation

NOT MANAGEABLE NEITHER EXPLOITABLE AS SUCH

Zetta 1021 Cloud

Exa 1018 RAID

Data processing & analytics

Big Data Analytics Trends

Data management at scale: all you need for cooking data

Modeling & Predictive analytics

The key property of disks:

Records can contain pointers to other records

qMany applications share data in common

Simplest data stores use a data model

Provide a persistence mechanism Tokyo tokyocabinet.sourceforge.net

API: inserts, deletes, index lookups Membase membase.com

No secondary indices or keys

SELECT name, pic

Support more complex data: pointerless

API: select, delete, getAttributes, Terrastore code.google.com/terrastore

Basic data model is rows and columns

recovery, Performance characteristics Volt DB voltdb.com

Use small-scope operations Clustrix clustrix.com

Challenge: joins that do not scale with sharding ScaleDB scaledb.com

Use small-scope transactions Scale Base scalebase.com

Shared nothing architecture for scalability

No more monolithic DBMS Components are providing Services

* See Dittrich, Geppert, Eds, Component Database Systems, MK 2000

the maximum amount of downtime the application can tolerate is 20 minutes

How to combine, deploy, and deliver DBMS functionalities:

Application n Components linked in different ways

HW Influences : machine architecture (P, Very large spectrum of DBMS

partition and/or replication

Individual users & applications

Software as a service (SaaS)

Platform as a service (PaaS)

Infrastructure as a service (IaaS)

Enabling tecnologies (hardware & software)

n Data is replicated, across large

Relational model and SQL as a

You might also like