You are on page 1of 53

Big Data Analytics Trends

putting forward data centric sciences


Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA
genoveva.vargas@imag.fr

http://vargas-solar.com/bigdata-management

Smart Grid Master, September-October, 2017


http://www.gartner.com/newsroom/id/2819918
2
INTERNET OF THINGS

3
Collection of data sets so large and complex that
it becomes difficult to process using
on-hand database management tools or
traditional data processing applications

4
THE VS & NEEDS OF BIG DATA

increasing volume (amount of data)


Velocity (speed of data in and out)
Variety (range of data types and
sources)
Veracity (data consistency)
Value (which is the real value of
data?)

5
BIG DATA AT BRONTO SCALE

1 bit Binary digit


8 bits 1 byte
1000 bytes 1 Kilobyte
We will no longer have the luxury of
1000 Kilobytes 1 Megabyte

Helluva
dealing
1000 Yottabytes
withlotjust
of big
data !!data
1000 Megabytes 1 Brontobyte
1 Gigabyte
1000 Brontobytes 1000 Gigabytes 1 Terabyte
1 Geopbyte
http://spectrum.ieee.org/computing/software/beyond-just-big-data
1000 Terabytes 1 Petabyte
1000 Petabytes 1Exabyte
1000 Exabyte 1 Zettabyte
1000 Zettabytes 1 Yottabyte
6
DATA CENTRIC SCIENCES
Data Science Computational Science
Network Science
Social Data Science
Digital humanities

Develop methodologies weaving data management, greedy algorithms, and programming


models that must be tuned to be deployed in different target computer architectures

Data collections as backbone for conducting experiments, drive hypothesis and lead to valid
conclusions, models, simulations, understanding

7
SMART ENERGY MANAGEMENT FOR ALL
Intensive use of Xbox on line
16:00 22:00
22-25C constant
Moderate use of lights
Smart Grid Lamp at night 19:00 6:00

Intensive use of electric appliances


Use of washing machines at the same time
as microwave twice a week Enjoys fresh air between
Use of computer and TV between 6:00 8:00 16:00 and 18:00
and 14:00 24:00 Watches stars at night
Artificial lights 6:00 8:00, 17:00-24:00 Uses computer, and TV
and 22:00 24:00 in winter 7:00 8:00 and 14:00
Natural illumination at least 4 hours/day 22:00

Use Smart Grid technology (sensors, actuators, automatic cooling systems, meteorology services)
Encourage intelligent energy consumption at home which is inversely proportional to:
Observe people energy consumption behaviour at home
Propose
Energy consumption
strategies rate
for reducing La tasa
electricity de when
invoice: consumo de energa
to open/close courtains, how to coordinate the use of electric
appliances
The size of the CO2 foot print
Respect confort requirements

8
GREEN & INTELLIGENT HOUSE
Challenge: determine personalized energy consumption and provisioning models
Confort preferences

Processing big data implied in the energy consumption observation


Computing energy consumption behavior models
Analyse and optimize energy consumption versus respecting the confort requirement of
inhabitants
Need of efficient data processing solutions

9
DATA PROCESSING REQUIREMENTS

Confort requirements

Cloud

10
DATA COLLECTIONS

Consumed data: different sizes; quality, uncertainty, ambiguity degree; evolution in structure, completeness, production
conditions, conditions in which data is retrieved; content, explicit cultural, contextual, background properties; access
policies modification
Conditions of consumption: reproducibility, transparency degree (avoid software artefacts)
Data collections
releases
Applications &
Data consumers
Computing
resources

NOT MANAGEABLE NEITHER EXPLOITABLE AS SUCH


RAW DATA:
heterogeneous (variety), huge (volume), incomplete, unprecise, missing, contradictory (veracity), continuous releases produced at
different rates (velocity), proprietary, critical, private (value)
Data collections
releases
MANAGEMENT BIG DATA

Yota 1024

Zetta 1021 Cloud


Data formats Data storage supports
Data collection sizes Data delivery mechanisms

Exa 1018 RAID

Peta 1015

Disk
13 1
3
14
OBJECTIVES

Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
the pace (computing speed) in which data must be consumed (harvested, prepared and processed).

Build services based underlying value added data managers that can
Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources
availability and the data properties
15
Final remarks & Lecture program

16
FINAL REMARKS

Data collections
New scales: bronto scale due to emerging IoT
New types: thick, long hot, cold
New quality measures: QoS, QoE, SLA

Data processing & analytics


Complex jobs, stream analytics are still open issues
Economic cost model & business models (Big Data value & pay-as-U-go)

17
CONTENT

Big Data Analytics Trends


Big data and beyond the mirror
Big Data analytics, Data mining, Data science

Data management at scale: all you need for cooking data


Cooking data: the big picture
Data as service tools: distributed storage, data access API, more complex data processing, declarative languages
New data analytics stacks

Modeling & Predictive analytics


Clustering at different scales
Graph analytics

18
19
Some historical facts about data management

20
Great variety of visual
Beginning of the use of BDs
resources to analyse data
& basic reports
Data was not
stored

21
SEQUENTIAL DATA PROCESSING
Master Tape Updated Master

Computer
Invoice

Orders

External medium (tape or cards) Not much different from Holleriths tabulator
Data dedicated to one application Batch Processing
Data seen as flowing through a stationary
processor

22
DISKS MADE MODERN DATABASES POSSIBLE

The key property of disks:


Random access to stored data

Records can contain pointers to other records


Records can be indexed by their values and
accessed directly using B trees record
record
Prof. Rudolf Bayer, Tech. Univ Munich
record

Don Chamberlain
23
CODE & DATA: SEPARATED AT BIRTH
COBOL
IDENTIFICATION: document
AUTHOR, PROGRAM-ID, INSTALLATION, Us
SOURCE-COMPUTER, OBJECT-COMPUTER,
SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL,
DATE-WRITTEN, DATE-COMPILED,
SECURITY.
CODASYL DBTG (1967)
COnference on DAta SYstems Languages
ENVIRONMENT: OS Data Base Task Group
CONFIGURATION SECTION. Defined DDL for a network data model
INPUT-OUTPUT SECTION. Set-Relationship semantics
DATA: Files/Records Cursor Verbs
Isolated from procedures
FILE SECTION.
WORKING-STORAGE SECTION. No encapsulation
LINKAGE SECTION. DATA division is the Schema ancestor
REPORT SECTION.
SCREEN SECTION.

PROCEDURE: code
them Jim Gray
24
INTEGRATED DATABASES (MID 60S)

qMany applications share data in common


qData is managed by a centralized system
qCosts are shared
qRedundancy and inconsistency are minimized Database Management
System
qControl is improved
qAccess language can be standardized
qData becomes an enterprise resource
qShared utilities: backup, recovery, replication, . . .
Schema & Data

25
Big data variety: the right model according to data

26
1970 - 2000 RELATIONAL DB
More than 30 years: maturity!
RxS
Theoretical & Practical aspects (DBMS) RS
Domains & R D1 x D2 x .... Dn, Algebra
R-S
1st Order Predicate Logic
Languages: SQL (wins), QUEL, QBE R[a]
DBMS Prototypes (1975), Products (1980) R:j
A major improvement in DB: provide data independence & a simple, tabular view of
data -------
Normal Forms & Dependencies (DB design, consistency) R*S
Controversial: missing values, duplicates

27
DATA MODELS

Tuple
Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
Document
Allows values to be nested documents or lists, as well as scalar values.
Attributes are not defined in a global schema

Extensible record
Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be
added on a per-record basis

28
DATA STORES
Key-value
Systems that store values and an index to find them, based on a key
Document
Systems that store documents, providing index and simple query mechanisms
Extensible record
Systems that store extensible records that can be partitioned vertically and horizontally across nodes
Graph
Systems that store model data as graphs where nodes can represent content modelled as document or key-value
structures and arcs represent a relation between the data modelled by the node
Relational
Systems that store, index and query tuples
29
KEY STORE VALUES

Simplest data stores use a data model


similar to the memcached distributed in- SYSTEM ADDRESS
memory cache
Redis code.google.com/p/redis
Single key-value index for all data Scalaris code.google.com/p/scalaris

Provide a persistence mechanism Tokyo tokyocabinet.sourceforge.net

Voldemort project-voldemort.com
Replication, versioning, locking, Riak riak.basho.com
transactions, sorting
Membrain schoonerinfotech.com/products

API: inserts, deletes, index lookups Membase membase.com

No secondary indices or keys

30
SELECT name, pic, profile_url
FROM user SELECT message, attachment
WHERE uid = me() FROM stream
WHERE source_id = me() AND type = 80

SELECT name
FROM friendlist
WHERE owner = me()

SELECT name, pic


SELECT name FROM user
FROM group WHERE online_presence = "active"
WHERE gid IN ( SELECT gid AND
FROM group_member uid IN ( SELECT uid2
WHERE uid = me() ) FROM friend
WHERE uid1 = me() )
https://developers.facebook.com/docs/reference/fql/ 31
<805114856,

>

32
DOCUMENT STORES

Support more complex data: pointerless


objects, i.e., documents
SYSTEM ADDRESS
Secondary indexes, multiple types of documents
(objects) per database, nested documents and SimpleDB amazon.com/simpledb
lists, e.g. B-trees
Couch DB couchdb.apache.org
Automatic sharding (scale writes), no explicit
locks, weaker concurrency (eventual for scaling
reads) and atomicity properties Mongo DB mongodb.org

API: select, delete, getAttributes, Terrastore code.google.com/terrastore


putAttributes on documents
Queries can be distributed in parallel over
multiple nodes using a map-reduce mechanism

33
DOCUMENT STORES

34
EXTENSIBLE RECORD STORES

Basic data model is rows and columns


Basic scalability model is splitting rows and columns over multiple
nodes
SYSTEM ADDRESS
Rows split across nodes through sharding on the primary key
Split by range rather than hash function HBase hbase.apache.com
Rows analogous to documents: variable number of attributes, attribute HyperTable hypertable.org
names must be unique
Cassandra incubator.apache.org/cassandra
Grouped into collections (tables)
Queries on ranges of values do not go to every node
Columns are distributed over multiple nodes using column groups
Which columns are best stored together
Column groups must be pre-defined with the extensible record stores

35
SCALABLE RELATIONAL SYSTEMS
SQL: rich declarative query language
Databases reinforce referential integrity
ACID semantics SYSTEM ADDRESS
Well understood operations:
Configuration, Care and feeding, Backups, Tuning, Failure and MySQL C mysql.com/cluster

recovery, Performance characteristics Volt DB voltdb.com

Use small-scope operations Clustrix clustrix.com

Challenge: joins that do not scale with sharding ScaleDB scaledb.com

Use small-scope transactions Scale Base scalebase.com

ACID transactions inefficient with communication and 2PC overhead Nimbus DB nimbusdb.com

Shared nothing architecture for scalability


Avoid cross-node operations

36
DBMS EVOLUTION

No more monolithic DBMS Components are providing Services


Extensible, lightweight DBMS Blur the boundaries between OS & DBMS
Unbundled technology* Self-adaptive Systems
Component-based architectures* (thick-grain Multi-tier architectures, Web, P2P, GRID,
vs. fine-grain) CLOUD,
OO Frameworks

* See Dittrich, Geppert, Eds, Component Database Systems, MK 2000


* Chaudhuri & Weikum, Rethinking Database System Architecture: Towards a Self-tuning RISC-style Database System, VLDB 2000

37
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication

Data
services Additional
extension
Access services
services

Storage Other
services services

1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)

Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 38
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication

Data
services
Service Level Agreement Additional
extension
In the event of a corruption,
Access or other disaster services
the maximum amount of data loss is the last 15 minutes of transactions
services

the maximum amount of downtime the application can tolerate is 20 minutes


Storage Other
services services

Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted measures associated to a query

1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)

Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 39
SERVICE ORIENTED DBMS
Extension services
Streaming, XML, procedures,
queries, replication

Data
services Additional
extension
Access services
services

Storage Other
services services

Service level agreement: the contracted delivery time of the service or performance
Required SLA: agreements between the user and SDBMS expressed as a combination of weighted
measures associated to a query

1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)

Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000. 40
CHALLENGES & OBJECTIVE

How to combine, deploy, and deliver DBMS functionalities:


Compliant to application/user requirements
Optimizing the consumption of computing resources in the presence of greedy data
processing tasks
Delivered according to Service Level Agreement (SLA) contracts
Deployed in elastic and distributed platforms

41
How to combine, deploy, and deliver DBMS functionalities:
Compliant to application/user requirements
"Design the next generation of data management systems
Optimizing the consumption of computing resources in the presence of greedy data
processing tasks
guided by SLA"
Delivered according to Service Level Agreement contracts
Deployed in elastic architectures

42
DBMS architecture issues

43
3-TIER ARCHITECTURE

Client
n Logical vs. Physical Architectures

Application n Components linked in different ways


server

Data server

44
SYSTEMS & ARCHITECTURES

HW Influences : machine architecture (P, Very large spectrum of DBMS


M, D), networks, data storage units, from smartcard to (parallel) DB Machine
shared disk, shared nothing Client(s) /Server (s)
SW Influences : O.S. & Middleware Main Memory DBMS
Monolithic DBMS

partition and/or replication

45
PARALLEL DATABASE ARCHITECTURES

46
THE CLOUD
Software as a service
PaaS
Platform as a service Illusion of infinite resources
PaaS No up-front cost
Fine-grained billing (e.g. hourly)
Infrastructure as a service
IaaS

Promotes a style of computing in which dynamically scalable and often virtualized resources are
provided as a service over the Internet
PaaS: allows customers to rent computers (virtual machines) on which to run their own computer
applications.

47
THE CLOUD

Individual users & applications

Software as a service (SaaS)


e.g., Salesforce, Google Apps

Platform as a service (PaaS)


e.g., Microsoft Azure, Google App Engine

Infrastructure as a service (IaaS)


e.g., Amazon EC2, GoGrid, Rackspace

Enabling tecnologies (hardware & software)


[FurhtEscalante 2010]
48
THE CLOUD
n Computing power is elastic, but
Software as a service
noly if workload is
PaaS parallelizable
n Shared-nothing architecture
Platform as a service
PaaS
n Data is stored at un-trusted
Infrastructure as a service hosts
IaaS n Solution: encrypting data

n Data is replicated, across large


geographic distances
n Availability and durability

49
SQL AS A SERVICE

User applications

Relational model and SQL as a


Service e.g. Amazon relational
Relational Cloud storage service
database service (RDS), MS SQL Azure

Relational
Implemented on top of
DBMS
parallel clusters of common
DBMS servers e.g., MySQL
MS SQL Server

50
DATA MANAGEMENT WITOUT
RESOURCES CONSTRAINTS

ELASTIC

COSTAWARE
Algorithms
Systems

Reduce the cost to manage and exploit data sets according to unlimited storage,
memory and computation resources 51
CLOUD DATA MANAGEMENT: SERVICES VIEWS
Definition Storage (persistency)
Querying and exploiting Efficient retrieval (indexing, caching)
Manipulation Fault tolerance (recovery, replication)
Maintenance

Peta
1015
Exa
Zetta 1018
Yota 1021 tape
1024 magnetic

RAID
Data Volume
Cloud

Hardware
52
Genoveva Vargas-Solar
CR1, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr

http://vargas-solar.com
53

You might also like