Professional Documents
Culture Documents
AGENDA
BIG DATA
Who is KEYRUS ? Big Data & Analytics, What is it ? Positioning Software & Tools Technical Architecture Value Proposition
Big Data
KEYRUS
A UNIQUE VALUE PROPOSITION
A GROUP STRONG AND AGILE SPECIALIST IN ORGANIZATIONS PERFORMANCE OUR VALUES FOR THE BENEFIT OF OUR CUSTOMERS AN INTERNATIONAL DIMENSION
153m
2012 Revenues
350
Large accounts* & LME
3800
SME customers
1650
Employees
12
countries
on 4 continents
The infrastructures and processes (quality HR,..) of a large professionnal services Group Simple and formalized governance to maintain agility at all times A customer-focused decision center Listed on NYSE-Euronext Paris
An ability to act on performance management strategy, systems and organizations Different Business Units to serve different types of clients (Large corporations, midmarket, and SMEs) Functional, Industry and Technology skills
Entrepreneurship Customer proximity Building our brand on quality of service A culture of innovation that defines how we operate and is also part of our value proposition Diversity as a key component of our HR policy
Revenue by Sector
Keyrus - All rights reserved
Industries: 31% Banking - Insurance: 19% Telecom : 8% Services - Distribution: 16% Public Services: 14% Utilities: 12%
Belgium Brazil Canada China Spain France Mauritius Israel Luxembourg Switzerland Tunisia USA 3
Big Data
5 Billion
# of cell phone users worldwide in 2010
10x
Growth in digital data every 5 years
2 Billion
# of Internet users worldwide in 2010
30 Billion
Pieces of content shared on Facebook every month
35 ZB
By 2020, the Digital Universe will be 44 times as big as it was in 2009
BIG DATA
LARGE HADRON COLLIDER OF CERN (SWITZERLAND)
Big Data
BIG DATA ?
NOT ONLY DATA VOLUME
Volume
Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. TB, Records, Transactions, Tables, Files
Velocity
Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Batch, Near time, Real time, Streams
Value Variety
Keyrus - All rights reserved
Innovate new business models Replace/Support human decision Custom actions Discover needs Improve performance Create transparency
Big data extends beyond structured data, including semistructured and unstructured data of all varieties: text, audio, video, click streams, log files and more. Multi-structured : Unstructured, Semi-Structured, Structured
Big Data 7
Analysis
Quarterly sales reporting Sales growth plan
Future
Strategy
Keyrus - All rights reserved
Past
Big Data
ANALYZING DATA
SKILL OF THE FUTURE
McKinsey: by 2018, the United States alone could face a shortage of:
140-190,000 people with deep analytical skills 1.5M managers/analysts with the know-how to use the analysis of big data to make effective decisions
www.mckinsey.com/mgi/publications/big_data/
Data Scientist
Working on global data Modeling complex business problems Using Big Data software packages (Mahout, Lucene) Discovering business insights Identifying opportunities Skills for coding, integrating and preparing large, varied, data sets Advanced analytics and modeling skills to reveal and understand hidden relationships Business knowledge and communication skills to present results
Statistician
Working on data sampling From data sampling to global data by projection Using statistical software packages (SAS, SPSS) Skills for probability, regression and modeling Practical experience on data cleansing, simulation and data visualization Skills for data interpretation, analysis, categorization, correlation, explanation Communication skills to present results
Big Data
10
Big Data
12
Actionable Analytics
Provides simulation, prediction, optimization and other analytics, to empower even more decision flexibility at the time and place of every business process action
In Memory Computing
The execution of certain-types of hours-long batch processes can be squeezed into minutes or even seconds
Keyrus - All rights reserved
Integrated Ecosystems
Packaging of software and services to address infrastructure or application workload
Big Data
13
Petabytes
35ZB -> a stack of 50GB Bluray DVDs reaching from earth to the moon x2
10 ** 21 Bytes
Big Data
14
Bank/Insurance risks management Bale III customer qualification, fraud management Telecommunications more reliable network where we can predict and prevent failure customers attrition Media more content that is lined up with your personal preferences Marketing e-reputation - Trends analysis on the web sites Healthcare prevention system epidemiological surveillance
Life Science
better targeted medicines with fewer complications and side effects Retail a personal experience with products and offers that are just what you need Government government services that are based on hard data, not just gut IT support optimization electric consumption analysis Gaming determining the future direction of the games
Big Data
15
Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) Data exploration and discovery (e.g., data scientists, identifying new data-driven products, new markets) Fraud detection prevention (e.g. revenue protection, site integrity, credit card protection, suspect transactions, fight against money laundering) Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) Social network and relationship analysis (e.g., influencer marketing, crowdsourcing, attrition prediction)
Keyrus - All rights reserved
Source: Teradata
Big Data
16
Compute Trend
New Analytics
(Massively Parallel Processing,, MapReduce , Algorithms)
Storage Trend
New Data Structure
(Distributed File Systems, NoSQL , NewSQL)
Enterprise data warehouse General purpose data warehouse Proprietary and dedicated data warehouse OLTP is the data warehouse
Keyrus - All rights reserved
Object Storage
Multi-Structured Data
Master/Master
Distributed FS
Federated/ Sharded
Shared source code Publicly available and free Support suscription not free No software vendor lock-in For the use and benefit of all without favour
Big Data
Commercial software
19
DATA WAREHOUSE
GARTNER ANALYSIS
Parallel Data Warehouse (Microsoft) IBM Netezza Oracle Exadata SAP HANA ParAccel Analytic Database Teradata HP Vertica
Column-Oriented database
Source Gartner January 2013
In-Memory database
Big Data
20
DATA MANAGEMENT
GARTNER ANALYSIS
Data Integration
Source Gartner October 2012
Data Quality
Source Gartner October 2012
Master Data
Source Gartner October 2012
Keyrus - All rights reserved
Data acquisition Consolidation Data migrations/conversions Synchronization of data between operational applications Interenterprise data sharing Delivery of data services in an SOA context
Identify, link and synchronize the information across heterogeneous data sources Create and manage a central database of record or index Support master data and governance requirements through workflow
Big Data
21
Governance
1 2 3 4 5 6 7
Keyrus - All rights reserved
Accessibility
External data access Open data access Data collect easily Wrong figures Visualization not clear for decision-making Incorrect data, doubloons Decision making impact Data update All data in the context Global data Data-understanding Data life cycle From sources to users Data lost Data intrusion Data habilitations
Business consistency
Technical consistency
Freshness
Completeness
Explicable
Traceability
Security
Big Data
22
BUSINESS INTELLIGENCE
GARTNER ANALYSIS
Predictive analysis Advanced visualization Geospatial analysis Cloud analytics platform Innovation Last years acquisitions
IBM > Cognos, Algorithmics
Keyrus - All rights reserved
Big Data
23
HADOOP OVERVIEW
OPEN SOURCE FRAMEWORK
What is Hadoop ?
Top level Apache Foundation project Large, active user base, mailing lists, user groups Very active community, strong development team
Why Hadoop ?
Searching Log Processing Data Analytics Video and Image Analysis Data Retention
Open Source software flexible and available architecture for large scale computation and data processing on a network of commodity hardware
Big Data 24
HADOOP PROVIDERS
FORESTER ANALYSIS
Amazon is the most prominent Hadoop cloud service provider IBM has the deepest Hadoop platform and application portfolio EMC Greenplum is the first mover in Hadoop appliances MapR has a strong OEM business for its Hadoop distribution Cloudera is the Hadoop pure play with the greatest adoption Hortonworks provides professional services to the Hadoop ecosystem Pentaho executes Hadoop MapReduce models and Pig scripts for data integration and analytics products DataStax embeds Cassandra for real-time Hadoop applications Datameer provides a user-friendly Hadoop modeling tool Platform Computing brings proven cluster management tools to Hadoop Zettaset specializes in Hadoop cluster management tools
Keyrus - All rights reserved
Outerthought focuses on Hadoop search applications HStreaming provides complex event processing middleware for Hadoop
Source Forester Research Inc. - February 2012
Big Data
25
CLOUDERA
HADOOP DISTRIBUTION - CDH
Hadoop is framework based on flexible and available architecture for large scale computation and data processing on a network of commodity hardwar e HDFS / MapReduce : Hadoop Distributed File System for storage and Hadoop MapReduce for compute. High availability and scalability. Open source software Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Hadoop it provides Tools to enable easy data extract/transform/load , a mechanism to impose structure on a variety of data formats, access to files stored either directly in HDFS or in other data storage systems such as HBase and query execution via MapReduce Pig is a high-level data-flow language and execution framework for parallel computation. Simple to write MapReduce program. Abstracts you from specific detail. Focus on data processing. Data flow. Data manipulation. for enhancing extract, transform and load data into HDFS or from HDFS into any target systems. Open source software Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Hadoop Framework
Job Workflow APACHE OOZIE Data Processing Lib DataFu for Pig
Web Console
HUE
Interactive SQL
Impala
Metadata
APACHE HIVE MetaStore
Batch Processing Languages Data Integration APACHE FLUME, APACHE SQOOP Cloud Deployment APACHE WHIRR APACHE PIG, APACHE HIVE Hadoop Core Kernel MapReduce, HDFS Connectivity
ODBC/JDBC/FUSE/HTTPS
Big Data
26
MAPREDUCE
MASSIVELY PARALLEL PROCESSING
MapReduce
MapReduce is the programming paradigm popularized by Google researchers Open-source Hadoop implementation of MapReduce by Yahoo Open source software framework for distributed computation Parallel computation (Map) on each block (Split) of data in an HDFS file and output a stream of (Key, Value) pairs to the local file system JobTracker schedules and manages jobs TaskTracker executes individual map() and reduce() tasks on each cluster node
Algorithms
Association Rule Learning Algorithms Genetic Algorithms Neural Network Algorithms Statistical Algorithms (Pandas) Machine Learning Algorithms (Mahout, Weka, Scikit Learn) Natural Language Processing Algorithms Trading Algorithms Clinical design Algorithms Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearch Server)
Languages
PHP Erlang Python Ruby R Java
Big Data
27
Column
Key-Value
3
Document
MongoDB (10Gen), CouchDB, Terra store, SimpleDB (AWS)
Graph
Keyrus - All rights reserved
Big Data
28
Cloud
Cloud
Cloud
Cloud models
SalesForce.com, Facebook, Twitter, Li nkedIn Amazon Web Services, Microsoft Windows Azure, Google Linux, Windows, Unix)
SaaS
App
App
App
App
App
App
PaaS
Java
Ruby
Python
PHP
Erlang
IaaS
Hardware (server, storage, network)
Big Data
30
INFRASTRUCTURE AS A SERVICE
IAAS MODEL
General Purpose
Combine server with storage & networking (Hyper-Scale Server) Specialized software enables general purpose systems designs to provide high performance data services
Emerging
Application Data Services
Future
Application
Infrastructure
Big Data 31
Analytics Architecture New Generation Not only SQL based Hadoop, Cassandra High scalability, availability and flexibility Compute and storage in the same box for reducing the network latency Right design for semi-structured and unstructured data
SAN Switch
Big Data
32
HADOOP ARCHITECTURE
OVEVIEW
Network Switches
2 x EdgeNode 2 CPU 6 core 96GB RAM 6 x HDD 600GB 15K (Raid10) 2 x 10GbE Ports
2 x NameNode/BackupNode 2 CPU 6 core 96GB RAM 6 x HDD 600GB 15K (Raid10) 2 x 10GbE Ports
3 to n DataNode 2 CPU 6 core 48GB RAM 12 x HDD 3TB 7.5K 2 x 10GbE Ports
Edge Nodes
Control Nodes
Worker Nodes
Big Data
33
Modeling Tools
BI / Analytics
Enterprise Reporting
Web/Mobile Applications
Logs
Files
Web Data
RDBMS
Big Data
34
Costs
Hardware & software products costs Services & Support costs Training & communication costs Energy & professional costs
Benefits
Increase productivity Increase margins and revenues Reduce time to access to relevant information
Keyrus - All rights reserved
Reduce time to decision making Enhance quality of information Enhance users satisfaction
Big Data 36
BIG DATA VALUE PROPOSITION Keyrus, leader in Business Intelligence (Consulting & Delivery) Works closely with the big data leaders Works with high level profiles: Statistician, Architect, BIDW Specialist, Consultant, Manager Develops partnerships Develops innovation Uses open source software
No software vendors lock-in Low TCO
37
&
Keyrus - All rights reserved
Big Data
38
THANK YOU
FOR YOUR ATTENTION
Big Data
39