Getting Started with Greenplum for Big Data Analytics
()
About this ebook
Related to Getting Started with Greenplum for Big Data Analytics
Related ebooks
Oracle Warehouse Builder 11g: Getting Started Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsLearning Tableau 10 - Second Edition Rating: 4 out of 5 stars4/5Data Fluency: Empowering Your Organization with Effective Data Communication Rating: 2 out of 5 stars2/5Learning Tableau 2019 - Third Edition: Tools for Business Intelligence, data prep, and visual analytics, 3rd Edition Rating: 0 out of 5 stars0 ratingsScalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsIBM Cognos 10 Framework Manager Rating: 0 out of 5 stars0 ratingsReal-Time Big Data Analytics Rating: 5 out of 5 stars5/5Concept Based Practice Questions for Tableau Desktop Specialist Certification Latest Edition 2023 Rating: 0 out of 5 stars0 ratingsLearning Tableau Rating: 0 out of 5 stars0 ratingsExpert T-SQL Window Functions in SQL Server 2019: The Hidden Secret to Fast Analytic and Reporting Queries Rating: 0 out of 5 stars0 ratingsIntroduction to Data Science Using R Rating: 0 out of 5 stars0 ratingsMicrosoft Azure Machine Learning Rating: 4 out of 5 stars4/5Monitoring Hadoop Rating: 0 out of 5 stars0 ratingsData Analytics with Google Cloud Platform Rating: 0 out of 5 stars0 ratingsSAS Viya: The Python Perspective Rating: 0 out of 5 stars0 ratingsMy Part-Time Study Notes on Mssql Server Rating: 0 out of 5 stars0 ratingsData Governance and Data Management: Contextualizing Data Governance Drivers, Technologies, and Tools Rating: 0 out of 5 stars0 ratingsData Modeling A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsAzure Data Lake A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsCore architecture data model A Clear and Concise Reference Rating: 0 out of 5 stars0 ratingsProfessional Hadoop Solutions Rating: 4 out of 5 stars4/5IBM InfoSphere DataStage A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsSpark SQL A Complete Guide Rating: 0 out of 5 stars0 ratingsData Visualization Strategy Standard Requirements Rating: 0 out of 5 stars0 ratings
Data Visualization For You
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Data Visualization: A Practical Introduction Rating: 5 out of 5 stars5/5How to Lie with Maps Rating: 4 out of 5 stars4/5The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios Rating: 4 out of 5 stars4/5Financial Reporting with Dashboards in Power BI Rating: 0 out of 5 stars0 ratingsNumPy Recipes Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 4 out of 5 stars4/5Excel for Beginners 2023: A Step-by-Step and Comprehensive Guide to Master the Basics of Excel, with Formulas, Functions, & Charts Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsTeach Yourself VISUALLY Power BI Rating: 0 out of 5 stars0 ratingsMastering Excel: Excel Apps Rating: 3 out of 5 stars3/5DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Tableau For Dummies Rating: 4 out of 5 stars4/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsFieldwork Handbook: A Practical Guide on the Go Rating: 0 out of 5 stars0 ratingsGetting to Know ArcGIS Desktop 10.8 Rating: 4 out of 5 stars4/5Visual Analytics with Tableau Rating: 0 out of 5 stars0 ratingsHow to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech Rating: 0 out of 5 stars0 ratingsCool Infographics: Effective Communication with Data Visualization and Design Rating: 4 out of 5 stars4/5No-Code Data Science: Mastering Advanced Analytics, Machine Learning, and Artificial Intelligence Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with Python: A Comprehensive Guide to NumPy, Pandas, and Matplotlib Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5D3.js in Action: Data visualization with JavaScript Rating: 0 out of 5 stars0 ratings
Reviews for Getting Started with Greenplum for Big Data Analytics
0 ratings0 reviews
Book preview
Getting Started with Greenplum for Big Data Analytics - Gollapudi Sunila
Table of Contents
Getting Started with Greenplum for Big Data Analytics
Credits
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Instant Updates on New Packt Books
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Big Data, Analytics, and Data Science Life Cycle
Enterprise data
Classification
Features
Big Data
So, what is Big Data?
Multi-structured data
Data analytics
Data science
Data science life cycle
Phase 1 – state business problem
Phase 2 – set up data
Phase 3 – explore/transform data
Phase 4 – model
Phase 5 – publish insights
Phase 6 – measure effectiveness
References/Further reading
Summary
2. Greenplum Unified Analytics Platform (UAP)
Big Data analytics – platform requirements
Greenplum Unified Analytics Platform (UAP)
Core components
Greenplum Database
Hadoop (HD)
Chorus
Command Center
Modules
Database modules
HD modules
Data Integration Accelerator (DIA) modules
Core architecture concepts
Data warehousing
Column-oriented databases
Parallel versus distributed computing/processing
Shared nothing, massive parallel processing (MPP) systems, and elastic scalability
Shared disk data architecture
Shared memory data architecture
Shared nothing data architecture
Data loading patterns
Greenplum UAP components
Greenplum Database
The Greenplum Database physical architecture
The Greenplum high-availability architecture
High-speed data loading using external tables
External table types
Polymorphic data storage and historic data management
Data distribution
Hadoop (HD)
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Chorus
Greenplum Data Computing Appliance (DCA)
Greenplum Data Integration Accelerator (DIA)
References/Further reading
Summary
3. Advanced Analytics – Paradigms, Tools, and Techniques
Analytic paradigms
Descriptive analytics
Predictive analytics
Prescriptive analytics
Analytics classified
Classification
Forecasting or prediction or regression
Clustering
Optimization
Simulations
Modeling methods
Decision trees
Association rules
The Apriori algorithm
Linear regression
Logistic regression
The Naive Bayesian classifier
K-means clustering
Text analysis
R programming
Weka
In-database analytics using MADlib
References/Further reading
Summary
4. Implementing Analytics with Greenplum UAP
Data loading for Greenplum Database and HD
Greenplum data loading options
External tables
gpfdist
gpload
Hadoop (HD) data loading options
Sqoop 2
Greenplum BulkLoader for Hadoop
Using external ETL to load data into Greenplum
Extraction, Load, and Transformation (ELT) and Extraction, Transformation, Load, and Transformation (ETLT)
Greenplum target configuration
Sourcing large volumes of data from Greenplum
Unsupported Greenplum data types
Push Down Optimization (PDO)
Greenplum table distribution and partitioning
Distribution
Data skew and performance
Optimizing the broadcast or redistribution motion for data co-location
Partitioning
Querying Greenplum Database and HD
Querying Greenplum Database
Analyzing and optimizing queries
The ANALYZE function
The EXPLAIN function
Dynamic Pipelining in Greenplum
Querying HDFS
Hive
Pig
Data communication between Greenplum Database and Hadoop (using external tables)
Data Computing Appliance (DCA)
Storage design, disk protection, and fault tolerance
Master server RAID configurations
Segment server RAID configurations
Monitoring DCA
Greenplum Database management
In-database analytics options (Greenplum-specific)
Window functions
The PARTITION BY clause
The ORDER BY clause
The OVER (ORDER BY…) clause
Creating, modifying, and dropping functions
User-defined aggregates
Using R with Greenplum
DBI Connector for R
PL/R
Using Weka with Greenplum
Using MADlib with Greenplum
Using Greenplum Chorus
Pivotal
References/Further reading
Summary
Index
Getting Started with Greenplum for Big Data Analytics
Getting Started with Greenplum for Big Data Analytics
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Production Reference: 1171013
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78217-704-3
www.packtpub.com
Cover Image by Aniket Sawant (<aniket_sawant_photography@hotmail.com>)
Credits
Author
Sunila Gollapudi
Reviewers
Brian Feeny
Scott Kahler
Alan Koskelin
Tuomas Nevanranta
Acquisition Editor
Kevin Colaco
Commissioning Editor
Deepika Singh
Technical Editors
Kanhucharan Panda
Vivek Pillai
Project Coordinator
Amey Sawant
Proofreader
Bridget Braund
Indexer
Mariammal Chettiyar
Graphics
Valentina D'silva
Ronak Dhruv
Abhinash Sahu
Production Coordinator
Adonia Jones
Cover Work
Adonia Jones
Foreword
In the last decade, we have seen the impact of exponential advances in technology on the way we work, shop, communicate, and think. At the heart of this change is our ability to collect and gain insights into data; and comments like Data is the new oil
or we have a Data Revolution
only amplifies the importance of data in our lives.
Tim Berners-Lee, inventor of the World Wide Web said, Data is a precious thing and will last longer than the systems themselves.
IBM recently stated that people create a staggering 2.5 quintillion bytes of data every day (that's roughly equivalent to over half a billion HD movie downloads). This information is generated from a huge variety of sources including social media posts, digital pictures, videos, retail transactions, and even the GPS tracking functions of mobile phones.
This data explosion has led to the term Big Data
moving from an Industry buzz word to practically a household term very rapidly. Harnessing Big Data
to extract insights is not an easy task; the potential rewards for finding these patterns are huge, but it will require technologists and data scientists to work together to solve these problems.
The book written by Sunila Gollapudi, Getting Started with Greenplum for Big Data Analytics, has been carefully crafted to address the needs of both the technologists and data scientists.
Sunila starts with providing excellent background to the Big Data problem and why new thinking and skills are required. Along with a dive deep into advanced analytic techniques, she brings out the difference in thinking between the new
Big Data science and the traditional Business Intelligence
, this is especially useful to help understand and bridge the skill gap.
She moves on to discuss the computing side of the equation-handling scale, complexity of data sets, and rapid response times. The key here is to eliminate the noise
in data early in the data science life cycle. Here, she talks about how to use one of the industry's leading product platforms like Greenplum to build Big Data solutions with an explanation on the need for a unified platform that can bring essential software components (commercial/open source) together backed by a hardware/appliance.
She then puts the two together to get the desired result—how to get meaning out of Big Data. In the process, she also brings out the capabilities of the R programming language, which is mainly used in the area of statistical computing, graphics, and advanced analytics.
Her easy-to-read practical style of writing with real examples shows her depth of understanding of this subject. The book would be very useful for both data scientists (who need to learn the computing side and technologies to understand) and also for those who aspire to learn data science.
V. Laxmikanth
Managing Director
Broadridge Financial Solutions (India) Private Limited
www.broadridge.com
About the Author
Sunila Gollapudi works as a Technology Architect for Broadridge Financial Solutions Private Limited. She has over 13 years of experience in developing, designing and architecting data-driven solutions with a focus on the banking and financial services domain for around eight years. She drives Big Data and data science practice for Broadridge. Her key roles have been Solutions Architect, Technical leader, Big Data evangelist, and Mentor.
Sunila has a Master's degree in Computer Applications and her passion for mathematics enthused her into data and analytics. She worked on Java, Distributed Architecture, and was a SOA consultant and Integration Specialist before she embarked on her data journey. She is a strong follower of open source technologies and believes in the innovation that open source revolution brings.
She has been a speaker at various conferences and meetups on Java and Big Data. Her current Big Data and data science specialties include Hadoop, Greenplum, R, Weka, MADlib, advanced analytics, machine learning, and data integration tools such as Pentaho and Informatica.
With a unique blend of technology and domain expertise, Sunila has been instrumental in conceptualizing architectural patterns and providing reference architecture for Big Data problems in the financial services domain.
Acknowledgement
It was a pleasure to work with Packt Publishing on this project. Packt has been most accommodating, extremely quick, and responsive to all requests.
I am deeply grateful to Broadridge for providing me the platform to explore and build expertise in Big Data technologies. My greatest gratitude to Laxmikanth V. (Managing Director, Broadridge) and Niladri Ray (Executive Vice President, Broadridge) for all the trust, freedom, and confidence in me.
Thanks to my parents for having relentlessly encouraged me to explore any and every subject that interested me.
Authors usually thank their spouses for their patience and support
or words to that effect. Unless one has lived through the actual experience, one cannot fully comprehend how true this is. Over the last ten years, Kalyan has endured what must have seemed like a nearly continuous stream of whining punctuated by occasional outbursts of exhilaration and grandiosity—all of which before the background of the self-absorbed attitude of a typical author. His patience and support were unfailing.
Last but not least, my love, my daughter, my angel, Nikita, who has been my continuous drive. Without her being as accommodative as she was, this book wouldn't have been possible.
About the Reviewers
Brian Feeny is a technologist/evangelist working with many Big Data technologies such as analytics, visualization, data mining, machine learning, and statistics. He is a graduate student in Software Engineering at Harvard University, primarily focused on data science, where he gets to work on interesting data problems using some of the latest methods and technology.
Brian works for Presidio Networked Solutions, where he helps businesses with their Big Data challenges and helps them understand how to make best use of their data.
I would like to thank my wife, Scarlett, for her tolerance of my busy schedule. I would like to thank Presidio, my employer, for investing in in our Big Data practice. Lastly, I would like to thank EMC and Pivotal for the excellent training and support they have given Presidio and myself.
Scott Kahler started down the path in the mid 80s when he disconnected the power LED on his Commodore 64. In this fashion he could run his handwritten Dungeons and Dragons' random character generator, and his parents wouldn't complain about the computer being