Ebook384 pages3 hours

Data Lake Development with Big Data

Name: Data Lake Development with Big Data
Author: Pasupuleti Pradeep
ISBN: 9781785881664

By Pasupuleti Pradeep and Purra Beulah Salome

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is for architects and senior managers who are responsible for building a strategy around their current data architecture. The reader will need a good knowledge of master data management, information lifecycle management, data governance, data product design, data engineering, and systems architecture. Also required is experience of Big Data technologies such as Hadoop, Spark, Splunk, and Storm.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateNov 26, 2015

ISBN9781785881664

Author

Pasupuleti Pradeep

Related authors

Skip carousel

Related to Data Lake Development with Big Data

Related ebooks

Skip carousel

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
Ebook
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
byAJIT DASH
Rating: 3 out of 5 stars
3/5
Data Mapping for Data Warehouse Design
Ebook
Data Mapping for Data Warehouse Design
byQamar Shahbaz
Rating: 5 out of 5 stars
5/5
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
Ebook
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
byApril Reeve
Rating: 0 out of 5 stars
0 ratings
Data Warehousing in the Age of Big Data
Ebook
Data Warehousing in the Age of Big Data
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
The Data Governance Imperative
Ebook
The Data Governance Imperative
bySteve Sarsfield
Rating: 0 out of 5 stars
0 ratings
Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses
Ebook
Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses
byRick van der Lans
Rating: 4 out of 5 stars
4/5
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Data Stewardship: An Actionable Guide to Effective Data Management and Data Governance
Ebook
Data Stewardship: An Actionable Guide to Effective Data Management and Data Governance
byDavid Plotkin
Rating: 3 out of 5 stars
3/5
Data Engineering on Azure
Ebook
Data Engineering on Azure
byVlad Riscutia
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5
Designing Cloud Data Platforms
Ebook
Designing Cloud Data Platforms
byDanil Zburivsky
Rating: 0 out of 5 stars
0 ratings
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
Ebook
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
byalasdair gilchrist
Rating: 5 out of 5 stars
5/5
Testing the Data Warehouse Practicum: Assuring Data Content, Data Structures and Quality
Ebook
Testing the Data Warehouse Practicum: Assuring Data Content, Data Structures and Quality
byDoug Vucevic
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
PostgreSQL for Data Architects
Ebook
PostgreSQL for Data Architects
byJayadevan Maymala
Rating: 0 out of 5 stars
0 ratings
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
Ebook
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Principles of Data Management: Facilitating information sharing
Ebook
Principles of Data Management: Facilitating information sharing
byKeith Gordon
Rating: 0 out of 5 stars
0 ratings
Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information<sup>TM</sup>
Ebook
Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information<sup>TM</sup>
byDanette McGilvray
Rating: 3 out of 5 stars
3/5
Modern Enterprise Business Intelligence and Data Management: A Roadmap for IT Directors, Managers, and Architects
Ebook
Modern Enterprise Business Intelligence and Data Management: A Roadmap for IT Directors, Managers, and Architects
byAlan Simon
Rating: 0 out of 5 stars
0 ratings
Data Analytics with Google Cloud Platform
Ebook
Data Analytics with Google Cloud Platform
byMurari Ramuka
Rating: 0 out of 5 stars
0 ratings
Expert Cube Development with SSAS Multidimensional Models
Ebook
Expert Cube Development with SSAS Multidimensional Models
byMarco Russo
Rating: 0 out of 5 stars
0 ratings
Business Intelligence Guidebook: From Data Integration to Analytics
Ebook
Business Intelligence Guidebook: From Data Integration to Analytics
byRick Sherman
Rating: 4 out of 5 stars
4/5
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Practical DataOps: Delivering Agile Data Science at Scale
Ebook
Practical DataOps: Delivering Agile Data Science at Scale
byHarvinder Atwal
Rating: 0 out of 5 stars
0 ratings
Data Lake for Enterprises
Ebook
Data Lake for Enterprises
byPankaj Misra
Rating: 0 out of 5 stars
0 ratings

Enterprise Applications For You

Skip carousel

Bitcoin For Dummies
Ebook
Bitcoin For Dummies
byPrypto
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
The Ridiculously Simple Guide to Google Docs: A Practical Guide to Cloud-Based Word Processing
Ebook
The Ridiculously Simple Guide to Google Docs: A Practical Guide to Cloud-Based Word Processing
byScott La Counte
Rating: 0 out of 5 stars
0 ratings
50 Useful Excel Functions: Excel Essentials, #3
Ebook
50 Useful Excel Functions: Excel Essentials, #3
byM.L. Humphrey
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
QuickBooks Online For Dummies
Ebook
QuickBooks Online For Dummies
byDavid H. Ringstrom
Rating: 0 out of 5 stars
0 ratings
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
QuickBooks 2023 All-in-One For Dummies
Ebook
QuickBooks 2023 All-in-One For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Scrivener For Dummies
Ebook
Scrivener For Dummies
byGwen Hernandez
Rating: 4 out of 5 stars
4/5
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
Ebook
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
byRobert W. Bly
Rating: 5 out of 5 stars
5/5
Excel Formulas and Functions 2020: Excel Academy, #1
Ebook
Excel Formulas and Functions 2020: Excel Academy, #1
byAdam Ramirez
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
Excel 2019 For Dummies
Ebook
Excel 2019 For Dummies
byGreg Harvey
Rating: 3 out of 5 stars
3/5
MrExcel XL: The 40 Greatest Excel Tips of All Time
Ebook
MrExcel XL: The 40 Greatest Excel Tips of All Time
byBill Jelen
Rating: 4 out of 5 stars
4/5
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
Ebook
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
byJames H. Moyle
Rating: 0 out of 5 stars
0 ratings
Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
Ebook
Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
byJamshid Gharajedaghi
Rating: 4 out of 5 stars
4/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
QuickBooks Online For Dummies
Ebook
QuickBooks Online For Dummies
byElaine Marmel
Rating: 0 out of 5 stars
0 ratings
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
Ebook
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
byTerry R. Hoffmann
Rating: 0 out of 5 stars
0 ratings
QuickBooks 2021 For Dummies
Ebook
QuickBooks 2021 For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
Excel 2016 For Dummies
Ebook
Excel 2016 For Dummies
byGreg Harvey
Rating: 4 out of 5 stars
4/5
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
Ebook
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
byBijay Kumar Sahoo
Rating: 0 out of 5 stars
0 ratings
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Excel Formulas That Automate Tasks You No Longer Have Time For
Ebook
Excel Formulas That Automate Tasks You No Longer Have Time For
byErik Kopp
Rating: 5 out of 5 stars
5/5
Excel Tips and Tricks
Ebook
Excel Tips and Tricks
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Enterprise AI For Dummies
Ebook
Enterprise AI For Dummies
byZachary Jarvinen
Rating: 3 out of 5 stars
3/5
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
Ebook
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
byCrystalynn Shelton
Rating: 0 out of 5 stars
0 ratings
Experts' Guide to OneNote
Ebook
Experts' Guide to OneNote
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5
101 Ready-to-Use Excel Formulas
Ebook
101 Ready-to-Use Excel Formulas
byMichael Alexander
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Data Modeling That Evolves With Your Business Using Data Vault - Episode 119: An interview about the data vault method of data modeling and how it simplifies integrating the evolving data sources that you are dealing with in your enterprise data warehouse
Podcast episode
Data Modeling That Evolves With Your Business Using Data Vault - Episode 119: An interview about the data vault method of data modeling and how it simplifies integrating the evolving data sources that you are dealing with in your enterprise data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
Podcast episode
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
byData Engineering Podcast
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Introduction to Data Mesh
Podcast episode
Introduction to Data Mesh
byThe Cloudcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Straining Your Data Lake Through A Data Mesh - Episode 90: An interview about how the data mesh architectural and organizational pattern can lead to a more maintainable data platform
Podcast episode
Straining Your Data Lake Through A Data Mesh - Episode 90: An interview about how the data mesh architectural and organizational pattern can lead to a more maintainable data platform
byData Engineering Podcast
0 ratings
0% found this document useful
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
Podcast episode
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
byThe Data Chief
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
Podcast episode
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
Podcast episode
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
byMaking Data Simple
0 ratings
0% found this document useful
#81 The Gradual Process of Building a Data Strategy
Podcast episode
#81 The Gradual Process of Building a Data Strategy
byDataFramed
100%
100% found this document useful
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
Podcast episode
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
byAWS Podcast
0 ratings
0% found this document useful
Best Integration Practices for Architecture Automation | BiZZdesign
Podcast episode
Best Integration Practices for Architecture Automation | BiZZdesign
byEnterprise Architecture Podcast
0 ratings
0% found this document useful
Maintaining Your Data Lake At Scale With Spark - Episode 85: A conversation with the architect of Delta Lake on the challenges of building a sustainable data lake at scale
Podcast episode
Maintaining Your Data Lake At Scale With Spark - Episode 85: A conversation with the architect of Delta Lake on the challenges of building a sustainable data lake at scale
byData Engineering Podcast
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
Kafka Streams with Jay Kreps: Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another?
Podcast episode
Kafka Streams with Jay Kreps: Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another?
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
Podcast episode
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
byMachine Learning Guide
0 ratings
0% found this document useful
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
Podcast episode
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
byTest and Code
0 ratings
0% found this document useful
Delivering Data and Analytics Value: CEOs cite data and analytics as the top capability for enabling growth over the next two years. In this podcast, Gartner’s chief of research for data and analytics, Carlie Idoine, highlights the top issues facing chief data and analytics officers (CDAOs) and how to demonstrate value.
Podcast episode
Delivering Data and Analytics Value: CEOs cite data and analytics as the top capability for enabling growth over the next two years. In this podcast, Gartner’s chief of research for data and analytics, Carlie Idoine, highlights the top issues facing chief data and analytics officers (CDAOs) and how to demonstrate value.
byTechWave: A Gartner Podcast for IT Leaders
0 ratings
0% found this document useful
Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig: An interview with Ashish Mrig about his career in data engineering, his experiences managing data teams at Wayfair, and the technical considerations that factor into platform design decisions in large organizations.
Podcast episode
Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig: An interview with Ashish Mrig about his career in data engineering, his experiences managing data teams at Wayfair, and the technical considerations that factor into platform design decisions in large organizations.
byData Engineering Podcast
100%
100% found this document useful
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
Podcast episode
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
#75 The Data Storytelling Skills Data Teams Need with Andy Cotgreave, Technical Evangelist at Tableau
Podcast episode
#75 The Data Storytelling Skills Data Teams Need with Andy Cotgreave, Technical Evangelist at Tableau
byDataFramed
0 ratings
0% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
Podcast episode
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
Podcast episode
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
The European Business Review
Article
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
Aug 2, 2019
7 min read
The Future Of The Data Economy
The European Business Review
Article
The Future Of The Data Economy
Jun 1, 2022
6 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
How Netflix’s OTT Architecture Functions?
Techfastly
Article
How Netflix’s OTT Architecture Functions?
May 1, 2022
With so many OTT platforms in the market today, Netflix has managed to capture a majority of the audience on a global scale. Netflix has become the go-to source of so much entertainment for consumers in less than 20 years. It can even be said that Ne
4 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Facilities Systems
Facility Management
Article
Facilities Systems
Oct 21, 2018
5 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Extending The Time Equation
The European Business Review
Article
Extending The Time Equation
Jul 26, 2021
4 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
TimeXtender HELPS EUROPEAN COMPANIES FROM NUMEROUS INDUSTRIES MANAGE THEIR DATA
The European Business Review
Article
TimeXtender HELPS EUROPEAN COMPANIES FROM NUMEROUS INDUSTRIES MANAGE THEIR DATA
May 25, 2021
3 min read
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
The European Business Review
Article
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
Dec 3, 2019
7 min read
Integrated Workplace Management Systems
Facility Management
Article
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read
Putting Artificial Intelligence to Work
Rotman Management
Article
Putting Artificial Intelligence to Work
May 1, 2018
11 min read
Good Governance for Dark Data: GUIDELINES FOR INDUSTRIAL IOT MANAGERS
The European Business Review
Article
Good Governance for Dark Data: GUIDELINES FOR INDUSTRIAL IOT MANAGERS
Mar 31, 2020
7 min read
Signals Of Change: how To Evolve For The New Global Reality
Rotman Management
Article
Signals Of Change: how To Evolve For The New Global Reality
May 1, 2022
11 min read
Code Meets Cognition
Business Today
Article
Code Meets Cognition
May 12, 2023
3 min read
Will Generative AI Disrupt Your Company And Your need For Workers?
The European Business Review
Article
Will Generative AI Disrupt Your Company And Your need For Workers?
Jul 31, 2023
5 min read
Cognitive Agents and Reinforcement of User Experience
Techfastly
Article
Cognitive Agents and Reinforcement of User Experience
Dec 1, 2021
3 min read

Related categories

Skip carousel

Reviews for Data Lake Development with Big Data

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Data Lake Development with Big Data - Pasupuleti Pradeep

Data Lake Development with Big Data

Credits

About the Authors

Acknowledgement

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. The Need for Data Lake

Before the Data Lake

Need for Data Lake

Defining Data Lake

Key benefits of Data Lake

Challenges in implementing a Data Lake

When to go for a Data Lake implementation

Data Lake architecture

Architectural considerations

Architectural composition

Architectural details

Understanding Data Lake layers

The Data Governance and Security Layer

The Information Lifecycle Management layer

The Metadata Layer

Understanding Data Lake tiers

The Data Intake tier

The Source System Zone

The Transient Zone

The Raw Zone

Batch Raw Storage

The real-time Raw Storage

The Data Management tier

The Integration Zone

The Enrichment Zone

The Data Hub Zone

The Data Consumption tier

The Data Discovery Zone

The Data Provisioning Zone

Summary

2. Data Intake

Understanding Intake tier zones

Source System Zone functionalities

Understanding connectivity processing

Understanding Intake Processing for data variety

Structured data

The need for integrating Structured Data in the Data Lake

Structured data loading approaches

Semi-structured data

The need for integrating semi-structured data in the Data Lake

Semi-structured data loading approaches

Unstructured data

The need for integrating Unstructured data in the Data Lake

Unstructured data loading approaches

Transient Landing Zone functionalities

File validation checks

File duplication checks

File integrity checks

File size checks

File periodicity checks

Data Integrity checks

Checking record counts

Checking for column counts

Schema validation checks

Raw Storage Zone functionalities

Data lineage processes

Watermarking process

Metadata capture

Deep Integrity checks

Bit Level Integrity checks

Periodic checksum checks

Security and governance

Information Lifecycle Management

Practical Data Ingestion scenarios

Architectural guidance

Structured data use cases

Semi-structured and unstructured data use cases

Big Data tools and technologies

Ingestion of structured data

Sqoop

Use case scenarios for Sqoop

WebHDFS

Use case scenarios for WebHDFS

Ingestion of streaming data

Apache Flume

Use case scenarios for Flume

Fluentd

Use case scenarios for Fluentd

Kafka

Use case scenarios for Kafka

Amazon Kinesis

Use case scenarios for Kinesis

Apache Storm

Use case scenarios for Storm

Summary

3. Data Integration, Quality, and Enrichment

Introduction to the Data Management Tier

Understanding Data Integration

Introduction to Data Integration

Prominent features of Data Integration

Loosely coupled Integration

Ease of use

Secure access

High-quality data

Lineage tracking

Practical Data Integration scenarios

The workings of Data Integration

Raw data discovery

Data quality assessment

Profiling the data

Data cleansing

Deletion of missing, null, or invalid values

Imputation of missing, null, or invalid values

Data transformations

Unstructured text transformation techniques

Structured data transformations

Data enrichment

Collect metadata and track data lineage

Traditional Data Integration versus Data Lake

Data pipelines

Addressing the limitations using Data Lake

Data partitioning

Addressing the limitations using Data Lake

Scale on demand

Addressing the limitations using Data Lake

Data ingest parallelism

Addressing the limitations using Data Lake

Extensibility

Addressing the limitations using Data Lake

Big Data tools and technologies

Syncsort

Use case scenarios for Syncsort

Talend

Use case scenarios for Talend

Pentaho

Use case scenarios for Pentaho

Summary

4. Data Discovery and Consumption

Understanding the Data Consumption tier

Data Consumption – Traditional versus Data Lake

An introduction to Data Consumption

Practical Data Consumption scenarios

Data Discovery and metadata

Enabling Data Discovery

Data classification

Classifying unstructured data

Named entity recognition

Topic modeling

Text clustering

Applications of data classification

Relation extraction

Extracting relationships from unstructured data

Feature-based methods

Understanding how feature-based methods work

Implementation

Semantic technologies

Understanding how semantic technologies work

Implementation

Extracting Relationships from structured data

Applications of relation extraction

Indexing data

Inverted index

Understanding how inverted index works

Implementation

Applications of Indexing

Performing Data Discovery

Semantic search

Word sense disambiguation

Latent Semantic Analysis

Faceted search

Fuzzy search

Edit distance

Wildcard and regular expressions

Data Provisioning and metadata

Data publication

Data subscription

Data Provisioning functionalities

Data formatting

Data selection

Data Provisioning approaches

Post-provisioning processes

Architectural guidance

Data Discovery

Big Data tools and technologies

Elasticsearch

Use case scenarios for Elasticsearch

IBM InfoSphere Data Explorer

Use case scenarios for IBM InfoSphere Data Explorer

Tableau

Use case scenarios for Tableau

Splunk

Use case scenarios for Splunk

Data Provisioning

Big Data tools and technologies

Data Dispatch

Use case scenarios for Data Dispatch

Summary

5. Data Governance

Understanding Data Governance

Introduction to Data Governance

The need for Data Governance

Governing Big Data in the Data Lake

Data Governance – Traditional versus Data Lake

Practical Data Governance scenarios

Data Governance components

Metadata management and lineage tracking

Data security and privacy

Big Data implications for security and privacy

Security issues in the Data Lake tiers

The Intake Tier

The Management Tier

The Consumption Tier

Information Lifecycle Management

Big Data implications for ILM

Implementing ILM using Data Lake

The Intake Tier

The Management Tier

The Consumption Tier

Architectural guidance

Big Data tools and technologies

Apache Falcon

Understanding how Falcon works

Use case scenarios for Falcon

Apache Atlas

Understanding how Atlas works

Use case scenarios for Atlas

IBM Big Data platform

Understanding how governance is provided in IBM Big Data platform

Use case scenarios for IBM Big Data platform

The current and future trends

Data Lake and future enterprise trajectories

Future Data Lake technologies

Summary

Index

Data Lake Development with Big Data

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2015

Production reference: 1241115

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-808-3

www.packtpub.com

Credits

Authors

Pradeep Pasupuleti

Beulah Salome Purra

Reviewer

Dr. Kornel Amadeusz Skałkowski

Commissioning Editor

Priya Singh

Acquisition Editor

Ruchita Bhansali

Content Development Editor

Rohit Kumar Singh

Technical Editor

Saurabh Malhotra

Copy Editor

Trishya Hajare

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Hemangini Bari

Graphics

Jason Monteiro

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Authors

Pradeep Pasupuleti has 18 years of experience in architecting and developing distributed and real-time data-driven systems. He constantly explores ways to use the power and promise of advanced analytics-driven platforms to solve the problems of the common man. He founded Datatma, a consulting firm, with a mission to humanize Big Data analytics, putting it to use to solve simple problems that serve a higher purpose.

He architected robust Big Data-enabled automated learning engines that enterprises regularly use in production in order to save time, money, and the lives of humans.

He built solid interdisciplinary data science teams that bridged the gap between theory and practice, thus, creating compelling data products. His primary focus is always to ensure his customers are delighted by assisting and addressing their business problems through data products that use Big Data technologies and algorithms. He consistently demonstrated thought leadership by solving high-dimensional data problems and getting phenomenal results.

He has performed strategic leadership roles in technology consulting, advising Fortune 100 companies on Big Data strategy and creating Big Data Centers of Excellence.

He has worked on use cases such as enterprise Data Lake, fraud detection, patient re-admission prediction, student performance prediction, claims optimization sentiment mining, cloud infrastructure SLA violation prediction, data leakage prevention, and mainframe offloaded ETL on Hadoop.

In the book Pig Design Patterns, Packt Publishing, he has compiled his learning and experiences from the challenges involved in building Hadoop-driven data products such as data ingest, data cleaning and validating, data transformation, dimensionality reduction, and many other interesting Big Data war stories.

Out of his office hours, he enjoys running marathons, exploring archeological sites, finding patterns in unrelated data sources, mentoring start-ups, and budding researchers.

He can be reached at <Pasupuleti.pradeepkumar@gmail.com> and https://in.linkedin.com/in/pradeeppasupuleti.

Acknowledgement

This book is dedicated to the loving memory of my mother, Smt. Sumathy; without her never-failing encouragement and everlasting love I would have never been half as good.

First and foremost, I have to thank my father, Sri. Prabhakar Pasupuleti, who never ceases to be a constant source of inspiration, a ray of hope, humility and strength, and whose support and guidance have given me the courage to chase my dreams.

I should also express my deep sense of gratitude to each of my family members, Sushma, Sresht, and Samvruth, who stood by me at every moment through very tough times and enabled me to complete this book.

I would like to sincerely thank all my teachers who were instrumental in shaping me. Among them, I would like to thank Usha Madam, Vittal Rao Sir, Gopal Krishna Sir, and Brindavan Sir for their stellar role in improving me.

I would also like to thank all my friends for their understanding in many ways. Their friendship makes my life a wonderful experience. I cannot list all the names here, but you are always on my mind.

Special thanks to the team at Packt for their contribution to this book.

Finally, I would like to thank my team, Salome et.al, that has placed immense faith in the power of Big Data analytics and built cutting edge data products.

Thank you lord, for always being there for me.

Beulah Salome Purra has over 11 years of experience and she specializes in building highly scalable distributed systems. She has worked extensively on architecting multiple large-scale Big Data solutions for Fortune 100 companies. Her core expertise lies in working on Big Data Analytics. In her current role at ATMECS, her focus is on building robust and scalable data products that extract value from huge data assets.

She can be reached at https://www.linkedin.com/in/beulahsalomep.

I am grateful to my parents, Rathnam and Padma, who have constantly encouraged and supported me throughout. I would like to thank my husband, Pratap, for his help on this book, his patience, love, and support; my brothers, Joel and Michael, for all their support.

I would like to profusely thank Pradeep Pasupuleti for mentoring me; working with him has been an enriching experience. I can't thank him enough for his constant encouragement, guidance, support, and for providing me an opportunity to work with him on this book.

Special thanks to David Hawke, Sanjay Singh, and Ravi Velagapudi—the leadership team at ATMECS—for their encouragement and support while I was writing this book.

Thanks to the editors and reviewers at Packt for all their effort in making this book better.

About the Reviewer

Dr. Kornel Amadeusz Skałkowski has a solid academic and industrial background. For more than 5 years, he worked as an assistant at AGH University of Science and Technology in Krakow. In 2015, he obtained his PhD. in the subject of machine learning-based adaptation of the SOA systems. He has cooperated with several companies on various projects concerning intelligent systems, machine learning, and Big Data. Currently, he works as a Big Data developer for SAP SE.

He is the co-author of 19 papers concerning software engineering, SOA systems, and machine learning. He also works as a reviewer for the American Journal of Software Engineering and Applications. He has participated in numerous European and national scientific projects. His research interests include machine learning, Big Data, and software engineering.

I would like to kindly thank my family, relatives, and friends, for their endless patience and support during the reviewing of this book.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

The book Data Lake Development with Big Data is a practical guide to help you learn the essential architectural approaches to design and build Data Lakes. It walks you through the various components of Data Lakes, such as data

Enjoying the preview?

Page 1 of 1

Data Lake Development with Big Data

About this ebook

Pasupuleti Pradeep

Related authors

Related to Data Lake Development with Big Data

Related ebooks

Enterprise Applications For You

Related podcast episodes

Related articles

Related categories

Reviews for Data Lake Development with Big Data

What did you think?

Book preview

Data Lake Development with Big Data - Pasupuleti Pradeep

Table of Contents

Data Lake Development with Big Data

Data Lake Development with Big Data

Credits

About the Authors

Acknowledgement

About the Reviewer

Support files, eBooks, discount offers, and more

Why subscribe?

Preface