Ebook483 pages2 hours

Spark Cookbook

Name: Spark Cookbook
Author: Rishi Yadav
ISBN: 9781783987078

By Rishi Yadav

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Become an expert at graph processing using GraphX
Use Apache Spark as your single big data compute platform and master its libraries
Learn with recipes that can be run on a single machine as well as on a production cluster of thousands of machines

Who This Book Is For

If you are a data engineer, an application developer, or a data scientist who would like to leverage the power of Apache Spark to get better insights from big data, then this is the book for you.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJul 27, 2015

ISBN9781783987078

Author

Rishi Yadav

Related authors

Skip carousel

Related to Spark Cookbook

Related ebooks

Skip carousel

Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Apache Spark 2.x Cookbook
Ebook
Apache Spark 2.x Cookbook
byRishi Yadav
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Python Business Intelligence Cookbook
Ebook
Python Business Intelligence Cookbook
byDempsey Robert
Rating: 0 out of 5 stars
0 ratings
Learning Apache Mahout Classification
Ebook
Learning Apache Mahout Classification
byGupta Ashish
Rating: 0 out of 5 stars
0 ratings
Learning Microsoft Cognitive Services
Ebook
Learning Microsoft Cognitive Services
byLeif Larsen
Rating: 0 out of 5 stars
0 ratings
Java Data Science Cookbook
Ebook
Java Data Science Cookbook
byRushdi Shams
Rating: 0 out of 5 stars
0 ratings
Mastering MLOps Architecture: From Code to Deployment: Manage the production cycle of continual learning ML models with MLOps (English Edition)
Ebook
Mastering MLOps Architecture: From Code to Deployment: Manage the production cycle of continual learning ML models with MLOps (English Edition)
byRaman Jhajj
Rating: 0 out of 5 stars
0 ratings
Instant MongoDB
Ebook
Instant MongoDB
byNayak Amol
Rating: 0 out of 5 stars
0 ratings
Pentaho Analytics for MongoDB Cookbook
Ebook
Pentaho Analytics for MongoDB Cookbook
byLatino Joel
Rating: 0 out of 5 stars
0 ratings
MongoDB High Availability
Ebook
MongoDB High Availability
byAfshin Mehrabani
Rating: 5 out of 5 stars
5/5
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
Ebook
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
byDr. Saket S.R. Mengle
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Machine Learning
Ebook
Microsoft Azure Machine Learning
bySumit Mund
Rating: 4 out of 5 stars
4/5
PostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases
Ebook
PostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases
bySimon Riggs
Rating: 0 out of 5 stars
0 ratings
Mastering Cloud Development using Microsoft Azure
Ebook
Mastering Cloud Development using Microsoft Azure
byRoberto Freato
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byDenny Lee
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Data Engineer A Complete Guide - 2021 Edition
Ebook
Data Engineer A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Ebook
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
byDebananda Ghosh
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Google Cloud Platform A Complete Guide - 2021 Edition
Ebook
Google Cloud Platform A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Azure Databricks A Complete Guide - 2021 Edition
Ebook
Azure Databricks A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
MongoDB Complete Self-Assessment Guide
Ebook
MongoDB Complete Self-Assessment Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5

Computers For You

Skip carousel

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
The Insider's Guide to Technical Writing
Ebook
The Insider's Guide to Technical Writing
byKrista Van Laan
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level
Ebook
Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level
byRon Hale-Evans
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
Ebook
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
byJohn Adamssen
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
Ebook
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
Network+ Study Guide & Practice Exams
Ebook
Network+ Study Guide & Practice Exams
byRobert Shimonski
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
Podcast episode
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
byTest and Code
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
146: Automation Tools for Web App and API Development and Maintenance - Michael Kennedy: Michael Kennedy joins the show this week to share some of the tools he uses during development and maintenance. We talk about tools used for semi-automated exploratory testing. We also talk about some of the other tools and techniques he uses to keep Talk Python Training, Talk Python, and Python Bytes all up and running smoothly.
Podcast episode
146: Automation Tools for Web App and API Development and Maintenance - Michael Kennedy: Michael Kennedy joins the show this week to share some of the tools he uses during development and maintenance. We talk about tools used for semi-automated exploratory testing. We also talk about some of the other tools and techniques he uses to keep Talk Python Training, Talk Python, and Python Bytes all up and running smoothly.
byTest and Code
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
Podcast episode
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
Podcast episode
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
byBig Technology Podcast
100%
100% found this document useful
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
Podcast episode
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Gitting After It with Katie Sylor-Miller: Katie Sylor-Miller is a frontend architect at Etsy, a company she joined in November 2015. Prior to this position, Katie worked as a senior front end developer at Constant Contact, a technical lead at EF Education, a front end web developer at Miller Syst
Podcast episode
Gitting After It with Katie Sylor-Miller: Katie Sylor-Miller is a frontend architect at Etsy, a company she joined in November 2015. Prior to this position, Katie worked as a senior front end developer at Constant Contact, a technical lead at EF Education, a front end web developer at Miller Syst
byScreaming in the Cloud
0 ratings
0% found this document useful
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics: Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
Podcast episode
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics: Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
byData Engineering Podcast
0 ratings
0% found this document useful
Agile Development for Data Scientists, Part 1: The Good: If you're a data scientist at a firm that does a …
Podcast episode
Agile Development for Data Scientists, Part 1: The Good: If you're a data scientist at a firm that does a …
byLinear Digressions
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Podcast episode
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
Ignore Previous Instructions and Listen To This Interview with Sander Schulhoff, CEO of Learnprompting.org: In this episode, Nathan sits down with Sander Schulhoff, Cofounder and CEO of Learnprompting.org.
Podcast episode
Ignore Previous Instructions and Listen To This Interview with Sander Schulhoff, CEO of Learnprompting.org: In this episode, Nathan sits down with Sander Schulhoff, Cofounder and CEO of Learnprompting.org.
by"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
0 ratings
0% found this document useful
Improving Upon a First-Draft Data Science Analysis: There are a lot of good resources out there for g…
Podcast episode
Improving Upon a First-Draft Data Science Analysis: There are a lot of good resources out there for g…
byLinear Digressions
0 ratings
0% found this document useful
Episode 189: Prioritizing Features with Corinn Pope: When developing a SaaS product, founders and product managers always have to deal with limited resources. Is there a formula for making the best feature decisions? Our guest today is Corinn Pope, founder of Speckled and prioritization expert. You’ll hear Corinn’s recommendations on roadmaps, decision-making formulas, prioritization processes, and management tips for all kinds of products.
Podcast episode
Episode 189: Prioritizing Features with Corinn Pope: When developing a SaaS product, founders and product managers always have to deal with limited resources. Is there a formula for making the best feature decisions? Our guest today is Corinn Pope, founder of Speckled and prioritization expert. You’ll hear Corinn’s recommendations on roadmaps, decision-making formulas, prioritization processes, and management tips for all kinds of products.
byUI Breakfast: UI/UX Design and Product Strategy
100%
100% found this document useful
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
Podcast episode
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
byScreaming in the Cloud
0 ratings
0% found this document useful
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
Podcast episode
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
byData Engineering Podcast
0 ratings
0% found this document useful
Challenges Operationalizing ML (And Some Solutions) // Nathan Ryan Frank // #199
Podcast episode
Challenges Operationalizing ML (And Some Solutions) // Nathan Ryan Frank // #199
byMLOps.community
0 ratings
0% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
Podcast episode
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
How Data Discovery is Changing the Game with Shinji Kim: Shinji Kim, CEO and Co-Founder of Select Star, joins Corey to talk about the fast-growing world of data discovery. Shinji presents the question that Select Star answers, “How discoverable is your data?” and explains how Select Star is differentiating itse
Podcast episode
How Data Discovery is Changing the Game with Shinji Kim: Shinji Kim, CEO and Co-Founder of Select Star, joins Corey to talk about the fast-growing world of data discovery. Shinji presents the question that Select Star answers, “How discoverable is your data?” and explains how Select Star is differentiating itse
byScreaming in the Cloud
0 ratings
0% found this document useful
Working With Developers
Podcast episode
Working With Developers
byBusiness Analysis Live!
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
MLOps Build or Buy, Startup vs. Enterprise? // Aaron Maurer & Katrina Ni # 157
Podcast episode
MLOps Build or Buy, Startup vs. Enterprise? // Aaron Maurer & Katrina Ni # 157
byMLOps.community
0 ratings
0% found this document useful
A Roadmap To Bootstrapping The Data Team At Your Startup: Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.
Podcast episode
A Roadmap To Bootstrapping The Data Team At Your Startup: Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.
byData Engineering Podcast
0 ratings
0% found this document useful
Developing Storage Solutions Before the Rest with AB Periasamay: Conversations about what the cloud is might be an infinitely convoluted one, but some are taking the conversation down paths less traveled. That is certainly the case for AB Periasamy, CEO and Co-Founder of MinIO, an open source provider of high performan
Podcast episode
Developing Storage Solutions Before the Rest with AB Periasamay: Conversations about what the cloud is might be an infinitely convoluted one, but some are taking the conversation down paths less traveled. That is certainly the case for AB Periasamy, CEO and Co-Founder of MinIO, an open source provider of high performan
byScreaming in the Cloud
0 ratings
0% found this document useful
How ChatGPT Can Supercharge Your L&D With Ross Stevenson
Podcast episode
How ChatGPT Can Supercharge Your L&D With Ross Stevenson
byThe Learning & Development Podcast
0 ratings
0% found this document useful

Skip carousel

What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
In Brief
Linux Format
Article
In Brief
Jun 1, 2021
Mu is a code editor for many forms of Python. We can write standard Python 3 code, create web apps and write code for microcontrollers such as the new Raspberry Pi Pico. Mu is designed for new users and does away with complicated IDEs in favour of a
1 min read
Create Asynchronous Code With Python
Linux Format
Article
Create Asynchronous Code With Python
Jun 29, 2021
8 min read
Basic Concepts
Linux Format
Article
Basic Concepts
Jul 2, 2019
A messaging system such as Kafka enables you to send messages between processes, applications and servers. Applications connect to Kafka to send or get data. Strictly speaking, a Kafka ‘topic’ is a unit of storage in Kafka: data in Kafka is stored in
1 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Working With Lists In Python
APC
Article
Working With Lists In Python
Jun 17, 2019
4 min read
In Conversation with Surbhi Rathore
Techfastly
Article
In Conversation with Surbhi Rathore
Oct 1, 2021
4 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
There’s A New Career In Town
True Love
Article
There’s A New Career In Town
Oct 21, 2019
2 min read
Zulip Economy
Linux Format
Article
Zulip Economy
Oct 20, 2020
10 min read
A Place For Everything
Outdoor Photographer
Article
A Place For Everything
Aug 10, 2019
9 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
Family History In The AI Era
Family Tree UK
Article
Family History In The AI Era
Apr 12, 2024
7 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
How An A.i. Chatbot Works
Muse: The magazine of science, culture, and smart laughs for kids and children
Article
How An A.i. Chatbot Works
Feb 1, 2024
1 min read
Do I Need To Learn Python To Be A Good Character Rigger?
3D World
Article
Do I Need To Learn Python To Be A Good Character Rigger?
Sep 7, 2021
1 min read
Artificial Intelligence Rules Of The Road
Linux Format
Article
Artificial Intelligence Rules Of The Road
Nov 14, 2023
AI FOR ALL! Anyone who works with computers needs to understand that AI will undoubtedly change how work is executed. That said, I don’t think we are anywhere near the much bleated “Everyone will lose their jobs!” IT-related jobs will change but they
2 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
AppleMagazine
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 15, 2023
4 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
TechLife News
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 16, 2023
4 min read
Seven Ways To Future-proof Your SEO Strategy
Marketing
Article
Seven Ways To Future-proof Your SEO Strategy
Apr 8, 2018
Search engine optimisation (SEO) is always changing. To stay ahead of your competitors you need to be able to shift your SEO strategy. Expect to see mobile devices, artificial intelligence (AI) and voice search dominating the news. But what practical
3 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read

Related categories

Skip carousel

Reviews for Spark Cookbook

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Spark Cookbook - Rishi Yadav

Spark Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the color images of this book

Errata

Piracy

Questions

1. Getting Started with Apache Spark

Introduction

Installing Spark from binaries

Getting ready

How to do it...

Building the Spark source code with Maven

Getting ready

How to do it...

Launching Spark on Amazon EC2

Getting ready

How to do it...

See also

Deploying on a cluster in standalone mode

Getting ready

How to do it...

How it works...

See also

Deploying on a cluster with Mesos

How to do it...

Deploying on a cluster with YARN

Getting ready

How to do it...

How it works…

Using Tachyon as an off-heap storage layer

How to do it...

See also

2. Developing Applications with Spark

Introduction

Exploring the Spark shell

How to do it...

Developing Spark applications in Eclipse with Maven

Getting ready

How to do it...

Developing Spark applications in Eclipse with SBT

How to do it...

Developing a Spark application in IntelliJ IDEA with Maven

How to do it...

Developing a Spark application in IntelliJ IDEA with SBT

How to do it...

3. External Data Sources

Introduction

Loading data from the local filesystem

How to do it...

Loading data from HDFS

How to do it...

There's more…

Loading data from HDFS using a custom InputFormat

How to do it...

Loading data from Amazon S3

How to do it...

Loading data from Apache Cassandra

How to do it...

There's more...

Merge strategies in sbt-assembly

Loading data from relational databases

Getting ready

How to do it...

How it works…

4. Spark SQL

Introduction

Understanding the Catalyst optimizer

How it works…

Analysis

Logical plan optimization

Physical planning

Code generation

Creating HiveContext

Getting ready

How to do it...

Inferring schema using case classes

How to do it...

Programmatically specifying the schema

How to do it...

How it works…

Loading and saving data using the Parquet format

How to do it...

How it works…

There's more…

Loading and saving data using the JSON format

How to do it...

How it works…

There's more…

Loading and saving data from relational databases

Getting ready

How to do it...

Loading and saving data from an arbitrary source

How to do it...

There's more…

5. Spark Streaming

Introduction

Word count using Streaming

How to do it...

Streaming Twitter data

How to do it...

Streaming using Kafka

Getting ready

How to do it...

There's more…

6. Getting Started with Machine Learning Using MLlib

Introduction

Creating vectors

How to do it…

How it works...

Creating a labeled point

How to do it…

Creating matrices

How to do it…

Calculating summary statistics

How to do it…

Calculating correlation

Getting ready

How to do it…

Doing hypothesis testing

How to do it…

Creating machine learning pipelines using ML

Getting ready

How to do it…

7. Supervised Learning with MLlib – Regression

Introduction

Using linear regression

Getting ready

How to do it…

Understanding cost function

Doing linear regression with lasso

How to do it…

Doing ridge regression

How to do it…

8. Supervised Learning with MLlib – Classification

Introduction

Doing classification using logistic regression

Getting ready

How to do it…

Doing binary classification using SVM

How to do it…

Doing classification using decision trees

Getting ready

How to do it…

How it works…

Doing classification using Random Forests

Getting ready

How to do it…

How it works…

Doing classification using Gradient Boosted Trees

Getting ready

How to do it…

Doing classification with Naïve Bayes

Getting ready

How to do it…

9. Unsupervised Learning with MLlib

Introduction

Clustering using k-means

Getting ready

How to do it…

Dimensionality reduction with principal component analysis

Getting ready

How to do it…

Dimensionality reduction with singular value decomposition

Getting ready

How to do it…

10. Recommender Systems

Introduction

Collaborative filtering using explicit feedback

Getting ready

How to do it…

Collaborative filtering using implicit feedback

Getting ready

How to do it…

How it works…

There's more…

11. Graph Processing Using GraphX

Introduction

Fundamental operations on graphs

Getting ready

How to do it…

Using PageRank

Getting ready

How to do it…

Finding connected components

Getting ready

How to do it…

Performing neighborhood aggregation

Getting ready

How to do it…

12. Optimizations and Performance Tuning

Introduction

Optimizing memory

Using compression to improve performance

Using serialization to improve performance

How to do it…

Optimizing garbage collection

How to do it…

Optimizing the level of parallelism

How to do it…

Understanding the future of optimization – project Tungsten

Manual memory management by leverage application semantics

Using algorithms and data structures

Code generation

Index

Spark Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2015

Production reference: 1160715

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78398-706-1

www.packtpub.com

Cover image by: InfoObjects design team

Credits

Author

Rishi Yadav

Reviewers

Thomas W. Dinsmore

Cheng Lian

Amir Sedighi

Commissioning Editor

Kunal Parikh

Acquisition Editors

Shaon Basu

Neha Nagwekar

Content Development Editor

Ritika Singh

Technical Editor

Ankita Thakur

Copy Editors

Ameesha Smith-Green

Swati Priya

Project Coordinator

Milton Dsouza

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Sheetal Aute

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.

About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.

InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.

Rishi is an open source contributor and active blogger.

My special thanks go to my better half, Anjali, for putting up with the long, arduous hours that were added to my already swamped schedule; our 8 year old son, Vedant, who tracked my progress on a daily basis; InfoObjects' CTO and my business partner, Sudhir Jangir, for leading the big data effort in the company; Helma Zargarian, Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and our internal review team, especially Arivoli Tirouvingadame, Lalit Shravage, and Sanjay Shroff, for helping with the review. I could not have written without your support. I would also like to thank Marcel Izumi for putting together amazing graphics.

About the Reviewers

Thomas W. Dinsmore is an independent consultant, offering product advisory services to analytic software vendors. To this role, he brings 30 years of experience, delivering analytics solutions to enterprises around the world. He uniquely combines hands-on analytics experience with the ability to lead analytic projects and interpret results.

Thomas' previous services include roles with SAS, IBM, The Boston Consulting Group, PricewaterhouseCoopers, and Oliver Wyman.

Thomas coauthored Modern Analytics Methodologies and Advanced Analytics Methodologies, published in 2014 by Pearson FT Press, and is under contract for a forthcoming book on business analytics from Apress. He publishes The Big Analytics Blog at www.thomaswdinsmore.com.

I would like to thank the entire editorial and production team at Packt Publishing, who work tirelessly to bring out quality books to the public.

Cheng Lian is a Chinese software engineer and Apache Spark committer from Databricks. His major technical interests include big data analytics, distributed systems, and functional programming languages.

Cheng is also the translator of the Chinese edition of Erlang and OTP in Action and Concurrent Programming in Erlang (Part I).

I would like to thank Yi Tian from AsiaInfo for helping me review some parts of Chapter 6, Getting Started with Machine Learning Using MLlib.

Amir Sedighi is an experienced software engineer, a keen learner, and a creative problem solver. His experience spans a wide range of software development areas, including cross-platform development, big data processing and data streaming, information retrieval, and machine learning. He is a big data lecturer and expert, working in Iran. He holds a bachelor's and master's degree in software engineering. Amir is currently the CEO of Rayanesh Dadegan Ekbatan, the company he cofounded in 2013 after several years of designing and implementing distributed big data and data streaming solutions for private sector companies.

I would like to thank the entire team at Packt Publishing, who work hard to bring awesomeness to the books and the readers' professional life.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

The success of Hadoop as a big data platform raised user expectations, both in terms of solving different analytics challenges as well as reducing latency. Various tools evolved over time, but when Apache Spark came, it provided one single runtime to address all these challenges. It eliminated the need to combine multiple tools with their own challenges and learning curves. By using memory for persistent storage besides compute, Apache Spark eliminates the need to store intermedia data in disk and increases processing speed up to 100 times. It also provides a single runtime, which addresses various analytics needs such as machine-learning and real-time streaming using various libraries.

This book covers the installation and configuration of Apache Spark and building solutions using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries.

Note

For more information on this book's recipes, please visit infoobjects.com/spark-cookbook.

What this book covers

Chapter 1, Getting Started with Apache Spark, explains how to install Spark on various environments and cluster managers.

Chapter 2, Developing Applications with Spark, talks about developing Spark applications on different IDEs and using different build tools.

Chapter 3, External Data Sources, covers how to read and write to various data sources.

Chapter 4, Spark SQL, takes you through the Spark SQL module that helps you to access the Spark functionality using the SQL interface.

Chapter 5, Spark Streaming, explores the Spark

Enjoying the preview?

Page 1 of 1

Spark Cookbook

About this ebook

Rishi Yadav

Related authors

Related to Spark Cookbook

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Spark Cookbook

What did you think?

Book preview

Spark Cookbook - Rishi Yadav

Table of Contents

Spark Cookbook

Spark Cookbook

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why Subscribe?

Preface

Note

What this book covers