Ebook953 pages8 hours

Data Mining Applications with R

Name: Data Mining Applications with R
Brand: Academic Press
Rating: 4.0 (2 reviews)

By Yanchang Zhao and Yonghua Cen

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

Data Mining Applications with R is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across many different industries, including government, finance, insurance, medicine, scientific research and more. This book presents 15 different real-world case studies illustrating various techniques in rapidly growing areas. It is an ideal companion for data mining researchers in academia and industry looking for ways to turn this versatile software into a powerful analytic tool.

R code, Data and color figures for the book are provided at the RDataMining.com website.

Helps data miners to learn to use R in their specific area of work and see how R can apply in different industries
Presents various case studies in real-world applications, which will help readers to apply the techniques in their work
Provides code examples and sample data for readers to easily learn the techniques by running the code by themselves

Skip carousel

LanguageEnglish

PublisherAcademic Press

Release dateNov 26, 2013

ISBN9780124115200

Author

Yanchang Zhao

A Senior Data Mining Analyst in Australia Government since 2009. Before joining public sector, he was an Australian Postdoctoral Fellow (Industry) in the Faculty of Engineering & Information Technology at University of Technology, Sydney, Australia. His research interests include clustering, association rules, time series, outlier detection and data mining applications and he has over forty papers published in journals and conference proceedings. He is a member of the IEEE and a member of the Institute of Analytics Professionals of Australia, and served as program committee member for more than thirty international conferences.

Related authors

Skip carousel

Related to Data Mining Applications with R

Related ebooks

Skip carousel

Mastering Predictive Analytics with R
Ebook
Mastering Predictive Analytics with R
byRui Miguel Forte
Rating: 4 out of 5 stars
4/5
Handbook of Statistical Analysis and Data Mining Applications
Ebook
Handbook of Statistical Analysis and Data Mining Applications
byRobert Nisbet
Rating: 4 out of 5 stars
4/5
Applied Statistical Modeling and Data Analytics: A Practical Guide for the Petroleum Geosciences
Ebook
Applied Statistical Modeling and Data Analytics: A Practical Guide for the Petroleum Geosciences
bySrikanta Mishra
Rating: 5 out of 5 stars
5/5
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Guerrilla Analytics: A Practical Approach to Working with Data
Ebook
Guerrilla Analytics: A Practical Approach to Working with Data
byEnda Ridge
Rating: 5 out of 5 stars
5/5
Applied Data Mining for Forecasting Using SAS
Ebook
Applied Data Mining for Forecasting Using SAS
byTim Rey
Rating: 0 out of 5 stars
0 ratings
Learning R for Geospatial Analysis
Ebook
Learning R for Geospatial Analysis
byMichael Dorman
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics with R
Ebook
Big Data Analytics with R
bySimon Walkowiak
Rating: 0 out of 5 stars
0 ratings
R: Data Analysis and Visualization
Ebook
R: Data Analysis and Visualization
byBrett Lantz
Rating: 5 out of 5 stars
5/5
Mastering Data Analysis with R
Ebook
Mastering Data Analysis with R
byDaróczi Gergely
Rating: 5 out of 5 stars
5/5
R: Recipes for Analysis, Visualization and Machine Learning
Ebook
R: Recipes for Analysis, Visualization and Machine Learning
byAtmajitsinh Gohil
Rating: 0 out of 5 stars
0 ratings
R and Data Mining: Examples and Case Studies
Ebook
R and Data Mining: Examples and Case Studies
byYanchang Zhao
Rating: 3 out of 5 stars
3/5
R Machine Learning By Example
Ebook
R Machine Learning By Example
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Simulation for Data Science with R
Ebook
Simulation for Data Science with R
byMatthias Templ
Rating: 0 out of 5 stars
0 ratings
Mastering Text Mining with R
Ebook
Mastering Text Mining with R
byAvinash Paul
Rating: 0 out of 5 stars
0 ratings
Data Analysis with R
Ebook
Data Analysis with R
byFischetti Tony
Rating: 5 out of 5 stars
5/5
Learning Predictive Analytics with R
Ebook
Learning Predictive Analytics with R
byMayor Eric
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R - Second Edition
Ebook
Machine Learning with R - Second Edition
byBrett Lantz
Rating: 5 out of 5 stars
5/5
Data Science: Concepts and Practice
Ebook
Data Science: Concepts and Practice
byVijay Kotu
Rating: 3 out of 5 stars
3/5
Practical Data Analysis Cookbook
Ebook
Practical Data Analysis Cookbook
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
R Graphs Cookbook Second Edition
Ebook
R Graphs Cookbook Second Edition
byJaynal Abedin
Rating: 3 out of 5 stars
3/5
R Data Science Essentials
Ebook
R Data Science Essentials
byKoushik Raja B.
Rating: 2 out of 5 stars
2/5
RStudio for R Statistical Computing Cookbook
Ebook
RStudio for R Statistical Computing Cookbook
byAndrea Cirillo
Rating: 0 out of 5 stars
0 ratings
Learning RStudio for R Statistical Computing
Ebook
Learning RStudio for R Statistical Computing
byvan derLoo Mark
Rating: 4 out of 5 stars
4/5
Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
R for Data Science
Ebook
R for Data Science
byDan Toomey
Rating: 5 out of 5 stars
5/5
R Data Visualization Cookbook
Ebook
R Data Visualization Cookbook
byAtmajitsinh Gohil
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R
Ebook
Machine Learning with R
byBrett Lantz
Rating: 4 out of 5 stars
4/5

Databases For You

Skip carousel

Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
SQL: Practical Guide for Developers
Ebook
SQL: Practical Guide for Developers
byMichael J. Donahoo
Rating: 2 out of 5 stars
2/5
Relational Database Design and Implementation
Ebook
Relational Database Design and Implementation
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
100+ SQL Queries T-SQL for Microsoft SQL Server
Ebook
100+ SQL Queries T-SQL for Microsoft SQL Server
byIFS Harrison
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Business Intelligence Guidebook: From Data Integration to Analytics
Ebook
Business Intelligence Guidebook: From Data Integration to Analytics
byRick Sherman
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
Learning PostgreSQL
Ebook
Learning PostgreSQL
byJuba Salahaldin
Rating: 1 out of 5 stars
1/5
SQL Clearly Explained
Ebook
SQL Clearly Explained
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
COMPUTER SCIENCE FOR ROOKIES
Ebook
COMPUTER SCIENCE FOR ROOKIES
byAngel Bahabwa
Rating: 0 out of 5 stars
0 ratings
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
Ebook
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
byPiyanka Jain
Rating: 5 out of 5 stars
5/5
Beginning Microsoft SQL Server 2012 Programming
Ebook
Beginning Microsoft SQL Server 2012 Programming
byPaul Atkinson
Rating: 1 out of 5 stars
1/5
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Artificial Intelligence for Fashion: How AI is Revolutionizing the Fashion Industry
Ebook
Artificial Intelligence for Fashion: How AI is Revolutionizing the Fashion Industry
byLeanne Luce
Rating: 0 out of 5 stars
0 ratings
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
Ebook
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
byDan Clark
Rating: 0 out of 5 stars
0 ratings
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
Ebook
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
byJhon Dujardin
Rating: 4 out of 5 stars
4/5
Codeless Data Structures and Algorithms: Learn DSA Without Writing a Single Line of Code
Ebook
Codeless Data Structures and Algorithms: Learn DSA Without Writing a Single Line of Code
byArmstrong Subero
Rating: 0 out of 5 stars
0 ratings
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Practical SQL
Ebook
Practical SQL
byDavid Perry
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
Podcast episode
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
Podcast episode
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
byEpicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
Setting the Standard: Impact of Method Standardization in Chromatography
Podcast episode
Setting the Standard: Impact of Method Standardization in Chromatography
byThe Analytical Wavelength
0 ratings
0% found this document useful
Build Your Own Data Pipeline - Andreas Kretz
Podcast episode
Build Your Own Data Pipeline - Andreas Kretz
byDataTalks.Club
0 ratings
0% found this document useful
Conquering the Last Mile in Data - Caitlin Moorman
Podcast episode
Conquering the Last Mile in Data - Caitlin Moorman
byDataTalks.Club
0 ratings
0% found this document useful
SE4ML - Software Engineering for Machine Learning - Nadia Nahar
Podcast episode
SE4ML - Software Engineering for Machine Learning - Nadia Nahar
byDataTalks.Club
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
37. Sean Knapp - The brave new world of data engineering
Podcast episode
37. Sean Knapp - The brave new world of data engineering
byTowards Data Science
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
Podcast episode
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
From MLOps to DataOps - Santona Tuli
Podcast episode
From MLOps to DataOps - Santona Tuli
byDataTalks.Club
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Podcast episode
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
byData Engineering Podcast
0 ratings
0% found this document useful
DevOps and Incident Response Evolution
Podcast episode
DevOps and Incident Response Evolution
byThe Cloudcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Serverless Data APIs
Podcast episode
Serverless Data APIs
byThe Cloudcast
0 ratings
0% found this document useful
Defining Success: Metrics and KPIs - Adam Sroka
Podcast episode
Defining Success: Metrics and KPIs - Adam Sroka
byDataTalks.Club
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Podcast episode
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
byData Engineering Podcast
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
Podcast episode
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
[Bite] Documenting Data Science Projects
Podcast episode
[Bite] Documenting Data Science Projects
byDataCafé
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Product Owners in Data Science - Anna Hannemann
Podcast episode
Product Owners in Data Science - Anna Hannemann
byDataTalks.Club
0 ratings
0% found this document useful
How Redpanda Extracts Business Value from Data Events with Alex Gallego
Podcast episode
How Redpanda Extracts Business Value from Data Events with Alex Gallego
byScreaming in the Cloud
0 ratings
0% found this document useful

Skip carousel

Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
태도가 건축이 될 때 When Attitude Becomes Architecture
Space
Article
태도가 건축이 될 때 When Attitude Becomes Architecture
Dec 5, 2023
12 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Data Centers Aren’t The Energy Hogs We Thought
Futurity
Article
Data Centers Aren’t The Energy Hogs We Thought
Feb 28, 2020
2 min read
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Futurity
Article
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Mar 26, 2019
4 min read
A CQ Exclusive: Slow Website Speeds Cause Spectrum Rage
CQ Amateur Radio
Article
A CQ Exclusive: Slow Website Speeds Cause Spectrum Rage
Apr 1, 2022
5 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
So Predictable? AI And Landscape Architecture
Landscape Architecture Australia
Article
So Predictable? AI And Landscape Architecture
Apr 30, 2023
6 min read
Prototype Paves Way For ‘Computer-on-a-chip’
Futurity
Article
Prototype Paves Way For ‘Computer-on-a-chip’
Feb 22, 2019
2 min read
Chinese Students' Dream Device Defeats Japan's Most Powerful Supercomputer In World Contest
Post Magazine
Article
Chinese Students' Dream Device Defeats Japan's Most Powerful Supercomputer In World Contest
Jun 15, 2022
A small computer developed by Chinese students outperformed Japan's most powerful machine in solving a major complex data problem related to artificial intelligence, according to the latest global ranking. Supercomputer Fugaku in Japan has nearly 4 m
3 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
Putting Artificial Intelligence to Work
Rotman Management
Article
Putting Artificial Intelligence to Work
May 1, 2018
11 min read
Not End Of Life
Linux Format
Article
Not End Of Life
May 30, 2023
In case you haven’t heard, MySQL 5.7 is going end of life (EOL). The upstream project will stop updates in October and focus on MySQL 8.0. This is a logical decision and they’ve given users ample time to upgrade. But some users and organisations need
1 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
White House Attacks on ARPA-E Endanger US Energy Innovation
Union of Concerned Scientists
Article
White House Attacks on ARPA-E Endanger US Energy Innovation
Apr 25, 2017
The America First Budget Blueprint released by the White House last month proposes to eliminate the Advanced Research Project Agency–Energy (ARPA-E). They’re kidding, right?
5 min read
Quantum Simulators An Overview
Techfastly
Article
Quantum Simulators An Overview
Oct 1, 2021
4 min read
Neural Pathways
Guitar Magazine
Article
Neural Pathways
Jul 2, 2021
5 min read
How Spooky Science Helps Us Peer Inside The Planets
All About Space
Article
How Spooky Science Helps Us Peer Inside The Planets
Dec 3, 2020
An assistant professor of computational science at the EPFL research centre in Lausanne, Switzerland, involved in the current research on metallic hydrogen. Could you explain how the machine-learning techniques used in your research work? Why were th
1 min read
Public Logs: The Benefits Outweigh the Risks
CQ Amateur Radio
Article
Public Logs: The Benefits Outweigh the Risks
Feb 1, 2020
5 min read
Naga Chandrasekaran
HWM Singapore
Article
Naga Chandrasekaran
Dec 6, 2022
Micron’s 232-layer NAND technology provided the high-performance storage necessary to support advanced solutions and real-time services required in data centre and automotive applications, thanks to benefits like longer battery life, better performan
3 min read
Always Looking Forward
Recoil
Article
Always Looking Forward
Mar 23, 2021
3 min read
“What You See Is A Mirage, As Tuck-up Picture That Doesn’t Describe What’s Happening To Your Packets”
PC Pro Magazine
Article
“What You See Is A Mirage, As Tuck-up Picture That Doesn’t Describe What’s Happening To Your Packets”
Oct 8, 2020
6 min read
New Tools for Using the Sherwood Tables for Transceiver Selection
CQ Amateur Radio
Article
New Tools for Using the Sherwood Tables for Transceiver Selection
Jan 1, 2023
Receive performance has been one of the top criteria for transceiver selection by hams for decades. As the well-worn phrase goes, “if you can’t hear ‘em, you can’t work ‘em.” Rob Sherwood has been conducting bench tests on the receive performance of
10 min read
Pragmatic Parametricism
Architectural Review Asia Pacific
Article
Pragmatic Parametricism
Nov 13, 2020
4 min read

Related categories

Skip carousel

Reviews for Data Mining Applications with R

Rating: 4 out of 5 stars

4/5

2 ratings0 reviews

Book preview

Data Mining Applications with R - Yanchang Zhao

1

Power Grid Data Analysis with R and Hadoop

Ryan Hafen, Tara Gibson, Kerstin Kleese van Dam and Terence Critchlow, Pacific Northwest National Laboratory, Richland, Washington, USA

Abstract

In this chapter, we use the R and Hadoop Integrated Programming Environment (RHIPE) as a flexible, scalable environment for analyzing multiterabyte data sets being produced by a phasor measurement unit sensor network on the electrical power grid. RHIPE enables exploratory data analysis on the entire data set, allowing us to develop both data cleaning and event classification methods that reflect event characteristics as represented by the actual data instead of relying on theoretical models. We describe several of the data cleaning filters that we have developed as well as one approach we have used for event detection. To ensure the generality of this chapter, we focus on the techniques we are using for our data analysis and example code that demonstrates how these techniques are used within the RHIPE package, instead of the domain-specific details of the data or events that we are extracting.

Keywords

R; Hadoop; RHIPE; Large data; Data cleaning; Event detection; Power grid

1.1 Introduction

This chapter presents an approach to analysis of large-scale time series sensor data collected from the electric power grid. This discussion is driven by our analysis of a real-world data set and, as such, does not provide a comprehensive exposition of either the tools used or the breadth of analysis appropriate for general time series data. Instead, we hope that this section provides the reader with sufficient information, motivation, and resources to address their own analysis challenges.

Our approach to data analysis is on the basis of exploratory data analysis techniques. In particular, we perform an analysis over the entire data set to identify sequences of interest, use a small number of those sequences to develop an analysis algorithm that identifies the relevant pattern, and then run that algorithm over the entire data set to identify all instances of the target pattern. Our initial data set is a relatively modest 2TB data set, comprising just over 53 billion records generated from a distributed sensor network. Each record represents several sensor measurements at a specific location at a specific time. Sensors are geographically distributed but reside in a fixed, known location. Measurements are taken 30 times per second and synchronized using a global clock, enabling a precise reconstruction of events. Because all of the sensors are recording on the status of the same, tightly connected network, there should be a high correlation between all readings.

Given the size of our data set, simply running R on a desktop machine is not an option. To provide the required scalability, we use an analysis package called RHIPE (pronounced ree-pay) (RHIPE, 2012). RHIPE, short for the R and Hadoop Integrated Programming Environment, provides an R interface to Hadoop. This interface hides much of the complexity of running parallel analyses, including many of the traditional Hadoop management tasks. Further, by providing access to all of the standard R functions, RHIPE allows the analyst to focus instead on the analysis of code development, even when exploring large data sets. A brief introduction to both the Hadoop programming paradigm, also known as the MapReduce paradigm, and RHIPE is provided in Section 1.3. We assume that readers already have a working knowledge of R.

As with many sensor data sets, there are a large number of erroneous records in the data, so a significant focus of our work has been on identifying and filtering these records. Identifying bad records requires a variety of analysis techniques including summary statistics, distribution checking, autocorrelation detection, and repeated value distribution characterization, all of which are discovered or verified by exploratory data analysis. Once the data set has been cleaned, meaningful events can be extracted. For example, events that result in a network partition or isolation of part of the network are extremely interesting to power engineers.

The core of this chapter is the presentation of several example algorithms to manage, explore, clean, and apply basic feature extraction routines over our data set. These examples are generalized versions of the code we use in our analysis. Section 1.3.3.2.2 describes these examples in detail, complete with sample code. Our hope is that this approach will provide the reader with a greater understanding of how to proceed when unique modifications to standard algorithms are warranted, which in our experience occurs quite frequently.

Before we dive into the analysis, however, we begin with an overview of the power grid, which is our application domain.

1.2 A Brief Overview of the Power Grid

The U.S. national power grid, also known as the electrical grid or simply the grid, was named the greatest engineering achievement of the twentieth century by the U.S. National Academy of Engineering (Wulf, 2000). Although many of us take for granted the flow of electricity when we flip a switch or plug in our chargers, it takes a large and complex infrastructure to reliably support our dependence on energy.

Built over 100 years ago, at its core the grid connects power producers and consumers through a complex network of transmission and distribution lines connecting almost every building in the country. Power producers use a variety of generator technologies, from coal to natural gas to nuclear and hydro, to create electricity. There are hundreds of large and small generation facilities spread across the country. Power is transferred from the generation facility to the transmission network, which moves it to where it is needed. The transmission network is comprised of high-voltage lines that connect the generators to distribution points. The network is designed with redundancy, which allows power to flow to most locations even when there is a break in the line or a generator goes down unexpectedly. At specific distribution points, the voltage is decreased and then transferred to the consumer. The distribution networks are disconnected from each other.

The US grid has been divided into three smaller grids: the western interconnection, the eastern interconnection, and the Texas interconnection. Although connections between these regions exist, there is limited ability to transfer power between them and thus each operates essentially as an independent power grid. It is interesting to note that the regions covered by these interconnections include parts of Canada and Mexico, highlighting our international interdependency on reliable power. In order to be manageable, a single interconnect may be further broken down into regions which are much more tightly coupled than the major interconnects, but are operated independently.

Within each interconnect, there are several key roles that are required to ensure the smooth operation of the grid. In many cases, a single company will fill multiple roles—typically with policies in place to avoid a conflict of interest. The goal of the power producers is to produce power as cheaply as possible and sell it for as much as possible. Their responsibilities include maintaining the generation equipment and adjusting their generation based on guidance from a balancing authority. The balancing authority is an independent agent responsible for ensuring the transmission network has sufficient power to meet demand, but not a significant excess. They will request power generators to adjust production on the basis of the real-time status of the entire network, taking into account not only demand, but factors such as transmission capacity on specific lines. They will also dynamically reconfigure the network, opening and closing switches, in response to these factors. Finally, the utility companies manage the distribution system, making sure that power is available to consumers. Within its distribution network, a utility may also dynamically reconfigure power flows in response to both planned and unplanned events. In addition to these primary roles, there are a variety of additional roles a company may play—for example, a company may lease the physical transmission or distribution lines to another company which uses those to move power within its network. Significant communication between roles is required in order to ensure the stability of the grid, even in normal operating circumstances. In unusual circumstances, such as a major storm, communication becomes critical to responding to infrastructure damage in an effective and efficient manner.

Despite being over 100 years old, the grid remains remarkably stable and reliable. Unfortunately, new demands on the system are beginning to affect it. In particular, energy demand continues to grow within the United States—even in the face of declining usage per person (DOE, 2012). New power generators continue to come online to address this need, with new capacity increasingly either being powered by natural gas generators (projected to be 60% of new capacity by 2035) or based on renewable energy (29% of new capacity by 2035) such as solar or wind power (DOE, 2012). Although there are many advantages to the development of renewable energy sources, they provide unique challenges to grid stability due to their unpredictability. Because electricity cannot be easily stored, and renewables do not provide a consistent supply of power, ensuring there is sufficient power in the system to meet demand without significant overprovisioning (i.e., wasting energy) is a major challenge facing grid operators. Further complicating the situation is the distribution of the renewable generators. Although some renewable sources, such as wind farms, share many properties with traditional generation capabilities—in particular, they generate significant amounts of power and are connected to the transmission system—consumer-based systems, such as solar panels on a business, are connected to the distribution network, not the transmission network. Although this distributed generation system can be extremely helpful at times, it is very different from the current model and introduces significant management complexity (e.g., it is not currently possible for a transmission operator to control when or how much power is being generated from solar panels on a house).

To address these needs, power companies are looking toward a number of technology solutions. One potential solution being considered is transitioning to real-time pricing of power. Today, the price of power is fixed for most customers—a watt used in the middle of the afternoon costs the same as a watt used in the middle of the night. However, the demand for power varies dramatically during the course of a day, with peak demand typically being during standard business hours. Under this scenario, the price for electricity would vary every few minutes depending on real-time demand. In theory, this would provide an incentive to minimize use during peak periods and transfer that utilization to other times. Because the grid infrastructure is designed to meet its peak load demands, excess capacity is available off-hours. By redistributing demand, the overall amount of energy that could be delivered with the same infrastructure is increased. For this scenario to work, however, consumers must be willing to adjust their power utilization habits. In some cases, this can be done by making appliances cost aware and having consumers define how they want to respond to differences in price. For example, currently water heaters turn on and off solely on the basis of the water temperature in the tank—as soon as the temperature dips below a target temperature, the heater goes on. This happens without considering the time of day or water usage patterns by the consumer, which might indicate if the consumer even needs the water in the next few hours. A price-aware appliance could track usage patterns and delay heating the water until either the price of electricity fell below a certain limit or the water was expected to be needed soon. Similarly, an air conditioner might delay starting for 5 or 10 min to avoid using energy during a time of peak demand/high cost without the consumer even noticing.

Interestingly, the increasing popularity of plug-in electric cars provides both a challenge and a potential solution to the grid stability problems introduced by renewables. If the vehicles remain price insensitive, there is the potential for them to cause sudden, unexpected jumps in demand if a large number of them begin charging at the same time. For example, one car model comes from the factory preset to begin charging at midnight local time, with the expectation that this is a low-demand time. However, if there are hundreds or thousands of cars within a small area, all recharging at the same time, the sudden surge in demand becomes significant. If the cars are price aware, however, they can charge whenever demand is lowest, as long as they are fully charged when their owner is ready to go. This would spread out the charging over the entire night, smoothing out the demand. In addition, a price-aware car could sell power to the grid at times of peak demand by partially draining its battery. This would benefit the owner through a buy low, sell high strategy and would mitigate the effect of short-term spikes in demand. This strategy could help stabilize the fluctuations caused by renewables by, essentially, providing a large-scale power storage capability.

In addition to making devices cost aware, the grid itself needs to undergo a significant change in order to support real-time pricing. In particular, the distribution system needs to be extended to support real-time recording of power consumption. Current power meters record overall consumption, but not when the consumption occurred. To enable this, many utilities are in the process of converting their customers to smart meters. These new meters are capable of sending the utility real-time consumption information and have other advantages, such as dramatically reducing the time required to read the meters, which have encouraged their adoption. On the transmission side, existing sensors provide operators with the status of the grid every 4 seconds. This is not expected to be sufficient given increasing variability, and thus new sensors called phasor measurement units (PMUs) are being deployed. PMUs provide information 30-60 times per second. The sensors are time synchronized to a global clock so that the state of the grid at a specific time can be accurately reconstructed. Currently, only a few hundred PMUs are deployed; however, the NASPI project anticipates having over 1000 PMUs online by the end of 2014 (Silverstein, 2012) with a final goal of between 10,000 and 50,000 sensors deployed over the next 20 years.

An important side effect of the new sensors on the transmission and distribution networks is a significant increase in the amount of information that power companies need to collect and process. Currently, companies are using the real-time streams to identify some critical events, but are not effectively analyzing the resulting data set. The reasons for this are twofold. First, the algorithms that have been developed in the past are not scaling to these new data sets. Second, exactly what new insights can be gleaned from this more refined data is not clear. Developing scalable algorithms for known events is clearly a first step. However, additional investigation into the data set using techniques such as exploratory analysis is required to fully utilize this new source of information.

1.3 Introduction to MapReduce, Hadoop, and RHIPE

Before presenting the power grid data analysis, we first provide an overview of MapReduce and associated topics including Hadoop and RHIPE. We present and discuss the implementation of a simple MapReduce example using RHIPE. Finally, we discuss other parallel R approaches for dealing with large-scale data analysis. The goal is to provide enough background for the reader to be comfortable with the examples provided in the following section.

The example we provide in this section is a simple implementation of a MapReduce operation using RHIPE on the iris data (Fisher, 1936) included with R. The goal is to solidify our description of MapReduce through a concrete example, introduce basic RHIPE commands, and prepare the reader to follow the code examples on our power grid work presented in the following section. In the interest of space, our explanations focus on the various aspects of RHIPE, and not on R itself. A reasonable skill level of R programming is assumed.

A lengthy exposition on all of the facets of RHIPE is not provided. For more details, including information about installation, job monitoring, configuration, debugging, and some advanced options, we refer the reader to RHIPE (2012) and White (2010).

1.3.1 MapReduce

MapReduce is a simple but powerful programming model for breaking a task into pieces and operating on those pieces in an embarrassingly parallel manner across a cluster. The approach was popularized by Google (Dean and Ghemawat, 2008) and is in wide use by companies processing massive amounts of data.

MapReduce algorithms operate on data structures represented as key/value pairs. The data are split into blocks; each block is represented as a key and value. Typically, the key is a descriptive data structure of the data in the block, whereas the value is the actual data for the block. MapReduce methods perform independent parallel operations on input key/value pairs and their output is also key/value pairs. The MapReduce model is comprised of two phases, the map and the reduce, which work as follows:

Map: A map function is applied to each input key/value pair, which does some user-defined processing and emits new key/value pairs to intermediate storage to be processed by the reduce.

Shuffle/Sort: The map output values are collected for each unique map output key and passed to a reduce function.

Reduce: A reduce function is applied in parallel to all values corresponding to each unique map output key and emits output key/value pairs.

1.3.1.1 An Example: The Iris Data

The iris data are very small and methods can be applied to it in memory, within R, without splitting it into pieces and applying MapReduce algorithms. It is an accessible introductory example nonetheless, as it is easy to verify computations done with MapReduce to those with the traditional approach. It is the MapReduce principles—not the size of the data—that are important: Once an algorithm has been expressed in MapReduce terms, it theoretically can be applied unchanged to much larger data.

The iris data are a data frame of 150 measurements of iris petal and sepal lengths and widths, with 50 measurements for each species of setosa, versicolor, and virginica. Let us assume that we are doing some computation on the sepal length. To simulate the notion of data being partitioned and distributed, consider the data being randomly split into 3 blocks. We can achieve this in R with the following:

> set.seed(4321) # make sure that we always get the same partition

> permute <- sample(1:150, 150)

> splits <- rep(1:3, 50)

> irisSplit <- tapply(permute, splits, function(x) {

iris[x, c(Sepal.Length, Species)]

})

> irisSplit

$‘1’

Sepal.Length Species

51 7.0 versicolor

7 4.6 setosa

… # output truncated

Throughout this chapter, code blocks that also display output will distinguish lines containing code with > at the beginning of the line. Code blocks that do not display output do not add this distinction.

This partitions the Sepal.Length and Species variables into three random subsets, having the keys 1, 2, or 3, which correspond to our blocks. Consider a calculation of the maximum sepal length by species with irisSplit as the set of input key/value pairs. This can be achieved with MapReduce by the following steps:

Map: Apply a function to each division of the data which, for each species, computes the maximum sepal length and outputs key=species and value=max sepal length.

Shuffle/Sort: Gather all map outputs with key setosa and send to one reduce, then all with key versicolor to another reduce, etc.

Reduce: Apply a function to all values corresponding to each unique map output key (species) which gathers and calculates the maximum of the values.

It can be helpful to view this process visually, as is shown in Figure 1.1. The input data are the irisSplit set of key/value pairs. As described in the steps above, applying the map to each input key/value pair emits a key/value pair of the maximum sepal length per species. These are gathered by key (species) and the reduce step is applied which calculates a maximum of maximums, finally outputting a maximum sepal length per species. We will revisit this Figure with a more detailed explanation of the calculation in Section 1.3.3.2.

Figure 1.1 An Illustration of applying a MapReduce job to calculate the maximum Sepal. Length by Species for the irisSplit data.

1.3.2 Hadoop

Hadoop is an open-source distributed software system for writing MapReduce applications capable of processing vast amounts of data, in parallel, on large clusters of commodity hardware, in a fault-tolerant manner. It consists of the Hadoop Distributed File System (HDFS) and the MapReduce parallel compute engine. Hadoop was inspired by papers written about Google’s MapReduce and Google File System (Dean and Ghemawat, 2008).

Hadoop handles data by distributing key/value pairs into the HDFS. Hadoop schedules and executes the computations on the key/value pairs in parallel, attempting to minimize data movement. Hadoop handles load balancing and automatically restarts jobs when a fault is encountered.

Hadoop has changed the way many organizations work with their data, bringing cluster computing to people with little knowledge of the complexities of distributed programming. Once an algorithm has been written the MapReduce way, Hadoop provides concurrency, scalability, and reliability for free.

1.3.3 RHIPE: R with Hadoop

RHIPE is a merger of R and Hadoop. It enables an analyst of large data to apply numeric or visualization methods in R. Integration of R and Hadoop is accomplished by a set of components written in R and Java. The components handle the passing of information between R and Hadoop, making the internals of Hadoop transparent to the user. However, the user must be aware of MapReduce as well as parameters for tuning the performance of a Hadoop job.

One of the main advantages of using R with Hadoop is that it allows rapid prototyping of methods and algorithms. Although R is not as fast as pure Java, it was designed as a programming environment for working with data and has a wealth of statistical methods and tools for data analysis and manipulation.

1.3.3.1 Installation

RHIPE depends on Hadoop, which can be tricky to install and set up. Excellent references for setting up Hadoop can be found on the web. The site www.rhipe.org provides installation instructions for RHIPE as well as a virtual machine with a local single-node Hadoop cluster. A single-node setup is good for experimenting with RHIPE syntax and for prototyping jobs before attempting to run at scale. Whereas we used a large institutional cluster to perform analyses on our real data, all examples in this chapter were run on a single-node virtual machine. We are using RHIPE version 0.72. Although RHIPE is a mature project, software can change over time. We advise the reader to check www.rhipe.org for notes of any changes since version

Enjoying the preview?

Page 1 of 1

Data Mining Applications with R

About this ebook

Yanchang Zhao

Related authors

Related to Data Mining Applications with R

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Data Mining Applications with R

What did you think?

Book preview

Data Mining Applications with R - Yanchang Zhao

1

Abstract

Keywords

R; Hadoop; RHIPE; Large data; Data cleaning; Event detection; Power grid

1.1 Introduction

1.2 A Brief Overview of the Power Grid

1.3 Introduction to MapReduce, Hadoop, and RHIPE

1.3.1 MapReduce

1.3.1.1 An Example: The Iris Data

> set.seed(4321) # make sure that we always get the same partition

> permute <- sample(1:150, 150)

> splits <- rep(1:3, 50)

> irisSplit <- tapply(permute, splits, function(x) {

iris[x, c(Sepal.Length, Species)]

})

> irisSplit

$‘1’

Sepal.Length Species

51 7.0 versicolor

7 4.6 setosa

… # output truncated

1.3.2 Hadoop

1.3.3 RHIPE: R with Hadoop

1.3.3.1 Installation