Cody's Data Cleaning Techniques Using SAS, Third Edition

Ebook463 pages3 hours

Cody's Data Cleaning Techniques Using SAS, Third Edition

Name: Cody's Data Cleaning Techniques Using SAS, Third Edition
Brand: SAS Institute
Rating: 4.5 (2 reviews)

By Ron Cody

Rating: 4.5 out of 5 stars

4.5/5

()

Read preview

About this ebook

Find errors and clean up data easily using SAS!

Thoroughly updated, Cody's Data Cleaning Techniques Using SAS, Third Edition, addresses tasks that nearly every data analyst needs to do - that is, make sure that data errors are located and corrected. Written in Ron Cody's signature informal, tutorial style, this book develops and demonstrates data cleaning programs and macros that you can use as written or modify which will make your job of data cleaning easier, faster, and more efficient.

Building on both the author’s experience gained from teaching a data cleaning course for over 10 years, and advances in SAS, this third edition includes four new chapters, covering topics such as the use of Perl regular expressions for checking the format of character values (such as zip codes or email addresses) and how to standardize company names and addresses.

With this book, you will learn how to:

find and correct errors in character and numeric values
develop programming techniques related to dates and missing values
deal with highly skewed data
develop techniques for correcting your data errors
use integrity constraints and audit trails to prevent errors from being added to a clean data set

Skip carousel

LanguageEnglish

PublisherSAS Institute

Release dateMar 15, 2017

ISBN9781635260670

Author

Ron Cody

Ron Cody, EdD, is a retired professor from the Rutgers Robert Wood Johnson Medical School who now works as a private consultant and national instructor for SAS. A SAS user since 1977, Ron's extensive knowledge and innovative style have made him a popular presenter at local, regional, and national SAS conferences. He has authored or co-authored numerous books, such as Learning SAS by Example: A Programmer's Guide, Second Edition; A Gentle Introduction to Statistics Using SAS Studio; Cody's Data Cleaning Techniques Using SAS, Third Edition; SAS Functions by Example, Second Edition; and several other books on SAS programming and statistical analysis. During his career at Rutgers Medical School, he authored or co-authored over 100 articles in scientific journals.

Related to Cody's Data Cleaning Techniques Using SAS, Third Edition

Related ebooks

Skip carousel

PROC REPORT by Example: Techniques for Building Professional Reports Using SAS: Techniques for Building Professional Reports Using SAS
Ebook
PROC REPORT by Example: Techniques for Building Professional Reports Using SAS: Techniques for Building Professional Reports Using SAS
byLisa Fine
Rating: 0 out of 5 stars
0 ratings
The SAS Programmer's PROC REPORT Handbook: ODS Companion
Ebook
The SAS Programmer's PROC REPORT Handbook: ODS Companion
byJane Eslinger
Rating: 0 out of 5 stars
0 ratings
SAS Functions by Example, Second Edition
Ebook
SAS Functions by Example, Second Edition
byRon Cody, EdD
Rating: 1 out of 5 stars
1/5
Practical and Efficient SAS Programming: The Insider's Guide
Ebook
Practical and Efficient SAS Programming: The Insider's Guide
byMartha Messineo
Rating: 0 out of 5 stars
0 ratings
PROC SQL: Beyond the Basics Using SAS, Third Edition
Ebook
PROC SQL: Beyond the Basics Using SAS, Third Edition
byKirk Paul Lafler
Rating: 0 out of 5 stars
0 ratings
SAS Macro Programming Made Easy, Third Edition
Ebook
SAS Macro Programming Made Easy, Third Edition
byMichele M. Burlew
Rating: 3 out of 5 stars
3/5
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
Ebook
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
byJane Eslinger
Rating: 0 out of 5 stars
0 ratings
The Little SAS Book: A Primer, Sixth Edition
Ebook
The Little SAS Book: A Primer, Sixth Edition
byLora D. Delwiche
Rating: 5 out of 5 stars
5/5
Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition
Ebook
Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition
byChris Holland
Rating: 0 out of 5 stars
0 ratings
Exercises and Projects for The Little SAS Book, Sixth Edition
Ebook
Exercises and Projects for The Little SAS Book, Sixth Edition
byRebecca A. Ottesen
Rating: 0 out of 5 stars
0 ratings
End-to-End Data Science with SAS: A Hands-On Programming Guide
Ebook
End-to-End Data Science with SAS: A Hands-On Programming Guide
byJames Gearheart
Rating: 0 out of 5 stars
0 ratings
SAS Viya: The R Perspective
Ebook
SAS Viya: The R Perspective
byYue Qi
Rating: 0 out of 5 stars
0 ratings
Logistic Regression Using SAS: Theory and Application, Second Edition
Ebook
Logistic Regression Using SAS: Theory and Application, Second Edition
byPaul D. Allison
Rating: 4 out of 5 stars
4/5
Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications, Third Edition
Ebook
Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications, Third Edition
byKattamuri S. Sarma
Rating: 0 out of 5 stars
0 ratings
Multiple Imputation of Missing Data Using SAS
Ebook
Multiple Imputation of Missing Data Using SAS
byPatricia Berglund
Rating: 0 out of 5 stars
0 ratings
Advanced SQL with SAS
Ebook
Advanced SQL with SAS
byChristian FG Schendera
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Numerical Applications with SAS
Ebook
Deep Learning for Numerical Applications with SAS
byHenry Bequet
Rating: 0 out of 5 stars
0 ratings
Categorical Data Analysis Using SAS, Third Edition
Ebook
Categorical Data Analysis Using SAS, Third Edition
byMaura E. Stokes
Rating: 0 out of 5 stars
0 ratings
SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4
Ebook
SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4
bySAS Institute
Rating: 1 out of 5 stars
1/5
PROC DOCUMENT by Example Using SAS
Ebook
PROC DOCUMENT by Example Using SAS
byMichael Tuchman
Rating: 0 out of 5 stars
0 ratings
Applied Data Mining for Forecasting Using SAS
Ebook
Applied Data Mining for Forecasting Using SAS
byTim Rey
Rating: 0 out of 5 stars
0 ratings
SAS Visual Analytics for SAS Viya
Ebook
SAS Visual Analytics for SAS Viya
bySAS Institute Inc.
Rating: 0 out of 5 stars
0 ratings
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Ebook
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
byKim Chantala
Rating: 0 out of 5 stars
0 ratings
Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data
Ebook
Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data
byDejan Sarka
Rating: 0 out of 5 stars
0 ratings
SAS Programming in the Pharmaceutical Industry, Second Edition
Ebook
SAS Programming in the Pharmaceutical Industry, Second Edition
byJack Shostak
Rating: 5 out of 5 stars
5/5
Carpenter's Complete Guide to the SAS Macro Language, Third Edition
Ebook
Carpenter's Complete Guide to the SAS Macro Language, Third Edition
byArt Carpenter
Rating: 0 out of 5 stars
0 ratings
Carpenter's Guide to Innovative SAS Techniques
Ebook
Carpenter's Guide to Innovative SAS Techniques
byArt Carpenter
Rating: 0 out of 5 stars
0 ratings
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
Ebook
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
byTricia Aanderud
Rating: 5 out of 5 stars
5/5
Learning R Programming
Ebook
Learning R Programming
byKun Ren
Rating: 5 out of 5 stars
5/5
Elementary Statistics Using SAS
Ebook
Elementary Statistics Using SAS
bySandra D. Schlotzhauer
Rating: 0 out of 5 stars
0 ratings

Mathematics For You

Skip carousel

The Little Book of Mathematical Principles, Theories & Things
Ebook
The Little Book of Mathematical Principles, Theories & Things
byRobert Solomon
Rating: 3 out of 5 stars
3/5
Basic Math & Pre-Algebra For Dummies
Ebook
Basic Math & Pre-Algebra For Dummies
byMark Zegarelli
Rating: 4 out of 5 stars
4/5
My Best Mathematical and Logic Puzzles
Ebook
My Best Mathematical and Logic Puzzles
byMartin Gardner
Rating: 5 out of 5 stars
5/5
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
Ebook
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
byKit Yates
Rating: 4 out of 5 stars
4/5
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
Ebook
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
byJ Scott
Rating: 0 out of 5 stars
0 ratings
Quantum Physics for Beginners
Ebook
Quantum Physics for Beginners
byMax Thomson
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
Ebook
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
The Thirteen Books of the Elements, Vol. 1
Ebook
The Thirteen Books of the Elements, Vol. 1
byEuclid
Rating: 0 out of 5 stars
0 ratings
ACT Math & Science Prep: Includes 500+ Practice Questions
Ebook
ACT Math & Science Prep: Includes 500+ Practice Questions
byKaplan Test Prep
Rating: 3 out of 5 stars
3/5
The Golden Ratio: The Divine Beauty of Mathematics
Ebook
The Golden Ratio: The Divine Beauty of Mathematics
byGary B. Meisner
Rating: 5 out of 5 stars
5/5
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
Ebook
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
byJane Cassie
Rating: 5 out of 5 stars
5/5
Algebra - The Very Basics
Ebook
Algebra - The Very Basics
byMetin Bektas
Rating: 5 out of 5 stars
5/5
Flatland
Ebook
Flatland
byEdwin A. Abbott
Rating: 4 out of 5 stars
4/5
Introducing Game Theory: A Graphic Guide
Ebook
Introducing Game Theory: A Graphic Guide
byIvan Pastine
Rating: 4 out of 5 stars
4/5
Mental Math Secrets - How To Be a Human Calculator
Ebook
Mental Math Secrets - How To Be a Human Calculator
byRandy Silverman
Rating: 5 out of 5 stars
5/5
Precalculus: A Self-Teaching Guide
Ebook
Precalculus: A Self-Teaching Guide
bySteve Slavin
Rating: 4 out of 5 stars
4/5
Algebra I For Dummies
Ebook
Algebra I For Dummies
byMary Jane Sterling
Rating: 4 out of 5 stars
4/5
Limitless Mind: Learn, Lead, and Live Without Barriers
Ebook
Limitless Mind: Learn, Lead, and Live Without Barriers
byJo Boaler
Rating: 4 out of 5 stars
4/5
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
Ebook
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
byDavid Borman
Rating: 4 out of 5 stars
4/5
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
Ebook
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
byChristopher Monahan
Rating: 5 out of 5 stars
5/5
The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics
Ebook
The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics
byClifford A. Pickover
Rating: 3 out of 5 stars
3/5
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
Ebook
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
byS. Deviant
Rating: 4 out of 5 stars
4/5
A Mind for Numbers | Summary
Ebook
A Mind for Numbers | Summary
bySummary Station
Rating: 4 out of 5 stars
4/5
Calculus Made Easy
Ebook
Calculus Made Easy
bySilvanus P. Thompson
Rating: 4 out of 5 stars
4/5
Game Theory: A Simple Introduction
Ebook
Game Theory: A Simple Introduction
byK.H. Erickson
Rating: 4 out of 5 stars
4/5
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
Ebook
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
Is God a Mathematician?
Ebook
Is God a Mathematician?
byMario Livio
Rating: 4 out of 5 stars
4/5
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
Ebook
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
byAlbert Rutherford
Rating: 3 out of 5 stars
3/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Real-World SRE Perspectives
Podcast episode
Real-World SRE Perspectives
byThe Cloudcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Shorten the distance between production data and insight: On this sponsored episode of the podcast, we talk with Stanimira Vlaeva, Developer Advocate at MongoDB, and Fredric Favelin, Technical Director, Partner Presales at MongoDB, about how a serverless database can minimize the distance between producing data and understanding it.
Podcast episode
Shorten the distance between production data and insight: On this sponsored episode of the podcast, we talk with Stanimira Vlaeva, Developer Advocate at MongoDB, and Fredric Favelin, Technical Director, Partner Presales at MongoDB, about how a serverless database can minimize the distance between producing data and understanding it.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
95: Battle of the CDPs: Packaged vs. Composable, 10 experts weigh in
Podcast episode
95: Battle of the CDPs: Packaged vs. Composable, 10 experts weigh in
byHumans of Martech
0 ratings
0% found this document useful
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
Podcast episode
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
Podcast episode
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Developer Data Platforms
Podcast episode
Developer Data Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Setting the Standard: Impact of Method Standardization in Chromatography
Podcast episode
Setting the Standard: Impact of Method Standardization in Chromatography
byThe Analytical Wavelength
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Podcast episode
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
byData Engineering Podcast
0 ratings
0% found this document useful
BeyondCorp with Robert Sadowski: On this episode of the podcast, our old pal Mark Mirchandani is joined by special guest host Max Saltonstall to talk trust and security products with fellow Googler Rob Sadowski.
Podcast episode
BeyondCorp with Robert Sadowski: On this episode of the podcast, our old pal Mark Mirchandani is joined by special guest host Max Saltonstall to talk trust and security products with fellow Googler Rob Sadowski.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
DevOps: Lessons Learned From Detroit To Deming - DevOpsDays DC - 2017: Derek Weeks, VP at Sonatype
Podcast episode
DevOps: Lessons Learned From Detroit To Deming - DevOpsDays DC - 2017: Derek Weeks, VP at Sonatype
byDevOps Days Podcast
0 ratings
0% found this document useful
Dev Tools Tabs Explained — Plus Tips & Tricks: In this episode of Syntax, Scott and Wes talk about dev tools tabs, what each tab does and how you can use them. Vonage - Sponsor Vonage is a Cloud Communications platform that allows developers to integrate voice, video and messaging into their...
Podcast episode
Dev Tools Tabs Explained — Plus Tips & Tricks: In this episode of Syntax, Scott and Wes talk about dev tools tabs, what each tab does and how you can use them. Vonage - Sponsor Vonage is a Cloud Communications platform that allows developers to integrate voice, video and messaging into their...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
Podcast episode
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
byData Engineering Podcast
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
[Bite] Documenting Data Science Projects
Podcast episode
[Bite] Documenting Data Science Projects
byDataCafé
0 ratings
0% found this document useful
Introducing Data Downtime: From Firefighting to Winning // Barr Moses // MLOps Coffee Sessions #19
Podcast episode
Introducing Data Downtime: From Firefighting to Winning // Barr Moses // MLOps Coffee Sessions #19
byMLOps.community
0 ratings
0% found this document useful
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
Podcast episode
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Getting through a SOC 2 audit with your nerves intact: On this sponsored episode of the podcast, Ben and Ryan talk with James Ciesielski, CTO and co-founder, and Megan Dean, information security and risk compliance manager, both of Rewind. We talk about how you can prep for and successfully get through a SOC 2 audit, how backing up your SaaS data can provide business continuity, and the benefits of establishing a relationship with your auditor.
Podcast episode
Getting through a SOC 2 audit with your nerves intact: On this sponsored episode of the podcast, Ben and Ryan talk with James Ciesielski, CTO and co-founder, and Megan Dean, information security and risk compliance manager, both of Rewind. We talk about how you can prep for and successfully get through a SOC 2 audit, how backing up your SaaS data can provide business continuity, and the benefits of establishing a relationship with your auditor.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
ESW #313 - Pablo Zurro, Travis Howerton: Fortra's Core Security has conducted it's fourth annual survey of cybersecurity professionals on the usage and perception of pen testing. The data collected provides visibility into the full spectrum of pen testing’s role, helping to determine how...
Podcast episode
ESW #313 - Pablo Zurro, Travis Howerton: Fortra's Core Security has conducted it's fourth annual survey of cybersecurity professionals on the usage and perception of pen testing. The data collected provides visibility into the full spectrum of pen testing’s role, helping to determine how...
bySecurity Weekly Podcast Network (Audio)
0 ratings
0% found this document useful
10: Test Case Design using Given-When-Then from BDD: It doesn’t matter if you are using pytest, unittest, nose, or something completely different, this episode will help you write better tests.
Podcast episode
10: Test Case Design using Given-When-Then from BDD: It doesn’t matter if you are using pytest, unittest, nose, or something completely different, this episode will help you write better tests.
byTest and Code
0 ratings
0% found this document useful
Streaming Data Integration Without The Code at Equalum - Episode 161: An interview about how the Equalum platform is architected to provide streaming data integration workflows with a no-code interface.
Podcast episode
Streaming Data Integration Without The Code at Equalum - Episode 161: An interview about how the Equalum platform is architected to provide streaming data integration workflows with a no-code interface.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
A.i. Coding
Linux Format
Article
A.i. Coding
Aug 22, 2023
16 min read
Protect Your Tech
Computeractive
Article
Protect Your Tech
Dec 6, 2023
What’s the threat? Scammers are placing fake adverts in Google for the popular program CPU-Z, in the latest example of hackers infiltrating search results with malware. The ads are headlined ‘CPU-Z | Analytics and Monitoring | Single Sign-On’ (see sc
2 min read
Network-monitoring software 2024
PC Pro Magazine
Article
Network-monitoring software 2024
Feb 8, 2024
4 min read
Trend Micro Maximum Security: Great Security Suite, But Privacy Protection Features Need Work
PCWorld
Article
Trend Micro Maximum Security: Great Security Suite, But Privacy Protection Features Need Work
Aug 7, 2018
6 min read
Discover Easy-to -build Desktop Apps
Linux Format
Article
Discover Easy-to -build Desktop Apps
Oct 22, 2019
Electron is actually a browser packaged with node.js and a few APIs. Because it’s built on top of the Chromium browser, you have everything available from there to add to your application. GitHub developed it as part of the Atom editor; it was open-s
7 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
The Best Privacy And Security Apps For Android
Android Advisor
Article
The Best Privacy And Security Apps For Android
May 3, 2023
10 min read
Buyer’s Guide Network Monitoring
PC Pro Magazine
Article
Buyer’s Guide Network Monitoring
Feb 9, 2023
4 min read
Visualise Smart- Home Sensor Data
Linux Format
Article
Visualise Smart- Home Sensor Data
Oct 17, 2023
8 min read
Monitor And Graph Your System Metrics
Linux Format
Article
Monitor And Graph Your System Metrics
Dec 13, 2022
Credit: https://oss.oetiker.ch/rrdtool Matt Holder has worked in IT support for over a decade, and always tries to use Linux alongside other installed systems. The code used in this article can be downloaded from https:// github.com/ mattmole/ LXF297
8 min read
PC Matic For Mac: Don’t Bother
MacWorld
Article
PC Matic For Mac: Don’t Bother
Feb 13, 2024
3 min read
Malwarebytes Privacy VPN For Mac: Good Performance From A Trusted Company
MacWorld
Article
Malwarebytes Privacy VPN For Mac: Good Performance From A Trusted Company
Oct 19, 2021
2 min read
The Best Privacy And Security Apps For Android
Android Advisor
Article
The Best Privacy And Security Apps For Android
Feb 2, 2022
7 min read
MacKeeper
Macworld UK
Article
MacKeeper
Jan 14, 2022
3 min read
PC Matic for Mac
Macworld UK
Article
PC Matic for Mac
Jan 12, 2024
3 min read
Password Managers
Linux Format
Article
Password Managers
Feb 6, 2024
14 min read
Best Password Managers
iPad & iPhone User
Article
Best Password Managers
Mar 13, 2020
4 min read
Antivirus Go Free Or Buy A Suite?
APC
Article
Antivirus Go Free Or Buy A Suite?
Apr 20, 2023
32 min read
How To Test Your Mac’s Internet Speed And Quality
MacWorld
Article
How To Test Your Mac’s Internet Speed And Quality
Apr 19, 2022
5 min read
How To Code Diagrams, Graphs And Pie Charts
Linux Format
Article
How To Code Diagrams, Graphs And Pie Charts
Mar 9, 2021
7 min read
We Need To Talk About AV Software: A Buyer’s Guide
PC Pro Magazine
Article
We Need To Talk About AV Software: A Buyer’s Guide
Mar 9, 2023
4 min read
Bitwarden vs LastPass
Maximum PC
Article
Bitwarden vs LastPass
Mar 2, 2021
4 min read
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
PCWorld
Article
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
Apr 30, 2024
6 min read
Windows 10 Privacy Settings: What’s New in the Creators Update
PCWorld
Article
Windows 10 Privacy Settings: What’s New in the Creators Update
Jun 7, 2017
2 min read

Related categories

Skip carousel

Reviews for Cody's Data Cleaning Techniques Using SAS, Third Edition

Rating: 4.5 out of 5 stars

4.5/5

2 ratings1 review

Rating: 5 out of 5 stars
5/5
It is what it is. A really nice little manual on data cleaning. It shows many ways of doing cleaning and why and it has excellent examples. I handle a lot of data, but do not have an IT background, so I needed an exhaxstive and pragmatic recipe book to help me systematize my ETL process. This is really accessible, brief and free of jargon.
1 person found this helpful

Book preview

Cody's Data Cleaning Techniques Using SAS, Third Edition - Ron Cody

Introduction

What is data cleaning? Also referred to as data cleansing or data scrubbing, data cleaning has several definitions, depending on the type of data and the type of analysis that you are going to pursue. The basic idea is to analyze your data to look for possible errors. Data errors can be conveniently divided into two broad categories, depending on whether the data values are character or numeric.

As an example of a character error, consider a variable called Gender that has been defined as either an 'M' or an 'F'. A value of 'X' would clearly be a data error. What about a missing value for Gender? This could be considered a data error in a study where the gender of every subject needs to be known. What about an observation where a value of 'F' is coded, but the subject is a male? This is also a data error, but one that might be very difficult to detect. Other types of character data errors include misspelling names or other typographical errors. One other task involving character data consists of creating consistency of data across multiple observations or multiple data sets. For example, a company name might be coded as 'IBM' in one location and 'International Business Machines' in another location.

Numeric data errors are generally more straightforward. For some variables, such as pulse rate or blood pressure, data errors can often be identified by inspecting data values outside of a predetermined range. For example, a resting heart rate over 100 would be worthy of further investigation. For other types of numeric data, such as bank deposits or withdrawals, you could use automatic methods that look for what are called outliers—data values that are not consistent with other values.

The methods you use to search for bad data depend on several factors. If you are conducting a clinical trial for a drug application to the FDA, you need to ensure that all the values in your data set accurately reflect the values that were collected in the study. If a possible error is found, you need to go back to the original data source and attempt to find the correct value. If you are building a data warehouse for business applications and you find a data error, you might decide to throw out the offending data value.

SAS is especially well suited for data cleaning tasks. As a SAS programmer, you have extremely powerful character functions available in the DATA step. Many of the SAS procedures (PROCS) can be used as well.

As you can see in the Table of Contents, this book covers detecting character and numeric data errors. You will find chapters devoted to standardization of data and other chapters devoted to using Perl regular expressions to look for data values that do not conform to a specific pattern, such as US ZIP codes or phone numbers.

One of the later chapters discusses how to correct data errors and another chapter discusses how to create integrity constraints on a cleaned data set. Integrity constraints can automatically prevent data values that violate a predetermined definition from being added to the original clean data set. Along with integrity constraints, you will see how to create an audit trail data set that reports on data violations in the data you are attempting to add to the original clean data set.

Chapter 1: Working with Character Data

Introduction

Using PROC FREQ to Detect Character Variable Errors

Changing the Case of All Character Variables in a Data Set

A Summary of Some Character Functions (Useful for Data Cleaning)

UPCASE, LOWCASE, and PROPCASE

NOTDIGIT, NOTALPHA, and NOTALNUM.

VERIFY

COMPBL

COMPRESS

MISSING

TRIMN and STRIP

Checking that a Character Value Conforms to a Pattern

Using a DATA Step to Detect Character Data Errors

Using PROC PRINT with a WHERE Statement to Identify Data Errors

Using Formats to Check for Invalid Values

Creating Permanent Formats

Removing Units from a Value

Removing Non-Printing Characters from a Character Value

Conclusions

Introduction

This chapter will discuss and demonstrate some basic methods for checking the validity of character data. For some character variables such as Gender or Race, where there are a limited number of valid values (for example, 'M' and 'F' for Gender), it is quite straightforward to use PROC FREQ to list frequencies for these variables and inspect the output for invalid values.

In other circumstances, you might have rules that a set of character values must adhere to. For example, a variable such as ID might be stored as character, but there is a requirement that it must only contain digits. Other constraints on character data might be that all the characters are letters or, possibly, letters or digits (called alphanumeric values). SAS has some powerful tools to test rules of this sort.

This book uses several data sources where errors were added intentionally. One of these files, called Patients.txt, is a text file containing fictional data on 101 patients.

The data layout for this file is shown in Table 1.1:

Table 1.1: Data Layout for the File Patients.txt

The first few lines of Patients.txt is shown in Figure 1.1:

Figure 1.1: First Few Lines of the Patients.txt File

You can use the following SAS program to read this text file and create a SAS data set:

Program 1.1: Reading the Patients.txt File

libname Clean 'c:\books\Clean3'; ➊

data Patients;

infile 'c:\books\clean3\Patients.txt' ; ➋

input @1 Patno $3. ➌

@4 Account_No $7.

@11 Gender $1.

@12 Visit mmddyy10.

@22 HR 3.

@25 SBP 3.

@28 DBP 3.

@31 Dx $7.

@38 AE 1.;

label Patno = Patient Number ➍

Account_No = Account Number

Gender = Gender

Visit = Visit Date

HR = Heart Rate

SBP = Systolic Blood Pressure

DBP = Diastolic Blood Pressure

Dx = Diagnosis Code

AE = Adverse Event?;

format Visit mmddyy10.; ➎

run;

proc sort data=Clean.Patients; ➏

by Patno Visit;

run;

proc print data=Clean.Patients; ➐

id Patno;

run;

➊ The LIBNAME statement assigns the library reference Clean to the folder where the permanent data set will be stored (c:\books\Clean3).

➋ The text file Patients.txt is stored in the folder c:\books\Clean3. You use an INFILE statement to instruct the program to read data from the Patients.txt file in this folder.

➌ This type of input is called formatted input—you use a column pointer (@n) to indicate the starting column for each variable and an INFORMAT statement to instruct the program how to read the value. This INPUT statement reads character data with the $n. informat, numeric data with the n. informat, and the visit date with the mmddyy10. informat.

➍ You use a LABEL statement to associate labels with variable names. These labels will appear when you run certain procedures (such as PROC MEANS) or in other procedures such as PROC PRINT (if you use a LABEL option). You do not have to use labels, but it makes some of the output easier to read.

➎ You use a FORMAT statement so that the visit date is in the month-day-year format (otherwise, it will appear as the number of days from January 1, 1960). You are free to use any date format you wish, such as date9. It does not have to be the same form that you used to read the date.

➏ You use PROC SORT to sort the data set by patient ID (Patno) and visit date (Visit).

➐ Use PROC PRINT to list the observations in the data set. The ID statement is an instruction to place this variable in the first column and to omit the Obs (observation number) column that PROC PRINT would include if you did not use an ID statement.

Before you look at the output from this program, take a look at the SAS Log to see if there were any errors in the program or in the data values. This is discussed in more detail in Chapter 7.

A listing of the first few lines is shown below:

Figure 1.2: Listing of the First Few Observations of Data Set Patients

Using PROC FREQ to Detect Character Variable Errors

We are going to start out by checking the character variables for possible invalid values. The variable Patno (stands for patient number) should only contain digits and should always have a value (for example, you do not want any missing values for this variable). The variable Account_No (account number) consists of a two-character state abbreviation followed by five digits. The variable Gender has valid values of 'M' and 'F' (uppercase). Missing values or values in lowercase are not allowed for Gender. Finally, the variable Dx (diagnosis code) consists of three digits, followed by a period, followed by three digits.

You can check Gender and the first two digits of the account number by computing frequencies, using the following program:

Program 1.2: Computing Frequencies Using PROC FREQ

libname Clean 'c:\Books\Clean3'; ➊

*Program to Compute Frequencies for Gender and the

First Two Digits of the Account_No (State Abbreviation);

data Check_Char;

set Clean.Patients(keep=Patno Account_No Gender); ➋

length State $ 2; ➌

State = Account_No; ➍

run;

title Frequencies for Gender and the First Two Digits of Account_No;

proc freq data=Check_Char; ➎

tables Gender State / nocum nopercent; ➏

run;

➊ The LIBNAME statement points to the folder c:\Books\Clean3, where the Patients data set is located. You can change this value to point to a valid location on your computer.

➋ The SET statement brings in the observations from the Patients data set. Note the use of the KEEP= data set option. Using the KEEP= data set option is more efficient than using a KEEP statement. This is an important point and, if you are not familiar with the distinction between a KEEP= data set option and a KEEP statement, pay close attention. If you had used a KEEP statement, all the variables from data set Patients would be imported into the PDV (program data vector – the place in memory that holds the variables and values). When you use a KEEP= data set option, only the variables Patno, Account_No, and Gender are read from data set Patients (and State is created in the program). For data sets with a large number of variables, using a KEEP= option is much more efficient then using a KEEP statement.

➌ The LENGTH statement sets the length of the new variable State to two bytes.

➍ Because State has a length of two bytes, assigning the value of Account_No to State results in State holding the first two characters of Account_No. You could have used the SUBSTR (substring) function to create the variable State, but the method used here is more efficient and used by many SAS programmers.

➎ Use PROC FREQ to compute frequencies.

➏ The TABLES statement lists the two variables (State and Gender) for which you want to compute frequencies. You can use the two options NOCUM and NOPERCENT to suppress the output of cumulative statistics and percentages.

Output from Program 1.2 is shown next:

Figure 1.3: First Part of the Output with Values for Gender

There are several invalid values for Gender as well as three missing

Enjoying the preview?

Page 1 of 1

Cody's Data Cleaning Techniques Using SAS, Third Edition

About this ebook

Ron Cody

Read more from Ron Cody

Related authors

Related to Cody's Data Cleaning Techniques Using SAS, Third Edition

Related ebooks

Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for Cody's Data Cleaning Techniques Using SAS, Third Edition

What did you think?

Book preview

Cody's Data Cleaning Techniques Using SAS, Third Edition - Ron Cody

Introduction

Chapter 1: Working with Character Data

Introduction

Using PROC FREQ to Detect Character Variable Errors