Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

R: Data Analysis and Visualization
R: Data Analysis and Visualization
R: Data Analysis and Visualization
Ebook3,564 pages20 hours

R: Data Analysis and Visualization

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

This course is for data scientist or quantitative analyst who are looking at learning R and take advantage of its powerful analytical design framework. It’s a seamless journey in becoming a full-stack R developer.
LanguageEnglish
Release dateJun 24, 2016
ISBN9781786460486
R: Data Analysis and Visualization
Author

Brett Lantz

"Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data."

Read more from Brett Lantz

Related to R

Related ebooks

Data Modeling & Design For You

View More

Related articles

Reviews for R

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    R - Brett Lantz

    Table of Contents

    R: Data Analysis and Visualization

    Meet Your Course Guide

    Course Structure

    Course journey

    The Course Roadmap and Timeline

    I. Module 1: Data Analysis with R

    1. RefresheR

    Navigating the basics

    Arithmetic and assignment

    Logicals and characters

    Flow of control

    Getting help in R

    Vectors

    Subsetting

    Vectorized functions

    Advanced subsetting

    Recycling

    Functions

    Matrices

    Loading data into R

    Working with packages

    2. The Shape of Data

    Univariate data

    Frequency distributions

    Central tendency

    Spread

    Populations, samples, and estimation

    Probability distributions

    Visualization methods

    3. Describing Relationships

    Multivariate data

    Relationships between a categorical and a continuous variable

    Relationships between two categorical variables

    The relationship between two continuous variables

    Covariance

    Correlation coefficients

    Comparing multiple correlations

    Visualization methods

    Categorical and continuous variables

    Two categorical variables

    Two continuous variables

    More than two continuous variables

    4. Probability

    Basic probability

    A tale of two interpretations

    Sampling from distributions

    Parameters

    The binomial distribution

    The normal distribution

    The three-sigma rule and using z-tables

    5. Using Data to Reason About the World

    Estimating means

    The sampling distribution

    Interval estimation

    How did we get 1.96?

    Smaller samples

    6. Testing Hypotheses

    Null Hypothesis Significance Testing

    One and two-tailed tests

    When things go wrong

    A warning about significance

    A warning about p-values

    Testing the mean of one sample

    Assumptions of the one sample t-test

    Testing two means

    Don't be fooled!

    Assumptions of the independent samples t-test

    Testing more than two means

    Assumptions of ANOVA

    Testing independence of proportions

    What if my assumptions are unfounded?

    7. Bayesian Methods

    The big idea behind Bayesian analysis

    Choosing a prior

    Who cares about coin flips

    Enter MCMC – stage left

    Using JAGS and runjags

    Fitting distributions the Bayesian way

    The Bayesian independent samples t-test

    8. Predicting Continuous Variables

    Linear models

    Simple linear regression

    Simple linear regression with a binary predictor

    A word of warning

    Multiple regression

    Regression with a non-binary predictor

    Kitchen sink regression

    The bias-variance trade-off

    Cross-validation

    Striking a balance

    Linear regression diagnostics

    Second Anscombe relationship

    Third Anscombe relationship

    Fourth Anscombe relationship

    Advanced topics

    9. Predicting Categorical Variables

    k-Nearest Neighbors

    Using k-NN in R

    Confusion matrices

    Limitations of k-NN

    Logistic regression

    Using logistic regression in R

    Decision trees

    Random forests

    Choosing a classifier

    The vertical decision boundary

    The diagonal decision boundary

    The crescent decision boundary

    The circular decision boundary

    10. Sources of Data

    Relational Databases

    Why didn't we just do that in SQL?

    Using JSON

    XML

    Other data formats

    Online repositories

    11. Dealing with Messy Data

    Analysis with missing data

    Visualizing missing data

    Types of missing data

    So which one is it?

    Unsophisticated methods for dealing with missing data

    Complete case analysis

    Pairwise deletion

    Mean substitution

    Hot deck imputation

    Regression imputation

    Stochastic regression imputation

    Multiple imputation

    So how does mice come up with the imputed values?

    Methods of imputation

    Multiple imputation in practice

    Analysis with unsanitized data

    Checking for out-of-bounds data

    Checking the data type of a column

    Checking for unexpected categories

    Checking for outliers, entry errors, or unlikely data points

    Chaining assertions

    Other messiness

    OpenRefine

    Regular expressions

    tidyr

    12. Dealing with Large Data

    Wait to optimize

    Using a bigger and faster machine

    Be smart about your code

    Allocation of memory

    Vectorization

    Using optimized packages

    Using another R implementation

    Use parallelization

    Getting started with parallel R

    An example of (some) substance

    Using Rcpp

    Be smarter about your code

    13. Reproducibility and Best Practices

    R Scripting

    RStudio

    Running R scripts

    An example script

    Scripting and reproducibility

    R projects

    Version control

    Communicating results

    II. Module 2: R Graphs

    1. R Graphics

    Base graphics using the default package

    Trellis graphs using lattice

    Graphs inspired by Grammar of Graphics

    2. Basic Graph Functions

    Introduction

    Creating basic scatter plots

    Getting ready

    How to do it...

    How it works...

    There's more...

    A note on R's built-in datasets

    See also

    Creating line graphs

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Creating bar charts

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Creating histograms and density plots

    How to do it...

    How it works...

    There's more...

    See also

    Creating box plots

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Adjusting x and y axes' limits

    How to do it...

    How it works...

    There's more...

    See also

    Creating heat maps

    How to do it...

    How it works...

    There's more...

    See also

    Creating pairs plots

    How to do it...

    How it works...

    There's more...

    See also

    Creating multiple plot matrix layouts

    How to do it...

    How it works...

    There's more...

    See also

    Adding and formatting legends

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Creating graphs with maps

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Saving and exporting graphs

    How to do it...

    How it works...

    There's more...

    See also

    3. Beyond the Basics – Adjusting Key Parameters

    Introduction

    Setting colors of points, lines, and bars

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Setting plot background colors

    Getting ready

    How to do it...

    How it works...

    There's more...

    Setting colors for text elements – axis annotations, labels, plot titles, and legends

    Getting ready

    How to do it...

    How it works...

    There's more...

    Choosing color combinations and palettes

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Setting fonts for annotations and titles

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Choosing plotting point symbol styles and sizes

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Choosing line styles and width

    Getting ready

    How to do it...

    How it works...

    See also

    Choosing box styles

    Getting ready

    How to do it...

    How it works...

    There's more...

    Adjusting axis annotations and tick marks

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Formatting log axes

    Getting ready

    How to do it...

    How it works...

    There's more...

    Setting graph margins and dimensions

    Getting ready

    How to do it...

    How it works...

    See also

    4. Creating Scatter Plots

    Introduction

    Grouping data points within a scatter plot

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Highlighting grouped data points by size and symbol type

    Getting ready

    How to do it...

    How it works...

    Labeling data points

    Getting ready

    How to do it...

    How it works...

    There's more...

    Correlation matrix using pairs plots

    Getting ready

    How to do it...

    How it works...

    Adding error bars

    Getting ready

    How to do it...

    How it works...

    There's more...

    Using jitter to distinguish closely packed data points

    Getting ready

    How to do it...

    How it works...

    Adding linear model lines

    Getting ready

    How to do it...

    How it works...

    Adding nonlinear model curves

    Getting ready

    How to do it...

    How it works...

    Adding nonparametric model curves with lowess

    Getting ready

    How to do it...

    How it works...

    Creating three-dimensional scatter plots

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating Quantile-Quantile plots

    Getting ready

    How to do it...

    How it works...

    There's more...

    Displaying the data density on axes

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating scatter plots with a smoothed density representation

    Getting ready

    How to do it...

    How it works...

    There's more...

    5. Creating Line Graphs and Time Series Charts

    Introduction

    Adding customized legends for multiple-line graphs

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Using margin labels instead of legends for multiple-line graphs

    Getting ready

    How to do it...

    How it works...

    There's more...

    Adding horizontal and vertical grid lines

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Adding marker lines at specific x and y values using abline

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating sparklines

    Getting ready

    How to do it...

    How it works...

    Plotting functions of a variable in a dataset

    Getting ready

    How to do it...

    How it works...

    There's more...

    Formatting time series data for plotting

    Getting ready

    How to do it...

    How it works...

    There's more...

    Plotting the date or time variable on the x axis

    Getting ready

    How to do it...

    How it works...

    There's more...

    Annotating axis labels in different human-readable time formats

    Getting ready

    How to do it...

    How it works...

    There's more...

    Adding vertical markers to indicate specific time events

    Getting ready

    How to do it...

    How it works...

    There's more...

    Plotting data with varying time-averaging periods

    Getting ready

    How to do it...

    How it works...

    Creating stock charts

    Getting ready

    How to do it...

    How it works...

    There's more...

    6. Creating Bar, Dot, and Pie Charts

    Introduction

    Creating bar charts with more than one factor variable

    Getting ready

    How to do it...

    How it works...

    See also

    Creating stacked bar charts

    Getting ready

    How to do it...

    How it works...

    There's more...

    Adjusting the orientation of bars – horizontal and vertical

    Getting ready

    How to do it...

    How it works...

    There's more...

    Adjusting bar widths, spacing, colors, and borders

    Getting ready

    How to do it...

    How it works...

    There's more...

    Displaying values on top of or next to the bars

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Placing labels inside bars

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating bar charts with vertical error bars

    Getting ready

    How to do it...

    How it works...

    There's more...

    Modifying dot charts by grouping variables

    Getting ready

    How to do it...

    How it works...

    Making better, readable pie charts with clockwise-ordered slices

    Getting ready

    How to do it...

    How it works...

    See also

    Labeling a pie chart with percentage values for each slice

    Getting ready

    How it works...

    There's more...

    See also

    Adding a legend to a pie chart

    Getting ready

    How to do it...

    How it works...

    There's more...

    7. Creating Histograms

    Introduction

    Visualizing distributions as count frequencies or probability densities

    Getting ready

    How to do it...

    How it works...

    There's more

    Setting the bin size and the number of breaks

    Getting ready

    How to do it...

    How it works...

    There's more

    Adjusting histogram styles – bar colors, borders, and axes

    Getting ready

    How to do it...

    How it works...

    There's more

    Overlaying a density line over a histogram

    Getting ready

    How to do it...

    How it works...

    Multiple histograms along the diagonal of a pairs plot

    Getting ready

    How to do it...

    How it works...

    Histograms in the margins of line and scatter plots

    Getting ready

    How to do it...

    How it works...

    8. Box and Whisker Plots

    Introduction

    Creating box plots with narrow boxes for a small number of variables

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Grouping over a variable

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Varying box widths by the number of observations

    Getting ready

    How to do it...

    How it works...

    Creating box plots with notches

    Getting ready

    How to do it...

    How it works...

    There's more

    Including or excluding outliers

    Getting ready

    How to do it...

    How it works...

    See also

    Creating horizontal box plots

    Getting ready

    How to do it...

    How it works...

    Changing the box styling

    Getting ready

    How to do it...

    How it works...

    There's more

    Adjusting the extent of plot whiskers outside the box

    Getting ready

    How to do it...

    How it works...

    There's more

    Showing the number of observations

    Getting ready

    How to do it...

    How it works...

    There's more

    Splitting a variable at arbitrary values into subsets

    Getting ready

    How to do it...

    How it works...

    There's more

    9. Creating Heat Maps and Contour Plots

    Introduction

    Creating heat maps of a single Z variable with a scale

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Creating correlation heat maps

    Getting ready

    How to do it...

    How it works...

    There's more

    Summarizing multivariate data in a single heat map

    Getting ready

    How to do it...

    How it works...

    There's more

    Creating contour plots

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Creating filled contour plots

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Creating three-dimensional surface plots

    Getting ready

    How to do it...

    How it works...

    There's more

    Visualizing time series as calendar heat maps

    Getting ready

    How to do it...

    How it works...

    There's more

    10. Creating Maps

    Introduction

    Plotting global data by countries on a world map

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Creating graphs with regional maps

    Getting ready

    How to do it...

    How it works...

    There's more

    Plotting data on Google maps

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Creating and reading KML data

    Getting ready

    How to do it...

    How it works...

    See Also

    Working with ESRI shapefiles

    Getting ready

    How to do it...

    How it works...

    There's more

    11. Data Visualization Using Lattice

    Introduction

    Creating bar charts

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Creating stacked bar charts

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Creating bar charts to visualize cross-tabulation

    Getting ready

    How to do it…

    How it works…

    There's more…

    Creating a conditional histogram

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Visualizing distributions through a kernel-density plot

    Getting ready

    How to do it…

    How it works…

    There's more…

    Creating a normal Q-Q plot

    Getting ready

    How to do it…

    How it works…

    There's more…

    Visualizing an empirical Cumulative Distribution Function

    Getting ready

    How to do it…

    How it works…

    There's more…

    Creating a boxplot

    Getting ready

    How to do it…

    How it works…

    There's more…

    Creating a conditional scatter plot

    Getting ready

    How to do it…

    How it works…

    There's more…

    12. Data Visualization Using ggplot2

    Introduction

    Creating bar charts

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Creating multiple bar charts

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Creating a bar chart with error bars

    Getting ready

    How to do it…

    How it works…

    There's more…

    Visualizing the density of a numeric variable

    Getting ready

    How to do it...

    How it works…

    There's more...

    Creating a box plot

    Getting ready

    How to do it...

    How it works…

    Creating a layered plot with a scatter plot and fitted line

    Getting ready

    How to do it...

    How it works…

    There's more...

    Creating a line chart

    Getting ready

    How to do it...

    How it works…

    There's more...

    Graph annotation with ggplot

    Getting ready

    How to do it...

    How it works...

    13. Inspecting Large Datasets

    Introduction

    Multivariate continuous data visualization

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Multivariate categorical data visualization

    Getting ready

    How to do it…

    How it works…

    There's more…

    Visualizing mixed data

    Getting ready

    How to do it…

    Zooming and filtering

    Getting ready

    How to do it...

    How it works…

    There's more...

    14. Three-dimensional Visualizations

    Introduction

    Three-dimensional scatter plots

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also...

    Three-dimensional scatter plots with a regression plane

    Getting ready

    How to do it…

    How it works…

    There's more…

    Three-dimensional bar charts

    Getting ready

    How to do it…

    How it works…

    Three-dimensional density plots

    Getting ready

    How to do it...

    How it works…

    15. Finalizing Graphs for Publications and Presentations

    Introduction

    Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Exporting graphs in vector formats – SVG, PDF, and PS

    Getting ready

    How to do it...

    How it works...

    There's more

    Adding mathematical and scientific notations (typesetting)

    Getting ready

    How to do it...

    How it works...

    There's more

    Adding text descriptions to graphs

    Getting ready

    How to do it...

    How it works...

    There's more

    Using graph templates

    Getting ready

    How to do it...

    How it works...

    There's more

    Choosing font families and styles under Windows, Mac OS X, and Linux

    Getting ready

    How to do it...

    How it works...

    There's more

    See also

    Choosing fonts for PostScripts and PDFs

    Getting ready

    How to do it...

    How it works...

    There's more

    III. Module 3: Learning Data Mining with R

    1. Warming Up

    Big data

    Scalability and efficiency

    Data source

    Data mining

    Feature extraction

    Summarization

    The data mining process

    CRISP-DM

    SEMMA

    Social network mining

    Social network

    Text mining

    Information retrieval and text mining

    Mining text for prediction

    Web data mining

    Why R?

    What are the disadvantages of R?

    Statistics

    Statistics and data mining

    Statistics and machine learning

    Statistics and R

    The limitations of statistics on data mining

    Machine learning

    Approaches to machine learning

    Machine learning architecture

    Data attributes and description

    Numeric attributes

    Categorical attributes

    Data description

    Data measuring

    Data cleaning

    Missing values

    Junk, noisy data, or outlier

    Data integration

    Data dimension reduction

    Eigenvalues and Eigenvectors

    Principal-Component Analysis

    Singular-value decomposition

    CUR decomposition

    Data transformation and discretization

    Data transformation

    Normalization data transformation methods

    Data discretization

    Visualization of results

    Visualization with R

    2. Mining Frequent Patterns, Associations, and Correlations

    An overview of associations and patterns

    Patterns and pattern discovery

    The frequent itemset

    The frequent subsequence

    The frequent substructures

    Relationship or rules discovery

    Association rules

    Correlation rules

    Market basket analysis

    The market basket model

    A-Priori algorithms

    Input data characteristics and data structure

    The A-Priori algorithm

    The R implementation

    A-Priori algorithm variants

    The Eclat algorithm

    The R implementation

    The FP-growth algorithm

    Input data characteristics and data structure

    The FP-growth algorithm

    The R implementation

    The GenMax algorithm with maximal frequent itemsets

    The R implementation

    The Charm algorithm with closed frequent itemsets

    The R implementation

    The algorithm to generate association rules

    The R implementation

    Hybrid association rules mining

    Mining multilevel and multidimensional association rules

    Constraint-based frequent pattern mining

    Mining sequence dataset

    Sequence dataset

    The GSP algorithm

    The R implementation

    The SPADE algorithm

    The R implementation

    Rule generation from sequential patterns

    High-performance algorithms

    3. Classification

    Classification

    Generic decision tree induction

    Attribute selection measures

    Tree pruning

    General algorithm for the decision tree generation

    The R implementation

    High-value credit card customers classification using ID3

    The ID3 algorithm

    The R implementation

    Web attack detection

    High-value credit card customers classification

    Web spam detection using C4.5

    The C4.5 algorithm

    The R implementation

    A parallel version with MapReduce

    Web spam detection

    Web key resource page judgment using CART

    The CART algorithm

    The R implementation

    Web key resource page judgment

    Trojan traffic identification method and Bayes classification

    Estimating

    Prior probability estimation

    Likelihood estimation

    The Bayes classification

    The R implementation

    Trojan traffic identification method

    Identify spam e-mail and Naïve Bayes classification

    The Naïve Bayes classification

    The R implementation

    Identify spam e-mail

    Rule-based classification of player types in computer games and rule-based classification

    Transformation from decision tree to decision rules

    Rule-based classification

    Sequential covering algorithm

    The RIPPER algorithm

    The R implementation

    Rule-based classification of player types in computer games

    4. Advanced Classification

    Ensemble (EM) methods

    The bagging algorithm

    The boosting and AdaBoost algorithms

    The Random forests algorithm

    The R implementation

    Parallel version with MapReduce

    Biological traits and the Bayesian belief network

    The Bayesian belief network (BBN) algorithm

    The R implementation

    Biological traits

    Protein classification and the k-Nearest Neighbors algorithm

    The kNN algorithm

    The R implementation

    Document retrieval and Support Vector Machine

    The SVM algorithm

    The R implementation

    Parallel version with MapReduce

    Document retrieval

    Classification using frequent patterns

    The associative classification

    CBA

    Discriminative frequent pattern-based classification

    The R implementation

    Text classification using sentential frequent itemsets

    Classification using the backpropagation algorithm

    The BP algorithm

    The R implementation

    Parallel version with MapReduce

    5. Cluster Analysis

    Search engines and the k-means algorithm

    The k-means clustering algorithm

    The kernel k-means algorithm

    The k-modes algorithm

    The R implementation

    Parallel version with MapReduce

    Search engine and web page clustering

    Automatic abstraction of document texts and the k-medoids algorithm

    The PAM algorithm

    The R implementation

    Automatic abstraction and summarization of document text

    The CLARA algorithm

    The CLARA algorithm

    The R implementation

    CLARANS

    The CLARANS algorithm

    The R implementation

    Unsupervised image categorization and affinity propagation clustering

    Affinity propagation clustering

    The R implementation

    Unsupervised image categorization

    The spectral clustering algorithm

    The R implementation

    News categorization and hierarchical clustering

    Agglomerative hierarchical clustering

    The BIRCH algorithm

    The chameleon algorithm

    The Bayesian hierarchical clustering algorithm

    The probabilistic hierarchical clustering algorithm

    The R implementation

    News categorization

    6. Advanced Cluster Analysis

    Customer categorization analysis of e-commerce and DBSCAN

    The DBSCAN algorithm

    Customer categorization analysis of e-commerce

    Clustering web pages and OPTICS

    The OPTICS algorithm

    The R implementation

    Clustering web pages

    Visitor analysis in the browser cache and DENCLUE

    The DENCLUE algorithm

    The R implementation

    Visitor analysis in the browser cache

    Recommendation system and STING

    The STING algorithm

    The R implementation

    Recommendation systems

    Web sentiment analysis and CLIQUE

    The CLIQUE algorithm

    The R implementation

    Web sentiment analysis

    Opinion mining and WAVE clustering

    The WAVE cluster algorithm

    The R implementation

    Opinion mining

    User search intent and the EM algorithm

    The EM algorithm

    The R implementation

    The user search intent

    Customer purchase data analysis and clustering high-dimensional data

    The MAFIA algorithm

    The SURFING algorithm

    The R implementation

    Customer purchase data analysis

    SNS and clustering graph and network data

    The SCAN algorithm

    The R implementation

    Social networking service (SNS)

    7. Outlier Detection

    Credit card fraud detection and statistical methods

    The likelihood-based outlier detection algorithm

    The R implementation

    Credit card fraud detection

    Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

    The NL algorithm

    The FindAllOutsM algorithm

    The FindAllOutsD algorithm

    The distance-based algorithm

    The Dolphin algorithm

    The R implementation

    Activity monitoring and the detection of mobile fraud

    Intrusion detection and density-based methods

    The OPTICS-OF algorithm

    The High Contrast Subspace algorithm

    The R implementation

    Intrusion detection

    Intrusion detection and clustering-based methods

    Hierarchical clustering to detect outliers

    The k-means-based algorithm

    The ODIN algorithm

    The R implementation

    Monitoring the performance of the web server and classification-based methods

    The OCSVM algorithm

    The one-class nearest neighbor algorithm

    The R implementation

    Monitoring the performance of the web server

    Detecting novelty in text, topic detection, and mining contextual outliers

    The conditional anomaly detection (CAD) algorithm

    The R implementation

    Detecting novelty in text and topic detection

    Collective outliers on spatial data

    The route outlier detection (ROD) algorithm

    The R implementation

    Characteristics of collective outliers

    Outlier detection in high-dimensional data

    The brute-force algorithm

    The HilOut algorithm

    The R implementation

    8. Mining Stream, Time-series, and Sequence Data

    The credit card transaction flow and STREAM algorithm

    The STREAM algorithm

    The single-pass-any-time clustering algorithm

    The R implementation

    The credit card transaction flow

    Predicting future prices and time-series analysis

    The ARIMA algorithm

    Predicting future prices

    Stock market data and time-series clustering and classification

    The hError algorithm

    Time-series classification with the 1NN classifier

    The R implementation

    Stock market data

    Web click streams and mining symbolic sequences

    The TECNO-STREAMS algorithm

    The R implementation

    Web click streams

    Mining sequence patterns in transactional databases

    The PrefixSpan algorithm

    The R implementation

    9. Graph Mining and Network Analysis

    Graph mining

    Graph

    Graph mining algorithms

    Mining frequent subgraph patterns

    The gPLS algorithm

    The GraphSig algorithm

    The gSpan algorithm

    Rightmost path extensions and their supports

    The subgraph isomorphism enumeration algorithm

    The canonical checking algorithm

    The R implementation

    Social network mining

    Community detection and the shingling algorithm

    The node classification and iterative classification algorithms

    The R implementation

    10. Mining Text and Web Data

    Text mining and TM packages

    Text summarization

    Topic representation

    The multidocument summarization algorithm

    The Maximal Marginal Relevance algorithm

    The R implementation

    The question answering system

    Genre categorization of web pages

    Categorizing newspaper articles and newswires into topics

    The N-gram-based text categorization

    The R implementation

    Web usage mining with web logs

    The FCA-based association rule mining algorithm

    The R implementation

    IV. Module 4: Mastering R for Quantitative Finance

    1. Time Series Analysis

    Multivariate time series analysis

    Cointegration

    Vector autoregressive models

    VAR implementation example

    Cointegrated VAR and VECM

    Volatility modeling

    GARCH modeling with the rugarch package

    The standard GARCH model

    The Exponential GARCH model (EGARCH)

    The Threshold GARCH model (TGARCH)

    Simulation and forecasting

    References and reading list

    2. Factor Models

    Arbitrage pricing theory

    Implementation of APT

    Fama-French three-factor model

    Modeling in R

    Data selection

    Estimation of APT with principal component analysis

    Estimation of the Fama-French model

    References

    3. Forecasting Volume

    Motivation

    The intensity of trading

    The volume forecasting model

    Implementation in R

    The data

    Loading the data

    The seasonal component

    AR(1) estimation and forecasting

    SETAR estimation and forecasting

    Interpreting the results

    References

    4. Big Data – Advanced Analytics

    Getting data from open sources

    Introduction to big data analysis in R

    K-means clustering on big data

    Loading big matrices

    Big data K-means clustering analysis

    Big data linear regression analysis

    Loading big data

    Fitting a linear regression model on large datasets

    References

    5. FX Derivatives

    Terminology and notations

    Currency options

    Exchange options

    Two-dimensional Wiener processes

    The Margrabe formula

    Application in R

    Quanto options

    Pricing formula for a call quanto

    Pricing a call quanto in R

    References

    6. Interest Rate Derivatives and Models

    The Black model

    Pricing a cap with Black's model

    The Vasicek model

    The Cox-Ingersoll-Ross model

    Parameter estimation of interest rate models

    Using the SMFI5 package

    References

    7. Exotic Options

    A general pricing approach

    The role of dynamic hedging

    How R can help a lot

    A glance beyond vanillas

    Greeks – the link back to the vanilla world

    Pricing the Double-no-touch option

    Another way to price the Double-no-touch option

    The life of a Double-no-touch option – a simulation

    Exotic options embedded in structured products

    References

    8. Optimal Hedging

    Hedging of derivatives

    Market risk of derivatives

    Static delta hedge

    Dynamic delta hedge

    Comparing the performance of delta hedging

    Hedging in the presence of transaction costs

    Optimization of the hedge

    Optimal hedging in the case of absolute transaction costs

    Optimal hedging in the case of relative transaction costs

    Further extensions

    References

    9. Fundamental Analysis

    The basics of fundamental analysis

    Collecting data

    Revealing connections

    Including multiple variables

    Separating investment targets

    Setting classification rules

    Backtesting

    Industry-specific investment

    References

    10. Technical Analysis, Neural Networks, and Logoptimal Portfolios

    Market efficiency

    Technical analysis

    The TA toolkit

    Markets

    Plotting charts - bitcoin

    Built-in indicators

    SMA and EMA

    RSI

    MACD

    Candle patterns: key reversal

    Evaluating the signals and managing the position

    A word on money management

    Wraping up

    Neural networks

    Forecasting bitcoin prices

    Evaluation of the strategy

    Logoptimal portfolios

    A universally consistent, non-parametric investment strategy

    Evaluation of the strategy

    References

    11. Asset and Liability Management

    Data preparation

    Data source at first glance

    Cash-flow generator functions

    Preparing the cash-flow

    Interest rate risk measurement

    Liquidity risk measurement

    Modeling non-maturity deposits

    A Model of deposit interest rate development

    Static replication of non-maturity deposits

    References

    12. Capital Adequacy

    Principles of the Basel Accords

    Basel I

    Basel II

    Minimum capital requirements

    Supervisory review

    Transparency

    Basel III

    Risk measures

    Analytical VaR

    Historical VaR

    Monte-Carlo simulation

    Risk categories

    Market risk

    Credit risk

    Operational risk

    References

    13. Systemic Risks

    Systemic risk in a nutshell

    The dataset used in our examples

    Core-periphery decomposition

    Implementation in R

    Results

    The simulation method

    The simulation

    Implementation in R

    Results

    Possible interpretations and suggestions

    References

    V. Module 5: Machine Learning with R module

    1. Introducing Machine Learning

    The origins of machine learning

    Uses and abuses of machine learning

    Machine learning successes

    The limits of machine learning

    Machine learning ethics

    How machines learn

    Data storage

    Abstraction

    Generalization

    Evaluation

    Machine learning in practice

    Types of input data

    Types of machine learning algorithms

    Matching input data to algorithms

    Machine learning with R

    Installing R packages

    Loading and unloading R packages

    2. Managing and Understanding Data

    R data structures

    Vectors

    Factors

    Lists

    Data frames

    Matrixes and arrays

    Managing data with R

    Saving, loading, and removing R data structures

    Importing and saving data from CSV files

    Exploring and understanding data

    Exploring the structure of data

    Exploring numeric variables

    Measuring the central tendency – mean and median

    Measuring spread – quartiles and the five-number summary

    Visualizing numeric variables – boxplots

    Visualizing numeric variables – histograms

    Understanding numeric data – uniform and normal distributions

    Measuring spread – variance and standard deviation

    Exploring categorical variables

    Measuring the central tendency – the mode

    Exploring relationships between variables

    Visualizing relationships – scatterplots

    Examining relationships – two-way cross-tabulations

    3. Lazy Learning – Classification Using Nearest Neighbors

    Understanding nearest neighbor classification

    The k-NN algorithm

    Measuring similarity with distance

    Choosing an appropriate k

    Preparing data for use with k-NN

    Why is the k-NN algorithm lazy?

    Example – diagnosing breast cancer with the k-NN algorithm

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Transformation – normalizing numeric data

    Data preparation – creating training and test datasets

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Transformation – z-score standardization

    Testing alternative values of k

    4. Probabilistic Learning – Classification Using Naive Bayes

    Understanding Naive Bayes

    Basic concepts of Bayesian methods

    Understanding probability

    Understanding joint probability

    Computing conditional probability with Bayes' theorem

    The Naive Bayes algorithm

    Classification with Naive Bayes

    The Laplace estimator

    Using numeric features with Naive Bayes

    Example – filtering mobile phone spam with the Naive Bayes algorithm

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – cleaning and standardizing text data

    Data preparation – splitting text documents into words

    Data preparation – creating training and test datasets

    Visualizing text data – word clouds

    Data preparation – creating indicator features for frequent words

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    5. Divide and Conquer – Classification Using Decision Trees and Rules

    Understanding decision trees

    Divide and conquer

    The C5.0 decision tree algorithm

    Choosing the best split

    Pruning the decision tree

    Example – identifying risky bank loans using C5.0 decision trees

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – creating random training and test datasets

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Boosting the accuracy of decision trees

    Making mistakes more costlier than others

    Understanding classification rules

    Separate and conquer

    The 1R algorithm

    The RIPPER algorithm

    Rules from decision trees

    What makes trees and rules greedy?

    Example – identifying poisonous mushrooms with rule learners

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    6. Forecasting Numeric Data – Regression Methods

    Understanding regression

    Simple linear regression

    Ordinary least squares estimation

    Correlations

    Multiple linear regression

    Example – predicting medical expenses using linear regression

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Exploring relationships among features – the correlation matrix

    Visualizing relationships among features – the scatterplot matrix

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Model specification – adding non-linear relationships

    Transformation – converting a numeric variable to a binary indicator

    Model specification – adding interaction effects

    Putting it all together – an improved regression model

    Understanding regression trees and model trees

    Adding regression to trees

    Example – estimating the quality of wines with regression trees and model trees

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Visualizing decision trees

    Step 4 – evaluating model performance

    Measuring performance with the mean absolute error

    Step 5 – improving model performance

    7. Black Box Methods – Neural Networks and Support Vector Machines

    Understanding neural networks

    From biological to artificial neurons

    Activation functions

    Network topology

    The number of layers

    The direction of information travel

    The number of nodes in each layer

    Training neural networks with backpropagation

    Example – Modeling the strength of concrete with ANNs

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Understanding Support Vector Machines

    Classification with hyperplanes

    The case of linearly separable data

    The case of nonlinearly separable data

    Using kernels for non-linear spaces

    Example – performing OCR with SVMs

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    8. Finding Patterns – Market Basket Analysis Using Association Rules

    Understanding association rules

    The Apriori algorithm for association rule learning

    Measuring rule interest – support and confidence

    Building a set of rules with the Apriori principle

    Example – identifying frequently purchased groceries with association rules

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – creating a sparse matrix for transaction data

    Visualizing item support – item frequency plots

    Visualizing the transaction data – plotting the sparse matrix

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    Sorting the set of association rules

    Taking subsets of association rules

    Saving association rules to a file or data frame

    9. Finding Groups of Data – Clustering with k-means

    Understanding clustering

    Clustering as a machine learning task

    The k-means clustering algorithm

    Using distance to assign and update clusters

    Choosing the appropriate number of clusters

    Example – finding teen market segments using k-means clustering

    Step 1 – collecting data

    Step 2 – exploring and preparing the data

    Data preparation – dummy coding missing values

    Data preparation – imputing the missing values

    Step 3 – training a model on the data

    Step 4 – evaluating model performance

    Step 5 – improving model performance

    10. Evaluating Model Performance

    Measuring performance for classification

    Working with classification prediction data in R

    A closer look at confusion matrices

    Using confusion matrices to measure performance

    Beyond accuracy – other measures of performance

    The kappa statistic

    Sensitivity and specificity

    Precision and recall

    The F-measure

    Visualizing performance trade-offs

    ROC curves

    Estimating future performance

    The holdout method

    Cross-validation

    Bootstrap sampling

    11. Improving Model Performance

    Tuning stock models for better performance

    Using caret for automated parameter tuning

    Creating a simple tuned model

    Customizing the tuning process

    Improving model performance with meta-learning

    Understanding ensembles

    Bagging

    Boosting

    Random forests

    Training random forests

    Evaluating random forest performance

    12. Specialized Machine Learning Topics

    Working with proprietary files and databases

    Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files

    Querying data in SQL databases

    Working with online data and services

    Downloading the complete text of web pages

    Scraping data from web pages

    Parsing XML documents

    Parsing JSON from web APIs

    Working with domain-specific data

    Analyzing bioinformatics data

    Analyzing and visualizing network data

    Improving the performance of R

    Managing very large datasets

    Generalizing tabular data structures with dplyr

    Making data frames faster with data.table

    Creating disk-based data frames with ff

    Using massive matrices with bigmemory

    Learning faster with parallel computing

    Measuring execution time

    Working in parallel with multicore and snow

    Taking advantage of parallel with foreach and doParallel

    Parallel cloud computing with MapReduce and Hadoop

    GPU computing

    Deploying optimized learning algorithms

    Building bigger regression models with biglm

    Growing bigger and faster random forests with bigrf

    Training and evaluating models in parallel with caret

    A. Reflect and Test Yourself Answers

    Module 1: Data Analysis with R

    Chapter 1: RefresheR

    Chapter 2: The Shape of Data

    Chapter 3: Describing Relationships

    Chapter 4: Probability

    Chapter 5: Using Data to Reason About the World

    Chapter 6: Testing Hypotheses

    Chapter 7: Bayesian Methods

    Chapter 8: Predicting Continuous Variables

    Chapter 9: Predicting Categorical Variables

    Chapter 10: Sources of Data

    Chapter 11: Dealing with Messy Data

    Chapter 12: Dealing with Large Data

    Module 2: R Graphs

    Chapter 1: R Graphics

    Chapter 2: Basic Graph Functions

    Chapter 3: Beyond the Basics – Adjusting Key Parameters

    Chapter 4: Creating Scatter Plots

    Chapter 5: Creating Line Graphs and Time Series Charts

    Chapter 6: Creating Bar, Dot, and Pie Charts

    Chapter 7: Creating Histograms

    Chapter 8: Box and Whisker Plots

    Chapter 9: Creating Heat Maps and Contour Plots

    Module 4: Mastering R for Quantitative Finance

    Chapter 1: Time Series Analysis

    Chapter 3: Forecasting Volume

    Chapter 4: Big Data – Advanced Analytics

    Chapter 5: FX Derivatives

    Chapter 6: Interest Rate Derivatives and Models

    Chapter 7: Exotic Options

    Chapter 8: Optimal Hedging

    Chapter 9: Fundamental Analysis

    Module 5: Machine Learning with R

    Chapter 1: Introducing Machine Learning

    Chapter 2: Managing and Understanding Data

    Chapter 3: Lazy Learning – Classification Using Nearest Neighbors

    Chapter 4: Probabilistic Learning – Classification Using Naive Bayes

    Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules

    Chapter 6: Forecasting Numeric Data – Regression Methods

    Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines

    Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules

    B. Bibliography

    Index

    R: Data Analysis and Visualization


    R: Data Analysis and Visualization

    A course in five modules

    Master the art of building analytical models using R with your Course Guide Edwin Moses

    Learn data analysis, data visualization techniques, data mining, and machine learning all using R and also learn to build models in quantitative finance using this powerful language

    To contact your Course Guide

    Email: <edwinm@packtpub.com>

    BIRMINGHAM - MUMBAI

    Meet Your Course Guide

    Welcome to this course on R, the statistical programming language for data scientists and statisticians. With this course, you'll embark on a journey of learning R for data science.

    If you have any questions along the way, you can reach out to me over email and I'll make sure you get everything from the course that we've planned – for you to become a working R developer. Details of how to contact me are included on the first page of this course.

    Course Structure

    The R learning path created for you has five connected modules. Each of these modules are a mini-course in their own right, and as you complete each one, you'll have gained key skills and be ready for the material in the next module!

    Now, let’s look at the pathway these modules create and how they will take you from doing data analysis with R to creating analytical models based on machine learning.

    Course journey

    This course begins by looking at the Data Analysis with R module. This module will help you navigate the R environment. You'll gain a thorough understanding of statistical reasoning and sampling. Finally, you'll be able to put best practices into effect to make your job easier and facilitate reproducibility.

    The second place to explore is R Graphs. This module will help you leverage powerful default R graphics and utilize advanced graphics systems such as lattice and ggplot2, the grammar of graphics. Through inspecting large datasets using tableplot and stunning three-dimensional visualizations, you will know how to produce, customize, and publish advanced visualizations using this popular, and powerful, framework.

    With the third module, Learning Data Mining with R, you will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on RHadoop projects. You will finish this module feeling confident in your ability to know which data mining algorithm to apply in any situation.

    The Mastering R for Quantitative Finance module pragmatically introduces both the quantitative finance concepts and their modeling in R, enabling you to build a tailor-made trading system on your own. By the end of the module, you will be well versed with various financial techniques using R and will be able to place good bets while making financial decisions.

    Finally, we'll look at the Machine Learning with R module. With this module, you'll discover all the analytical tools you need to gain insights from complex data and learn how to choose the correct algorithm for your specific needs. Through full engagement with the sort of real-world problems data-wranglers face, you'll learn to apply machine learning methods to deal with common tasks, including classification, prediction, forecasting, market analysis, and clustering.

    The Course Roadmap and Timeline

    Here's a view of the entire course plan before we begin. This grid gives you a topic overview of the whole course and its modules, so you can see how we will move through particular phases of learning to use R, what skills you’ll be learning along the way, and what you can do with those skills at each point. I also offer you an estimate of the time you might want to take for each module, although a lot depends on your learning style how much you’re able to give the course each week!

    Part I. Module 1: Data Analysis with R

    Chapter 1. RefresheR

    Before we dive into the (other) fun stuff (sampling multi-dimensional probability distributions, using convex optimization to fit data models, and so on), it would be helpful if we review those aspects of R that all subsequent chapters will assume knowledge of.

    If you fancy yourself as an R guru, you should still, at least, skim through this chapter, because you'll almost certainly find the idioms, packages, and style introduced here to be beneficial in following along with the rest of the material.

    If you don't care much about R (yet), and are just in this for the statistics, you can heave a heavy sigh of relief that, for the most part, you can run the code given in this book in the interactive R interpreter with very little modification, and just follow along with the ideas. However, it is my belief (read: delusion) that by the end of this book, you'll cultivate a newfound appreciation of R alongside a robust understanding of methods in data analysis.

    Fire up your R interpreter, and let's get started!

    Navigating the basics

    In the interactive R interpreter, any line starting with a > character denotes R asking for input (If you see a + prompt, it means that you didn't finish typing a statement at the prompt and R is asking you to provide the rest of the expression.). Striking the return key will send your input to R to be evaluated. R's response is then spit back at you in the line immediately following your input, after which R asks for more input. This is called a REPL (Read-Evaluate-Print-Loop). It is also possible for R to read a batch of commands saved in a file (unsurprisingly called batch mode), but we'll be using the interactive mode for most of the book.

    As you might imagine, R supports all the familiar mathematical operators as most other languages:

    Arithmetic and assignment

    Check out the following example:

      > 2 + 2

      [1] 4

     

      > 9 / 3

      [1] 3

     

      > 5 %% 2    # modulus operator (remainder of 5 divided by 2)

      [1] 1

    Anything that occurs after the octothorpe or pound sign, #, (or hash-tag for you young'uns), is ignored by the R interpreter. This is useful for documenting the code in natural language. These are called comments.

    In a multi-operation arithmetic expression, R will follow the standard order of operations from math. In order to override this natural order, you have to use parentheses flanking the sub-expression that you'd like to be performed first.

      > 3 + 2 - 10 ^ 2        # ^ is the exponent operator

      [1] -95

      > 3 + (2 - 10) ^ 2

      [1] 67

    In practice, almost all compound expressions are split up with intermediate values assigned to variables which, when used in future expressions, are just like substituting the variable with the value that was assigned to it. The (primary) assignment operator is <-.

      > # assignments follow the form VARIABLE <- VALUE

      > var <- 10

      > var

      [1] 10

      > var ^ 2

      [1] 100

      > VAR / 2            # variable names are case-sensitive

      Error: object 'VAR' not found

    Notice that the first and second lines in the preceding code snippet didn't have an output to be displayed, so R just immediately asked for more input. This is because assignments don't have a return value. Their only job is to give a value to a variable, or to change the existing value of a variable. Generally, operations and functions on variables in R don't change the value of the variable. Instead, they return the result of the operation. If you want to change a variable to the result of an operation using that variable, you have to reassign that variable as follows:

      > var              # var is 10

      [1] 10

      > var ^ 2

      [1] 100

      > var              # var is still 10

      [1] 10

      > var <- var ^ 2    # no return value

      > var              # var is now 100

      [1] 100

    Be aware that variable names may contain numbers, underscores, and periods; this is something that trips up a lot of people who are familiar with other programming languages that disallow using periods in variable names. The only further restrictions on variable names are that it must start with a letter (or a period and then a letter), and that it must not be one of the reserved words in R such as TRUE, Inf, and so on.

    Although the arithmetic operators that we've seen thus far are functions in their own right, most functions in R take the form: function_name (value(s) supplied to the function). The values supplied to the function are called arguments of that function.

      > cos(3.14159)      # cosine function

      [1] -1

      > cos(pi)          # pi is a constant that R provides

      [1] -1

      > acos(-1)          # arccosine function

      [1] 2.141593

      > acos(cos(pi)) + 10

      [1] 13.14159

      > # functions can be used as arguments to other functions

    (If you paid attention in math class, you'll know that the cosine of π is -1, and that arccosine is the inverse function of cosine.)

    There are hundreds of such useful functions defined in base R, only a handful of which we will see in this book. Two sections from now, we will be building our very own functions.

    Before we move on from arithmetic, it will serve us well to visit some of the odd values that may result from certain operations:

      > 1 / 0

      [1] Inf

      >  0 / 0

      [1] NaN

    It is common during practical usage of R to accidentally divide by zero. As you can see, this undefined operation yields an infinite value in R. Dividing zero by zero yields the value NaN, which stands for Not a Number.

    Logicals and characters

    So far, we've only been dealing with numerics, but there are other atomic data types in R. To wit:

      > foo <- TRUE        # foo is of the logical data type

      > class(foo)        # class() tells us the type

      [1] logical

      > bar <- hi!      # bar is of the character data type

      > class(bar)

      [1] character

    The logical data type (also called Booleans) can hold the values TRUE or FALSE or, equivalently, T or F. The familiar operators from Boolean algebra are defined for these types:

      > foo

      [1] TRUE

      > foo && TRUE                # boolean and

      [1] TRUE

      > foo && FALSE

      [1] FALSE

      > foo || FALSE                # boolean or

      [1] TRUE

      > !foo                        # negation operator

      [1] FALSE

    In a Boolean expression with a logical value and a number, any number that is not 0 is interpreted as TRUE.

      > foo && 1

      [1] TRUE

      > foo && 2

      [1] TRUE

      > foo && 0

      [1] FALSE

    Additionally, there are functions and operators that return logical values such as:

      > 4 < 2          # less than operator

      [1] FALSE

      > 4 >= 4          # greater than or equal to

      [1] TRUE

      > 3 == 3          # equality operator

      [1] TRUE

      > 3 != 2          # inequality operator

      [1] TRUE

    Just as there are functions in R that are only defined for work on the numeric and logical data type, there are other functions that are designed to work only with the character data type, also known as strings:

      > lang.domain <- statistics

      > lang.domain <- toupper(lang.domain)

      > print(lang.domain)

      [1] STATISTICS

      > # retrieves substring from first character to fourth character

      > substr(lang.domain, 1, 4)         

      [1] STAT

      > gsub(I, 1, lang.domain)  # substitutes every I for 1

      [1] STAT1ST1CS

      # combines character strings

      > paste(R does, lang.domain, !!!)

      [1] R does STATISTICS !!!

    Flow of control

    The last topic in this section will be flow of control constructs.

    The most basic flow of control construct is the if statement. The argument to an if statement (what goes between the parentheses), is an expression that returns a logical value. The block of code following the if statement gets executed only if the expression yields TRUE. For example:

      > if(2 + 2 == 4)

      +    print(very good)

      [1] very good

      > if(2 + 2 == 5)

      +    print(all hail to the thief)

      >

    It is possible to execute more than one statement if an if condition is triggered; you just have to use curly brackets ({}) to contain the statements.

      > if((4/2==2) && (2*2==4)){

      +    print(four divided by two is two...)

      +    print(and two times two is four)

      + }

      [1] four divided by two is two...

      [1] and two times two is four

      >

    It is also possible to specify a block of code that will get executed if the if conditional is FALSE.

      > closing.time <- TRUE

      > if(closing.time){

      +    print(you don't have to go home)

      +    print(but you can't stay here)

      + } else{

      +    print(you can stay here!)

      + }

      [1] you don't have to go home

      [1] but you can't stay here

      > if(!closing.time){

      +    print(you don't have to go home)

      +    print(but you can't stay here)

      + } else{

      +    print(you can stay here!)

      + }

      [1] you can stay here!

      >

    There are other flow of control constructs (like while and for), but we won't directly be using them much in this text.

    Getting help in R

    Before we go further, it would serve us well to have a brief section detailing how to get help in R. Most R tutorials leave this for one of the last sections—if it is even included at all! In my own personal experience, though, getting help is going to be one of the first things you will want to do as you add more bricks to your R knowledge castle. Learning R doesn't have to be difficult; just take it slowly, ask questions, and get help early. Go you!

    It is easy to get help with R right at the console. Running the help.start() function at the prompt will start a manual browser. From here, you can do anything from going over the basics of R to reading the nitty-gritty details on how R works internally.

    You can get help on a particular function in R if you know its name, by supplying that name as an argument to the help function. For example, let's say you want to know more about the gsub() function that I sprang on you before. Running the following code:

      > help(gsub)

      > # or simply

      > ?gsub

    will display a manual page documenting what the function is, how to use it, and examples of its usage.

    This rapid accessibility to documentation means that I'm never hopelessly lost when I encounter a function which I haven't seen before. The downside to this extraordinarily convenient help mechanism is that I rarely bother to remember the order of arguments, since looking them up is just seconds away.

    Occasionally, you won't quite remember the exact name of the function you're looking for, but you'll have an idea about what the name should be. For this, you can use the help.search() function.

      > help.search(chisquare)

      > # or simply

      > ??chisquare

    For tougher, more semantic queries, nothing beats a good old fashioned web search engine. If you don't get relevant results the first time, try adding the term programming or statistics in there for good measure.

    Vectors

    Vectors are the most basic data structures in R, and they are ubiquitous indeed. In fact, even the single values that we've been working with thus far were actually vectors of length 1. That's why the interactive R console has been printing [1] along with all of our output.

    Vectors are essentially an ordered collection of values of the same atomic data type. Vectors can be arbitrarily large (with some limitations), or they can be just one single value.

    The canonical way of building vectors manually is by using the c() function (which stands for combine).

      > our.vect <- c(8, 6, 7, 5, 3, 0, 9)

      > our.vect

      [1] 8 6 7 5 3 0 9

    In the preceding example, we created a numeric vector of length 7 (namely, Jenny's telephone number).

    Note that if we tried to put character data types into this vector as follows:

      > another.vect <- c(8, 6, 7, -, 3, 0, 9)

      > another.vect

      [1] 8 6 7 - 3 0 9

    R would convert all the items in the vector (called elements) into character data types to satisfy the condition that all elements of a vector must be of the same type. A similar thing happens when you try to use logical values in a vector with numbers; the logical values would be converted into 1 and 0 (for TRUE and FALSE, respectively). These logicals will turn into TRUE and FALSE (note the quotation marks) when used in a vector that contains characters.

    Subsetting

    It is very common to want to extract one or more elements from a vector. For this, we use a technique called indexing or subsetting. After the vector, we put an integer in square brackets ([]) called the subscript operator. This instructs R to return the element at that index. The indices (plural for index, in case you were wondering!) for vectors in R start at 1, and stop at the length of the vector.

      > our.vect[1]                  # to get the first value

      [1] 8

      > # the function length() returns the length of a vector

      > length(our.vect)

      [1] 7

      > our.vect[length(our.vect)]  # get the last element of a vector

      [1] 9

    Note that in the preceding code, we used a function in the subscript operator. In cases like these, R evaluates the expression in the subscript operator, and uses the number it returns as the index to extract.

    If we get greedy, and try to extract an element at an index that doesn't exist, R will respond with NA, meaning, not available. We see this special value cropping up from time to time throughout this text.

      > our.vect[10]

      [1] NA

    One of the most powerful ideas in R is that you can use vectors to subset other vectors:

      > # extract the first, third, fifth, and

      > # seventh element from our vector

      > our.vect[c(1, 3, 5, 7)]

      [1] 8 7 3 9

    The ability to use vectors to index other vectors may not seem like much now, but its usefulness will become clear soon.

    Another way to create vectors is by using sequences.

      > other.vector <- 1:10

      > other.vector

      [1]  1  2  3  4  5  6  7  8  9 10

      > another.vector <- seq(50, 30, by=-2)

      > another.vector

      [1] 50 48 46 44 42 40 38 36 34 32 30

    Above, the 1:10 statement creates a vector from 1 to 10. 10:1 would have created the same 10 element vector, but in reverse. The seq() function is more general in that it allows sequences to be made using steps (among many other things).

    Combining our knowledge of sequences and vectors subsetting vectors, we can get the first 5 digits of Jenny's number thusly:

      > our.vect[1:5]

      [1] 8 6 7 5 3

    Vectorized functions

    Part of what makes R so powerful is that many of R's functions take vectors as arguments. These vectorized functions are usually extremely fast and efficient. We've already seen one such function, length(), but there are many many others.

      > # takes the mean of a vector

      > mean(our.vect)

      [1] 5.428571

      > sd(our.vect)    # standard deviation

      [1] 3.101459

      > min(our.vect)

      [1] 0

      > max(1:10)

      [1] 10

      > sum(c(1, 2, 3))

      [1] 6

    In practical settings, such as when reading data from files, it is common to have NA values in vectors:

      > messy.vector <- c(8, 6, NA, 7, 5, NA, 3, 0, 9)

      > messy.vector

      [1]  8  6 NA  7  5 NA  3  0  9

      > length(messy.vector)

      [1] 9

    Some vectorized functions will not allow NA values by default. In these cases, an extra keyword argument must be supplied along with the first argument to the function.

      > mean(messy.vector)

     

    Enjoying the preview?
    Page 1 of 1