Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Mastering Scientific Computing with R
Mastering Scientific Computing with R
Mastering Scientific Computing with R
Ebook830 pages4 hours

Mastering Scientific Computing with R

Rating: 3 out of 5 stars

3/5

()

Read preview

About this ebook

About This Book
  • Perform publication-quality science using R
  • Use some of R’s most powerful and least known features to solve complex scientific computing problems
  • Learn how to create visual illustrations of scientific results
Who This Book Is For

If you want to learn how to quantitatively answer scientific questions for practical purposes using the powerful R language and the open source R tool ecosystem, this book is ideal for you. It is ideally suited for scientists who understand scientific concepts, know a little R, and want to be able to start applying R to be able to answer empirical scientific questions. Some R exposure is helpful, but not compulsory.

LanguageEnglish
Release dateJan 31, 2015
ISBN9781783555260
Mastering Scientific Computing with R

Related to Mastering Scientific Computing with R

Related ebooks

Programming For You

View More

Related articles

Reviews for Mastering Scientific Computing with R

Rating: 3 out of 5 stars
3/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Scientific Computing with R - Paul Gerrard

    Table of Contents

    Mastering Scientific Computing with R

    Credits

    About the Authors

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Programming with R

    Data structures in R

    Atomic vectors

    Operations on vectors

    Lists

    Attributes

    Factors

    Multidimensional arrays

    Matrices

    Data frames

    Loading data into R

    Saving data frames

    Basic plots and the ggplot2 package

    Flow control

    The for() loop

    The apply() function

    The if() statement

    The while() loop

    The repeat{} and break statement

    Functions

    General programming and debugging tools

    Summary

    2. Statistical Methods with R

    Descriptive statistics

    Data variability

    Confidence intervals

    Probability distributions

    Fitting distributions

    Higher order moments of a distribution

    Other statistical tests to fit distributions

    The propagate package

    Hypothesis testing

    Proportion tests

    Two sample hypothesis tests

    Unit root tests

    Summary

    3. Linear Models

    An overview of statistical modeling

    Model formulas

    Explanatory variables interactions

    Error terms

    The intercept as parameter 1

    Updating a model

    Linear regression

    Plotting a slope

    Analysis of variance

    Generalized linear models

    Generalized additive models

    Linear discriminant analysis

    Principal component analysis

    Clustering

    Summary

    4. Nonlinear Methods

    Nonparametric and parametric models

    The adsorption and body measures datasets

    Theory-driven nonlinear regression

    Visually exploring nonlinear relationships

    Extending the linear framework

    Polynomial regression

    Performing a polynomial regression in R

    Spline regression

    Nonparametric nonlinear methods

    Kernel regression

    Kernel weighted local polynomial fitting

    Optimal bandwidth selection

    A practical scientific application of kernel regression

    Locally weighted polynomial regression and the loess function

    Nonparametric methods with the np package

    Nonlinear quantile regression

    Summary

    5. Linear Algebra

    Matrices and linear algebra

    Matrices in R

    Vectors in R

    Matrix notation

    The physical functioning dataset

    Basic matrix operations

    Element-wise matrix operations

    Matrix subtraction

    Matrix addition

    Matrix sweep

    Basic matrixwise operations

    Transposition

    Matrix multiplication

    Multiplying square matrices for social networks

    Outer products

    Using sparse matrices in matrix multiplication

    Matrix inversion

    Solving systems of linear equations

    Determinants

    Triangular matrices

    Matrix decomposition

    QR decomposition

    Eigenvalue decomposition

    Lower upper decomposition

    Cholesky decomposition

    Singular value decomposition

    Applications

    Rasch analysis using linear algebra and a paired comparisons matrix

    Calculating Cronbach's alpha

    Image compression using direct cosine transform

    Importing an image into R

    The compression technique

    Creating the transformation and quantization matrices

    Putting the matrices together for image compression

    DCT in R

    Summary

    6. Principal Component Analysis and the Common Factor Model

    A primer on correlation and covariance structures

    Datasets used in this chapter

    Principal component analysis and total variance

    Understanding the basics of PCA

    How does PCA relate to SVD?

    Scaled versus unscaled PCA

    PCA for dimension reduction

    PCA to summarize wine properties

    Choosing the number of principal components to retain

    Formative constructs using PCA

    Exploratory factor analysis and reflective constructs

    Familiarizing yourself with the basic terms

    Matrices of interest

    Expressing factor analysis in a matrix model

    Basic EFA and concepts of covariance algebra

    Concepts of EFA estimation

    The centroid method

    Multiple actors

    Direct factor extraction by principal axis factoring

    Performing principal axis factoring in R

    Other factor extraction methods

    Factor rotation

    Orthogonal factor rotation methods

    Quartimax rotation

    Varimax rotation

    Oblique rotations

    Oblimin rotation

    Promax rotation

    Factor rotation in R

    Advanced EFA with the psych package

    Summary

    7. Structural Equation Modeling and Confirmatory Factor Analysis

    Datasets

    Political democracy

    Physical functioning dataset

    Holzinger-Swineford 1939 dataset

    The basic ideas of SEM

    Components of an SEM model

    Path diagram

    Matrix representation of SEM

    The reticular action model (RAM)

    An example of SEM specification

    An example in R

    SEM model fitting and estimation methods

    Assessing SEM model fit

    Using OpenMx and matrix specification of an SEM

    Summarizing the OpenMx approach

    Explaining an entire example

    Specifying the model matrices

    Fitting the model

    Fitting SEM models using lavaan

    The lavaan syntax

    Comparing OpenMx to lavaan

    Explaining an example in lavaan

    Explaining an example in OpenMx

    Summary

    8. Simulations

    Basic sample simulations in R

    Pseudorandom numbers

    The runif() function

    Bernoulli random variables

    Binomial random variables

    Poisson random variables

    Exponential random variables

    Monte Carlo simulations

    Central limit theorem

    Using the mc2d package

    One-dimensional Monte Carlo simulation

    Two-dimensional Monte Carlo simulation

    Additional mc2d functions

    The mcprobtree() function

    The cornode() function

    The mcmodel() function

    The evalmcmod() function

    Data visualization

    Multivariate nodes

    Monte Carlo integration

    Multiple integration

    Other density functions

    Rejection sampling

    Importance sampling

    Simulating physical systems

    Summary

    9. Optimization

    One-dimensional optimization

    The golden section search method

    The optimize() function

    The Newton-Raphson method

    The Nelder-Mead simplex method

    More optim() features

    Linear programming

    Integer-restricted optimization

    Unrestricted variables

    Quadratic programming

    General non-linear optimization

    Other optimization packages

    Summary

    10. Advanced Data Management

    Cleaning datasets in R

    String processing and pattern matching

    Regular expressions

    Floating point operations and numerical data types

    Memory management in R

    Basic R memory commands

    Handling R objects in memory

    Missing data

    Computational aspects of missing data in R

    Statistical considerations of missing data

    Deletion methods

    Listwise deletion or complete case analysis

    Pairwise deletion

    Visualizing missing data

    An overview of multiple imputation

    Imputation basic principles

    Approaches to imputation

    The Amelia package

    Getting estimates from multiply imputed datasets

    Extracting the mean

    Extracting the standard error of the mean

    The mice package

    Imputation functions in mice

    Summary

    Index

    Mastering Scientific Computing with R


    Mastering Scientific Computing with R

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: January 2015

    Production reference: 1270115

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78355-525-3

    www.packtpub.com

    Cover image by Jason Dupuis Mayer (<jdmphoto011@gmail.com>)

    Credits

    Authors

    Paul Gerrard

    Radia M. Johnson

    Reviewers

    Laurent Drouet

    Ratanlal Mahanta

    Mzabalazo Z. Ngwenya

    Donato Teutonico

    Commissioning Editor

    Kartikey Pandey

    Acquisition Editor

    Greg Wild

    Content Development Editor

    Akshay Nair

    Technical Editors

    Rosmy George

    Ankita Thakur

    Copy Editors

    Shivangi Chaturvedi

    Pranjali Chury

    Puja Lalwani

    Adithi Shetty

    Project Coordinator

    Mary Alex

    Proofreaders

    Simran Bhogal

    Martin Diver

    Ameesha Green

    Paul Hindle

    Bernadette Watkins

    Indexer

    Priya Subramani

    Graphics

    Sheetal Aute

    Disha Haria

    Abhinash Sahu

    Production Coordinator

    Conidon Miranda

    Cover Work

    Conidon Miranda

    About the Authors

    Paul Gerrard is a physician and healthcare researcher who is based out of Portland, Maine, where he currently serves as the medical director of the cardiopulmonary rehabilitation program at New England Rehabilitation Hospital of Portland. He studied business economics in college. After completing medical school, he did a residency in physical medicine and rehabilitation at Harvard Medical School and Spaulding Rehabilitation Hospital, where he served as chief resident and stayed on as faculty at Harvard before moving to Portland. He continues to collaborate on research projects with researchers at other academic institutions within the Boston area and around the country. He has published and presented research on a range of topics, including traumatic brain injury, burn rehabilitation, health outcomes, and the epidemiology of disabling medical conditions.

    I would like to thank my beautiful wife, Deirdre, and my son, Patrick. My work on this book is dedicated to the loving memory of Fiona.

    Radia M. Johnson has a doctorate degree in immunology and currently works as a research scientist at the Institute for Research in Immunology and Cancer at the Université de Montréal, where she uses genomics and bioinformatics to identify and characterize the molecular changes that contribute to cancer development. She routinely uses R and other computer programming languages to analyze large data sets from ongoing collaborative projects. Since obtaining her PhD at the University of Toronto, she has also worked as a research associate at the University of Cambridge in Hematology, where she gained experience using system biology to study blood cancer.

    I would like to thank Dr. Charlie Massie for teaching me to love programming in R and Dr. Phil Kousis for all his support through the years. You are both excellent mentors and wonderful friends!

    About the Reviewers

    Laurent Drouet holds a PhD in economics and social sciences from the University of Geneva, Switzerland, and a master's degree in applied mathematics from the Institute of Applied Mathematics of Angers, France. He was also a postdoctoral research fellow at the Research Lab of Economics and Environmental Management at the Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland. He was also a researcher at the Public Research Center Tudor, Luxembourg. He is currently a senior researcher at Fondazione Eni Enrico Mattei (FEEM) and a research affiliate at Centro Euro-Mediterraneo sui Cambiamenti Climatici (CMCC), Italy.

    His main research is related to integrated assessment modeling and energy modeling. For more than a decade, he designed scientific tools to perform data analysis for this type of modeling. He also built optimization frameworks to couple models of many kinds (such as climate models, air quality models, and economy models). He created and developed the bottom-up techno-economic energy model ETEM to study optimal energy policies at urban or national levels.

    I want to thank my wife for her support every day both in my private life and professional life.

    Ratanlal Mahanta holds an MSc in computational finance. He is currently working at GPSK Investment Group as a senior quantitative analyst. He has 4 years of experience in quantitative trading and strategies developments for sell side and risk consulting firms. He is an expert in high frequency and algorithmic trading. He has expertise in these areas: quantitative trading (FX, equities, futures and options, and engineering on derivatives); algorithms—partial differential equations, stochastic differential equations, the finite difference method, Monte Carlo, and Machine Learning; code—R programming, C++, MATLAB, HPC, and scientific computing; data analysis—Big Data analytic [EOD to TBT], Bloomberg, Quandl, and Quantopian; and strategies—vol-arbitrage, vanilla and exotic options modeling, trend following, mean reversion, co-integration, Monte Carlo simulations, ValueatRisk, stress testing, buy side trading strategies with high Sharpe ratio, credit risk modeling, and credit rating.

    He has reviewed Mastering R for Quantitative Finance, Packt Publishing. He is currently reviewing two other books for Packt Publishing: Mastering Python for Data Science and Machine Learning with R Cookbook.

    Mzabalazo Z. Ngwenya holds a postgraduate degree in mathematical statistics from the University of Cape Town. He has worked extensively in the field of statistical consulting, wherein he utilized varied statistical software including R. His area of interest are primarily centered around statistical computing. Previously, he was involved in reviewing Learning RStudio for R Statistical Computing, Mark P.J. van der Loo and Edwin de Jonge; R Statistical Application Development Example Beginner's Guide, Prabhanjan Narayanachar Tattar; Machine Learning with R, Brett Lantz; R Graph Essentials, David Alexandra Lillis, and R Object-oriented Programming, Kelly Black, all by Packt Publishing. He currently works as a biometrician.

    Donato Teutonico has several years of experience in modeling and the simulation of drug effects and clinical trials in industrial and academic settings. He received his PharmD degree from the University of Turin, Italy, specializing in chemical and pharmaceutical technology, and his PhD in pharmaceutical sciences from Paris-Sud University, France.

    He is the author of two R packages for pharmacometrics, CTStemplate and panels-for-pharmacometrics, which are both available on Google Code. He is also the author of Instant R Starter, Packt Publishing.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    As an open source computing environment, R is rapidly becoming the lingua franca of the statistical computing community. R's powerful base functions, powerful statistical tools, open source nature, and avid user community have led to R having an expansive library of powerful, cutting-edge quantitative methods not yet available to users of other high-cost statistical programs.

    With this book, you will learn not just about R, but how to use R to answer conceptual, scientific, and experimental questions.

    Beginning with an overview of fundamental R concepts, including data types, R program flow, and basic coding techniques, you'll learn how R can be used to achieve the most commonly needed scientific data analysis tasks, including testing for statistically significant differences between groups and model relationships in data. You will also learn parametric and nonparametric techniques for both difference testing and relationship modeling.

    You will delve into linear algebra and matrix operations with an emphasis not on the R syntax, but on how these operations can be used to address common computational or analytical needs. This book also covers the application of matrix operations for the purpose of finding a structure in high-dimensional data using the principal component, exploratory factor, and confirmatory factor analysis in addition to structural equation modeling. You will also master methods for simulation, learn about an advanced analytical method, and finish by going to the next level with advanced data management focused on dealing with messy and problematic datasets that serious analysts deal with daily.

    By the end of this book, you will be able to undertake publication-quality data analysis in R.

    What this book covers

    Chapter 1, Programming with R, presents an overview of how data is stored and accessed in R. Then, we will go over how to load data into R using built-in functions and useful packages for easy import from Excel worksheets. We will also cover how to use flow control statements and functions to reduce complexity and help you program more efficiently.

    Chapter 2, Statistical Methods with R, presents an overview of how to summarize your data and get useful statistical information for downstream analysis. We will show you how to plot and get statistical information from probability distributions and how to test the fit of your sample distribution to well-defined probability distributions.

    Chapter 3, Linear Models, covers linear models, which are probably the most commonly used statistical methods to study the relationships between variables. The Generalized linear model section will delve into a bit more detail than typical R books, discussing the nature of link functions and canonical link functions.

    Chapter 4, Nonlinear Methods, reviews applications of nonlinear methods in R using both parametric and nonparametric methods for both theory-driven and exploratory analysis.

    Chapter 5, Linear Algebra, covers algebra techniques in R. We will also learn linear algebra operations including transposition, inversion, matrix multiplication, and a number of matrix transformations.

    Chapter 6, Principal Component Analysis and the Common Factor Model, helps you understand the application of linear algebra to covariance and correlation matrices. We will cover how to use PCA to account for total variance in a set of variables and how to use EFA to model common variance among these variables in R.

    Chapter 7, Structural Equation Modeling and Confirmatory Factor Analysis, covers the fundamental ideas underlying structural equation modeling, which are often overlooked in other books discussing SEM in R, and then delve into how SEM is done in R.

    Chapter 8, Simulations, explains how to perform basic sample simulations and how to use simulations to answer statistical problems. We will also learn how to use R to generate random numbers, and how to simulate random variables from several common probability distributions.

    Chapter 9, Optimization, explores a variety of methods and techniques to optimize a variety of functions. We will also cover how to use a wide range of R packages and functions to set up, solve, and visualize different optimization problems.

    Chapter 10, Advanced Data Management, walks you through the basic techniques for data handling and some basic memory management considerations.

    What you need for this book

    The software that we require for this book is R Version 3.0.1 or higher, OpenMx Version 1.4, and RStudio.

    Who this book is for

    If you want to learn how to quantitatively answer scientific questions for practical purposes using the powerful R language and the open source R tool ecosystem, this book is ideal for you. It is ideally suited for scientists who understand scientific concepts, know a little R, and want to start applying R to be able to answer empirical scientific questions. Some R exposure is helpful, but not compulsory.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: You can also retrieve additional information on the objects stored in your environment using the str() function.

    A block of code is set as follows:

    > integer_vector <- c(1L, 2L, 12L, 29L)

    > integer_vector

    [1]  1  2 12 29

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: To install R on Windows, click on Download R for Windows, and then click on base for the download link and installation instructions.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output.

    You can download this file from: https://www.packtpub.com/sites/default/files/downloads/5253OS_ColoredImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.

    Chapter 1. Programming with R

    Scientific computing is an informatics approach to problem solving using mathematical models and/or applying quantitative analysis techniques to interpret, visualize, and solve scientific problems. Generally speaking, scientists and data analysts are concerned with understanding certain phenomena or processes using observations from an experiment or through simulation. For example, a biologist may want to understand what changes in gene expression are required for a normal cell to become a cancerous cell, or a physicist may want to study the life cycle of galaxies through numerical simulations. In both cases, they will need to collect the data, and then manipulate and process it before it can be visualized and interpreted to answer their research question. Scientific computing is involved in all these steps.

    R is an excellent open source language for scientific computing. R is broadly used in companies and academics as it has great performance value and provides a cutting-edge software environment. It was initially designed as a software tool for statistical modeling but has since then evolved into a powerful tool for data mining and analytics. In addition to its rich collection of classical numerical methods or basic actions, there are also hundreds of R packages for a wide variety of scientific computing needs such as state-of-the-art visualization methods, specialized data analysis tools, machine learning, and even packages such as Shiny to build interactive web applications. In this book, we will teach you how to use R and some of its packages to define and manipulate your data using a variety of methods for data exploration and visualization. This book will present to you state-of-the-art mathematical and statistical methods needed for scientific computing. We will also teach you how to use R to evaluate complex arithmetic expressions and statistical modeling. We will also cover how to deal with missing data and the steps needed to write your own functions tailored to your analysis requirements. By the end of this book, you will not only be comfortable using R and its many packages, but you will also be able to write your own code to solve your own scientific problems.

    This first chapter will present an overview of how data is stored and accessed in R. Then, we will look at how to load your data into R using built-in functions and useful packages, in order to easily import data from Excel worksheets. We will also show you how to transform your data using the reshape2 package to make your data ready to graph by plotting functions such as those provided by the ggplot2 package. Next, you will learn how to use flow-control statements and functions to reduce complexity, and help you program more efficiently. Lastly, we will go over some of the debugging tools available in R to help you successfully run your programs in R.

    The following is a list of the topics that we will cover in this chapter:

    Atomic vectors

    Lists

    Object attributes

    Factors

    Matrices and arrays

    Data frames

    Plots

    Flow control

    Functions

    General programming and debugging tools

    Before we begin our overview of R data structures, if you haven't already installed R, you can download the most recent version from http://cran.r-project.org. R compiles and runs on Linux, Mac, and Windows so that you can download the precompiled binaries to install it on your computer. For example, go to http://cran.r-project.org, click on Download R for Linux, and then click on ubuntu to get the most up-to-date instructions to install R on Ubuntu. To install R on Windows, click on Download R for Windows, and then click on base for the download link and installation instructions. For Mac OS users, click on Download R for (Mac) OS X for the download links and installation instructions.

    In addition to the most recent version of R, you may also want to download RStudio, which is an integrated development environment that provides a powerful user interface that makes learning R easier and fun. The main limitation of RStudio is that it has difficulty loading very large datasets. So if you are working with very large tables, you may want to run your analysis in R directly. That being said, RStudio is great to visualize the objects you stored in your workplace at the click of a button. You can easily search help pages and packages by clicking on the appropriate tabs. Essentially, RStudio provides all that you need to help analyze your data at your fingertips. The following screenshot is an example of the RStudio user interface running the code from this chapter:

    You can download RStudio for all platforms at http://www.rstudio.com/products/rstudio/download/.

    Finally, the font conventions used in this book are as follows. The code you should directly type into R is preceded by > and any lines preceded by # will be treated as comment in R.

    > The user will type this into R

    This is the response from R

    > # If the user types this, R will treat it as a comment

    Note

    Note that all the code written in this book was run with R Version 3.0.2.

    Data structures in R

    R objects can be grouped into two categories:

    Homogeneous: This is when the content is of the same type of data

    Heterogeneous: This is when the content contains different types of data

    Atomic vectors, Matrices, or Arrays are data structures that are used to store homogenous data, while Lists and Data frames are typically used to store heterogeneous data. R objects can also be organized based on the number of dimensions they contain. For example, atomic vectors and lists are one-dimensional objects, whereas matrices and data frames are two-dimensional objects. Arrays, however, are objects that can have any number of dimensions. Unlike other programming languages such as Perl, R does not have scalar or zero-dimensional objects. All single numbers and strings are stored in vectors of length one.

    Atomic vectors

    Vectors are the basic data structure in R and include atomic vectors and lists. Atomic vectors are flat and can be logical, numeric (double), integer, character, complex, or raw. To create a vector, we use the c() function, which means combine elements into a vector:

    > x <- c(1, 2, 3)

    To create an integer vector, add the number followed by L, as follows:

    > integer_vector <- c(1L, 2L, 12L, 29L)

    > integer_vector

    [1]  1  2 12 29

    To create a logical vector, add TRUE (T) and FALSE (F), as follows:.

    > logical_vector <- c(T, TRUE, F, FALSE)

    > logical_vector

    [1]  TRUE  TRUE FALSE FALSE

    Tip

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    To create a vector containing strings, simply add the words/phrases in double quotes:

    > character_vector <- c(Apple, Pear, Red, Green, These are my favorite fruits and colors)

    > character_vector

    [1] Apple                               

    [2] Pear                               

    [3] Red                                 

    [4] Green                               

    [5] These are my favorite fruits and colors

    > numeric_vector <- c(1, 3.4, 5, 10)

    > numeric_vector

    [1]  1.0  3.4  5.0 10.0

    R also includes functions that allow you to create vectors containing repetitive elements with rep() or a sequence of numbers with seq():

    > seq(1, 12, by=3)

    [1]  1  4  7 10

    > seq(1, 12) #note the default parameter for by is 1

    [1]  1  2  3  4  5  6  7  8  9 10 11 12

    Instead of using the seq() function, you can also use a colon, :, to indicate that you would like numbers 1 to 12 to be stored as a vector, as shown in the following example:

    > y <- 1:12

    > y

    [1]  1  2  3  4  5  6  7  8  9 10 11 12

    > z <- c(1:3, y)

    > z

    [1]  1  2  3  1  2  3  4  5  6  7  8  9 10 11 12

    To replicate elements of a vector, you can simply use the rep() function, as follows:

    > x <- rep(3, 14)

    > x

    [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3

    You can also replicate complex patterns as follows:

    > rep(seq(1, 4), 3)

    [1] 1 2 3 4 1 2 3 4 1 2 3 4

    Atomic vectors can only be of one type so if you mix numbers and strings, your vector will be coerced into the most flexible type. The most to the least flexible vector types are Character, numeric, integer, and logical, as shown in the following diagram:

    This means that if you mix numbers with strings, your vector will be coerced into a character vector, which is the most flexible type of the two. In the following paragraph, there are two different examples showing this coercion in practice. The first example shows that when a character and numeric vector are combined, the class of this new object becomes a character vector because a character vector is more flexible than a numeric vector. Similarly, in the second example, we see that the class of the new object x is numeric because a numeric vector is more flexible than an integer vector. The two examples are as follows:

    Example 1:

    > mixed_vector <- c(character_vector, numeric_vector)

    > mixed_vector

    [1] Apple                               

    [2] Pear                               

    [3] Red                                 

    [4]

    Enjoying the preview?
    Page 1 of 1