You are on page 1of 12

This video is about summarizing data.

In the previous videos, we talked a little bit about how we would get data from files either on the internet or files that we have on our local computer. Once we have the data loaded into R, then we want to do some sort of summarization to sort of see if the data have any problems or to identify characteristics of the data that might be useful to us during the analysis. So, why do we summarize data? The first thing to keep in mind is that almost always, the data are too big to look at the whole thing. So, except in very extreme circumstances, it's very difficult to just eyeball the entire data set and see interesting patterns or see potential problems with the data. So since the first step to find those problems or to find issues that are interesting to look at downstream, you definitely need to summarize the data in ways that are useful for you to be able to identify identify those patterns. So, when you do these summaries, some things that you might be looking out for are missing values values that are outside of the expected ranges. So, if you're defining temperature in Celsius and the measurements in Baltimore, if you see a measurement of 250, that's probably a little bit high, so you should know, look for those sorts of things. And it, less you think that those sorts of things never happen, it's almost always at least one or two crazy values that occur in at least every data set that I've seen. you might look for values that seem to be in the wrong unit, so if most of the measurements are in Celsius and one measurement is in Fahrenheit. You also want to look for mislabeled variables or columns and variables that are the wrong class. So, variables that look like they should be quantitative but are actually labeled as character variables and so forth. So, we're going to talk a little bit about the ways that you can summarize data. Again, this is not comprehensive, so there are a long number of ways that you can summarize data. And depending on the type of data that you're looking at, some will be better

than others. So, I'm going to sort of give you an overview of the, the basic and most useful ways to summarize data and if you need to summarize data in other ways, that the best way is to do that is to sort of the, look at the data type that you're summarizing and search on, one Google, I mean, that's seriously, that's the best way to do it. So this is an earthquake data set that we're going to be using to illustrate some of these ideas. this is available from data.gov. So, this is another one of those examples where the time that you download the data set really matters. So, this data set is actually updated every week. And it's only the earthquakes for the past seven days. So, if you run, if you're running these slides at some unspecified time in the future. And it's seven days after I created them, some of the exact numbers that you're going to be seeing are going to be a little bit different. So, that's just something to keep in mind when you're running these slides or when you're looking at these commands. If you get something slightly different, it might just be because you ran it at a different time. So, this is the URL for the dataset that we're going to be looking at. So, again, what we can do is we can use the download.file command we learned about in getting lectures, in getting data lectures. And so, we can pass the file URL to download.file and we could assign the dataset to the earthquakedata.csv file. and then we could also look at the date that is was downloaded. So, these slides were created this time on Sunday January 27th, 2013. So again, if it's seven days beyond, that your going to get a slightly different data set. And then we can read the csv file that results in using the read.csv file, and now we have that stored in the eData variable. So again the purpose, the purpose of summarizing is that it's very hard to look at the whole data set. So, if I just type edata and hit Return in R after loading it in, I get a very long data

frame. So, it, it gives me sort of the variables here across the columns and in the rows are each of the observations. And each one of these corresponding to a specific earthquake. And so, we get the the source, the equation ID, I'm sorry, earthquake ID, the version, the date, and time. And actually, you can't actually see all the other variables that are being collected or being output as well, they all actually fall off the screen here. So, when you get the full dataset is, is not a, a viable option for understanding potential patterns in the data. So, here is some important first view, the very first things that you always run when you load a dataset into R. And so, they are, first, you look at the dimension in the data frame. So here, in this case, I did dim(eData) and I actually end up seeing that's there's 1,000 earthquakes, or 1,057 rows exactly, and there are 10 columns. One reason that I do this always as one of the first commands that I run is because if I know that there are 11 variables and I only see 10 here, then there was a problem running, reading the data into R. Similarly, if I know there should be, you know, 10,000 rows and I only see a 1,057, which often will happen if the data are stored in a weird format, you can actually detect if the data had been read in, incorrectly from this very simple summary. The only thing you can do is look at the names of the variables in the data frame and so again, you apply names to eData and we get the list of names of the 10 variables. And so, this should be the variable names that you're expecting, in this case, for the earthquake data in R. And you could also look at specifically, the number of rows or the number of columns in the dataset. And dim actually gives you both the row numbers, that's the first number, and the column numbers, that's the second number that you get from dim. Or you can get them individually using the num, nrow or ncolumn commands. So then there's some other ways that you can start summarizing the data once

you've, sort of, looked at the very basics is in terms of just the size and shape of the dataset that you've loaded in. And so, one of these is, for quantitative variables, you can look at the quantiles of that data set. So, the quantiles are sort of like the percentiles. You can imagine, if you took the SAT, and you were at the 99th percentile, then 99% of the people who took the SAT that year got a lower score than you did. This is the same sort of thing, for the quantiles. So 0% of the values are less than -61 for the eData latitude variable. This is when I do the, the quantile applied to the eData latitude. I get the 0, 25th percent, 50th percentile, the 75th percentile, and the 100th percentile. So, I can see that this is, this gives me an idea of the range of values that I observe. And sort of where the middle of the values are and so I can use this to identify if some of the values are really outside of the range. So, if I saw, for example, a latitude of 5,064, you know that that was either measured on some scale that's very different or it's an incorrect value. You can also apply the summary command to the entire data frame. And so, when you do that, you get a quantile information for some of the quantitative variables but you also actually get some other information for other variables, say, for qualitative variables. So, for example, for the source variable in the data frame, this is not a quantitative variable, it actually has these different characters corresponding to different detectors and in this case most of them were detected with this ak detector and there are 330 of them that corresponded to that. And so, it actually summarizes both the quantitative and qualitative variables for you so you can get a first glance look at what the dataset looks like in and if you notice any particular problems. The other thing that's very useful is to determine if variables that should be characters are being loaded by R as numeric variables or vice versa. The more likely scenario is that a

numeric variable is loaded as a character variable. So you can do that. First, you can look at the class of the entire data frame and, of course, it comes up as a data frame. Then, what you can do is, you can look at the class of each individual column. So, to do this, there, this is sort of a tricky way of doing that, so what you do is you look at the first row of the data, the data frame. So, this is the eData frame and by selecting just the first row by the, this comma here tells you, if you put a number before the comma, it will select a row, if you put, put a number after the comma, it will select a column. And so here, we've selected the first row of that eData dataset and what we would like to do is apply that class function to every single element of that first row and we can do that with this sapply function. So, what sapply does is it basically runs along and to every value in this vector, it applies this function. So, we see that for the source variable, we get a factor, for the equation ID, we get a factor and so forth. For latitude and longitude, we get numeric variables, as well as for magnitude, we get a numeric variable. These are all sort of what we were expecting. So, this is another way to determine whether the data had been loaded in properly and whether the variables were loaded in a way that you expect them to be loaded. The next step is to start looking at the actual values that you see for different variables. And so, a couple of very useful functions are unique, length, and table. So, one example is to look at the unique values. So, unique values subvariables, particularly qualitative variables, will only have a certain number of unique values. Quantitative variables might have entirely unique values. So, we're looking at this qualitative variable source and we're going to look at the unique values and you can see that it's listed here, all the values that that variable takes. This is our way of summarizing, very

succinctly, a qualitative variable. And if you see that there are classes of that variable that should not be here, you can start exploring them further. You can also look at the length of the unique values for particular variables. So again, we've taken the unique values for this variable source and we've looked at the length of that. And so, we see that there are 11 unique values for source. This is another way of succinctly summarizing how many values you see and if you expected to see more or less, you can quickly access That t you have a problem with the data. You can also do a table of the qualitative variables. If you do table of quantitative variable, you're going to get a very big table because every value will be unique and you'll get exactly one for each of the categories. But for a qualitative variable, if you do table of a qualitative variable in this case eData$Src, you can see that each of the unique values is listed and underneath the unique values is the number of times that it appears. So, remember in summary, we saw that the ak the ak appeared 330 times in the source variable and so again, when we take the table of that source variable, we see that it appears 330. But we also see that for all the other values that that variable can take the number of times that it takes that value. So, this gives you an idea a little bit about the distribution of qualitative variables. The table command is actually more flexible than just allowing you to look at single variables. So, suppose we want to look at the relationship between the source variable and the version variable for this data set, we can do table of the first variable, eData$Src, comma, the second variable, eData$Version, and we actually see a two-dimensional table now. So, what this table shows is first along the, y, the y, sort of the rows here, we actually see the values of the source variable ak, ci, and so forth. And along the columns, we see the different versions that you can have. So, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, but also A, B, D and E. And then, what you see is the count for each cell of this table, of the number of

times that the source variable is equal to ak, for one of the rows in the data frame. And the version variable is equal to 2. So, 211 rows of the data frame have the source variable equal to ak and the version variable equal to 2. And it's the same that you get for each of the cells. So, there you can kind of see the relationship between these two variables and see that for example, most of the values seem to be occurring up here among these smaller number of detectors and these smaller number of versions. You can also see, for example, places where there are no values like this or places where these particular sources, ak, ci, and so forth, do not have any values that come from these versions. So, another way that you can look at data in addition to table and unique, is to look at any and all. So, any and all are particularly useful when looking at missing data. But also, if you want to see if there's some particular characteristic and see if it exists for any of the variables in a dataset. So, for example, if we look at the latitude data, so if you go eData$Lat, and we look at the first ten values, so this is just subsets to the first ten of those latitude values, we see that they're listed here. And so, suppose we want to see which of those values are greater than 40, it's kind of hard, hard to eyeball it directly, but you can use you can define sort of a logical variable. And the way that you do that is eData$Lat[1:10]>40. And so, what that does is for every value, it checks whether it's greater than 40 or not. And then, if it is greater than 40, it reports true. So, for example, the third value is 65, it's greater than 40, so it's true. And if it's less than 40, here, in this case, 38.83, then it reports false for that value. And so now, we have a new vector that's the same length as this original 10 long vector of latitudes, and it tells us which of the ones are greater than 40. And then, if we want to see if any of them are greater than 40, we can just say, any of these values are greater than 40, and it tells us, true.

So sometimes, what you're looking for is just, for some variable, either there any of them that has some particular characteristic and you can use the any command to be able to check and see if that's true. The all command, on the other side hand, is looking to see if all values have the same properties. So, for example, we again define this eData$Lat latitude variable to be, the all the values that are greater than 40 and so what we can do is see if all of these values are greater than 40 by applying all to that variable and in this case we actually get set to that vector and in this case, we actually get false because there are a large number of these that are actually equal to false. The only way that this would return true is if every single value of this vector was equal to true. So, these two variables, or these two functions, any and all, allow you to evaluate if there are particular patterns that you observe in the data set particularly, if there's patterns that affect all of the variable or affect eithwe one of the variables. The only thing we can do is is we can subset the values. And we can do this in more complicated ways than we saw in the original lectures. So, one example is that we can use the ampersand sign to do and operations. So, what we're going to do here is we're looking at the data frame eData, and we're going to take the columns that equal latitude and longitude, for this dataset. And so, we actually just want to subset to those values that have the names tatitude and longitude. Those are the columns that we're going to lubset to because it's after the comma. Before the comma, we actually want to look at the rows that we're going to subset to. And so, what we're going to do is, again, define these logical vectors so this is eData$Lat sign is greater than 0, will be all, will be equal to true whenever the latitude is greater than 0 and will be equal to false whenever it's less than zero or equal to zero. Similarly, we can define the same sort of thing for longitude, we can say, we're going to define a logical vector that's equal to true, whenever the longitude is

greater than zero. And it's equal to false whenever longitude is less than or equal to 0. And then, we want to say, find all the cases where both latitude and longitude are greater than zero. So, to do that, we just stick an ampersand in between them and what you get out is a set of values where both the latitude and the longitude are greater than 0. Another case that you might want to look at is suppose you want to look at a case where either the latitude or the longitude is greater than 0 but one of those two things has to be true. So here, we use the or symbol here to be able to determine whether the either the latitude or the longitude is greater than 0. So, in this case, you see some cases where latitude is positive and longitude is negative and some cases, the other way, where longitude is positive and latitude is negative. But one or the other of these two conditions has to hold. Either the latitude is positive or the longitude is positive. So, you don't see any cases where both the latitude and longitude take on negative values. So now after we've looked at a couple of different ways that we can subset the data, and look at unique values, and all sorts of other things, what we're going to do is we're going to actually look at a dataset. This is a dataset that actually was put together for a paper that I wrote a couple of years ago on submissions and reviews and in experiment. So, in this experiment, people solved problems, like SAT problems. They were submitted to a computer. The computer then randomly assigned them to other people to review. And then, the other people that reviewed those those problems could either say that it was correct or incorrect. And then, we can learn a little bit about the peer review system. This is particularly relevant because your data analyses will be graded on a peer review system. And we learned that cooperation between peer reviewers and people who are authors, increase the accuracy of the review process.

So, we're going to look at this data because it will show us a couple of others ways that we can manipulate data sets and look at summaries and figure out how they're working. So here, we need to download actually two data sets, and so they're on Dropbox. And so, we've assigned here the two URLs for the two data sets. And then, we download the two files using the similar methodology that we've done before. Then, we can read them, they're both csv files, so we use read.csv and read the two files in. And we can look at a top of those files. So, here is the top of the reviews file, we see it has an ID, a solution ID, a reviewer ID, a start and stop time, and so forth, and you can see that they're all quantitative variables here. And then, we also look at the head of the solutions file, again, it has an ID but now a problem ID and then some of the similar variables that we saw before. So, one thing that we might want to do is determine if there are any missing values and one way to use that, to do that is to use the is.na function. So, suppose that we want to look at the reviews time left variable, we can look at the first 10 values of that variable and see which of them are NA. So, if you use is.na applied to a vector, what it will do is it will look at every value one at a time and then tell you whether that value is NA or missing. So, in this case, the first, second, third, fourth, fifth, sixth, seventh, eigth value is a missing value of that time left variable, and all the rest are false because they're not NA values. So then, the other thing that we can do is, if we have this logical vector defined by using is.na on the entire time left vector, what we can do is just use sum to calculate the total number of times that you see in NA value. So remember, true means that there's an, it's missing. It's an NA value. And so, if we do sum of a logical ve, ve vector. What it does is it counts up the number of times that you see true. And so you see 84 missing values for this reviews$time_left variable. and indeed, if you do a table of is.na(reviews$time_left), you're now

going to look at a table of this logical vector of whether it's true a missing value or not. And you see that 84 of the times, it's missing and 115 of the times, it is not missing. So, an important issue about dealing with tables and missing values is going to be illustrated with this example. So here, I've just created, this has nothing to do with the previous experiment, but you can see I've created a vector and it has values 0, 1, 2, 3, NA, 3, 3. 2, 2, 3. NA being the missing value. If I type table of that vector, I actually see the number of times that 0, 1, 2, and 3 appear, but I don't see the number of times that the missing indicator appears. And that's because one of the options, the useNA option is defined by default, not to show NAs. So, if you run table on that exact same vector, but you set the useNA option to, if any, then if there are NA values, you will see that the NA value appears here, as well. So, there's one missing value in that vector. So, that's just an important little trick to remember if your looking at the number of values in a vector and you want to make sure that you see the missing values as well, you need to change the useNA parameter that you're passing to the table function. So, another thing that you can do is you can actually summarize by rows and columns. So, another way to summarize data rather than to summarize the individual bariables at the level of a table is that you can actually just see what the sum of all the values are in particular column, or the mean of all the values in a particular column. So, this could be useful when you're looking at, seeing if there's sort of any sort of variables that have an un, unusually high mean or unusually low mean. It should only apply really to values that are quantitative. And so, what you can see is, since we're using only quantitative variables in these reviews, we can do the column sums. The column sums tells you the sum of all the reviewer IDs, in this case that might not be a necessarily very useful number.

But you see that here the column sums for reviews are NA for the start, stop, time left, and accept parameters. And that's because if there are any NA values, the sum will always be equal to na. So, you might need to use this na.rm parameter to be able to ignore the NA values. And so, for example, if you take the column means of the same reviews data frame and you set na.rm=TRUE, then what happens is actually it takes the means of each column. And it does that by completely skipping any columns where there is an NA for that variable. So, for example, for the start variable, it takes the mean of the start variable, and it does it by completely ignoring any values that are equal equal to NA. And so, you end up with, these are the values that you end up getting for each of the column means ignoring the NA values. You can also do the same thing for row means. And so, all this does is, instead of getting a mean for each column, it gets a mean for each row. And again, you might need to set na.rn=TRUE, because otherwise, any row with an na will get a value of na when you apply rowMeans to it. So, I know this was a super quick summary of ways that you can summarize data. But this is the first pass in data analysis, there's always to run one or more or several of these functions and sort of get a feel for what the shape, structure, number of NAs and so forth, exist in that data set. It also lets you summarize a little bit the quantitative distributions of variables using quantile and things like that. So, the next thing that we're going to talk about is data monging and that's going to be the key component of any data analysis and it's usually performed after summarizing the datasets but can also be performed before summarizing the datasets.

You might also like