You are on page 1of 6

BIOL 322 Fall 2017: Analysis of single cell gene expression data using R programming

Introduction: The dataset we are using comes from a developmental model of mouse inner ear neuron development.
Neurons of the developing inner ear were fluorescently labeled red using a genetic regulation system called the
Cre/lox system. The end result of this genetic manipulation is that any cell that expressed the Neurog1 gene at any
point during development now expresses the red fluorescent protein called “tomato”. We can isolate these cells
through a method called flow cytometry. Individual cells were collected, RNA isolated, cDNA synthesized and
analyzed by qRT-PCR for expression of 192 genes. In total 513 cells were analyzed for all 192 genes, resulting in
98,496 data points. As you can imagine, trying to analyzed this much data can be really daunting. We will use the R
programming language to visualize and analyze this amount of data using a specific clustering technique called
principle component analysis. Follow the direction below to visualize and ask questions about some of the genes that
are involved in inner ear neuron development.

It is important to start your analysis with a clean and normalized dataset of expression values.

The data table for analysis is ordered in the following way. Column 1 is the date of the experiment, Column 2 is the
cell number (identifier), Column 3 is the age of the animal from which the cell was taken, Column 4 is the somite
stage (another way of defining developmental age). The rest of the table is numerical data representing all the genes
that were analyzed. R requires the “.csv” format for data import into the program.

Starting R-Studio and import your data


1) Generate a working directory for this specific project. Any appropriate location on your computer is fine.

2) Copy your dataset (“.csv” file) into the working directory.

3) Rename your dataset to “data.csv”.

4) Open R-Studio.

5) File -> New Project…

6) Now select “Create project from Existing Directory” and pick the directory that you just created.

7) R Studio should now have 3 windows open: the Console is on the left (here is where you can type commands), the
other two windows on the right will come into play later.

8) File -> New File -> R Script. This creates a new window (called the editor) that opens on the left side above the
console. Now 4 windows are open, which is the standard environment you will get used to.

The editor is your R script, it is your lab book for this data analysis project. Here you write or paste your code, and
you annotate your commands so that everything that you do to your data is documented.

9) Save your (empty) R script file into your working directory (either via File -> Save or by clicking on the disk icon).
Frequent saving of your R script ensures that it is protected from occasional crashes of the program.

10) Now have a look at your R Studio project window. Upper left is the editor, lower left is the console, upper right
is the “Environment” which will harbor all your data and variables, and the bottom right window is where you can
find lots of inofrmation such as your files, your plots, installed packages, and help.

11) Let’s type into the console, after the > sign the following line (don’t type the “>” sign):
> getwd()
You will see that the console returns a line that lists the file path to your working directory. If you ever want to change
the working directory, you can do this with: setwd(“/your_directory”).
Let’s import the example data. Copy the following lines of code into your editor.
# Load data --> input format = csv file with columns = genes, rows = samples, column names = TRUE, row names
= FALSE
data <- read.csv('data.csv', header = T, sep = ",", as.is = T)

You will notice that the pasted text is interpreted by the editor. The first line that starts with a hashtag is shown in
green because it is interpreted as a comment or annotation. Use hashtags to document your R script - this is how you
use the Script as your lab book.

The second line is code for importing your data.csv file into a so-called data frame variable named data. The
arguments that are indicated with the read.csv command are header (True), separator (comma), and “as is” (True),
which means that the data is not being converted into a specific format. This is needed because the data frame consists
of a mix of metadata and numeric data.

Run the 2 code lines by highlighting (don’t worry if you also highlight hashtag’ed lines, they will behave inert) and
by clicking on the symbol in the upper right corner of the code window.

Have a look at the upper right window. The “Environment” tab should be active and you will see a line named “data”
followed by text stating “513 obs. of 196 variables”. The objects are your cells and the variables are the different
columns.

Double click on the “data” line and you will see on the left side in the upper window your data table (similar to what
it looked in Excel). You can also click on the small arrow in front of the “data” line to expand your data view to the
individual variables. The first line should be the date, followed by metadata, and ultimately the first gene in the
dataset. Your gene expression data (Log2Ex) values should be in a “num” = numeric string.

At this point, double check that everything looks all right. Does the data table (the matrix) make sense? Are the gene
names the correct ones? etc/ Familiarize yourself with the way R imported the non-numeric data. There is no need to
fully understand the different data formats, but just remember that there are formats for numeric data and for other
data, that there is metadata and gene expression data, and be aware of how the data matrix is organized.

Preparation of your data for analysis


There are many different and probably more elegant solutions for handling data matrices. The goal is not to offer the
most elegant programming solution but a process that can be followed and that leads to accurate results.

1) Look at your data. This is best done using the “grid” view in the upper left window. You can also type
> View(data)
Count the number of columns that represent metadata; should be 4. It is important to be aware of the columns that do
not contain gene expression data and because your organized your data in a specific way, these columns are the first
of your dataset.

2) Create the “non.data.col” variable with the number 4 with the following line of code:
# define a variable with the number of columns on the left of your grid that do not contain gene expression data
non.data.col <- 4

3) Now, create a data matrix expr that only contains the gene expression data with:
# remove meta information, retain expression values
expr <- data[, (non.data.col+1):ncol(data)]

4) Double check that the variable expr is making sense.

5) Conversely, extract all metadata columns and assign these to the variable meta:
# extract meta data columns
meta <- data[, 1:(non.data.col)]

6) Double check that the variable meta is making sense.

7) Now, we do something that is not intuitive. We convert the data frame expr into a matrix. Type at the console:
> class(expr)
The function returns a line confirming that expr is a data frame. Conversion into a matrix is simply done with:
# convert data frame into matrix - "protects" data from accidental conversions
expr <- as.matrix(expr)

8) Check the conversion with:


> class(expr)

9) Ok, let’s manipulate the matrix. First, we want to find whether our matrix contains and values < 0. Negative values
are not interpretable (there is no negative gene expression), but these values sometime occur during normalization.
To find out whether and how many negative values are hidden in your data matrix, type:
> table(expr <0)

10) Let’s set all negative values to zero:


# Set all values < 0 to 0
expr[expr < 0] <- 0

11) Check with


> table(expr <0)

12) In the next step, we will remove all columns (genes) that do not harbor any data (all zeros).
# if sum is not zero (T = True), then the column will be retained, if F (False) remove the column
# define the function
f <- apply(X = expr, MARGIN = 2, FUN = function(x){sum(x) > 0})
# use the function
expr <- expr[, f]

Ok, this is a tough one. Type ?apply to learn more about the apply function, which will tell you about the parameter
MARGIN and FUN. And ok again - this is Daniel’s elegant example. You could also achieve the same thing with
(alternative):
# A simpler way for doing this would be:
expr <- expr[,colSums(expr)>0]

13) The data matrix “expr” now contains only columns with gene expression values and none of the columns is
empty. Just double check this to make sure.

Creating bar plot of gene expression data

Info.

2D PCA and plotting the data


Principle Component Analysis (PCA) is a statistical method of grouping data together based on the variance within
the data. We will generate plots that represent all 513 cells in the dataset. These cells will be organized so that cells
that have very disimilar expression profiles for all 192 genes will be located further away from each other. For PCA
we use the command prcomp. Type ?prcomp into the console and a detailed description of the function will appear
on the lower right side window.
1) Execute:
#create PCA object; x contains principal components, i.e. PCAresult$x[,1] = 1.PC etc
PCAresult <- prcomp(expr, scale. = T)
This performs the PCA with the cleaned up data matrix and returns the results to the variable PCAresult. The PCA is
done with scaled data, which ensures that genes with different dynamic ranges are treated equally. This is the standard.
If you want to have a look at a PCA of your data that is not using scaled variables, then replace T with F.

2) Now you can quickly check the result of the PCA with the plot function:
plot(PCAresult$x)

3) Let’s make the plot a littler prettier. Execute the next block of commands:
# Color by gene; define color ramp
cRamp <- colorRampPalette(c("seagreen", "yellow", "red"))
ncolors <- 100
cols <- c("gray", cRamp(ncolors))

cRamp <- colorRampPalette(c("seagreen", "yellow", “red”)) defines a color ramp from green via yellow to red.
ncolors <- 100 segments the ramp into 100 gradations; depending on the number of cells and the dynamics of gene
expression you might want to change this number at a later stage. cols <- c("gray", cRamp(ncolors)) assigns the
graded ramp to the variable cols and adds a leading value “grey” as the first color.
If you like, type cols into the console and you will get in return the hex codes for the first element “grey” plus the
100 elements of your color ramp.

4) Let’s map the gene expression values of one example gene (here “Isl1”) to the color palette. This is done by
executing 5 lines of code. You can execute them one line at a time and double check how the variables gexpr and
gexpr.col change by simply typing the variables into the console command line:
# Map gene expression to color palette
gname <- "Isl1" #name of gene of interest
gexpr <- expr[, gname] #select column w/ gene expression data
gexpr <- ceiling((gexpr / max(gexpr)) * ncolors) #non-detects = 0, others in [1, ncolors]
gexpr <- gexpr + 1 #shift by 1 as array index starts with 1; new range: [1, ncolors + 1], whereas 1 = non-detects
gexpr.col <- cols[gexpr] #map gene expression to color code

gname <- “Isl1" defines the gname (column header) that is being mapped.

gexpr <- expr[, gname]] picks the column with the header “Gata3” and assigns it to the vector gexpr. Typing “gexpr”
into the command line shows the Log2Ex values for Gata3.

gexpr <- ceiling((gexpr / max(gexpr)) * ncolors) is performing the mapping function by scaling the 100 gradations.
The function “ceiling” rounds to the next integer value, which means that all values >0 will have an assigned integer
with 100 as the maximum. Zeros will be preserved.

The next line of code gexpr <- gexpr + 1 adds “1” to each element of the gexpr vector so that the vector now has
numbers starting at 1 until 101. This allows for mapping any gene expression value encoded by “1” to be mapped to
the first element of the “cols” vector, where the first element is “gray”, followed by the color ramp from element 2-
101.

Finally, map the gene expression to the color ramp with gexpr.col <- cols[gexpr]. If you type “gexpr.col” you will
get the 101 element vector representing the mapped gene expression colors for Gata3.

5) Let’s look at the colors projected to the PCA data. First, pick the dimensions of the PCA with pcs <- c(1, 2). This
defines the dimensions that you want to plot. For example, use pcs <- c(1,3) for PC1 and PC3 or pcs <- (2,3) for PC2
and PC3, etc.
#index of principal components to visualize
pcs <- c(1, 2)

6) Finally, plot the projected gene expression data for Gata3 on the PC1/PC2 2-dimensional map:

plot(PCAresult$x[, pcs[1]], PCAresult$x[, pcs[2]], col = gexpr.col, pch = 19, main = gname, xlab = paste("PC",
pcs[1]), ylab = paste("PC", pcs[2]))

The parameters should be pretty self-explanatory, but you can always use the help function (?plot) to get details.

7) Let’s save your plot. We do this by writing with the plot function directly into an open file. So the order of
commands for this is important. First, we will open the file, then we write the plot into the file, and finally, we will
close (and thereby save) the file.
# open png pipeline with the name “gname”.png
png(filename = paste(gname, ".png", sep=""), width = 480, height = 480, units = "px", pointsize = 12, bg = "white",
res = NA)

This opens the pipeline. Note the dimensions (size) of the .png file. This can all be customized.

plot(PCAresult$x[, pcs[1]], PCAresult$x[, pcs[2]], col = gexpr.col, pch = 19, main = gname, xlab = paste("PC",
pcs[1]), ylab = paste("PC", pcs[2]))

Same plot command code line as above, but this time the plot will be written into the png pipeline because you opened
the pipeline.

Finally, close the pipeline:


dev.off()

dev.off() closes the graphics device (in this case, the file) and saves it to the working directory

8) You should now find the file “Isl1.png” in your working directory.

You can change the gene by reassigning gname and running the mapping code lines again.

Projecting metadata onto PCA plot


Sometimes you may want to have a look at how your metadata relates to the distribution of cells on a PCA plot. For
this, you need to know the column header of the metadata you want to project. For example for the provided dataset
there is a column labeled “somite.range”, so let’s have a look at the FACS date.

1) Assign somite.range to the variable “m.header” m.header <- "somite.range". Because the metadata is not
necessarily numerical, we use the function “as.factor” in the following line to extract the metadata column and assign
it to the variable “m”: m <- as.factor(meta[, m.header]). When you type “m” into the console you will get the values
for “m” as well as a list of the different levels. In our specific example,
these would be the two different dates.
# Projecting metadata onto PCA plot
m.header <- "somite.range"
m <- as.factor(meta[, m.header])

2) You plot the data by executing the following code line.


plot(PCAresult$x[, pcs[1]], PCAresult$x[, pcs[2]], col = m, pch = 16, main = m.header, xlab = paste("PC", pcs[1]),
ylab = paste("PC", pcs[2]))

3) If you want to add a legend, execute the following code.


m.values <- as.factor(levels(m))
legend(legend = m.values, "topright", pch = 16, col = m.values)

4) For saving a plot, simply execute the next lines of code, which again should be self explanatory. Did you keep
track of the current working directory?
# create a file with projected metadata
png(filename = paste(m.header, ".png", sep=""), width = 480, height = 480, units = "px", pointsize = 12, bg = "white",
res = NA) #opens png pipeline
plot(PCAresult$x[, pcs[1]], PCAresult$x[, pcs[2]], col = m, pch = 16, main = m.header, xlab = paste("PC", pcs[1]),
ylab = paste("PC", pcs[2]))
m.values <- as.factor(levels(m))
legend(legend = m.values, "topright", pch = 16, col = m.values)
dev.off()