You are on page 1of 16

Affymetrix Data analysis

Course in Practical Analysis of Microarray Data


Computational Exercises

2010 March 22-26, Technischen Universitt Mnchen


Amin Moghaddasi, Kurt Fellenberg
Adapted from Wolfgang Huber, Robert Gentleman, best practice exercises.

This exercise will help you through analysis of Affymetrix microarray data. It contains loading data into
R/Bioconductor, quality control, and preprocessing of raw data. Use the help() command to find information
about particular command/object.

1. You need to have the latest version of R and Bioconductor installed, the libraries affy, estrogen, hgu95av2,
hgu95av2cdf, and vsn from the latest Bioconductor version.

> library(affy)
> library(estrogen)
> library(vsn)

To install missing packages issue the following commands:

> source("http://bioconductor.org/biocLite.R")
> biocLite("PKGNAME")

Replace PKGNAME with the missing package.

2. Reading data files.


2.1. In this exercise the CEL files are located at ~/R/library/estrogen/extdata. Find the directory

> datadir <- system.file("extdata", package = "estrogen")


> datadir
[1] "C:/PROGRA~1/R/R-28~1.1/library/estrogen/extdata"
> dir(datadir)
[1] "bad.cel" "estrogen.txt" "high10-1.cel" "high10-2.cel" "high48-1.cel"
[6] "high48-2.cel" "low10-1.cel" "low10-2.cel" "low48-1.cel" "low48-2.cel"
[11] "phenoData.txt" "targLimma.txt"

Set the working directory path to that location.

> setwd(datadir)

To find a subfolder /extdata of the estrogen library installed on your computer, the function
system.file is used.

1
2.2. The estrogen.txt contains information for the sample hybridization onto the arrays. Open it in a
favorite text editor and have a look around this file. Load it into a phenoData or
AnnotatedDataFrame object

> pd <- read.AnnotatedDataFrame("targLimma.txt", header = TRUE, row.names = 2)


> pData(pd)

The information for the sample annotations are stored in AnnotatedDataFrame objects. For example,
treatment conditions in a cell line experiment or clinical or histopathological characteristics of tissue
biopsies. The header and row.names options let the read.AnnotatedDataFrame function know
whether the annotation file consists of column headings and row names respectively.

2.3. Load the data from the CEL files as well as the phenotypic data into an AffyBatch object.

> a <- ReadAffy(filenames = rownames(pData(pd)), phenoData = pd, verbose = TRUE)


> a
trying URL
'http://bioconductor.org/packages/2.3/data/annotation/bin/windows/contrib/2.8/hg
u95av2cdf_2.3.0.zip'
Content type 'application/zip' length 1354027 bytes (1.3 Mb)
opened URL
downloaded 1.3 Mb

package 'hgu95av2cdf' successfully unpacked and MD5 sums checked

updating HTML package descriptions


AffyBatch object
size of arrays=640x640 features (9 kb)
cdf=HG_U95Av2 (12625 affyids)
number of samples=8
number of genes=12625
annotation=hgu95av2
notes=

Note: If you don't have permission to write to the R library directory, you can have a personal library.
dir.create(Sys.getenv("R_LIBS_USER"), recursive = TRUE). For details you can look at here.

3. CEL file images. The image function allows us to look at the spatial distribution of the intensities on a chip.
This can be useful for quality control. Use image function to create all 8 cell images.

> image(a[, 1])

Compare these cel images with another image file to look at the spatial artifacts. The file is located in the
same folder as bad.cel .
> badc = ReadAffy("bad.cel")
> image(badc)
2
4. Histograms. Another way to visualize the chip is to look at the intensity distribution histogram. Because of
the large dynamic range (O(104)), it is useful to look at the log-transformed values:

> hist(log2(intensity(a[, 4])), breaks = 100, col = "red")

5. Normalization.
5.1. Before comparing data from different arrays the probe-level data has to be summarized to represent
expression levels per gene and intensities have to be normalized between different arrays. We can use
the function expresso to choose between different methods to normalize the data and calculate
expression values.

> x <- expresso(a, bg.correct = FALSE, normalize.method = "vsn", normalize.param


= list(subsample = 1000), pmcorrect.method = "pmonly", summary.method =
"medianpolish")
> x
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 8 samples
element names: exprs, se.exprs
phenoData
sampleNames: low10-1.cel, low10-2.cel, ..., high48-2.cel (8 total)
varLabels and varMetadata description:
Name:
Target:
estrogen:
time.h:
featureData
featureNames: 100_g_at, 1000_at, ..., AFFX-YEL024w/RIP1_at (12625 total)
fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: hgu95av2

3
The parameter subsample determines the time consumption, as well as the precision of the calibration.
The default (if you leave away the parameter normalize.param = list(subsample=1000)) is 20000;
here we chose a smaller value for the sake of demonstration.
There is the possibility that expresso is not working properly due to memory problems (normally it should
work with 384 MB). Do not bother, the variable x is already available in the current workspace of R, so you
can simply continue with the next commands.
workspace.RData includes the expression set x and the affybatch a. Then you can continue with the
next paragraph.

5.2. What are other available methods for normalization, and expression value calculation? Choose
another method (for example, MAS5 or RMA) and compare the results. For example, look at
scatterplots of the probe set summaries for the same arrays between different methods.

> normalize.methods(a)
[1] "constant" "contrasts" "invariantset" "loess"
[5] "qspline" "quantiles" "quantiles.robust" "vsn"

> express.summary.stat.methods()
[1] "avgdiff" "liwong" "mas" "medianpolish" "playerout"

6. Boxplot.
6.1. To compare the intensity distribution across several chips, we can look at the boxplots, both of the
raw intensities a and the normalized probe set values x:

> boxplot(a, col = "red")


> boxplot(data.frame(exprs(x)), col = "blue")

In the commands above, note the different syntax: a is an object of type AffyBatch, and the boxplot
function has been programmed to know automatically what to do with it. exprs(x) is an object of type
matrix. What happens if you do boxplot(x) or boxplot(exprs(x))?

> class(x)
[1] "ExpressionSet"

4
attr(,"package")
[1] "Biobase"
> class(exprs(x))

7. Scatterplot.
7.1. The scatterplot is a visualization that is useful for assessing the variation (or reproducibility, depending
on how you look at it) between chips. We can look at all probes, the perfect match probes only, the
mismatch probes only, and of course also at the normalized, probe-set-summarized data. Why are the
arrays that were made at t = 48h much brighter than those at t = 10h? Look at histograms and
scatterplots of the probe intensities from chips at 10h and at48h to see whether you can find any
evidence of saturation, changes in experimental protocol, or other quality problems. Distinguish
between probes that are supposed to represent genes (you can access these e.g. through the
functions pm()) and control probes.

> plot(exprs(a)[, 1:2], log = "xy", pch = ".", main = "all")


> plot(pm(a)[, 1:2], log = "xy", pch = ".", main = "pm")
> plot(mm(a)[, 1:2], log = "xy", pch = ".", main = "mm")
> plot(exprs(x)[, 1:2], pch = ".", main = "x")

8. Heatmap.
8.1. Heatmaps are a quick and easy way to visualize and see structures in medium sized tables of data.
Here we do a rough selection of the 50 genes with the highest variation (standard deviation) across
chips.

> rsd <- apply(exprs(x), 1, sd)


> sel <- order(rsd, decreasing = TRUE)[1:50]
> heatmap(exprs(x)[sel, ], col = gentlecol(256))

9. ANOVA.
9.1. A better way to select genes is to estimate the variation between different types of experiments in the
data. One way to do this is ANOVA. We set up a linear model with main effects for the level of strogen
(estrogen) and the time (time.h). Both are factors with 2 levels. Now we can start analysing our data
for biological effects.

> lm.coef <- function(y) lm(y ~ factor(estrogen) * factor(time.h), data =


pData(pd))$coefficients

> eff <- esApply(x, 1, lm.coef)

For each gene, we obtain the fitted coefficients for main effects and interaction:
> dim(eff)
[1] 4 12625

> rownames(eff)
[1] "(Intercept)"
[2] "factor(estrogen)present"
[3] "factor(time.h)48"
[4] "factor(estrogen)present:factor(time.h)48"
5
> affyids <- colnames(eff)

Lets bring up the mapping from the vendors probe set identifier to gene names.

> library(hgu95av2)
> ls("package:hgu95av2")
[1] "hgu95av2" "hgu95av2ACCNUM"
[3] "hgu95av2CHR" "hgu95av2CHRLENGTHS"
[5] "hgu95av2CHRLOC" "hgu95av2ENTREZID"
[7] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE"
[9] "hgu95av2GENENAME" "hgu95av2GO"
[11] "hgu95av2GO2ALLPROBES" "hgu95av2GO2PROBE"
[13] "hgu95av2LOCUSID" "hgu95av2MAP"
[15] "hgu95av2MAPCOUNTS" "hgu95av2OMIM"
[17] "hgu95av2ORGANISM" "hgu95av2PATH"
[19] "hgu95av2PATH2PROBE" "hgu95av2PFAM"
[21] "hgu95av2PMID" "hgu95av2PMID2PROBE"
[23] "hgu95av2PROSITE" "hgu95av2QC"
[25] "hgu95av2REFSEQ" "hgu95av2SUMFUNC_DEPRECATED"
[27] "hgu95av2SYMBOL" "hgu95av2UNIGENE"

Lets now first look at the estrogen main effect, and print the top 3 genes with largest effect in one
direction, as well as in the other direction. Then, look at the estrogen:time interaction.

> lowest <- sort(eff[2, ], decreasing = FALSE)[1:3]


> mget(names(lowest), hgu95av2GENENAME)
$`846_s_at`
[1] "BCL2-antagonist/killer 1"
$`37294_at`
[1] "B-cell translocation gene 1, anti-proliferative"
$`38551_at`
[1] "L1 cell adhesion molecule"
> highest <- sort(eff[2, ], decreasing = TRUE)[1:3]
> mget(names(highest), hgu95av2GENENAME)
$`31798_at`
[1] "trefoil factor 1"
$`910_at`
[1] "thymidine kinase 1, soluble"
$`40117_at`
[1] "minichromosome maintenance complex component 6"
> hist(eff[4, ], breaks = 100, col = "blue", main = "estrogen:time interaction")
> highia <- sort(eff[4, ], decreasing = TRUE)[1:3]
> mget(names(highia), hgu95av2GENENAME)
$`1651_at`
[1] "ubiquitin-conjugating enzyme E2C"
$`40412_at`
[1] "pituitary tumor-transforming 1"

6
$`1945_at`
[1] "cyclin B1"

7
Differential gene expression

1. ALL data.
The data used for these exercises come from a study of Chiaretti et al. (Blood 103:2771-8, 2004) on acute
lymphoblastic leukemia (ALL), which was conducted with HG-U95Av2 Affymetrix arrays. They are used to
demonstrate the functions to find differential genes. The data package ALL contains an exprSet object
called ALL, which contains the expression data that were normalized with rma (intensities are on the log2
scale), and annotations of the samples.

1.1. Load the ALL package. What is the dimension of the expression data matrix?
1.2. Use the function show to get an overview of the exprSet object. What are the variables describing
the samples stored in the pData slot?

> library(ALL)
> library(hgu95av2)
> library(annotate)
> data(ALL)
> show(ALL)
> dim(exprs(ALL))
> print(summary(pData(ALL)))

2. B-cell ALL.
We want to look at the B-cell ALL samples (they can be identified by the column BT of the pData slot of
the exprSet ALL). Of particular interest is the comparison of samples with the BCR/ABL fusion gene
resulting from a translocation of the chromosomes 9 and 22 (labelled BCR/ABL in the column mol), with
samples that are cytogenetically normal (labelled NEG).
2.1. Define an exprSet object containing only the data from the B-cell ALL samples. How many samples
belong to the cytogenetically defined groups?

> pdat <- pData(ALL)


> table(pdat$BT)
> table(pdat$mol)
> subset <- intersect(grep("^B", as.character(pdat$BT)),
which(as.character(pdat$mol) %in% c("BCR/ABL", "NEG")))
> eset <- ALL[, subset]
> table(eset$mol)

3. Non-specific filtering.
Many of the genes on the chip wont be expressed in the Bcell lymphozytes studied here, or might have
only small variability across the samples.
3.1. We try to remove these genes (more precisely: the corresponding probe sets) with an intensity filter
(the intensity of a gene should be above log2(100) in at least 25 percent of the samples), and a
variance filter (the interquartile range of log2intensities should be at least 0.5). We create a new
exprSet containing only the probe sets which passed our filter. How many probe sets do we get?

> library(genefilter)
> f1 <- pOverA(0.25, log2(100))

8
> f2 <- function(x) (IQR(x) > 0.5)
> ff <- filterfun(f1, f2)
> selected <- genefilter(eset, ff)
> sum(selected)
> esetSub <- eset[selected, ]

4. Differential expression.
Now we are ready to examine the selected genes for differential expression between the BCR/ABL samples
and the cytogenetically normal ones.
4.1. Use the twosample ttest to identify genes that are differentially expressed between the two groups.
The function mt.teststat from the multtest package allows computing several commonly used
test statistics for all rows of a data matrix studying its help page. First, we calculate the nominal p
values the function pt gives the distribution function of the tdistribution. We can get an
impression of the amount of differential gene expression by looking at a histogram of the pvalue
distribution.

> library(multtest)
> cl <- as.numeric(esetSub$mol == "BCR/ABL")
> t <- mt.teststat(exprs(esetSub), classlabel = cl, test = "t.equalvar")
> pt <- 2 * pt(-abs(t), df = ncol(exprs(esetSub)) - 2)
> hist(pt, 50)

4.2. The function p.adjust contains different multiple testing procedures. Look at the help page of this
function. For pvalue adjustment in terms of the FDR, we use the method of Benjamini and Hochberg.
How many genes do you get when imposing an FDR of 0.1?

> pa <- p.adjust(pt, method = "BH")


> sum(pa < 0.1)

4.3. Plot the pvalues against the logratios (differences of mean logintensities within the two groups) in
a volcano plot. Note the asymmetry of the volcano plot.

> logRatio <- rowMeans(exprs(esetSub)[, cl == 1]) - rowMeans(exprs(esetSub)[,


+ cl == 0])
> plot(logRatio, -log10(pt), xlab = "log-ratio", ylab = "-log10(p)")

5. Limma.
A ttest analysis can also be conducted with functions of the limma package.
5.1. First, we have to define the design matrix. One possibility is to use an intercept term that represents
the mean logintensity of a gene across all samples (first column consisting of 1s), and to encode the
difference between the two classes in the seond column.

> library(limma)
> design <- cbind(mean = 1, diff = cl)
5.2. A linear model is fitted for every gene by the function lmFit, and Empirical Bayes moderation of the
standard errors is done by the function eBayes.

> fit <- lmFit(esetSub, design)


> fit2 <- eBayes(fit)
9
> topTable(fit2, coef = "diff", adjust.method = "fdr")

5.3. When you compare the resulting pvalues with those from the parametric ttest (Exercise 4.a), you
will see that they are almost identical. Because of the large number of samples, the Empirical Bayes
moderation is not so relevant in this data set the genespecific variance can well be estimated from
the data of each gene.

> plot(log10(pt), log10(fit2$p.value[, "diff"]), xlab = "two-sample t-test",


+ ylab = "limma")
> abline(c(0, 1), col = "Red")

6. Annotation.
6.1. Now we want to see which genes are the most significant ones, and look at their raw and adjusted p
values from the different methods. Gene symbols are provided in the annotation package hgu95av2.

> diff <- order(pa)[1:10]


> genesymbols <- mget(featureNames(esetSub)[diff], hgu95av2SYMBOL)
> pvalues <- cbind(pt, pa)[diff, ]
> rownames(pvalues) <- genesymbols
> print(pvalues)

6.2. The top 3 probe sets represent the ABL1 gene, which is affected by the translocation characterizing
the BCR/ABL samples. Now we want to see whether there are further probe sets representing this
gene, and whether these have been selected by our nonspecific filtering.

> geneSymbols = mget(featureNames(ALL), hgu95av2SYMBOL)


> ABL1probes <- which(geneSymbols == "ABL1")
> selected[ABL1probes]

7. Gene Ontology.
7.1. Many of the effects due to the BCR/ABL translocation are mediated by tyrosine kinase activity. Lets
look at the probe sets that are annotated at the GO term protein-tyrosine kinase activity, which has
the identifier GO:0004713.

> gN <- featureNames(esetSub)


> tykin <- unique(lookUp("GO:0004813", "hgu95av2", "GO2ALLPROBES"))
> str(tykin)
> sel <- (gN %in% tykin)

7.2. We can now check whether there are more differentially expressed genes among the tyrosine kinases
than among the other genes. Fishers exact test for contingency tables is used to check whether the
proportions of differentially expressed genes are significantly different in the two gene groups.

> tab <- table(pt < 0.05, sel, dnn = c("p < 0.05", "tykin"))
> print(tab)
> fisher.test(tab)

8. ROC curve screening.

10
8.1. We want to find marker genes that are specifically expressed in leukemias with the BCR/ABL
translocation. At different cutoff levels we can determine how well the expression levels of the genes
are separated between the two classes and calculate specificity and sensitivity for each gene. At a
specificity of at least 0.9, we would like to identify the genes with the best sensitivity for the BCR/ABL
phenotype. This can be expressed by the partial area under the ROC curve (pAUC, we choose t0 = 0.1).
To limit the computation time, we compute the pAUCstatistic only for the first 100 probe sets.

> library(ROC)
> mypauc1 <- function(x) {
+ pAUC(rocdemo.sca(truth = cl, data = x, rule = dxrule.sca),
+ t0 = 0.1)
+ }
> pAUC1s <- esApply(esetSub[1:100, ], 1, mypauc1)

8.2. Select the 2 probe sets with the maximal value of our pAUCstatistic, and plot the corresponding ROC
curves. Look for a comparison at the ttest pvalues for these genes.

> best <- order(pAUC1s, decreasing = T)[1:2]


> x11()
> par(mfrow = c(1, 2))
> for (pS in best) {
+ RC <- rocdemo.sca(truth = cl, data = exprs(esetSub)[pS, ],
+ rule = dxrule.sca)
+ plot(RC, main = featureNames(esetSub)[pS])
+ }
> print(pt[best])

11
Clustering gene expression data

1. Hierarchical clustering
The first step in hierarchical clustering is to compute a matrix containing all distances between the objects
that are to be clustered.

1.1. For the ALL data set, the distance matrix is of size 128 x 128. From this we compute a dendrogram
using complete linkage hierarchical clustering:

> cl <- substr(as.character(ALL$BT), 1, 1)


> dat <- exprs(ALL)
> d <- dist(t(dat))
> image(as.matrix(d))
> hc <- hclust(d, method = "complete")
> plot(hc, labels = cl)

1.2. We now split the dendrogram into two clusters (using the function cutree) and compare the resulting
clusters with the true classes. Compute a contingency table and then apply an independence test.

> groups <- cutree(hc, k = 2)


> table(groups, cl)
> fisher.test(groups, cl)$p.value
[1] 0.03718791

The null-hypothesis of the Fisher test is "the true labels and the cluster results are independent". The p-
value shows that we can reject this hypothesis at a signicance level of 5%. But this is just statistics: What is
your impression of the quality of this clustering? Did we already succeed in reconstructing the true classes?

2. Gene selection before clustering samples


2.1. The dendrogram plot suggests that the clustering can still be improved. And the p-value is not that
low. To improve our results, we should try to avoid using genes that contribute just noise and no
information. A simple approach is to exclude all genes that show no variance across all samples. Then
we repeat the analysis from the last section.

> genes.var <- apply(dat, 1, var)


> genes.var.select <- order(genes.var, decreasing = T)[1:100]
> dat.s <- dat[genes.var.select, ]
> d.s <- dist(t(dat.s))
> hc.s <- hclust(d.s, method = "complete")
> plot(hc.s, labels = cl)

2.2. Plot the distance matrix d.s. Compare it to d. Then split the tree again and analyze the result by a
contingency table and an independence test:

12
> groups.s <- cutree(hc.s, k = 2)
> table(groups.s, cl)
cl
groups.s B T
1 95 0
2 0 33
> fisher.test(groups.s, cl)$p.value
[1] 2.326082e-31

It seems that reducing the number of genes was a good idea. But where does the effect come from: Is it
just dimension reduction (from 12625 to 100) or do high-variance genes carry more information than other
genes? We investigate by comparing our results to a clustering on 100 randomly selected genes:

> genes.random.select <- sample(nrow(dat), 100)


> dat.r <- dat[genes.random.select, ]
> d.r <- dist(t(dat.r))
> image(as.matrix(d.r))
> hc.r <- hclust(d.r, method = "complete")
> plot(hc.r, labels = cl)
> groups.r <- cutree(hc.r, k = 2)
> table(groups.r, cl)
> fisher.test(groups.r, cl)$p.value

3. K-means
3.1. First look at help(kmeans) to become familiar with the function. Set the number of clusters to k=2.
Then perform k-means clustering for all samples and all genes. k-means uses a random starting
solution and thus different runs can lead to different results. Run k-means 10 times and use the result
with smallest within-cluster variance.

> k <- 2
> withinss <- Inf
> for (i in 1:10) {
+ kmeans.run <- kmeans(t(dat), k)
+ print(sum(kmeans.run$withinss))
+ print(table(kmeans.run$cluster, cl))
+ cat("----\n")
+ if (sum(kmeans.run$withinss) < withinss) {
+ result <- kmeans.run
+ withinss <- sum(result$withinss)
+ }
+ }
> table(result$cluster, cl)

3.2. The last result is the statistically best out of 10 tries, but does it reflect biology? Now do k-means again
using the 100 top-variance genes. Compare the results.

> kmeans.s <- kmeans(t(dat.s), k)


> table(kmeans.s$cluster, cl)

13
4. PAM: Partitioning Around Medoids
4.1. As explained in today's lecture, PAM is a generalization of k-means. Look at the help page of the R-
function. Then apply pam for clustering the samples using all genes:

NOTE: In case you dont have the cluster package do


> install.packages(cluster)
> library(cluster)
> result <- pam(t(dat), k)
> table(result$clustering, cl)

4.2. Now use k=2:50 top variance genes for clustering and calculate the number of misclassifications in
each step. Plot the number of genes versus the number of misclassifications. What is the minimal
number of genes needed for obtaining 0 misclassifications?

> ngenes <- 2:50


> o <- order(genes.var, decreasing = T)
> miscl <- NULL
> for (k in ngenes) {
+ dat.s2 <- dat[o[1:k], ]
+ pam.result <- pam(t(dat.s2), k = 2)
+ ct <- table(pam.result$clustering, cl)
+ miscl[k] <- min(ct[1, 2] + ct[2, 1], ct[1, 1] + ct[2, 2])
+ }
> xl = "# genes"
> yl = "# misclassification"
> plot(ngenes, miscl[ngenes], type = "l", xlab = xl, ylab = yl)

5. The objective function


5.1. Partitioning methods try to optimize an objective function. The objective function of k-means is the
sum of all within-cluster sum of squares. Run k-means with k=2:20 clusters and 100 top variance
genes. For k=1 calculate the total sum of squares. Plot the obtained values of the objective function.

> totalsum <- sum(diag((ncol(dat.s) - 1) * cov(t(dat.s))))


> withinss <- numeric()
> withinss[1] <- totalsum
> for (k in 2:20) {
+ withinss[k] <- sum(kmeans(t(dat.s), k)$withinss)
+ }
> plot(1:20, withinss, xlab = "# clusters", ylab = "objective function", type =
"b")

Why is the objective function not necessarily decreasing? Why is k=2 a good choice?

6. The silhouette score


6.1. First read the Details section of help(silhouette) for a denition of the silhouette score. Then compute
silhouette values for PAM clustering results with k=2:20 clusters. Plot the silhouette widths. Choose an
14
optimal k according to the maximal average silhouette width. Compare silhouette plots for k=2 and
k=3. Why is k=2 optimal? Which observations are misclassied? Which cluster is more compact?

> asw <- numeric()


> for (k in 2:20) {
+ asw[k] <- pam(t(dat.s), k)$silinfo$avg.width
+ }
> plot(1:20, asw, xlab = "# clusters", ylab = "average silhouette width", type =
"b")
> plot(silhouette(pam(t(dat.s), 2)))

7. How to check significance of clustering results


7.1. Randomly permute class labels 20 times. For k=2, calculate the average silhouette width (asw) for pam
clustering results with true labels and for all random labeling. Generate a box-plot of asw values
obtained from random labeling. Compare it with the asw value obtained from true labels.

> d.s <- dist(t(dat.s))


> cl.true <- as.numeric(cl == "B")
> asw <- mean(silhouette(cl.true, d.s)[, 3])
> asw.random <- rep(0, 20)
> for (sim in 1:20) {
+ cl.random = sample(cl.true)
+ asw.random[sim] = mean(silhouette(cl.random, d.s)[, 3])
+ }
> symbols(1, asw, circles = 0.01, ylim = c(-0.1, 0.4), inches = F, bg = "red")
> text(1.2, asw, "Average silhouette value\n for true labels")
> boxplot(data.frame(asw.random), col = "blue", add = T)

You will see: The average silhouette value of the pam clustering result lies well outside the range of values
achieved by random clusters.

8. Clustering of genes
8.1. Cluster the 100 top variance genes with hierarchical clustering. You can get the gene names from the
annotation package hgu95av2.

> library(hgu95av2)
> gene.names <- sapply(featureNames(ALL), get, hgu95av2SYMBOL)
> gene.names <- gene.names[genes.var.select]
> d.g = dist(dat.s)
> hc = hclust(d.g, method = "complete")
> plot(hc, labels = gene.names)

8.2. Now plot the average silhouette widths for k=2:20 clusters using pam and have a look at the silhouette
plot for k=2. What do you think: Did we nd a structure in the genes?

15
> asw = numeric()
> for (k in 2:20) {
+ asw[k] = pam(dat.s, k)$silinfo$avg.width
+ }
> plot(1:20, asw, xlab = "# clusters", ylab = "average silhouette width", type =
"b")
> plot(silhouette(pam(dat.s, 2)))

16

You might also like