Thyroid cancer can develop from the different cells that form the follicles of the thyroid. This gland located at the base of the throat secretes hormones such as T3 and T4 that have their metabolic functionalities such as control of heart rate, blood pressure, body temperature and weight. See it on TCGA.
From the four types of thyroid cancer (papillary, follicular, medullary, and anaplastic thyroid cancer) Papillary Thyroid Carcinoma (PTC) is the most common type of thyroid cancer (Agrawal et al., 2014). It is more prevalent in women than men and its common diagnosis occurs around 49 years old.
Driver onco-mutations for this type of cancer appear as alterations of the MAPK signaling pathway and the PI3K-AKT pathway (Agrawal et al., 2014). Both are implied in cell proliferation and survival and in human tumorigenesis. The overactivation of the MAPK pathway because of mutations such as the BRAFV600E mutation, leads to the development of PTC from follicular thyroid cells (Xing, 2013). On the other hand, mutations that activate the PI3K-AKT pathway, such as mutations in RAS, PTEN and PIK3CA, mostly leads to development of Follicular Thyroid Adenoma (FTA) and Follicular Thyroid Cancer (FTC) also in follicular thyroid cells (Xing, 2013).
Research from The Cancer Genome Atlas (TCGA) focused on PTC and they found that a part from alterations in BRAF (specifically BRAFV600E) and RAS, there are other driver mutations such the ones in EIF1AX PPM1D, and CHEK2 genes that are main alterations for that type of thyroid cancer(Agrawal et al., 2014).
We start importing the raw table of counts.
library(SummarizedExperiment)
thca <- readRDS("seTHCA.rds")
thca
## class: RangedSummarizedExperiment
## dim: 20115 572
## metadata(5): experimentData annotation cancerTypeCode
## cancerTypeDescription objectCreationDate
## assays(1): counts
## rownames(20115): 1 2 ... 102724473 103091865
## rowData names(3): symbol txlen txgc
## colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
## TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
## TCGA.KS.A41L.11A.11R.A23N.07
## colData names(549): type bcr_patient_uuid ...
## lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
Explore the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata.
dim(colData(thca))
## [1] 572 549
col.data <- colData(thca)
col.data[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
## type bcr_patient_uuid
## <factor> <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
## TCGA.BJ.A0YZ.01A.11R.A10U.07 tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
## TCGA.BJ.A0Z0.01A.11R.A10U.07 tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
## TCGA.BJ.A0Z2.01A.11R.A10U.07 tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 tumor 331cae6e-2868-4c58-9302-709a9ff7d025
## bcr_patient_barcode form_completion_date
## <factor> <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 TCGA-4C-A93U 2014-11-12
## TCGA.BJ.A0YZ.01A.11R.A10U.07 TCGA-BJ-A0YZ 2011-4-11
## TCGA.BJ.A0Z0.01A.11R.A10U.07 TCGA-BJ-A0Z0 2011-4-15
## TCGA.BJ.A0Z2.01A.11R.A10U.07 TCGA-BJ-A0Z2 2011-4-15
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 TCGA-BJ-A0Z3 2011-6-2
## prospective_collection
## <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07 NO
## TCGA.BJ.A0Z0.01A.11R.A10U.07 NO
## TCGA.BJ.A0Z2.01A.11R.A10U.07 NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 NO
col.data.meta <- mcols(colData(thca), use.names=TRUE)
col.data.meta
## DataFrame with 549 rows and 2 columns
## labelDescription
## <character>
## type sample type (tumor/normal)
## bcr_patient_uuid bcr patient uuid
## bcr_patient_barcode bcr patient barcode
## form_completion_date form completion date
## prospective_collection tissue prospective collection indicator
## ... ...
## lymph_nodes_pelvic_pos_total total pelv lnp
## lymph_nodes_aortic_examined_count total aor lnr
## lymph_nodes_aortic_pos_by_he aln pos light micro
## lymph_nodes_aortic_pos_by_ihc aln pos ihc
## lymph_nodes_aortic_pos_total total aor-lnp
## CDEID
## <character>
## type NA
## bcr_patient_uuid NA
## bcr_patient_barcode 2673794
## form_completion_date NA
## prospective_collection 3088492
## ... ...
## lymph_nodes_pelvic_pos_total 3151828
## lymph_nodes_aortic_examined_count 3104460
## lymph_nodes_aortic_pos_by_he 3151832
## lymph_nodes_aortic_pos_by_ihc 3151831
## lymph_nodes_aortic_pos_total 3151827
Now, explore the row (feature) data.
rowRanges(thca)
## GRanges object with 20115 ranges and 3 metadata columns:
## seqnames ranges strand | symbol txlen
## <Rle> <IRanges> <Rle> | <character> <integer>
## 1 chr19 [58345178, 58362751] - | A1BG 3322
## 2 chr12 [ 9067664, 9116229] - | A2M 4844
## 9 chr8 [18170477, 18223689] + | NAT1 2280
## 10 chr8 [18391245, 18401218] + | NAT2 1322
## 12 chr14 [94592058, 94624646] + | SERPINA3 3067
## ... ... ... ... . ... ...
## 100996331 chr15 [20835372, 21877298] - | POTEB 1706
## 101340251 chr17 [40027542, 40027645] - | SNORD124 104
## 101340252 chr9 [33934296, 33934376] - | SNORD121B 81
## 102724473 chrX [49303669, 49319844] + | GAGE10 538
## 103091865 chr21 [39313935, 39314962] + | BRWD1-IT2 1028
## txgc
## <numeric>
## 1 0.5644190
## 2 0.4882329
## 9 0.3942982
## 10 0.3895613
## 12 0.5249429
## ... ...
## 100996331 0.4308324
## 101340251 0.4903846
## 101340252 0.4074074
## 102724473 0.5055762
## 103091865 0.5924125
## -------
## seqinfo: 455 sequences (1 circular) from hg38 genome
Looking deeper into col.data with:
col.data[1:5, 30:35]
## DataFrame with 5 rows and 6 columns
## lymph_nodes_examined_count
## <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07 [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07 YES
## TCGA.BJ.A0Z2.01A.11R.A10U.07 NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 YES
## lymph_nodes_examined
## <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 41
## TCGA.BJ.A0YZ.01A.11R.A10U.07 [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07 5
## TCGA.BJ.A0Z2.01A.11R.A10U.07 [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 3
## lymph_nodes_examined_he_count
## <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 1
## TCGA.BJ.A0YZ.01A.11R.A10U.07 [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07 0
## TCGA.BJ.A0Z2.01A.11R.A10U.07 [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 0
## weiss_score_overall mitoses_per_50_hpf
## <factor> <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 NA NA
## TCGA.BJ.A0YZ.01A.11R.A10U.07 NA NA
## TCGA.BJ.A0Z0.01A.11R.A10U.07 NA NA
## TCGA.BJ.A0Z2.01A.11R.A10U.07 NA NA
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 NA NA
## ajcc_pathologic_tumor_stage
## <factor>
## TCGA.4C.A93U.01A.11R.A39I.07 Stage IVA
## TCGA.BJ.A0YZ.01A.11R.A10U.07 Stage II
## TCGA.BJ.A0Z0.01A.11R.A10U.07 Stage II
## TCGA.BJ.A0Z2.01A.11R.A10U.07 Stage IVC
## TCGA.BJ.A0Z3.01A.11R.A13Y.07 Stage I
We observed that there are different ways to indicate that the data is not available (NA, [Not Available]…). Consequently, we decide to study which variables have more information. To do so, we create two functions, one (na.replace
) to change those values into the standarized NA
nomenclature, and filter.info
to quantify how many of each variable is not NA
.
na.replace <- function(x){
# Remove the unwanted factors
lev <- levels(x)
lev <- lev[!(lev %in% "[Not Available]")]
lev <- lev[!(lev %in% "[Unknown]")]
lev <- lev[!(lev %in% "[Not Applicable]")]
lev <- lev[!(lev %in% "[Not Available]|[Not Available]|[Not Available]")]
lev <- lev[!(lev %in% "[Not Evaluated]")]
lev <- lev[!(lev %in% "None")]
x <- factor(x, levels=lev)
}
filter.info <- function(x){
# Function to filter the NA, [Not Available], [Not Applicable] and [Unknown]
# Return the proportion of information on x
nas <- sum(is.na(x))
return(1 - (nas/length(x)))
}
We apply these functions to the col.data
and we visualize it:
meta.info <- as.data.frame(lapply(col.data, na.replace))
variable.info <- sapply(meta.info, filter.info)
# Representing the information by column vs total information
relative.info <- variable.info[order(variable.info)]/sum(variable.info)
plot(relative.info, main="Information brought by column")
# Plotting the histogram of the information
hist(variable.info,
main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
# Seeing the most informative to set a threshold,
# frequency are the number of columns with such % of information
hist(variable.info, xlim = c(0.3, 1), ylim = c(0, 35),
main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
As we can see, in many variables the main value is just NA
or other derivatives meaning “No information available for this variable”. Therefore, in order to keep the most informative variables, we apply a threshold of 60% (of the values containing informative data) to select the relevant ones:
filtered <- meta.info[, variable.info > 0.6]
dim(filtered)
## [1] 572 44
meta.info.summary2 <- col.data.meta[variable.info > 0.6,]
Thus we ended with 44 informatives variables.
To explore the variables associated with the samples we first create an object with ggplot
:
library("ggplot2")
fplot <- ggplot(as.data.frame(filtered))
## ggplot options for axes and background color
# Options of the labels
l <- theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 1))
# Removing the background
b <- theme(axis.line = element_line(colour = "black"),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey",
size = 0.5, linetype = 2),
panel.grid.minor.y = element_line(colour = "grey",
size = 0.25, linetype = 3),
panel.border = element_blank(),
panel.background = element_blank())
y.axis <- ylab("number of cases")
And we plot some relationships for some significant variables:
In the “normal” type samples most of the information is not available. This could be due to the fact that all “normal” samples were obtained from patients with tumors. So in order to have the maximum information of each sample, we infer “normal” samples information from its paired sample data (mainly gender and ethnicity).
fplot + geom_bar(aes(type, fill = gender)) + b + y.axis +
ggtitle("Gender distribution across normal and tumour samples")
As expected, there are more female samples than male samples. As we mentioned before, thyroid cancer is more prevalent in women than men and we can see that not only for tumor samples but for normals samples too. In addition, there are more tumor cases than normal cases.
fplot + geom_bar(aes(type, fill = tumor_status)) + b + y.axis +
ggtitle("Tumor status across normal and tumour samples")
We also have examined the tumor status of the tumor samples. There are more cancer patients without tumor (“tumor free”) than with the tumor (“with tumor”) after the collection of the data. We assume that most of the patients that participate in the study have recovered after it.
fplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b +
y.axis + ggtitle("Ethnicity distribution across normal and tumour samples")
In this plot we can see clearly the lack of data for some variables regarding normal samples. In other words, there is no record for normal samples neither females nor males about the hispanic origin of the individuals.
labels <- c("AM. INDIAN OR ALASKA NAT.", "ASIAN", "BLACK OR AFRICAN AM.",
"WHITE", "NA")
fplot + facet_wrap(~ ethnicity) + geom_bar(aes(race, fill = gender)) + b +
l + y.axis + scale_x_discrete(breaks=waiver(), labels = labels) +
ggtitle("Gender distribution across different ethnicity and race.")
If we have a look at the ethnicities, gender and hispanic origin, the data shows more cases for “white” individuals “not hispanic or latino” than “white” individuals “hispanic or latino”.
fplot + geom_bar(aes(x=gender, fill = race)) + b + y.axis +
ggtitle("Ethnicity distribution in male and female")
If we ignore the hispanic origin of the individuals and just focus on the main ethnicities of the study we can observe that for both female and male samples there is a majority of “white” individuals. However, the fact that “white” subjects are hispanic or not can provide some variability so that is something we will take into account later.
fplot + facet_wrap(~ type ) + geom_bar(aes(tumor_focality, fill = gender)) + b +
l + y.axis + xlab("tumor focality") +
ggtitle("Gender distribution in different tumour focality")
There is no data about tumor focality in the normal samples, but this could be due to the fact that are samples obtained probably during the tumor removal or for a biopsy, from a healty part of the tyroid gland. For this reason there is no reason for the normal samples to have this feature.
labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular",
"Papillary - Tall Cell", "NA")
hd.xaxis <- scale_x_discrete(name = "histologic diagnosis", breaks=waiver(),
labels=labels)
fplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + hd.xaxis +
y.axis +
ggtitle("Gender distribution across different types of hystologic diagnosis")
Here, we can see that the tumor samples came mainly from classical Papillary Thyroid Carcinoma. We expect differential expression between the different thyroid cancer phenotypes so this might be a source of variability.
labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular",
"Papillary - Tall Cell", "NA")
ggplot(filtered[!is.na(filtered$age_at_diagnosis), ]) +
geom_bar(aes(age_at_diagnosis, na.rm = TRUE, fill = histologic_diagnosis)) +
b + l + labs(x="age at diagnosis") + y.axis +
scale_fill_discrete(name = "histologic diagnosis", breaks = waiver(),
labels = labels) +
ggtitle("Histologic diagnosis distribution across different ages")
Even thought, thyroid cancer is commontly diagnosed at mature ages (~50) this plot shows that the classical papillary thyroid cancer has similar representation in all ages for this dataset.
As a summary from the previous plots, the most abundant population in the samples is the “white” and “not hispanic or latino”, so we decide to perform the analysis on this group of individuals, whose tumoral sample correspond to the most common cancer in thyroids: Papillary Thyroid Carcinoma(PTC).
With this approach we want to avoid to include in the analysis, samples that introduce variability (mainly differential expression due to enviromental or epigenetic factors) as much as possible. To have more insight on the disease we want to have just paired samples, that is, samples from the same participant. So we filter the data accordingly:
# Paired data
paired <- intersect(thca[, thca$type == 'tumor']$bcr_patient_barcode,
thca[, thca$type == 'normal']$bcr_patient_barcode)
paired_mask <- thca$bcr_patient_barcode %in% paired
filtered_filt <- filtered[paired_mask, ]
# Imposing the selected conditions
tumor_names <- row.names(subset(filtered_filt,
type == "tumor" &
race == "WHITE" &
histologic_diagnosis == "Thyroid Papillary Carcinoma - Classical/usual" &
ethnicity == "NOT HISPANIC OR LATINO" ))
# Extracting the participants identifiers
participants <- unlist(lapply(tumor_names, substr, 9, 12))
participants <- participants[participants != ""]
check <- function(x, checklist){
# Returning name of the sample just if the name is the same as in checklist
a <- substr(x, 9, 12)
if(a %in% checklist){
return(x)
}
}
# Extracting sample names of tumor and normal data.
sample_names <- unlist(lapply(row.names(filtered_filt), check,
checklist = participants))
thca.filtered <- thca[, sample_names]
We are left with 66 samples. We explore again the variables for this subset. In this case, we show just some of the relationship plots we have seen previously:
filtered_filt <- filtered_filt[row.names(filtered_filt) %in% sample_names, ]
mplot <- ggplot(filtered_filt) + y.axis
mplot + geom_bar(aes(type, fill = gender)) + b +
ggtitle("Ethnicity and type of sample after filtering")
This plot just checks whether we have the same number of cases in normal sample and tumor sample so that we have done correctly the filtering.
mplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b +
ggtitle("Hispanic origin, gender and type of sample after filtering")
Hispanic origin data is only available for tumor samples but as we have mentioned before we can extrapolate ethinicity and origin to the normal samples because they are paired (see below).
mplot + geom_bar(aes(x=gender, fill = ethnicity)) + b + l +
ggtitle("Gender samples and hispanic origin after filtering")
Same proportion of being or not being “hispanic or latino” in females and males. Subset filtering has not modified the fact that we have more female samples than males.
mplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l +
scale_x_discrete(name = "histologic diagnosis", breaks=waiver(),
labels = c("Papillary - Classical", "NA")) +
ggtitle("Gender distribution across different types of hystologic diagnosis
after filtering")
Papillary Classical Thyroid cancer in the same proportion as “NA” because we have the same number of normal cases and classical thyroid tumor cases.
ggplot(filtered_filt[!is.na(filtered_filt$age_at_diagnosis), ]) +
geom_bar(aes(age_at_diagnosis, fill = histologic_diagnosis)) + b + l + y.axis +
xlab("age at diagnosis") + scale_fill_discrete(name = "histologic diagnosis",
breaks=waiver(), labels = c("Papillary - Classical", "NA")) +
ggtitle("Histologic diagnosis distribution across different ages after filtering")
Number of cases across ages is more or less similar after the filtering exept for some ages that have more cases around 30, 34, 42 and 47.
To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.
library(edgeR)
# Normalization just with the selected samples
dge.subset <- DGEList(counts = assays(thca.filtered)$counts,
genes = as.data.frame(mcols(thca.filtered)))
dge <- DGEList(counts = assays(thca)$counts, genes = as.data.frame(mcols(thca)))
Now calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.
assays(thca.filtered)$logCPM <- cpm(dge.subset, log=TRUE, prior.count=3)
assays(thca.filtered)$logCPM[1:5, 1:5]
## TCGA.BJ.A28R.01A.11R.A16R.07 TCGA.BJ.A28W.01A.11R.A32Y.07
## 1 3.014910 3.895927
## 2 8.681899 9.333470
## 9 -4.164318 -4.164318
## 10 -4.164318 -4.164318
## 12 4.423408 4.912253
## TCGA.BJ.A2N7.01A.11R.A18C.07 TCGA.BJ.A2N8.01A.11R.A18C.07
## 1 2.074429 3.634778
## 2 9.313603 8.873442
## 9 -4.164318 -4.164318
## 10 -4.164318 -4.164318
## 12 4.523368 4.366761
## TCGA.BJ.A2N9.01A.11R.A18C.07
## 1 3.436989
## 2 8.467420
## 9 -4.164318
## 10 -4.164318
## 12 4.584505
Let’s examine the library sizes in terms of total number of sequence read counts per sample. Figure 5 below shows library sizes per sample in increasing order.
ord <- order(dge$sample$lib.size)
barplot(dge$sample$lib.size[ord]/1e6, las=1, ylab="Millions of reads",
xlab="Samples", col=c("blue", "red")[(thca$type[ord] == "tumor") + 1],
border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)
Plotting for all samples makes it impossible to see accurately the blue and red colors corresponding to tumor and normal respectively.
ord.subset <- order(dge.subset$sample$lib.size)
barplot(dge.subset$sample$lib.size[ord.subset]/1e6, las=1, ylab="Millions of reads",
xlab="Samples", col=c("blue", "red")[(thca.filtered$type[ord.subset] == "tumor") + 1],
border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)
After the filtering the subset plot allows one to see that there is similar heterogeneity in tumor and normal library sizes. In addition, this figure reveals substantial differences in sequencing depth between samples and we may consider discarding those samples whose depth is substantially lower than the rest. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.
sampledepth <- round(dge.subset$sample$lib.size / 1e6, digits = 1)
names(sampledepth) <- substr(sample_names, 6, 12)
qqnorm(sampledepth)
qqline(sampledepth)
Looking into the qqplot we decide to filter out those below 40 but keeping only those that are paired:
sample_names <- sample_names[sampledepth > 40]
keep.paired <- function(pattern, x){
# Return the names if it is paired
pos <- grep(pattern, x, fixed = TRUE)
if (length(pos) > 1){
return(x[pos])
}
}
sample_names <- unique(unlist(lapply(substr(sample_names, 9, 12),
keep.paired, sample_names)))
thca.filtered <- thca.filtered[, colnames(thca.filtered) %in% sample_names]
dge.subset <- DGEList(counts = assays(thca.filtered)$counts,
genes = mcols(thca.filtered))
## Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
## Arguments in '...' ignored
Thus we have 59 samples which are above the threshold and 54 of them are paired, which are the ones which will be used from here on.
Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 7.
library(geneplotter)
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(assays(thca.filtered[,
thca.filtered$type == "tumor"])$logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Tumor samples",
las = 1)
multidensity(as.list(as.data.frame(assays(thca.filtered[,
thca.filtered$type == "normal"])$logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Normal samples",
las = 1)
These two spikes are due to the lowly expressed genes and the highly expressed genes, but in the lowly expressed genes the effect of GC content and the read length may have higher impact.
To see the impact of GC and length on the number of read we plot them:
gc_len <- rowRanges(thca.filtered)
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txgc)
## [1] -0.01173666
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txlen)
## [1] -0.0008702022
par(mfrow=c(1,2))
plot(log10(rowMeans(assays(thca.filtered)$counts)), gc_len$txgc,
main = "GC content over expression", xlab = "Expression (log10)",
ylab = "GC")
plot(log10(rowMeans(assays(thca.filtered)$counts)), log10(gc_len$txlen),
main = "Length over the expression", xlab = "Expression (log10)",
ylab = "Length (log10)")
There is a low correlation on the counts and the GC content or the length but as there are clearly two groups based on the length, so we will normalize also using the length with the package cqn:
library("cqn")
cqn.subset <- cqn(assays(thca.filtered)$counts, x = gc_len$txgc, lengths = gc_len$txlen)
par(mfrow = c(1, 2))
boxplot(cqn.subset$func1, names = FALSE, xlab = "Samples",
main = "Estimated effect of GC")
boxplot(cqn.subset$func2, names = FALSE, xlab = "Samples",
main = "Estimated effect of length")
As we thought the effect of the length is higher than the effect of GC content
We can see if the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately:
cqn.logCPM <- cqn.subset$y + cqn.subset$offset
assays(thca.filtered)$cqn.logCPM <- cqn.logCPM
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(
cqn.logCPM[,thca.filtered$type == "tumor"])),
xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
cqn.logCPM[, thca.filtered$type == "normal"])),
xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
These “after-normalization” plots are quite better than the previous ones used without GC and length normalization on the second spike. With this normalization we do not appreciate substantial differences between the samples in the distribution of expression values. However, below 0 there is a lot of variability.
Let’s calculate now the average expression per gene through all the samples. Figure 9 shows the distribution of those values across genes.
par(mfrow = c(1, 2))
avgexp <- rowMeans(assays(thca.filtered)$logCPM)
hist(avgexp, xlab = "log2 CPM", main = "Using DGEList normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)
# Using the corrected logCPM
avgexp.cqn <- rowMeans(assays(thca.filtered)$cqn.logCPM)
hist(avgexp.cqn, xlab = "log2 CPM", main = "Using cqn normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)
In the light of previous plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.
mask <- avgexp.cqn > 1
thca.filtered <- thca.filtered[mask, ]
dge.subset <- dge.subset[mask, ]
We have nrow(thca.filtered)
genes remaining on our dataset from 54 samples, half of them are healthy and the other half are tumoral.
Since we will use the offsets computed by cqn
, there is no need to normalize using the normalization tools from edgeR
, such as calcNormFactors
according to cqn vignette.
dge.subset$offset <- cqn.subset$glm.offset
dge.norm <- calcNormFactors(dge.subset)
assays(thca.filtered)$n.logCPM <- cpm(dge.norm, log = TRUE, prior.count = 3)
par(mfrow = c(3, 2))
# Not normalized but filtered
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
# Normalize with calcNormFactors after filtering out those
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$n.logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$n.logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
# Normalized with cqn and filtered the low values:
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$cqn.logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
assays(thca.filtered[,thca.filtered$type == "tumor"])$cqn.logCPM)),
xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
The normalization after filtering lowly-expressed genes with ‘cqn’ package clusters samples into one thin gaussian curve whereas normalization with calcNormFactors makes it thiker.
We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples.
par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type == "tumor"]
dgetmp <- dge.subset[, thca.filtered$type == "tumor"]
for (i in 1:ncol(setmp)) {
A <- rowMeans(assays(setmp)$cqn.logCPM)
M <- assays(setmp)$cqn.logCPM[, i] - A
samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
smoothScatter(A, M, main = samplename, las = 1)
abline(h = 0, col = "blue", lwd = 2)
lo <- lowess(M ~ A)
lines(lo$x, lo$y, col = "red", lwd = 2)
}
We do not observe samples with major expression-level dependent biases.
Let’s look now to the normal samples:
par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type == "normal"]
dgetmp <- dge.subset[, thca.filtered$type == "normal"]
for (i in 1:ncol(setmp)) {
A <- rowMeans(assays(setmp)$cqn.logCPM)
M <- assays(setmp)$cqn.logCPM[, i] - A
samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
smoothScatter(A, M, main = samplename, las = 1)
abline(h = 0, col = "blue", lwd = 2)
lo <- lowess(M ~ A)
lines(lo$x, lo$y, col = "red", lwd = 2)
}
We do not observe either important expression-level dependent biases among the normal samples except on the sample TCGA-BJ-A3PR. Which we proceed to remove it from the cqn.logCP:
mask <- -grep("TCGA.BJ.A3PR", colnames(thca.filtered))
thca.filtered <- thca.filtered[,mask]
sample_names <- sample_names[mask]
dge.subset <- dge.subset[,mask]
Thus the remaining samples are 52 from 26 patients. Recall that we still have 11801 genes from each sample.
We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see the wiki), following the strategy described here we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(sample_names, 6, 7)
table(tss)
## tss
## BJ EL ET H2
## 13 33 2 4
samplevial <- substr(sample_names, 14, 16)
table(samplevial)
## samplevial
## 01A 11A 11C
## 26 25 1
portion <- substr(sample_names, 18, 19)
table(portion)
## portion
## 11 12 21 22 31
## 42 6 2 1 1
table(substr(sample_names, 20, 20))
##
## R
## 52
plate <- substr(sample_names, 22, 25)
table(plate)
## plate
## A16R A180 A18C A19O A20F A21D A220 A22L A23N
## 2 2 5 2 10 6 6 7 12
center <- substr(sample_names, 27, 28)
table(center)
## center
## 07
## 52
We can observe that the samples where collected at 4 tissues source sites but the majority of them at University of Pittsburg (BJ) and MD Anderson(EL). However,the manager center of the data is the same (University of North Carolina) and the same metabolite (RNA). In addition, they come from 7 different plates more or less distributed. However, most samples came from 2 out of 5 different portion and 2 out of 3 vials.
We are going to check if the previous variables of the barcode are surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with the variables:
table(data.frame(TYPE = thca.filtered$type, TSS = tss))
## TSS
## TYPE BJ EL ET H2
## normal 0 20 2 4
## tumor 13 13 0 0
# We will check for the other variables
table(data.frame(TYPE = thca.filtered$type, PLATE = plate))
## PLATE
## TYPE A16R A180 A18C A19O A20F A21D A220 A22L A23N
## normal 0 2 0 0 3 0 2 7 12
## tumor 2 0 5 2 7 6 4 0 0
table(data.frame(TYPE = thca.filtered$type, Portion = portion))
## Portion
## TYPE 11 12 21 22 31
## normal 18 4 2 1 1
## tumor 24 2 0 0 0
table(data.frame(TYPE = thca.filtered$type, Vial = samplevial))
## Vial
## TYPE 01A 11A 11C
## normal 12 13 1
## tumor 14 12 0
From the tables we can already see that all of them can be candidates as indicators for batch effect as there is no homogeneity of the number of tumor and normal samples across any of the variables.
To double check if they have some effect on the batch we make the hierarchical clustering of them. First, we set some functions to plot the hierarhcical clustering:
help.dendogram <- function(x, batch, labels) {
# Helper function to plot dendograms
if (is.leaf(x)) {
## color by batch
attr(x, "nodePar") <- list(lab.col = as.vector(batch[attr(x, "label")]))
## label by outcome
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}
plot.batch <- function(dendrogram, batching, se, out){
# Function to see the batch
# Given a sampleDendogram use batching to colour the leafs by sample.names of se
sample.names = colnames(se)
batch <- as.integer(factor(batching))
names(batch) <- sample.names
i.dendrogram <- dendrapply(dendrogram, help.dendogram, batch, out)
plot(i.dendrogram, main = "Hierarchical clustering")
legend("topright", paste("Batch", levels(factor(batching))), fill = sort(unique(batch)))
}
With this functions to plot the clustering with different labelings is easier:
d <- as.dist(1-cor(assays(thca.filtered)$cqn.logCPM, method = "spearman"))
sampleClustering <- hclust(d)
sampleDendrogram <- as.dendrogram(sampleClustering, hang = 0.1)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)
par(mfrow = c(1,1))
plot.batch(sampleDendrogram, portion, thca.filtered, outcome)
plot.batch(sampleDendrogram, samplevial, thca.filtered, outcome)
plot.batch(sampleDendrogram, tss, thca.filtered, outcome)
plot.batch(sampleDendrogram, as.character(thca.filtered$type), thca.filtered, outcome)
plot.batch(sampleDendrogram, plate , thca.filtered, outcome)
We can see that the sample from A3PR-normal clusters with the tumoral ones, and the A3H2 tumoral clusters with the healthy patients on Figure 14.D. To visualize the relationship we use the multiscaling dimensional plot colored by portion, tss, type, and gender:
batch_effects <- list(portion, tss, as.character(thca.filtered$type),
as.character(thca.filtered$gender))
for (batching in batch_effects){
batch <- as.integer(factor(batching))
names(batch) <- sample_names
plotMDS(dge.subset, labels = outcome, col = batch)
legend("bottomleft", paste("Batch", levels(factor(batching))),
fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
We can observe that there are two groups of tumoral samples, and one of them is close to healthy samples.
Our model of the disease will affect the analysis from here onwards. Therefore we need to select our model and null hypothesis. We define them as:
# Create the design matrix involving tumor and normal samples and males and females
treat <- factor(paste(thca.filtered$type, thca.filtered$gender, sep = "."))
paired <- as.numeric(as.factor(thca.filtered$bcr_patient_barcode))
design <- model.matrix(~treat + paired, data=colData(thca.filtered))
colnames(design) <- c(levels(treat), "paired")
# design <- cbind("(Intercept)" = 1, design)
head(design)
## normal.FEMALE normal.MALE tumor.FEMALE
## TCGA.BJ.A28R.01A.11R.A16R.07 1 0 1
## TCGA.BJ.A2N7.01A.11R.A18C.07 1 0 1
## TCGA.BJ.A2N8.01A.11R.A18C.07 1 0 1
## TCGA.BJ.A2N9.01A.11R.A18C.07 1 0 1
## TCGA.BJ.A2NA.01A.12R.A19O.07 1 0 0
## TCGA.BJ.A3PU.01A.11R.A220.07 1 0 0
## tumor.MALE paired
## TCGA.BJ.A28R.01A.11R.A16R.07 0 21
## TCGA.BJ.A2N7.01A.11R.A18C.07 0 30
## TCGA.BJ.A2N8.01A.11R.A18C.07 0 31
## TCGA.BJ.A2N9.01A.11R.A18C.07 0 32
## TCGA.BJ.A2NA.01A.12R.A19O.07 1 33
## TCGA.BJ.A3PU.01A.11R.A220.07 1 39
# And the null model
mod0 <- model.matrix(~1 + gender + type + paired, data = colData(thca.filtered))
head(mod0)
## (Intercept) genderMALE typetumor paired
## TCGA.BJ.A28R.01A.11R.A16R.07 1 0 1 21
## TCGA.BJ.A2N7.01A.11R.A18C.07 1 0 1 30
## TCGA.BJ.A2N8.01A.11R.A18C.07 1 0 1 31
## TCGA.BJ.A2N9.01A.11R.A18C.07 1 0 1 32
## TCGA.BJ.A2NA.01A.12R.A19O.07 1 1 1 33
## TCGA.BJ.A3PU.01A.11R.A220.07 1 1 1 39
Our model includes the interaction of gender and type, while our null model just includes the gender and type (in both cases with the paired factor to take into account the fact that the samples are paired). This way comparing this models we will see if there is interaction between gender and tumor or not.
After looking for batch effect we couldn’t find a clear cause of the batch effect. We suspect that sample vial has something to do. However, we try to see if we can improve the clustering by using method ComBat for sample vial.
library("sva")
combatexp <- ComBat(assays(thca.filtered)$cqn.logCPM, tss, mod = design)
## Found 4 batches
## Adjusting for 4 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
d <- as.dist(1 - cor(combatexp, method = "spearman"))
cluster.combat <- hclust(d)
cluster.combat <- as.dendrogram(cluster.combat, hang = 0.1)
names(batch) <- colnames(thca.filtered)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)
plot.batch(cluster.combat, samplevial, thca.filtered, outcome)
plot.batch(cluster.combat, tss, thca.filtered, outcome)
plot.batch(cluster.combat, thca.filtered$gender, thca.filtered, outcome)
plot.batch(cluster.combat, thca.filtered$type, thca.filtered, outcome)
plot.batch(cluster.combat, portion, thca.filtered, outcome)
We can observe a slightly improvement, but still some tumoral samples cluster together with the normal ones.
batch_effects <- list(samplevial, tss, as.character(thca.filtered$gender),
as.character(thca.filtered$type), portion)
for (batching in batch_effects){
batch <- as.integer(factor(batching))
names(batch) <- sample_names
plotMDS(combatexp, labels = outcome, col = batch)
legend("bottomleft", paste("Batch", levels(factor(batching))),
fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
To improve the batch effect removal we look if QR decomposition yields better results. In this case, we look for sample vial and TSS batch effects removal. To do so, we have to built the full model which would include gender and type as fixed effects.
library("limma")
qrexp <- removeBatchEffect(assays(thca.filtered)$cqn.logCPM, tss, design = design)
d <- as.dist(1 - cor(qrexp, method = "spearman"))
cluster.qr <- hclust(d)
cluster.qr <- as.dendrogram(cluster.qr, hang = 0.1)
plot.batch(cluster.qr, samplevial, thca.filtered, outcome)
plot.batch(cluster.qr, tss, thca.filtered, outcome)
plot.batch(cluster.qr, thca.filtered$gender, thca.filtered, outcome)
plot.batch(cluster.qr, thca.filtered$type, thca.filtered, outcome)
plot.batch(cluster.qr, portion, thca.filtered, outcome)
With QR decomposition we can observe that some tumoral samples cluster with other normal samples on a lower branch.
for (batching in batch_effects){
batch <- as.integer(factor(batching))
names(batch) <- sample_names
plotMDS(qrexp, labels = outcome, col = batch)
legend("bottomleft", paste("Batch", levels(factor(batching))),
fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
We further try if SVD yelds a better result:
library("corpcor")
s <- fast.svd(t(scale(t(assays(thca.filtered)$cqn.logCPM),
center = TRUE, scale = TRUE)))
pcSds <- s$d
pcSds[1] <- 0
svdexp <- s$u %*% diag(pcSds) %*% t(s$v)
colnames(svdexp) <- colnames(thca.filtered)
class(svdexp)
## [1] "matrix"
dim(svdexp)
## [1] 11801 52
d <- as.dist(1 - cor(svdexp, method = "spearman"))
cluster.svd <- hclust(d)
cluster.svd <- as.dendrogram(cluster.svd, hang = 0.1)
names(batch) <- colnames(thca.filtered)
plot.batch(cluster.svd, samplevial, thca.filtered, outcome)
plot.batch(cluster.svd, tss, thca.filtered, outcome)
plot.batch(cluster.svd, thca.filtered$gender, thca.filtered, outcome)
plot.batch(cluster.svd, thca.filtered$type, thca.filtered, outcome)
plot.batch(cluster.svd, portion, thca.filtered, outcome)
SVD clusters together normal and tumoral samples, maybe because some of them are quite similar according to the PCA previously performed. Therefore we don’t plot the MDS for SVD.
After all the process the best one, that separates better tumor and normal samples is the ComBat function, as we have seen on the figure 17.D, which will be used for following-up analysis.
assays(thca.filtered)$qrexp <- qrexp
assays(thca.filtered)$combatexp <- combatexp
We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva using the models previously created.
pv <- f.pvalue(assays(thca.filtered)$combatexp, design, mod0)
sum(p.adjust(pv, method = "fdr") < 0.01)
## [1] 0
There are 0 genes changing significantly their expression at FDR < 1%. In Figure 18 below we show the distribution of the resulting p-values.
hist(pv, main = "Histogram of p-values", las = 1)
hist(pv, main = "Histogram of p-values", xlim = c(0, 0.1), breaks = 80)
As we can see the distribution of p-values is not uniform, and reach the maximum frequency at p-value 1, and the minimum at p-value 0.
Using sva
to estimate and remove artifacts from high dimensional data, where we use our alternative hypothesis and our null hypothesis:
sv <- sva(assays(thca.filtered)$combatexp, design, mod0)
## Number of significant surrogate variables is: 9
## Iteration (out of 5 ):1 2 3 4 5
The SVA algorithm has found 9 surrogate variables. Let’s use them to assess against the extent of differential expression this time adjusting for these surrogate variables.
modsv <- cbind(design, sv$sv)
colnames(modsv) <- c(colnames(design), paste0("sv", c(1:sv$n)))
mod0sv <- cbind(mod0, sv$sv)
colnames(mod0sv) <- c(colnames(mod0), paste0("sv", c(1:sv$n)))
pvsv <- f.pvalue(assays(thca.filtered)$combatexp, modsv, mod0sv)
sum(p.adjust(pvsv, method = "fdr") < 0.01)
## [1] 0
Apparently we haven’t increased the number of DE genes, but we can see if the distribution of p-values has changed:
hist(pvsv, main = "Histogram of p-values adjusting surrogate variables",
las = 1)
hist(pvsv, main = "Histogram of p-values adjusting surrogate variables",
las = 1, xlim = c(0, 0.1), breaks = 80)
As we can see the distribution is now uniform, thus sva
corrected for artifacts and the underlying hypotesis of uniform distribution of gene expression is true.
Instead of working with the surrogate variables in the model, we could remove them using the following function
# Code from https://www.biostars.org/p/121489/#121500
cleanY = function(y, mod, svs) {
# Function to remove the efect of sv following a design from y
X = cbind(mod, svs)
Hat = solve(t(X) %*% X) %*% t(X)
beta = (Hat %*% t(y))
rm(Hat)
gc()
P = ncol(mod)
return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}
e.no_surrogates <- cleanY(assays(thca.filtered)$combatexp, design, sv$sv)
The object e.no_surrogates contains the expression after removing the surrogates variables.
We want ensembl annotation, and gene symbol of our probes, and we want also the gene ontologies from all the human genes (will be later used for functional enrichment):
library("org.Hs.eg.db")
##
annot <- select(org.Hs.eg.db,
columns = c("SYMBOL", "ENSEMBL"),
keys = rownames(thca.filtered),
keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
all.Human.GO <- select(org.Hs.eg.db,
columns = c("GO", "SYMBOL"),
key = keys(org.Hs.eg.db, keytype = "ENTREZID"),
keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
annot.Entrez.GO <- select(org.Hs.eg.db,
columns = c("GO", "SYMBOL"),
keys = rownames(thca.filtered),
keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
The different objects contain different information: annot
contains the Symbol and Ensembl identifier from the remaining genes, all.Human.GO
is the object with Gene Ontologies and Symbols for all human genes, while anot.Entrez.GO
has the GO and the symbols for the genes.
Using the expression values with batch effect removed by ComBat
we proceed with the voom normalization, note that the surrogate variables effect is still in there, because it is better adjusting than removing the effect of a variable.
# Transform the log2 normalized by cqn to counts without the surrogate variables
n_counts <- apply(assays(thca.filtered)$combatexp, 1:2, function(x){2^(x)})
# Normalize by mean variance with voom
vo <- voom(n_counts, modsv, plot = TRUE)
fit <- lmFit(vo, modsv)
# Make the comparisons we are interested in based on our design
cm <- makeContrasts(
FemaleTumorVsFemaleNormal = tumor.FEMALE - normal.FEMALE,
MaleTumorVsMaleNormal = tumor.MALE - normal.MALE,
FemaleTumorVsMaleTumor = tumor.FEMALE - tumor.MALE,
FemaleNormalVsMaleNormal = normal.FEMALE - normal.MALE,
TumorVsNormal = (tumor.MALE + tumor.FEMALE) - (normal.MALE + normal.FEMALE),
FemaleVsMale = (tumor.FEMALE + normal.FEMALE) - (tumor.MALE + normal.MALE),
FemaleTumorvsMaleNormal = tumor.FEMALE - normal.MALE,
FemalNormaleVsMaleTumor = normal.FEMALE - tumor.MALE,
levels = modsv)
cm[1:5,]
## Contrasts
## Levels FemaleTumorVsFemaleNormal MaleTumorVsMaleNormal
## normal.FEMALE -1 0
## normal.MALE 0 -1
## tumor.FEMALE 1 0
## tumor.MALE 0 1
## paired 0 0
## Contrasts
## Levels FemaleTumorVsMaleTumor FemaleNormalVsMaleNormal
## normal.FEMALE 0 1
## normal.MALE 0 -1
## tumor.FEMALE 1 0
## tumor.MALE -1 0
## paired 0 0
## Contrasts
## Levels TumorVsNormal FemaleVsMale FemaleTumorvsMaleNormal
## normal.FEMALE -1 1 0
## normal.MALE -1 -1 -1
## tumor.FEMALE 1 1 1
## tumor.MALE 1 -1 0
## paired 0 0 0
## Contrasts
## Levels FemalNormaleVsMaleTumor
## normal.FEMALE 1
## normal.MALE 0
## tumor.FEMALE 0
## tumor.MALE -1
## paired 0
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
# Decide the threshold of differentially expressed genes
DE_threshold <- 4
lfc_threshold <- log2(DE_threshold)
res <- decideTests(fit2, lfc = lfc_threshold)
summary(res)
## FemaleTumorVsFemaleNormal MaleTumorVsMaleNormal FemaleTumorVsMaleTumor
## -1 10630 0 2
## 0 1169 11801 11799
## 1 2 0 0
## FemaleNormalVsMaleNormal TumorVsNormal FemaleVsMale
## -1 1 9941 1
## 0 776 1852 773
## 1 11024 8 11027
## FemaleTumorvsMaleNormal FemalNormaleVsMaleTumor
## -1 1 3
## 0 11800 1153
## 1 0 10645
We decided a threshold of differentially expressed genes of 4 (thus log2 of it is 2), meaning that it must be 4 times more expressed in one group than in other in orther to pass the filter. Based on these we are only interested in four contrast which will be annotated with symbol and ensembl id:
tt_tumor <- topTable(fit2, number = Inf, coef = "TumorVsNormal", p.value = 0.05)
tt_gender <- topTable(fit2, number = Inf, coef = "FemaleVsMale", p.value = 0.05)
tt_females <- topTable(fit2, number = Inf, coef = "FemaleTumorVsFemaleNormal", p.value = 0.05)
tt_males <- topTable(fit2, number = Inf, coef = "MaleTumorVsMaleNormal", p.value = 0.05)
# Annotate with the gene symbol
tt_tumor <- merge(tt_tumor, annot, by.x = 0, by.y = "ENTREZID",
all.x = T, all.y = F)
tt_males <- merge(tt_males, annot, by.x = 0, by.y = "ENTREZID",
all.x = T, all.y = F)
tt_gender <- merge(tt_gender, annot, by.x = 0, by.y = "ENTREZID",
all.x = T, all.y = F)
tt_females <- merge(tt_females, annot, by.x = 0, by.y = "ENTREZID",
all.x = T, all.y = F)
head(tt_tumor)
## Row.names logFC AveExpr t P.Value adj.P.Val
## 1 10002 -9.476681 8.292238 -10.373128 4.543685e-13 1.700603e-12
## 2 100033416 -2.200815 2.309834 -3.099678 3.481826e-03 4.556335e-03
## 3 100033417 -5.299424 7.891636 -2.295198 2.687264e-02 3.224772e-02
## 4 100033420 -4.215710 4.668286 -2.464289 1.797807e-02 2.194418e-02
## 5 100033424 -4.426622 3.413475 -7.221180 7.724593e-09 1.644856e-08
## 6 100033426 -5.603019 6.223775 -14.069474 2.382373e-17 2.333144e-16
## B SYMBOL ENSEMBL
## 1 19.457926 NR2E3 ENSG00000278570
## 2 -1.862352 SNORD116-4 ENSG00000275529
## 3 -4.046471 SNORD116-5 ENSG00000207191
## 4 -3.554915 SNORD116-8 ENSG00000207093
## 5 9.910661 SNORD116-12 ENSG00000207197
## 6 28.689808 SNORD116-14 ENSG00000206621
Diagnostics plot:
pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_gender$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 6], df = fit2$df.prior + fit2$df.residual[6], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 6], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females and males patients",
outer = TRUE, cex = 1.5, side = 3, line = -2)
pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_tumor$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 5], df = fit2$df.prior + fit2$df.residual[5], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 5], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between healthy and tumoral patients",
outer = TRUE, cex = 1.5, side = 3, line = -2)
pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_females$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 1], df = fit2$df.prior + fit2$df.residual[1], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 1], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females patients and healthy.", outer = TRUE, cex = 1.5,
side = 3, line = -2)
# pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
# hist(tt_males$P.Value, xlab = "Raw P-values", main = "", las = 1)
# qqt(fit2$t[, 2], df = fit2$df.prior + fit2$df.residual[2], main = "", pch = ".", cex = 3)
# qqline(fit2$t[, 2], lwd = 2, col = "red")
# abline(0, 1, lwd = 2)
# mtext("Comparing genes between male patients and healthy.", outer = TRUE, cex = 1.5,
# side = 3, line = -2)
We can see in volanco plots how look like the data
volcano <- function(data, title){
# Plot a volcano plot of a toptable output of limma
if (nrow(data) == 0) {
return(NULL)
}
ggplot(data, aes(logFC, B, colour = adj.P.Val)) + geom_point() + theme_bw()+
ggtitle(title) + geom_vline(xintercept = c(2, -2), col = "red")+
geom_hline(yintercept = 4.6, col = "red")
}
volcano(tt_tumor, "Comparing tumoral vs normal")
volcano(tt_gender, "Comparing females vs males")
volcano(tt_females, "Comparing females: tumoral vs normal")
The number of genes which are differentialed expressed can be calculated with:
de_tumor <- subset(tt_tumor, abs(logFC) >= lfc_threshold & B >= 4.6)
de_gender <- subset(tt_gender, abs(logFC) >= lfc_threshold & B >= 4.6)
tryCatch(de_males <- subset(tt_males, abs(logFC) >= lfc_threshold & B >= 4.6),
error = function(e){return(0)})
## [1] 0
de_females <- subset(tt_females, abs(logFC) >= lfc_threshold & B >= 4.6)
# Number of genes which pass the filter
de_genes <- c(nrow(de_tumor), nrow(de_gender), 0, nrow(de_females))
names(de_genes) <- c("Tumor vs Normal", "Females vs Males", "Males", "Females")
de_genes
## Tumor vs Normal Females vs Males Males Females
## 7204 10630 0 8976
As we can see there aren’t any gene of the comparison male vs female that pass the filter. We can compare if in the different comparisons such genes remain allways up-regulated or down-regulated.
But we create some function to make it easier to compare several lists:
up_down <- function(data, names = NULL){
# Classify the genes in up or down, names are the names of the genes and
# should be on the same order
dif_genes <- ifelse(data$logFC > 0, "up", "down")
if (is.null(names)){
names(dif_genes) <- rownames(data)
} else {
names(dif_genes) <- names
}
dif_genes
}
classify <- function(regulated1, regulated2){
# Check how much up regulated genes in one comparison are down regulated in others
inter <- intersect(names(regulated1), names(regulated2))
genes <- c()
genes2 <- c()
# Reorder them by the name for the comparison with table
for (name in inter){
genes <- c(genes, regulated1[name])
genes2 <- c(genes2, regulated2[name])
}
# Perform the comparison
table(genes, genes2,
dnn = c(deparse(substitute(regulated1)), deparse(substitute(regulated2))))
}
genes_names <- function(regulated1, regulated2){
# Extract the names of those up-regulated, down-regulated or not clear in
# both lists
inter <- intersect(names(regulated1), names(regulated2))
up <- c()
down <- c()
not_clear <- c()
for (name in inter){
if (regulated1[name] == regulated2[name]){
if (regulated1[name] == "up"){
up <- c(up, name)
} else{
down <- c(down, name)
}
} else{
not_clear <- c(not_clear, name)
}
}
list(up = up, down = down, not_clear = not_clear)
}
The function up_down
classify the genes in up-regulated and down-regulated. The function classify
given two lists of genes up or down regulates, find the genes common in them and check if in both lists are up-regulated or down-regulated, however it doesn’t display the names, just the numbers. However the function genes_names
classify the genes on the three elements of the list: up if in both sets is up-regulated, down if in both sets the genes is down-regulated and not_clear if in a dataset the gene is up-regulated and in the other is down-regulated. We can see here the results of several comparisons:
r_tumor <- up_down(de_tumor)
r_females <- up_down(de_females)
# r_males <- up_down(de_males)
r_gender <- up_down(de_gender)
# We see how do they overlap, more up-regulated, or more dow-regulated genes.
# classify(r_tumor, r_males)
classify(r_tumor, r_females)
## r_females
## r_tumor down
## down 5747
# classify(r_females, r_males)
# classify(r_gender, r_males)
classify(r_females, r_gender)
## r_gender
## r_females down up
## down 1 7973
classify(r_tumor, r_gender)
## r_gender
## r_tumor down up
## down 1 6391
# We collect the name of genes for further studies
# genes_females_males <- genes_names(r_females, r_males)
# genes_tumor_males <- genes_names(r_tumor, r_males)
genes_tumor_females <- genes_names(r_tumor, r_females)
DE_tumor <- unique(rownames(de_tumor))
# DE_males <- unique(rownames(de_males))
DE_females <- unique(rownames(de_females))
DE_gender <- unique(rownames(de_gender))
To visualize it we plot them with venn diagrams:
library("Vennerable")
plot(Venn(list(DE_tumor, DE_females),
SetNames = c("Tumoral", "Females")))
As we can observe of the comparison, many genes differentially expressed on females comparing healthy samples and tumoral ones are not found on males.
We can perform a functional enrichment on our data with the GO already stored
library(GOstats)
go_obj <- function(genes_id, universe = unique(annot.Entrez.GO$ENTREZID)){
# Create an object for GOstat from ENTREZIDs
new("GOHyperGParams", geneIds = genes_id,
universeGeneIds = universe,
annotation = "org.Hs.eg.db", ontology = "BP",
pvalueCutoff = 0.05, testDirection = "over",
conditional = TRUE)
}
params_tumor <- go_obj(DE_tumor)
# params_males <- go_obj(DE_males)
params_females <- go_obj(DE_females)
parmas_gender <- go_obj(DE_gender)
# The set of differentially expressed only in females
# mask <- DE_females %in% intersect(DE_males, DE_females)
# DE_females_exc <- DE_females[!mask]
# DE_females_exc <- DE_females_exc[!DE_females_exc %in% intersect(DE_females_exc, DE_tumor)]
# params_female_exc <- go_obj(DE_females_exc)
# The set of differentially expressed only in males
# mask <- DE_males %in% intersect(DE_males, DE_females)
# DE_males_exc <- DE_males[!mask]
# DE_males_exc <- DE_males_exc[!DE_males_exc %in% intersect(DE_males_exc, DE_tumor)]
# params_male_exc <- go_obj(DE_males_exc)
# We explore those genes that are shared among the comparisons
# go_females_males_up <- go_obj(genes_females_males$up)
# go_females_males_down <- go_obj(genes_females_males$down)
# go_females_males_not_clear <- go_obj(genes_females_males$not_clear)
# go_tumor_females_not_clear <- go_obj(genes_tumor_females$not_clear)
# go_tumor_females_up <- go_obj(genes_tumor_females$up) # Genes not in the universe
go_tumor_females_down <- go_obj(genes_tumor_females$down)
# go_tumor_males_not_clear <- go_obj(genes_tumor_males$not_clear)
# go_tumor_males_up <- go_obj(genes_tumor_males$up)
# go_tumor_males_down <- go_obj(genes_tumor_males$down)
Once prepared with the object we can run the analysis:
go_analisys <- function(go_obj, annots.GO, size = 5, count = 5, pvalue = 0.05){
hgOver <- hyperGTest(go_obj)
GO <- GOstats::summary(hgOver)
GO <- subset(GO, Size >= size & Count >= count & Pvalue <= pvalue)
if (nrow(GO) == 0){
warning("Not significant GO on this group with these settings.")
return(NULL)
}
# Visualize the data
plotgo <- ggplot(GO, aes(Size, Count, size = OddsRatio, colour = Pvalue)) +
geom_point() + b + ggtitle(deparse(substitute(go_obj)))
GO <- GO[order(GO$OddsRatio, decreasing = TRUE), ]
resume <- function(id, annot){
#
a <- unique(annot[annot$ENTREZID %in% id, "SYMBOL"])
if (length(a) >= 1){
return(a)
} else{
return("")
}
}
geneIDs <- geneIdsByCategory(hgOver)[GO$GOBPID]
geneSYMs <- sapply(geneIDs, resume, annot = annots.GO)
if (is.matrix(geneSYMs)){
geneSYMs <- apply(geneSYMs, 2, paste, collapse = ", ")
} else {
geneSYMs <- sapply(geneSYMs, paste, collapse = ", ")
}
GO <- merge(GO, geneSYMs, by.x = "GOBPID", by.y = 0)
GO <- GO[order(-GO$OddsRatio, GO$Pvalue, -GO$ExpCount, -GO$Count), ]
rownames(GO) <- 1:nrow(GO)
colnames(GO) <- c(colnames(GO)[-ncol(GO)], "Genes")
list(GO = GO, plot = plotgo)
}
GO_tumor <- go_analisys(params_tumor, annot.Entrez.GO)
if (!is.null(GO_tumor)){
print(head(GO_tumor$GO))
plot(GO_tumor$plot)
}
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0051482 5.341315e-05 Inf 2.341270 8 8
## 2 GO:0030194 1.811497e-04 Inf 2.045881 7 7
## 3 GO:0007158 6.257946e-04 Inf 1.755952 6 6
## 4 GO:0018119 6.257946e-04 Inf 1.755952 6 6
## 5 GO:0044144 6.257946e-04 Inf 1.755952 6 6
## 6 GO:0051481 6.257946e-04 Inf 1.755952 6 6
## Term
## 1 positive regulation of cytosolic calcium ion concentration involved in phospholipase C-activating G-protein coupled signaling pathway
## 2 positive regulation of blood coagulation
## 3 neuron cell-cell adhesion
## 4 peptidyl-cysteine S-nitrosylation
## 5 modulation of growth of symbiont involved in interaction with host
## 6 negative regulation of cytosolic calcium ion concentration
## Genes
## 1 AGTR1, S1PR1, LPAR1, EDNRA, F2R, GRM1, HTR2C, KISS1
## 2 CD36, F2R, F3, F7, F12, S100A9, SELP
## 3 ASTN1, RET, TNR, NRXN3, NRXN1, NRXN2
## 4 ADH5, ACE, GAPDH, S100A8, S100A9, SNTA1
## 5 CD36, CTSG, IFNG, IL10, IL12B, OSBP
## 6 ATP1A2, GTF2I, HRC, KCNA5, KCNK3, SLC8A1
We should do the same if we had some genes differentially expressed on the comparison between gender.
GO_females <- go_analisys(params_females, annot.Entrez.GO)
if (!is.null(GO_females)){
print(head(GO_females$GO))
plot(GO_females$plot)
}
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0045909 1.030450e-06 Inf 5.236111 14 14
## 2 GO:0007188 6.933480e-06 Inf 4.463845 12 12
## 3 GO:0010248 5.311264e-05 Inf 3.740079 10 10
## 4 GO:0033762 1.380714e-04 Inf 3.354971 9 9
## 5 GO:0071887 3.722405e-04 Inf 2.983501 8 8
## 6 GO:1901381 3.781912e-04 Inf 2.989409 8 8
## Term
## 1 positive regulation of vasodilation
## 2 adenylate cyclase-modulating G-protein coupled receptor signaling pathway
## 3 establishment or maintenance of transmembrane electrochemical gradient
## 4 response to glucagon
## 5 leukocyte apoptotic process
## 6 positive regulation of potassium ion transmembrane transport
## Genes
## 1 ADM, ADRA2A, ADRB2, AGTR2, ALOX12, CPS1, F2RL1, GJA5, INS, PTAFR, PTPRM, UCN, VIP, KAT2B
## 2 AVPR2, CCR3, CNR1, FPR1, GCG, GCGR, GNAT1, GNAT2, PRKACB, PTGER4, GLP2R, WASF2
## 3 ATP1A1, ATP1A2, ATP1A3, ATP12A, ATP1A4, ATP1B3, FXYD2, ATP4A, BAX, SLC22A1
## 4 CDO1, CREB1, CRY1, CYC1, HMGCS2, PFKFB1, QDPR, RPS6KB1, SREBF1
## 5 CLC, DFFA, IFNG, IL2RA, RIPK1, BCL2L11, SIVA1, TRAF3IP2
## 6 DPP6, EDN3, KCNC1, KCNC2, KCNE1, KCNH2, KCNJ2, KCNQ1
GO_gender <- go_analisys(parmas_gender, annot.Entrez.GO)
if (!is.null(GO_gender)){
print(head(GO_gender$GO))
plot(GO_gender$plot)
}
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0030199 1.995872e-06 Inf 6.259921 15 15
## 2 GO:0043279 4.414534e-06 Inf 5.808478 14 14
## 3 GO:0034698 4.792845e-06 Inf 5.842593 14 14
## 4 GO:0030155 8.343936e-06 Inf 5.293155 13 13
## 5 GO:0007188 2.580717e-05 Inf 4.979673 12 12
## 6 GO:0033189 2.762569e-05 Inf 5.007937 12 12
## Term
## 1 collagen fibril organization
## 2 response to alkaloid
## 3 response to gonadotropin
## 4 regulation of cell adhesion
## 5 adenylate cyclase-modulating G-protein coupled receptor signaling pathway
## 6 response to vitamin A
## Genes
## 1 ANXA2, SERPINH1, COL1A1, COL2A1, COL3A1, COL5A1, COL5A2, COL11A1, COL12A1, LUM, DDR2, P4HA1, COL14A1, ADAMTS3, ADAMTS2
## 2 ADA, BCHE, EDNRA, FGA, HIF1A, IL1B, PPP2R2A, PPP5C, PRKCE, PRKCG, TAC1, UQCRC1, PEA15, TMED10
## 3 ASNS, CTSV, GATA4, GATA6, GJB2, HMGCS1, ICAM1, ITGA3, ITGB1, PAWR, SLC5A5, SRD5A1, STAR, PAX8
## 4 S1PR1, ICAM1, LAMA3, LAMA4, LAMA5, PPP1R12A, SERPINI1, PPP2R1A, PRKX, SOX9, SRF, CYTIP, NUAK1
## 5 AVPR2, CCR3, CNR1, FPR1, GCG, GCGR, GNAT1, GNAT2, PRKACB, PTGER4, GLP2R, WASF2
## 6 ARG1, CAT, DNMT1, DNMT3A, DNMT3B, GATA4, RARA, RXRA, SLC34A1, TSHB, TYMS, ALDH1A2
GO_tumor_females_down <- go_analisys(go_tumor_females_down, annot.Entrez.GO)
if (!is.null(GO_tumor_females_down)){
print(head(GO_tumor_females_down$GO))
plot(GO_tumor_females_down$plot)
}
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0051482 9.634658e-06 Inf 1.890653 8 8
## 2 GO:0009629 7.346161e-04 Inf 1.181658 5 5
## 3 GO:0042148 7.346161e-04 Inf 1.181658 5 5
## 4 GO:0051918 7.346161e-04 Inf 1.181658 5 5
## 5 GO:2000074 7.346161e-04 Inf 1.181658 5 5
## 6 GO:0001774 2.595136e-04 22.69022 1.890653 7 8
## Term
## 1 positive regulation of cytosolic calcium ion concentration involved in phospholipase C-activating G-protein coupled signaling pathway
## 2 response to gravity
## 3 strand invasion
## 4 negative regulation of fibrinolysis
## 5 regulation of type B pancreatic cell development
## 6 microglial cell activation
## Genes
## 1 AGTR1, S1PR1, LPAR1, EDNRA, F2R, GRM1, HTR2C, KISS1
## 2 BGLAP, CBLB, PTAFR, SPARC, FOSL1
## 3 RAD51, RAD51C, RAD51B, RAD51D, XRCC2
## 4 APOH, F2, HRG, SERPINE1, USF1
## 5 ARNTL, BAD, RFX3, RHEB, CLOCK
## 6 CLU, CX3CR1, IL13, JUN, CX3CL1, SNCA, TRPV1
We should look in details to those values and associations in the tables.
library(xtable)
table_h <- function(go_obj, ftitle){
# Function to output the table in html format
if (is.null(go_obj)){
warning("Null Object: unable to creat the table")
} else {
xtab_tumor <- xtable(go_obj, align="l|c|r|r|r|r|r|p{3cm}|p{3cm}")
print(x=xtab_tumor, file = ftitle, type = "html")
}
}
table_g <- function(tt_obj, ftitle){
# Function to output the table in html format
if (is.null(tt_obj)){
warning("Null Object: unable to creat the table")
} else {
rownames(tt_obj) <- 1:nrow(tt_obj)
xtab_tumor <- xtable(tt_obj, align="l|c|r|r|r|r|r|p{3cm}|p{3cm}|r|")
print(x=xtab_tumor, file = ftitle, type = "html")
}
}
table_h(GO_tumor$GO, "GO_tumor_table.html")
table_h(GO_females$GO, "GO_females_table.html")
# table_h(GO_males$GO, "GO_males_table.html")
table_h(GO_gender$GO, "GO_gender_table.html")
# table_h(GO_female_exc$GO, "GO_female_exc_table.html")
# table_h(GO_male_exc$GO, "GO_male_exc_table.html")
# table_h(GO_females_males_up$GO, "GO_females_males_up_table.html")
# table_h(GO_females_males_down$GO, "GO_females_males_down_table.html")
# table_h(GO_females_males_not_clear$GO, "GO_females_males_not_clear_table.html")
# table_h(GO_tumor_females_not_clear$GO, "GO_tumor_females_not_clear_table.html")
# table_h(GO_tumor_males_not_clear$GO, "GO_tumor_males_not_clear_table.html")
# table_h(GO_tumor_females_up$GO, "GO_tumor_females_up_table.html")
# table_h(GO_tumor_males_up$GO, "GO_tumor_males_up_table.html")
table_h(GO_tumor_females_down$GO, "GO_tumor_females_down_table.html")
# table_h(GO_tumor_males_down$GO, "GO_tumor_males_down_table.html")
table_g(de_tumor, "DE_tumor.html")
table_g(de_gender, "DE_gender.html")
table_g(de_females, "DE_females.html")
# table_g(de_males, "DE_males.html")
The table of genes which are differentially expressed in the different comparisons are in the following sites:
* The comparison of tumoral vs normal samples.
* The comparison of females vs males samples.
* The comparison of tumoral vs normal female samples.
* The comparison of tumoral vs normal male samples.
To explore more than the top 6 GO we should look into the tables generated with any browser. * In GO_tumor_table we have the list of GO enrich on the differentially expressed genes in the comparison tumor vs healthy samples.
* In GO_females_table we have the list of GO enrich on the differentially expressed genes in the comparison tumoral females vs healthy females.
* In GO_males_table we have the list of GO enrich on the differentially expressed genes in the comparison between tumoral males samples and healthy males samples.
* In GO_gender_table we have the list of GO enrich on the differentially expressed genes in the comparison between females and males.
* In GO_tumor_females_down we have the list of GO enrich on the differentially expressed genes down-regulated in the comparison of tumoral vs healthy samples and tumoral samples and healthy females.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] xtable_1.8-2 GO.db_3.3.0
## [3] GOstats_2.38.0 graph_1.50.0
## [5] Category_2.38.0 Matrix_1.2-6
## [7] Vennerable_3.1.0.9000 org.Hs.eg.db_3.3.0
## [9] corpcor_1.6.8 sva_3.20.0
## [11] genefilter_1.54.2 mgcv_1.8-12
## [13] nlme_3.1-128 cqn_1.18.0
## [15] quantreg_5.26 SparseM_1.7
## [17] preprocessCore_1.34.0 nor1mix_1.2-1
## [19] mclust_5.2 geneplotter_1.50.0
## [21] annotate_1.50.0 XML_3.98-1.4
## [23] AnnotationDbi_1.34.3 lattice_0.20-33
## [25] edgeR_3.14.0 limma_3.28.6
## [27] ggplot2_2.1.0 SummarizedExperiment_1.2.2
## [29] Biobase_2.32.0 GenomicRanges_1.24.1
## [31] GenomeInfoDb_1.8.1 IRanges_2.6.0
## [33] S4Vectors_0.10.1 BiocGenerics_0.18.0
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.1 colorspace_1.2-6 htmltools_0.3.5
## [4] yaml_2.1.13 survival_2.39-4 RBGL_1.48.1
## [7] DBI_0.4-1 RColorBrewer_1.1-2 plyr_1.8.4
## [10] stringr_1.0.0 zlibbioc_1.18.0 MatrixModels_0.4-1
## [13] munsell_0.4.3 gtable_0.2.0 evaluate_0.9
## [16] labeling_0.3 knitr_1.13 GSEABase_1.34.0
## [19] Rcpp_0.12.5 KernSmooth_2.23-15 scales_0.4.0
## [22] formatR_1.4 XVector_0.12.1 digest_0.6.9
## [25] stringi_1.1.1 grid_3.3.1 tools_3.3.1
## [28] magrittr_1.5 RSQLite_1.0.0 rmarkdown_0.9.6
## [31] AnnotationForge_1.14.2