Introduction

Thyroid cancer can develop from the different cells that form the follicles of the thyroid. This gland located at the base of the throat secretes hormones such as T3 and T4 that have their metabolic functionalities such as control of heart rate, blood pressure, body temperature and weight. See it on TCGA.

From the four types of thyroid cancer (papillary, follicular, medullary, and anaplastic thyroid cancer) Papillary Thyroid Carcinoma (PTC) is the most common type of thyroid cancer (Agrawal et al., 2014). It is more prevalent in women than men and its common diagnosis occurs around 49 years old.

Driver onco-mutations for this type of cancer appear as alterations of the MAPK signaling pathway and the PI3K-AKT pathway (Agrawal et al., 2014). Both are implied in cell proliferation and survival and in human tumorigenesis. The overactivation of the MAPK pathway because of mutations such as the BRAFV600E mutation, leads to the development of PTC from follicular thyroid cells (Xing, 2013). On the other hand, mutations that activate the PI3K-AKT pathway, such as mutations in RAS, PTEN and PIK3CA, mostly leads to development of Follicular Thyroid Adenoma (FTA) and Follicular Thyroid Cancer (FTC) also in follicular thyroid cells (Xing, 2013).

Research from The Cancer Genome Atlas (TCGA) focused on PTC and they found that a part from alterations in BRAF (specifically BRAFV600E) and RAS, there are other driver mutations such the ones in EIF1AX PPM1D, and CHEK2 genes that are main alterations for that type of thyroid cancer(Agrawal et al., 2014).

Data import

We start importing the raw table of counts.

library(SummarizedExperiment)

thca <- readRDS("seTHCA.rds") 
thca
## class: RangedSummarizedExperiment 
## dim: 20115 572 
## metadata(5): experimentData annotation cancerTypeCode
##   cancerTypeDescription objectCreationDate
## assays(1): counts
## rownames(20115): 1 2 ... 102724473 103091865
## rowData names(3): symbol txlen txgc
## colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
##   TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
##   TCGA.KS.A41L.11A.11R.A23N.07
## colData names(549): type bcr_patient_uuid ...
##   lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

Data exploration

Explore the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata.

dim(colData(thca))
## [1] 572 549
col.data <- colData(thca)
col.data[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
##                                  type                     bcr_patient_uuid
##                              <factor>                             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07    tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
## TCGA.BJ.A0YZ.01A.11R.A10U.07    tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
## TCGA.BJ.A0Z0.01A.11R.A10U.07    tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
## TCGA.BJ.A0Z2.01A.11R.A10U.07    tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
## TCGA.BJ.A0Z3.01A.11R.A13Y.07    tumor 331cae6e-2868-4c58-9302-709a9ff7d025
##                              bcr_patient_barcode form_completion_date
##                                         <factor>             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07        TCGA-4C-A93U           2014-11-12
## TCGA.BJ.A0YZ.01A.11R.A10U.07        TCGA-BJ-A0YZ            2011-4-11
## TCGA.BJ.A0Z0.01A.11R.A10U.07        TCGA-BJ-A0Z0            2011-4-15
## TCGA.BJ.A0Z2.01A.11R.A10U.07        TCGA-BJ-A0Z2            2011-4-15
## TCGA.BJ.A0Z3.01A.11R.A13Y.07        TCGA-BJ-A0Z3             2011-6-2
##                              prospective_collection
##                                            <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                    YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z0.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z2.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     NO
col.data.meta <- mcols(colData(thca), use.names=TRUE)
col.data.meta
## DataFrame with 549 rows and 2 columns
##                                                          labelDescription
##                                                               <character>
## type                                           sample type (tumor/normal)
## bcr_patient_uuid                                         bcr patient uuid
## bcr_patient_barcode                                   bcr patient barcode
## form_completion_date                                 form completion date
## prospective_collection            tissue prospective collection indicator
## ...                                                                   ...
## lymph_nodes_pelvic_pos_total                               total pelv lnp
## lymph_nodes_aortic_examined_count                           total aor lnr
## lymph_nodes_aortic_pos_by_he                          aln pos light micro
## lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
## lymph_nodes_aortic_pos_total                                total aor-lnp
##                                         CDEID
##                                   <character>
## type                                       NA
## bcr_patient_uuid                           NA
## bcr_patient_barcode                   2673794
## form_completion_date                       NA
## prospective_collection                3088492
## ...                                       ...
## lymph_nodes_pelvic_pos_total          3151828
## lymph_nodes_aortic_examined_count     3104460
## lymph_nodes_aortic_pos_by_he          3151832
## lymph_nodes_aortic_pos_by_ihc         3151831
## lymph_nodes_aortic_pos_total          3151827

Now, explore the row (feature) data.

rowRanges(thca)
## GRanges object with 20115 ranges and 3 metadata columns:
##             seqnames               ranges strand |      symbol     txlen
##                <Rle>            <IRanges>  <Rle> | <character> <integer>
##           1    chr19 [58345178, 58362751]      - |        A1BG      3322
##           2    chr12 [ 9067664,  9116229]      - |         A2M      4844
##           9     chr8 [18170477, 18223689]      + |        NAT1      2280
##          10     chr8 [18391245, 18401218]      + |        NAT2      1322
##          12    chr14 [94592058, 94624646]      + |    SERPINA3      3067
##         ...      ...                  ...    ... .         ...       ...
##   100996331    chr15 [20835372, 21877298]      - |       POTEB      1706
##   101340251    chr17 [40027542, 40027645]      - |    SNORD124       104
##   101340252     chr9 [33934296, 33934376]      - |   SNORD121B        81
##   102724473     chrX [49303669, 49319844]      + |      GAGE10       538
##   103091865    chr21 [39313935, 39314962]      + |   BRWD1-IT2      1028
##                  txgc
##             <numeric>
##           1 0.5644190
##           2 0.4882329
##           9 0.3942982
##          10 0.3895613
##          12 0.5249429
##         ...       ...
##   100996331 0.4308324
##   101340251 0.4903846
##   101340252 0.4074074
##   102724473 0.5055762
##   103091865 0.5924125
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

Looking deeper into col.data with:

col.data[1:5, 30:35]
## DataFrame with 5 rows and 6 columns
##                              lymph_nodes_examined_count
##                                                <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                        YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07            [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                        YES
## TCGA.BJ.A0Z2.01A.11R.A10U.07                         NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                        YES
##                              lymph_nodes_examined
##                                          <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   41
## TCGA.BJ.A0YZ.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    5
## TCGA.BJ.A0Z2.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                    3
##                              lymph_nodes_examined_he_count
##                                                   <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                             1
## TCGA.BJ.A0YZ.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                             0
## TCGA.BJ.A0Z2.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                             0
##                              weiss_score_overall mitoses_per_50_hpf
##                                         <factor>           <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                  NA                 NA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z0.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z2.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                  NA                 NA
##                              ajcc_pathologic_tumor_stage
##                                                 <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   Stage IVA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z2.01A.11R.A10U.07                   Stage IVC
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     Stage I

We observed that there are different ways to indicate that the data is not available (NA, [Not Available]…). Consequently, we decide to study which variables have more information. To do so, we create two functions, one (na.replace) to change those values into the standarized NA nomenclature, and filter.info to quantify how many of each variable is not NA.

na.replace <- function(x){
    # Remove the unwanted factors
    lev <- levels(x)
    lev <- lev[!(lev %in% "[Not Available]")]
    lev <- lev[!(lev %in% "[Unknown]")]
    lev <- lev[!(lev %in% "[Not Applicable]")]
    lev <- lev[!(lev %in% "[Not Available]|[Not Available]|[Not Available]")]
    lev <- lev[!(lev %in% "[Not Evaluated]")]
    lev <- lev[!(lev %in% "None")]

    x <- factor(x, levels=lev)
}

filter.info <- function(x){
    # Function to filter the NA, [Not Available], [Not Applicable] and [Unknown]
    # Return the proportion of information on x
    nas <- sum(is.na(x))
    return(1 - (nas/length(x)))
}

We apply these functions to the col.data and we visualize it:

meta.info <- as.data.frame(lapply(col.data, na.replace))
variable.info <- sapply(meta.info, filter.info)

# Representing the information by column vs total information
relative.info <- variable.info[order(variable.info)]/sum(variable.info)
plot(relative.info, main="Information brought by column")
\label{fig:col.data.analysis}Figure 1. Information in each column

Figure 1. Information in each column

# Plotting the histogram of the information
hist(variable.info, 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
\label{col.data.analysis2}Figure 2.A Histogram of information of additional information per sample.

Figure 2.A Histogram of information of additional information per sample.

# Seeing the most informative to set a threshold,
# frequency are the number of columns with such % of information
hist(variable.info, xlim = c(0.3, 1), ylim = c(0, 35), 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
\label{col.data.analysis3}Figure 2.B: A zoom on the most informative variables.

Figure 2.B: A zoom on the most informative variables.

As we can see, in many variables the main value is just NA or other derivatives meaning “No information available for this variable”. Therefore, in order to keep the most informative variables, we apply a threshold of 60% (of the values containing informative data) to select the relevant ones:

filtered <- meta.info[, variable.info  > 0.6]

dim(filtered)
## [1] 572  44
meta.info.summary2 <- col.data.meta[variable.info  > 0.6,]

Thus we ended with 44 informatives variables.

Associated variables exploration

To explore the variables associated with the samples we first create an object with ggplot:

library("ggplot2")
fplot <- ggplot(as.data.frame(filtered))

## ggplot options for axes and background color
# Options of the labels
l <- theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 1))
# Removing the background
b <-  theme(axis.line = element_line(colour = "black"),
            panel.grid.major.x = element_blank(),
            panel.grid.major.y = element_line(colour = "grey", 
                                              size = 0.5, linetype = 2),
            panel.grid.minor.y = element_line(colour = "grey", 
                                              size = 0.25, linetype = 3),
            panel.border = element_blank(),
            panel.background = element_blank())
y.axis <- ylab("number of cases")

And we plot some relationships for some significant variables:
In the “normal” type samples most of the information is not available. This could be due to the fact that all “normal” samples were obtained from patients with tumors. So in order to have the maximum information of each sample, we infer “normal” samples information from its paired sample data (mainly gender and ethnicity).

fplot + geom_bar(aes(type, fill = gender)) + b + y.axis + 
    ggtitle("Gender distribution across normal and tumour samples")
\label{gender}Figure 3.A: The gender distribution: threre are more females than males.

Figure 3.A: The gender distribution: threre are more females than males.

As expected, there are more female samples than male samples. As we mentioned before, thyroid cancer is more prevalent in women than men and we can see that not only for tumor samples but for normals samples too. In addition, there are more tumor cases than normal cases.

fplot + geom_bar(aes(type, fill = tumor_status)) + b + y.axis + 
    ggtitle("Tumor status across normal and tumour samples")
\label{status}Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.

Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.

We also have examined the tumor status of the tumor samples. There are more cancer patients without tumor (“tumor free”) than with the tumor (“with tumor”) after the collection of the data. We assume that most of the patients that participate in the study have recovered after it.

fplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b + 
    y.axis + ggtitle("Ethnicity distribution across normal and tumour samples")
\label{ethnicity}Figure 3.C: The predominant ethnicity are not hispanic or latino

Figure 3.C: The predominant ethnicity are not hispanic or latino

In this plot we can see clearly the lack of data for some variables regarding normal samples. In other words, there is no record for normal samples neither females nor males about the hispanic origin of the individuals.

labels <- c("AM. INDIAN OR ALASKA NAT.", "ASIAN", "BLACK OR AFRICAN AM.", 
            "WHITE", "NA")
fplot + facet_wrap(~ ethnicity) + geom_bar(aes(race, fill = gender)) + b +
    l + y.axis + scale_x_discrete(breaks=waiver(), labels = labels) +
    ggtitle("Gender distribution across different ethnicity and race.")
\label{ethnicity2}Figure 3.D: Most samples are from white ancestry.

Figure 3.D: Most samples are from white ancestry.

If we have a look at the ethnicities, gender and hispanic origin, the data shows more cases for “white” individuals “not hispanic or latino” than “white” individuals “hispanic or latino”.

fplot + geom_bar(aes(x=gender, fill = race)) + b + y.axis + 
    ggtitle("Ethnicity distribution in male and female")
\label{race}Figure 3.E: Both in males and females the most abundant race is white.

Figure 3.E: Both in males and females the most abundant race is white.

If we ignore the hispanic origin of the individuals and just focus on the main ethnicities of the study we can observe that for both female and male samples there is a majority of “white” individuals. However, the fact that “white” subjects are hispanic or not can provide some variability so that is something we will take into account later.

fplot + facet_wrap(~ type ) + geom_bar(aes(tumor_focality, fill = gender)) + b +
    l + y.axis + xlab("tumor focality") + 
    ggtitle("Gender distribution in different tumour focality")
\label{focality}Figure 3.F: There are some tumors which we don't know the focality.

Figure 3.F: There are some tumors which we don’t know the focality.

There is no data about tumor focality in the normal samples, but this could be due to the fact that are samples obtained probably during the tumor removal or for a biopsy, from a healty part of the tyroid gland. For this reason there is no reason for the normal samples to have this feature.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
hd.xaxis <-  scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
                labels=labels)
fplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + hd.xaxis + 
    y.axis + 
    ggtitle("Gender distribution across different types of hystologic diagnosis")
\label{diagnosis}Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.

Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.

Here, we can see that the tumor samples came mainly from classical Papillary Thyroid Carcinoma. We expect differential expression between the different thyroid cancer phenotypes so this might be a source of variability.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
ggplot(filtered[!is.na(filtered$age_at_diagnosis), ]) + 
    geom_bar(aes(age_at_diagnosis, na.rm = TRUE, fill = histologic_diagnosis)) +
    b + l + labs(x="age at diagnosis") + y.axis + 
    scale_fill_discrete(name = "histologic diagnosis", breaks = waiver(), 
    labels = labels) + 
    ggtitle("Histologic diagnosis distribution across different ages")
\label{diagnosis2}Figure 3.H: There is a representative sample of diagnosis over the age of patients.

Figure 3.H: There is a representative sample of diagnosis over the age of patients.

Even thought, thyroid cancer is commontly diagnosed at mature ages (~50) this plot shows that the classical papillary thyroid cancer has similar representation in all ages for this dataset.

As a summary from the previous plots, the most abundant population in the samples is the “white” and “not hispanic or latino”, so we decide to perform the analysis on this group of individuals, whose tumoral sample correspond to the most common cancer in thyroids: Papillary Thyroid Carcinoma(PTC).

Subseting

With this approach we want to avoid to include in the analysis, samples that introduce variability (mainly differential expression due to enviromental or epigenetic factors) as much as possible. To have more insight on the disease we want to have just paired samples, that is, samples from the same participant. So we filter the data accordingly:

# Paired data
paired <- intersect(thca[, thca$type == 'tumor']$bcr_patient_barcode, 
                    thca[, thca$type == 'normal']$bcr_patient_barcode)
paired_mask <- thca$bcr_patient_barcode %in% paired

filtered_filt <- filtered[paired_mask, ]

# Imposing the selected conditions
tumor_names <- row.names(subset(filtered_filt, 
    type == "tumor" &
    race == "WHITE" &
    histologic_diagnosis == "Thyroid Papillary Carcinoma - Classical/usual" &
    ethnicity == "NOT HISPANIC OR LATINO" ))

# Extracting the participants identifiers
participants <- unlist(lapply(tumor_names, substr, 9, 12))
participants <- participants[participants != ""]

check <- function(x, checklist){
    # Returning name of the sample just if the name is the same as in checklist
    a <- substr(x, 9, 12)
    if(a %in% checklist){
        return(x)
    }
}

# Extracting sample names of tumor and normal data.
sample_names <- unlist(lapply(row.names(filtered_filt), check, 
                              checklist = participants))
thca.filtered <- thca[, sample_names]

We are left with 66 samples. We explore again the variables for this subset. In this case, we show just some of the relationship plots we have seen previously:

filtered_filt <- filtered_filt[row.names(filtered_filt) %in% sample_names, ]
mplot <- ggplot(filtered_filt) + y.axis
mplot + geom_bar(aes(type, fill = gender)) + b +
  ggtitle("Ethnicity and type of sample after filtering")
\label{pairedg}Figure 4.A: The samples are consistent with the paired data.

Figure 4.A: The samples are consistent with the paired data.

This plot just checks whether we have the same number of cases in normal sample and tumor sample so that we have done correctly the filtering.

mplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b +
  ggtitle("Hispanic origin, gender and type of sample after filtering")
\label{pairede}Figure 4.B: The ethnicity is also consistent with the filter applied

Figure 4.B: The ethnicity is also consistent with the filter applied

Hispanic origin data is only available for tumor samples but as we have mentioned before we can extrapolate ethinicity and origin to the normal samples because they are paired (see below).

mplot + geom_bar(aes(x=gender, fill = ethnicity)) + b + l +
  ggtitle("Gender samples and hispanic origin after filtering")
\label{pairedge}Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have 'NA' as ethnicity

Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have ‘NA’ as ethnicity

Same proportion of being or not being “hispanic or latino” in females and males. Subset filtering has not modified the fact that we have more female samples than males.

mplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + 
    scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
    labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Gender distribution across different types of hystologic diagnosis
            after filtering")
\label{paireda}Figure 4.D: This graph shows that the paired data filtering is consistent

Figure 4.D: This graph shows that the paired data filtering is consistent

Papillary Classical Thyroid cancer in the same proportion as “NA” because we have the same number of normal cases and classical thyroid tumor cases.

ggplot(filtered_filt[!is.na(filtered_filt$age_at_diagnosis), ]) +
    geom_bar(aes(age_at_diagnosis, fill = histologic_diagnosis)) + b + l + y.axis +
    xlab("age at diagnosis") + scale_fill_discrete(name = "histologic diagnosis",
    breaks=waiver(), labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Histologic diagnosis distribution across different ages after filtering")
\label{fig:subset.explore5}Figure 4.E: We have a epresentative sample of all ages.

Figure 4.E: We have a epresentative sample of all ages.

Number of cases across ages is more or less similar after the filtering exept for some ages that have more cases around 30, 34, 42 and 47.

Transforming counts to rpkm

To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.

library(edgeR)
# Normalization just with the selected samples
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = as.data.frame(mcols(thca.filtered)))
dge <- DGEList(counts = assays(thca)$counts, genes = as.data.frame(mcols(thca)))

Now calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.

assays(thca.filtered)$logCPM <- cpm(dge.subset, log=TRUE, prior.count=3)
assays(thca.filtered)$logCPM[1:5, 1:5]
##    TCGA.BJ.A28R.01A.11R.A16R.07 TCGA.BJ.A28W.01A.11R.A32Y.07
## 1                      3.014910                     3.895927
## 2                      8.681899                     9.333470
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.423408                     4.912253
##    TCGA.BJ.A2N7.01A.11R.A18C.07 TCGA.BJ.A2N8.01A.11R.A18C.07
## 1                      2.074429                     3.634778
## 2                      9.313603                     8.873442
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.523368                     4.366761
##    TCGA.BJ.A2N9.01A.11R.A18C.07
## 1                      3.436989
## 2                      8.467420
## 9                     -4.164318
## 10                    -4.164318
## 12                     4.584505

Quality assessment and normalization

Library sizes

Let’s examine the library sizes in terms of total number of sequence read counts per sample. Figure 5 below shows library sizes per sample in increasing order.

ord <- order(dge$sample$lib.size)
barplot(dge$sample$lib.size[ord]/1e6, las=1, ylab="Millions of reads",
        xlab="Samples", col=c("blue", "red")[(thca$type[ord] == "tumor") + 1],
        border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)
\label{libsize1}Figure 5.A: Library sizes in increasing order of the whole dataset

Figure 5.A: Library sizes in increasing order of the whole dataset

Plotting for all samples makes it impossible to see accurately the blue and red colors corresponding to tumor and normal respectively.

ord.subset <- order(dge.subset$sample$lib.size)
barplot(dge.subset$sample$lib.size[ord.subset]/1e6, las=1, ylab="Millions of reads",
        xlab="Samples", col=c("blue", "red")[(thca.filtered$type[ord.subset] == "tumor") + 1],
        border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)
\label{libsize2}Figure 5.B: Library sizes in increasing order of the paired data.

Figure 5.B: Library sizes in increasing order of the paired data.

After the filtering the subset plot allows one to see that there is similar heterogeneity in tumor and normal library sizes. In addition, this figure reveals substantial differences in sequencing depth between samples and we may consider discarding those samples whose depth is substantially lower than the rest. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.

sampledepth <- round(dge.subset$sample$lib.size / 1e6, digits = 1)
names(sampledepth) <- substr(sample_names, 6, 12)

qqnorm(sampledepth)
qqline(sampledepth)
\label{libsizen}Figure 6: Distribution of library size compared with the normal distribution.

Figure 6: Distribution of library size compared with the normal distribution.

Looking into the qqplot we decide to filter out those below 40 but keeping only those that are paired:

sample_names <- sample_names[sampledepth > 40]

keep.paired <- function(pattern, x){
    # Return the names if it is paired
    pos <- grep(pattern, x, fixed = TRUE)
    if (length(pos) > 1){
        return(x[pos])
    }
}

sample_names <- unique(unlist(lapply(substr(sample_names, 9, 12), 
                                     keep.paired, sample_names)))

thca.filtered <- thca.filtered[, colnames(thca.filtered) %in% sample_names]
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = mcols(thca.filtered))
## Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
## Arguments in '...' ignored

Thus we have 59 samples which are above the threshold and 54 of them are paired, which are the ones which will be used from here on.

Distribution of expression levels among samples

Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 7.

library(geneplotter)
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(assays(thca.filtered[, 
            thca.filtered$type ==  "tumor"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", 
            las = 1)
multidensity(as.list(as.data.frame(assays(thca.filtered[,
            thca.filtered$type ==  "normal"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Normal samples",
            las = 1)
\label{distRawExp}Figure 7: Non-parametric density distribution of expression profiles per sample.

Figure 7: Non-parametric density distribution of expression profiles per sample.

These two spikes are due to the lowly expressed genes and the highly expressed genes, but in the lowly expressed genes the effect of GC content and the read length may have higher impact.

GC and length effect

To see the impact of GC and length on the number of read we plot them:

gc_len <- rowRanges(thca.filtered)
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txgc)
## [1] -0.01173666
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txlen)
## [1] -0.0008702022
par(mfrow=c(1,2))
plot(log10(rowMeans(assays(thca.filtered)$counts)), gc_len$txgc, 
     main = "GC content over expression", xlab = "Expression (log10)", 
     ylab = "GC")
plot(log10(rowMeans(assays(thca.filtered)$counts)), log10(gc_len$txlen), 
     main = "Length over the expression", xlab = "Expression (log10)", 
     ylab = "Length (log10)")
\label{gc_length}Figure 8: Relationship between GC content and length with the mean expression of genes.

Figure 8: Relationship between GC content and length with the mean expression of genes.

There is a low correlation on the counts and the GC content or the length but as there are clearly two groups based on the length, so we will normalize also using the length with the package cqn:

library("cqn")
cqn.subset <- cqn(assays(thca.filtered)$counts, x = gc_len$txgc, lengths = gc_len$txlen)
par(mfrow = c(1, 2))
boxplot(cqn.subset$func1, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of GC")
boxplot(cqn.subset$func2, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of length")
\label{cqn}Figure 9: The effect of GC and length on each sample.

Figure 9: The effect of GC and length on each sample.

As we thought the effect of the length is higher than the effect of GC content

We can see if the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately:

cqn.logCPM <- cqn.subset$y + cqn.subset$offset
assays(thca.filtered)$cqn.logCPM <- cqn.logCPM
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(
    cqn.logCPM[,thca.filtered$type ==  "tumor"])),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    cqn.logCPM[, thca.filtered$type ==  "normal"])),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
\label{distExp}Figure 10: Non-parametric density distribution of expression profiles per sample.

Figure 10: Non-parametric density distribution of expression profiles per sample.

These “after-normalization” plots are quite better than the previous ones used without GC and length normalization on the second spike. With this normalization we do not appreciate substantial differences between the samples in the distribution of expression values. However, below 0 there is a lot of variability.

Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 9 shows the distribution of those values across genes.

par(mfrow = c(1, 2))
avgexp <- rowMeans(assays(thca.filtered)$logCPM)
hist(avgexp, xlab = "log2 CPM", main = "Using DGEList normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)

# Using the corrected logCPM
avgexp.cqn <- rowMeans(assays(thca.filtered)$cqn.logCPM)
hist(avgexp.cqn, xlab = "log2 CPM", main = "Using cqn normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)
\label{exprdist}Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization

Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization

Filtering of lowly-expressed genes

In the light of previous plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.

mask <- avgexp.cqn > 1
thca.filtered <- thca.filtered[mask, ]
dge.subset <- dge.subset[mask, ]

We have nrow(thca.filtered) genes remaining on our dataset from 54 samples, half of them are healthy and the other half are tumoral.

Normalization

Since we will use the offsets computed by cqn, there is no need to normalize using the normalization tools from edgeR, such as calcNormFactors according to cqn vignette.

dge.subset$offset <- cqn.subset$glm.offset

dge.norm <- calcNormFactors(dge.subset)
assays(thca.filtered)$n.logCPM <- cpm(dge.norm, log = TRUE, prior.count = 3)

par(mfrow = c(3, 2))

# Not normalized but filtered
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalize with calcNormFactors after filtering out those
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalized with cqn and filtered the low values:
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
\label{norm}Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization

Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization

The normalization after filtering lowly-expressed genes with ‘cqn’ package clusters samples into one thin gaussian curve whereas normalization with calcNormFactors makes it thiker.

MA-plots

Tumoral samples

We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples.

par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==  "tumor"]
dgetmp <- dge.subset[, thca.filtered$type ==  "tumor"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}
\label{maPlotsTumor}Figure 13.A: MA-plots of the tumor samples.

Figure 13.A: MA-plots of the tumor samples.

We do not observe samples with major expression-level dependent biases.

Healthy samples

Let’s look now to the normal samples:

par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==   "normal"]
dgetmp <- dge.subset[, thca.filtered$type ==  "normal"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}
\label{maPlotsNormal}Figure 13.B: MA-plots of the normal samples.

Figure 13.B: MA-plots of the normal samples.

We do not observe either important expression-level dependent biases among the normal samples except on the sample TCGA-BJ-A3PR. Which we proceed to remove it from the cqn.logCP:

mask <- -grep("TCGA.BJ.A3PR", colnames(thca.filtered))
thca.filtered <- thca.filtered[,mask]
sample_names <- sample_names[mask]
dge.subset <- dge.subset[,mask]

Thus the remaining samples are 52 from 26 patients. Recall that we still have 11801 genes from each sample.

Batch identification

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see the wiki), following the strategy described here we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(sample_names, 6, 7)
table(tss)
## tss
## BJ EL ET H2 
## 13 33  2  4
samplevial <- substr(sample_names, 14, 16)
table(samplevial)
## samplevial
## 01A 11A 11C 
##  26  25   1
portion <- substr(sample_names, 18, 19)
table(portion)
## portion
## 11 12 21 22 31 
## 42  6  2  1  1
table(substr(sample_names, 20, 20))
## 
##  R 
## 52
plate <- substr(sample_names, 22, 25)
table(plate)
## plate
## A16R A180 A18C A19O A20F A21D A220 A22L A23N 
##    2    2    5    2   10    6    6    7   12
center <- substr(sample_names, 27, 28)
table(center)
## center
## 07 
## 52

We can observe that the samples where collected at 4 tissues source sites but the majority of them at University of Pittsburg (BJ) and MD Anderson(EL). However,the manager center of the data is the same (University of North Carolina) and the same metabolite (RNA). In addition, they come from 7 different plates more or less distributed. However, most samples came from 2 out of 5 different portion and 2 out of 3 vials.

We are going to check if the previous variables of the barcode are surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with the variables:

table(data.frame(TYPE = thca.filtered$type, TSS = tss))
##         TSS
## TYPE     BJ EL ET H2
##   normal  0 20  2  4
##   tumor  13 13  0  0
# We will check for the other variables
table(data.frame(TYPE = thca.filtered$type, PLATE = plate))
##         PLATE
## TYPE     A16R A180 A18C A19O A20F A21D A220 A22L A23N
##   normal    0    2    0    0    3    0    2    7   12
##   tumor     2    0    5    2    7    6    4    0    0
table(data.frame(TYPE = thca.filtered$type, Portion = portion))
##         Portion
## TYPE     11 12 21 22 31
##   normal 18  4  2  1  1
##   tumor  24  2  0  0  0
table(data.frame(TYPE = thca.filtered$type, Vial = samplevial))
##         Vial
## TYPE     01A 11A 11C
##   normal  12  13   1
##   tumor   14  12   0

From the tables we can already see that all of them can be candidates as indicators for batch effect as there is no homogeneity of the number of tumor and normal samples across any of the variables.

To double check if they have some effect on the batch we make the hierarchical clustering of them. First, we set some functions to plot the hierarhcical clustering:

help.dendogram <- function(x, batch, labels) {
    # Helper function to plot dendograms
    if (is.leaf(x)) {
        ## color by batch
        attr(x, "nodePar") <- list(lab.col = as.vector(batch[attr(x, "label")])) 
        ## label by outcome
        attr(x, "label") <- as.vector(labels[attr(x, "label")])
    }
    x
}

plot.batch <- function(dendrogram, batching, se, out){
    # Function to see the batch
    # Given a sampleDendogram use batching to colour the leafs by sample.names of se
    sample.names = colnames(se)
    batch <- as.integer(factor(batching))
    names(batch) <- sample.names
    i.dendrogram <- dendrapply(dendrogram, help.dendogram, batch, out)
    plot(i.dendrogram, main = "Hierarchical clustering")
    legend("topright", paste("Batch", levels(factor(batching))), fill = sort(unique(batch)))
}

With this functions to plot the clustering with different labelings is easier:

d <- as.dist(1-cor(assays(thca.filtered)$cqn.logCPM, method = "spearman"))
sampleClustering <- hclust(d)
sampleDendrogram <- as.dendrogram(sampleClustering, hang = 0.1)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                        as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)
par(mfrow = c(1,1))
plot.batch(sampleDendrogram, portion, thca.filtered, outcome)
\label{clustering.n1}Figure 14.A: Hierarchical clustering of the samples by portion

Figure 14.A: Hierarchical clustering of the samples by portion

plot.batch(sampleDendrogram, samplevial, thca.filtered, outcome)
\label{clustering.n2}Figure 14.B: Hierarchical clustering of the samples by sample vial

Figure 14.B: Hierarchical clustering of the samples by sample vial

plot.batch(sampleDendrogram, tss, thca.filtered, outcome)
\label{clusering.n3}Figure 14.C: Hierarchical clustering of the samples by TSS

Figure 14.C: Hierarchical clustering of the samples by TSS

plot.batch(sampleDendrogram, as.character(thca.filtered$type), thca.filtered, outcome)
\label{clustering.n5}Figure 14.D: Hierarchical clustering of the samples by type

Figure 14.D: Hierarchical clustering of the samples by type

plot.batch(sampleDendrogram, plate , thca.filtered, outcome)
\label{clustering.n5}Figure 14.E: Hierarchical clustering of the samples by prospective collection

Figure 14.E: Hierarchical clustering of the samples by prospective collection

We can see that the sample from A3PR-normal clusters with the tumoral ones, and the A3H2 tumoral clusters with the healthy patients on Figure 14.D. To visualize the relationship we use the multiscaling dimensional plot colored by portion, tss, type, and gender:

batch_effects <- list(portion, tss, as.character(thca.filtered$type),
                      as.character(thca.filtered$gender))
for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(dge.subset, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

We can observe that there are two groups of tumoral samples, and one of them is close to healthy samples.

Models of the disease

Our model of the disease will affect the analysis from here onwards. Therefore we need to select our model and null hypothesis. We define them as:

# Create the design matrix involving tumor and normal samples and males and females
treat <- factor(paste(thca.filtered$type, thca.filtered$gender, sep = "."))
paired <- as.numeric(as.factor(thca.filtered$bcr_patient_barcode))
design <- model.matrix(~treat + paired, data=colData(thca.filtered))
colnames(design) <- c(levels(treat), "paired")
# design <- cbind("(Intercept)" = 1, design)
head(design)
##                              normal.FEMALE normal.MALE tumor.FEMALE
## TCGA.BJ.A28R.01A.11R.A16R.07             1           0            1
## TCGA.BJ.A2N7.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2N8.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2N9.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2NA.01A.12R.A19O.07             1           0            0
## TCGA.BJ.A3PU.01A.11R.A220.07             1           0            0
##                              tumor.MALE paired
## TCGA.BJ.A28R.01A.11R.A16R.07          0     21
## TCGA.BJ.A2N7.01A.11R.A18C.07          0     30
## TCGA.BJ.A2N8.01A.11R.A18C.07          0     31
## TCGA.BJ.A2N9.01A.11R.A18C.07          0     32
## TCGA.BJ.A2NA.01A.12R.A19O.07          1     33
## TCGA.BJ.A3PU.01A.11R.A220.07          1     39
# And the null model
mod0 <- model.matrix(~1 + gender + type + paired, data = colData(thca.filtered))
head(mod0)
##                              (Intercept) genderMALE typetumor paired
## TCGA.BJ.A28R.01A.11R.A16R.07           1          0         1     21
## TCGA.BJ.A2N7.01A.11R.A18C.07           1          0         1     30
## TCGA.BJ.A2N8.01A.11R.A18C.07           1          0         1     31
## TCGA.BJ.A2N9.01A.11R.A18C.07           1          0         1     32
## TCGA.BJ.A2NA.01A.12R.A19O.07           1          1         1     33
## TCGA.BJ.A3PU.01A.11R.A220.07           1          1         1     39

Our model includes the interaction of gender and type, while our null model just includes the gender and type (in both cases with the paired factor to take into account the fact that the samples are paired). This way comparing this models we will see if there is interaction between gender and tumor or not.

Removing batch effect

ComBat

After looking for batch effect we couldn’t find a clear cause of the batch effect. We suspect that sample vial has something to do. However, we try to see if we can improve the clustering by using method ComBat for sample vial.

library("sva")

combatexp <- ComBat(assays(thca.filtered)$cqn.logCPM, tss, mod = design)
## Found 4 batches
## Adjusting for 4 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
d <- as.dist(1 - cor(combatexp, method = "spearman"))
cluster.combat <- hclust(d)
cluster.combat <- as.dendrogram(cluster.combat, hang = 0.1)
names(batch) <- colnames(thca.filtered)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                            as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)

plot.batch(cluster.combat, samplevial, thca.filtered, outcome)
\label{clustering.combat}Figure 16.A: Hierarchical clustering of the samples with ComBat correction colored by samplevial.

Figure 16.A: Hierarchical clustering of the samples with ComBat correction colored by samplevial.

plot.batch(cluster.combat, tss, thca.filtered, outcome)
\label{clustering.combat2}Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by tissues sourec site

Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by tissues sourec site

plot.batch(cluster.combat, thca.filtered$gender, thca.filtered, outcome)
\label{clustering.combat3}Figure 16.C: Hierarchical clustering of the samples with ComBat correction coloured by gender.

Figure 16.C: Hierarchical clustering of the samples with ComBat correction coloured by gender.

plot.batch(cluster.combat, thca.filtered$type, thca.filtered, outcome)
\label{clustering.combat4}Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by type.

Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by type.

plot.batch(cluster.combat, portion, thca.filtered, outcome)
\label{clustering.combat5}Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by portion

Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by portion

We can observe a slightly improvement, but still some tumoral samples cluster together with the normal ones.

batch_effects <- list(samplevial, tss, as.character(thca.filtered$gender),
                      as.character(thca.filtered$type), portion)
for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(combatexp, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
\label{mds}Figure 17: Multidimensional scaling plot of the samples.

Figure 17: Multidimensional scaling plot of the samples.

\label{mds}Figure 17: Multidimensional scaling plot of the samples.

Figure 17: Multidimensional scaling plot of the samples.

\label{mds}Figure 17: Multidimensional scaling plot of the samples.

Figure 17: Multidimensional scaling plot of the samples.

\label{mds}Figure 17: Multidimensional scaling plot of the samples.

Figure 17: Multidimensional scaling plot of the samples.

\label{mds}Figure 17: Multidimensional scaling plot of the samples.

Figure 17: Multidimensional scaling plot of the samples.

QR Decomposition

To improve the batch effect removal we look if QR decomposition yields better results. In this case, we look for sample vial and TSS batch effects removal. To do so, we have to built the full model which would include gender and type as fixed effects.

library("limma")

qrexp <- removeBatchEffect(assays(thca.filtered)$cqn.logCPM, tss, design = design)
d <- as.dist(1 - cor(qrexp, method = "spearman"))
cluster.qr <- hclust(d)
cluster.qr <- as.dendrogram(cluster.qr, hang = 0.1)

plot.batch(cluster.qr, samplevial, thca.filtered, outcome)
\label{clustering.QRdecomposition.1}Figure 18.A: Hierarchical clustering of the samples with QR descompostion for sample vial

Figure 18.A: Hierarchical clustering of the samples with QR descompostion for sample vial

plot.batch(cluster.qr, tss, thca.filtered, outcome)
\label{clustering.QRdecomposition.2}Figure 18.B: Hierarchical clustering of the samples with QR descompostion for TSS

Figure 18.B: Hierarchical clustering of the samples with QR descompostion for TSS

plot.batch(cluster.qr, thca.filtered$gender, thca.filtered, outcome)
\label{clustering.QRdecomposition.3}Figure 18.C: Hierarchical clustering of the samples with QR descompostion for gender

Figure 18.C: Hierarchical clustering of the samples with QR descompostion for gender

plot.batch(cluster.qr, thca.filtered$type, thca.filtered, outcome)
\label{clustering.QRdecomposition.4}Figure 18.D: Hierarchical clustering of the samples with QR descompostion for type

Figure 18.D: Hierarchical clustering of the samples with QR descompostion for type

plot.batch(cluster.qr, portion, thca.filtered, outcome)
\label{clustering.QRdecomposition.5}Figure 18.E: Hierarchical clustering of the samples with QR descompostion for portion

Figure 18.E: Hierarchical clustering of the samples with QR descompostion for portion

With QR decomposition we can observe that some tumoral samples cluster with other normal samples on a lower branch.

for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(qrexp, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
\label{mds}Figure 19: Multidimensional scaling plot of the samples with qrexp.

Figure 19: Multidimensional scaling plot of the samples with qrexp.

\label{mds}Figure 19: Multidimensional scaling plot of the samples with qrexp.

Figure 19: Multidimensional scaling plot of the samples with qrexp.

\label{mds}Figure 19: Multidimensional scaling plot of the samples with qrexp.

Figure 19: Multidimensional scaling plot of the samples with qrexp.

\label{mds}Figure 19: Multidimensional scaling plot of the samples with qrexp.

Figure 19: Multidimensional scaling plot of the samples with qrexp.

\label{mds}Figure 19: Multidimensional scaling plot of the samples with qrexp.

Figure 19: Multidimensional scaling plot of the samples with qrexp.

Removing batch effect with SVD

We further try if SVD yelds a better result:

library("corpcor")
s <- fast.svd(t(scale(t(assays(thca.filtered)$cqn.logCPM),
                      center = TRUE, scale = TRUE)))
pcSds <- s$d
pcSds[1] <- 0
svdexp <- s$u %*% diag(pcSds) %*% t(s$v)
colnames(svdexp) <- colnames(thca.filtered)
class(svdexp)
## [1] "matrix"
dim(svdexp)
## [1] 11801    52
d <- as.dist(1 - cor(svdexp, method = "spearman"))
cluster.svd <- hclust(d)
cluster.svd <- as.dendrogram(cluster.svd, hang = 0.1)
names(batch) <- colnames(thca.filtered)

plot.batch(cluster.svd, samplevial, thca.filtered, outcome)
\label{clustering.svd1}Figure 20.A: Hierarchical clustering of the samples with SVD for sample vial

Figure 20.A: Hierarchical clustering of the samples with SVD for sample vial

plot.batch(cluster.svd, tss, thca.filtered, outcome)
\label{clustering.svd2}Figure 20.B: Hierarchical clustering of the samples with SVD for TSS

Figure 20.B: Hierarchical clustering of the samples with SVD for TSS

plot.batch(cluster.svd, thca.filtered$gender, thca.filtered, outcome)
\label{clustering.svd3}Figure 20.C: Hierarchical clustering of the samples with SVD for gender

Figure 20.C: Hierarchical clustering of the samples with SVD for gender

plot.batch(cluster.svd, thca.filtered$type, thca.filtered, outcome)
\label{clustering.svd4}Figure 20.D: Hierarchical clustering of the samples with SVD for type

Figure 20.D: Hierarchical clustering of the samples with SVD for type

plot.batch(cluster.svd, portion, thca.filtered, outcome)
\label{clustering.svd3}Figure 20.E: Hierarchical clustering of the samples with SVD for portion

Figure 20.E: Hierarchical clustering of the samples with SVD for portion

SVD clusters together normal and tumoral samples, maybe because some of them are quite similar according to the PCA previously performed. Therefore we don’t plot the MDS for SVD.

After all the process the best one, that separates better tumor and normal samples is the ComBat function, as we have seen on the figure 17.D, which will be used for following-up analysis.

assays(thca.filtered)$qrexp <- qrexp
assays(thca.filtered)$combatexp <- combatexp

Initial assesment on differential expression

We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva using the models previously created.

pv <- f.pvalue(assays(thca.filtered)$combatexp, design, mod0)
sum(p.adjust(pv, method = "fdr") < 0.01)
## [1] 0

There are 0 genes changing significantly their expression at FDR < 1%. In Figure 18 below we show the distribution of the resulting p-values.

hist(pv, main = "Histogram of p-values", las = 1)
\label{pdist}Figure 19.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 19.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

hist(pv, main = "Histogram of p-values", xlim = c(0, 0.1), breaks = 80)
\label{pdist2}Figure 18.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 18.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

As we can see the distribution of p-values is not uniform, and reach the maximum frequency at p-value 1, and the minimum at p-value 0.

Surrogate variables with sva

Using sva to estimate and remove artifacts from high dimensional data, where we use our alternative hypothesis and our null hypothesis:

sv <- sva(assays(thca.filtered)$combatexp, design, mod0)
## Number of significant surrogate variables is:  9 
## Iteration (out of 5 ):1  2  3  4  5

The SVA algorithm has found 9 surrogate variables. Let’s use them to assess against the extent of differential expression this time adjusting for these surrogate variables.

modsv <- cbind(design, sv$sv)
colnames(modsv) <- c(colnames(design), paste0("sv", c(1:sv$n)))
mod0sv <- cbind(mod0, sv$sv)
colnames(mod0sv) <- c(colnames(mod0), paste0("sv", c(1:sv$n)))
pvsv <- f.pvalue(assays(thca.filtered)$combatexp, modsv, mod0sv)
sum(p.adjust(pvsv, method = "fdr") < 0.01)
## [1] 0

Apparently we haven’t increased the number of DE genes, but we can see if the distribution of p-values has changed:

hist(pvsv, main = "Histogram of p-values adjusting surrogate variables", 
     las = 1)
\label{psvdist}Figure 20.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 20.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

hist(pvsv, main = "Histogram of p-values adjusting surrogate variables", 
     las = 1, xlim = c(0, 0.1), breaks = 80)
\label{psvdist2}Figure 20.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 20.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

As we can see the distribution is now uniform, thus sva corrected for artifacts and the underlying hypotesis of uniform distribution of gene expression is true.

Removing surrogate variables effect

Instead of working with the surrogate variables in the model, we could remove them using the following function

# Code from https://www.biostars.org/p/121489/#121500

cleanY = function(y, mod, svs) {
    # Function to remove the efect of sv following a design from y
    X = cbind(mod, svs)
    Hat = solve(t(X) %*% X) %*% t(X)
    beta = (Hat %*% t(y))
    rm(Hat)
    gc()
    P = ncol(mod)
    return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

e.no_surrogates <- cleanY(assays(thca.filtered)$combatexp, design, sv$sv)

The object e.no_surrogates contains the expression after removing the surrogates variables.

Functional Annotation

We want ensembl annotation, and gene symbol of our probes, and we want also the gene ontologies from all the human genes (will be later used for functional enrichment):

library("org.Hs.eg.db")
## 
annot <- select(org.Hs.eg.db, 
                columns = c("SYMBOL", "ENSEMBL"),
                keys = rownames(thca.filtered),
                keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
all.Human.GO <- select(org.Hs.eg.db,
                       columns = c("GO", "SYMBOL"), 
                       key = keys(org.Hs.eg.db, keytype = "ENTREZID"), 
                       keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
annot.Entrez.GO <- select(org.Hs.eg.db, 
                columns = c("GO", "SYMBOL"),
                keys = rownames(thca.filtered),
                keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns

The different objects contain different information: annot contains the Symbol and Ensembl identifier from the remaining genes, all.Human.GO is the object with Gene Ontologies and Symbols for all human genes, while anot.Entrez.GO has the GO and the symbols for the genes.

Differential expression of paired data

Using the expression values with batch effect removed by ComBat we proceed with the voom normalization, note that the surrogate variables effect is still in there, because it is better adjusting than removing the effect of a variable.

# Transform the log2 normalized by cqn to counts without the surrogate variables
n_counts <- apply(assays(thca.filtered)$combatexp, 1:2, function(x){2^(x)})

# Normalize by mean variance with voom
vo <- voom(n_counts, modsv, plot = TRUE)
\label{voom} Figure 21: The effect of the normalization by voom.

Figure 21: The effect of the normalization by voom.

fit <- lmFit(vo, modsv)

# Make the comparisons we are interested in based on our design
cm <- makeContrasts(
    FemaleTumorVsFemaleNormal = tumor.FEMALE - normal.FEMALE,
    MaleTumorVsMaleNormal = tumor.MALE - normal.MALE,
    FemaleTumorVsMaleTumor = tumor.FEMALE - tumor.MALE,
    FemaleNormalVsMaleNormal = normal.FEMALE - normal.MALE,
    TumorVsNormal = (tumor.MALE + tumor.FEMALE) - (normal.MALE + normal.FEMALE),
    FemaleVsMale = (tumor.FEMALE + normal.FEMALE) - (tumor.MALE + normal.MALE),
    FemaleTumorvsMaleNormal = tumor.FEMALE - normal.MALE,
    FemalNormaleVsMaleTumor = normal.FEMALE - tumor.MALE,
    levels = modsv)
cm[1:5,]
##                Contrasts
## Levels          FemaleTumorVsFemaleNormal MaleTumorVsMaleNormal
##   normal.FEMALE                        -1                     0
##   normal.MALE                           0                    -1
##   tumor.FEMALE                          1                     0
##   tumor.MALE                            0                     1
##   paired                                0                     0
##                Contrasts
## Levels          FemaleTumorVsMaleTumor FemaleNormalVsMaleNormal
##   normal.FEMALE                      0                        1
##   normal.MALE                        0                       -1
##   tumor.FEMALE                       1                        0
##   tumor.MALE                        -1                        0
##   paired                             0                        0
##                Contrasts
## Levels          TumorVsNormal FemaleVsMale FemaleTumorvsMaleNormal
##   normal.FEMALE            -1            1                       0
##   normal.MALE              -1           -1                      -1
##   tumor.FEMALE              1            1                       1
##   tumor.MALE                1           -1                       0
##   paired                    0            0                       0
##                Contrasts
## Levels          FemalNormaleVsMaleTumor
##   normal.FEMALE                       1
##   normal.MALE                         0
##   tumor.FEMALE                        0
##   tumor.MALE                         -1
##   paired                              0
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)

# Decide the threshold of differentially expressed genes
DE_threshold <- 4
lfc_threshold <- log2(DE_threshold)
res <- decideTests(fit2, lfc = lfc_threshold)
summary(res)
##    FemaleTumorVsFemaleNormal MaleTumorVsMaleNormal FemaleTumorVsMaleTumor
## -1                     10630                     0                      2
## 0                       1169                 11801                  11799
## 1                          2                     0                      0
##    FemaleNormalVsMaleNormal TumorVsNormal FemaleVsMale
## -1                        1          9941            1
## 0                       776          1852          773
## 1                     11024             8        11027
##    FemaleTumorvsMaleNormal FemalNormaleVsMaleTumor
## -1                       1                       3
## 0                    11800                    1153
## 1                        0                   10645

We decided a threshold of differentially expressed genes of 4 (thus log2 of it is 2), meaning that it must be 4 times more expressed in one group than in other in orther to pass the filter. Based on these we are only interested in four contrast which will be annotated with symbol and ensembl id:

tt_tumor <- topTable(fit2, number = Inf, coef = "TumorVsNormal", p.value = 0.05)
tt_gender <- topTable(fit2, number = Inf, coef = "FemaleVsMale", p.value = 0.05)
tt_females <- topTable(fit2, number = Inf, coef = "FemaleTumorVsFemaleNormal", p.value = 0.05)
tt_males <- topTable(fit2, number = Inf, coef = "MaleTumorVsMaleNormal", p.value = 0.05)
    
# Annotate with the gene symbol
tt_tumor <- merge(tt_tumor, annot, by.x = 0, by.y = "ENTREZID", 
                  all.x = T, all.y = F)
tt_males <- merge(tt_males, annot, by.x = 0, by.y = "ENTREZID", 
                  all.x = T, all.y = F)
tt_gender <- merge(tt_gender, annot, by.x = 0, by.y = "ENTREZID", 
                   all.x = T, all.y = F)
tt_females <- merge(tt_females, annot, by.x = 0, by.y = "ENTREZID", 
                   all.x = T, all.y = F)
head(tt_tumor)
##   Row.names     logFC  AveExpr          t      P.Value    adj.P.Val
## 1     10002 -9.476681 8.292238 -10.373128 4.543685e-13 1.700603e-12
## 2 100033416 -2.200815 2.309834  -3.099678 3.481826e-03 4.556335e-03
## 3 100033417 -5.299424 7.891636  -2.295198 2.687264e-02 3.224772e-02
## 4 100033420 -4.215710 4.668286  -2.464289 1.797807e-02 2.194418e-02
## 5 100033424 -4.426622 3.413475  -7.221180 7.724593e-09 1.644856e-08
## 6 100033426 -5.603019 6.223775 -14.069474 2.382373e-17 2.333144e-16
##           B      SYMBOL         ENSEMBL
## 1 19.457926       NR2E3 ENSG00000278570
## 2 -1.862352  SNORD116-4 ENSG00000275529
## 3 -4.046471  SNORD116-5 ENSG00000207191
## 4 -3.554915  SNORD116-8 ENSG00000207093
## 5  9.910661 SNORD116-12 ENSG00000207197
## 6 28.689808 SNORD116-14 ENSG00000206621

Accuracy

Diagnostics plot:

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_gender$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 6], df = fit2$df.prior + fit2$df.residual[6], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 6], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females and males patients", 
      outer = TRUE, cex = 1.5, side = 3, line = -2)
\label{accuracy}Figure 22.A: P-value distribution and comparison with the t-student distribution in the comparison Males vs Females.

Figure 22.A: P-value distribution and comparison with the t-student distribution in the comparison Males vs Females.

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_tumor$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 5], df = fit2$df.prior + fit2$df.residual[5], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 5], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between healthy and tumoral patients", 
      outer = TRUE, cex = 1.5, side = 3, line = -2)
\label{accuracy2}Figure 22.B: P-value distribution and comparison with the t-student distribution in the comparison Tumoral and Healthy samples.

Figure 22.B: P-value distribution and comparison with the t-student distribution in the comparison Tumoral and Healthy samples.

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_females$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 1], df = fit2$df.prior + fit2$df.residual[1], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 1], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females patients and healthy.", outer = TRUE, cex = 1.5, 
      side = 3, line = -2)
\label{accuracy3}Figure 22.C: P-value distribution and comparison with the t-student distribution in the female comparison tumoral and healthy samples.

Figure 22.C: P-value distribution and comparison with the t-student distribution in the female comparison tumoral and healthy samples.

# pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
# hist(tt_males$P.Value, xlab = "Raw P-values", main = "", las = 1)
# qqt(fit2$t[, 2], df = fit2$df.prior + fit2$df.residual[2], main = "", pch = ".", cex = 3)
# qqline(fit2$t[, 2], lwd = 2, col = "red")
# abline(0, 1, lwd = 2)
# mtext("Comparing genes between male patients and healthy.", outer = TRUE, cex = 1.5, 
#       side = 3, line = -2)

Volcano plots

We can see in volanco plots how look like the data

volcano <- function(data, title){
    # Plot a volcano plot of a toptable output of limma
    if (nrow(data) == 0) {
        return(NULL)
    }
    ggplot(data, aes(logFC, B, colour = adj.P.Val)) + geom_point() + theme_bw()+ 
        ggtitle(title) + geom_vline(xintercept = c(2, -2), col = "red")+ 
        geom_hline(yintercept = 4.6, col = "red")
}

volcano(tt_tumor, "Comparing tumoral vs normal")
\label{volcano1}Figure 23.A: Volcano plot of the comparison healthy vs tumoral.

Figure 23.A: Volcano plot of the comparison healthy vs tumoral.

volcano(tt_gender, "Comparing females vs males")
\label{volcano2}Figure 23.B: Volcano plot of the comparison males vs females.

Figure 23.B: Volcano plot of the comparison males vs females.

volcano(tt_females, "Comparing females: tumoral vs normal")
\label{volcano4}Figure 23.C: Volcano plot of the comparison healthy fremales vs tumoral females.

Figure 23.C: Volcano plot of the comparison healthy fremales vs tumoral females.

Genes differentially expressed

The number of genes which are differentialed expressed can be calculated with:

de_tumor <- subset(tt_tumor, abs(logFC) >= lfc_threshold & B >= 4.6)
de_gender <- subset(tt_gender, abs(logFC) >= lfc_threshold & B >= 4.6)
tryCatch(de_males <- subset(tt_males, abs(logFC) >= lfc_threshold & B >= 4.6),
         error = function(e){return(0)})
## [1] 0
de_females <- subset(tt_females, abs(logFC) >= lfc_threshold & B >= 4.6)

# Number of genes which pass the filter
de_genes <- c(nrow(de_tumor), nrow(de_gender), 0, nrow(de_females))
names(de_genes) <- c("Tumor vs Normal", "Females vs Males", "Males", "Females")
de_genes
##  Tumor vs Normal Females vs Males            Males          Females 
##             7204            10630                0             8976

As we can see there aren’t any gene of the comparison male vs female that pass the filter. We can compare if in the different comparisons such genes remain allways up-regulated or down-regulated.

Exploration of differential expressed genes

But we create some function to make it easier to compare several lists:

up_down <- function(data,  names = NULL){
    # Classify the genes in up or down, names are the names of the genes and 
    # should be on the same order
    dif_genes <- ifelse(data$logFC > 0, "up", "down")
    if (is.null(names)){
        names(dif_genes) <- rownames(data)
    } else {
        names(dif_genes) <- names
    }
    
    dif_genes
}

classify <- function(regulated1, regulated2){
    # Check how much up regulated genes in one comparison are down regulated in others

    inter <- intersect(names(regulated1), names(regulated2))
    
    genes <- c()
    genes2 <- c()
    
    # Reorder them by the name for the comparison with table
    for (name in inter){
        genes <- c(genes, regulated1[name])
        genes2 <- c(genes2, regulated2[name])
    }
    
    # Perform the comparison
    table(genes, genes2, 
          dnn = c(deparse(substitute(regulated1)), deparse(substitute(regulated2))))
}

genes_names <- function(regulated1, regulated2){
    # Extract the names of those up-regulated, down-regulated or not clear in
    # both lists
    inter <- intersect(names(regulated1), names(regulated2))
    
    up <- c()
    down <- c()
    not_clear <- c()
    for (name in inter){
        if (regulated1[name] == regulated2[name]){
            if (regulated1[name] == "up"){
                up <- c(up, name)
            } else{
                down <- c(down, name)
            }
        } else{
            not_clear <- c(not_clear, name)
        }
    }
    
    list(up = up, down = down, not_clear = not_clear)
}

The function up_down classify the genes in up-regulated and down-regulated. The function classify given two lists of genes up or down regulates, find the genes common in them and check if in both lists are up-regulated or down-regulated, however it doesn’t display the names, just the numbers. However the function genes_names classify the genes on the three elements of the list: up if in both sets is up-regulated, down if in both sets the genes is down-regulated and not_clear if in a dataset the gene is up-regulated and in the other is down-regulated. We can see here the results of several comparisons:

r_tumor <- up_down(de_tumor)
r_females <- up_down(de_females)
# r_males <- up_down(de_males)
r_gender <- up_down(de_gender)

# We see how do they overlap, more up-regulated, or more dow-regulated genes.
# classify(r_tumor, r_males)
classify(r_tumor, r_females)
##        r_females
## r_tumor down
##    down 5747
# classify(r_females, r_males)

# classify(r_gender, r_males)
classify(r_females, r_gender)
##          r_gender
## r_females down   up
##      down    1 7973
classify(r_tumor, r_gender)
##        r_gender
## r_tumor down   up
##    down    1 6391
# We collect the name of genes for further studies
# genes_females_males <- genes_names(r_females, r_males)
# genes_tumor_males <- genes_names(r_tumor, r_males)
genes_tumor_females <- genes_names(r_tumor, r_females)
DE_tumor <- unique(rownames(de_tumor))
# DE_males <- unique(rownames(de_males))
DE_females <- unique(rownames(de_females))
DE_gender <- unique(rownames(de_gender))

Venn diagrams

To visualize it we plot them with venn diagrams:

library("Vennerable")
plot(Venn(list(DE_tumor, DE_females),
              SetNames = c("Tumoral", "Females")))
\label{venn1}Figure 24.A: Representation of the number of genes differentially expressed in comparison between females, and healthy vs tumoral.

Figure 24.A: Representation of the number of genes differentially expressed in comparison between females, and healthy vs tumoral.

As we can observe of the comparison, many genes differentially expressed on females comparing healthy samples and tumoral ones are not found on males.

Functional enrichment

We can perform a functional enrichment on our data with the GO already stored

library(GOstats)

go_obj <- function(genes_id, universe = unique(annot.Entrez.GO$ENTREZID)){
    # Create an object for GOstat from ENTREZIDs
    new("GOHyperGParams", geneIds = genes_id,
                    universeGeneIds = universe,
                    annotation = "org.Hs.eg.db", ontology = "BP",
                    pvalueCutoff = 0.05, testDirection = "over", 
                    conditional = TRUE)
}

params_tumor <- go_obj(DE_tumor)
# params_males <- go_obj(DE_males)
params_females <- go_obj(DE_females)
parmas_gender <- go_obj(DE_gender)

# The set of differentially expressed only in females
# mask <- DE_females %in% intersect(DE_males, DE_females)
# DE_females_exc <- DE_females[!mask]
# DE_females_exc <- DE_females_exc[!DE_females_exc %in% intersect(DE_females_exc, DE_tumor)]
# params_female_exc <- go_obj(DE_females_exc)

# The set of differentially expressed only in males
# mask <- DE_males %in% intersect(DE_males, DE_females)
# DE_males_exc <- DE_males[!mask]
# DE_males_exc <- DE_males_exc[!DE_males_exc %in% intersect(DE_males_exc, DE_tumor)]
# params_male_exc <- go_obj(DE_males_exc)

# We explore those genes that are shared among the comparisons
# go_females_males_up <- go_obj(genes_females_males$up)
# go_females_males_down <- go_obj(genes_females_males$down)
# go_females_males_not_clear <- go_obj(genes_females_males$not_clear)

# go_tumor_females_not_clear <- go_obj(genes_tumor_females$not_clear)
# go_tumor_females_up <- go_obj(genes_tumor_females$up) # Genes not in the universe
go_tumor_females_down <- go_obj(genes_tumor_females$down)

# go_tumor_males_not_clear <- go_obj(genes_tumor_males$not_clear)
# go_tumor_males_up <- go_obj(genes_tumor_males$up)
# go_tumor_males_down <- go_obj(genes_tumor_males$down)

Once prepared with the object we can run the analysis:

go_analisys <- function(go_obj, annots.GO, size = 5, count = 5, pvalue = 0.05){
    hgOver <- hyperGTest(go_obj)
    
    GO <- GOstats::summary(hgOver)
    
    GO <- subset(GO, Size >= size & Count >= count & Pvalue <= pvalue)
    if (nrow(GO) == 0){
        warning("Not significant GO on this group with these settings.")
        return(NULL)
    }
    
    # Visualize the data
    plotgo <- ggplot(GO, aes(Size, Count, size = OddsRatio, colour = Pvalue)) + 
        geom_point() + b + ggtitle(deparse(substitute(go_obj)))
    
    GO <- GO[order(GO$OddsRatio, decreasing = TRUE), ]
    resume <- function(id, annot){
        #
        a <- unique(annot[annot$ENTREZID %in% id, "SYMBOL"])
        if (length(a) >= 1){
            return(a)
        } else{
            return("")
        }
    }
    geneIDs <- geneIdsByCategory(hgOver)[GO$GOBPID]
    geneSYMs <- sapply(geneIDs, resume, annot = annots.GO)
    if (is.matrix(geneSYMs)){
        geneSYMs <- apply(geneSYMs, 2, paste, collapse = ", ")
    } else {
        geneSYMs <- sapply(geneSYMs, paste, collapse = ", ")
    }
    GO <- merge(GO, geneSYMs, by.x = "GOBPID", by.y = 0)
    GO <- GO[order(-GO$OddsRatio, GO$Pvalue, -GO$ExpCount, -GO$Count), ]
    rownames(GO) <- 1:nrow(GO)
    colnames(GO) <- c(colnames(GO)[-ncol(GO)], "Genes")
    list(GO = GO, plot = plotgo)
}
GO_tumor <- go_analisys(params_tumor, annot.Entrez.GO)
if (!is.null(GO_tumor)){
    print(head(GO_tumor$GO))
    plot(GO_tumor$plot)
}
##       GOBPID       Pvalue OddsRatio ExpCount Count Size
## 1 GO:0051482 5.341315e-05       Inf 2.341270     8    8
## 2 GO:0030194 1.811497e-04       Inf 2.045881     7    7
## 3 GO:0007158 6.257946e-04       Inf 1.755952     6    6
## 4 GO:0018119 6.257946e-04       Inf 1.755952     6    6
## 5 GO:0044144 6.257946e-04       Inf 1.755952     6    6
## 6 GO:0051481 6.257946e-04       Inf 1.755952     6    6
##                                                                                                                                    Term
## 1 positive regulation of cytosolic calcium ion concentration involved in phospholipase C-activating G-protein coupled signaling pathway
## 2                                                                                              positive regulation of blood coagulation
## 3                                                                                                             neuron cell-cell adhesion
## 4                                                                                                     peptidyl-cysteine S-nitrosylation
## 5                                                                    modulation of growth of symbiont involved in interaction with host
## 6                                                                            negative regulation of cytosolic calcium ion concentration
##                                                 Genes
## 1 AGTR1, S1PR1, LPAR1, EDNRA, F2R, GRM1, HTR2C, KISS1
## 2                CD36, F2R, F3, F7, F12, S100A9, SELP
## 3                ASTN1, RET, TNR, NRXN3, NRXN1, NRXN2
## 4             ADH5, ACE, GAPDH, S100A8, S100A9, SNTA1
## 5                 CD36, CTSG, IFNG, IL10, IL12B, OSBP
## 6            ATP1A2, GTF2I, HRC, KCNA5, KCNK3, SLC8A1
\label{tumor1}Figure 25.A: Representation of the P-value and effect of the GO of the comparison of healty and tumoral samples.

Figure 25.A: Representation of the P-value and effect of the GO of the comparison of healty and tumoral samples.

We should do the same if we had some genes differentially expressed on the comparison between gender.

GO_females <- go_analisys(params_females, annot.Entrez.GO)
if (!is.null(GO_females)){
    print(head(GO_females$GO))
    plot(GO_females$plot)
}
##       GOBPID       Pvalue OddsRatio ExpCount Count Size
## 1 GO:0045909 1.030450e-06       Inf 5.236111    14   14
## 2 GO:0007188 6.933480e-06       Inf 4.463845    12   12
## 3 GO:0010248 5.311264e-05       Inf 3.740079    10   10
## 4 GO:0033762 1.380714e-04       Inf 3.354971     9    9
## 5 GO:0071887 3.722405e-04       Inf 2.983501     8    8
## 6 GO:1901381 3.781912e-04       Inf 2.989409     8    8
##                                                                        Term
## 1                                       positive regulation of vasodilation
## 2 adenylate cyclase-modulating G-protein coupled receptor signaling pathway
## 3    establishment or maintenance of transmembrane electrochemical gradient
## 4                                                      response to glucagon
## 5                                               leukocyte apoptotic process
## 6              positive regulation of potassium ion transmembrane transport
##                                                                                      Genes
## 1 ADM, ADRA2A, ADRB2, AGTR2, ALOX12, CPS1, F2RL1, GJA5, INS, PTAFR, PTPRM, UCN, VIP, KAT2B
## 2           AVPR2, CCR3, CNR1, FPR1, GCG, GCGR, GNAT1, GNAT2, PRKACB, PTGER4, GLP2R, WASF2
## 3               ATP1A1, ATP1A2, ATP1A3, ATP12A, ATP1A4, ATP1B3, FXYD2, ATP4A, BAX, SLC22A1
## 4                           CDO1, CREB1, CRY1, CYC1, HMGCS2, PFKFB1, QDPR, RPS6KB1, SREBF1
## 5                                  CLC, DFFA, IFNG, IL2RA, RIPK1, BCL2L11, SIVA1, TRAF3IP2
## 6                                     DPP6, EDN3, KCNC1, KCNC2, KCNE1, KCNH2, KCNJ2, KCNQ1
\label{tumor2}Figure 25.B: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females comparison

Figure 25.B: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females comparison

GO_gender <- go_analisys(parmas_gender, annot.Entrez.GO)
if (!is.null(GO_gender)){
    print(head(GO_gender$GO))
    plot(GO_gender$plot)
}
##       GOBPID       Pvalue OddsRatio ExpCount Count Size
## 1 GO:0030199 1.995872e-06       Inf 6.259921    15   15
## 2 GO:0043279 4.414534e-06       Inf 5.808478    14   14
## 3 GO:0034698 4.792845e-06       Inf 5.842593    14   14
## 4 GO:0030155 8.343936e-06       Inf 5.293155    13   13
## 5 GO:0007188 2.580717e-05       Inf 4.979673    12   12
## 6 GO:0033189 2.762569e-05       Inf 5.007937    12   12
##                                                                        Term
## 1                                              collagen fibril organization
## 2                                                      response to alkaloid
## 3                                                  response to gonadotropin
## 4                                               regulation of cell adhesion
## 5 adenylate cyclase-modulating G-protein coupled receptor signaling pathway
## 6                                                     response to vitamin A
##                                                                                                                    Genes
## 1 ANXA2, SERPINH1, COL1A1, COL2A1, COL3A1, COL5A1, COL5A2, COL11A1, COL12A1, LUM, DDR2, P4HA1, COL14A1, ADAMTS3, ADAMTS2
## 2                          ADA, BCHE, EDNRA, FGA, HIF1A, IL1B, PPP2R2A, PPP5C, PRKCE, PRKCG, TAC1, UQCRC1, PEA15, TMED10
## 3                          ASNS, CTSV, GATA4, GATA6, GJB2, HMGCS1, ICAM1, ITGA3, ITGB1, PAWR, SLC5A5, SRD5A1, STAR, PAX8
## 4                          S1PR1, ICAM1, LAMA3, LAMA4, LAMA5, PPP1R12A, SERPINI1, PPP2R1A, PRKX, SOX9, SRF, CYTIP, NUAK1
## 5                                         AVPR2, CCR3, CNR1, FPR1, GCG, GCGR, GNAT1, GNAT2, PRKACB, PTGER4, GLP2R, WASF2
## 6                                      ARG1, CAT, DNMT1, DNMT3A, DNMT3B, GATA4, RARA, RXRA, SLC34A1, TSHB, TYMS, ALDH1A2
\label{tumor3}Figure 25.D: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females againts males

Figure 25.D: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females againts males

GO_tumor_females_down <- go_analisys(go_tumor_females_down, annot.Entrez.GO)
if (!is.null(GO_tumor_females_down)){
    print(head(GO_tumor_females_down$GO))
    plot(GO_tumor_females_down$plot)
}
##       GOBPID       Pvalue OddsRatio ExpCount Count Size
## 1 GO:0051482 9.634658e-06       Inf 1.890653     8    8
## 2 GO:0009629 7.346161e-04       Inf 1.181658     5    5
## 3 GO:0042148 7.346161e-04       Inf 1.181658     5    5
## 4 GO:0051918 7.346161e-04       Inf 1.181658     5    5
## 5 GO:2000074 7.346161e-04       Inf 1.181658     5    5
## 6 GO:0001774 2.595136e-04  22.69022 1.890653     7    8
##                                                                                                                                    Term
## 1 positive regulation of cytosolic calcium ion concentration involved in phospholipase C-activating G-protein coupled signaling pathway
## 2                                                                                                                   response to gravity
## 3                                                                                                                       strand invasion
## 4                                                                                                   negative regulation of fibrinolysis
## 5                                                                                      regulation of type B pancreatic cell development
## 6                                                                                                            microglial cell activation
##                                                 Genes
## 1 AGTR1, S1PR1, LPAR1, EDNRA, F2R, GRM1, HTR2C, KISS1
## 2                    BGLAP, CBLB, PTAFR, SPARC, FOSL1
## 3                RAD51, RAD51C, RAD51B, RAD51D, XRCC2
## 4                       APOH, F2, HRG, SERPINE1, USF1
## 5                       ARNTL, BAD, RFX3, RHEB, CLOCK
## 6         CLU, CX3CR1, IL13, JUN, CX3CL1, SNCA, TRPV1
\label{tumor10}Figure 25.K: Representation of the P-value and effect of the GO of the genes down-regulated in both comparisons between normal and tumors and tumor vs normal in females.

Figure 25.K: Representation of the P-value and effect of the GO of the genes down-regulated in both comparisons between normal and tumors and tumor vs normal in females.

Result tables

We should look in details to those values and associations in the tables.

library(xtable)
table_h <- function(go_obj, ftitle){
    # Function to output the table in html format
    if (is.null(go_obj)){
        warning("Null Object: unable to creat the table")
    } else {
        xtab_tumor <- xtable(go_obj, align="l|c|r|r|r|r|r|p{3cm}|p{3cm}")
        print(x=xtab_tumor, file = ftitle, type = "html")
    }
}

table_g <- function(tt_obj, ftitle){
    # Function to output the table in html format
    if (is.null(tt_obj)){
        warning("Null Object: unable to creat the table")
    } else {
        rownames(tt_obj) <- 1:nrow(tt_obj)
        xtab_tumor <- xtable(tt_obj, align="l|c|r|r|r|r|r|p{3cm}|p{3cm}|r|")
        print(x=xtab_tumor, file = ftitle, type = "html")
    }
}

table_h(GO_tumor$GO, "GO_tumor_table.html")
table_h(GO_females$GO, "GO_females_table.html")
# table_h(GO_males$GO, "GO_males_table.html")
table_h(GO_gender$GO, "GO_gender_table.html")

# table_h(GO_female_exc$GO, "GO_female_exc_table.html")
# table_h(GO_male_exc$GO, "GO_male_exc_table.html")

# table_h(GO_females_males_up$GO, "GO_females_males_up_table.html")
# table_h(GO_females_males_down$GO, "GO_females_males_down_table.html")
# table_h(GO_females_males_not_clear$GO, "GO_females_males_not_clear_table.html")

# table_h(GO_tumor_females_not_clear$GO, "GO_tumor_females_not_clear_table.html")
# table_h(GO_tumor_males_not_clear$GO, "GO_tumor_males_not_clear_table.html")
# table_h(GO_tumor_females_up$GO, "GO_tumor_females_up_table.html")
# table_h(GO_tumor_males_up$GO, "GO_tumor_males_up_table.html")
table_h(GO_tumor_females_down$GO, "GO_tumor_females_down_table.html")
# table_h(GO_tumor_males_down$GO, "GO_tumor_males_down_table.html")

table_g(de_tumor, "DE_tumor.html")
table_g(de_gender, "DE_gender.html")
table_g(de_females, "DE_females.html")
# table_g(de_males, "DE_males.html")

The table of genes which are differentially expressed in the different comparisons are in the following sites:
* The comparison of tumoral vs normal samples.
* The comparison of females vs males samples.
* The comparison of tumoral vs normal female samples.
* The comparison of tumoral vs normal male samples.

To explore more than the top 6 GO we should look into the tables generated with any browser. * In GO_tumor_table we have the list of GO enrich on the differentially expressed genes in the comparison tumor vs healthy samples.
* In GO_females_table we have the list of GO enrich on the differentially expressed genes in the comparison tumoral females vs healthy females.
* In GO_males_table we have the list of GO enrich on the differentially expressed genes in the comparison between tumoral males samples and healthy males samples.
* In GO_gender_table we have the list of GO enrich on the differentially expressed genes in the comparison between females and males.
* In GO_tumor_females_down we have the list of GO enrich on the differentially expressed genes down-regulated in the comparison of tumoral vs healthy samples and tumoral samples and healthy females.

SessionInfo

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] xtable_1.8-2               GO.db_3.3.0               
##  [3] GOstats_2.38.0             graph_1.50.0              
##  [5] Category_2.38.0            Matrix_1.2-6              
##  [7] Vennerable_3.1.0.9000      org.Hs.eg.db_3.3.0        
##  [9] corpcor_1.6.8              sva_3.20.0                
## [11] genefilter_1.54.2          mgcv_1.8-12               
## [13] nlme_3.1-128               cqn_1.18.0                
## [15] quantreg_5.26              SparseM_1.7               
## [17] preprocessCore_1.34.0      nor1mix_1.2-1             
## [19] mclust_5.2                 geneplotter_1.50.0        
## [21] annotate_1.50.0            XML_3.98-1.4              
## [23] AnnotationDbi_1.34.3       lattice_0.20-33           
## [25] edgeR_3.14.0               limma_3.28.6              
## [27] ggplot2_2.1.0              SummarizedExperiment_1.2.2
## [29] Biobase_2.32.0             GenomicRanges_1.24.1      
## [31] GenomeInfoDb_1.8.1         IRanges_2.6.0             
## [33] S4Vectors_0.10.1           BiocGenerics_0.18.0       
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.1         colorspace_1.2-6       htmltools_0.3.5       
##  [4] yaml_2.1.13            survival_2.39-4        RBGL_1.48.1           
##  [7] DBI_0.4-1              RColorBrewer_1.1-2     plyr_1.8.4            
## [10] stringr_1.0.0          zlibbioc_1.18.0        MatrixModels_0.4-1    
## [13] munsell_0.4.3          gtable_0.2.0           evaluate_0.9          
## [16] labeling_0.3           knitr_1.13             GSEABase_1.34.0       
## [19] Rcpp_0.12.5            KernSmooth_2.23-15     scales_0.4.0          
## [22] formatR_1.4            XVector_0.12.1         digest_0.6.9          
## [25] stringi_1.1.1          grid_3.3.1             tools_3.3.1           
## [28] magrittr_1.5           RSQLite_1.0.0          rmarkdown_0.9.6       
## [31] AnnotationForge_1.14.2

Bibliography

  • Agrawal N, Akbani R, Aksoy BA, Ally A, Arachchi H, Asa SL, Zou L. (2014). Integrated Genomic Characterization of Papillary Thyroid Carcinoma. Cell, 159(3), 676-690.
  • The Canger Genome Atlas. Papillary Thyroid Carcinoma. Retrieved May 6, 2016. Available from: http://cancergenome.nih.gov/cancersselected/thyroid
  • Xing M. (2013). Molecular pathogenesis and mechanisms of thyroid cancer. Nature Reviews. Cancer, 13(3), 184-99.