Supplementary material of New insights of the RNa-seq data of The Cancer Genome Atlas for Papillary Thyroid Carcinoma

Introduction

Thyroid cancer can develop from the different cells that form the follicles of the thyroid. This gland located at the base of the throat secretes hormones such as T3 and T4 that have their metabolic functionalities such as control of heart rate, blood pressure, body temperature and weight. See it on TCGA.

From the four types of thyroid cancer (papillary, follicular, medullary, and anaplastic thyroid cancer) Papillary Thyroid Carcinoma (PTC) is the most common type of thyroid cancer (Agrawal et al., 2014). It is more prevalent in women than men and its common diagnosis occurs around 49 years old.

Driver onco-mutations for this type of cancer appear as alterations of the MAPK signaling pathway and the PI3K-AKT pathway (Agrawal et al., 2014). Both are implied in cell proliferation and survival and in human tumorigenesis. The overactivation of the MAPK pathway because of mutations such as the BRAFV600E mutation, leads to the development of PTC from follicular thyroid cells (Xing, 2013). On the other hand, mutations that activate the PI3K-AKT pathway, such as mutations in RAS, PTEN and PIK3CA, mostly leads to development of Follicular Thyroid Adenoma (FTA) and Follicular Thyroid Cancer (FTC) also in follicular thyroid cells (Xing, 2013).

Research from The Cancer Genome Atlas (TCGA) focused on PTC and they found that a part from alterations in BRAF (specifically BRAFV600E) and RAS, there are other driver mutations such the ones in EIF1AX PPM1D, and CHEK2 genes that are main alterations for that type of thyroid cancer(Agrawal et al., 2014).

Data import

We start importing the raw table of counts.

library(SummarizedExperiment)

thca <- readRDS("seTHCA.rds") 
thca

## class: RangedSummarizedExperiment 
## dim: 20115 572 
## metadata(5): experimentData annotation cancerTypeCode
##   cancerTypeDescription objectCreationDate
## assays(1): counts
## rownames(20115): 1 2 ... 102724473 103091865
## rowData names(3): symbol txlen txgc
## colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
##   TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
##   TCGA.KS.A41L.11A.11R.A23N.07
## colData names(549): type bcr_patient_uuid ...
##   lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

Data exploration

Explore the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata.

dim(colData(thca))

## [1] 572 549

col.data <- colData(thca)
col.data[1:5, 1:5]

## DataFrame with 5 rows and 5 columns
##                                  type                     bcr_patient_uuid
##                              <factor>                             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07    tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
## TCGA.BJ.A0YZ.01A.11R.A10U.07    tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
## TCGA.BJ.A0Z0.01A.11R.A10U.07    tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
## TCGA.BJ.A0Z2.01A.11R.A10U.07    tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
## TCGA.BJ.A0Z3.01A.11R.A13Y.07    tumor 331cae6e-2868-4c58-9302-709a9ff7d025
##                              bcr_patient_barcode form_completion_date
##                                         <factor>             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07        TCGA-4C-A93U           2014-11-12
## TCGA.BJ.A0YZ.01A.11R.A10U.07        TCGA-BJ-A0YZ            2011-4-11
## TCGA.BJ.A0Z0.01A.11R.A10U.07        TCGA-BJ-A0Z0            2011-4-15
## TCGA.BJ.A0Z2.01A.11R.A10U.07        TCGA-BJ-A0Z2            2011-4-15
## TCGA.BJ.A0Z3.01A.11R.A13Y.07        TCGA-BJ-A0Z3             2011-6-2
##                              prospective_collection
##                                            <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                    YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z0.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z2.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     NO

col.data.meta <- mcols(colData(thca), use.names=TRUE)
col.data.meta

## DataFrame with 549 rows and 2 columns
##                                                          labelDescription
##                                                               <character>
## type                                           sample type (tumor/normal)
## bcr_patient_uuid                                         bcr patient uuid
## bcr_patient_barcode                                   bcr patient barcode
## form_completion_date                                 form completion date
## prospective_collection            tissue prospective collection indicator
## ...                                                                   ...
## lymph_nodes_pelvic_pos_total                               total pelv lnp
## lymph_nodes_aortic_examined_count                           total aor lnr
## lymph_nodes_aortic_pos_by_he                          aln pos light micro
## lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
## lymph_nodes_aortic_pos_total                                total aor-lnp
##                                         CDEID
##                                   <character>
## type                                       NA
## bcr_patient_uuid                           NA
## bcr_patient_barcode                   2673794
## form_completion_date                       NA
## prospective_collection                3088492
## ...                                       ...
## lymph_nodes_pelvic_pos_total          3151828
## lymph_nodes_aortic_examined_count     3104460
## lymph_nodes_aortic_pos_by_he          3151832
## lymph_nodes_aortic_pos_by_ihc         3151831
## lymph_nodes_aortic_pos_total          3151827

Now, explore the row (feature) data.

rowRanges(thca)

## GRanges object with 20115 ranges and 3 metadata columns:
##             seqnames               ranges strand |      symbol     txlen
##                <Rle>            <IRanges>  <Rle> | <character> <integer>
##           1    chr19 [58345178, 58362751]      - |        A1BG      3322
##           2    chr12 [ 9067664,  9116229]      - |         A2M      4844
##           9     chr8 [18170477, 18223689]      + |        NAT1      2280
##          10     chr8 [18391245, 18401218]      + |        NAT2      1322
##          12    chr14 [94592058, 94624646]      + |    SERPINA3      3067
##         ...      ...                  ...    ... .         ...       ...
##   100996331    chr15 [20835372, 21877298]      - |       POTEB      1706
##   101340251    chr17 [40027542, 40027645]      - |    SNORD124       104
##   101340252     chr9 [33934296, 33934376]      - |   SNORD121B        81
##   102724473     chrX [49303669, 49319844]      + |      GAGE10       538
##   103091865    chr21 [39313935, 39314962]      + |   BRWD1-IT2      1028
##                  txgc
##             <numeric>
##           1 0.5644190
##           2 0.4882329
##           9 0.3942982
##          10 0.3895613
##          12 0.5249429
##         ...       ...
##   100996331 0.4308324
##   101340251 0.4903846
##   101340252 0.4074074
##   102724473 0.5055762
##   103091865 0.5924125
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

Looking deeper into col.data with:

col.data[1:5, 30:35]

## DataFrame with 5 rows and 6 columns
##                              lymph_nodes_examined_count
##                                                <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                        YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07            [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                        YES
## TCGA.BJ.A0Z2.01A.11R.A10U.07                         NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                        YES
##                              lymph_nodes_examined
##                                          <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   41
## TCGA.BJ.A0YZ.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    5
## TCGA.BJ.A0Z2.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                    3
##                              lymph_nodes_examined_he_count
##                                                   <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                             1
## TCGA.BJ.A0YZ.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                             0
## TCGA.BJ.A0Z2.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                             0
##                              weiss_score_overall mitoses_per_50_hpf
##                                         <factor>           <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                  NA                 NA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z0.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z2.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                  NA                 NA
##                              ajcc_pathologic_tumor_stage
##                                                 <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   Stage IVA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z2.01A.11R.A10U.07                   Stage IVC
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     Stage I

We observed that there are different ways to indicate that the data is not available (NA, [Not Available]…). Consequently, we decide to study which variables have more information. To do so, we create two functions, one (na.replace) to change those values into the standarized NA nomenclature, and filter.info to quantify how many of each variable is not NA.

na.replace <- function(x){
    # Remove the unwanted factors
    lev <- levels(x)
    lev <- lev[!(lev %in% "[Not Available]")]
    lev <- lev[!(lev %in% "[Unknown]")]
    lev <- lev[!(lev %in% "[Not Applicable]")]
    lev <- lev[!(lev %in% "[Not Available]|[Not Available]|[Not Available]")]
    lev <- lev[!(lev %in% "[Not Evaluated]")]
    lev <- lev[!(lev %in% "None")]

    x <- factor(x, levels=lev)
}

filter.info <- function(x){
    # Function to filter the NA, [Not Available], [Not Applicable] and [Unknown]
    # Return the proportion of information on x
    nas <- sum(is.na(x))
    return(1 - (nas/length(x)))
}

We apply these functions to the col.data and we visualize it:

meta.info <- as.data.frame(lapply(col.data, na.replace))
variable.info <- sapply(meta.info, filter.info)

# Representing the information by column vs total information
relative.info <- variable.info[order(variable.info)]/sum(variable.info)
plot(relative.info, main="Information brought by column")

$\label{fig:col.data.analysis}Figure 1. Information in each column$

Figure 1. Information in each column

# Plotting the histogram of the information
hist(variable.info, 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")

$\label{col.data.analysis2}Figure 2.A Histogram of information of additional information per sample.$

Figure 2.A Histogram of information of additional information per sample.

# Seeing the most informative to set a threshold,
# frequency are the number of columns with such % of information
hist(variable.info, xlim = c(0.3, 1), ylim = c(0, 35), 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")

$\label{col.data.analysis3}Figure 2.B: A zoom on the most informative variables.$

Figure 2.B: A zoom on the most informative variables.

As we can see, in many variables the main value is just NA or other derivatives meaning “No information available for this variable”. Therefore, in order to keep the most informative variables, we apply a threshold of 60% (of the values containing informative data) to select the relevant ones:

filtered <- meta.info[, variable.info  > 0.6]

dim(filtered)

## [1] 572  44

meta.info.summary2 <- col.data.meta[variable.info  > 0.6,]

Thus we ended with 44 informatives variables.

Associated variables exploration

To explore the variables associated with the samples we first create an object with ggplot:

library("ggplot2")
fplot <- ggplot(as.data.frame(filtered))

## ggplot options for axes and background color
# Options of the labels
l <- theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 1))
# Removing the background
b <-  theme(axis.line = element_line(colour = "black"),
            panel.grid.major.x = element_blank(),
            panel.grid.major.y = element_line(colour = "grey", 
                                              size = 0.5, linetype = 2),
            panel.grid.minor.y = element_line(colour = "grey", 
                                              size = 0.25, linetype = 3),
            panel.border = element_blank(),
            panel.background = element_blank())
y.axis <- ylab("number of cases")

And we plot some relationships for some significant variables:
In the “normal” type samples most of the information is not available. This could be due to the fact that all “normal” samples were obtained from patients with tumors. So in order to have the maximum information of each sample, we infer “normal” samples information from its paired sample data (mainly gender and ethnicity).

fplot + geom_bar(aes(type, fill = gender)) + b + y.axis + 
    ggtitle("Gender distribution across normal and tumour samples")

$\label{gender}Figure 3.A: The gender distribution: threre are more females than males.$

Figure 3.A: The gender distribution: threre are more females than males.

As expected, there are more female samples than male samples. As we mentioned before, thyroid cancer is more prevalent in women than men and we can see that not only for tumor samples but for normals samples too. In addition, there are more tumor cases than normal cases.

fplot + geom_bar(aes(type, fill = tumor_status)) + b + y.axis + 
    ggtitle("Tumor status across normal and tumour samples")

$\label{status}Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.$

Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.

We also have examined the tumor status of the tumor samples. There are more cancer patients without tumor (“tumor free”) than with the tumor (“with tumor”) after the collection of the data. We assume that most of the patients that participate in the study have recovered after it.

fplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b + 
    y.axis + ggtitle("Ethnicity distribution across normal and tumour samples")

$\label{ethnicity}Figure 3.C: The predominant ethnicity are not hispanic or latino$

Figure 3.C: The predominant ethnicity are not hispanic or latino

In this plot we can see clearly the lack of data for some variables regarding normal samples. In other words, there is no record for normal samples neither females nor males about the hispanic origin of the individuals.

labels <- c("AM. INDIAN OR ALASKA NAT.", "ASIAN", "BLACK OR AFRICAN AM.", 
            "WHITE", "NA")
fplot + facet_wrap(~ ethnicity) + geom_bar(aes(race, fill = gender)) + b +
    l + y.axis + scale_x_discrete(breaks=waiver(), labels = labels) +
    ggtitle("Gender distribution across different ethnicity and race.")

$\label{ethnicity2}Figure 3.D: Most samples are from white ancestry.$

Figure 3.D: Most samples are from white ancestry.

If we have a look at the ethnicities, gender and hispanic origin, the data shows more cases for “white” individuals “not hispanic or latino” than “white” individuals “hispanic or latino”.

fplot + geom_bar(aes(x=gender, fill = race)) + b + y.axis + 
    ggtitle("Ethnicity distribution in male and female")

$\label{race}Figure 3.E: Both in males and females the most abundant race is white.$

Figure 3.E: Both in males and females the most abundant race is white.

If we ignore the hispanic origin of the individuals and just focus on the main ethnicities of the study we can observe that for both female and male samples there is a majority of “white” individuals. However, the fact that “white” subjects are hispanic or not can provide some variability so that is something we will take into account later.

fplot + facet_wrap(~ type ) + geom_bar(aes(tumor_focality, fill = gender)) + b +
    l + y.axis + xlab("tumor focality") + 
    ggtitle("Gender distribution in different tumour focality")

$\label{focality}Figure 3.F: There are some tumors which we don't know the focality.$

Figure 3.F: There are some tumors which we don’t know the focality.

There is no data about tumor focality in the normal samples, but this could be due to the fact that are samples obtained probably during the tumor removal or for a biopsy, from a healty part of the tyroid gland. For this reason there is no reason for the normal samples to have this feature.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
hd.xaxis <-  scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
                labels=labels)
fplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + hd.xaxis + 
    y.axis + 
    ggtitle("Gender distribution across different types of hystologic diagnosis")

$\label{diagnosis}Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.$

Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.

Here, we can see that the tumor samples came mainly from classical Papillary Thyroid Carcinoma. We expect differential expression between the different thyroid cancer phenotypes so this might be a source of variability.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
ggplot(filtered[!is.na(filtered$age_at_diagnosis), ]) + 
    geom_bar(aes(age_at_diagnosis, na.rm = TRUE, fill = histologic_diagnosis)) +
    b + l + labs(x="age at diagnosis") + y.axis + 
    scale_fill_discrete(name = "histologic diagnosis", breaks = waiver(), 
    labels = labels) + 
    ggtitle("Histologic diagnosis distribution across different ages")

$\label{diagnosis2}Figure 3.H: There is a representative sample of diagnosis over the age of patients.$

Figure 3.H: There is a representative sample of diagnosis over the age of patients.

Even thought, thyroid cancer is commontly diagnosed at mature ages (~50) this plot shows that the classical papillary thyroid cancer has similar representation in all ages for this dataset.

As a summary from the previous plots, the most abundant population in the samples is the “white” and “not hispanic or latino”, so we decide to perform the analysis on this group of individuals, whose tumoral sample correspond to the most common cancer in thyroids: Papillary Thyroid Carcinoma(PTC).

Subseting

With this approach we want to avoid to include in the analysis, samples that introduce variability (mainly differential expression due to enviromental or epigenetic factors) as much as possible. To have more insight on the disease we want to have just paired samples, that is, samples from the same participant. So we filter the data accordingly:

# Paired data
paired <- intersect(thca[, thca$type == 'tumor']$bcr_patient_barcode, 
                    thca[, thca$type == 'normal']$bcr_patient_barcode)
paired_mask <- thca$bcr_patient_barcode %in% paired

filtered_filt <- filtered[paired_mask, ]

# Imposing the selected conditions
tumor_names <- row.names(subset(filtered_filt, 
    type == "tumor" &
    race == "WHITE" &
    histologic_diagnosis == "Thyroid Papillary Carcinoma - Classical/usual" &
    ethnicity == "NOT HISPANIC OR LATINO" ))

# Extracting the participants identifiers
participants <- unlist(lapply(tumor_names, substr, 9, 12))
participants <- participants[participants != ""]

check <- function(x, checklist){
    # Returning name of the sample just if the name is the same as in checklist
    a <- substr(x, 9, 12)
    if(a %in% checklist){
        return(x)
    }
}

# Extracting sample names of tumor and normal data.
sample_names <- unlist(lapply(row.names(filtered_filt), check, 
                              checklist = participants))
thca.filtered <- thca[, sample_names]

We are left with 66 samples. We explore again the variables for this subset. In this case, we show just some of the relationship plots we have seen previously:

filtered_filt <- filtered_filt[row.names(filtered_filt) %in% sample_names, ]
mplot <- ggplot(filtered_filt) + y.axis
mplot + geom_bar(aes(type, fill = gender)) + b +
  ggtitle("Ethnicity and type of sample after filtering")

$\label{pairedg}Figure 4.A: The samples are consistent with the paired data.$

Figure 4.A: The samples are consistent with the paired data.

This plot just checks whether we have the same number of cases in normal sample and tumor sample so that we have done correctly the filtering.

mplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b +
  ggtitle("Hispanic origin, gender and type of sample after filtering")

$\label{pairede}Figure 4.B: The ethnicity is also consistent with the filter applied$

Figure 4.B: The ethnicity is also consistent with the filter applied

Hispanic origin data is only available for tumor samples but as we have mentioned before we can extrapolate ethinicity and origin to the normal samples because they are paired (see below).

mplot + geom_bar(aes(x=gender, fill = ethnicity)) + b + l +
  ggtitle("Gender samples and hispanic origin after filtering")

$\label{pairedge}Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have 'NA' as ethnicity$

Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have ‘NA’ as ethnicity

Same proportion of being or not being “hispanic or latino” in females and males. Subset filtering has not modified the fact that we have more female samples than males.

mplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + 
    scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
    labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Gender distribution across different types of hystologic diagnosis
            after filtering")

$\label{paireda}Figure 4.D: This graph shows that the paired data filtering is consistent$

Figure 4.D: This graph shows that the paired data filtering is consistent

Papillary Classical Thyroid cancer in the same proportion as “NA” because we have the same number of normal cases and classical thyroid tumor cases.

ggplot(filtered_filt[!is.na(filtered_filt$age_at_diagnosis), ]) +
    geom_bar(aes(age_at_diagnosis, fill = histologic_diagnosis)) + b + l + y.axis +
    xlab("age at diagnosis") + scale_fill_discrete(name = "histologic diagnosis",
    breaks=waiver(), labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Histologic diagnosis distribution across different ages after filtering")

$\label{fig:subset.explore5}Figure 4.E: We have a epresentative sample of all ages.$

Figure 4.E: We have a epresentative sample of all ages.

Number of cases across ages is more or less similar after the filtering exept for some ages that have more cases around 30, 34, 42 and 47.

Transforming counts to rpkm

To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.

library(edgeR)
# Normalization just with the selected samples
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = as.data.frame(mcols(thca.filtered)))
dge <- DGEList(counts = assays(thca)$counts, genes = as.data.frame(mcols(thca)))

Now calculate $\log_2$ CPM values of expression and put them as an additional assay element to ease their manipulation.

assays(thca.filtered)$logCPM <- cpm(dge.subset, log=TRUE, prior.count=3)
assays(thca.filtered)$logCPM[1:5, 1:5]

##    TCGA.BJ.A28R.01A.11R.A16R.07 TCGA.BJ.A28W.01A.11R.A32Y.07
## 1                      3.014910                     3.895927
## 2                      8.681899                     9.333470
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.423408                     4.912253
##    TCGA.BJ.A2N7.01A.11R.A18C.07 TCGA.BJ.A2N8.01A.11R.A18C.07
## 1                      2.074429                     3.634778
## 2                      9.313603                     8.873442
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.523368                     4.366761
##    TCGA.BJ.A2N9.01A.11R.A18C.07
## 1                      3.436989
## 2                      8.467420
## 9                     -4.164318
## 10                    -4.164318
## 12                     4.584505

Quality assessment and normalization

Library sizes

Let’s examine the library sizes in terms of total number of sequence read counts per sample. Figure 5 below shows library sizes per sample in increasing order.

ord <- order(dge$sample$lib.size)
barplot(dge$sample$lib.size[ord]/1e6, las=1, ylab="Millions of reads",
        xlab="Samples", col=c("blue", "red")[(thca$type[ord] == "tumor") + 1],
        border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)

$\label{libsize1}Figure 5.A: Library sizes in increasing order of the whole dataset$

Figure 5.A: Library sizes in increasing order of the whole dataset

Plotting for all samples makes it impossible to see accurately the blue and red colors corresponding to tumor and normal respectively.

ord.subset <- order(dge.subset$sample$lib.size)
barplot(dge.subset$sample$lib.size[ord.subset]/1e6, las=1, ylab="Millions of reads",
        xlab="Samples", col=c("blue", "red")[(thca.filtered$type[ord.subset] == "tumor") + 1],
        border = NA)
legend("topleft", c("tumor", "normal"), fill=c("red", "blue"), inset=0.01)

$\label{libsize2}Figure 5.B: Library sizes in increasing order of the paired data.$

Figure 5.B: Library sizes in increasing order of the paired data.

After the filtering the subset plot allows one to see that there is similar heterogeneity in tumor and normal library sizes. In addition, this figure reveals substantial differences in sequencing depth between samples and we may consider discarding those samples whose depth is substantially lower than the rest. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.

sampledepth <- round(dge.subset$sample$lib.size / 1e6, digits = 1)
names(sampledepth) <- substr(sample_names, 6, 12)

qqnorm(sampledepth)
qqline(sampledepth)

$\label{libsizen}Figure 6: Distribution of library size compared with the normal distribution.$

Figure 6: Distribution of library size compared with the normal distribution.

Looking into the qqplot we decide to filter out those below 40 but keeping only those that are paired:

sample_names <- sample_names[sampledepth > 40]

keep.paired <- function(pattern, x){
    # Return the names if it is paired
    pos <- grep(pattern, x, fixed = TRUE)
    if (length(pos) > 1){
        return(x[pos])
    }
}

sample_names <- unique(unlist(lapply(substr(sample_names, 9, 12), 
                                     keep.paired, sample_names)))

thca.filtered <- thca.filtered[, colnames(thca.filtered) %in% sample_names]
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = mcols(thca.filtered))

## Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
## Arguments in '...' ignored

Thus we have 59 samples which are above the threshold and 54 of them are paired, which are the ones which will be used from here on.

Distribution of expression levels among samples

Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 7.

library(geneplotter)
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(assays(thca.filtered[, 
            thca.filtered$type ==  "tumor"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", 
            las = 1)
multidensity(as.list(as.data.frame(assays(thca.filtered[,
            thca.filtered$type ==  "normal"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Normal samples",
            las = 1)

$\label{distRawExp}Figure 7: Non-parametric density distribution of expression profiles per sample.$

Figure 7: Non-parametric density distribution of expression profiles per sample.

These two spikes are due to the lowly expressed genes and the highly expressed genes, but in the lowly expressed genes the effect of GC content and the read length may have higher impact.

GC and length effect

To see the impact of GC and length on the number of read we plot them:

gc_len <- rowRanges(thca.filtered)
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txgc)

## [1] -0.01173666

cor(rowMeans(assays(thca.filtered)$counts), gc_len$txlen)

## [1] -0.0008702022

par(mfrow=c(1,2))
plot(log10(rowMeans(assays(thca.filtered)$counts)), gc_len$txgc, 
     main = "GC content over expression", xlab = "Expression (log10)", 
     ylab = "GC")
plot(log10(rowMeans(assays(thca.filtered)$counts)), log10(gc_len$txlen), 
     main = "Length over the expression", xlab = "Expression (log10)", 
     ylab = "Length (log10)")

$\label{gc_length}Figure 8: Relationship between GC content and length with the mean expression of genes.$

Figure 8: Relationship between GC content and length with the mean expression of genes.

There is a low correlation on the counts and the GC content or the length but as there are clearly two groups based on the length, so we will normalize also using the length with the package cqn:

library("cqn")
cqn.subset <- cqn(assays(thca.filtered)$counts, x = gc_len$txgc, lengths = gc_len$txlen)
par(mfrow = c(1, 2))
boxplot(cqn.subset$func1, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of GC")
boxplot(cqn.subset$func2, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of length")

$\label{cqn}Figure 9: The effect of GC and length on each sample.$

Figure 9: The effect of GC and length on each sample.

As we thought the effect of the length is higher than the effect of GC content

We can see if the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately:

cqn.logCPM <- cqn.subset$y + cqn.subset$offset
assays(thca.filtered)$cqn.logCPM <- cqn.logCPM
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(
    cqn.logCPM[,thca.filtered$type ==  "tumor"])),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    cqn.logCPM[, thca.filtered$type ==  "normal"])),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

$\label{distExp}Figure 10: Non-parametric density distribution of expression profiles per sample.$

Figure 10: Non-parametric density distribution of expression profiles per sample.

These “after-normalization” plots are quite better than the previous ones used without GC and length normalization on the second spike. With this normalization we do not appreciate substantial differences between the samples in the distribution of expression values. However, below 0 there is a lot of variability.

Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 9 shows the distribution of those values across genes.

par(mfrow = c(1, 2))
avgexp <- rowMeans(assays(thca.filtered)$logCPM)
hist(avgexp, xlab = "log2 CPM", main = "Using DGEList normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)

# Using the corrected logCPM
avgexp.cqn <- rowMeans(assays(thca.filtered)$cqn.logCPM)
hist(avgexp.cqn, xlab = "log2 CPM", main = "Using cqn normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)

$\label{exprdist}Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization$

Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization

Filtering of lowly-expressed genes

In the light of previous plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.

mask <- avgexp.cqn > 1
thca.filtered <- thca.filtered[mask, ]
dge.subset <- dge.subset[mask, ]

We have nrow(thca.filtered) genes remaining on our dataset from 54 samples, half of them are healthy and the other half are tumoral.

Normalization

Since we will use the offsets computed by cqn, there is no need to normalize using the normalization tools from edgeR, such as calcNormFactors according to cqn vignette.

dge.subset$offset <- cqn.subset$glm.offset

dge.norm <- calcNormFactors(dge.subset)
assays(thca.filtered)$n.logCPM <- cpm(dge.norm, log = TRUE, prior.count = 3)

par(mfrow = c(3, 2))

# Not normalized but filtered
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalize with calcNormFactors after filtering out those
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalized with cqn and filtered the low values:
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

$\label{norm}Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization$

Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization

The normalization after filtering lowly-expressed genes with ‘cqn’ package clusters samples into one thin gaussian curve whereas normalization with calcNormFactors makes it thiker.

MA-plots

Tumoral samples

We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples.

par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==  "tumor"]
dgetmp <- dge.subset[, thca.filtered$type ==  "tumor"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}

$\label{maPlotsTumor}Figure 13.A: MA-plots of the tumor samples.$

Figure 13.A: MA-plots of the tumor samples.

We do not observe samples with major expression-level dependent biases.

Healthy samples

Let’s look now to the normal samples:

par(mfrow = c(9, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==   "normal"]
dgetmp <- dge.subset[, thca.filtered$type ==  "normal"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}

$\label{maPlotsNormal}Figure 13.B: MA-plots of the normal samples.$

Figure 13.B: MA-plots of the normal samples.

We do not observe either important expression-level dependent biases among the normal samples except on the sample TCGA-BJ-A3PR. Which we proceed to remove it from the cqn.logCP:

mask <- -grep("TCGA.BJ.A3PR", colnames(thca.filtered))
thca.filtered <- thca.filtered[,mask]
sample_names <- sample_names[mask]
dge.subset <- dge.subset[,mask]

Thus the remaining samples are 52 from 26 patients. Recall that we still have 11801 genes from each sample.

Batch identification

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see the wiki), following the strategy described here we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(sample_names, 6, 7)
table(tss)

## tss
## BJ EL ET H2 
## 13 33  2  4

samplevial <- substr(sample_names, 14, 16)
table(samplevial)

## samplevial
## 01A 11A 11C 
##  26  25   1

portion <- substr(sample_names, 18, 19)
table(portion)

## portion
## 11 12 21 22 31 
## 42  6  2  1  1

table(substr(sample_names, 20, 20))

## 
##  R 
## 52

plate <- substr(sample_names, 22, 25)
table(plate)

## plate
## A16R A180 A18C A19O A20F A21D A220 A22L A23N 
##    2    2    5    2   10    6    6    7   12

center <- substr(sample_names, 27, 28)
table(center)

## center
## 07 
## 52

We can observe that the samples where collected at 4 tissues source sites but the majority of them at University of Pittsburg (BJ) and MD Anderson(EL). However,the manager center of the data is the same (University of North Carolina) and the same metabolite (RNA). In addition, they come from 7 different plates more or less distributed. However, most samples came from 2 out of 5 different portion and 2 out of 3 vials.

We are going to check if the previous variables of the barcode are surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with the variables:

table(data.frame(TYPE = thca.filtered$type, TSS = tss))

##         TSS
## TYPE     BJ EL ET H2
##   normal  0 20  2  4
##   tumor  13 13  0  0

# We will check for the other variables
table(data.frame(TYPE = thca.filtered$type, PLATE = plate))

##         PLATE
## TYPE     A16R A180 A18C A19O A20F A21D A220 A22L A23N
##   normal    0    2    0    0    3    0    2    7   12
##   tumor     2    0    5    2    7    6    4    0    0

table(data.frame(TYPE = thca.filtered$type, Portion = portion))

##         Portion
## TYPE     11 12 21 22 31
##   normal 18  4  2  1  1
##   tumor  24  2  0  0  0

table(data.frame(TYPE = thca.filtered$type, Vial = samplevial))

##         Vial
## TYPE     01A 11A 11C
##   normal  12  13   1
##   tumor   14  12   0

From the tables we can already see that all of them can be candidates as indicators for batch effect as there is no homogeneity of the number of tumor and normal samples across any of the variables.

To double check if they have some effect on the batch we make the hierarchical clustering of them. First, we set some functions to plot the hierarhcical clustering:

help.dendogram <- function(x, batch, labels) {
    # Helper function to plot dendograms
    if (is.leaf(x)) {
        ## color by batch
        attr(x, "nodePar") <- list(lab.col = as.vector(batch[attr(x, "label")])) 
        ## label by outcome
        attr(x, "label") <- as.vector(labels[attr(x, "label")])
    }
    x
}

plot.batch <- function(dendrogram, batching, se, out){
    # Function to see the batch
    # Given a sampleDendogram use batching to colour the leafs by sample.names of se
    sample.names = colnames(se)
    batch <- as.integer(factor(batching))
    names(batch) <- sample.names
    i.dendrogram <- dendrapply(dendrogram, help.dendogram, batch, out)
    plot(i.dendrogram, main = "Hierarchical clustering")
    legend("topright", paste("Batch", levels(factor(batching))), fill = sort(unique(batch)))
}

With this functions to plot the clustering with different labelings is easier:

d <- as.dist(1-cor(assays(thca.filtered)$cqn.logCPM, method = "spearman"))
sampleClustering <- hclust(d)
sampleDendrogram <- as.dendrogram(sampleClustering, hang = 0.1)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                        as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)
par(mfrow = c(1,1))
plot.batch(sampleDendrogram, portion, thca.filtered, outcome)

$\label{clustering.n1}Figure 14.A: Hierarchical clustering of the samples by portion$

Figure 14.A: Hierarchical clustering of the samples by portion

plot.batch(sampleDendrogram, samplevial, thca.filtered, outcome)

$\label{clustering.n2}Figure 14.B: Hierarchical clustering of the samples by sample vial$

Figure 14.B: Hierarchical clustering of the samples by sample vial

plot.batch(sampleDendrogram, tss, thca.filtered, outcome)

$\label{clusering.n3}Figure 14.C: Hierarchical clustering of the samples by TSS$

Figure 14.C: Hierarchical clustering of the samples by TSS

plot.batch(sampleDendrogram, as.character(thca.filtered$type), thca.filtered, outcome)

$\label{clustering.n5}Figure 14.D: Hierarchical clustering of the samples by type$

Figure 14.D: Hierarchical clustering of the samples by type

plot.batch(sampleDendrogram, plate , thca.filtered, outcome)

$\label{clustering.n5}Figure 14.E: Hierarchical clustering of the samples by prospective collection$

Figure 14.E: Hierarchical clustering of the samples by prospective collection

We can see that the sample from A3PR-normal clusters with the tumoral ones, and the A3H2 tumoral clusters with the healthy patients on Figure 14.D. To visualize the relationship we use the multiscaling dimensional plot colored by portion, tss, type, and gender:

batch_effects <- list(portion, tss, as.character(thca.filtered$type),
                      as.character(thca.filtered$gender))
for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(dge.subset, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}

$\label{mds}Figure 15: Multidimensional scaling plot of the samples.$

Figure 15: Multidimensional scaling plot of the samples.

$\label{mds}Figure 15: Multidimensional scaling plot of the samples.$

Figure 15: Multidimensional scaling plot of the samples.

$\label{mds}Figure 15: Multidimensional scaling plot of the samples.$

Figure 15: Multidimensional scaling plot of the samples.

$\label{mds}Figure 15: Multidimensional scaling plot of the samples.$

Figure 15: Multidimensional scaling plot of the samples.

We can observe that there are two groups of tumoral samples, and one of them is close to healthy samples.

Models of the disease

Our model of the disease will affect the analysis from here onwards. Therefore we need to select our model and null hypothesis. We define them as:

# Create the design matrix involving tumor and normal samples and males and females
treat <- factor(paste(thca.filtered$type, thca.filtered$gender, sep = "."))
paired <- as.numeric(as.factor(thca.filtered$bcr_patient_barcode))
design <- model.matrix(~treat + paired, data=colData(thca.filtered))
colnames(design) <- c(levels(treat), "paired")
# design <- cbind("(Intercept)" = 1, design)
head(design)

##                              normal.FEMALE normal.MALE tumor.FEMALE
## TCGA.BJ.A28R.01A.11R.A16R.07             1           0            1
## TCGA.BJ.A2N7.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2N8.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2N9.01A.11R.A18C.07             1           0            1
## TCGA.BJ.A2NA.01A.12R.A19O.07             1           0            0
## TCGA.BJ.A3PU.01A.11R.A220.07             1           0            0
##                              tumor.MALE paired
## TCGA.BJ.A28R.01A.11R.A16R.07          0     21
## TCGA.BJ.A2N7.01A.11R.A18C.07          0     30
## TCGA.BJ.A2N8.01A.11R.A18C.07          0     31
## TCGA.BJ.A2N9.01A.11R.A18C.07          0     32
## TCGA.BJ.A2NA.01A.12R.A19O.07          1     33
## TCGA.BJ.A3PU.01A.11R.A220.07          1     39

# And the null model
mod0 <- model.matrix(~1 + gender + type + paired, data = colData(thca.filtered))
head(mod0)

##                              (Intercept) genderMALE typetumor paired
## TCGA.BJ.A28R.01A.11R.A16R.07           1          0         1     21
## TCGA.BJ.A2N7.01A.11R.A18C.07           1          0         1     30
## TCGA.BJ.A2N8.01A.11R.A18C.07           1          0         1     31
## TCGA.BJ.A2N9.01A.11R.A18C.07           1          0         1     32
## TCGA.BJ.A2NA.01A.12R.A19O.07           1          1         1     33
## TCGA.BJ.A3PU.01A.11R.A220.07           1          1         1     39

Our model includes the interaction of gender and type, while our null model just includes the gender and type (in both cases with the paired factor to take into account the fact that the samples are paired). This way comparing this models we will see if there is interaction between gender and tumor or not.

Removing batch effect

ComBat

After looking for batch effect we couldn’t find a clear cause of the batch effect. We suspect that sample vial has something to do. However, we try to see if we can improve the clustering by using method ComBat for sample vial.

library("sva")

combatexp <- ComBat(assays(thca.filtered)$cqn.logCPM, tss, mod = design)

## Found 4 batches
## Adjusting for 4 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

d <- as.dist(1 - cor(combatexp, method = "spearman"))
cluster.combat <- hclust(d)
cluster.combat <- as.dendrogram(cluster.combat, hang = 0.1)
names(batch) <- colnames(thca.filtered)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                            as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)

plot.batch(cluster.combat, samplevial, thca.filtered, outcome)

$\label{clustering.combat}Figure 16.A: Hierarchical clustering of the samples with ComBat correction colored by samplevial.$

Figure 16.A: Hierarchical clustering of the samples with ComBat correction colored by samplevial.

plot.batch(cluster.combat, tss, thca.filtered, outcome)

$\label{clustering.combat2}Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by tissues sourec site$

Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by tissues sourec site

plot.batch(cluster.combat, thca.filtered$gender, thca.filtered, outcome)

$\label{clustering.combat3}Figure 16.C: Hierarchical clustering of the samples with ComBat correction coloured by gender.$

Figure 16.C: Hierarchical clustering of the samples with ComBat correction coloured by gender.

plot.batch(cluster.combat, thca.filtered$type, thca.filtered, outcome)

$\label{clustering.combat4}Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by type.$

Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by type.

plot.batch(cluster.combat, portion, thca.filtered, outcome)

$\label{clustering.combat5}Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by portion$

Figure 16.D: Hierarchical clustering of the samples with ComBat coloured by portion

We can observe a slightly improvement, but still some tumoral samples cluster together with the normal ones.

batch_effects <- list(samplevial, tss, as.character(thca.filtered$gender),
                      as.character(thca.filtered$type), portion)
for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(combatexp, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}

$\label{mds}Figure 17: Multidimensional scaling plot of the samples.$

Figure 17: Multidimensional scaling plot of the samples.

$\label{mds}Figure 17: Multidimensional scaling plot of the samples.$

Figure 17: Multidimensional scaling plot of the samples.

$\label{mds}Figure 17: Multidimensional scaling plot of the samples.$

Figure 17: Multidimensional scaling plot of the samples.

$\label{mds}Figure 17: Multidimensional scaling plot of the samples.$

Figure 17: Multidimensional scaling plot of the samples.

$\label{mds}Figure 17: Multidimensional scaling plot of the samples.$

Figure 17: Multidimensional scaling plot of the samples.

QR Decomposition

To improve the batch effect removal we look if QR decomposition yields better results. In this case, we look for sample vial and TSS batch effects removal. To do so, we have to built the full model which would include gender and type as fixed effects.

library("limma")

qrexp <- removeBatchEffect(assays(thca.filtered)$cqn.logCPM, tss, design = design)
d <- as.dist(1 - cor(qrexp, method = "spearman"))
cluster.qr <- hclust(d)
cluster.qr <- as.dendrogram(cluster.qr, hang = 0.1)

plot.batch(cluster.qr, samplevial, thca.filtered, outcome)

$\label{clustering.QRdecomposition.1}Figure 18.A: Hierarchical clustering of the samples with QR descompostion for sample vial$

Figure 18.A: Hierarchical clustering of the samples with QR descompostion for sample vial

plot.batch(cluster.qr, tss, thca.filtered, outcome)

$\label{clustering.QRdecomposition.2}Figure 18.B: Hierarchical clustering of the samples with QR descompostion for TSS$

Figure 18.B: Hierarchical clustering of the samples with QR descompostion for TSS

plot.batch(cluster.qr, thca.filtered$gender, thca.filtered, outcome)

$\label{clustering.QRdecomposition.3}Figure 18.C: Hierarchical clustering of the samples with QR descompostion for gender$

Figure 18.C: Hierarchical clustering of the samples with QR descompostion for gender

plot.batch(cluster.qr, thca.filtered$type, thca.filtered, outcome)

$\label{clustering.QRdecomposition.4}Figure 18.D: Hierarchical clustering of the samples with QR descompostion for type$

Figure 18.D: Hierarchical clustering of the samples with QR descompostion for type

plot.batch(cluster.qr, portion, thca.filtered, outcome)

$\label{clustering.QRdecomposition.5}Figure 18.E: Hierarchical clustering of the samples with QR descompostion for portion$

Figure 18.E: Hierarchical clustering of the samples with QR descompostion for portion

With QR decomposition we can observe that some tumoral samples cluster with other normal samples on a lower branch.

for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(qrexp, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}