Analysis of the stats of the Experimental packages in Bioconductor project.
Here we are going to analyse the Annotation packages of Bioconductor. See the home of the analysis here.
First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:
load("stats.RData")
stats <- stats[Category == "Annotation", ]
yearly <- yearly[Category == "Annotation", ]
stats
## Package Date Month Year
## 1: BSgenome.Dmelanogaster.UCSC.dm3 2014-08-01 02:00:00 08 2014
## 2: BSgenome.Dmelanogaster.UCSC.dm3 2017-01-01 01:00:00 01 2017
## 3: BSgenome.Dmelanogaster.UCSC.dm3 2017-02-01 01:00:00 02 2017
## 4: BSgenome.Dmelanogaster.UCSC.dm3 2017-03-01 01:00:00 03 2017
## 5: BSgenome.Dmelanogaster.UCSC.dm3 2017-04-01 02:00:00 04 2017
## 6: BSgenome.Dmelanogaster.UCSC.dm3 2017-05-01 02:00:00 05 2017
## 7: BSgenome.Dmelanogaster.UCSC.dm3 2017-06-01 02:00:00 06 2017
## 8: BSgenome.Dmelanogaster.UCSC.dm3 2017-07-01 02:00:00 07 2017
## 9: BSgenome.Dmelanogaster.UCSC.dm3 2016-01-01 01:00:00 01 2016
## 10: BSgenome.Dmelanogaster.UCSC.dm3 2016-02-01 01:00:00 02 2016
## ---
## 105857: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017-03-01 01:00:00 03 2017
## 105858: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017-04-01 02:00:00 04 2017
## 105859: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017-05-01 02:00:00 05 2017
## 105860: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017-06-01 02:00:00 06 2017
## 105861: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017-07-01 02:00:00 07 2017
## 105862: SNPlocs.Hsapiens.dbSNP150.GRCh38 2017-07-01 02:00:00 07 2017
## 105863: TxDb.Ggallus.UCSC.galGal5.refGene 2017-04-01 02:00:00 04 2017
## 105864: TxDb.Ggallus.UCSC.galGal5.refGene 2017-05-01 02:00:00 05 2017
## 105865: TxDb.Ggallus.UCSC.galGal5.refGene 2017-06-01 02:00:00 06 2017
## 105866: TxDb.Ggallus.UCSC.galGal5.refGene 2017-07-01 02:00:00 07 2017
## Category Nb_of_distinct_IPs Nb_of_downloads
## 1: Annotation 172 235
## 2: Annotation 74 142
## 3: Annotation 163 285
## 4: Annotation 135 225
## 5: Annotation 125 207
## 6: Annotation 175 259
## 7: Annotation 217 438
## 8: Annotation 65 85
## 9: Annotation 125 186
## 10: Annotation 221 297
## ---
## 105857: Annotation 3 4
## 105858: Annotation 3 3
## 105859: Annotation 14 18
## 105860: Annotation 18 21
## 105861: Annotation 7 7
## 105862: Annotation 1 1
## 105863: Annotation 3 4
## 105864: Annotation 15 18
## 105865: Annotation 12 12
## 105866: Annotation 4 5
There have been 2572 Experimental packages in Bioconductor. Some have been added recently and some later.
First we explore the number of packages being downloaded by month:
theme_bw <- theme_bw(base_size = 16)
scal <- scale_x_datetime(date_breaks = "3 months")
theme <- theme(axis.text.x=element_text(angle=60, hjust=1))
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages downloaded") +
theme +
scal +
xlab("")
The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages
ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Downloads") +
scal +
theme +
xlab("")
Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads),
sem = sd(Nb_of_downloads)/sqrt(.N)),
by = Date], aes(Date, Number)) +
geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem), width = .1, position = pd) +
geom_point() +
geom_line() +
theme_bw +
ggtitle("Downloads") +
ylab("Mean download for a package") +
scal +
theme +
xlab("")
Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is less dispersion between packages downloads.
This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.
today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme +
xlab("")
We can see that At the beggining of 2009 many Annotation packages were added in Bioconductor, and since them occasionally there is a raise to 250 new downloads (Which would be new packages being added). This may be due to updates in the databases and new releases of data being incorporated.
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme +
xlab("") +
ylim(c(0, 250))
## Warning: Removed 2 rows containing missing values (position_stack).
Close view to the new packages not previously downloaded. ## Removed
Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:
deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date", "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages without downloads") +
scal +
theme +
ylab("Last seen packages") +
xlab("")
Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 500 packages as per last year. Although this last year many more packages has been left without downloading.
In total there are 1 packages downloaded. We further explore how many time between the incorporation of the package and the last download.
df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(60*60*24*365) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
Packages tend to stay up to 8 years. Not surprisingly the number of packages incorporated before 2009 and still in the repository are of 0 packages. But those packages not removed how do they do in Bioconductor?
We can start comparing the number of downloads (different from 0) by how many IPs download each package.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs),
sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N)),
by = c("Date")], aes(Date, Number)) +
geom_point() +
geom_errorbar(aes(ymin = Number - sem,
ymax = Number + sem),
width = .1, position = pd) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Mean downloads per IP") +
xlab("") +
theme +
scal +
guides(col = FALSE)
Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.
ratio <- stats[, .(Mean = mean(Nb_of_downloads/Nb_of_distinct_IPs),
sd = sd(Nb_of_downloads/Nb_of_distinct_IPs),
max = max(Nb_of_downloads/Nb_of_distinct_IPs),
min = min(Nb_of_downloads/Nb_of_distinct_IPs)),
by = c("Package")]
ratio <- ratio[order(Mean, decreasing = TRUE), ]
ratio$Package <- as.character(ratio$Package)
ratio
## Package Mean sd max
## 1: hgfocusprobe 10.459925 5.199339 25.300000
## 2: BSgenome.Scerevisiae.UCSC.sacCer1 6.169985 2.540019 14.850000
## 3: pd.hg.u133a.2 3.607621 3.682459 15.266667
## 4: mirna10cdf 3.431727 2.104486 8.243243
## 5: hgu95aprobe 3.387250 3.650635 16.274194
## 6: IlluminaHumanMethylation450k.db 3.030156 2.300729 10.768786
## 7: ath1attigr 3.000000 3.391165 9.000000
## 8: atgenomeattigr 2.800000 3.492850 9.000000
## 9: atgenomeattigrcdf 2.800000 3.492850 9.000000
## 10: atgenomeattigrprobe 2.800000 3.492850 9.000000
## ---
## 2563: MeSH.Sau.MW2.eg.db 1.000000 NA 1.000000
## 2564: MeSH.Sau.Mu3.eg.db 1.000000 0.000000 1.000000
## 2565: MeSH.Sau.Mu50.eg.db 1.000000 NA 1.000000
## 2566: MeSH.Sau.N315.eg.db 1.000000 NA 1.000000
## 2567: MeSH.Sau.Newman.eg.db 1.000000 NA 1.000000
## 2568: MeSH.Sau.RF122.eg.db 1.000000 NA 1.000000
## 2569: MeSH.Sau.USA300FPR3757.eg.db 1.000000 0.000000 1.000000
## 2570: MeSH.Sau.VC40.eg.db 1.000000 NA 1.000000
## 2571: o%2520rg.Hs.eg.db 1.000000 NA 1.000000
## 2572: SNPlocs.Hsapiens.dbSNP150.GRCh38 1.000000 NA 1.000000
## min
## 1: 1.179487
## 2: 1.111111
## 3: 1.000000
## 4: 1.000000
## 5: 1.000000
## 6: 1.243697
## 7: 1.000000
## 8: 1.000000
## 9: 1.000000
## 10: 1.000000
## ---
## 2563: 1.000000
## 2564: 1.000000
## 2565: 1.000000
## 2566: 1.000000
## 2567: 1.000000
## 2568: 1.000000
## 2569: 1.000000
## 2570: 1.000000
## 2571: 1.000000
## 2572: 1.000000
We can see that the package with more downloads from the same IP is hgfocusprobe, followed by, BSgenome.Scerevisiae.UCSC.sacCer1, pd.hg.u133a.2 and the forth one is mirna10cdf.
Now we explore if there is some seasons cycles in the downloads, as in figure 2 seems to be some cicles.
First we can explore the number of IPs per month downloading each package:
ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Distinct IP downloads") +
scal +
theme +
guides(col = FALSE)
As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Downloads") +
scal +
theme +
guides(col = FALSE)
Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 10000)+
theme +
guides(col = FALSE)
There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw+
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 2500)+
theme +
guides(col = FALSE)
## Warning: Removed 169 rows containing missing values (geom_path).
As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.
Maybe there is a relationship between the downloads and the number of IPs per date
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme +
guides(col = FALSE)
We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme +
guides(col = FALSE) +
ylim(1, 5)
But most of the packages seem to be more or less constant and around 2.
To measure if the same IP has downloaded the same package in several months we can compare if the IPs per year are closer to the IP per month.
staty <- stats[, .(Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs)), by = c("Year", "Package")]
year <- staty[yearly, , on = c("Package", "Year")]
year <- year[, Repeated_IP := Nb_of_distinct_IPs-i.Nb_of_distinct_IPs, by = c("Package", "Year")]
year <- year[, Repeated_IP_per := Repeated_IP/Nb_of_distinct_IPs*100, by = c("Package", "Year")]
ggplot(year, aes(Year, Repeated_IP_per)) +
geom_violin() +
xlab("") +
ylab("Percentatge of repeated IPs") +
theme_bw +
guides(col = FALSE)
This gives an idea that most of the IPs don’t update all their packages to the newest version of each package, or it could be that as most of the IPs are dynamic they change the measure is not completely reliable
year2 <- year[, .(m = mean(Repeated_IP_per), sem = sd(Repeated_IP_per)/sqrt(.N)), by = "Year"]
ggplot(year2, aes(Year, m)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = m - sem, ymax = m + sem),
width = .1, position = pd) +
ylab("Percentatge") +
xlab("") +
ggtitle("Mean download per ") +
theme_bw
We can see however that the percentatge of IPs which download the package quite similar for the packages in Bioconductor. It remains also around 17% of the total IPs.
year$Year <- as.numeric(year$Year)
d <- date()
thisYear <- as.numeric(substr(d, nchar(d)-3, nchar(d)))
ggplot(year, aes(Year, Repeated_IP_per, col = Package)) +
geom_point() +
geom_line() +
guides(col = FALSE) +
scale_x_continuous(breaks = seq(2009, thisYear, 1)) +
ylab("Percentatge") +
ggtitle("Update from the same IP") +
theme_bw
We can see that some packages ocassionaly raise above the mean.
We can observe if the packages has been consistently the most downloaded package of Bioconductor:
PercDate <- stats[, .(Package, Downloads = Nb_of_downloads/sum(Nb_of_downloads)), by = Date]
PercDate <- PercDate[order(Date, order(Downloads)), ]
ggplot(PercDate, aes(Date, Downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads of packages by dates") +
xlab("") +
ylab("% of downloads") +
guides(col = FALSE) +
scal +
theme
We can see that usually the percentatge of downloads of a packages doesn’t reach 5% of downloads of the category. And that there is a huge differences between packages in the number of downloads. Let’s compare the most downloaded package to the other package:
OrdDate <- PercDate[, .(Package, Ord = Downloads/max(Downloads)), by = Date]
ggplot(OrdDate, aes(Date, Ord, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads of packages by dates") +
xlab("") +
ylab("% of the Package more downloaded") +
guides(col = FALSE) +
scal +
theme
We can observe that usually there are few package that are closer to the most downloaded packages and lots of them are far from it. We can see the rank between them to be able to see the evolution of downloads along time of a package:
rankDate <- OrdDate[, .(Package, rank = rank(Ord)/.N), by = Date]
ggplot(rankDate, aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Rank of packages by downloads") +
xlab("") +
ylab("Position by downloads") +
guides(col = FALSE) +
scal +
theme
Only if we select to follow some package we can track them:
packages <- c("GO.db", "org.Hs.eg.db", "org.MeSH.Hsa.db", "GenomeInfoDbData",
"BSgenome.Dmelanogaster.UCSC.dm3", "PFAM.db", "PFAM",
"KEGG", "reactome.db")
ggplot(rankDate[Package %in% packages, ], aes(Date, rank, col = Package))+
geom_line() +
theme_bw +
ggtitle("Relative rank of packages by downloads") +
xlab("") +
ylab("Position by downloads") +
scal +
theme
Here we can see that ALL and affydata are on the top downloaded packages since 2009. Recently airway has grown to reach the top downloaded packages, while other packages haven’t reached the top 25% packages by downloads on a month.
We can also observe when did a package reach the maximum number of downloads:
PercPack <- stats[, .(Date, Downloads = Nb_of_downloads/sum(Nb_of_downloads)), by = Package]
ggplot(PercPack, aes(Date, Downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Growth of the packages") +
xlab("") +
ylab("Downloads/max(Downloads)") +
guides(col = FALSE) +
scal +
theme
Here we can see the date when a package reached the highest number of downloads. For most packages the higher downloads are on the recent months.
OrdPack <- PercPack[, .(Date, rank = Downloads/max(Downloads)), by = Package]
ggplot(OrdPack, aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Growth of the packages") +
xlab("") +
ylab("Downloads/max(Downloads)") +
guides(col = FALSE) +
scal +
theme
Here we can see when the package hast the most downloads along time. As usuall we need to focus on fewer packages to be able to distinguish between package to see their evolution:
ggplot(OrdPack[Package %in% packages, ],
aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Growth of the packages") +
xlab("") +
ylab("Downloads/max(Downloads)") +
scal +
theme
As expected the packages that keep up with Bioconductor growth have a peak near the end of the serie.
We can combine both models into a single one to have a timeless comparison of the packages. So we will know if the max position reached in Bioconductor is done when there are more downloads.
model <- rankDate[OrdPack, on = c("Package", "Date")]
setnames(model, "i.rank", "rank.P")
setnames(model, "rank", "rank.B")
ggplot(model[Package %in% packages], aes(rank.P, rank.B, col = Package)) +
geom_point() +
theme_bw +
geom_line() +
ylab("Position in Bioconductor") +
xlab("Position in package")
This is a timeless comparison of the growth of packages in Bioconductor and against themselfs. The higher on the y-axis the higher they have been on Bioconductor, and the higher on the x-axis the most downloads they had.
Next we add the time as a factor so we standarize by the time a package has been in Bioconductor. We have omitted months with 0 downloads of a package so this might bias a bit the ranking.
model[, rank.D := rank(as.numeric(Date))/.N, by = Package]
ggplot(model[Package %in% packages], aes(rank.D, rank.B, col = Package)) +
geom_point() +
theme_bw +
geom_line() +
xlab("By date") +
ylab("Position in Bioconductor")
ggplot(model[Package %in% packages], aes(rank.D, rank.P, col = Package)) +
geom_point() +
theme_bw +
geom_line() +
xlab("By date") +
ylab("Position in Package")
We can observe that some packages climb to the top of Bioconductor sooner than others and some climb and then decrease the number of downloads proportional to the the number of downloads in Bioconductor.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1.9000 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.11 knitr_1.16 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.3-2 rlang_0.1.1 highr_0.6 stringr_1.2.0
## [9] plyr_1.8.4 tools_3.4.0 grid_3.4.0 gtable_0.2.0
## [13] htmltools_0.3.6 yaml_2.1.14 lazyeval_0.2.0 rprojroot_1.2
## [17] digest_0.6.12 tibble_1.3.3 bookdown_0.4 evaluate_0.10.1
## [21] rmarkdown_1.6 labeling_0.3 stringi_1.1.5 compiler_3.4.0
## [25] scales_0.4.1.9002 backports_1.1.0