Bioconductor provide stats of the project. If you are curious about what is the evolution of the downloads of a certain package or how does the downloads progress with time this is the right place.
Bioconductor classifies the packages in three categories:
For each category a page with the stats of the packages of each category is provided, together with a file:
The project also has a support site which has an api to access the data. We will access analyze it too. We first join the stats of the downloads and IPs of the packages to work easier and compare between them.
web <- "https://www.bioconductor.org/packages/stats/" # Base url
colClasses <- c("factor", "character", "character", "numeric", "numeric")
software <- read.delim(paste0(web, "bioc/bioc_pkg_stats.tab"),
colClasses = colClasses)
experimental <- read.delim(paste0(web, "data-experiment/experiment_pkg_stats.tab"),
colClasses = colClasses)
annotation <- read.delim(paste0(web, "data-annotation/annotation_pkg_stats.tab"),
colClasses = colClasses)
# Convert to data.tables
setDT(software)
setDT(experimental)
setDT(annotation)
# Assing Category
software[, Category := "Software"]
experimental[, Category := "Experimental"]
annotation[, Category := "Annotation"]
# Bind
stats <- rbind(software, experimental, annotation)
yearly <- stats[Month == "all", ]
stats <- stats[Month != "all", , ]
stats[, .(Packages = length(unique(Package))), by = Category]
## Category Packages
## 1: Software 1930
## 2: Experimental 728
## 3: Annotation 2586
However, there are some packages are erroniously classified in several categories. You can see it if we calculate the overlap between packages:
packages <- list(Software = unique(software$Package),
Experimental = unique(experimental$Package),
Annotation = unique(annotation$Package))
f <- function(x, y){
length(intersect(x, y))
}
vf <- Vectorize(f)
outer(packages, packages, vf)
## Software Experimental Annotation
## Software 1930 471 27
## Experimental 471 728 17
## Annotation 27 17 2586
We need to classify better, we will use the total number of downloads to put them in a single category based in the total downloads for each category:
# Calculates number of downloads per package and category
max_cat <- stats[, .(Total = sum(Nb_of_downloads)), by = c("Package", "Category")]
# Select the category with max downloads per package
max_cat <- max_cat[, .(Category = Category[which.max(Total)]), by = Package]
# Join by name
stats <- stats[max_cat, , on = "Package"]
yearly <- yearly[max_cat, on = "Package"]
# Substitute name
stats <- stats[, -"Category"]
names(stats)[names(stats) == "i.Category"] <- "Category"
yearly <- yearly[Category == i.Category, ]
yearly <- yearly[, -"i.Category"]
stats[, .(Packages = length(unique(Package))), by = Category]
## Category Packages
## 1: Software 1812
## 2: Experimental 351
## 3: Annotation 2574
We do a little visual exploration of the total downloads per category:
theme_bw <- theme_bw(base_size = 16)
p <- ggplot(stats[ , .(Downloads = sum(Nb_of_downloads)), by = c("Category")]) +
geom_bar(aes(Category, log10(Downloads), fill = Category), stat = "identity") +
theme_bw +
ylab("log10(Downloads per category)") +
ggtitle("Total downloads per category")
print(p)
We can see that Software packages are more downloaded, and Annotation pakcages the least. But is this due to some packages or in general:
p <- ggplot(stats[ , .(Downloads = log10(sum(Nb_of_downloads))), by = c("Package", "Category")]) +
geom_violin(aes(Category, Downloads, fill = Category)) +
theme_bw +
ylab("log10(Downloads per package)")
print(p)
To work more easily, we convert the months to a date:
monthsConvert <- function(x) {
if (x == "Jan") {
"01"
} else if (x == "Feb"){
"02"
} else if (x == "Mar"){
"03"
} else if (x == "Apr"){
"04"
} else if (x == "May"){
"05"
} else if (x == "Jun"){
"06"
} else if (x == "Jul"){
"07"
} else if (x == "Aug"){
"08"
} else if (x == "Sep"){
"09"
} else if (x == "Oct"){
"10"
} else if (x == "Nov"){
"11"
} else if (x == "Dec"){
"12"
}
}
stats$Month <- sapply(stats$Month, monthsConvert)
stats$Date <- as.POSIXct(as.yearmon(paste(stats$Year, stats$Month, sep = "-")), frac = 1)
bioc_packages <- c("BiocInstaller", "Biobase", "BiocGenerics", "S4Vectors", "IRanges", "AnnotationDbi")
stats <- stats[Nb_of_downloads != 0, ] # We remove rows of packages without a download in that month.
# Convert the data from several categories to one entry in the right category
stats <- stats[, .(Month = unique(Month), Year = unique(Year),
Category = unique(Category),
Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs),
Nb_of_downloads = sum(Nb_of_downloads)) ,
by = c("Package", "Date")]
save(stats, yearly, bioc_packages, monthsConvert, file="stats.RData")
We have also stored the data for future uses.
We can observe the number of packages downloaded for each category along time:
theme <- theme(axis.text.x = element_text(angle = 60, hjust = 1))
scal <- scale_x_datetime(date_breaks = "3 months")
p <- ggplot(stats[, .(Downloaded = .N), by = c("Date", "Category")], aes(Date, Downloaded, color = Category)) +
geom_line() +
theme_bw +
ggtitle("Packages downloaded") +
theme +
scal
print(p)
We can see that the number of packages downloaded increase consistently for Software packages and at a slower peace also for Experimental data, but the Annnotations show some peaks. This peaks might be new databases added as a package in Bioconductor.
The number of download per month for each category:
p <- ggplot(stats[, .(Downloads = sum(Nb_of_downloads)), by = c("Date", "Category")], aes(Date, Downloads, color = Category)) +
geom_line() +
theme_bw +
ggtitle("Packages downloaded") +
xlab("") +
theme +
scal
print(p)
The number of downloads in the Annotation and Experimental category remains relatively stable compared to Software packages, which since 2011 increase linearly. This suggest that competency for downloads in the software category has increased:
pd <- position_dodge(0.1)
p <- ggplot(stats[, .(Number = mean(Nb_of_downloads),
sem = sd(Nb_of_downloads)/sqrt(.N)),
by = c("Date", "Category")],
aes(Date, Number, color = Category)) +
geom_errorbar(aes(ymin = Number-sem, ymax = Number + sem),
width = .1, position = pd) +
geom_point() +
geom_line() +
theme_bw +
ggtitle("Downloads") +
ylab("Mean download for a package") +
xlab("") +
theme +
scal
print(p)
Looking at the number of downloads per package along time it contradicts our hypothesis that competence for downloads in the Software package has increased. There is now more variation but the mean number of downloads per package remains more constant.
We can see that for Annotation package the mean download per packages is the same from 2009. Experimental packages had much larger variation, but nowadays fewer experimental packages are used as we can see from the error bars. In software packages there is much larger variation since two years, but traditionally it had more variation and higher downloads per package than any other category.
p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs),
sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N)),
by = c("Date", "Category")],
aes(Date, Number, color = Category)) +
geom_point() +
geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem),
width = .1, position = pd) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Mean downloads per IP for a package") +
xlab("") +
theme +
scal
print(p)
We can see that in some months there is an increase of downloads per IP in all the three categories whil other months the categories don’t follow the same pattern. In general the downloads per IP are around 1.5 to 2.
p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs)),
by = c("Package", "Category")],
aes(Category, Number, color = Category)) +
geom_violin() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Mean downloads per IP") +
xlab("") +
theme
print(p)
We can see that there is a variation in the number of downloads per IP in the Software package, there is a very extreme value for a package.
p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs)),
by = c("Package", "Category")],
aes(Category, Number, color = Category)) +
geom_violin() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Mean downloads per IP") +
xlab("") +
theme +
ylim(c(1, 3))
print(p)
## Warning: Removed 40 rows containing non-finite values (stat_ydensity).
Looking closer we can see that most of the packages are downloaded between 1 or twice per IP each month where the package is downloaded.
We can explore if the packages has been updated thorough the year by the same IP.
staty <- stats[, .(Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs)), by = c("Year", "Package", "Category")]
year <- staty[yearly, , on = c("Package", "Year")]
year[, Repeated_IP := Nb_of_distinct_IPs-i.Nb_of_distinct_IPs, by = c("Year", "Package", "Category")]
year[, Repeated_IP_per := Repeated_IP/Nb_of_distinct_IPs*100, by = c("Year", "Package", "Category")]
year2 <- year[, .(m = mean(Repeated_IP_per), sem = sd(Repeated_IP_per)/sqrt(.N)), by = c("Year", "Category")]
year2$Year <- as.numeric(year2$Year)
d <- date()
thisYear <- as.numeric(substr(d, nchar(d)-3, nchar(d)))
ggplot(year2, aes(Year, m, col = Category)) +
geom_line() +
geom_point() +
geom_errorbar(aes(ymin = m - sem, ymax = m + sem),
width = .1) +
ylim(c(0, 26)) +
ylab("Percentatge") +
xlab("") +
ggtitle("Update from the same IP") +
theme_bw +
scale_x_continuous(breaks = seq(2009, thisYear, 1))
Experimental and software packages tend to be updated around the 20% while annotation packages historically have been less updated from the same IP.
For each category a similar analysis has been performed:
To analyse the support site relation with the packages we first downloaded and classify the data as described here.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] zoo_1.8-0 data.table_1.10.4-3 ggplot2_2.2.1
## [4] BiocStyle_2.6.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 knitr_1.17 magrittr_1.5 munsell_0.4.3
## [5] lattice_0.20-35 colorspace_1.3-2 rlang_0.1.4 highr_0.6
## [9] stringr_1.2.0 plyr_1.8.4 tools_3.4.3 grid_3.4.3
## [13] gtable_0.2.0 htmltools_0.3.6 yaml_2.1.16 lazyeval_0.2.1
## [17] rprojroot_1.2 digest_0.6.13 tibble_1.3.4 bookdown_0.5
## [21] evaluate_0.10.1 rmarkdown_1.8 labeling_0.3 stringi_1.1.6
## [25] compiler_3.4.3 scales_0.5.0 backports_1.1.2