1 Introduction

Bioconductor provide stats of the project. If you are curious about what is the evolution of the downloads of a certain package or how does the downloads progress with time this is the right place.

Bioconductor classifies the packages in three categories:

For each category a page with the stats of the packages of each category is provided, together with a file:

The project also has a support site which has an api to access the data. We will access analyze it too. We first join the stats of the downloads and IPs of the packages to work easier and compare between them.

web <- "https://www.bioconductor.org/packages/stats/" # Base url
colClasses <- c("factor", "character", "character", "numeric", "numeric")
software <- read.delim(paste0(web, "bioc/bioc_pkg_stats.tab"), 
                    colClasses = colClasses)
experimental <- read.delim(paste0(web, "data-experiment/experiment_pkg_stats.tab"), 
                    colClasses = colClasses)
annotation <- read.delim(paste0(web, "data-annotation/annotation_pkg_stats.tab"), 
                    colClasses = colClasses)
# Convert to data.tables
setDT(software)
setDT(experimental)
setDT(annotation)
# Assing Category
software[, Category := "Software"]
experimental[, Category := "Experimental"]
annotation[, Category := "Annotation"]
# Bind
stats <- rbind(software, experimental, annotation)
yearly <- stats[Month == "all", ]
stats <- stats[Month != "all", , ]
stats[, .(Packages = length(unique(Package))), by = Category]
##        Category Packages
## 1:     Software     1930
## 2: Experimental      728
## 3:   Annotation     2586

However, there are some packages are erroniously classified in several categories. You can see it if we calculate the overlap between packages:

packages <- list(Software = unique(software$Package), 
                 Experimental = unique(experimental$Package), 
                 Annotation = unique(annotation$Package))
f <- function(x, y){
  length(intersect(x, y))
}
vf <- Vectorize(f)
outer(packages, packages, vf)
##              Software Experimental Annotation
## Software         1930          471         27
## Experimental      471          728         17
## Annotation         27           17       2586

We need to classify better, we will use the total number of downloads to put them in a single category based in the total downloads for each category:

# Calculates number of downloads per package and category
max_cat <- stats[, .(Total = sum(Nb_of_downloads)), by = c("Package", "Category")]
# Select the category with max downloads per package
max_cat <- max_cat[, .(Category = Category[which.max(Total)]), by = Package]
# Join by name
stats <- stats[max_cat, , on = "Package"]
yearly <- yearly[max_cat, on = "Package"]
# Substitute name
stats <- stats[, -"Category"]
names(stats)[names(stats) == "i.Category"] <- "Category"

yearly <- yearly[Category == i.Category, ]
yearly <- yearly[, -"i.Category"]
stats[, .(Packages = length(unique(Package))), by = Category]
##        Category Packages
## 1:     Software     1812
## 2: Experimental      351
## 3:   Annotation     2574

2 Initial exploratory analysis

We do a little visual exploration of the total downloads per category:

theme_bw <- theme_bw(base_size = 16)
p <- ggplot(stats[ , .(Downloads = sum(Nb_of_downloads)), by = c("Category")]) +
  geom_bar(aes(Category, log10(Downloads), fill = Category), stat = "identity") + 
  theme_bw +
  ylab("log10(Downloads per category)") + 
  ggtitle("Total downloads per category")
print(p)

Figure 1: Total downloads per Category

We can see that Software packages are more downloaded, and Annotation pakcages the least. But is this due to some packages or in general:

p <- ggplot(stats[ , .(Downloads = log10(sum(Nb_of_downloads))), by = c("Package", "Category")]) +
  geom_violin(aes(Category, Downloads, fill = Category)) + 
  theme_bw +
  ylab("log10(Downloads per package)")
print(p)

Figure 2: Boxplot of downloads per package and category

To work more easily, we convert the months to a date:

monthsConvert <- function(x) {
  if (x == "Jan") {
    "01"
  } else if (x == "Feb"){
    "02"
  } else if (x == "Mar"){
    "03"
  } else if (x == "Apr"){
    "04"
  } else if (x == "May"){
    "05"
  } else if (x == "Jun"){
    "06"
  } else if (x == "Jul"){
    "07"
  } else if (x == "Aug"){
    "08"
  } else if (x == "Sep"){
    "09"
  } else if (x == "Oct"){
    "10"
  } else if (x == "Nov"){
    "11"
  } else if (x == "Dec"){
    "12"
  }
}
stats$Month <- sapply(stats$Month, monthsConvert)
stats$Date <- as.POSIXct(as.yearmon(paste(stats$Year, stats$Month, sep = "-")), frac = 1)
bioc_packages <- c("BiocInstaller", "Biobase", "BiocGenerics", "S4Vectors", "IRanges", "AnnotationDbi")
stats <- stats[Nb_of_downloads != 0, ] # We remove rows of packages without a download in that month.
# Convert the data from several categories to one entry in the right category
stats <- stats[, .(Month = unique(Month), Year = unique(Year), 
                   Category = unique(Category), 
                   Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs), 
                  Nb_of_downloads = sum(Nb_of_downloads)) , 
               by = c("Package", "Date")]
save(stats, yearly, bioc_packages, monthsConvert, file="stats.RData")

We have also stored the data for future uses.

We can observe the number of packages downloaded for each category along time:

theme <- theme(axis.text.x = element_text(angle = 60, hjust = 1))
scal <- scale_x_datetime(date_breaks = "3 months")
p <- ggplot(stats[,  .(Downloaded = .N), by = c("Date", "Category")], aes(Date, Downloaded, color = Category)) +
  geom_line() +
  theme_bw +
  ggtitle("Packages downloaded") +
  theme +
  scal
print(p)

Figure 3: Packages downloaded per date and category
For each category the number of packages downloaded are displayed

We can see that the number of packages downloaded increase consistently for Software packages and at a slower peace also for Experimental data, but the Annnotations show some peaks. This peaks might be new databases added as a package in Bioconductor.

The number of download per month for each category:

p <- ggplot(stats[, .(Downloads = sum(Nb_of_downloads)), by = c("Date", "Category")], aes(Date, Downloads, color = Category)) +
  geom_line() +
  theme_bw +
  ggtitle("Packages downloaded") +
  xlab("") +
  theme +
  scal
print(p)

Figure 4: Downloads per date and category
For each category the downloads are displayed

The number of downloads in the Annotation and Experimental category remains relatively stable compared to Software packages, which since 2011 increase linearly. This suggest that competency for downloads in the software category has increased:

pd <- position_dodge(0.1)
p <- ggplot(stats[, .(Number = mean(Nb_of_downloads), 
                  sem = sd(Nb_of_downloads)/sqrt(.N)), 
              by = c("Date", "Category")], 
       aes(Date, Number, color = Category)) +
  geom_errorbar(aes(ymin = Number-sem, ymax = Number + sem), 
                width = .1, position = pd) +
  geom_point() + 
  geom_line() +
  theme_bw +
  ggtitle("Downloads") +
  ylab("Mean download for a package") +
  xlab("") + 
  theme +
  scal
print(p)

Figure 5: Downloads per package
Mean of the downloads per package per category, the error bar is the standard error of the mean.

Looking at the number of downloads per package along time it contradicts our hypothesis that competence for downloads in the Software package has increased. There is now more variation but the mean number of downloads per package remains more constant.
We can see that for Annotation package the mean download per packages is the same from 2009. Experimental packages had much larger variation, but nowadays fewer experimental packages are used as we can see from the error bars. In software packages there is much larger variation since two years, but traditionally it had more variation and higher downloads per package than any other category.

p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs),
                 sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N)), 
              by = c("Date", "Category")], 
       aes(Date, Number, color = Category)) +
  geom_point() + 
  geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem), 
                width = .1, position = pd) +
  geom_line() +
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Mean downloads per IP for a package") +
  xlab("") + 
  theme +
  scal
print(p)

Figure 6: Downloads per IP
Mean of the downloads per IP per package, the error bars are the standard error of the mean.

We can see that in some months there is an increase of downloads per IP in all the three categories whil other months the categories don’t follow the same pattern. In general the downloads per IP are around 1.5 to 2.

p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs)), 
              by = c("Package", "Category")], 
       aes(Category, Number, color = Category)) +
  geom_violin() +
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Mean downloads per IP") +
  xlab("") + 
  theme
print(p)

Figure 7: Downloads per IP
Mean of the downloads per IP per package per category plus the error bars at the 95% CI.

We can see that there is a variation in the number of downloads per IP in the Software package, there is a very extreme value for a package.

p <- ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs)), 
              by = c("Package", "Category")], 
       aes(Category, Number, color = Category)) +
  geom_violin() +
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Mean downloads per IP") +
  xlab("") + 
  theme +
  ylim(c(1, 3))
print(p)
## Warning: Removed 40 rows containing non-finite values (stat_ydensity).

Figure 8: Downloads per IP
Mean of the downloads per IP per package per category plus the error bars at the 95% CI.

Looking closer we can see that most of the packages are downloaded between 1 or twice per IP each month where the package is downloaded.

3 Updates

We can explore if the packages has been updated thorough the year by the same IP.

staty <- stats[, .(Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs)), by = c("Year", "Package", "Category")]
year <- staty[yearly, , on = c("Package", "Year")]
year[, Repeated_IP := Nb_of_distinct_IPs-i.Nb_of_distinct_IPs, by =  c("Year", "Package", "Category")]
year[, Repeated_IP_per := Repeated_IP/Nb_of_distinct_IPs*100, by =  c("Year", "Package", "Category")]
year2 <- year[, .(m = mean(Repeated_IP_per), sem = sd(Repeated_IP_per)/sqrt(.N)), by = c("Year", "Category")]
year2$Year <- as.numeric(year2$Year)
d <- date()
thisYear <- as.numeric(substr(d, nchar(d)-3, nchar(d)))
ggplot(year2, aes(Year, m, col = Category)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin = m - sem, ymax = m + sem), 
                width = .1) + 
  ylim(c(0, 26)) +
  ylab("Percentatge") +
  xlab("") +
  ggtitle("Update from the same IP") + 
  theme_bw +
  scale_x_continuous(breaks = seq(2009, thisYear, 1))

Figure 9: Package update
Mean percentatge of the installation of the same package in a year by the same IP.

Experimental and software packages tend to be updated around the 20% while annotation packages historically have been less updated from the same IP.

4 Analysis per category

For each category a similar analysis has been performed:

5 Analysis support site

To analyse the support site relation with the packages we first downloaded and classify the data as described here.

SessionInfo

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] zoo_1.8-0           data.table_1.10.4-3 ggplot2_2.2.1      
## [4] BiocStyle_2.6.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14     knitr_1.17       magrittr_1.5     munsell_0.4.3   
##  [5] lattice_0.20-35  colorspace_1.3-2 rlang_0.1.4      highr_0.6       
##  [9] stringr_1.2.0    plyr_1.8.4       tools_3.4.3      grid_3.4.3      
## [13] gtable_0.2.0     htmltools_0.3.6  yaml_2.1.16      lazyeval_0.2.1  
## [17] rprojroot_1.2    digest_0.6.13    tibble_1.3.4     bookdown_0.5    
## [21] evaluate_0.10.1  rmarkdown_1.8    labeling_0.3     stringi_1.1.6   
## [25] compiler_3.4.3   scales_0.5.0     backports_1.1.2

Bioconductor stats