1 Load data

Here we are going to analyse the software packages of Bioconductor. See the home of the analysis here were we already transformed the data. From that stats we are going to analyse the software category:

load("stats.RData", verbose = TRUE)
## Loading objects:
##   stats
##   yearly
##   bioc_packages
##   monthsConvert
stats <- stats[Category == "Software", ]
yearly <- yearly[Category == "Software", ]
stats
##         Package                Date Month Year Category Nb_of_distinct_IPs
##     1:  ABarray 2017-01-01 01:00:00    01 2017 Software                105
##     2:  ABarray 2017-02-01 01:00:00    02 2017 Software                155
##     3:  ABarray 2017-03-01 01:00:00    03 2017 Software                119
##     4:  ABarray 2017-04-01 02:00:00    04 2017 Software                184
##     5:  ABarray 2017-05-01 02:00:00    05 2017 Software                192
##     6:  ABarray 2017-06-01 02:00:00    06 2017 Software                173
##     7:  ABarray 2017-07-01 02:00:00    07 2017 Software                118
##     8:  ABarray 2017-08-01 02:00:00    08 2017 Software                115
##     9:  ABarray 2017-09-01 02:00:00    09 2017 Software                119
##    10:  ABarray 2017-10-01 02:00:00    10 2017 Software                111
##    ---                                                                    
## 89924:    zFPKM 2017-10-01 02:00:00    10 2017 Software                  4
## 89925:    zFPKM 2017-11-01 01:00:00    11 2017 Software                 52
## 89926:    zFPKM 2017-12-01 01:00:00    12 2017 Software                 51
## 89927: zinbwave 2017-06-01 02:00:00    06 2017 Software                  5
## 89928: zinbwave 2017-07-01 02:00:00    07 2017 Software                 20
## 89929: zinbwave 2017-08-01 02:00:00    08 2017 Software                 19
## 89930: zinbwave 2017-09-01 02:00:00    09 2017 Software                 50
## 89931: zinbwave 2017-10-01 02:00:00    10 2017 Software                 75
## 89932: zinbwave 2017-11-01 01:00:00    11 2017 Software                130
## 89933: zinbwave 2017-12-01 01:00:00    12 2017 Software                112
##        Nb_of_downloads
##     1:             153
##     2:             229
##     3:             216
##     4:             272
##     5:             282
##     6:             285
##     7:             180
##     8:             192
##     9:             168
##    10:             165
##    ---                
## 89924:               4
## 89925:              73
## 89926:              67
## 89927:               8
## 89928:              23
## 89929:              25
## 89930:              88
## 89931:             111
## 89932:             280
## 89933:             184

There have been 1812 Software packages in Bioconductor.

2 Packages

2.1 Number

First we explore the number of packages being downloaded by month:

stats <- stats[Nb_of_downloads != 0, ] # We remove rows of packages with a download in that month.
theme_bw <- theme_bw(base_size = 16)
theme <- theme(axis.text.x=element_text(angle = 60, hjust = 1))
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Packages downloaded") +
  theme + 
  scal + 
  xlab("")

Figure 1: Packages in Bioconductor with downloads

The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages

ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Downloads") +
  scal +
  theme + 
  xlab("")

Figure 2: Downloads of packages

Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.

pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads), 
                  sem = sd(Nb_of_downloads)/sqrt(.N)),
              by = Date], aes(Date, Number)) +
  geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem), 
                width = .1, position = pd) +
  geom_point() + 
  geom_line() +
  theme_bw +
  ggtitle("Downloads") +
  ylab("Mean download for a package") +
  scal +
  theme + 
  xlab("")

Figure 3: Downloads of packages per package
The error bar indicates the standard error of the mean.

Here we can appreciate that the number of downloads per package hasn’t changed much with time. If something, now there is more dispersion between packages downloads.

2.2 Incorporations

This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.

today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw + 
  ggtitle("Packages with first download") +
  scal +
  theme +
  xlab("")

Figure 4: New packages

We can see that there were more than 350 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 50 new downloads (Which would be new packages being added).

ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw + 
  ggtitle("Packages with first download") +
  scal +
  theme +
  ylim(c(0, 60)) +
  ylab("New packages") +
  xlab("")
## Warning: Removed 1 rows containing missing values (position_stack).

Figure 5: New packages
Zoom on the new downloads of packages after 2009.

We can now observe that for each year there are two spikes of new downloads of packages, usually they are the packages being added for the new release of Bioconductor.

2.3 Removed

Using a similar procedure we can approximate the packages deprecated and removed each month, although a package could not be downloaded and still included in Bioconductor. In this case we look for the last date a package was downloaded, excluding the current month:

deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date",  "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) + 
  geom_bar(stat = "identity") + 
  theme_bw + 
  ggtitle("Packages without downloads") +
  scal +
  theme + 
  ylab("Last seen packages") + 
  xlab("")

Figure 6: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.

Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 80 packages as per last month.

We further explore how many time between the incorporation of the package and the last download.

df <- merge(incorporation, deprecation, by = "Package")
 # Transform to years
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(365*60*60*24)
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")

Figure 7: Time of packages between first and last download
The red line indicates the mean time in Bioconductor

We can see that most deprecated packages are less than a year (I would say around two releases) and some stay on Bioconductor up to 6 years before being removed. Not surprisingly the number of packages incorporated before 2009 and removed from the repository are 0 packages. But those packages not removed how do they do in Bioconductor?

3 Packages downloads

3.1 Ratio downloads per IP

We can start comparing the number of downloads to how many IPs download each package.

pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs),
                 sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N)), 
              by = c("Date")], aes(Date, Number)) +
  geom_point() +
  geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem),
                width = .1, position = pd) +
  geom_line() +
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Mean downloads per IP") +
  xlab("") + 
  theme +
  scal

Figure 8: Downloads per IP
The error bars indicate the standard error of the mean.

We can see that usually the number of downloads per IP is around 2, but that there is much variation between the packages. In the points marked in red, the variation is bigger than the mean, this might be due to specific packages being downloaded mostly from the same IP:

ratio <- stats[, .(Mean = mean(Nb_of_downloads/Nb_of_distinct_IPs),
                   sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N),
                   sd = sd(Nb_of_downloads/Nb_of_distinct_IPs),
                   max = max(Nb_of_downloads/Nb_of_distinct_IPs),
                   min = min(Nb_of_downloads/Nb_of_distinct_IPs)), 
              by = c("Package")]
ratio <- ratio[order(Mean, decreasing = TRUE), ]
ratio$Package <- as.character(ratio$Package)
ratio
##                  Package      Mean       sem        sd        max      min
##    1:           flowCore 35.384572 7.6679155 79.687316 385.518002 1.424051
##    2:         cummeRbund  6.717642 1.3519148 11.707925  70.841699 1.583333
##    3:           REMPdata  6.000000        NA        NA   6.000000 6.000000
##    4:           picaplot  6.000000        NA        NA   6.000000 6.000000
##    5:        EpicopyData  5.666667        NA        NA   5.666667 5.666667
##    6:            mosaics  5.586846 1.7998096 16.297969  93.465831 1.096774
##    7:  phosphonormalizer  5.295936 1.0675253  3.994314  13.500000 1.464286
##    8:           OPWeight  5.277541 0.8186818  2.166028   7.461538 1.333333
##    9:     topOnto.HDO.db  5.000000        NA        NA   5.000000 5.000000
##   10:        AnimalQTLDB  5.000000        NA        NA   5.000000 5.000000
##   ---                                                                     
## 1803:          exomecopy  1.000000        NA        NA   1.000000 1.000000
## 1804: imageanalysisBrain  1.000000        NA        NA   1.000000 1.000000
## 1805:        neoantigenR  1.000000        NA        NA   1.000000 1.000000
## 1806:          omicplotR  1.000000        NA        NA   1.000000 1.000000
## 1807:          phantasus  1.000000        NA        NA   1.000000 1.000000
## 1808:           projectR  1.000000        NA        NA   1.000000 1.000000
## 1809:          projectoR  1.000000        NA        NA   1.000000 1.000000
## 1810:               roma  1.000000        NA        NA   1.000000 1.000000
## 1811:            snpfier  1.000000        NA        NA   1.000000 1.000000
## 1812:            triwise  1.000000        NA        NA   1.000000 1.000000

We can see that the package with more downloads from the same IP is flowCore, followed by, cummeRbund, REMPdata and the forth one is picaplot. We can see that some (132) packages have been downloaded each time from different IP. There are 52 package with more dispersion than mean download per IP, which suggest that are packages highly downloaded in some specific places.

I am curious how are the default packages of Bioconductor downloaded, let’s see where they are:

ratio[Package %in% bioc_packages, ]
##          Package     Mean        sem        sd       max      min
## 1: BiocInstaller 2.292161 0.12795222 1.1372642  8.487406 1.190476
## 2:       Biobase 2.120063 0.11134705 1.1571525 11.021502 1.632311
## 3: AnnotationDbi 1.877192 0.04329730 0.4499588  5.708393 1.454985
## 4:       IRanges 1.779438 0.02494051 0.2591894  3.781343 1.408883
## 5:     S4Vectors 1.687693 0.02138638 0.1434642  2.014107 1.406886
## 6:  BiocGenerics 1.594835 0.01217714 0.1047517  2.075613 1.329412

BiocInstaller is base package more downloaded per IP, maybe because the is necessary to install the other packages in Bioconductor.

Now we explore if there is some seasons cycles in the downloads, as in figure 2 seems to be some cycles.

3.2 By date

First we can explore the number of IPs per month downloading each package:

ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Distinct IP downloads") +
  scal +
  theme + 
  guides(col = FALSE)

$Distinct IP per package$

Figure 9: Distinct IP per package

As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Package Downloads") +
  ylab("Downloads") +
  scal +
  theme + 
  guides(col = FALSE)

Figure 10: Downloads per year

Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Downloads per package") +
  ylab("Downloads") +
  scal +
  ylim(0, 50000)+
  theme + 
  guides(col = FALSE)
## Warning: Removed 24 rows containing missing values (geom_path).

Figure 11: Downloads per year

There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw+
  ggtitle("Downloads per package") +
  ylab("Downloads") +
  scal +
  ylim(0, 10000)+
  theme + 
  guides(col = FALSE)
## Warning: Removed 1003 rows containing missing values (geom_path).

Figure 12: Downloads per year

As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.

Maybe there is a relationship between the downloads and the number of IPs per date

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Spread of downloads") +
  ylab("Downloads per IPs") +
  scal +
  theme + 
  guides(col = FALSE)

Figure 13: Ratio downloads per IP per package

We can see some packages have occasional raises of downloads per IP. But for small ranges we miss a lot of packages:

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Ratio") +
  scal +
  theme + 
  guides(col = FALSE) +
  ylim(1, 5)

Figure 14: Ratio downloads per IP per package

But most of the packages seem to be more or less constant and around 2.

3.3 By year

To measure if the same IP has downloaded the same package in several months we can compare if the IPs per year are closer to the IP per month.

staty <- stats[, .(Nb_of_distinct_IPs = sum(Nb_of_distinct_IPs)), by = c("Year", "Package")]
year <- staty[yearly, , on = c("Package", "Year")]
year[, Repeated_IP := Nb_of_distinct_IPs-i.Nb_of_distinct_IPs, by = c("Package", "Year")]
year[, Repeated_IP_per := Repeated_IP/Nb_of_distinct_IPs*100, by = c("Package", "Year")]
ggplot(year, aes(Year, Repeated_IP_per)) + 
  geom_violin() +
  xlab("") +
  ylab("Update from the same IP") +
  theme_bw +
  guides(col = FALSE)

This gives an idea that most of the IPs don’t update all their packages to the newest version of each package, or it could be that as most of the IPs are dynamic they change the measure is not completely reliable

year2 <- year[, .(m = mean(Repeated_IP_per), sem = sd(Repeated_IP_per)/sqrt(.N)), by = "Year"]
ggplot(year2, aes(Year, m)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = m - sem, ymax = m + sem), 
                width = .1, position = pd) + 
  ylab("Percentatge") +
  xlab("") +
  ggtitle("Update from the same IP") + 
  theme_bw

Figure 15: Mean percentage of updates of the packages
The error bar are the SEM.

We can see however that the percentatge of IPs which download the package quite similar for the packages in Bioconductor. It remains also around 17% of the total IPs.

year$Year <- as.numeric(year$Year)
d <- date()
thisYear <- as.numeric(substr(d, nchar(d)-3, nchar(d)))
ggplot(year, aes(Year, Repeated_IP_per, col = Package)) +
  geom_point() +
  geom_line() +
  guides(col = FALSE) + 
  scale_x_continuous(breaks = seq(2009, thisYear, 1)) + 
  ylab("Percentatge") +
  ggtitle("Update from the same IP") +
  theme_bw

Figure 16: Percentatge of updates from the same IP by package

We can see that some packages ocassionaly raise above the mean.

4 Models

4.1 Position in Bioconductor

We can observe if the packages has been consistently the most downloaded package of Bioconductor:

PercDate <- stats[, .(Package, Downloads = Nb_of_downloads/sum(Nb_of_downloads)), by = Date]
PercDate <- PercDate[order(Date, order(Downloads)), ]
ggplot(PercDate, aes(Date, Downloads, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Downloads of packages by dates") +
  xlab("") + 
  ylab("% of downloads") +
  guides(col = FALSE) +
  scal + 
  theme

Figure 17: Downloads of packages in Bioconductor

We can see that usually the percentage of downloads of a packages doesn’t reach 5% of downloads of the category. And that there is a huge differences between packages in the number of downloads. Let’s compare the most downloaded package to the other package:

OrdDate <- PercDate[, .(Package, Ord = Downloads/max(Downloads)), by = Date]
ggplot(OrdDate, aes(Date, Ord, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Downloads of packages by dates") +
  xlab("") + 
  ylab("% of the Package more downloaded") +
  guides(col = FALSE) +
  scal + 
  theme

Figure 18: Position of packages in Bioconductor

We can observe that usually there are few package that are closer to the most downloaded packages and lots of them are far from it. We can see the rank between them to be able to see the evolution of downloads along time of a package:

rankDate <- OrdDate[, .(Package, rank = rank(Ord)/.N), by = Date]
ggplot(rankDate, aes(Date, rank, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Rank of packages by downloads") +
  xlab("") + 
  ylab("Position by downloads") +
  guides(col = FALSE) +
  scal + 
  theme

Figure 19: % of packages in Bioconductor
Closer to 1 indicates the top downloaded packages.

Only if we select to follow some package we can track them:

packages <- c("limma", "GOSemSim", "BioCor", "Clonality", "Prostar", 
             "rintact", "bioassayR", "DESeq", "DESeq2", "edgeR")
ggplot(rankDate[Package %in% packages, ], aes(Date, rank, col = Package))+ 
  geom_line() + 
  theme_bw + 
  ggtitle("Relative rank of packages by downloads") +
  xlab("") + 
  ylab("Position by downloads") +
  scal + 
  theme

Figure 20: Evolution of downloads of packages in Bioconductor

Here we can see that limma is one of the top downloaded packages since 2009, edgeR and GOSemSim has grown to reach the top downloaded packages, while other packages haven’t reached the top 25% packages by downloads on a month.

4.2 Package cycle

We can also observe when did a package reach the maximum number of downloads:

PercPack <- stats[, .(Date, Downloads = Nb_of_downloads/sum(Nb_of_downloads)), by = Package]
ggplot(PercPack, aes(Date, Downloads, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Growth of the packages") +
  xlab("") + 
  ylab("Downloads/max(Downloads)") +
  guides(col = FALSE) +
  scal + 
  theme

Figure 21: Cycle of packages
Percentatge of downloads of a package along time.

Here we can see the date when a package reached the highest number of downloads. For most packages the higher downloads are on the recent months.

OrdPack <- PercPack[, .(Date, rank = Downloads/max(Downloads)), by = Package]
ggplot(OrdPack, aes(Date, rank, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Growth of the packages") +
  xlab("") + 
  ylab("Downloads/max(Downloads)") +
  guides(col = FALSE) +
  scal + 
  theme

Here we can see when the package hast the most downloads along time. As usually we need to focus on fewer packages to be able to distinguish between package to see their evolution:

ggplot(OrdPack[Package %in% c(packages, "RTools4TB", "SemSim"), ], 
       aes(Date, rank, col = Package)) + 
  geom_line() + 
  theme_bw + 
  ggtitle("Growth of the packages") +
  xlab("") + 
  ylab("Downloads/max(Downloads)") +
  scal + 
  theme

Figure 22: Cycle of few packages
Position of package downloads respect the maximum downloads of the packages along time.

As expected the packages that keep up with Bioconductor growth have a peak near the end of the series. For this reason I added the package SemSim and the RTools4TB to see that package that has been less and less downloaded.

We can combine both models into a single one to have a timeless comparison of the packages. So we will know if the max position reached in Bioconductor is done when there are more downloads.

model <- rankDate[OrdPack, on = c("Package", "Date")]
setnames(model, "rank", "rank.B")
setnames(model, "i.rank", "rank.P")
ggplot(model[Package %in% c(packages,"RTools4TB", "SemSim")], aes(rank.P, rank.B, col = Package)) +
  geom_point() + 
  theme_bw + 
  geom_line() +
  ylab("Position in Bioconductor") + 
  xlab("Position in package")

Figure 23: Relative position of the packages
Position in Bioconductor downloads and position of package downloads.

This is a timeless comparison of the growth of packages in Bioconductor and against themselves. The higher on the y-axis the higher they have been on Bioconductor, and the higher on the x-axis the most downloads they had. Next we add the time as a factor so we standarize by the time a package has been in Bioconductor. We have omitted months with 0 downloads of a package so this might bias a bit the ranking.

model[, rank.D := rank(as.numeric(Date))/.N, by = Package]
ggplot(model[Package %in% c(packages,"RTools4TB", "SemSim")], aes(rank.D, rank.B, col = Package)) +
  geom_point() + 
  theme_bw + 
  geom_line() +
  xlab("By date") + 
  ylab("Position in Bioconductor")

Figure 24: Relative position of the packages
Position relative to Bioconductor and itself compared to date.

  
ggplot(model[Package %in%  c(packages,"RTools4TB", "SemSim")], aes(rank.D, rank.P, col = Package)) +
  geom_point() + 
  theme_bw + 
  geom_line() +
  xlab("By date") + 
  ylab("Position in Package")