Introduction

A friend asked how my package handles that the pathways databases are incomplete. At the moment it doesn’t, it is not a smart package (garbage in, garbage out), so make sure of the input information of which genes go to which pathways.

However this got me into thinking about the distribution of the size of the pathways along the number of a pathway a gene has. Or in some related distribution to try to infer if from the number pathways a gene is involved in we can deduce some expected pathway size or the other way around, if given a size of a pathway we can deduce in how many pathways are the genes.

This page is about some ideas I got around this topic, which I would like to explore with my package BioCor (or in Bioconductor), or in general to the pathways database problem.

Differences between databases

One point I would like to address using BioCor is how similar are pathway databases. According to Pathguide there are 166 metabolic pathways. I would like to find if the information is redundant or if there is one better database, by looking at the similarity between genes and the genes annotated in both databases. I recently come up with the idea that the metabolism is like a language and pathways are like sentences. I am considering using the following variables to compare databases:

  • Size of pathways: both the range and the information content
  • Number of pathway per genes: both the range and the information content
  • The number of pathways: One question that would require manual curation is if pathways with similar names have really the same genes involved.
  • The number of genes (and the agreement of genes used)
  • The n-grams of genes: both the number of each n-gram present and the information content, as well as which is the biggest n-gram present, more than once (otherwise that would be the same as the largest pathway)
  • The similarity between pathways (pathSim)
  • The similarity annotation for genes between (geneSim): It would be interesting to compare similarities present in BioCor through several databases.

I am not sure if this would help to “settle” the discussion about pathways and which database use, but at least would help to choose a database for functional enrichment analysis or other related assessments.

De novo pathway finder

One discussion I end up was if it was possible to find new pathways for genes purely in-silico. This is a bit crazy idea, but can we find new pathways without further information aside from what we already have? The idea is to start with some database as seed and then through random (or not) re-sampling of genes labeled as a new pathway see if the similarity of the genes converge (or not).

One could set restriction to the newly created pathways such that they should have certain dissimilarity with all the previous pathways or follow certain distribution from the first point. I think this is kind of alike how the MetaCyc project work, starting from a seed and expand from there.

The features of these new pathways could be:

  • Pathway length: Is a probability function from the seed
  • Pathway similarity to other pathways: must be bigger or equal to the previous existing similarities
  • Pathway must have a new gene not previously used (obvious)

I don’t know how the addition of pathways affect the similarity between genes. I would like to explore this with a couple or three databases.

Information available to compare two pathways

To compare two pathways we have the following parameters:

  • The number of pathways of each gene (and the probability that such coincidence happens)
  • The number of pathways where both genes are present (and the probability that such coincidence happens)
  • The number of pathways where one gene or the other is present (and the probability that such coincidence happens)
  • The similarity between their pathways (and the probability that such coincidence happens)
  • The length of the pathways of each gene (and the probability that such coincidence happens)
  • The length of the pathways where both genes are present (and the probability that such coincidence happens)
  • The length of the pathways where one gene or the other is present (and the probability that such coincidence happens)

Actually BioCor only uses the similarity between the pathways of two genes to calculate how similar they are, which normalize the length of the pathways and the number of pathways for the two genes. This could be complicated/improved using the above variables depending on how much information is carried by each variable.