A friend asked how my package handles that the pathways databases are incomplete. At the moment it doesn’t, it is not a smart package (garbage in, garbage out), so make sure of the input information of which genes go to which pathways.
However this got me into thinking about the distribution of the size of the pathways along the number of a pathway a gene has. Or in some related distribution to try to infer if from the number pathways a gene is involved in we can deduce some expected pathway size or the other way around, if given a size of a pathway we can deduce in how many pathways are the genes.
This page is about some ideas I got around this topic, which I would like to explore with my package BioCor (or in Bioconductor), or in general to the pathways database problem.
One point I would like to address using BioCor is how similar are pathway databases. According to Pathguide there are 166 metabolic pathways. I would like to find if the information is redundant or if there is one better database, by looking at the similarity between genes and the genes annotated in both databases. I recently come up with the idea that the metabolism is like a language and pathways are like sentences. I am considering using the following variables to compare databases:
I am not sure if this would help to “settle” the discussion about pathways and which database use, but at least would help to choose a database for functional enrichment analysis or other related assessments.
One discussion I end up was if it was possible to find new pathways for genes purely in-silico. This is a bit crazy idea, but can we find new pathways without further information aside from what we already have? The idea is to start with some database as seed and then through random (or not) re-sampling of genes labeled as a new pathway see if the similarity of the genes converge (or not).
One could set restriction to the newly created pathways such that they should have certain dissimilarity with all the previous pathways or follow certain distribution from the first point. I think this is kind of alike how the MetaCyc project work, starting from a seed and expand from there.
The features of these new pathways could be:
I don’t know how the addition of pathways affect the similarity between genes. I would like to explore this with a couple or three databases.
To compare two pathways we have the following parameters:
Actually BioCor only uses the similarity between the pathways of two genes to calculate how similar they are, which normalize the length of the pathways and the number of pathways for the two genes. This could be complicated/improved using the above variables depending on how much information is carried by each variable.