Integration of relational and hierarchical network information for protein function prediction
Date
2008
DOI
Authors
Jiang, Xiaoyu
Nariai, Naoki
Steffen, Martin
Kasif, Simon
Kolaczyk, Eric
Version
OA Version
Citation
Abstract
BACKGROUND:In the current climate of high-throughput computational
biology, the inference of a protein's function from related measurements, such as
protein-protein interaction relations, has become a canonical task. Most existing
technologies pursue this task as a classification problem, on a term-by-term basis, for each
term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary
for biological functions. However, ontology structures are essentially hierarchies, with
certain top to bottom annotation rules which protein function predictions should in
principle follow. Currently, the most common approach to imposing these hierarchical
constraints on network-based classifiers is through the use of transitive closure to
predictions.RESULTS:We propose a probabilistic framework to integrate information in
relational data, in the form of a protein-protein interaction network, and a hierarchically
structured database of terms, in the form of the GO database, for the purpose of protein
function prediction. At the heart of our framework is a factorization of local neighborhood
information in the protein-protein interaction network across successive ancestral terms in
the GO hierarchy. We introduce a classifier within this framework, with computationally
efficient implementation, that produces GO-term predictions that naturally obey a
hierarchical 'true-path' consistency from root to leaves, without the need for further
post-processing.CONCLUSION:A cross-validation study, using data from the yeast Saccharomyces
cerevisiae, shows our method offers substantial improvements over both standard
'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field
methods, whether in their original form or when post-processed to artificially impose
'true-path' consistency. Further analysis of the results indicates that these improvements
are associated with increased predictive capabilities (i.e., increased positive predictive
value), and that this increase is consistent uniformly with GO-term depth. Additional in
silico validation on a collection of new annotations recently added to GO confirms the
advantages suggested by the cross-validation study. Taken as a whole, our results show that
a hierarchical approach to network-based protein function prediction, that exploits the
ontological structure of protein annotation databases in a principled manner, can offer
substantial advantages over the successive application of 'flat' network-based
methods.