Department of Mathematics and Statistics, Boston University, Boston, MA 02215, USA

Bioinformatics Program, Boston University, Boston MA, 02215, USA

Department of Genetics and Genomics, Boston University, Boston MA, 02118, USA

Department of Biomedical Engineering, Boston University, Boston MA, 02215, USA

Abstract

Background

In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.

Results

We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.

Conclusion

A cross-validation study, using data from the yeast

Background

Proteins are fundamental to the complex molecular and biochemical processes taking place within organisms. An understanding of their role is therefore critical in biology and bio-related areas, for purposes ranging from general knowledge to the development of targeted medicine and diagnostics. High-throughput sequencing technology has identified a tremendous number of genes with no known functional annotation. On average, as many as 70% of the genes in a genome have poorly known or unknown functions

Protein function prediction can take many forms. The traditional and most popular methodologies use homology modeling and sequence similarity to infer biochemical function

For example, microarrays are often used to cluster proteins into groups of genes that respond concordantly to a given environmental stimuli. When these groups are strongly enriched in proteins in a given biological process such as insulin signaling and also contain proteins without annotation we often take the leap of faith and predict the unknown proteins to be associated with this process as well. Similarly, when two proteins are found to interact in a high throughput assay we also tend to use this as evidence of functional linkage.

However, enrichment and guilt by association are often highly misleading and can lead to a very high false positive rate if not used with caution. The work in

More broadly, the work in this paper is important in demonstrating that an important role can be played in this context by the knowledge captured in biological ontologies, when properly harnessed. That this should be the case is not obvious

Nevertheless, despite such concerns, our work here shows that in the present context of automated protein function prediction, the leverage of hierarchies grounded in biological ontologies can yield real, quantifiable advantages over 'flat' network-based approaches.

Objective

Computational protein function prediction is typically treated as a classification problem. From this perspective, given a protein

Protein-protein interaction (PPI) data are common, and have been used widely in the protein function prediction problem. A functional linkage graph is used to represent the information in the PPI, where nodes represent proteins and edges indicate pairwise interactions, as in Fig.

Visualization small PPI network and GO DAG

**Visualization small PPI network and GO DAG**. This plot contains two toy examples of Protein-Protein Interaction network and the Gene Ontology structure. (a) Schematic network of local protein interactions; (b) schematic GO hierarchy, where the thicker link indicates larger weight. Among the neighbors of the central protein in (a), 4 out of 5 are labeled with term

Databases of labels

This annotation rule suggests that when predicting the label of a term in the hierarchy, it is helpful to first consider whether the protein has the parent term or not. Thus, informative are not only the neighbors labeled with the term of interest but also those labeled with the parent. For instance, to predict the label of term

As currently practiced in most instances, prediction of protein function is done with classifiers trained separately for each possible label

To further illustrate this, we show a toy GO hierarchy in Fig.

Illustration of obedience and disobedience to the true-path rule

**Illustration of obedience and disobedience to the true-path rule**. The plot demonstrates a small example of GO hierarchy with four terms A, B, C and D. The true annotations and the predicted probabilities of the terms for some protein are also given, in a format of "true annotation (probability)". We use this to illustrate predictions that are consistent and are not consistent with the

Most existing methods, as discussed earlier, predict protein function in a term-by-term fashion, without considering the relationship among terms. Suppose the probabilities in the plot are obtained from one of such methods. If we apply a cut-off of 0.5, which is a commonly used threshold in this field, we will predict that the protein is NOT annotated with term A, since the probability of having A is 0.4, less than 0.5; and is annotated with A's child C. This violates

The basic premise of this paper is that reference to this hierarchical relationship among labels is best incorporated in the initial stage of constructing a classifier, as a valuable source of information in and of itself. Our objective here is to demonstrate the power of this premise and to show that it may be tapped in the form of a single, coherent probabilistic classifier. In particular, we develop a probability model that integrates relational data and a hierarchy of labels, and illustrate its advantages in predicting protein function using a PPI network and the Gene Ontology (GO) hierarchy.

Related Work

Many methodologies have been proposed to predict protein functions. Most of the earlier methods tend to use a single source of protein information, such as PPI. Typical examples include the "Nearest-Neighbor" algorithm, also known as "guilt-by-association" principle, and the Binomial-Neighborhood (BN) method

These earlier methods were followed later by a surge of interest in combining heterogeneous sources of protein information. For example, a machine learning approach integrating datasets of PPI, gene expression, hydropathy profile and amino acid sequences, in the form of different kernels, has been introduced

Motivated in part by seminal work of

In summary, combining relational protein data, such as PPI, and hierarchical structures, as in GO, in one probabilistic model to predict

Methods

Ontologies like GO are structured as directed acyclic graphs (DAG's), where a child term may have multiple parent terms. The DAG structure, with alternative paths from the root to internal and leaf terms, is one of the reasons that formal approaches to annotation predictions have been difficult. It is well known that computing the most likely assignment of values to variables in a DAG of size

We apply a minimal spanning tree (MST) algorithm to transform a DAG into a tree-structured hierarchy, by preserving the link between the child and the parent with the heaviest weight

As a result of this transformation, there now exists a unique path from the root term to any non-root term. That is, let _{d }denote a term at the _{d}, _{d-1}, ..., _{1}, _{0}, with _{0 }being the root, and _{i-1 }being the parent of _{i}. For example, in Fig.

We propose to build a classifier in this setting based on the use of hierarchical conditional probabilities of the form _{d }is a GO term of interest. The binary variable _{d}; otherwise, it takes the value -1. Finally, _{d}. We will refer to

In the remainder of this section, we present certain model assumptions that in turn lead to a particular form for the probabilities

Assumptions

We assume that labels on proteins obey a Markov property with respect to the PPI. That is, that the labeling of a protein is independent of any other proteins given that of its neighbors. Similarly, we assume that a Markov property holds on the GO tree-structured hierarchy, meaning that for a given protein the status of a GO term label is independent of that of the other terms, given that of its parent.

In addition, we assume that for any given protein

and

where

• _{ch }is the child term; _{pa }is its parent;

• _{ch}, and _{pa};

• _{1 }is the probability with which neighbors of _{ch }(being already labeled with _{pa}), given _{ch};

• _{0 }is the probability with which neighbors of _{ch }(being already labeled with _{pa}), given _{ch }but is labeled with _{pa}.

We refer to this overall set of model assumptions as the

Parameters _{1 }and _{0 }are term-specific: different terms have different _{1 }and _{0}. For a given term _{ch}, all proteins share the same _{1 }and _{0}. They are estimated by pseudo-likelihood approach, from the labeled training data, separately for each _{ch }to be predicted. When calculating

More specifically, assume there are _{ch }and _{ch}. To simplify notation, let _{ch, i }and _{pa, i }be protein _{ch }and _{pa}, respectively. For the _{ch}-annotated proteins, we have

_{ch, i }_{pa, i}, _{1}),

where _{1 }based on all _{ch}-annotated proteins is

The estimator for _{1 }is based on all _{ch}-annotated proteins' neighborhoods in the training set, and is the ratio of the total number of their _{ch}-annotated neighbors and the total number of their _{pa}-annotated neighbors, i.e.,

with

Similarly, the estimator for _{0 }is based on all _{ch}-unannoated proteins' neighborhoods in the training set, and is the ratio of the total number of their _{ch}-annotated neighbors and the total number of their _{pa}-annotated neighbors,

with

An issue of estimation is the lack of data. Few data will affect the predictability and interpretability of the terms. Thus, we focus on terms with at least 5 proteins annotated with in the GO dataset. In principle, more formal work could be done, by using smoothing techniques and Empirical Bayes approaches, which we are exploring in our current work. It appears that improvement is not uniform, and the issue clearly requires separate consideration and will likely form a substantial component of a separate paper. Its subtlety likely is due to the well-known issue of classifiers doing well for classification while still being off-target for estimation

Also notice that we use one-hop neighborhoods in this paper, i.e., neighbors that are directly connected to the protein of study. The extension to larger neighborhoods could be easily done, and would likely yield some improvement in predictive performance, but at the expense of some additional mathematical overhead, replacing the BN framework with one like those in

Local Hierarchical Conditional Probability

By the Markov property assumed on the GO hierarchy, for any non-root term, only the parent affects its labelling. Therefore, to derive an expression for our hierarchical conditional probabilities

Applying Bayes' rule, we have

For the first term in the numerator,

For the second term in the numerator, we use the plug-in estimate

For the denominator, we apply the law of total probability and as a result, together with the two results above, the probability in (1) can be expressed as

where

Global Hierarchical Conditional Probability

Equipped with the local hierarchical conditional probability, for any non-root GO term _{d }in the hierarchy, we now derive an expression for _{d }given its neighborhood status.

Note that by the _{d-1 }is the parent of _{d}.

Hence,

This logic easily extends recursively back through all ancestors of _{d}, and thus the conditional probability (3) can be factorized as

where _{m }and _{m-1}.

Importantly, note that due to the form of the factorization, the global conditional probability for _{d }is no greater than that for its parent _{d-1}, i.e., we have the inequality

As we go down along the path from the root in the hierarchy, the probability that protein

Algorithm

Classification using our Hierarchical Binomial-Neighborhood (HBN) model may be accomplished using a straightforward top-to-bottom algorithm. Specifically, for a given protein

**initialize **P

**for **_{max},

**while ∃ **unlabeled terms _{m }at level

**compute **

**if **

**else **set _{m}

**end**

**end**

Notice that setting the labels at each step is not necessary. However, doing so facilitates the computation efficiency, by avoiding the calculation of the probabilities below the threshold. By letting

For a given protein, the algorithm requires at most _{GO}) steps, where _{GO }is the number of GO terms, and therefore, for _{Protein }proteins, no more than _{Protein}_{GO}) steps are needed. Hence, the algorithm is linear in the size of both the PPI and the GO networks. In practice, it has been found to be quite fast, particularly because each protein can be expected to have a large proportion of -1 labels, and once a -1 is assigned to a term it is simply propagated to all descendant terms.

Results

Data

The PPI data used in this paper is from the yeast

The Gene Ontology used is

From the initial data, a set of labels is constructed in a way that follows the

Please visit

Cross-Validation Study

We apply our Hierarchical Binomial-Neighborhood (HBN) method, as well as the "Nearest-Neighbor" (NN) algorithm and the Binomial-Neighborhood (BN) method of

Evaluation

We use three metrics by which to evaluate the performance characteristics of each classification method. The first is the standard Receiver Operating Characteristic (ROC) curve, which evaluates a classifier's performance in a manner that aggregates across all terms. We examine ROC curves both for the overall GO hierarchy and within each of the 47 sub-hierarchies.

Since the ROC curve, as a metric, is 'flat', in that it ignores any hierarchical structure among terms, we use as a second metric a hierarchical performance measure, called _{β}

Next, for each protein _{i},

_{i }and _{i}, respectively. Define hierarchical precision (

The value _{β }

where _{1 }with equal weights on precision and recall, simply denoted as

Lastly, because accurate positive predictions are of most biological interest in this area, and because predictions of terms increasingly deeper in the GO hierarchy are of increasingly greater use, we examine the positive predictive value (PPV) of each of the methods, as a function of depth in the hierarchy. However, as the prevalence of known terms tends to decrease substantially with depth, and PPV decreases similarly with decreasing prevalence, we normalize PPV by prevalence to allow meaningful comparison across depths. Specifically, we compute a log-odds version of PPV in the form

where

An Illustration

To better appreciate the performance gains from HBN that we describe momentarily below, we first present an illustrative example. Consider protein YGL017W (AFT1) and its neighborhood, as depicted in Fig. _{pa }= GO:0045449, or _{ch }= GO:0045941, or _{pa}, and two with _{ch}. The prediction from HBN results from applying a threshold to Equation. The analogous probability for BN is given by

Illustration of HBN's working mechanism

**Illustration of HBN's working mechanism**. The plot shows (a) protein YGL017W and its neighborhood, (b) Small GO hierarchy. Three neighbors are labeled with the parent term GO:0045449; two of them are labeled with the child term GO:0045941. We want to predict whether YGL017W is labeled with GO:0045941.

where

•

•

•

•

•

•

Table

Parameters from Nearest-Neighbor (NN), Binomial-Neighborhood (BN) and Hierarchical Binomial-Neighborhood (HBN)

NN

BN

HBN

.

_{1 }= 0.2927

.

_{0 }= 0.0992

.

This table contains the parameters and the corresponding probabilities estimated by the three methods, as discussed in the paper, when predicting whether yeast gene YGL017W has GO term GO:0045941,

Cross-Validation Results

A comparison of the overall performance of the three methods, by ROC curves and the ^{-5 }for comparison of HBN with BN and with NN. The gains of HBN over BN directly reflects the benefit of effectively integrating the GO hierarchical information into the construction of our classifier.

Overall method performance comparison by ROC curve

**Overall method performance comparison by ROC curve**. This plot demonstrates the ROC curves of the three methods based on the 5-fold cross-validation study on the whole yeast genome. Colors: HBN (red); BN (light blue); NN (blue).

Overall method performance comparison by

**Overall method performance comparison by hF measure**. This plot demonstrates the curves of

Overall method performance comparison by precision and recall

**Overall method performance comparison by precision and recall**. This plot demonstrates the precision versus recall curves of the three methods based on the 5-fold cross-validation study on the whole yeast genome. Colors: HBN (red); BN (light blue); NN (blue).

Recall that, as a result of our predicting only for GO terms annotated with less than 300 proteins in the database, the full

**ROC curves and hF plots for 47 sub-hierarchies in cross-validation study**. This file contains the ROC curves and plots of

Click here for file

These ROC plots are constructed using the original BN (and NN) predictions, without any correction for "true-path" consistency. However, the overwhelming improvement of HBN over BN indicated by the ROC curves is actually similar when the initial predictions of BN are post-processed by applying transitive closure. Specifically, HBN improves on BN in 28 of the sub-hierarchies, while BN outperforms HBN in only 4 sub-hierarchies. These results strongly suggest the validity of our premise as to the importance of incorporating hierarchical information in the GO database in the initial construction of a classifier. The

As an illustration, consider the performance on the sub-hierarchy corresponding to Fig. _{NN }= 0.16, _{BN }= 0.35, and _{HBN }= 0.56).

Method performance comparison by ROC curve on sub-hierarchy GO:0050896

**Method performance comparison by ROC curve on sub-hierarchy GO:0050896**. The plot shows the ROC curves of the three methods based on the 5-fold cross-validation study on the sub-hierarchy with root GO term GO:0050896,

Method performance comparison by

**Method performance comparison by hF on sub-hierarchy GO:0050896**. The plot shows the curves of

In contrast, Fig.

Method performance comparison by ROC curve on sub-hierarchy GO:0019538

**Method performance comparison by ROC curve on sub-hierarchy GO:0019538**. The plot shows the ROC curves of the three methods based on the 5-fold cross-validation study on the sub-hierarchy with root GO term GO:0019538,

Comparison of method performance by

**Comparison of method performance by hF on sub-hierarchy GO:0019538**. The plot shows the curves of

Lastly, Fig.

Visualization of the averaged positive predictive value comparison

**Visualization of the averaged positive predictive value comparison**. The plot contains the curves of the averaged positive predictive values (PPV) over cross-validation folds of the three methods, against 1-NPV, the averaged negative predictive value (NPV). Colors: HBN (red); BN (light blue); NN (blue).

Visualization of the averaged log-odds positive predictive value comparison on GO hierarchy depth

**Visualization of the averaged log-odds positive predictive value comparison on GO hierarchy depth**. The plot demonstrates the curves of the averaged log-odds PPV over cross-validation folds of the three methods for NPV = 0.987, as a function of the GO hierarchy depth. Colors: HBN (red); BN (light blue); NN (blue).

Shown in Fig.

In Silico Validation Results

Recall that the above results are based on gene-GO term annotations in the January 2007 GO database. As an

We applied HBN, BN, and NN in each of the 47 sub-hierarchies to genes that (i) were annotated with only the root term in the June 2006 database, and (ii) were assigned more specific functions in that sub-hierarchy in the May 2007 database. There were a total of 508 genes that had received at least one new annotation in one of the sub-hierarchies, with as few as 1 gene and as many as 74 genes per hierarchy. There were 33 sub-hierarchies having such genes. The methods were compared for their accuracy through the

** hF plots for 17 sub-hierarchies in in silico study**. This file contains the plots of

Click here for file

Overall, most of the plots are consistent with the cross-validation results. Interestingly, however, there are a number of cases where HBN clearly outperforms NN and BN by a larger margin in the

** hF plots for new predictions on sub-hierarchy GO:0050896**. The plot shows the

** hF plots for new predictions on sub-hierarchy GO:0019538**. The plot shows the

Overall, these results suggest that the performance advantages of HBN indicated by the cross-validation study are, if anything, potentially understated.

Discussion

For a well-studied organism, such as

Biological and biomedical ontologies have become a prominent, and perhaps indispensable, tool in bioinformatics and biological research. GO in particular has been used in numerous papers to detect biological process enrichment of co-expressed genes, identify biological processes associated with disease, etc. However in the vast majority of applications the hierarchical nature of GO is actually not being used directly. For example, in enrichment testing such as GSEA or GNEA we typically test for every biological process if the differentially expressed genes in some condition are associated with this process more than expected by chance.

Thus while GO and other ontologies obviously organize biological knowledge in an intuitive fashion, the structure is not typically exploited for actual inference by predictive analysis tools. This is rather different from evolutionary analysis tools and genetics frameworks where probabilistic ancestor/descendant relationships in phylogenies (hierarchies) are exploited very directly with substantial practical and theoretical benefits.

Our work here suggests that similar developments of probabilistic frameworks are not only feasible, but promising, for improved protein function inference with gene ontologies. In addition, it suggests the need for further research to be done to clarify the utility of different representations for such purposes. Finally, it also raises the prospect of re-engineering ontologies or other similar representations, from the perspective of seeking to provide maximal value for probabilistic inference programs.

Conclusion

We have developed a probabilistic framework for automated prediction of protein function using relational information (e.g., a network of protein-protein interactions) which exploits the hierarchical structure of ontologies, and guarantees the predictions obey a 'true-path' annotation rule. We have evaluated the performance of our method and compared it with two other network-based methods by both cross-validation and an

Authors' contributions

XJ carried out the statistical study, implemented and performed the computation, drafted the manuscript. NN prepared the datasets and helped the computation. MS interpreted the results and took part in the analysis. SK participated in the design of the study and the analysis. EDK conceived of the study, participated in its design, supervised the analysis and finalized the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Russ Greiner for early feedback. This work was supported in part by NHGRI grant R01 HG003367-01A1, NIH award GM078987, NSF grant ITR-048715, and ONR award N00014-06-1-0096.