Integrated Assessment of Genomic Correlates of Protein Evolutionary Rate
Franzosa, Eric A.
Gerstein, Mark B.
MetadataShow full item record
CitationXia, Yu, Eric A. Franzosa, Mark B. Gerstein. "Integrated Assessment of Genomic Correlates of Protein Evolutionary Rate" PLoS Computational Biology 5(6): e1000413. (2009)
Rates of evolution differ widely among proteins, but the causes and consequences of such differences remain under debate. With the advent of high-throughput functional genomics, it is now possible to rigorously assess the genomic correlates of protein evolutionary rate. However, dissecting the correlations among evolutionary rate and these genomic features remains a major challenge. Here, we use an integrated probabilistic modeling approach to study genomic correlates of protein evolutionary rate in Saccharomyces cerevisiae. We measure and rank degrees of association between (i) an approximate measure of protein evolutionary rate with high genome coverage, and (ii) a diverse list of protein properties (sequence, structural, functional, network, and phenotypic). We observe, among many statistically significant correlations, that slowly evolving proteins tend to be regulated by more transcription factors, deficient in predicted structural disorder, involved in characteristic biological functions (such as translation), biased in amino acid composition, and are generally more abundant, more essential, and enriched for interaction partners. Many of these results are in agreement with recent studies. In addition, we assess information contribution of different subsets of these protein properties in the task of predicting slowly evolving proteins. We employ a logistic regression model on binned data that is able to account for intercorrelation, non-linearity, and heterogeneity within features. Our model considers features both individually and in natural ensembles ("meta-features") in order to assess joint information contribution and degree of contribution independence. Meta-features based on protein abundance and amino acid composition make strong, partially independent contributions to the task of predicting slowly evolving proteins; other meta-features make additional minor contributions. The combination of all meta-features yields predictions comparable to those based on paired species comparisons, and approaching the predictive limit of optimal lineage-insensitive features. Our integrated assessment framework can be readily extended to other correlational analyses at the genome scale. Author Summary Proteins encoded within a given genome are known to evolve at drastically different rates. Through recent large-scale studies, researchers have measured a wide variety of properties for all proteins in yeast. We are interested to know how these properties relate to one another and to what extent they explain evolutionary rate variation. Protein properties are a heterogeneous mix, a factor which complicates research in this area. For example, some properties (e.g., protein abundance) are numerical, while others (e.g., protein function) are descriptive; protein properties may also suffer from noise and hidden redundancies. We have addressed these issues within a flexible and robust statistical framework. We first ranked a large list of protein properties by the strength of their relationships with evolutionary rate; this confirms many known evolutionary relationships and also highlights several new ones. Similar protein properties were then grouped and applied to predict slowly evolving proteins. Some of these groups were as effective as paired species comparison in making correct predictions, although in both cases a great deal of evolutionary rate variation remained to be explained. Our work has helped to refine the set of protein properties that researchers should consider as they investigate the mechanisms underlying protein evolution.