Machine learning methods in construction of transcriptional regulatory networks

Date
2012
DOI
Authors
Fan, Yue
Version
Embargo Date
Indefinite
OA Version
Citation
Abstract
The transcriptional regulatory network is a biological network that captures the interactions between transcription factor genes (TF-genes) and their regulatory gene targets. Regulation of transcription controls the level of gene expression and thus governs many characteristics of cells. The primary mechanism of transcriptional regulation is through DNA binding, that is, a transcription factor is usually bound to a DNA binding site which is sometimes located in the promoter region of a target gene. The construction of the regulatory network is a problem which can be decomposed into the sub-problems of identifying, for every known gene which produces a TF, its target genes, its binding motif (common sequence pattern of its DNA binding sites) and its DNA binding sites themselves (nucleotide-level binding locations). Many tools have been developed in the last decade to solve these problems. This thesis presents a series of machine learning-based algorithms, making use of support vector machines (SVMs), which can be used to construct the transcriptional regulatory network. This has also established a framework which enables other machine learning algorithms to be applied to this field. The connection between new machine learning methods and traditional methods for solving the above problems also suggests that the machine methods introduced have the potential to identify optimal solutions based on the use training examples of binding motifs, binding sites, and target genes of a given TF. Based on the insights of a pilot project (TFSVM), we first develop a motif discovery tool (SVMotif) to discover binding motifs out of a set of pre-identified potential binding sequences. This tool, tested on the yeast genome, validates many previously identified motifs and also discovers novel ones. Besides identifying primary binding motifs, this tool also successfully identifies 20 secondary motifs at the p = 0.15 significance level. In order to leverage the advantage of different motif discovery algorithms, an ensemble algorithm is then developed to integrate information from multiple position weight matrices (PWM) produced by 5 commonly used motif discovery algorithms. A connection between the SVM-based methods and traditional PWM-based methods is described, which becomes the basis of integrating multiple PWMs by considering them as SVM-based weak learners. This ensemble method is tested in solving the three above-mentioned identification problems--it outperforms its 5 components on all tasks. Finally, a machine framework is proposed and implemented to utilize network information to denoise gene expression feature vectors used for diagnosis and prognosis in biological and biomedical problems. Several local smoothing techniques from statistics are generalized to the graphs/networks obtained from the above and other network construction methods. We then applied the algorithm to denoising gene expression profiles--the resulting smoothed gene expression features improve the accuracy of biological phenotype classification significantly.
Description
Thesis (Ph.D.)--Boston University
PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
License