A computer program for the production of horizon graphs from multiple samples of single-gene expression data

Date
2012
DOI
Authors
Almozlino, Adam
Version
Embargo Date
Indefinite
OA Version
Citation
Abstract
In Bioinformatics, as in many fields of science, it is often necessary to analyze collections of data sets. These data sets are often composed of matched values for an independent and a dependent variable. For example, a chemist may examine the absorption spectra of a group of compounds, a sociologist might examine life expectancy versus income in different communities, or a bioinformatician might examine expression of individual nucleotides in a genome by an organism. In these examples, the independent variables are frequency, income, and position in the genome, respectively, and the dependent variables are absorption, life expectancy, and expression, respectively. Previously, such a group of data set may have been displayed visually by a collection of individual tiled line graphs, or by superimposing multiple line graphs on the same graph space. The Horizon Graph is a newly developed data visualization method for these sorts of data sets, and outcompetes its alternatives. A Horizon Graph is composed of a series of "bars", each of which displays a single data set. Bars code for the independent variable along the horizontal axis, and for the dependent variable using both color and the vertical axis. These bars are stacked atop one another vertically, resulting in a single plot that enhances a user's ability to recognize patterns within and between the data sets We wrote a program in Java that generates a Horizon Graph. The program accepts a collection of data sets describing mRNA expression of a common genomic region across multiple samples. It then generates a plot that represents the genetic splicing of the genomic region. The program may generate one of two types of Horizon Graph. The first plot type shows genetic splicing for the data sets. The second plot shows the splicing of every data set relative to the splicing of a "control" subset of the data sets. The program also offers a user several other options regarding the produced plot. These include the ability to mark contiguous portions of nucleotide positions. This can be used to mark regions of exogenous sequences. The individual tasks the program had to accomplish to generate a plot from data sets were separated into isolated Java "classes". These "classes" interacted through well-defined inputs and outputs, and were coordinated in another, separate "class". The resulting structure maximizes the ease for future users to use portions of the program to accomplish portions of its behavior in novel contexts. The program was demonstrated on a collection of data sets from the mef2d gene locus, each from a unique tissue type.
Description
Thesis (M.A.)--Boston University PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
License