Computational approaches for whole-transcriptome cancer analysis based on RNA sequencing data
RNA-Seq (Whole Transcriptome Shotgun Sequencing) provides an ideal platform to study the complete set of transcripts for a specific developmental stage or physiological condition. It reveals not only expression-level changes, but also structural changes in the coding sequences, including gene rearrangements. In this dissertation, I present my contributions to the development of computational tools for the robust and efficient analysis of RNA-seq data to support cancer research. To automate the laborious and computationally intensive procedure of RNA-seq data management, I worked on the development of Hydra, an RNA-seq pipeline for the parallel processing and quality control of large numbers of samples. With user-friendly reports on quality control and running checkpoints, Hydra makes the data processing procedure fast, efficient and reliable. Here, I report my application of the pipeline to the analysis of patient-derived lymphoma xenograft samples, to show Hydra’s ability to detect abnormalities (e.g., mouse tissue contamination) in the sequencing data. Because fusions play an important role in carcinogenesis, fusion detection has become an important area of methodological research. Several computational methods have been developed to identify fusion transcripts from RNA-seq data. However, all these methods require realignment to the transcriptome, a computationally expensive task, unnecessary in many cases. Here, I present QueryFuse, a novel gene-specific fusion-detection algorithm for aligned RNA-seq data. It is designed to help biologists find and/or computationally validate fusions of interest quickly, and to annotate the detected events with visualization and detailed properties of the supporting reads. By focusing the fusion detection on read pairs aligned to query genes, we can not only reduce realignment time, but also afford to use a more accurate but computationally expensive local aligner. In the extensive evaluation I performed, I obtained comparable or better results compared with two widely adopted tools (deFuse and TophatFusion) on two simulated datasets, as well as on cell line datasets with known fusions. Finally, I contributed to the identification of a novel fusion event in lymphoma, with potential therapeutic implications in clinical samples. I validated this fusion in silico by my putative reference method before experimental validation.