Computational analyses of small silencing RNAs
MetadataShow full item record
High-throughput sequencing is a powerful tool to study diverse aspects of biology and applies to genome, transcriptome, and small RNA profiling. Ever increasing sequencing throughput and more specialized sequencing assays demand more sophisticated bioinformatics approaches. In this thesis, I present 4 studies for which I developed computational methods to handle high-throughput sequencing data to gain insights into biology. The first study describes the genome of High Five (Hi5) cells, originally derived from Trichoplusia ni eggs. The chromosome-level assembly (scaffold N50 = 14.2 Mb) contains 14,037 predicted protein-coding genes. Examination and curation of multiple gene families, pathways, and small RNA-producing loci reveal species- and order-specific features. The availability of the genome sequence, together with genome editing and single-cell cloning protocols, enables Hi5 cells as a new tool for studying small RNAs. The second study focuses on just one type of piRNAs that are produced at the pachytene stage of mammalian spermatogenesis. Despite their abundance, pachytene piRNAs are poorly understood. I find that pachytene piRNAs cleave transcripts of protein-coding genes and further target transcripts from other pachytene piRNA loci. Subsequently, systematic investigation of piRNA targeting by integrating different types of sequencing data uncovers the piRNA targeting rule. The third study describes computational procedures to map splicing branchpoints using high-throughput sequencing data. Screening >1.2 trillion RNA-seq reads determines >140,000 BPs for both human and mouse. Such branchpoints are compiled into BPDB (BranchPoint DataBase) to provide a comprehensive branchpoint catalog. The final study combines novel experimental and computational procedures to handle PCR duplicates that are prevalent in high-throughput sequencing data. Incorporation of unique molecular identifiers (UMIs) to tag each read enables unambiguous identification of PCR duplicates. Both simulated and experimental datasets demonstrate that UMI incorporation increases the reproducibility of RNA-seq and small RNA-seq. Surveying 7 common variables in high-throughput sequencing reveals that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine the PCR duplicate frequency. Finally, I show that removing PCR duplicates without UMIs leads to substantial bias into data analysis.
RightsAttribution-ShareAlike 4.0 International