Analysis of genomic data to derive biological conclusions on (1) transcriptional regulation in the human genome and (2) antibody resistance in hepatitis C virus
High-throughput sequencing has become pervasive in all facets of genomic analysis. I developed computational methods to analyze high-throughput sequencing data and derive biological conclusions in two research areas -- transcriptional regulation in mammals and evolution of virus under immune pressure. To investigate transcriptional regulation, I integrated data from multiple experiments performed by the ENCODE consortium. First, my analysis revealed that Transcription Factors (TFs) prefer to bind GC-rich, histone-depleted regions. By comparing in vivo and in vitro nucleosome dynamics, I observed that while histones have an innate preference for binding GC-rich DNA, TF binding overrides this preference and produces a negative correlation between GC content and histone enrichment. In the next project, I found that the binding events of multiple TFs co-occur at genomic regions enriched in activating histone marks that are typically associated with gene enhancers and promoters, suggesting that these regions may be enhancers or have TSS-distal transcription. Lastly, I used supervised machine learning techniques to train histone enrichment signals and sequence features to predict transcriptional enhancers to be validated in mouse-transgenic assays. In a post-clinical trial exploratory analysis of Hepatitis C Virus (HCV), I traced the evolutionary path of the envelope proteins E1 and E2 in HCV-infected liver transplant patients, in response to a novel antibody. I developed a systematic amino acid-level analysis pipeline that quantifies differences in amino acid frequencies in each position between two time points. Upon applying this method across all positions in the E1/E2 region and comparing pre-liver-transplant and post-viral-rebound time points, mutations in two positions emerged as being key to antibody evasion. Both these mutations--N415K/D and N417S--were in the epitope targeted by the antibody, but surprisingly, did not co-occur. In post-rebound viral genomes that contain the N417S mutation but retain the wild-type variant at 415, N-linked glycosylation of 415 is another possible escape mechanism. Using the same analysis pipeline, I also identified additional candidate escape mutations outside the epitope, which could be potential therapeutic targets.