High-Performance Software Development for Genomic Sequence Alignment and Analysis
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Content Notes
Abstract
Nucleic acid sequencing technology is a powerful tool for understanding genetic information. Genomic data analysis software is critical for transforming complex sequencing results into meaningful biological information. Emerging sequencing technologies help scientists to understand biological processes from multiple angles, but they also raise the challenge of developing new sequence analysis tools, especially new alignment methods, to support these techniques. In this dissertation, I developed a rapid and accurate sequence alignment software, HISAT-3N, to solve the alignment problem of nucleotide conversion sequencing (NC) technologies. NC technologies, such as BS-seq and SLAM seq, involve converting one type of nucleotide to another, which allows researchers to identify specific chemical modifications in DNA or RNA molecules. However, the conversions generated in these NC technologies make it difficult to align the reads back to the reference genome. To solve this issue, I implemented the 3-letter alignment algorithm into HISAT2, which was developed by our lab previously, to create HISAT-3N. I thoroughly tested HISAT-3N and demonstrated that it is more than seven times faster and more accurate than widely used sequence aligners, and can support all types of nucleotide conversion sequencing technologies, including those that have not yet been developed. Additionally, to generalize the process of developing new alignment methods to support new sequencing technologies, I created a platform that allows for the modularized design of sequence alignment software. This platform incorporates algorithms from HISAT2, STAR, and BWA, providing greater efficiency for developers to create novel sequence alignment software and more flexibility for users to analyze different types of data in a variety of computational environments. Finally, I developed a metagenomics analysis pipeline that effectively organizes and manages multiple well-known sequence analysis software for rapid and accurate soil microbial analysis. The successful development and implementation of these tools demonstrate the robustness of a well-designed bioinformatics software and pipeline framework in bioinformatics analysis. Overall, my work emphasizes the significance of continuously improving genomics data analysis tools. This is important to support emerging sequencing technologies and deliver more precise results, which assist researchers in revealing valuable genetic information.