Author: Rabea, Zeinab Mahmoud Ali./ Title: Long read sequence analysis /

Search In this Thesis

العنوان

Long read sequence analysis /

المؤلف

Rabea, Zeinab Mahmoud Ali.

هيئة الاعداد

باحث / زينب محمد علي ربيع

مشرف / مجدي ذكريا

مشرف / سمير الموجي

مشرف / ساره متولي

مناقش / حامد محمد ناصر

الموضوع

Computer Science. Future Works.

تاريخ النشر

2023.

عدد الصفحات

online resource (119 pages) :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2023

مكان الإجازة

جامعة المنصورة - كلية الحاسبات والمعلومات - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

119

from

119

Abstract

The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. A suffix array is one of the data structures that are used to construct the Burrows-Wheeler transform (BWT) for long-length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominated by sorting suffixes in lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this thesis, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long-length genomes, their corresponding sequence is divided into parts (i.e., reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally, all partially computed arrays are merged into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts. The proposed suffix array construction algorithm for DNA alphabets is used to build SuffixAligner. SuffixAligner is a python-based aligner for long noisy reads generated from third-generation sequencing machines. SuffixAligner exploits the nature of the biological alphabet that has a fixed-size and a predefined lexical ordering to construct a suffix array for indexing a reference genome. FM-index is used to efficiently search the indexed reference and locate the exactly matched seeds among reads and the reference. The matched seeds are arranged into windows/clusters and the ones with the maximum number of seeds are reported as candidates for mapping positions. SuffixAligner is evaluated and compared against lordFAST, BWA, and Minimap2 using simulated and PacBio and Nanopore real data sets. The results show that SuffixAligner mapped more reads compared to the other compared tools with high sensitivity and alignment rates. SuffixAligner can also improve the results of the other alignment tools by reading their resulted SAM files and find a room for every unmapped read compared to a reference.