Search In this Thesis
   Search In this Thesis  
العنوان
Next generation sequence assembly /
المؤلف
El-Metwally, Sara El-Sayed Yousef.
هيئة الاعداد
باحث / سارة السيد يوسف المتولي
مشرف / مجدي زكريا رشاد
مشرف / طاهر توفيق أحمد حمزة
مشرف / ميشيل واترمان
مناقش / أحمد السعيد طلبة
مناقش / مصطفى محمود عارف
الموضوع
Sequence alignment (Bioinformatics) Sequence Analysis, DNA. Base Sequence. Sequence Alignment.
تاريخ النشر
2016.
عدد الصفحات
157 p. :
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
Computer Science (miscellaneous)
تاريخ الإجازة
1/1/2016
مكان الإجازة
جامعة المنصورة - كلية الحاسبات والمعلومات - Computer Science
الفهرس
Only 14 pages are availabe for public view

from 157

from 157

Abstract

The deluge of current sequenced data has exceeded Moore’s Law, more than doubling every two years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process, and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory. This thesis addresses the basic framework of next-generation sequence assemblers, which comprises five communicated stages for data analysis and processing, namely, error correction, graph construction, graph simplification, contigs/scaffolds production, and assembly evaluation. Besides studying a wide variety of techniques, algorithms, and software tools used during each stage and proposing a layered architecture approach for constructing a general assembler, this thesis particularly introduces LightAssembler as a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of -spaced sequenced -mers and the other holding -mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage. The software is open-source and user-friendly, free to download and easy to install and use.