Author: Gamal, Rofida Mohammed./ Title: A Study of Protein Structure Prediction Using Deep Learning Techniques /

Search In this Thesis

العنوان

A Study of Protein Structure Prediction Using Deep Learning Techniques /

المؤلف

Gamal, Rofida Mohammed.

هيئة الاعداد

باحث / روفيده محمد جمال

مشرف / محب رمزي جرجس

مشرف / ايناس فاروق الجلدوى

الموضوع

Diagnostic Techniques and Procedures. Deep Learning.

تاريخ النشر

2021.

عدد الصفحات

100 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2021

مكان الإجازة

جامعة المنيا - كلية العلوم - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

131

from

131

Abstract

Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics; it is very vital in medicine, biotechnology and more.
Protein secondary structure prediction (PSSP) has a significant role in the prediction of protein tertiary structure, as it bridges the gap between the protein primary sequences and tertiary structure prediction. Protein secondary structures are classified into two categories: 3-state category and 8-state category. Predicting the 3 states and the 8 states of secondary structures from protein sequences are called the Q3 prediction and the Q8 prediction problems, respectively.
Ensemble learning methods and deep learning (DL) models have been proposed to tackle the challenges of PSSP, but most of these methods have focused on Q3 prediction. Although Q8 prediction is more challenging and complex, we focus in this study on Q8 prediction, because the 8 classes of secondary structures reveal more precise structural information for a variety of applications than the 3 classes of secondary structures.
In this thesis, we first explore the performance of ensemble learning algorithms compared to that of individual ML algorithms in Q8 PSSP, by developing an ensemble ML approach for Q8 PSSP. This approach employs two different ensemble methods, namely Bagging and Boosting. The ensemble members (base learners) considered for constructing the ensemble models are well known classifiers, namely SVM (Support Vector Machines), KNN (K-Nearest Neighbor), DT (Decision Tree), RF (Random Forest), and NB (Naïve Bayes), with two feature extraction techniques, namely LDA (Linear Discriminate Analysis) and PCA (Principal Component Analysis). Experiments have been conducted for evaluating the performance of single models as well as ensemble models, with PCA and LDA, in Q8 PSSP. The experimental results confirmed that ensemble ML models are more accurate than individual ML models. They also indicated that features extracted by LDA are more effective than those extracted by PCA.
Then, we explore the performance of various DL architectures for Q8 PSSP, by developing six DL architectures, using CNNs (convolutional neural networks), RNNs (recurrent neural networks), and some combinations of them. These architectures are: CNN-SW (CNNs with sliding window); CNN-WP (CNNs with whole protein as input); LSTM+ (LSTM (Long Short-Term Memory) and BLSTM (Bidirectional LSTM)); GRU+ (GRU (Gated Recurrent Unit) and BGRU (bidirectional GRU)); CNN-BGRU (CNNs and BGRUs); and CNN-BLSTM (CNNs and BLSTMs). They include batch normalization, drop-out, and fully-connected layers. We have used CB6133 and CB513 datasets for training and testing, respectively. The experiments showed that combining CNN with BLSTM or BGRU overcame overfitting, and achieved better Q8 accuracy, precision, recall and F-score. The experiments on CB513 showed that CNN-SW, CNN-BGRU, and CNN-BLSTM achieved Q8 accuracy comparable with some state-of-the-art models.
Based on our exploration of the performance of both the ML ensemble models and the DL models in Q8 PSSP, we conclude that the performance of DL models in Q8 PSSP is significantly better than that of ML ensemble models.