Search In this Thesis
   Search In this Thesis  
العنوان
Machine Learning and Its Applications to
Genomics Data Science Analysis for
Personalized Medicine/
المؤلف
Manhrawy,Ibrahim Ibrahim Mousa .
هيئة الاعداد
باحث / ابراهيم ابراهيم موسي منهراوي
مشرف / بسنت محمد الكفراوي
الموضوع
Mathematics. Theoretical backgroud. Filter Method. Data and Methodology. Elastic Net.
تاريخ النشر
2021.
عدد الصفحات
125 p. :
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
علوم الحاسب الآلي
تاريخ الإجازة
6/10/2021
مكان الإجازة
جامعة المنوفية - كلية العلوم - الرياضيات وعلوم الحاسب
الفهرس
Only 14 pages are availabe for public view

from 162

from 162

Abstract

Cancer is a group of diseases that involve abnormal cell growth with the potential to penetrate or spread to other parts of the body. Cancer rates are rising at an alarming rate worldwide. Cancer microarray data usually include a few samples with many gene expression levels as features. Gene expression or Microarray is a technology that monitors the expression of many genes in parallel, making it useful in cancer classification. High dimensionality in cancer microarray data results in the overfitting problem. Classification methods are an effective way to classify data, especially in medicine, where these methods are widely used in diagnosis and analysis for decision-making. Feature selection is an important stage in pre-processing the data before developing a machine learning model or as part of many data analysis processes. The objective of feature selection is to select the most relevant and eliminate the redundant features within a group of features according to some established metric. With this, it is possible to create more efficient and interpretable data mining models; also, feature reduction will reduce the data collection costs in the future. According to the phenomenon widely known as “big data,” the datasets available for analysis are growing. Common data mining algorithms become unable to process big data entirely. Depending on their size, feature selection algorithms themselves also become unable to process data directly. Considering that this trend towards the growth of datasets is not expected to cease, scalable feature selection algorithms that can increase their processing capacity taking advantage of computer cluster resources become very important. This doctoral dissertation presents various methods of group feature selection in multiclass classification.
The performance will be compared between different machine learning algorithms: Random Forest classifier (RF), Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Naive Bayes (NB) on the AML datasets from the National Cancer Institute (NCI), Cairo University. There are three main objectives. The first objective is to evaluate the effect of feature selection on data classification concerning the efficiency and effectiveness of each algorithm in terms of accuracy, precision, sensitivity, and specificity. Experimental results determine that LR gives the best accuracy (92.30%) with the lowest error rate. All experiments are affected within a simulation environment and manipulated in Python 3.7 data mining tool.
Second, a new hybrid feature selection model called the RBARegulizer model is proposed, which is based on two types of feature selection techniques RBAs and regulizer algorithms. In this research, two types of RBAs algorithms (ReliefF and MultiSURF) are studied for feature-ranking filters of the most important genes. Similarly, three regulizer algorithms (Lasso, ElasticNet, ElasticNetCV) are studied and evaluated to reduce the feature subset, remove the noise and irrelevant features to improve the performance and accuracy of cancer (Microarray) data classification. In order to evaluate the model, filtered data from the proposed feature selection model experiment on three different classifiers, SVM, MLP, and Random Forest. Four high-dimensional microarray data for different cancer types were applied in the experiments. The experimental results proved that the proposed model overcomes the overfitting problem of cancer microarray data.
Third, we presented a model called PSO-ENSVM, a hybrid feature selection, optimization, and classification method. Swarm optimization PSO algorithm is used to get near-optimal, optimal or solutions for optimizing the tuning parameters of Elastic Net and SVM as a classifier. To evaluate the model, we use seven microarray data sets for different cancer types. We compared the PSO-ENSVM model with the PSO-SVM, where we optimized the RBF Kernel hyperparameter without feature selection and the SVM with the RBF Kernel. The experimental result presented the ability of this model to obtain an ideal subset of features. That led to increased performance rates, as it was able to reduce the number of effective features in classification. Moreover, the results indicated that the RBARegulizer model is perfect for improving cancer microarray data classification accuracy. In effect, the results demonstrated that the PSO-ENSVM model is superior compared to PSO-SVM and SVM with RBF kernel.
Keywords: Feature selection, Machine learning, Regulizer algorithms, Microarray, Cancer Classification.