Author: Elzeki, Omar Mohamed Abdel-Hamid./ Title: Missing values processing in big data /

Search In this Thesis

العنوان

Missing values processing in big data /

المؤلف

Elzeki, Omar Mohamed Abdel-Hamid.

هيئة الاعداد

باحث / عمر محمد عبدالحميد الزكى

مشرف / سمير الدسوقى الموجى

مشرف / محمد فتحى الرحماوى

الموضوع

Computer science.

تاريخ النشر

2019.

عدد الصفحات

120 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/12/2019

مكان الإجازة

جامعة المنصورة - كلية الحاسبات والمعلومات - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

137

from

137

Abstract

Data cleaning is focused on enhancing the quality of data to make it “appropriate for use” by users over decreasing errors in data and enhancing their documentations and presentations. In this thesis, we propose a reasonable distance function that is more effective in determining the best replacement values for missing data before applying a classifier on the objective dataset. In essence, the Weighted Heuristic Similarity Estimation mechanism (WHSE) consumes substantial effort in practical application fields. The WHSE method was benchmarked using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) metrics. The evaluation process was conducted using three distinct classifiers: Nearest-Neighbor (NN), Linear-Regression (LR), and Multi-Layer Perceptron (MLP). WHSE method is applied on two different datasets: Iris and Forest Fires to estimate its impact in replacing missing value. Consequently, WHSE formula can direct the applied classifier to score at least similar performance regardless of the characteristics of the imputed data. WHSE method is expected to be scalable, stable and applicable in big data analytics. Also, this thesis presents a new algorithm, named EMII, for imputing missing values in medical datasets. EMII algorithm evolutionarily combines Information Gain (IG) and Genetic Algorithm (GA) to mutually generate imputable values. EMII algorithm is column-oriented not instance oriented than other implementation of GA which increases column correlation to the class in the same dataset. EMII algorithm is evaluated for imputing the generated missing values in four cancer gene expression standard medical datasets (Colon, Leukemia, Lung cancer-Michigan, and Prostate) via comparing the truth original complete datasets against the imputed datasets. The analysis of the experimental results reveals that the imputed values generated by EMII were almost the same as the original values besides having the same impact on the applied classifiers due to accuracy as similar as the original complete datasets. EMII has a running time of θ (n2), where n is the total number of columns.