Search In this Thesis
   Search In this Thesis  
العنوان
Measuring Data Quality Using Data Mining Algorithms /
المؤلف
Atia, Eslam El Shahat Omara.
هيئة الاعداد
باحث / اسلام الشحات عمارة عطيه
مشرف / طه السيد طه
مناقش / محمد يونس عبد السميع الحملاوي
مناقش / فتحي السيد عبدالسميع
الموضوع
Measuring instruments- Data processing. Data mining. Algorithms- Data processing.
تاريخ النشر
2013.
عدد الصفحات
127 p. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
1/1/2013
مكان الإجازة
جامعة المنوفية - كلية الهندسة الإلكترونية - قسم هندسة وعلوم الحاسبات
الفهرس
Only 14 pages are availabe for public view

from 3

from 3

Abstract

Data is a very important asset for any organization and its quality has serious
consequences on the business and the organization. Data quality is a basic concern for a
wide range of information systems as data warehouses, business intelligence, customer
relationship management and supply chain management. Data quality was stated in the
literature as a multi dimensional concept that includes completeness, accuracy,
timeliness, consistency ...etc.
In this work data quality is measured applying data mining algorithms that are able to
discover previously unknown patterns and relationships in a dataset besides they can
handle discrete and continues data. Two algorithms are applied neural networks and
support vector machines. Data quality dimensions completeness and timeliness are
selected to be measured as they are two of the dimensions shared by most of data quality
dimensions proposals and they are two of the basic data quality dimensions set.
The proposed methodology to measure the two dimensions considers a very important
aspect of data that is the field type. It is the field being mandatory, not applicable and
optional. The measurement is done for a real unbalanced dataset so to train the data
mining model a mechanism for handling the unbalance problem is followed by
duplicating the minority instances. Cross validation method is applied to evaluate the
performance measures for the two applied data mining algorithms then the registered
results are compared.
First, the data quality dimension completeness is assessed applying statistical, neural
network and support vector machine models which judge the state of each data row
whether it is complete or not then the dataset completeness is calculated. Neural network
and support vector machine models act as classifiers for the row completeness. Second,
the data quality dimension timeliness is assessed using also statistical, neural network
and a support vector machine models. The models calculate each row timeliness value
then the dataset timeliness is measured.