Search In this Thesis
   Search In this Thesis  
العنوان
A New strategy for web page classification /
المؤلف
Abul-Wafa, Arwa Essam Mohammed Mohy El-Din.
هيئة الاعداد
باحث / أروه عصام محمد محي الدين ابوالوفا
مشرف / محمد فتحي الرحماوى
مشرف / أحمد إبراهيم صالح
مناقش / هشام عرفات علي
مناقش / أحمد أبوالفتوح صالح
الموضوع
Database management. Image processing.
تاريخ النشر
2016.
عدد الصفحات
137 p. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
01/01/2016
مكان الإجازة
جامعة المنصورة - كلية الهندسة - Computers Engineering and Systems
الفهرس
Only 14 pages are availabe for public view

from 170

from 170

Abstract

WWW is a continuously growing giant. Within the next few years, web contents will surely increase tremendously. Automatic Web page classification is a prominent research area within information retrieval field. It is significantly different from traditional text classification. Because of the presence of additional information, provided by the HTML structure and hyperlinks. This thesis introduces a novel strategy for vertical web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO). It employs several web mining techniques, and mainly depends on proposed multi-layered domain ontology. In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains. CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC). The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning. Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy. This thesis introduces a simple but effective modification on the behavior of the focused web crawlers. The basic idea is in employing web mining techniques to help focused crawlers achieving their main target. This can be accomplished by embedding domain distillers in the area of focused crawling. Hence, before passing the retrieved page to the indexer or retrieving its embedded links, it must pass through a domain distiller. The decision here is accurate as it relies on the content of the page not on estimation. The domain distiller relies on a proposed Optimized Naïve Bayes (ONB) classifier, which combines NB and SVM. Initially, GA is used to optimize the soft margins of SVM. Then the optimized SVM is employed to discard the outliers from the available training examples. Next, the pruned examples are used to train the traditional NB classifier. ONB is tested against recent classification techniques. Experimental results have proven the effectiveness of ONB as it introduces the maximum classification accuracy 89%. Also, results indicate that the proposed distiller improves the performance of focused crawling in terms of crawling harvest rate.