Author: Okasha, Mira El-Sayed./ Title: An adaptive model for hidden web crawler /

Search In this Thesis

العنوان

An adaptive model for hidden web crawler /

المؤلف

Okasha, Mira El-Sayed.

هيئة الاعداد

باحث / ميرا السيد عكاشه

مشرف / عايدة عثمان عبدالجواد

مناقش / عايدة عثمان عبدالجواد

الموضوع

Web Searching. Information Retrieval. Deep Web. Hidden Web Crawler. Automated Agents. Automatic Query Generation. Ontology.

تاريخ النشر

2010.

عدد الصفحات

108 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

هندسة النظم والتحكم

تاريخ الإجازة

01/01/2010

مكان الإجازة

جامعة المنصورة - كلية الهندسة - computers and systems engineering

الفهرس

Only 14 pages are availabe for public view

from

133

from

133

Abstract

In spite of the virtually unlimited amount of information sources, search engines cannot find or index a large part of these information, because they are located behind HTML web forms. That part of web is usually known as hidden web or deep web. The only way to access and retrieve these information is to fill out the HTML forms. Although these forms and the dynamically generated pages are usually very helpful to users, where they often get exactly the information they want through filling out these forms. It is tedious for them to visit dozens of websites of the same application and fill out different web forms provided by each site. Since the traditional crawlers lack the suitable technique to get past HTML forms, many hidden web crawlers try to overcome the problem of retrieving data behind these web forms. So the greatest challenge now is how to enable hidden web crawlers to interact with these forms automatically which are designed basically for human interaction. This thesis introduces a new framework for an Adaptive Hidden Web Crawler that aims to address two challenges : Hidden Web Resources Discovery and Information Retrieval, through introducing an efficient technique for locating web forms that are entry points to online databases (searchable web forms). Along with filling out these searchable forms automatically with values derived from the suitable ontology, thereafter submit the queries automatically and finally analyze the result pages. The main contributions of this thesis can be summarized as follows: -Proposing a new Searchable Form Detector (SFD) algorithm, that accurately detect searchable web forms and hence collect queryable pages. -Proposing an Adaptive Links Extractor (ALE) algorithm, that employees agents for automatically adapt un-canonical URLs to canonical ones, and also exclude unbeneficial links from the crawling process. -Exploiting Semantic Web technology specifically domain-specific ontologies, to identify page’s domain and hence choose the specific ontology to automatically fill in the form’s text boxes with suitable matched values. -Formulating the queries automatically and submit them to the extent possible and finally collect and organize the result pages for the indexing process. By the empirical experiments of the proposed technique on real websites, we have concluded that this technique enhances the performance of the hidden web crawlers by reducing the time and cost of the crawling process. As well as increasing the precision and recall of the retrieved documents.