Author: Sakr, Noha Ahmed Mohamed./ Title: A proposed clustering algorithm based on data mining techniques /

Search In this Thesis

العنوان

A proposed clustering algorithm based on data mining techniques /

المؤلف

Sakr, Noha Ahmed Mohamed.

هيئة الاعداد

باحث / نهى أحمد محمد صقر

مشرف / على ابراهيم الدسوقي

مناقش / علاءالدين محمد رياض

مناقش / هشام عرفات علي

الموضوع

Data Mining Techniques. data.

تاريخ النشر

2010.

عدد الصفحات

112 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة

تاريخ الإجازة

1/1/2010

مكان الإجازة

جامعة المنصورة - كلية الهندسة - department of computer engineering and systems

الفهرس

Only 14 pages are availabe for public view

from

138

from

138

Abstract

Recently, massive volumes of data had become resident on our computers at home and work. Managing these data quantities manually was a very hard task to be accomplished by humans. As a result, novel automated systems have been introduced by many researchers in different information technology fields to serve the new arising requirements of the users. As a hidden stage embedded in most of information retrieval systems; document clustering is considered one of the most common clustering applications in the real life. By automatically grouping sets of documents with similar contents, document clustering techniques are serving for most of information retrieval and text mining systems. Generating clusters from the whole document collection automatically saves the time and the effort for the electronic data seekers. In the traditional Vector Space Model (VSM), researchers have considered the unique word which occurs in the document set as the candidate features for the VSM. Other researchers considered the semantics of the single word to represent the features of the document. Recently a new trend which considers the phrase to be a more informative feature has taken place. It is obvious that the first approach discarded the importance of the phrase, while the semantics were totally ignored in the second approach. For these reasons, a proposed framework which considers both issues is presented in this thesis. The main contribution of this thesis is investigating a framework for computing the similarity measure of the traditional VSM by considering the semantics of the phrases in the document as the constituting terms of the VSM instead of the traditional terms such as words or phrases. Moreover, introducing a proposed algorithm for disambiguating the meanings of the words of the phrase. The proposed algorithm is performing its work by connecting to the WordNet which is a lexical-semantic network to retrieve the semantically related tokens and senses for each word in the phrase. Generally, the framework is organized into three phases which are document pre-processing for preparing the documents of the dataset. Document representation, for representing each document in one of the canonical models to be ready for the similarity calculations. Finally, document grouping phase, which physically divides the whole dataset into stand alone clusters. Validation of the proposed framework’ performance is achieved by conducting a set of practical experiments. Moreover, a comparative study with two of the recent approaches for document clustering is established. Results have shown that the proposed framework tends to be an accurate document clustering framework. It also has shown that framework is highly scalable against the relatively large datasets.