Search In this Thesis
   Search In this Thesis  
العنوان
Text Mining on Social Networking using NLP Techniques\
المؤلف
Asal,Walaa Mohamed Medhat
هيئة الاعداد
باحث / ولاء محمد مدحت عسل
مشرف / هدى قرشي محمد
مشرف / أحمد حسن يوسف
مناقش / ايمن محمد وهبة
تاريخ النشر
2015.
عدد الصفحات
117p.:
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
1/1/2015
مكان الإجازة
جامعة عين شمس - كلية الهندسة - كهرباء حاسبات
الفهرس
Only 14 pages are availabe for public view

from 32

from 32

Abstract

The web has become a very important source of information recently as it becomes a read-write
platform. The dramatic increase of online social networks (OSN), video sharing sites, online
news, online reviews sites, online forums and blogs has made the user-generated content, in the
form of unstructured free text gains a considerable attention due to its importance for many
businesses. The web is used by many languages’ speakers. It is no longer used by English
speakers only. Text mining becomes necessary nowadays to extract information and discover
knowledge from this huge amount of textual data. Working on text data means that we need a
better understanding of the text. Natural Language Processing (NLP) techniques could help in
better understanding of the text.
Sentiment Analysis (SA) is one of the text mining well-known techniques. It is the
computational study of people’s opinions, attitudes, and emotions towards individuals, events, or
topics covered by reviews or news. The target of SA is to find opinions, identify the sentiments
they express, and then classify their polarity.
The thesis proposes a framework for preparing and using corpora from OSN and review sites for
SA task in two different natural languages (English, and Arabic). The framework consists of
three phases. The first phase is the preprocessing and cleaning of data collected, then data
annotation. The second phase is applying various text processing (NLP) techniques including
removing stopwords, replace the negation words and the following negated words with the
antonyms of the negated words, and using selective words of part-of-speech tags (adjectives and
verbs) on the prepared corpora. The third phase is text classification using Naïve Bayes and
Decision Tree classifiers and two feature selection approaches, unigrams and bigrams. The
framework components were analyzed at each stage. It is important to analyze the components of
the framework to configure which scenario is better for each corpus used. The analysis is
enhanced by applying the framework components on the English language benchmark corpus
movie reviews in addition to the prepared corpora from OSN sites and a review site.
There is lack of language resources of Arabic language as most of them are under development.
In order to use Arabic language in the framework, there are some sources needed as stopword
lists, Arabic Wordnet, and Arabic POS tagger. The problem is that the stopwords lists generated
before were on Modern Standard Arabic (MSA) which is not the common language used in
OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN Arabic corpora. We compare the efficiency of text classification when using the
generated lists along with previously generated lists of MSA and combining the Egyptian dialect
list with the MSA list. The other sources are still under development. They are created to fulfill
for MSA not Dialectical Arabic which is the language used by the OSN users.
The framework was applied and tested on the English language corpora with all its stages. For
the Arabic language, the text processing technique of removing stopwords only is applied. The
experiments show that the OSN data is extremely unbalanced for both languages. The results
show that applying text processing techniques improve the classification accuracy of the NB
classifier and reduce the training time of both classifiers. The performance was measured with
accuracy, F-measure, and training time criteria. The results also show that Decision tree classifier
gives better results for imbalance data for both languages. The experiments on Arabic corpora
show that the general lists containing the Egyptian dialects stopwords give better performance
than using lists of MSA stopwords only.