![]() | Only 14 pages are availabe for public view |
Abstract The web has become a very important source of information recently as it becomes a read-write platform. The dramatic increase of online social networks (OSN), video sharing sites, online news, online reviews sites, online forums and blogs has made the user-generated content, in the form of unstructured free text gains a considerable attention due to its importance for many businesses. The web is used by many languages’ speakers. It is no longer used by English speakers only. Text mining becomes necessary nowadays to extract information and discover knowledge from this huge amount of textual data. Working on text data means that we need a better understanding of the text. Natural Language Processing (NLP) techniques could help in better understanding of the text. Sentiment Analysis (SA) is one of the text mining well-known techniques. It is the computational study of people’s opinions, attitudes, and emotions towards individuals, events, or topics covered by reviews or news. The target of SA is to find opinions, identify the sentiments they express, and then classify their polarity. The thesis proposes a framework for preparing and using corpora from OSN and review sites for SA task in two different natural languages (English, and Arabic). The framework consists of three phases. The first phase is the preprocessing and cleaning of data collected, then data annotation. The second phase is applying various text processing (NLP) techniques including removing stopwords, replace the negation words and the following negated words with the antonyms of the negated words, and using selective words of part-of-speech tags (adjectives and verbs) on the prepared corpora. The third phase is text classification using Naïve Bayes and Decision Tree classifiers and two feature selection approaches, unigrams and bigrams. The framework components were analyzed at each stage. It is important to analyze the components of the framework to configure which scenario is better for each corpus used. The analysis is enhanced by applying the framework components on the English language benchmark corpus movie reviews in addition to the prepared corpora from OSN sites and a review site. There is lack of language resources of Arabic language as most of them are under development. In order to use Arabic language in the framework, there are some sources needed as stopword lists, Arabic Wordnet, and Arabic POS tagger. The problem is that the stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN Arabic corpora. We compare the efficiency of text classification when using the generated lists along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The other sources are still under development. They are created to fulfill for MSA not Dialectical Arabic which is the language used by the OSN users. The framework was applied and tested on the English language corpora with all its stages. For the Arabic language, the text processing technique of removing stopwords only is applied. The experiments show that the OSN data is extremely unbalanced for both languages. The results show that applying text processing techniques improve the classification accuracy of the NB classifier and reduce the training time of both classifiers. The performance was measured with accuracy, F-measure, and training time criteria. The results also show that Decision tree classifier gives better results for imbalance data for both languages. The experiments on Arabic corpora show that the general lists containing the Egyptian dialects stopwords give better performance than using lists of MSA stopwords only. |