Author: Mostafa,Shimaa Ismail Mohamed./ Title: Semantic-Based Text Similarity.

Search In this Thesis

العنوان

Semantic-Based Text Similarity.

المؤلف

Mostafa,Shimaa Ismail Mohamed.

هيئة الاعداد

باحث / Shimaa Ismail Mohamed Mostafa

مشرف / Tarek Ahmed Elshishtawy

مشرف / AbdelWahab Kamel Alsammak

مناقش / osama abd el-raouf

مناقش / khaled fouad

الموضوع

Natural Language Processing. Knowledge-based Systems. Semantic Textual Similarity.

تاريخ النشر

2022.

عدد الصفحات

93 p ;

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

Information Systems

تاريخ الإجازة

19/7/2022

مكان الإجازة

جامعة بنها - كلية الحاسبات والمعلومات - نظم المعلومات

الفهرس

Only 14 pages are availabe for public view

from

110

from

110

Abstract

This thesis focuses on one of the most important aspects of information retrieval: textual similarity: lexical and semantic. The technique of comparing two texts to see how similar they are. It examines several elements of textual similarity as well as the factors that influence it. It covers topics like language, with a focus on sentence structure, as well as algorithms and other tools that may be used to investigate and assess textual similarity.
Semantic Textual Similarity (STS) becomes an important topic for many types of research and applications. They play a fundamental role in different tasks such as information retrieval, questions generation and answering, automatic essay scoring, automatic short answer grading, machine translation, text summarization, sentiment analysis, and others. In this thesis, two different hybrid approaches are presented to measure the semantic similarity of two snipped Arabic texts:
The first approach depends on the alignment of a word semantic space for measuring the semantic similarity of Arabic snipped texts. This approach combines two similarity measurement methods: vector space-based and alignment-based. The vector space-based method depends on a semantic net that represents the meaning of words as vectors. These vectors are lemmatized to enrich the search space. The alignment-based method generates an Alignment Word Space Matrix (AWSM) for the snipped texts according to the generated semantic word spaces. Finally, the degree of sentence semantic similarity is measured using proposed alignment rules. Four experiments are carried out to evaluate the performance of the proposed approach, using two different datasets. The experiments proved that the applied methodologies such as preprocessing, lemmatization process for the input text and the vector space model, the algorithm of semantic word space extraction, the alignment technique, and studying the negation effect have a better impact on the generated results. The degree of correlation between the proposed approach results and the human judgment scores reaches 0.7212 which is considered one of the optimum results of the published Arabic semantic textual similarities.
The second approach presents a Language-Model for measuring the similarity between Arabic texts lexically and semantically. This approach uses the edit distance concept as a frame algorithm to capture the lexical similarities. In the proposed work, lexical level distances between lemma-form words are calculated while partial edit costs are allowed to embed semantic similarity measurements. Many knowledge resources have been used, such as words’ synonyms, negation rules, and word semantic spaces. A searchable Arabic thesaurus dictionary is built in two forms surface form and lemma form. Semantic word spaces are generated using the fastText word embedding model, which represents the words in vector spaces. The algorithm is enhanced to overcome different word orders limitation within the given sentences by generating word permutations. This technique selects the best alignment sequence of the snipped text words to yield the minimum edit distance which leads to the most similarity value. The proposed approach gives also another method to deal with the presence of negation terms in the provided sentences, which may reverse their similarity. The experimental results are compared with other state-of-the-art algorithms using two benchmark datasets. It shows that the proposed algorithm achieves higher Pearson Correlation Coefficients compared to other works. It improves the correlation similarity measure by 4% and 2% for the two used datasets as compared to the results of the state-of-the-art systems.