الفهرس | Only 14 pages are availabe for public view |
Abstract This thesis presents a novel Recurrent Neural Network Language model based on tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested on the English AMI speech recognition dataset and the Online Open Source Arabic (OOSA) language corpus. Also, this thesis proposes a novel hybrid approach to automatically detect and correct Arabic spelling errors. The proposed model is based on the confusion matrix and the noisy channel spelling correction model combined with the proposed modified recurrent neural network-based language model. The confusion matrix was constructed using 163,452 pairs of spelling errors, and its corrected form extracted from the Qatar Arabic Language Bank (QALP). Based on the reported results, automatic spelling correction accuracy was enhanced by about 3.5% for the Arabic language misspelling mistakes dataset. Also, this thesis presents a novel approach for automatic Arabic text diacritization using deep encode-decode recurrent neural networks followed by several text correction steps to improve the overall system output accuracy. The proposed model achieves a morphological diacritization word error rate (WER) of 3.85% and a diacritic error rate (DER) of 1.12% respectively. |