![]() | Only 14 pages are availabe for public view |
Abstract The popularity of the Internet and World Wide Web increases the need for information management of electronic texts. Textual document are the easier way in saving information in all aspects on the computer in spit of the difficulties in making use of these information. Text Mining is the discipline of retrieving meaningful information from natural language text. The main problems that face text mining are the feature reduction problem, the dimensionality problem, and accurate and fast classification problems. This thesis attempts to introduce implementation of new intelligent hybrid models which handles these problems. Transformation systems of the evolutionary computating algorithms and the machine learning algorithms are used to classify PLSNL (Partially Structured , Largely Natural) documents based on their structuring conventions. Genetic Algorithms , as evolutionary computating algorithm, are used to find the most significant ( informative) words in the feature reduction process based on the line structuring conventions. Thus, the most informative features (synopses) are extracted and the succinct feature vector is prepared to represent the document. Based on the succinct feature vector, a machine learning algorithm is needed for mining the associated categories. The machine learning algorithms, C4.5 and Classification based on Multiple Association Rules (CMAR) algorithm, are used to classify the documents. The new hybrid models, Hybrid Genetic and C4.5 Algorithm for Textual Document Classification and Hybrid Intelligent Model of Genetic Algorithms and Association Rules in Text mining, help decision-maker to conclude a sort of rules with the highest classification accuracy for documents. In contrast with other approaches, a comparison with previous approaches is illustrated. The comparative study shows the efficiency of the new hybrid models in increasing the classification accuracy and reducing the time consumed in classification process. The details and limitations of the new approaches are discussed and future works are suggested. |