الفهرس | Only 14 pages are availabe for public view |
Abstract Up to the late eighties information was largely centralized, so its retrieval was not a difficult task. But, in the nineties, with the arrival of the World Wide Web (WWW), or simply the Web, information became distributed, making the retrieval of relevant documents a challenge. To this end, Distributed Information Retrieval (DIR) came about, lying in the intersection between Information Retrieval (IR) and Distributed Systems (DSs). The objective of a DIR system is to provide a tool that searches the huge number of available databases and merges results into a single list back to the user. In this thesis a set of algorithms and techniques to improve the performance of DIR systems are proposed. Specifically, a novel IR architecture, an efficient query expansion algorithm based on WordNet, a new crawling technique based on ontology and a new rapid filtering algorithm based on semantic similarity are proposed. The query expansion technique, using synonyms from WordNet, converts a user demand into a set of discrete concepts that semantically interpret the query requirements. On the other hand, document similarity is evaluated by computing the semantic distance between terms in the document and those in the expanded query. Finally, filtering increases the retrieval effectiveness, by ranking the relevant documents in an ordered list, discarding those that are least relevant. In order to validate and test the proposed algorithms and techniques, a DIR system, named Ontology Based Distributed Information Retrieval (OBDIR) have been designed and implemented. The OBDIR system is built from the bottom up (using the Python language), with the thrust being to integrate the use of ontology and principles of focused crawlers. The OBDIR experimental results are very promising. The proposed system outperforms by far those based on the standard benchmark of Breadth First (BF) search technique which is achieved through the employment of the same query without the preprocessing that employed in the proposed system. Finally, a probabilistic model for analyzing the performance of DIR systems is introduced. The model is built on Bayes’ theorem and can help establish a rigorous theory for DIR. The model has been used to calculate the precision and accuracy having obtained the underlying probabilities empirically by running the proposed OBDIR system. |