Author: Omar, Metwally Rashad Metwally./ Title: On string representation for or effcient mining and searching over string database /

Search In this Thesis

العنوان

On string representation for or effcient mining and searching over string database /

المؤلف

Omar, Metwally Rashad Metwally.

هيئة الاعداد

باحث / متولى رشاد متولى

مشرف / ماهر شديد زايد

مشرف / كرم جودة

مناقش / متولى رشاد متولى

الموضوع

Mathematics.

تاريخ النشر

2012.

عدد الصفحات

81 p. ;

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الرياضيات (المتنوعة)

تاريخ الإجازة

1/1/2012

مكان الإجازة

جامعة بنها - كلية العلوم - Mathematics

الفهرس

Only 14 pages are availabe for public view

from

Abstract

One of the most important primitive data types in modern data processing is strings.String data is ubiquitous, and its management has taken on particular importance
in the past few years. An important research topic in string processing is string similarity join. A string similarity join nds all similar pairs between two collections
of strings. It is an essential operation in many applications, such as data integra-
tion and cleaning, web page detection, and pattern recognition. Many similarity
functions have been proposed to quantify the similarity between two strings. In this thesis, we study string similarity joins with edit distance constraints. There exist
many algorithms to support string similarity join with edit distance constraint, such
as All-Pairs-Ed, PPJoin, and ED-Join. Most of these algorithms follows the lter
and verication paradigm, where indexes are used to lter many of the unpromising
string pairs and generate candidate pairs, then these candidates are veried to out- put the nal result. Recently, a verication-free, trie-based similarity join framework
is proposed. It uses a trie structure to index all strings in a given dataset. Though,
this framework has shown many advantages over the lter-and-rene framework, the
main problem with current trie-based join algorithms is that they generate and main-
tain lots of candidate prexes called active-nodes which need to be further removed.
With large edit distance, active-node sets becomes large as well. To overcome this
problem, they use many pruning techniques to remove false positive prexes in
a subsequent phase. This makes the existing algorithms inecient for processing
v
vi
large data sets with long strings, and higher edit distance threshold. In this thesis,
we propose a new trie-based join algorithm called PreJoin, which improves upon
current trie-based join methods. It eciently nds all similar string pairs using a
new active-node set generation method, and a dynamic preorder traversal of the
trie index. We improve PreJoin in order to be suitable to deal with large edit dis-
tance. This improvement called PreJoin-Plus algorithm. Experiments show that
our approaches is highly ecient for processing short as well as long strings, andoutperforms the state-of-the-art trie-based join approaches by a factor of ve.