![]() | Only 14 pages are availabe for public view |
Abstract One of the most important primitive data types in modern data processing is strings.String data is ubiquitous, and its management has taken on particular importance in the past few years. An important research topic in string processing is string similarity join. A string similarity join nds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integra- tion and cleaning, web page detection, and pattern recognition. Many similarity functions have been proposed to quantify the similarity between two strings. In this thesis, we study string similarity joins with edit distance constraints. There exist many algorithms to support string similarity join with edit distance constraint, such as All-Pairs-Ed, PPJoin, and ED-Join. Most of these algorithms follows the lter and verication paradigm, where indexes are used to lter many of the unpromising string pairs and generate candidate pairs, then these candidates are veried to out- put the nal result. Recently, a verication-free, trie-based similarity join framework is proposed. It uses a trie structure to index all strings in a given dataset. Though, this framework has shown many advantages over the lter-and-rene framework, the main problem with current trie-based join algorithms is that they generate and main- tain lots of candidate prexes called active-nodes which need to be further removed. With large edit distance, active-node sets becomes large as well. To overcome this problem, they use many pruning techniques to remove false positive prexes in a subsequent phase. This makes the existing algorithms inecient for processing v vi large data sets with long strings, and higher edit distance threshold. In this thesis, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It eciently nds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the trie index. We improve PreJoin in order to be suitable to deal with large edit dis- tance. This improvement called PreJoin-Plus algorithm. Experiments show that our approaches is highly ecient for processing short as well as long strings, andoutperforms the state-of-the-art trie-based join approaches by a factor of ve. |