Daisy Z. Wang -- String Matching for Pub/Sub, 2004 Summer

String Maching in Publish/Subscribe System

McAllister Research Award, Summer 2004
Supervisor: Professor Arno Jacobsen
UofT Middleware Research Group

State of the Art:

Technical Report:
REtriever: String Matching Engine for Publish/Subscribe System

Abstract

In this paper, we will first present REtriever, a novel DFA-based String Matching Engine for the "String Retrieval" problem in Publish/Subscribe System. For the current stage, REtriever supports a subset of Regular Expression, which can be used to specify XPath Language, continuous stream queries and string-based rules. Thus REtriever can be adapted to be used for XML Filtering and continous queries on streams. REtriever is based on the idea of combined-NFA in YFilter, but it uses novel techniques to transform Nondeterministic Finite Automata(NFA) to Deterministic Finite Automata(DFA), and it develops several algorithms to further exploit "general overlaps" among subscriptions to further reduce the size of DFA. We will also discuss experimental observations of the performance of REtriever compare with YFilter and Naive String Matching Engines. Our experiment show that Naive Sequential Search algorithm does not scale with the number of subscriptions; and REtriever achieves 50 percent of performance enhancement relative to YFilter. Lastly, we will briefly overview our on-going research on extending REtriever.

For detailed algorithm description and result representation and analysis, please refer to the upcoming Technical Report.

Project Proposal:

The Publish/Subscribe paradigm is a middleware model applicable to selective information dissemination, location-based services and workload management. The publish/subscribe system filters subscriptions, expressing user interests and profiles against publications, submitted by content providers. In contrast to traditional database processing, publish/subscribe tracks occurrences of events that happen in the future and are therefore not stored in a databse. Subscriptions based on string predicates are crucial for the wide-spread use of this new paradigm on the Web and in mainstream products. This project aims at building a matching kernel that scales to millions of subscriptions and is based on an efficient algorithm that can process string predicates and strings. In this context, string predicates constitute subscriptions and strings represent publications.

Resources

YFilter

Papers

Y.L Diao, P. Fischer, M. J. Franklin, R. To, YFilter: Efficient and Acalable Filtering of XML Documents, In Proc. of ICDE 2002
Y.L Diao, M. J. Franklin, Query Processing for High-Volume XML Message Brokering, In Proc. of VLDB 2003
C.Y Chan, M. Garofalakis, R. Rastogi, RE-Tree: An Efficient Index Structure for Regular Expressions, In Proc. of VLDB 20022002, HK
M. Altinel and M.J. Franklin. Efficient Filtering of XML Documents for Selective Dissemination of Information, In Proc. of VLDB 2000
C.-Y. Chan, P. Felber, M. Garofalakis, R. Rastogi, Efficient Filtering of XML Documetns with XPath Expressions, In Proc. of ICDE 2002
W3C XML Path Language (XPath) Version 1.0
I. Burcea, H-A. Jacobsen, XToPSS - Efficient XML Filtering against Large Numbers of XPath Expressions

String Maching in Publish/Subscribe System

State of the Art: Technical Report: REtriever: String Matching Engine for Publish/Subscribe System

Project Proposal:

Resources

Papers

People

State of the Art:

Technical Report:
REtriever: String Matching Engine for Publish/Subscribe System