In this paper, we will first present REtriever, a novel DFA-based String Matching Engine for the "String Retrieval" problem in Publish/Subscribe System. For the current stage, REtriever supports a subset of Regular Expression, which can be used to specify XPath Language, continuous stream queries and string-based rules. Thus REtriever can be adapted to be used for XML Filtering and continous queries on streams. REtriever is based on the idea of combined-NFA in YFilter, but it uses novel techniques to transform Nondeterministic Finite Automata(NFA) to Deterministic Finite Automata(DFA), and it develops several algorithms to further exploit "general overlaps" among subscriptions to further reduce the size of DFA. We will also discuss experimental observations of the performance of REtriever compare with YFilter and Naive String Matching Engines. Our experiment show that Naive Sequential Search algorithm does not scale with the number of subscriptions; and REtriever achieves 50 percent of performance enhancement relative to YFilter. Lastly, we will briefly overview our on-going research on extending REtriever.
For detailed algorithm description and result representation and analysis, please refer to the upcoming Technical Report.
The Publish/Subscribe paradigm is a middleware model applicable to selective information dissemination, location-based services and workload management. The publish/subscribe system filters subscriptions, expressing user interests and profiles against publications, submitted by content providers. In contrast to traditional database processing, publish/subscribe tracks occurrences of events that happen in the future and are therefore not stored in a databse. Subscriptions based on string predicates are crucial for the wide-spread use of this new paradigm on the Web and in mainstream products. This project aims at building a matching kernel that scales to millions of subscriptions and is based on an efficient algorithm that can process string predicates and strings. In this context, string predicates constitute subscriptions and strings represent publications.