Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data
Streams
Shiwei Che, Wu Yang, and Wei Wang
Information Securityresearch Center, Harbin
Engineering University, China
Abstract: The unprecedented development and
popularization of the Internet, combined with the emergence of a variety of
modern applications, such as search engines, online transactions, climate
warning systems and so on, enables the worldwide storage of data to grow
unprecedented. Efficient storage, management and processing of such huge
amounts of data has become an important academic research topic. The detection
and removal of duplicate and redundant data from such multi-trillion data, while
ensuring resource and computational efficiency, has constituted a challenging
area of research.Because of the fact that all
the data of potentially unbounded data streams can not be stored, and the need
to delete duplicated data as accurately as possible, intelligent approximate
duplicate data detection algorithms are urgently required.
Many well-known methods based on the bitmap structure, Bloom Filter and its
variants are listed in the literature. In this paper, we propose a new data
structure, Improved Streaming Quotient Filter (ISQF), to efficiently
detect and remove duplicate data in a data stream. ISQF intelligently stores
the signatures of elements in a data stream, while using an eviction strategy
to provide near zero error rates. We show that ISQF achieves near optimal
performance with fairly low memory requirements, making it an ideal and
efficient method for repeated data detection. It has a very low error rate.
Empirically, we compared ISQF with some existing methods (especially Steaming
Quotient Filter (SQF)). The results show that our proposed method outperforms
theexisting methods in terms of memory usage and accuracy.We also discuss the parallel implementation
of ISQF.
Keywords: Bloom
filters, Computer Network, Data stream, Duplicate detection, False
positive rates.
Received November
30, 2017; accepted July
21, 2019