mproved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

Shiwei Che, Wu Yang, and Wei Wang

Information Securityresearch Center, Harbin Engineering University, China

Abstract: The unprecedented development and popularization of the Internet, combined with the emergence of a variety of modern applications, such as search engines, online transactions, climate warning systems and so on, enables the worldwide storage of data to grow unprecedented. Efficient storage, management and processing of such huge amounts of data has become an important academic research topic. The detection and removal of duplicate and redundant data from such multi-trillion data, while ensuring resource and computational efficiency, has constituted a challenging area of research.Because of the fact that all the data of potentially unbounded data streams can not be stored, and the need to delete duplicated data as accurately as possible, intelligent approximate duplicate data detection algorithms are urgently required. Many well-known methods based on the bitmap structure, Bloom Filter and its variants are listed in the literature. In this paper, we propose a new data structure, Improved Streaming Quotient Filter (ISQF), to efficiently detect and remove duplicate data in a data stream. ISQF intelligently stores the signatures of elements in a data stream, while using an eviction strategy to provide near zero error rates. We show that ISQF achieves near optimal performance with fairly low memory requirements, making it an ideal and efficient method for repeated data detection. It has a very low error rate. Empirically, we compared ISQF with some existing methods (especially Steaming Quotient Filter (SQF)). The results show that our proposed method outperforms theexisting methods in terms of memory usage and accuracy.We also discuss the parallel implementation of ISQF.

Keywords: Bloom filters, Computer Network, Data stream, Duplicate detection, False positive rates.

Received November 30, 2017; accepted July 21, 2019

https://doi.org/10.34028/iajit/17/5/10
Read 2923 times Last modified on Wednesday, 26 August 2020 05:29
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…