Privacy-Preserving for Distributed Data Streams: Towards l-Diversity
Mona Mohamed, Sahar Ghanem, and Magdy Nagi
Computer and Systems Engineering
Department, Alexandria University, Egypt
Abstract: Privacy-preserving data publishing have been studied widely on
static data. However, many recent applications generate data streams that are
real-time, unbounded, rapidly changing, and distributed in nature. Recently, few
work addressed k-anonymity and l-diversity for data streams. Their model
implied that if the stream is distributed, it is collected at a central site
for anonymization. In this paper, we propose a novel distributed model where
distributed streams are first anonymized by distributed (collecting) sites
before merging and releasing. Our approach extends Continuously Anonymizing
STreaming data via adaptive cLustEring (CASTLE), a cluster-based approach that provides
both k-anonymity and l-diversity for centralized data streams. The main idea is
for each site to construct its local clustering model and exchange this local
view with other sites to globally construct approximately the same clustering view.
The approach is heuristic in a sense that not every update to the local view is
sent, instead triggering events are selected for exchanging cluster information.
Extensive experiments on a real data set are performed to study the introduced Information
Loss (IL) on different settings. First, the impact of the different parameters
on IL are quantified. Then k-anonymity and l-diversity are compared in terms of
messaging cost and IL. Finally, the effectiveness of the proposed distributed
model is studied by comparing the introduced IL to the IL of the centralized
model (as a lower bound) and to a distributed model with no communication (as
an upper bound). The experimental results show that the main contributing
factor to IL is the number of attributes in the quasi-identifier (50%-75%) and
the number of sites contributed about 1% and this proves the scalability of the
proposed approach. In addition, providing l-diversity is shown to introduce
about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction
in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set
size.
Keywords: k-anonymity, l-diversity, data streams and
clustering.
Received April 20, 2017;
accepted December 18, 2017
https://doi.org/10.34028/iajit/17/1/7