基于本地化差分隐私保护的频繁项目挖掘算法-《计算机技术与发展》

文章信息/Info

作者:: 朱美琪¹; 2; 杨庚¹; 2; 白云璐¹; 3; 1. 南京邮电大学计算机学院、网络空间安全学院,江苏南京 210023;
2. 江苏省大数据安全和智能处理重点实验室,江苏南京 210023;
3. 南京市医药大学信息技术学院,江苏南京 210023

Author(s):: ZHU Mei-qi¹; 2; YANG Geng¹; 2; BAI Yun-lu¹; 3; 1. School of Computer Science and Software,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;
2. Jiangsu Key Laboratory of Big Data Security and Intelligent Processing,Nanjing 210023,China;
3. School of Information Engineering,Nanjing University of Chinese Medicine,Nanjing 210023,China

Keywords:: frequent items mining; local differential privacy; set-valued data; privacy protection; randomized response

摘要:: 频繁项目挖掘是数据挖掘的研究热点之一, 若数据集包含敏感信息, 不作处理地发布挖掘结果会有隐私泄露的风险。目前,已有本地化差分隐私的频繁项目挖掘算法, 但还无法满足处理大数据时的实时性和数据可用性要求。针对这些问题,该文提出了一种新的面向本地化差分隐私保护的频繁项目挖掘算法—GFIM (group-based frequent items mining)。该算法把用户随机划分为不相交且大小相等的两组用户,整个运行过程分为两个阶段。第一阶段主要根据全部用户提交的信息挖掘出频繁项目的候选集 C ,而在第二阶段,两组用户分别通过设置冗余项把自身修剪为 O(k) 发送给数据收集者,最终的 top-k 频繁项目将利用上述两个阶段的结果。采用分阶段的思想减少了计算时遍历数据集的次数,加快了整体的运行速度。通过理论证明了该算法满足本地化差分隐私,在多个真实数据集上的实验也验证了该方法的性能。

Abstract:: Frequent items mining is one of the research hotspots of data mining. If the data set contains sensitive information, publishing mining results without processing risks privacy leakage. At present, there are heavy hitter estimation algorithms that meet local differential privacy, but they still cannot meet the real-time and data availability requirements when processing big data. In response to these problems,we propose a new frequent item set mining algorithm for local differential privacy protection—GFIM (group-based frequent items mining),which divides users into two disjoint and equal-sized parts randomly. The whole operation process is divided into two stages. In the first stage,GFIM digs out the candidate set C of frequent items based on the information submitted by all users. In the second stage, two groups of users trim themselves to O(k) by setting redundant items and send them to the data collector. The final top-k frequent items will use the results of the above two stages. According to the two-stage idea,the times of traversal of data set can be reduced and the overall running speed can be accelerated. Theoretically, it has been proved that the algorithm satisfies differential privacy and experiments on multiple real data sets have also verified the performance of the method.