«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn.1673-629X.2018.06.018]
点击复制

基于Spark 的密度聚类算法并行化研究()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年06期

页码:: 80-84

栏目:: 智能、算法、系统工程

出版日期:: 2018-06-10

文章信息/Info

Title:: Research on Parallelization of Density Clustering Algorithm Based on Spark

文章编号:: 1673-629X(2018)06-0080-05

作者:: 朱子龙; 李玲娟; 南京邮电大学计算机学院,江苏南京 210023

Author(s):: ZHU Zi-long; LI Ling-juan; School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China

关键词:: DBSCAN; 聚类; Spark; 并行化

Keywords:: DBSCAN; clustering; Spark; parallelization

分类号:: TP301.6

DOI:: 10.3969/ j. issn.1673-629X.2018.06.018

文献标志码:: A

摘要:: 聚类分析目前是数据挖掘研究领域中热门的研究课题,DBSCAN 算法则是聚类分析中较为重要的一种基于密度的算法。 Apache Spark 扩展了广泛使用的 MapReduce 计算模型,提出了基于内存的并行计算框架。通过将中间结果缓存在内存中减少 I/ O 磁盘操作,使其能够更高效地支持交互式查询、迭代式计算等多种计算模式。为了更好地进行大数据聚类挖掘,研究如何对基于当今主流的大数据处理框架 Spark 对 DBSCAN 算法进行并行化。设计了基于 Spark 的 DBSCAN算法并行化方案,通过合理利用 RDD 和设计 Sample 算子、map 函数、collectAsMap 算子、reduceByKey 算子,实现了对寻找核心对象的密度可达数据点过程的并行化。在 Spark 平台上运用 DBSCAN 算法对 UCI 的 Wine 数据集、Car Evaluation 数据集和 Adult 数据集的并行化聚类结果表明,并行化的 DBSCAN 算法具有较好的准确性和时效性,适用于大数据聚类。

Abstract:: Clustering analysis is currently a hot research topic in data mining,and DBSCAN algorithm is a density-based algorithm which is more important in clustering analysis. Apache Spark extends the widely used MapReduce computing model and proposes a memorybased parallel computing framework. It reduces I/ O disk operations by caching intermediate results in memory,enabling it to more efficiently support multiple computing modes such as interactive queries,iterative calculations and so on. In order to mine the large data well,we study how to parallelize the DBSCAN algorithm based on large data processing framework Spark,and design a scheme on parallelization of density clustering algorithm based on Spark. Through the rational use of RDD and design of Sample operator,map function,collectAsMap operator,reduceByKey operator,it realizes the parallelization of the process of finding the density-reach data points for the core object. With DBSCAN algorithm on the Spark platform on the data set of UCI Wine,Car Evaluation and Adult,the parallel clustering results show that the parallelized of DBSCAN algorithm has better accuracy and timeliness,and it is suitable for large data clustering.

相似文献/References:

[1]蒋璐璐王适王宝成李慧敏李鑫慧.一种改进的标记分水岭遥感图像分割方法[J].计算机技术与发展,2010,(01):36.
　JIANG Lu-lu,WANG Shi,WANG Bao-cheng,et al.Segmentation of Remote Sensing Image Based on an Improved Labeling Watershed Algorithm[J].,2010,(06):36.
[2]张甜罗眉孟晓红赵宗涛.一种基于状态特征的航天发射故障诊断技术[J].计算机技术与发展,2010,(01):93.
　ZHANG Tian,LUO Mei,MENG Xiao-hong,et al.A Technology in Fault Diagnosis of Spaceflight Launch Based on State Character[J].,2010,(06):93.
[3]王会颖章义刚.求解聚类问题的改进人工鱼群算法[J].计算机技术与发展,2010,(03):84.
　WANG Hui-ying,ZHANG Yi-gang.An Improved Artificial Fish- Swarm Algorithm of Solving Clustering Analysis Problem[J].,2010,(06):84.
[4]赵敏倪志伟刘斌.K—means与朴素贝叶斯在商务智能中的应用[J].计算机技术与发展,2010,(04):179.
　ZHAO Min,NI Zhi-wei,LIU Bin.Application Research of K - Means Clustering and Naive Bayesian Algorithm in Business Intelligence[J].,2010,(06):179.
[5]吴楠胡学钢.基于聚类分区的序列模式挖掘算法研究[J].计算机技术与发展,2010,(06):109.
　WU Nan,HU Xue-gang.Research on Clustering Partition-Based Approach of Sequential Pattern Mining[J].,2010,(06):109.
[6]耿波仲红徐杰闫娜娜.用关联分析法对负荷预测结果进行二次处理[J].计算机技术与发展,2008,(04):171.
　GENG Bo,ZHONG Hong,XU Jie,et al.Using Correlation Analysis to Treat Load Forecasting Results[J].,2008,(06):171.
[7]游芳姜建国张坤.基于二维属性的高维数据聚类算法研究[J].计算机技术与发展,2009,(05):111.
　YOU Fang,JIANG Jian-guo,ZHANG Kun.Cluster- Algorithm Studies Based on Two- Dimensional Attribute Higher - Dimension Data[J].,2009,(06):111.
[8]刘淑英程国建彭方.人工神经生长细胞结构网络在医疗诊断的应用[J].计算机技术与发展,2009,(05):231.
　LIU Shu-ying,CHENG Guo-jian,PENG Fang.Applications of Growing Cell Structures of Artificial Neural Network for Medical Diagnosis[J].,2009,(06):231.
[9]范新沈闻丁泉勋沈洁.基于正例和未标文档的半监督分类研究[J].计算机技术与发展,2009,(06):58.
　FAN Xin,SHEN Wen,DING Quan-xun,et al.Research on Semi- Supervised Classification Based on Positive and Unlabeled Text Document[J].,2009,(06):58.
[10]王园园倪志伟赵裕啸伍章俊.基于决策树的模糊聚类评价算法及其应用[J].计算机技术与发展,2009,(09):232.
　WANG Yuan-yuan,NI Zhi-wei,ZHAO Yu-xiao,et al.Fuzzy Clustering Evaluation Algorithm Based on Decision Tree and Application[J].,2009,(06):232.
[11]王玉雷,李玲娟. 一种密度和划分结合的聚类算法[J].计算机技术与发展,2015,25(09):53.
　WANG Yu-le,LI Ling-juan. A Clustering Algorithm of Combination of Density and Division[J].,2015,25(06):53.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1191
全文下载/Downloads629
评论/Comments