[1]朱子龙,李玲娟.基于Spark 的密度聚类算法并行化研究[J].计算机技术与发展,2018,28(06):80-84.[doi:10.3969/ j. issn.1673-629X.2018.06.018]
 ZHU Zi-long,LI Ling-juan.Research on Parallelization of Density Clustering Algorithm Based on Spark[J].,2018,28(06):80-84.[doi:10.3969/ j. issn.1673-629X.2018.06.018]
点击复制

基于Spark 的密度聚类算法并行化研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年06期
页码:
80-84
栏目:
智能、算法、系统工程
出版日期:
2018-06-10

文章信息/Info

Title:
Research on Parallelization of Density Clustering Algorithm Based on Spark
文章编号:
1673-629X(2018)06-0080-05
作者:
朱子龙李玲娟
南京邮电大学 计算机学院,江苏 南京 210023
Author(s):
ZHU Zi-longLI Ling-juan
School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
关键词:
DBSCAN聚类Spark并行化
Keywords:
DBSCANclusteringSparkparallelization
分类号:
TP301.6
DOI:
10.3969/ j. issn.1673-629X.2018.06.018
文献标志码:
A
摘要:
聚类分析目前是数据挖掘研究领域中热门的研究课题,DBSCAN 算法则是聚类分析中较为重要的一种基于密度的算法。 Apache Spark 扩展了广泛使用的 MapReduce 计算模型,提出了基于内存的并行计算框架。 通过将中间结果缓存在内存中减少 I/ O 磁盘操作,使其能够更高效地支持交互式查询、迭代式计算等多种计算模式。 为了更好地进行大数据聚类挖掘,研究如何对基于当今主流的大数据处理框架 Spark 对 DBSCAN 算法进行并行化。 设计了基于 Spark 的 DBSCAN算法并行化方案,通过合理利用 RDD 和设计 Sample 算子、map 函数、collectAsMap 算子、reduceByKey 算子,实现了对寻找核心对象的密度可达数据点过程的并行化。 在 Spark 平台上运用 DBSCAN 算法对 UCI 的 Wine 数据集、Car Evaluation 数据集和 Adult 数据集的并行化聚类结果表明,并行化的 DBSCAN 算法具有较好的准确性和时效性,适用于大数据聚类。
Abstract:
Clustering analysis is currently a hot research topic in data mining,and DBSCAN algorithm is a density-based algorithm which is more important in clustering analysis. Apache Spark extends the widely used MapReduce computing model and proposes a memorybased parallel computing framework. It reduces I/ O disk operations by caching intermediate results in memory,enabling it to more efficiently support multiple computing modes such as interactive queries,iterative calculations and so on. In order to mine the large data well,we study how to parallelize the DBSCAN algorithm based on large data processing framework Spark,and design a scheme on parallelization of density clustering algorithm based on Spark. Through the rational use of RDD and design of Sample operator,map function,collectAsMap operator,reduceByKey operator,it realizes the parallelization of the process of finding the density-reach data points for the core object. With DBSCAN algorithm on the Spark platform on the data set of UCI Wine,Car Evaluation and Adult,the parallel clustering results show that the parallelized of DBSCAN algorithm has better accuracy and timeliness,and it is suitable for large data clustering.

相似文献/References:

[1]蒋璐璐 王适 王宝成 李慧敏 李鑫慧.一种改进的标记分水岭遥感图像分割方法[J].计算机技术与发展,2010,(01):36.
 JIANG Lu-lu,WANG Shi,WANG Bao-cheng,et al.Segmentation of Remote Sensing Image Based on an Improved Labeling Watershed Algorithm[J].,2010,(06):36.
[2]张甜 罗眉 孟晓红 赵宗涛.一种基于状态特征的航天发射故障诊断技术[J].计算机技术与发展,2010,(01):93.
 ZHANG Tian,LUO Mei,MENG Xiao-hong,et al.A Technology in Fault Diagnosis of Spaceflight Launch Based on State Character[J].,2010,(06):93.
[3]王会颖 章义刚.求解聚类问题的改进人工鱼群算法[J].计算机技术与发展,2010,(03):84.
 WANG Hui-ying,ZHANG Yi-gang.An Improved Artificial Fish- Swarm Algorithm of Solving Clustering Analysis Problem[J].,2010,(06):84.
[4]赵敏 倪志伟 刘斌.K—means与朴素贝叶斯在商务智能中的应用[J].计算机技术与发展,2010,(04):179.
 ZHAO Min,NI Zhi-wei,LIU Bin.Application Research of K - Means Clustering and Naive Bayesian Algorithm in Business Intelligence[J].,2010,(06):179.
[5]吴楠 胡学钢.基于聚类分区的序列模式挖掘算法研究[J].计算机技术与发展,2010,(06):109.
 WU Nan,HU Xue-gang.Research on Clustering Partition-Based Approach of Sequential Pattern Mining[J].,2010,(06):109.
[6]耿波 仲红 徐杰 闫娜娜.用关联分析法对负荷预测结果进行二次处理[J].计算机技术与发展,2008,(04):171.
 GENG Bo,ZHONG Hong,XU Jie,et al.Using Correlation Analysis to Treat Load Forecasting Results[J].,2008,(06):171.
[7]游芳 姜建国 张坤.基于二维属性的高维数据聚类算法研究[J].计算机技术与发展,2009,(05):111.
 YOU Fang,JIANG Jian-guo,ZHANG Kun.Cluster- Algorithm Studies Based on Two- Dimensional Attribute Higher - Dimension Data[J].,2009,(06):111.
[8]刘淑英 程国建 彭方.人工神经生长细胞结构网络在医疗诊断的应用[J].计算机技术与发展,2009,(05):231.
 LIU Shu-ying,CHENG Guo-jian,PENG Fang.Applications of Growing Cell Structures of Artificial Neural Network for Medical Diagnosis[J].,2009,(06):231.
[9]范新 沈闻 丁泉勋 沈洁.基于正例和未标文档的半监督分类研究[J].计算机技术与发展,2009,(06):58.
 FAN Xin,SHEN Wen,DING Quan-xun,et al.Research on Semi- Supervised Classification Based on Positive and Unlabeled Text Document[J].,2009,(06):58.
[10]王园园 倪志伟 赵裕啸 伍章俊.基于决策树的模糊聚类评价算法及其应用[J].计算机技术与发展,2009,(09):232.
 WANG Yuan-yuan,NI Zhi-wei,ZHAO Yu-xiao,et al.Fuzzy Clustering Evaluation Algorithm Based on Decision Tree and Application[J].,2009,(06):232.
[11]王玉雷,李玲娟. 一种密度和划分结合的聚类算法[J].计算机技术与发展,2015,25(09):53.
 WANG Yu-le,LI Ling-juan. A Clustering Algorithm of Combination of Density and Division[J].,2015,25(06):53.

更新日期/Last Update: 2018-08-16