«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2020. 06. 004]
点击复制

基于 Spark 的层次聚类算法的并行化研究()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 30
期数:: 2020年06期

页码:: 19-22

栏目:: 智能、算法、系统工程

出版日期:: 2020-06-10

文章信息/Info

Title:: Research on Parallelization of Hierarchical Clustering Algorithm Based on Spark

文章编号:: 1673-629X(2020)06-0019-04

作者:: 余胜辉; 李玲娟; 南京邮电大学计算机学院,江苏南京 210023

Author(s):: YU Sheng-hui; LI Ling-juan; School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China

关键词:: Spark; 层次聚类; CURE; RDD; 并行化

Keywords:: Spark; hierarchical clustering; CURE; RDD; parallelization

分类号:: TP301. 6

DOI:: 10. 3969 / j. issn. 1673-629X. 2020. 06. 004

摘要:: 随着大数据时代的来临,传统的计算模式已经不足以支撑如此大量的数据。基于内存计算的大数据并行化计算框架 Spark 的出现很好地解决了这一问题。 CURE 是一种基于取样和代表点的层次聚类算法,它采用迭代的方式,自底向上地合并两个距离最近的簇。与传统的聚类算法相比,CURE 算法对异常点的敏感度更小。但是在处理大量数据的情况下,CURE 算法存在着因反复迭代而消耗大量时间的问题。文中利用了 Spark 的 RDD 编程模型的可伸缩性和分布式等特点,实现了对 CURE 算法计算过程的并行化,提升了该算法对数据的处理速度,使算法能够适应数据规模的扩展,并且提高了聚类的性能。在 Spark 上运用 CURE 算法对公开数据集的并行化处理结果表明,基于 Spark 的 CURE 算法并行化既保证了聚类准确率又提高了算法的时效性。

Abstract:: With the advent of the era of big data,traditional computing models are not enough to support such a large amount of data.The emergence of Spark, a big data parallel computing framework based on in-memory computing,solves this problem well. CURE is a hierarchical clustering algorithm based on sampling and representative points,and uses an iterative method to merge two closest clusters from the bottom up. Compared with traditional clustering algorithm,CURE algorithm is less sensitive to outliers. However,in the case of processing large amounts of data,the CURE algorithm has the problem of consuming a lot of time due to repeated iterations. We utilize the scalability and distributed characteristics of Spark’s RDD program-ming model to realize the parallelization of the computing process of CRUE algorithm,which improves the speed of data processing,makes the algorithm adapt to the expansion of data scale,and improves the performance of clustering. The parallelization of the public dataset using CURE algorithm on Spark shows that the parallelization of Spark-based CURE algorithm not only ensures the clustering accuracy but also improves the timeliness of the algorithm.

相似文献/References:

[1]段准,刘功申. 基于TextRank的用户模板构建方法[J].计算机技术与发展,2015,25(10):1.
　DUAN Zhun,LIU Gong-shen. Method of Building User Profile Based on TextRank[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2015,25(06):1.
[2]许必宵,陈升波,韩重阳,等. 改进的数据预处理算法及其应用[J].计算机技术与发展,2015,25(12):143.
　XU Bi-xiao,CHEN Sheng-bo,HAN Chong-yang,et al. Improved Data Preprocessing Algorithm and Its Application[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2015,25(06):143.
[3]刘红兵[],李文坤[],张仰森[]. 基于LDA模型和多层聚类的微博话题检测[J].计算机技术与发展,2016,26(06):25.
　LIU Hong-bing[],LI Wen-kun[],ZHANG Yang-sen[]. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2016,26(06):25.
[4]庄荣,李玲娟.基于Spark 的 CVFDT 分类算法并行化研究[J].计算机技术与发展,2018,28(06):35.[doi:10.3969/ j. issn.1673-629X.2018.06.008]
　ZHUANG Rong,LI Ling-juan.Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):35.[doi:10.3969/ j. issn.1673-629X.2018.06.008]
[5]朱子龙,李玲娟.基于Spark 的密度聚类算法并行化研究[J].计算机技术与发展,2018,28(06):80.[doi:10.3969/ j. issn.1673-629X.2018.06.018]
　ZHU Zi-long,LI Ling-juan.Research on Parallelization of Density Clustering Algorithm Based on Spark[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):80.[doi:10.3969/ j. issn.1673-629X.2018.06.018]
[6]于苹苹,倪建成,韦锦涛,等.基于 Spark 与词语相关度的 KNN 文本分类算法[J].计算机技术与发展,2018,28(03):87.[doi:10.3969/ j. issn.1673-629X.2018.03.018]
　YU Ping-ping,NI Jian-cheng,WEI Jin-tao,et al.KNN Text Classification Based on Word Relatedness and Spark Framework[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):87.[doi:10.3969/ j. issn.1673-629X.2018.03.018]
[7]王诚,赵申屹.一种改进的并行关联规则增量更新算法研究[J].计算机技术与发展,2018,28(07):48.[doi:10.3969/ j. issn.1673-629X.2018.07.011]
　WANG Cheng,ZHAO Shen-yi.Research on an Improved Incremental Updated Algorithm for Parallel Association Rule[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):48.[doi:10.3969/ j. issn.1673-629X.2018.07.011]
[8]成小海.基于 Spark 的高维数据相似性连接[J].计算机技术与发展,2018,28(08):43.[doi:10.3969/ j. issn.1673-629X.2018.08.009]
　CHENG Xiao-hai.Similarity Joins of High-dimensional Data Based on Spark[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):43.[doi:10.3969/ j. issn.1673-629X.2018.08.009]
[9]许德心,李玲娟.基于 Spark 的关联规则挖掘算法并行化研究[J].计算机技术与发展,2019,29(03):30.[doi:10.3969/ j. issn.1673-629X.2019.03.006]
　XU De-xin,LI Ling-juan.Research on Parallelization of Association Rules Mining Algorithm Based on Spark[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):30.[doi:10.3969/ j. issn.1673-629X.2019.03.006]
[10]张彩廷,祝永志.感知用户年龄的 Item-based 协同过滤推荐算法[J].计算机技术与发展,2019,29(06):95.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 020]
　ZHANG Cai-ting,ZHU Yong-zhi.User’s Age-aware Item-based Collaborative Filtering Recommendation Algorithm[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):95.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 020]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1414
全文下载/Downloads872
评论/Comments