[1]李淋淋[],倪建成[],曹博[],等. 基于Spark框架的并行聚类算法[J].计算机技术与发展,2017,27(05):97-101.
 LI Lin-lin[],NI Jian-cheng[],CAO Bo[],et al. Parallel Clustering Algorithm with Spark Framework[J].,2017,27(05):97-101.
点击复制

 基于Spark框架的并行聚类算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年05期
页码:
97-101
栏目:
智能、算法、系统工程
出版日期:
2017-05-10

文章信息/Info

Title:
 Parallel Clustering Algorithm with Spark Framework
文章编号:
1673-629X(2017)05-0087-05
作者:
 李淋淋[1]倪建成[2]曹博[1]于苹苹[1]姚彬修[1]
1. 曲阜师范大学 信息科学与工程学院;2.曲阜师范大学 软件学院
Author(s):
 LI Lin-lin[1]NI Jian-cheng[2]CAO Bo[1]YU Ping-ping[1]YAO Bin-xiu[1]
关键词:
 K-means Spark 大数据Hadoop MapReduce
Keywords:
 K-meansSpark big dataHadoopMapReduce
分类号:
TP301.6
文献标志码:
A
摘要:
 针对传统K-means算法在处理海量数据时存在距离计算瓶颈及因迭代计算次数增加导致内存不足的问题,提出了一种基于Spark框架的SBTICK-means (Spark Based Triangle Inequality Canopy-K-means)并行聚类算法.为了更好地解决K值选取的盲目性和随机性的问题,该算法利用Canopy进行预处理得到初始聚类中心点和K值;在K-means迭代计算过程中进一步利用距离三角不等式定理减少冗余计算、加快聚类速度,结合Spark框架实现算法的并行化,充分利用Spark的内存计算优势提高数据的处理速度,缩减算法的整体运行时间.实验结果表明,SBTICK-means算法在保证准确率的同时大大提高了聚类效率,与传统的K-means算法、Canopy-K-means算法和基于MapReduce框架下的该算法相比,在加速比、扩展比以及运行速率上都有一定的提高,从而更适合应用于海量数据的聚类研究.
Abstract:
 In view of the issues that when processing massive data,traditional K-means algorithm has the bottlenecks of distance computation and causes memory overflow due to increase of iterative calculation,the SBTICK-means (Spark Based Triangle Inequality Canopy-K-means) parallel clustering algorithm based on Spark framework has been proposed.In order to better solve the problem of blindness and randomness about K value’s selection,initial cluster centers and K value have been preprocessed by Canopy.During K-means iterative calculation,redundant computation has been reduced and clustering speed has been accelerated by the triangle inequality theorem.Combined with Spark framework and made full use of memory computing advantages,the data processing speed has been improved and the overall running time of this algorithm has been decreased.Experimental results show that the proposed algorithm has improved clustering efficiency while ensuring the accuracy rate,and that at the same time,the size-up rate,scale-up rate and operating speed have been improved compared with the traditional K-means algorithm and Canopy-K-means and this algorithm based on MapReduce framework.Therefore it can be more suitable for clustering research of massive data.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(05):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(05):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(05):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(05):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(05):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(05):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(05):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(05):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(05):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(05):47.
[11]吴诗雨,孟庆民,仲姝. 5G物联网中K-means算法辅助的小区休眠机制[J].计算机技术与发展,2017,27(07):200.
 WU Shi-yu,MENG Qing-min,ZHONG Shu. A Dormancy Mechanism of K-means Algorithm in 5G Internet of Things[J].,2017,27(05):200.

更新日期/Last Update: 2017-07-07