«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

[1]李玉波[],杨余旺[],唐浩[],等. 基于Spark的K-means安全区间更新优化算法[J].计算机技术与发展,2017,27(08):1-6.
　LI Yu-bo[],YANG Yu-wang[],TANG Hao[],et al. Optimization of K-means Updating Security Interval Based on Spark[J].,2017,27(08):1-6.
点击复制

基于Spark的K-means安全区间更新优化算法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 27
期数:: 2017年08期

页码:: 1-6

栏目:: 智能、算法、系统工程

出版日期:: 2017-08-10

文章信息/Info

Title:: Optimization of K-means Updating Security Interval Based on Spark

文章编号:: 1673-629X（2017)08-0001-06

作者:: 李玉波[1] ; 杨余旺[1]; 唐浩[1] ; 陈光炜[2]; 1.南京理工大学计算机科学与工程学院；2.普渡大学

Author(s):: LI Yu-bo[1] ; YANG Yu-wang[1] ; TANG Hao[1] ; CHEN Guang-wei[2]

关键词:: ; K-means; 安全区间; Spark ; 大数据; 时间效率

Keywords:: ; K-means; security interval; Spark big data; time efficiency

分类号:: TP301

文献标志码:: A

摘要:: 每次K-means算法更新聚类中心后,会对数据集中所有的点迭代计算它们与最新聚类中心的距离,进而获取点的最新聚类.这种全局迭代计算的特征导致传统K-means算法时间效率低.随着数据集增大,算法的时间效率和聚类性能下降过快,因此传统的K-means算法不适合大数据环境下的聚类使用.针对大数据场景下的时间效率和性能优化问题,提出了一种基于Spark的K-means安全区间更新优化算法.在每次更新聚类中心后,该算法更新安全区间标签,根据标签是否大于0每次判断落在该区间内的全部数据的簇别,避免计算所有点与中心的距离,减少因全局迭代造成的时间和计算资源开销.算法基于Spark机器MLlib组件的点向量模型优化了模型性能.通过衡量平均误差准则和算法时间两个指标,进行了优化K-means与传统K-means聚类的性能对比实验.结果表明,所提出的优化算法在上述两个指标上均优于传统的K-means聚类算法,适用于大数据环境下的数据聚类场景.

Abstract:: At each time when the K-means algorithm updates the cluster center,it needs to calculate iteratively the distance between all the points in the dataset with the latest clustering center to get the latest clustering of each point.This feature of global iterative computation leads to low efficiency of traditional K-means algorithm.As the data set increases,its time efficiency and clustering performance decrease too fast,so that the traditional K-means algorithm is not suitable for clustering in big data.Therefore,a new K-means secure interval updating algorithm based on Spark is proposed for time efficiency and performance optimization in big data.After updated the cluster center every time,it updates security interval label.According to whether the label is greater than 0 instead of calculation of the distance between all the points and the new center and cluster identification of all the data in the interval every time,which reduces the overhead of time and computation.The performance of the algorithm model based on the point vector model of Spark MLlib component has been optimized.It is made a comparison with the traditional K-means algorithm on average error criterion and operation time.The experimental results show that it is superior to the traditional K-means clustering algorithm in the above two indexes and is suitable for data clustering scenario in big data.

相似文献/References:

[1]范新沈闻丁泉勋沈洁.基于正例和未标文档的半监督分类研究[J].计算机技术与发展,2009,(06):58.
　FAN Xin,SHEN Wen,DING Quan-xun,et al.Research on Semi- Supervised Classification Based on Positive and Unlabeled Text Document[J].,2009,(08):58.
[2]李若鹏李翔林祥李建华.基于DK算法的互联网热点主动发现研究与实现[J].计算机技术与发展,2008,(09):1.
　LI Ruo-peng,LI Xiang,LIN Xiang,et al.Discovering Information Hotspots on Initiative over Internet Based on DK Clustering Algorithm[J].,2008,(08):1.
[3]朱云贺张春海张博.基于数据分段的K-means的优化研究[J].计算机技术与发展,2010,(11):130.
　ZHU Yun-he,ZHANG Chun-hai,ZHANG Bo.Optimizing Research on K-means Based on Data Partition[J].,2010,(08):130.
[4]何云李辉姚能坚赵榕生.改进K-means算法实现移动通信行为特征分析[J].计算机技术与发展,2011,(06):63.
　HE Yun,LI Hui,YAO Neng-jian,et al.Application of Improved K-Means Algorithm in Mobile Communication Behavioral Characteristic Analysis[J].,2011,(08):63.
[5]黎银环,张剑.改进的 K-means 算法在入侵检测中的应用[J].计算机技术与发展,2013,(01):165.
　LI Yin-huan,ZHANG Jian.Application of Improved K-means Clustering Algorithm in Intrusion Detection[J].,2013,(08):165.
[6]李四海,满自斌.自适应特征权重的K-means聚类算法[J].计算机技术与发展,2013,(06):98.
　LI Si-hai[],MAN Zi-bin[].K-means Clustering Algorithm Based on Adaptive Feature Weighted[J].,2013,(08):98.
[7]耿永政,陈坚.结合图论的JSEG彩色图像分割算法[J].计算机技术与发展,2014,24(05):15.
　GENG Yong-zheng,CHEN Jian.JSEG Color Image Segmentation Algorithm Combining Graph Theory[J].,2014,24(08):15.
[8]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
　ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(08):1.
[9]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
　LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(08):5.
[10]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
　HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(08):13.
[11]陈斌,苏一丹,黄山. 基于KM-SMOTE和随机森林的不平衡数据分类[J].计算机技术与发展,2015,25(09):17.
　CHEN Bin,SU Yi-dan,HUANG Shan. Classification of Imbalance Data Based on KM-SMOTE Algorithm and Random Forest[J].,2015,25(08):17.
[12]胡磊,蔡红霞,俞涛. 双重聚类的协同过滤算法在智能家居中的应用[J].计算机技术与发展,2017,27(02):100.
　HU Lei,CAI Hong-xia,YU Tao. Application of Collaborative Filtering Recommendation Based on Double Clustering in Smart Home System[J].,2017,27(08):100.
[13]鲍黎明,黄刚. 基于多叉树确定K值的动态K-means聚类算法[J].计算机技术与发展,2017,27(06):41.
　BAO Li-ming,HUANG Gang. A Dynamic Clustering Algorithm of K-means Based onMulti-branches Tree for K-values[J].,2017,27(08):41.
[14]曹耀彬,王亚刚. 免疫算法优化的RBF在入侵检测中的应用[J].计算机技术与发展,2017,27(06):114.
　CAO Yao-bin,WANG Ya-gang. Application of RBF Neural Network Optimized by Immune Algorithm in Intrusion Detection[J].,2017,27(08):114.
[15]万新贵,李玲娟. 基于结构与属性的社区划分方法[J].计算机技术与发展,2017,27(08):97.
　WAN Xin-gui,LI Ling-juan. Community Division Method with Structure and Attribute[J].,2017,27(08):97.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1476
全文下载/Downloads872
评论/Comments