«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2020. 12. 018]
点击复制

基于密度优化初始聚类中心的 K-means 算法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 30
期数:: 2020年12期

页码:: 99-105

栏目:: 智能、算法、系统工程

出版日期:: 2020-12-10

文章信息/Info

Title:: K-means Algorithm Based on Density Optimization Initial Clustering Center

文章编号:: 1673-629X(2020)12-0099-07

作者:: 王艳娥¹; 安健²; 梁艳¹; 康晶晶³; 1. 西安思源学院理工学院,陕西西安 710038; 2. 西安交通大学深圳研究院,广东深圳 518057; 3. 山西农业大学信息学院,山西晋中 030800

Author(s):: WANG Yan-e¹; AN Jian²; LIANG Yan¹; KANG Jing-jing³; 1. School of Technology,Xi’an Siyuan University,Xi’an 710038,China; 2. Shenzhen Research Institute of Xi’an Jiaotong University,Shenzhen 518057,China; 3. School of Information Engineering,Shanxi Agricultural University,Jinzhong 030800,China

关键词:: K-means 算法; 密度; 去噪; 最优超球体; 均方差; 噪声数据

Keywords:: K-means algorithm; density; de-noisy; optimal super sphere; mean square error; noise data

分类号:: TP181

DOI:: 10. 3969 / j. issn. 1673-629X. 2020. 12. 018

摘要:: 针对 K-means 算法随机选择初始聚类中心, 对噪音和异常点比较敏感, 聚类结果过多依赖于专家经验从而缺乏一定客观性的问题, 提出一种新的度量样本密度的方法优化 K-means 算法对初始聚类中心的选择。该方法基于样本实际分布,以最优超球体中样本个数与超球体中样本相似性作为度量样本密度的关键,能够有效选出较优的聚类中心,使得选择的初始聚类中心更接近样本集的实际分布。算法在乳腺癌数据集、常用 UCI 数据集以及人工模拟数据集上进行测试,实验结果表明,与已有同类方法相比, 该算法在各数据集上的聚类评价指标均有提高,而且运行速度更快, 聚类结果更稳定, 聚类准确率更高:在乳腺癌数据集 wdbc 上的准确率为 91.04% ,提高了 6%。在 Iris 数据集上的准确率为 94%,提高了 5%。

Abstract:: The K-means algorithm randomly selects the initial clustering center,which is sensitive to noise and outliers. The clustering results are too dependent on expert experience and thus lack of objectivity. In order to solve the problem,we propose a new method of measuring sample density to optimize the selection of the initial clustering center by K-means algorithm. Based on the actual distribution of samples,this method takes the number of samples in the optimal hypersphere and the similarity of samples in the hypersphere as the key to measure the sample density,and can effectively select the optimal clustering center,so that the selected initial clustering center is closer to the actual distribution of the sample set. The algorithm is tested on the breast cancer data set, UCI data set and artificial simulation data set. The experiment shows that compared with the existing similar methods, the proposed algorithm improves the clustering evaluation index on each data set, and runs faster, with more stable clustering results and higher clustering accuracy. The accuracy rate on wdbc is 91.04%,increased by 6%. The accuracy on Iris is 94%,up 5%.

相似文献/References:

[1]李雷周蒙蒙鲁延玲.基于密度法的双隶属度模糊支持向量机[J].计算机技术与发展,2009,(12):44.
　LI Lei,ZHOU Meng-meng,LU Yan-ling.Fuzzy Support Vector Machine Based on Density with Dual Membership[J].,2009,(12):44.
[2]李杰贾瑞玉张璐璐.一个改进的基于DBSCAN的空间聚类算法研究[J].计算机技术与发展,2007,(01):114.
　LI Jie,JIA Rui-yu,ZHANG Lu-lu.Research on Improving Spatial Clustering Algorithm Based on DBSCAN[J].,2007,(12):114.
[3]王玉雷,李玲娟. 一种密度和划分结合的聚类算法[J].计算机技术与发展,2015,25(09):53.
　WANG Yu-le,LI Ling-juan. A Clustering Algorithm of Combination of Density and Division[J].,2015,25(12):53.
[4]戚后林,顾磊. 基于密度与最小距离的K-means算法初始中心方法[J].计算机技术与发展,2017,27(09):60.
　QI Hou-lin,GU Lei. An Initial Center Algorithm of K-means Based on Density and Minimum Distance[J].,2017,27(12):60.
[5]申艳光,张玲玉,刘永红.基于混合遗传算法的物流路径优化方法研究[J].计算机技术与发展,2018,28(03):192.[doi:10．3969/j．issn．1673－629X．2018．03．041]
　SHEN Yan-guang,ZHANG Ling-yu,LIU Yong-hong.Study on Optimizing of Physical Ｒouting Method Based on Hybrid Genetic Algorithm[J].,2018,28(12):192.[doi:10．3969/j．issn．1673－629X．2018．03．041]
[6]李春生,刘涛,于澍,等.基于K－means算法的研究生入学成绩分析[J].计算机技术与发展,2019,29(02):162.[doi:10．3969/j．issn．1673－629X．2019．02．034]
　LI Chunsheng,LIU Tao,YU Shu,et al.Analysis of Enrollment of Graduate Students Based on K－means Algorithm[J].,2019,29(12):162.[doi:10．3969/j．issn．1673－629X．2019．02．034]
[7]凌静,江凌云,赵迎.结合模拟退火算法的遗传 K-Means 聚类方法[J].计算机技术与发展,2019,29(09):61.[doi:10. 3969 / j. issn. 1673-629X. 2019. 09. 012]
　LING Jing,JIANG Ling-yun,ZHAO Ying.A Genetic K-Means Clustering Method Combined with Simulated Annealing Algorithm[J].,2019,29(12):61.[doi:10. 3969 / j. issn. 1673-629X. 2019. 09. 012]
[8]许睿,李艳翠,訾乾龙,等.虚拟学习社区中意见领袖识别模型研究[J].计算机技术与发展,2020,30(05):56.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 011]
　XU Rui,LI Yan-cui,ZI Qian-long,et al.Research on Identifying Model of Opinion Leader in Virtual Learning Community[J].,2020,30(12):56.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 011]
[9]国强强,朱振方.基于 LightGBM 算法的移动用户信用评分研究[J].计算机技术与发展,2020,30(09):210.[doi:10. 3969 / j. issn. 1673-629X. 2020. 09. 038]
　GUO Qiang-qiang,ZHU Zhen-fang.Research on Mobile User Credit Score Based on LightGBM Algorithm[J].,2020,30(12):210.[doi:10. 3969 / j. issn. 1673-629X. 2020. 09. 038]
[10]盛丹丹.聚类算法在高校院所学生微博的应用研究[J].计算机技术与发展,2022,32(S2):47.[doi:10. 3969 / j. issn. 1673-629X. 2022. S2. 008]
　SHENG Dan-dan.Research on Cluster Algorithm in Institute of Geology and College Students’ Microblogging[J].,2022,32(12):47.[doi:10. 3969 / j. issn. 1673-629X. 2022. S2. 008]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed878
全文下载/Downloads605
评论/Comments