«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 11. 023]
点击复制

基于相似度均值的分类数据层次聚类分析算法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年11期

页码:: 154-163

栏目:: 人工智能

出版日期:: 2022-11-10

文章信息/Info

Title:: A Hierarchical Clustering Analysis Algorithm of Categorical Data Based on Mean of Similarity

文章编号:: 1673-629X(2022)11-0154-10

作者:: 褚轲欣; 荀亚玲; 太原科技大学计算机科学与技术学院,山西太原 030024

Author(s):: CHU Ke-xin; XUN Ya-ling; School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China

关键词:: 层次聚类; 分类数据; 相似度均值; 平稳相似性度量; 分配策略摇

Keywords:: hierarchical clustering; categorical data; mean of similarity; steady similarity measure; allocation strategy

分类号:: TP311

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 11. 023

摘要:: 层次聚类分析在数据挖掘与机器学习等领域是一种广泛使用的无监督学习技术,但是,由于层次聚类分析算法主要是依赖于人为设定的相似度阈值来实现聚类簇的合并或分裂,因此在没有任何先验知识时,难以设定相似度阈值。采用相似度均值以及边界数据对象分配策略,提出了一种基于相似度均值的分类数据层次聚类分析算法。该算法利用相似度均值刻画数据集中数据对象分布的集中趋势以及平稳相似性度量,作为层次聚类簇合并或分裂的重要依据,给出了一种相似度均值的计算公式,从而可以自动确定相似度阈值,解决了层次聚类分析中相似度阈值参数的人为设定问题;利用相似度均值,给出了一种边界数据对象的分配策略,有效提高了边界数据对象分配的准确性及聚类质量。在 UCI 与人工合成数据集上的实验验证了该算法具有良好的聚类性能和抗噪性,以及相似度均值的稳定性和有效性。

Abstract:: Hierarchical clustering analysis is a widely used unsupervised learning technology in the fields of data mining and machinelearning. However,it is difficult to set the similarity threshold without any prior knowledge, since the hierarchical clustering analysisalgorithm mainly relies on the similarity thresholds by artificial setting to realize the merging or splitting of clusters. Based on the mean ofsimilarity and boundary data object allocation strategy,a hierarchical clustering analysis algorithm of categorical data using the mean ofsimilarity is proposed. As an important basis for the merging or splitting of clusters in hierarchical clustering, the algorithm uses thesteady similarity measure and the mean of similarity can capture the central tendency of the distribution of data objects in the data sets. Acalculation formula of the mean of similarity is given,which can automatically determine the similarity threshold and solve the artificialsetting of the similarity threshold parameters in the hierarchical clustering analysis. A boundary data object allocation strategy is presentedby using the mean of similarity,which can effectively improve the accuracy of boundary data objects allocation and clustering quality.Experimental results validate the excellent clustering performance and anti - noise, as well as the stability and effectiveness of thealgorithm爷 s mean of similarity on UCI and artificial data sets.

相似文献/References:

[1]段准,刘功申. 基于TextRank的用户模板构建方法[J].计算机技术与发展,2015,25(10):1.
　DUAN Zhun,LIU Gong-shen. Method of Building User Profile Based on TextRank[J].,2015,25(11):1.
[2]许必宵,陈升波,韩重阳,等. 改进的数据预处理算法及其应用[J].计算机技术与发展,2015,25(12):143.
　XU Bi-xiao,CHEN Sheng-bo,HAN Chong-yang,et al. Improved Data Preprocessing Algorithm and Its Application[J].,2015,25(11):143.
[3]刘红兵[],李文坤[],张仰森[]. 基于LDA模型和多层聚类的微博话题检测[J].计算机技术与发展,2016,26(06):25.
　LIU Hong-bing[],LI Wen-kun[],ZHANG Yang-sen[]. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J].,2016,26(11):25.
[4]李云霞,姚建国,万定生,等.一种水文时间序列异常模式检测方法研究[J].计算机技术与发展,2019,29(07):159.[doi:10. 3969 / j. issn. 1673-629X. 2019. 07. 032]
　LI Yun-xia,YAO Jian-guo,WAN Ding-sheng,et al.An Anomaly Pattern Detection Method for Hydrological Time Series[J].,2019,29(11):159.[doi:10. 3969 / j. issn. 1673-629X. 2019. 07. 032]
[5]余胜辉,李玲娟.基于 Spark 的层次聚类算法的并行化研究[J].计算机技术与发展,2020,30(06):19.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 004]
　YU Sheng-hui,LI Ling-juan.Research on Parallelization of Hierarchical Clustering Algorithm Based on Spark[J].,2020,30(11):19.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 004]
[6]张峰,顾一凡.基于近似边界和层次聚类的超多目标进化算法[J].计算机技术与发展,2020,30(12):61.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 011]
　ZHANG Feng,GU Yi-fan.Many-objective Evolutionary Algorithm Based on Approximate Boundary and Hierarchical Clustering[J].,2020,30(11):61.[doi:10. 3969 / j. issn. 1673-629X. 2020. 12. 011]
[7]乔奋凤,朱欣娟,高岭.基于自适应扩展机制的领域智能问答系统[J].计算机技术与发展,2021,31(12):13.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 003]
　QIAO Fen-feng,ZHU Xin-juan,GAO Ling.Domain Intelligent Q&A System Based on Adaptive Extension Mechanism[J].,2021,31(11):13.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 003]
[8]张凯棋,宋亦静,陈鑫.基于属性组权重的分类数据离群检测[J].计算机技术与发展,2023,33(11):20.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 004]
　ZHANG Kai-qi,SONG Yi-jing,CHEN Xin.Attribute Group Weight-based Outlier Detection for Categorical Data[J].,2023,33(11):20.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 004]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed516
全文下载/Downloads331
评论/Comments