一种基于标准差的 K-medoids 聚类算法-《计算机技术与发展》

文章信息/Info

Author(s):: DENG Yu-fang; ZHANG Ji-fu; School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China

Keywords:: K-medoids clustering algorithm; initial center point; standard deviation; UCI dataset

摘要:: K-medoids 聚类分析具有对孤立点敏感度较低和良好的鲁棒性等特点 , 但由于初始聚类中心的选取和中心点迭代更新等 , 聚类精度和效率较低。文中根据标准差体现数据离散程度 , 定义了初始中心点候选集 , 给出了一种基于标准差的 K-medoids 聚类算法。该算法首先利用标准差定义了初始中心点候选集 , 并采用逐步增加的方式确定初始中心点 , 从而保证了选取密集程度较大的样本点作初始聚类中心点 , 同时避免选取到密集程度较低的样本点尤其是孤立点作为初始中心点 ; 其次 , 按照数据样本归属于最近的中心点的原则 , 形成初始聚类簇 , 不断更新聚类中心点 , 直到聚类误差平方和相同为止 , 形成聚类簇 ; 最后 , 在 UCI 数据集和人工数据集上的实验验证了该聚类算法具有良好的聚类精度、效率和鲁棒性。

Abstract:: The K-medoids clustering algorithm has the advantages of low sensitivity to isolated points and strong robustness. However, due to the selection of initial clustering center and the iterative updating of the center point,the clustering accuracy and efficiency are low. The initial center point candidate set is defined according to the standard deviation,and a K-medoids clustering algorithm based on standard deviation is presented. Firstly,the initial center point candidate set is defined by the standard deviation,and the initial center point is determined by a stepwise increasing,which ensures the selection of dense sample points as the initial center point,and avoid the selection of dense sample points, especially isolated points,as the initial center point. Secondly,according to the principle that the data sample belongs to the nearest central point,the initial clusters is formed,and the cluster center points are continuously updated until the clustering error squares is the same to form clusters. In the end,the experiment on UCI dataset and artificial dataset validates that the proposed algorithm has better clustering accuracy,efficiency and robustness.