面向线性文本的K-means聚类算法研究-《计算机技术与发展》

文章信息/Info

Author(s):: WEN Bi-long; LI Fei; MA Qiang; School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China

Keywords:: linear text; organizational structure; random and even center point selection; isometric point classification; K-means algorithm

摘要:: 计算机技术与发展,Computer Technology and Development2018,(9) 【摘要】鉴于线性文本内容组织形式的有序性,将有序的主题内容进行正确的划分,用于挖掘文本中隐藏的信息、知识,是一个值得研究的问题.同时,传统的K-means聚类算法在对线性文本进行聚类时,会造成计算复杂度增加以及无穷迭代或聚类结果混乱等一系列问题.针对以上问题,对传统的K-means算法进行研究,将随机初始化中心点的算法进行改进,提出一种随机均匀初始化中心点算法.该算法充分考虑线性文本的组织结构特性,随机化第一个中心点后,均匀地确定其他中心点,保证了文本子主题的完整划分;与此同时,又采用了设定约束规则的等距点归类法,实现文本迭代次数限制下的自动归类.实验结果表明,该算法在对线性文本进行聚类时,可以有效减少迭代次数并提高聚类精度,最终获得较好的聚类效果.

Abstract:: In view of the orderliness of the organized form of linear texts,it is worthwhile studying to mine the hidden information and knowledge from the text by dividing the subject content correctly. At the same time,the traditional K-means clustering algorithm will conduce to a series of problems such as increasing computational complexity,infinite iteration phenomenon or clustering results confusion. For this,we research the traditional K-means algorithm and improve the algorithm of randomly initializing center,based on which we propose a random uniform initialization center algorithm. This algorithm gives plenty of considerations to the organizational structure of linear texts. After one central point is randomized,other central points are uniformly determined to ensure the sufficiently division of the subtopic. Meantime,we adopt an equidistant point categorization under the constraint rules to realize automatic classification under the limit of text iteration. The experiment illustrates that the proposed algorithm can effectively cut down iteration times and improve the clustering accuracy when clustering linear texts,obtaining the better clustering outcome at last.