[1]王安瑾.一种基于 MinHash 的改进新闻文本聚类算法[J].计算机技术与发展,2019,29(02):39-42.[doi:10.3969/j.issn.1673-629X.2019.02.008]
 WANG Anjin.An Improved News Text Clustering Algorithm Based on MinHash[J].,2019,29(02):39-42.[doi:10.3969/j.issn.1673-629X.2019.02.008]
点击复制

一种基于 MinHash 的改进新闻文本聚类算法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年02期
页码:
39-42
栏目:
智能、算法、系统工程
出版日期:
2019-02-10

文章信息/Info

Title:
An Improved News Text Clustering Algorithm Based on MinHash
文章编号:
1673-629X(2019)02-0039-04
作者:
王安瑾
东华大学 计算机科学与技术学院,上海 200000
Author(s):
WANG An-jin
School of Computer Science and Technology,Donghua University,Shanghai 200000,China
关键词:
MinHashJaccard 系数DBSCAN文本聚类
Keywords:
MinHashJaccard coefficientDBSCANtext-clustering
分类号:
TP301.6
DOI:
10.3969/j.issn.1673-629X.2019.02.008
摘要:
信息技术的不断发展,带来的是网络上新闻文本的快速增长,面对大量的新闻文本,对其进行有效聚类就显得十分重要。基于上述需求,提出一种基于 MinHash 的 DBSCAN 聚类算法。针对传统向量空间模型文本聚类存在的数据维度高、计算复杂度大、资源消耗多的问题,该算法使用 MinHash 对所有文本的文本特征词集合进行降维,从而有效减少了资源的浪费。对新得到的特征矩阵中的数据任意两两计算 Jaccard 系数,将每一个结果与 DBSCAN 聚类中给定的邻域半径Eps 进行比较并计算所有距离大于邻域半径 Eps 的点的周围节点数目是否大于等于形成一个簇所需要的最小点数MinPts,由此可以判断该文本是否为核心点,是否可以形成簇。实验结果表明,该方法对于新闻文本聚类有着很好的效果,可以对网络上错综复杂的新闻文本进行有效的聚类。
Abstract:
The continuous development of information technology has brought about the rapid growth of news texts on the Internet. In the face of a large number of news texts,it is very important to cluster them effectively. Based on the above requirements,we propose an improved DBSCAN clustering algorithm based on MinHash. In order to solve the problem of high data dimension,high computational complexity and large resource consumption in traditional vector space model text clustering,this algorithm uses MinHash to reduce the dimension of all text feature word sets,thus effectively reducing the wastes of resources. Jaccard coefficient is calculated for any two-by-two data in the obtained characteristics matrix,and each result is compared with the neighborhood radius Eps in DBSCAN clustering and calculated whether all the neighboring nodes whose distances are greater than the neighborhood radius Eps is greater than or equal to MinPts.Therefore,we can determine whether the text is a core point and whether clusters can be formed. Experiment shows that the algorithm has a better effect on news text clustering and can effectively cluster the intricate news text on the Internet.

相似文献/References:

[1]李 鑫.一种面向 Mashup 应用的 API 推荐方法[J].计算机技术与发展,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
 LI Xin.An API Recommendation Method for Mashup Application[J].,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]

更新日期/Last Update: 2019-02-10