«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 07. 006]
点击复制

基于 TFIDF+LSA 算法的新闻文本聚类与可视化()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年07期

页码:: 34-38

栏目:: 大数据分析与挖掘

出版日期:: 2022-07-10

文章信息/Info

Title:: News Text Clustering and Visualization Based on TFIDF+LSA Algorithm

文章编号:: 1673-629X(2022)07-0034-05

作者:: 郝秀慧; 方贤进; 杨高明; 安徽理工大学计算机科学与工程学院,安徽淮南 232001

Author(s):: HAO Xiu-hui; FANG Xian-jin; YANG Gao-ming; School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan 232001,China

关键词:: 词频反文档频率; 潜在语义分析; 文本聚类速度; 文本聚类可视化; kmeans

Keywords:: term frequency inverse document frequency; latent semantic analysis; speed of text clustering; text clustering visualization; kmeans

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 07. 006

摘要:: 近几年来,文本聚类技术作为机器学习领域一种无监督学习的方法,也越来越成为数据挖掘领域备受关注的技术之一。将小规模的文本数据聚为几类,在一定程度上说是一件比较容易实现的工作。可是,当面对大量高维的中文文本数据时,由于在这种情况下对文本聚类,面对的将是高维和稀疏的数据,在保证聚类质量的情况下,提高聚类的速度和可视化效果也成为聚类研究的课题之一。该文提出一种结合词频反文档频率算法 ( term frequency, inverse documentfrequency,TFIDF)和潜在语义分析算法( latent semantic analysis,LSA) 相结合的方法,来提高 kmeans 中文文本聚类的速度和可视化效果。将从网页上采集到的 11 456 条新闻作为实验对象,通过基于 TFIDF 聚类和基于 TFIDF+LSA 聚类进行实验对比,根据聚类指标轮廓系数(Silhouette coefficient,SC)、卡林斯基-原巴斯指数(Calinski-Harabasz index,CHI) 和戴维斯-堡丁指数( Davies-Bouldin index,DBI) 的值表明,该方法不仅能保证文本聚类的质量,还能大大提高文本聚类的速度和可视化效果。

Abstract:: In recent years,as an unsupervised learning method in the field of machine learning,text clustering technology has increasingly become one of the most concerned technologies in the field of data mining. To a certain extent,it is a relatively easy work to aggregate small-scale text data into several categories. However, when faced with a large number of high - dimensional Chinese text data, text clustering in this case will be faced with high and sparse data,while ensuring the quality of clustering,improving the clustering speed and visualization effect has become one of the topics of clustering research. We propose a method combining term frequency inverse document frequency ( TFIDF) algorithm and latent semantic analysis ( LSA) to improve the speed and visualization of kmeans Chinese text clustering. In this paper,11 456 pieces of news collected from web pages are taken as experimental objects,and the experimental comparison is made based on TFIDF clustering and TFIDF+LSA clustering. According to the clustering index like Silhouette coefficient(SC), Calinski-Harabasz index ( CHI) and Davies-Bouldin index ( DBI) ,the proposed method can not only guarantee the quality of text clustering,but also greatly improve the speed and visualization of text clustering.

相似文献/References:

[1]乌庆敏杨思春.基于潜在语义分析的智能答疑系统研究与实现[J].计算机技术与发展,2008,(09):251.
　WU Qing-min,YANG Si-chun.Research on Intelligent Question Answering System Based on Latent Semantic Analysis[J].,2008,(07):251.
[2]赵涛[],张太红[][],陈燕红[]. 中文农业网页去重及相似度判断研究[J].计算机技术与发展,2015,25(01):191.
　ZHAO Tao[],ZHANG Tai-hong[][],CHEN Yan-hong[]. Research on Duplicate Removal and Similarity Evaluation of Chinese Agricultural Web Pages[J].,2015,25(07):191.
[3]邵曦,陶凯云. 基于音乐内容和歌词的音乐情感分类研究[J].计算机技术与发展,2015,25(08):184.
　SHAO Xi,TAO Kai-yun. Research on Music Emotion Classification Based on Music Content and Lyrics[J].,2015,25(07):184.
[4]文晓艺,郝程程.基于奇异值分解的新闻标题聚类研究[J].计算机技术与发展,2020,30(02):42.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 009]
　WEN Xiao-yi,HAO Cheng-cheng.Study on News Header Clustering Based on Singular Value Decompositio[J].,2020,30(07):42.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 009]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1562
全文下载/Downloads816
评论/Comments