[1]文晓艺,郝程程.基于奇异值分解的新闻标题聚类研究[J].计算机技术与发展,2020,30(02):42-46.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 009]
 WEN Xiao-yi,HAO Cheng-cheng.Study on News Header Clustering Based on Singular Value Decompositio[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2020,30(02):42-46.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 009]
点击复制

基于奇异值分解的新闻标题聚类研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
30
期数:
2020年02期
页码:
42-46
栏目:
智能、算法、系统工程
出版日期:
2020-02-10

文章信息/Info

Title:
Study on News Header Clustering Based on Singular Value Decompositio
文章编号:
1673-629X(2020)02-0042-05
作者:
文晓艺郝程程
上海对外经贸大学 统计与信息学院,上海 201600
Author(s):
WEN Xiao-yiHAO Cheng-cheng
School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201600,China
关键词:
汉语分词词云图奇异值分解潜在语义分析K-means聚类
Keywords:
Chinese word segmentationword cloud diagramsingular value decompositionlatent semantic analysisK-means clustering
分类号:
TP31
DOI:
10. 3969 / j. issn. 1673-629X. 2020. 02. 009
摘要:
汉语分词技术和文本聚类是自然语言处理的重要环节,在文本信息的组织、摘要和导航中应用广泛。 文本聚类作 为一种无监督学习算法,其依据是聚类假设:同类的文档相似程度大,不同类的文档相似程度小。 文中主要研究汉语文本 聚类算法在新闻标题类文本中的应用。 首先对采集到的若干条新闻标题进行分词和特征提取,将分词后的文本转化为词 条矩阵;然后使用 TF-IDF技术处理词条矩阵,得到基于分词权重的新的词条矩阵,对新的词条矩阵进行奇异值分解,得到 主成分得分矩阵,提取主成分分析文本特征并根据主成分得分矩阵进行 K-均值和分层聚类分析;最后将聚类结果用词云 图的形式展示出来并评价聚类效果的好坏。 实证显示,对词条矩阵的奇异值分解能降低向量空间的维数,提高聚类的精 度和运算速度。
Abstract:
Chinese word segmentation and text clustering areimportant in natural languageprocessing,which arewidely used in text information organization, summarization and navigation. As an unsupervised learning algorithm,text clustering is based on the clustering hypothesis:documents of same category are more similar,while documents of different categories are less similar. We mainly study the application of Chinesetext clustering algorithmsin news headers. First of all, we divide the collected news headlines into word segmentation and feature extraction,and convert the text after word segmentation into term line matrix. Then the term line matrix is processed by TFIDF technology and a new lexical matrix based on word segmentation weight is obtained. The new lexical matrix is decomposed by singular value and the principal component scoring matrix is obtained.? The text features of principal component analysis are extracted and K-means and hierarchical cluster analysis are performed according to the scoring matrix of principal component analysis. Finally,the clustering results are displayed in the form of a word cloud map and the quality of the clustering effect is evaluated. The experiment shows that the singular value decomposition of the lexical matrix can effectively reduce the dimension of the vector space,thus improving the accuracy and speed of the clustering.
更新日期/Last Update: 2020-02-10