基于奇异值分解的新闻标题聚类研究-《计算机技术与发展》

文章信息/Info

Title:: Study on News Header Clustering Based on Singular Value Decompositio

Author(s):: WEN Xiao-yi; HAO Cheng-cheng; School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201600,China

Keywords:: Chinese word segmentation; word cloud diagram; singular value decomposition; latent semantic analysis; K-means clustering

摘要:: 汉语分词技术和文本聚类是自然语言处理的重要环节,在文本信息的组织、摘要和导航中应用广泛。文本聚类作为一种无监督学习算法,其依据是聚类假设:同类的文档相似程度大,不同类的文档相似程度小。文中主要研究汉语文本聚类算法在新闻标题类文本中的应用。首先对采集到的若干条新闻标题进行分词和特征提取,将分词后的文本转化为词条矩阵;然后使用 TF-IDF技术处理词条矩阵,得到基于分词权重的新的词条矩阵,对新的词条矩阵进行奇异值分解,得到主成分得分矩阵,提取主成分分析文本特征并根据主成分得分矩阵进行 K-均值和分层聚类分析;最后将聚类结果用词云图的形式展示出来并评价聚类效果的好坏。实证显示,对词条矩阵的奇异值分解能降低向量空间的维数,提高聚类的精度和运算速度。

Abstract:: Chinese word segmentation and text clustering areimportant in natural languageprocessing,which arewidely used in text information organization, summarization and navigation. As an unsupervised learning algorithm,text clustering is based on the clustering hypothesis:documents of same category are more similar,while documents of different categories are less similar. We mainly study the application of Chinesetext clustering algorithmsin news headers. First of all, we divide the collected news headlines into word segmentation and feature extraction,and convert the text after word segmentation into term line matrix. Then the term line matrix is processed by TFIDF technology and a new lexical matrix based on word segmentation weight is obtained. The new lexical matrix is decomposed by singular value and the principal component scoring matrix is obtained.? The text features of principal component analysis are extracted and K-means and hierarchical cluster analysis are performed according to the scoring matrix of principal component analysis. Finally,the clustering results are displayed in the form of a word cloud map and the quality of the clustering effect is evaluated. The experiment shows that the singular value decomposition of the lexical matrix can effectively reduce the dimension of the vector space,thus improving the accuracy and speed of the clustering.