[1]殷 硕,王卫亚,柳有权.基于语义特征抽取的文本聚类研究[J].计算机技术与发展,2020,30(03):46-50.[doi:10. 3969 / j. issn. 1673-629X. 2020. 03. 009]
 YIN Shuo,WANG Wei-ya,LIU You-quan.Research on Text Clustering Based on Semantic Feature Extraction[J].Computer Technology and Development,2020,30(03):46-50.[doi:10. 3969 / j. issn. 1673-629X. 2020. 03. 009]
点击复制

基于语义特征抽取的文本聚类研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
30
期数:
2020年03期
页码:
46-50
栏目:
智能、算法、系统工程
出版日期:
2020-03-10

文章信息/Info

Title:
Research on Text Clustering Based on Semantic Feature Extraction
文章编号:
1673-629X(2020)03-0046-05
作者:
殷 硕王卫亚柳有权
长安大学 信息工程学院,陕西 西安 710064
Author(s):
YIN ShuoWANG Wei-yaLIU You-quan
School of Information Engineering,Chang’an University,Xi’an 710064,China
关键词:
文本聚类语义特征抽取特征降维文本相似度知网
Keywords:
text clusteringsemantic feature extractionfeature dimension reductiontext similarityHowNet
分类号:
TP301.6
DOI:
10. 3969 / j. issn. 1673-629X. 2020. 03. 009
摘要:
基于向量空间模型(VSM)的文本聚类会出现向量维度过高以及缺乏语义信息的问题,导致聚类效果出现偏差。 为解决以上问题,引入《知网》作为语义词典,并改进词语相似度算法的不足。 利用改进的词语语义相似度算法对文本特 征进行语义压缩,使所有特征词都是主题相关的,利用调整后的TF-IDF算法对特征项进行加权,完成文本特征抽取,降低 文本表示模型的维度。 在聚类中,将同一类的文本划分为同一个簇,利用簇中所有文本的特征词完成簇的语义特征抽取, 簇的表示模型和文本的表示模型有着相同的形式。 通过计算簇之间的语义相似度,将相似度大于阈值的簇合并,更新簇 的特征,直到算法结束。 通过实验验证,与基于K-Means和VSM的聚类算法相比,文中算法大幅降低了向量维度,聚类效 果也有明显提升。
Abstract:
Text clustering based on vector space model(VSM) has the problems of too high vector dimension and lack of semantic infor mation,which results in the deviation of clustering effect. In order to solve the above problems,we introduce HowNet as semantic dictionary and improve the word similarity algorithm. The improved word semantic similarity algorithm is used to compress the text features semantically so that all feature words are subject-related. The adjusted TF-IDF algorithm is used to weigh the feature items to complete the text feature extraction and reduce the dimension of the text representation model. In clustering,the text of the same class is divided into the same cluster,and the semantic features of the cluster are extracted by using the feature words of all the text in the cluster. The representation model of the cluster has the same form as the representation model of the text. By calculating the semantic similarity between the clusters,the clusters with similarity greater than the threshold are merged and the features of clusters are updated until the end of the algorithm. Experiment shows that compared with K-Means and VSM-based clustering algorithm,the proposed algorithm greatly reduces the vector dimension and improves the clustering effect significantly.

相似文献/References:

[1]黄文江 李翔 林祥.基于Chameleon算法的文本聚类技术研究[J].计算机技术与发展,2010,(06):1.
 HUANG Wen-jiang,LI Xiang,LIN Xiang.Research on Text Clustering Based on Chameleon Algorithm[J].Computer Technology and Development,2010,(03):1.
[2]许高建.基于Web的文本挖掘技术研究[J].计算机技术与发展,2007,(06):187.
 XU Gao-jian.Research on Text Mining Techniques Web- Based[J].Computer Technology and Development,2007,(03):187.
[3]费洪晓 穆珺 刘正.基于文本聚类和权重调整的用户兴趣建模算法[J].计算机技术与发展,2007,(02):128.
 FEI Hong-xiao,MU Jun,LIU Zheng.Study on User Profile Learning Algorithm Based on Document Clustering and Feature Weight Adjustment[J].Computer Technology and Development,2007,(03):128.
[4]何聚厚,范文静.基于改进K-Means算法的教学反思文本聚类研究[J].计算机技术与发展,2013,(11):99.
 HE Ju-hou[],FAN Wen-jing[].Research on Text Clustering of Teaching Reflection Based on Improved K-Means Algorithm[J].Computer Technology and Development,2013,(03):99.
[5]李培,马力.网络用户兴趣的智能挖掘方法研究[J].计算机技术与发展,2014,24(02):76.
 LI Pei,MA Li.Research on Intelligent Mining Method for Web Users Interests[J].Computer Technology and Development,2014,24(03):76.
[6]李晨,杨子江,朱世伟,等. 基于Hadoop的网络舆情监控平台设计与实现[J].计算机技术与发展,2016,26(02):144.
 LI Chen,YANG Zi-jiang,ZHU Shi-wei,et al. Design and Implementation of Network Consensus Monitoring System Based on Hadoop[J].Computer Technology and Development,2016,26(03):144.
[7]潘晓英,胡开开,朱静. 一种基于TextRank的文本二次聚类算法[J].计算机技术与发展,2016,26(08):7.
 PAN Xiao-ying,HU Kai-kai,ZHU Jing. A Secondary Text Clustering Algorithm Based on TextRank[J].Computer Technology and Development,2016,26(03):7.
[8]王安瑾.一种基于 MinHash 的改进新闻文本聚类算法[J].计算机技术与发展,2019,29(02):39.[doi:10.3969/j.issn.1673-629X.2019.02.008]
 WANG Anjin.An Improved News Text Clustering Algorithm Based on MinHash[J].Computer Technology and Development,2019,29(03):39.[doi:10.3969/j.issn.1673-629X.2019.02.008]
[9]杨丹,朱世玲,卞正宇.基于改进的K-means算法在文本挖掘中的应用[J].计算机技术与发展,2019,29(04):68.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 014]
 YANG Dan,ZHU Shi-ling,BIAN Zheng-yu.Application of Improved K-means Algorithm in Text Mining[J].Computer Technology and Development,2019,29(03):68.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 014]

更新日期/Last Update: 2020-03-10