«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 09. 015]
点击复制

基于类别主题词集的加权相似度短文本分类()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年09期

页码:: 95-99

栏目:: 人工智能

出版日期:: 2022-09-10

文章信息/Info

Title:: Short Text Classification with Weighted Similarity Based on Category Topic Word Set

文章编号:: 1673-629X(2022)09-0095-05

作者:: 王小楠; 黄卫东; 南京邮电大学管理学院,江苏南京 210003

Author(s):: WANG Xiao-nan; HUANG Wei-dong; School of Management,Nanjing University of Posts and Telecommunications,Nanjing 210003,China

关键词:: Word2Vec; 短文本分类; 相似度; 类别主题; 加权

Keywords:: Word2Vec; short text classification; similarity; category topic; weighting

分类号:: TP391.1

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 09. 015

摘要:: 由于短文本存在特征稀疏的问题,在分类问题上效果不佳,该文充分利用词向量模型,在词层面提出一种基于类别主题词集的加权相似度的短文本分类算法。首先训练词向量模型,其次使用 TF-IDF 选择出最能代表各类别的主题词形成类别主题词集,将短文本的关键词与各类别主题词分别进行相似度计算,将类别主题词对主题的贡献度表示在权重中,选择相似度最高的结果作为该短文本的类别。实验结果表明,基于类别主题词集的加权相似度短文本分类方法在精确率上相较 KNN 算法、Logistic 回归算法、决策树分类算法分别提高了 2. 9% 、1. 8% 、10. 2% ; 在召回率上分别提升了3. 0% 、1. 7% 、10. 4% 。但是类别主题词对类别的贡献度量化维度简单。基于主题词集的加权相似度短文本分类算法在词的层面解决了短文本分类中的特征不足的问题,提高了短文本分类的性能。

Abstract:: Due to the problem of sparse features of short text,it is not effective in classification. We make full use of the word vector model and propose? a short text classification algorithm based on the weighted similarity of the category topic word set at the word level.Firstly the word vector model is trained. TF-IDF is used to select the subject words that can best represent each category to form the category subject word set. The similarity between the keywords of the short text and the subject words of each category is calculated respectively. The contribution degree? ?of the category subject words to the topic is expressed in the weight,and the result with the highest similarity is selected as the category of? the short text. The experiment shows that the precision of the short text classification method based on the weighted similarity of the category topic word set is 2. 9% ,1. 8% ,and 10. 2% higher than that of the KNN algorithm,theLogistic regression algorithm,and the decision tree classification algorithm respectively. The recall rate increased by 3. 0% ,1. 7% ,and10. 4% respectively. The metric dimension of the contribution of topic words to category is simple. The short text classification algorithm based on the weighted similarity of the topic word set solves the problem of insufficient features in short text classification at the word level,and improves the performance of short text classification.

相似文献/References:

[1]卫华,韩立新,夏建华. 基于Word2 fea模型的文本建模方法[J].计算机技术与发展,2016,26(02):165.
　WEI Hua,HAN Li-xin,XIA Jian-hua. Text Modeling Method Based on Word2 fea Model[J].,2016,26(09):165.
[2]张兴兰,刘炀. 基于复杂网络及神经网络挖掘用户兴趣的方法[J].计算机技术与发展,2016,26(12):22.
　ZHANG Xing-lan,LIU Yang. Method of Mining User Interest Based on Complex Network and Neural Network[J].,2016,26(09):22.
[3]倪高伟,李涛,刘峥.结合语义和结构的短文本相似度计算[J].计算机技术与发展,2018,28(08):104.[doi:10.3969/ j. issn.1673-629X.2018.08.022]
　NI Gao-wei,LI Tao,LIU Zheng.Similarity Calculation of Short Text Combined with Semantic and Structure[J].,2018,28(09):104.[doi:10.3969/ j. issn.1673-629X.2018.08.022]
[4]贾清,杨抒.基于 Word2vec 的克隆代码检测方法研究[J].计算机技术与发展,2020,30(08):124.[doi:10. 3969 / j. issn. 1673-629X. 2020. 08. 021]
　JIA Qing,YANG Shu.Research on Clone Code Detection Method Based on Word2vec[J].,2020,30(09):124.[doi:10. 3969 / j. issn. 1673-629X. 2020. 08. 021]
[5]李鑫.一种面向 Mashup 应用的 API 推荐方法[J].计算机技术与发展,2021,31(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
　LI Xin.An API Recommendation Method for Mashup Application[J].,2021,31(09):38.[doi:10. 3969 / j. issn. 1673-629X. 2021. 02. 007]
[6]何烨辛,谷　林,孙　晨.基于ＣＮＮ的程序编译错误信息特征提取[J].计算机技术与发展,2021,31(05):204.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 035]
　,ＣＮＮ－ｂａｓｅｄＰｒｏｇｒａｍＣｏｍｐｉｌａｔｉｏｎＥｒｒｏｒＭｅｓｓａｇｅＦｅａｔｕｒｅＥｘｔｒａｃｔｉｏ[J].,2021,31(09):204.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 035]
[7]冼广铭,王鲁栋,曾碧卿,等.基于 LDA 和 BiGRU 的文本分类[J].计算机技术与发展,2022,32(04):15.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 003]
　XIAN Guang-ming,WANG Lu-dong,ZENG Bi-qing,et al.Text Classification Based on LDA and BiGRU[J].,2022,32(09):15.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 003]
[8]关慧,曹同洲.基于 CNN 和多注意力机制的 XSS 检测模型[J].计算机技术与发展,2023,33(04):175.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 026]
　GUAN Hui,CAO Tong-zhou.XSS Detection Model Based on CNN and Multi-attention Mechanism[J].,2023,33(09):175.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 026]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed900
全文下载/Downloads519
评论/Comments