[1]孟涛,王诚.基于扩展短文本词特征向量的分类研究[J].计算机技术与发展,2019,29(04):57-62.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 012]
 MENG Tao,WANG Cheng.Research on Short Text Classification Based on Extended Word Feature Vectors[J].,2019,29(04):57-62.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 012]
点击复制

基于扩展短文本词特征向量的分类研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年04期
页码:
57-62
栏目:
智能、算法、系统工程
出版日期:
2019-04-10

文章信息/Info

Title:
Research on Short Text Classification Based on Extended Word Feature Vectors
文章编号:
1673-629X(2019)04-0057-06
作者:
孟涛王诚
南京邮电大学 通信与信息工程学院,江苏 南京 210003
Author(s):
MENG TaoWANG Cheng
School of Telecommunications & Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
关键词:
短文本Word2vec模型词嵌入改进后的特征权重算法语义相关度
Keywords:
short textsWord2vec modelword embeddingimproved feature weight algorithmsemantic relevance
分类号:
TP301
DOI:
10. 3969 / j. issn. 1673-629X. 2019. 04. 012
摘要:
由于短文本的文档长度较短,短文本中词语的共现信息非常匮乏,造成短文本信息稀疏性问题。信息稀疏性也成为了传统主题模型在短文本上难以取得突破性进展的瓶颈之一。针对短文本分类,充分利用短文本中的每一个词语并解决其稀疏性成为关键。为了解决这一问题,基于Word2vec模型对短文本进行词嵌入扩展以解决其稀疏性,并将词向量转换成概率语义分布来测量语义关联性;针对短文本扩展后的特征向量,利用改进后的特征权重算法并引入语义相关度去处理扩展后的词特征向量。该方法可以区分出扩展后的短文本中词的重要程度,以便获得更准确的语义相关性。短文本分类研究采用KNN算法分类,实验结果表明,通过在外部语料集上学习得到的语义相关性扩展来处理短文本特征,可以有效提高短文本的分类效果。
Abstract:
Due to the short length of the document,the co-occurrence information of the words in the short text is very scarce,which causes the sparseness of the short text. The sparseness of information has also become one of the reasons why the traditional topic model isdifficult to make breakthrough progress on short texts. For short text classification,it is very important to make full use of every word in essay and solve its sparseness. For this,word embedding is extended based on Word2vec model to solve its sparsity,and word vectors areconverted into probabilistic semantic distribution to measure semantic relevance. For the extended feature vector of short text,the improved feature weight algorithm is used and the semantic relevance is introduced to handle the extended word feature vector. This methodcan distinguish the importance degree of words in the extended short text so that we can get more accurate semantic relevance. In this paper,we adopt KNN algorithm to study the short text classification. The experiment shows that we can extend short text features by learning semantic correlation obtained from external corpus,which can effectively improve the classification effect of short texts.

相似文献/References:

[1]赵小谦 郑彦 储海庆.概念树在短文本语义相似度上的应用[J].计算机技术与发展,2012,(06):159.
 ZHAO Xiao-qian,ZHENG Yan,CHU Hai-qing.Application of Concept Tree in Semantic Similarity of Short Texts[J].,2012,(04):159.
[2]苏小英[][],孟环建[]. 基于神经网络的微博情感分析[J].计算机技术与发展,2015,25(12):161.
 SU Xiao-ying[][],MENG Huan-jian[]. Sentiment Analysis of Micro-blog Based on Neural Networks[J].,2015,25(04):161.
[3]张仪,陈国,张再跃. 可增量的用户短文本聚类方法研究[J].计算机技术与发展,2017,27(11):83.
 ZHANG Yi,CHEN Guo,ZHANG Zai-yue. Research on Scalable Clustering of User-oriented Short Text[J].,2017,27(04):83.

更新日期/Last Update: 2019-04-10