基于 LDA 和 BiGRU 的文本分类-《计算机技术与发展》

文章信息/Info

Author(s):: XIAN Guang-ming; WANG Lu-dong; ZENG Bi-qing; MEI Hao-yang; TAO Rui; School of Software,South China Normal University,Foshan 528225,China

Keywords:: LDA topic model; BiGRU; Word2vec; deep learning; text classification

摘要:: 文本分类是自然语言处理的基础任务,文本中的特征稀疏性和提取特征所用的神经网络影响后续的分类效果。针对文本中的特征信息不足以及传统模型上下文依赖关系方面不足的问题,提出经过 TF-IDF 加权的词向量和 LDA 主题模型相融合,利用双向门控循环神经网络层( BiGRU) 充分提取文本深度信息特征的分类方法。该方法主要使用的数据集是天池比赛新闻文本分类数据集,首先用 Word2vec 和 LDA 模型分别在语料库中训练词向量,Word2vec 经过 TF-IDF 进行加权所得的词向量再与 LDA 训练的经过最大主题概率扩展的词向量进行简单拼接,拼接后得到文本矩阵,将文本矩阵输入到 BiGRU 神经网络中,分别从前后两个反方向提取文本深层次信息的特征向量,最后使用 softmax 函数进行多分类,根据输出的概率判断所属的类别。与现有的常用文本分类模型相比,准确率、F1 值等评价指标都有了较高的提升。

Abstract:: Text classification is a basic task of natural language processing. The feature sparsity in the text and the neural network used to extract the feature affect the subsequent classification effect. In order to solve the problems of feature sparsity in text and the deficiency of context dependence in traditional models, we propose a new classification method which combines TF-IDF-weighted word vectors with LDA subject model and uses bidirectional gating cyclic neural network layer ( BIGRU) to fully extract the features of depth information in text. The main data set is the news text classification data set of Tianchi Competition. Firstly,word vectors are trained in the corpus by Word2vec and LDA models respectively. Word2vec weighted word vectors by TF-IDF are then simply joined with word vectors trained by LDA with maximum topic probability expansion. The text matrix is obtained after the Mosaic,and the text matrix is input into the Bigru neural network, and the feature vectors of the deep information of the text are extracted from the two opposite directions respectively. Finally, the softmax function is used for multiple classification, and the category is judged according to the out put probability. Compared with the existing common text classification model,the accuracy,F1 value and other evaluation indicators have been improved.