基于 ERNIE-RCNN 模型的中文短文本分类-《计算机技术与发展》

文章信息/Info

Author(s):: WANG Hao-chang; SUN Ming-ze; School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China

Keywords:: Chinese short text classification; ERNIE; ERNIE-RCNN; word vector; feature extraction; deep learning

摘要:: 由于中文短文本存在特征词少、规范性差、数据规模量大等难点,ERNIE 预训练模型占用内存大,进行短文本分类时会造成向量空间稀疏、文本预训练不准确、时间复杂度高等问题。针对以上短文本分类存在的问题,提出基于 ERNIE-RCNN 模型的中文短文本分类。模型运用 ERNIE 模型作为词向量,对实体和词语义单元掩码,后连接 Transformer 的编码层,对 ERNIE 层输出的词嵌入向量进行编码,优化模型过拟合问题,增强泛化能力,RCNN 模型对 ERNIE 输入的词向量进行特征提取,卷积层利用大小不同的卷积核提取大小不同的特征值,池化层进行映射处理,最后通过 softmax 进行分类。将该模型与七种深度学习文本分类模型在中文新闻数据集上进行训练实验,得到了模型在准确率、精准率、召回率、F1 值、迭代次数、运行时间上的对比结果,表明 ERNIE-RCNN 模型能够很好地提取文本中的特征信息,减少了训练时间,有效解决了中文短文本分类的难点,具有很好的分类效果。

Abstract:: Due to the difficulties in short Chinese texts such as fewer feature words,poor standardization and large data size,the ERNIEpre-training model occupies a large amount of memory,which causes problems such as sparse vector space,inaccurate text pre-trainingand high time complexity when classifying short texts. In response to the above short text classification problems,we propose a Chineseshort text classification based on the ERNIE-RCNN model. The model uses the ERNIE model as a word vector,masks entities and wordsense units,and then connects to the encoding layer of Transformer and outputs to the ERNIE layer. The word embedding vector isencoded to optimize the model over-fitting problem and enhance the generalization ability. The RCNN model performs feature extractionon the word vector input by ERNIE. The convolution layer uses convolution kernels of different sizes to extract feature values of differentsizes. The pooling layer is mapped and finally classified by softmax. The proposed model is trained on the Chinese news data set withseven deep learning text classification models,and the comparison results of accuracy,precision,recall,F1 value,number of iterations andrunning time are obtained. It is showed that ERNIE-RCNN can extract the feature information in the text well,reduce the training time,effectively solve the difficulties in the classification of Chinese short texts with excellent classification effect.