«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2019. 06. 030]
点击复制

基于 Skip-gram 的 CNNs 文本邮件分类模型()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 29
期数:: 2019年06期

页码:: 143-147

栏目:: 应用开发研究

出版日期:: 2019-06-10

文章信息/Info

Title:: CNNs-Highway Text Message Classification Model Based on Skip-gram

文章编号:: 1673-629X(2019)06-0143-05

作者:: 黄鹤¹ ; 荆晓远² ; 董西伟² ; 吴飞²; 1. 南京邮电大学计算机学院,江苏南京 210023; 2. 南京邮电大学自动化学院

Author(s):: HUANG He¹ ; JING Xiao-yuan² ; DONG Xi-wei² ; WU Fei²; 1. School of Computer,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;2. School of Automation,Nanjing University of Posts and Telecommunications

关键词:: 自然语言处理; 词嵌入; 邮件分类; 卷积神经网络; 深度学习

Keywords:: natural language processing; word embedding; mail classification; convolutional neural network; deep learning

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2019. 06. 030

摘要:: 随着互联网广告技术的发展和电子邮件的普及,越来越多的垃圾广告邮件充斥生活,而对如何高效区分垃圾邮件的研究也逐渐成为了热门课题。自然语言在结构上具有很强的前后相关性,而且对于中文邮件直接转化成向量会有过高的维度产生,影响最后分类的准确性。对此,首先对邮件文本进行分词,再利用 skip-gram 模型训练出数据集中每个词的word embedding,引入的词嵌入(word embedding) 是为了将邮件文本转化成低维度特征向量;然后将每个词的 word embedding 组合为二维特征矩阵作为网络的输入,此外在每一次的迭代过程中,输入特征也作为参数进行更新;最后送入提出的 CNN-HIGHWAY 混合模型中进行邮件分类。将该混合模型在 CCERT 中文邮件样本集上进行实验,并与传统的机器学习方法和标准的卷积神经网络模型进行对比,结果表明该模型不仅解决了维度过高的问题,而且提高了邮件分类的准确率。

Abstract:: With the development of Internet advertising technology and the popularity of e-mail,more and more spam advertisements are flooding the lives. The research on how to effectively distinguish spam has gradually become a hot topic. The natural language has a strong front-to-back correlation in structure and also too high dimensions for the direct translation of Chinese emails into vectors,which adversely affects the accuracy of the final classification. Therefore,we propose a model which firstly segments e-mail texts and uses the skip-gram model to train the word embedding of each word in the data set. The introduced word embedding is to convert the message text into a low-dimensional feature vector. Then the word embedding of each word is combined into a two-dimensional feature matrix as the input of the network. In addition,during each iteration,the input features are also updated as parameters. Finally,the feature vectors are sent to the proposed CNN-HIGHWAY hybrid model for classification. The hybrid model is tested on the CCERT Chinese mail sample set. Compared with the traditional machine learning methods and the standard convolutional neural network models,this model not only solves the problem of high dimensionality,but also improves the accuracy of mail classification.

相似文献/References:

[1]陈国华赵克李亚涛易帅.自然语言处理系统中的事件类名词的耦合处理[J].计算机技术与发展,2008,(06):60.
　CHEN Guo-hua,ZHAO Ke,LI Ya-tao,et al.Coupling Processing of Event Noun in NLP Systems[J].,2008,(06):60.
[2]程节华.基于FAQ的智能答疑系统中分词模块的设计[J].计算机技术与发展,2008,(07):181.
　CHENG Jie-hua.Design of Words Module in Intelligent Q/A System Based on FAQ[J].,2008,(06):181.
[3]杨欢许威赵克陈余.动词属性在自然语言处理当中的研究与应用[J].计算机技术与发展,2008,(07):233.
　YANG Huan,XU Wei,ZHAO Ke,et al.Research and Application of Verb Attributes in Natural Language Processing[J].,2008,(06):233.
[4]孙超张仰森.面向综合语言知识库的知识融合与获取研究[J].计算机技术与发展,2010,(08):25.
　SUN Chao,ZHANG Yang-sen.Research of Knowledge Integration and Obtaining Oriented Comprehensive Language Knowledge System[J].,2010,(06):25.
[5]党建亿珍珍赵克殷鸿.数学领域集体词结构形式化处理研究[J].计算机技术与发展,2007,(05):121.
　DANG Jian,YI Zhen-zhen,ZHAO Ke,et al.Research of Formalization Processing for Collective Structures in Mathematics Domain[J].,2007,(06):121.
[6]江有福郑庆华.自然语言网络答疑系统中倒排索引技术的研究[J].计算机技术与发展,2006,(02):126.
　JIANG You-fu,ZHENG Qing-hua.Research of Inverted Index in NLWAS[J].,2006,(06):126.
[7]刘亚清张瑾于纯妍.基于义原同现频率的汉语词义排歧系统[J].计算机技术与发展,2006,(05):184.
　LIU Ya-qing,ZHANG Jin,YU Chun-yan.A Chinese Word Sense Disambiguation System Based on Primitive CO- Occurrence Data[J].,2006,(06):184.
[8]刘政怡李炜吴建国.基于IMM—IME的汉字键盘输入法编程技术研究[J].计算机技术与发展,2006,(12):43.
　LIU Zheng-yi,LI Wei,WU Jian-guo.Research of Programming Technology of Chinese Input Method Based on IMM- IME[J].,2006,(06):43.
[9]赵鹏何留进孙凯方薇[].基于情感计算的网络中文信息分析技术[J].计算机技术与发展,2010,(11):146.
　ZHAO Peng,HE Liu-jin,SUN Kai,et al.Analyzing Technologies of Internet Chinese Information Based on Affective Computing[J].,2010,(06):146.
[10]徐远方李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展,2012,(05):134.
　XU Yuan-fang,LI Cheng-cheng.Research on New Word Identification Based on SVM and Word Characteristics[J].,2012,(06):134.
[11]潘理虎,郝彦杰,周耀辉,等.基于文本卷积的多因素煤炭产品推荐模型[J].计算机技术与发展,2021,31(04):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 04. 034]
　PAN Li-hu,HAO Yan-jie,ZHOU Yao-hui,et al.Multi Factor Coal Product Recommendation Model Based onText Convolution[J].,2021,31(06):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 04. 034]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed993
全文下载/Downloads435
评论/Comments