基于改进TFIDF 算法的邮件分类技术-《计算机技术与发展》

文章信息/Info

Title:: Mail Sorting Technology Based on Improved TFIDF

文章编号:: 1673-629X(2018)08-0027-05

作者:: 陶峰¹ ; 汤鲲² ; 程光 ³; 1. 武汉邮电科学研究院,湖北武汉 430074;
2. 南京烽火星空通信发展有限公司,江苏南京 210019;
3. 东南大学计算机科学与工程学院,江苏南京 210096

Author(s):: TAO Feng ¹; TANG Kun ² ; CHENG Guang ³; 1. Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074,China;
2. FiberHome StarrySky Co. ,Ltd. ,Nanjing 210019,China;
3. School of Computer Science and Engineering,Southeast University,Nanjing 210096,China

关键词:: 邮件分类; 区分度; 特征词; 权值; 特征提取

Keywords:: mail classification; distinguishing degree; feature word; weight; feature extraction

分类号:: TP391.1

DOI:: 10.3969/ j. issn.1673-629X.2018.08.006

文献标志码:: A

摘要:: 随着电子邮件的普及,垃圾邮件的泛滥问题也逐渐引起人们的关注,垃圾邮件分类技术的研究成为了近年来的热点课题。邮件特征选择会直接影响到分类的效率和精确度,使用 TFIDF 算法可以有效评估一个特征项对于邮件分类的重要程度。但在邮件分类中单纯使用 TFIDF 来判断一个特征是否有区分度还存在很多的不足:没有考虑到特征词在类间和类内的分布情况,低估了高频词的作用并高估了低频词的作用。对 TFIDF 算法进行修改,降低特例邮件中频繁出现的特征词的影响,引入了频率差,增加了在类中频繁出现的词条的权值,并减小了在类中出现频率小的词条的权值。最终将改进的 TFIDF 算法与传统特征提取算法进行对比。实验结果表明,改进算法可以选择出更合适的特征项集合,从而使邮件分类的效果更好。

Abstract:: With the popularity of e-mail,the proliferation of spam has gradually attracted people’s attention,and the research on spam classification technology has become a hot topic in recent years. Mail feature selection will directly affect the efficiency and accuracy of classification,the use of TFIDF algorithm can effectively assess the characteristics of a feature for the classification of the importance of the message. However,the use of TFIDF in the classification of mail to determine whether there is a distinction between the characteristics exists a lot of problems:not taking into account the characteristics of the word in the category and the distribution of classes,underestimated the role of high frequency words and overestimated the role of low frequency words. In this paper,we modify the TFIDF algorithm to reduce the influence of the frequent occurrence of feature words in special cases,and introduce the frequency difference to increase the
weight of the entries that appear frequently in the class and reduce the weight of the entries with low frequency of occurrence in the class. Finally,the improved TFIDF algorithm is compared with the traditional feature extraction algorithm. The experiment shows that the improved algorithm can choose a more suitable set of feature items,so that the effect of mail classification is better.

相似文献/References:

[1]刘毅张月琳.基于Agent的邮件过滤与个性化分类系统设计[J].计算机技术与发展,2009,(02):66.
　LIU Yi,ZHANG Yue-lin.Design of a Mail Filter and Personalized Classification System Based on Agent[J].,2009,(08):66.
[2]李春梅,徐庆生.基于多特征的汉语句子相似度计算模型的研究[J].计算机技术与发展,2014,24(06):136.
　LI Chun-mei,XU Qing-sheng.Research on Chinese Sentence Similarity Calculation Model Based on Multi-features[J].,2014,24(08):136.
[3]黄鹤,荆晓远,董西伟,等.基于 Skip-gram 的 CNNs 文本邮件分类模型[J].计算机技术与发展,2019,29(06):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]
　HUANG He,JING Xiao-yuan,DONG Xi-wei,et al.CNNs-Highway Text Message Classification Model Based on Skip-gram[J].,2019,29(08):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics