[1]陶 峰,汤 鲲,程 光.基于改进TFIDF 算法的邮件分类技术[J].计算机技术与发展,2018,28(08):27-31.[doi:10.3969/ j. issn.1673-629X.2018.08.006]
 TAO Feng,TANG Kun,CHENG Guang.Mail Sorting Technology Based on Improved TFIDF[J].,2018,28(08):27-31.[doi:10.3969/ j. issn.1673-629X.2018.08.006]
点击复制

基于改进TFIDF 算法的邮件分类技术()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年08期
页码:
27-31
栏目:
智能、算法、系统工程
出版日期:
2018-08-10

文章信息/Info

Title:
Mail Sorting Technology Based on Improved TFIDF
文章编号:
1673-629X(2018)08-0027-05
作者:

陶 峰 1 汤 鲲 2 程 光 3

1. 武汉邮电科学研究院,湖北 武汉 430074;
2. 南京烽火星空通信发展有限公司,江苏 南京 210019;
3. 东南大学 计算机科学与工程学院,江苏 南京 210096
Author(s):
TAO Feng 1 TANG Kun 2 CHENG Guang 3
1. Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074,China;
2. FiberHome StarrySky Co. ,Ltd. ,Nanjing 210019,China;
3. School of Computer Science and Engineering,Southeast University,Nanjing 210096,China
关键词:
邮件分类区分度特征词权值特征提取
Keywords:
mail classificationdistinguishing degreefeature wordweightfeature extraction
分类号:
TP391.1
DOI:
10.3969/ j. issn.1673-629X.2018.08.006
文献标志码:
A
摘要:
随着电子邮件的普及,垃圾邮件的泛滥问题也逐渐引起人们的关注,垃圾邮件分类技术的研究成为了近年来的热点课题。 邮件特征选择会直接影响到分类的效率和精确度,使用 TFIDF 算法可以有效评估一个特征项对于邮件分类的重要程度。 但在邮件分类中单纯使用 TFIDF 来判断一个特征是否有区分度还存在很多的不足:没有考虑到特征词在类间和类内的分布情况,低估了高频词的作用并高估了低频词的作用。 对 TFIDF 算法进行修改,降低特例邮件中频繁出现的特征词的影响,引入了频率差,增加了在类中频繁出现的词条的权值,并减小了在类中出现频率小的词条的权值。 最终将改进的 TFIDF 算法与传统特征提取算法进行对比。 实验结果表明,改进算法可以选择出更合适的特征项集合,从而使邮件分类的效果更好。
Abstract:
With the popularity of e-mail,the proliferation of spam has gradually attracted people’s attention,and the research on spam classification technology has become a hot topic in recent years. Mail feature selection will directly affect the efficiency and accuracy of classification,the use of TFIDF algorithm can effectively assess the characteristics of a feature for the classification of the importance of the message. However,the use of TFIDF in the classification of mail to determine whether there is a distinction between the characteristics exists a lot of problems:not taking into account the characteristics of the word in the category and the distribution of classes,underestimated the role of high frequency words and overestimated the role of low frequency words. In this paper,we modify the TFIDF algorithm to reduce the influence of the frequent occurrence of feature words in special cases,and introduce the frequency difference to increase the
weight of the entries that appear frequently in the class and reduce the weight of the entries with low frequency of occurrence in the class. Finally,the improved TFIDF algorithm is compared with the traditional feature extraction algorithm. The experiment shows that the improved algorithm can choose a more suitable set of feature items,so that the effect of mail classification is better.

相似文献/References:

[1]刘毅 张月琳.基于Agent的邮件过滤与个性化分类系统设计[J].计算机技术与发展,2009,(02):66.
 LIU Yi,ZHANG Yue-lin.Design of a Mail Filter and Personalized Classification System Based on Agent[J].,2009,(08):66.
[2]李春梅,徐庆生.基于多特征的汉语句子相似度计算模型的研究[J].计算机技术与发展,2014,24(06):136.
 LI Chun-mei,XU Qing-sheng.Research on Chinese Sentence Similarity Calculation Model Based on Multi-features[J].,2014,24(08):136.
[3]黄 鹤,荆晓远,董西伟,等.基于 Skip-gram 的 CNNs 文本邮件分类模型[J].计算机技术与发展,2019,29(06):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]
 HUANG He,JING Xiao-yuan,DONG Xi-wei,et al.CNNs-Highway Text Message Classification Model Based on Skip-gram[J].,2019,29(08):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]

更新日期/Last Update: 2018-10-15