«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2019. 05. 010]
点击复制

Web文档分类中TFIDF特征选择算法的改进()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 29
期数:: 2019年05期

页码:: 49-53

栏目:: 智能、算法、系统工程

出版日期:: 2019-05-10

文章信息/Info

Title:: Improvement of TFIDF Feature Selection Algorithm in Web Document Classification

文章编号:: 1673-629X(2019)05-0049-05

作者:: 段国仑¹; 谢钧¹; 郭蕾蕾²; 王晓莹¹; 1. 陆军工程大学指挥控制工程学院,江苏南京 210007;2. 陆军工程大学通信工程学院,江苏南京 210007

Author(s):: DUAN Guo-lun¹; XIE Jun¹; GUO Lei-lei²; WANG Xiao-ying¹; 1. School of Command Control Engineering,Army Engineering University of PLA,Nanjing 210007,China;2. School of Communications Engineering,Army Engineering University of PLA,Nanjing 210007,China

关键词:: Web文档分类; 特征选择; TFIDF算法; SVM

Keywords:: Web document classification; feature selection; TFIDF algorithm; SVM

分类号:: TP391. 1

DOI:: 10. 3969 / j. issn. 1673-629X. 2019. 05. 010

摘要:: 随着海量数据资源在网络中的出现,Web 文档分类技术越来越受到重视。在 Web 文档分类的研究中,特征选择算法有着重要的研究意义。特征选择能有效降低文本向量空间模型的维度,从而构造出更快,消耗更低的预测模型。传统的 TFIDF 算法仅仅依靠文档中所包含特征词的词频和逆文档频率来判断该特征词对于文档分类的重要性,忽略了特征项在类内和类间的分布以及数据集不均衡现象,从而效果受到制约。针对存在的不足进行改进,提出了类内分布因子以及类间分布因子。基于类内以及类间因子,替代逆文档频率,可以使得改进的表达式能够选择出更加高效的特征词。通过使用 SVM 分类器进行文本分类对比实验,与改进前的方法相比,该方法能使 F1 值得到一定程度的提高,在不均衡数据集上同样具有较好的分类效果。

Abstract:: With the emergence of massive data resources in the network,Web document classification technology has received more and more attention. In the research of Web document classification,feature selection algorithm has important research significance. Feature selection can effectively reduce the dimensions of the text vector space model,so as to construct a prediction model that is faster and costs less. The traditional TFIDF algorithm only depends on the word frequency and inverse document frequency of the feature words contained in the document to judge the importance of the feature word for document classification,ignoring the distribution of feature items within and between classes and the imbalance of data sets. The effect is limited. In order to improve the existing deficiencies, intra - classdistribution factors and inter - class distribution factors were proposed. Based on intra - and inter - class factors, instead of inversedocument frequency, improved expressions can be selected for more efficient feature words. By using the SVM classifier for textclassification and comparison experiments,this method can increase the F1 value to a certain extent,and also has better classification effect on the unbalanced data set.

相似文献/References:

[1]刘利何先平袁文亮.股票趋势预测中Wrapper方法的研究与应用[J].计算机技术与发展,2010,(01):209.
　LIU Li,HE Xian-ping,YUAN Wen-liang.Research and Application of Wrapper Approach to Stock Trend Prediction[J].,2010,(05):209.
[2]黄炜黄志华.一种基于遗传算法和SVM的特征选择[J].计算机技术与发展,2010,(06):21.
　HUANG Wei,HUANG Zhi-hua.Feature Selection Based on Genetic Algorithm and SVM[J].,2010,(05):21.
[3]张家柏王小玲.基于聚类和二进制PSO的特征选择[J].计算机技术与发展,2010,(06):25.
　ZHANG Jia-bai,WANG Xiao-ling.A Novel Algorithm Based on K-Means Clustering and Binary Particle Swarm Optimization[J].,2010,(05):25.
[4]冯甲策叶明王惠文.基于Gram—Schmidt过程的支持向量机降维方法[J].计算机技术与发展,2009,(11):7.
　FENG Jia-ce,YE Ming,WANG Hui-wen.Dimension Reduction Method of Support Vector Machine Based on Gram- Schmidt Process[J].,2009,(05):7.
[5]林伟柳荣其徐熙.邮件过滤中一种改进的特征选择方法研究[J].计算机技术与发展,2009,(01):84.
　LIN Wei,LIU Rong-qi,XU Xi.Improvement of Feature Selection Algorithm in Spam Filtering[J].,2009,(05):84.
[6]刘毅张月琳.基于Agent的邮件过滤与个性化分类系统设计[J].计算机技术与发展,2009,(02):66.
　LIU Yi,ZHANG Yue-lin.Design of a Mail Filter and Personalized Classification System Based on Agent[J].,2009,(05):66.
[7]陈素萍谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
　CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(05):112.
[8]段震王倩倩张燕平张铃.覆盖算法下文本分类特征选择的研究[J].计算机技术与发展,2008,(11):29.
　DUAN Zhen,WANG Qian-qian,ZHANG Yan-ping,et al.Study on Feature Selection of Text Classification in Cross Cover Algorithm[J].,2008,(05):29.
[9]王希雷.基于Rough集理论的车牌汉字特征提取[J].计算机技术与发展,2007,(06):26.
　WANG Xi-lei.Car Plate Chinese Character Feature Extraction Based on Rough Set Theory[J].,2007,(05):26.
[10]董梅胡学钢.基于多特征选择的中文文本分类[J].计算机技术与发展,2007,(07):117.
　DONG Mei,HU Xue-gang.Text Categorization Based on Multiple Features Selection[J].,2007,(05):117.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed937
全文下载/Downloads584
评论/Comments