[1]段国仑,谢钧,郭蕾蕾,等.Web文档分类中TFIDF特征选择算法的改进[J].计算机技术与发展,2019,29(05):49-53.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 010]
 DUAN Guo-lun,XIE Jun,GUO Lei-lei,et al.Improvement of TFIDF Feature Selection Algorithm in Web Document Classification[J].,2019,29(05):49-53.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 010]
点击复制

Web文档分类中TFIDF特征选择算法的改进()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年05期
页码:
49-53
栏目:
智能、算法、系统工程
出版日期:
2019-05-10

文章信息/Info

Title:
Improvement of TFIDF Feature Selection Algorithm in Web Document Classification
文章编号:
1673-629X(2019)05-0049-05
作者:
段国仑1谢钧1郭蕾蕾2王晓莹1
1. 陆军工程大学 指挥控制工程学院,江苏 南京 210007;2. 陆军工程大学 通信工程学院,江苏 南京 210007
Author(s):
DUAN Guo-lun1XIE Jun1GUO Lei-lei2WANG Xiao-ying1
1. School of Command Control Engineering,Army Engineering University of PLA,Nanjing 210007,China;2. School of Communications Engineering,Army Engineering University of PLA,Nanjing 210007,China
关键词:
Web文档分类特征选择TFIDF算法SVM
Keywords:
Web document classificationfeature selectionTFIDF algorithmSVM
分类号:
TP391. 1
DOI:
10. 3969 / j. issn. 1673-629X. 2019. 05. 010
摘要:
随着海量数据资源在网络中的出现,Web 文档分类技术越来越受到重视。 在 Web 文档分类的研究中,特征选择算法有着重要的研究意义。 特征选择能有效降低文本向量空间模型的维度,从而构造出更快,消耗更低的预测模型。 传统的 TFIDF 算法仅仅依靠文档中所包含特征词的词频和逆文档频率来判断该特征词对于文档分类的重要性,忽略了特征项在类内和类间的分布以及数据集不均衡现象,从而效果受到制约。 针对存在的不足进行改进,提出了类内分布因子以及类间分布因子。 基于类内以及类间因子,替代逆文档频率,可以使得改进的表达式能够选择出更加高效的特征词。 通过使用 SVM 分类器进行文本分类对比实验,与改进前的方法相比,该方法能使 F1 值得到一定程度的提高,在不均衡数据集上同样具有较好的分类效果。
Abstract:
With the emergence of massive data resources in the network,Web document classification technology has received more and more attention. In the research of Web document classification,feature selection algorithm has important research significance. Feature selection can effectively reduce the dimensions of the text vector space model,so as to construct a prediction model that is faster and costs less. The traditional TFIDF algorithm only depends on the word frequency and inverse document frequency of the feature words contained in the document to judge the importance of the feature word for document classification,ignoring the distribution of feature items within and between classes and the imbalance of data sets. The effect is limited. In order to improve the existing deficiencies, intra - classdistribution factors and inter - class distribution factors were proposed. Based on intra - and inter - class factors, instead of inversedocument frequency, improved expressions can be selected for more efficient feature words. By using the SVM classifier for textclassification and comparison experiments,this method can increase the F1 value to a certain extent,and also has better classification effect on the unbalanced data set.

相似文献/References:

[1]刘利 何先平 袁文亮.股票趋势预测中Wrapper方法的研究与应用[J].计算机技术与发展,2010,(01):209.
 LIU Li,HE Xian-ping,YUAN Wen-liang.Research and Application of Wrapper Approach to Stock Trend Prediction[J].,2010,(05):209.
[2]黄炜 黄志华.一种基于遗传算法和SVM的特征选择[J].计算机技术与发展,2010,(06):21.
 HUANG Wei,HUANG Zhi-hua.Feature Selection Based on Genetic Algorithm and SVM[J].,2010,(05):21.
[3]张家柏 王小玲.基于聚类和二进制PSO的特征选择[J].计算机技术与发展,2010,(06):25.
 ZHANG Jia-bai,WANG Xiao-ling.A Novel Algorithm Based on K-Means Clustering and Binary Particle Swarm Optimization[J].,2010,(05):25.
[4]冯甲策 叶明 王惠文.基于Gram—Schmidt过程的支持向量机降维方法[J].计算机技术与发展,2009,(11):7.
 FENG Jia-ce,YE Ming,WANG Hui-wen.Dimension Reduction Method of Support Vector Machine Based on Gram- Schmidt Process[J].,2009,(05):7.
[5]林伟 柳荣其 徐熙.邮件过滤中一种改进的特征选择方法研究[J].计算机技术与发展,2009,(01):84.
 LIN Wei,LIU Rong-qi,XU Xi.Improvement of Feature Selection Algorithm in Spam Filtering[J].,2009,(05):84.
[6]刘毅 张月琳.基于Agent的邮件过滤与个性化分类系统设计[J].计算机技术与发展,2009,(02):66.
 LIU Yi,ZHANG Yue-lin.Design of a Mail Filter and Personalized Classification System Based on Agent[J].,2009,(05):66.
[7]陈素萍 谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
 CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(05):112.
[8]段震 王倩倩 张燕平 张铃.覆盖算法下文本分类特征选择的研究[J].计算机技术与发展,2008,(11):29.
 DUAN Zhen,WANG Qian-qian,ZHANG Yan-ping,et al.Study on Feature Selection of Text Classification in Cross Cover Algorithm[J].,2008,(05):29.
[9]王希雷.基于Rough集理论的车牌汉字特征提取[J].计算机技术与发展,2007,(06):26.
 WANG Xi-lei.Car Plate Chinese Character Feature Extraction Based on Rough Set Theory[J].,2007,(05):26.
[10]董梅 胡学钢.基于多特征选择的中文文本分类[J].计算机技术与发展,2007,(07):117.
 DONG Mei,HU Xue-gang.Text Categorization Based on Multiple Features Selection[J].,2007,(05):117.

更新日期/Last Update: 2019-05-10