[1]朱世玲,郑彦.改进的文本特征选取算法研究[J].计算机技术与发展,2019,29(05):66-69.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 014]
 ZHU Shi-ling,ZHENG Yan.Research on Improved Text Feature Selection Algorithm[J].,2019,29(05):66-69.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 014]
点击复制

改进的文本特征选取算法研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年05期
页码:
66-69
栏目:
智能、算法、系统工程
出版日期:
2019-05-10

文章信息/Info

Title:
Research on Improved Text Feature Selection Algorithm
文章编号:
1673-629X(2019)05-0066-04
作者:
朱世玲郑彦
南京邮电大学 计算机软件学院,江苏 南京 210023
Author(s):
ZHU Shi-lingZHENG Yan
School of Computer Software,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
关键词:
特征选取互信息文档频率文本分类改进互信息改进文档频率
Keywords:
feature selection mutual information document frequency text classification mutual information improved document frequency improved
分类号:
TP301. 6
DOI:
10. 3969 / j. issn. 1673-629X. 2019. 05. 014
摘要:
特征选取的好坏决定了文本分类的准确度。 文本特征选取通常有文档频率、互信息、信息增益、卡方统计量等方法。 文中讨论了文档频率和互信息在特征选取时的缺点,基于这些缺点,提出了一种混合文档频率和互信息的改进算法。文档频率进行特征选取时会偏向选择高频词,而没有考虑到该词是否在类别间有区分度,所以提出通过计算词的文档频率的类别方差作为文档频率的权重来进行特征选取。 互信息偏向选择低频词,也忽略了互信息值为负的那些特征作用,有些互信息为负的词反而包含更多的类别信息。 所以对互信息的值取了绝对值来加强互信息为负的词的作用。 通过对比 DF、MI 和改进的 DFMI 的实验结果,发现该算法在精度、召回率和 F1 度量上都有所提高,验证了该方法的有效性。
Abstract:
The quality of feature selection determines the accuracy of text classification. Text feature selection usually includes documentfrequency,mutual information,information gain,Chi-square statistics and so on. We discuss the shortcomings of document frequencyand mutual information in feature selection,and on the basis propose an improved algorithm for hybrid document frequency and mutualinformation. In the feature selection of document frequency,high-frequency words are preferred without considering whether there is adegree of discrimination between categories. Therefore,we propose to take the category variance of word frequency as the weight ofdocument frequency for feature selection. The mutual information tends to select low-frequency words,but also ignores those featureswith negative mutual information value. Some words with negative mutual information contain more category information. Therefore,theabsolute value of mutual information is taken to strengthen the role of words with negative mutual information. The experimentalcomparison of DF,MI and improved DFMI indicates that the proposed algorithm improves in accuracy,recall rate and F1 measure,whichverifies its effectiveness.

相似文献/References:

[1]许镇 王洪国 冉玉梅 杨玉会.基于判别模型的垃圾邮件过滤方法[J].计算机技术与发展,2010,(01):177.
 XU Zhen,WANG Hong-guo,RAN Yu-mei,et al.Spam Filter Method Based on Discriminative Model[J].,2010,(05):177.
[2]王友国.基于互信息的多阈值系统中随机谐振现象研究[J].计算机技术与发展,2010,(06):89.
 WANG You-guo,LIU Hong-wei,LUO Ji.Stochastic Resonance in Multi-Threshold Systems Based on Mutual Information[J].,2010,(05):89.
[3]程节华 段汉根.汉语短语识别方法研究[J].计算机技术与发展,2008,(04):67.
 CHENG Jie-hua,DUAN Han-gen.Research on Phrase Chunking Methods[J].,2008,(05):67.
[4]翟利志 王敬东 李鹏.基于邻域信息的红外与可见光图像互信息配准[J].计算机技术与发展,2008,(10):151.
 ZHAI Li-zhi,WANG Jing-dong,LI Peng.Infrared and Visible Light Image Mutual Information Registration Based on Neighborhood Information[J].,2008,(05):151.
[5]赵国际 李竹林 赵宗涛 张宏[].文本分类技术及在军事情报中的应用[J].计算机技术与发展,2007,(08):176.
 ZHAO Guo-ji,LI Zhu-lin,ZHAO Zong-tao,et al.Technology of Text Classification and Application in Military Intelligence Management[J].,2007,(05):176.
[6]胡逢彬 桂现才.决策表属性约简的相对信息量表示[J].计算机技术与发展,2006,(07):39.
 HU Feng-bin,GUI Xian-cai.Relative Information Quantity Representation for Attribute Reduction of Decision Tables[J].,2006,(05):39.
[7]王友国 刘沁雨.多阈值系统中高斯混合噪声改善信息的传输[J].计算机技术与发展,2011,(04):120.
 WANG You-guo,LIU Qin-yu.Gaussian Mixture Noise to Improve Information Transmission in Multi-threshold System[J].,2011,(05):120.
[8]魏博诚 王爱平 沙先军 王永.一种消除中文分词中交集型歧义的方法[J].计算机技术与发展,2011,(05):60.
 WEI Bo-cheng,WANG Ai-ping,SHA Xian-jun,et al.A Method about Removing Overlapping Ambiguity Producing in Chinese Matching[J].,2011,(05):60.
[9]韩月阳 邓世昆 贾时银 李远方.基于字分类的中文分词的研究[J].计算机技术与发展,2011,(07):29.
 HAN Yue-yang,DENG Shi-kun,JIA Shi-yin,et al.Chinese Word Segmentation Research Based on Classification of Words[J].,2011,(05):29.
[10]姚明海,赵连朋,刘维学.基于特征选择的Bagging分类算法研究[J].计算机技术与发展,2014,24(04):103.
 YAO Ming-hai,ZHAO Lian-peng,LIU Wei-xue.Research on Bagging Classification Algorithm Based on Feature Selection[J].,2014,24(05):103.

更新日期/Last Update: 2019-05-10