[1]孙彦雄,李业丽,边玉宁.面向图书主题分类的随机森林算法的应用研究[J].计算机技术与发展,2020,30(06):65-70.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 013]
 SUN Yan-xiong,LI Ye-li,BIAN Yu-ning.Application of Random Forest Algorithm for Book Subject Classification[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2020,30(06):65-70.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 013]
点击复制

面向图书主题分类的随机森林算法的应用研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
30
期数:
2020年06期
页码:
65-70
栏目:
智能、算法、系统工程
出版日期:
2020-06-10

文章信息/Info

Title:
Application of Random Forest Algorithm for Book Subject Classification
文章编号:
1673-629X(2020)06-0065-06
作者:
孙彦雄李业丽边玉宁
北京印刷学院,北京 102600
Author(s):
SUN Yan-xiongLI Ye-liBIAN Yu-ning
Beijing Institute of Graphic Communication,Beijing 102600,China
关键词:
图书文本分类随机森林Tr-K 方法TRk-SW-RF 模型主题分类决策树
Keywords:
book text classificationrandom forestTr-K methodTRk-SW-RF modeltheme classificationdecision tree
分类号:
TP301. 6
DOI:
10. 3969 / j. issn. 1673-629X. 2020. 06. 013
摘要:
针对传统随机森林算法对文本特征提取质量不高导致分类效果差的问题,提出一种对图书等大数据量文本信息文本的改进的随机森林算法。 又由于传统随机森林决策树质量难以保证,提出一种加权投票提高决策树质量的机制。 算法主要由两方面组成,一方面是基于文本主题特征提取的 Tr-K 方法,目的是提高文本主题特征的质量与代表性;另一方面是基于 bootstrap 抽样时遗留的 1/3 袋外数据提出的验证机制。 文中采用的是 20 Newsgroups 数据集和来自于搜狗实验室提供的中文分类语料库,中英文两种数据集充分考虑了该模型的泛化性,并在实验中验证了不同数据集下较传统随机森林算法拥有更优秀的分类能力。 Python 环境下的实验数据表明,该方法在文本分类中相对于 C4.5、KNN、SVM、原始随机森林算法可以取得更好的结果。
Abstract:
In view of the problem of poor classification effect caused by low quality of extracting text features for the traditional random forest algorithm,an improved random forest algorithm for the text of big data like books is proposed. Since the quality of traditional random forest decision tree is difficult to guarantee,a weighted voting mechanism to improve the quality of decision-making tree is presented.The algorithm is mainly composed of two aspects. One is the Tr-K method based on text theme feature extraction,which aims to improve the quality and representation of text features. The other is the verification mechanism of 1/3 of the extra-bags of data left over from the bootstrap sampling.We use the 20 Newsgroups dataset and the Chinese corpus from the Sogou Lab. For the Chinese and English datasets,we take full consideration of the generalization of the model and verify that it has better classification ability compared with the traditional random forests under different datasets. The experimental data in Python environment show that the proposed method can achieve better results in text classification relative to C4.5, KNN, SVM and original random forest algorithm.

相似文献/References:

[1]陈斌,苏一丹,黄山. 基于KM-SMOTE和随机森林的不平衡数据分类[J].计算机技术与发展,2015,25(09):17.
 CHEN Bin,SU Yi-dan,HUANG Shan. Classification of Imbalance Data Based on KM-SMOTE Algorithm and Random Forest[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2015,25(06):17.
[2]张丹丹,李雷. 基于PCANet-RF的人脸检测系统[J].计算机技术与发展,2016,26(02):31.
 ZHANG Dan-dan,LI Lei. Face Detection System Based on PCANet-RF[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2016,26(06):31.
[3]刘广东,邱晓晖.基于多模式LBP 与深度森林的指静脉识别[J].计算机技术与发展,2018,28(07):83.[doi:10.3969/ j. issn.1673-629X.2018.07.018]
 LIU Guang-dong,QIU Xiao-hui.Finger Vein Recognition Based on Multi-mode LBP and Deep Forest[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):83.[doi:10.3969/ j. issn.1673-629X.2018.07.018]
[4]张鑫,吴海涛,曹雪虹.Hadoop 环境下基于随机森林的特征选择算法[J].计算机技术与发展,2018,28(07):88.[doi:10.3969/ j. issn.1673-629X.2018.07.019]
 ZHANG Xin,WU Hai-tao,CAO Xue-hong.A Feature Selection Algorithm Based on Random Forest in Hadoop Platform[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):88.[doi:10.3969/ j. issn.1673-629X.2018.07.019]
[5]刘凯,郑山红,蒋权,等.基于随机森林的自适应特征选择算法[J].计算机技术与发展,2018,28(09):101.[doi:10.3969/j.issn.1673-629X.2018.09.021]
 LIU Kai,ZHENG Shanhong,JIANG Quan,et al.A Self-adaptive Feature Selection Algorithm Based on Random Forest[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2018,28(06):101.[doi:10.3969/j.issn.1673-629X.2018.09.021]
[6]陆兵,顾苏杭.基于级联特征的随机森林运动目标跟踪算法[J].计算机技术与发展,2019,29(05):86.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 019]
 LU Bing,GU Su-hang.A Moving Object Tracking Algorithm of Random Forest Based on Features Cascade[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):86.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 019]
[7]刘耀杰,刘独玉.基于不平衡数据集的改进随机森林算法研究[J].计算机技术与发展,2019,29(06):100.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 021]
 LIU Yao-jie,LIU Du-yu.Research on Improved Random Forest Algorithm Based on Unbalanced Datasets[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):100.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 021]
[8]苗立志,刁继尧,娄 冲,等.基于 Spark 和随机森林的乳腺癌风险预测分析[J].计算机技术与发展,2019,29(08):142.[doi:10. 3969 / j. issn. 1673-629X. 2019. 08. 027]
 MIAO Li-zhi,DIAO Ji-yao,LOU Chong,et al.Breast Cancer Risk Prediction Analysis Based on Apache Spark and Random Forest Algorithm[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):142.[doi:10. 3969 / j. issn. 1673-629X. 2019. 08. 027]
[9]于 澍,曹 琦,刘 涛.基于随机森林的微博互动特征分析[J].计算机技术与发展,2019,29(10):51.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 011]
 YU Shu,CAO Qi,LIU Tao.Analysis of Interactive Characteristics of Weibo Based on Random Forest[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):51.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 011]
[10]姬晓飞,石宇辰.多分类器融合的光学遥感图像目标识别算法[J].计算机技术与发展,2019,29(11):52.[doi:10. 3969 / j. issn. 1673-629X. 2019. 11. 011]
 JI Xiao-fei,SHI Yu-chen.Optical Remote Sensing Image Object Recognition Based on Multiple Classifications Fusion[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(06):52.[doi:10. 3969 / j. issn. 1673-629X. 2019. 11. 011]

更新日期/Last Update: 2020-06-10