[1]张银杰,揣锦华,翟晓惠.基于集成学习算法的恶意软件感染二分类预测[J].计算机技术与发展,2021,31(05):15-20.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 003]
 ,BinaryPredictionofMalwareInfectionBasedonIntegratedLearningAlgorithm[J].,2021,31(05):15-20.[doi:10. 3969 / j. issn. 1673-629X. 2021. 05. 003]
点击复制

基于集成学习算法的恶意软件感染二分类预测()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
31
期数:
2021年05期
页码:
15-20
栏目:
大数据分析与挖掘
出版日期:
2021-05-10

文章信息/Info

Title:
BinaryPredictionofMalwareInfectionBasedonIntegratedLearningAlgorithm
文章编号:
1673-629X(2021)05-0015-06
作者:
张银杰揣锦华翟晓惠
长安大学 信息工程学院,陕西 西安 710064
Author(s):
ZHANGYin-jieCHUAIJin-huaZHAIXiao-hui
SchoolofInformationEngineering,Chang’anUniversity,Xi’an710064,China
关键词:
集成学习恶意软件感染二分类预测随机森林LightGBM特征工程
Keywords:
ntegratedlearningmalwareinfectionbinarypredictionrandomforestLightGBMfeatureengineering
分类号:
TP311.5
DOI:
10. 3969 / j. issn. 1673-629X. 2021. 05. 003
摘要:
对随机森林和 LightGBM两种集成学习算法在恶意软件感染二分类预测中的应用进行了研究。针对恶意软件感染预测数据集,通过预处理修正异常值,选择合适的编码方式处理数据集中不同类型的数据;进行特征工程处理,包括原始特征的构建并对部分特征进行拆分,构建时间戳特征以补充缺失的时间信息;使用基于 Bagging集成的随机森林算法得到特征重要性分数,并按照从高到低的顺序排列以发现对预测恶意软件感染影响较大的因素;根据重要性分数划分出含有不同特征的数据集,分别选择随机森林和基于 Boosting集成的 LightGBM算法建立预测模型,根据随机森林的 AUC值变化评估出最合适的特征集以实现降维过程;选择传统决策树与集成学习算法比较五折交叉验证结果。实验结果表明:集成学习算法在预测过程中能够确定合适的特征数量,且预测性能明显高于传统决策树算法。
Abstract:
Theapplicationofrandom forestandLightGBM twointegratedlearningalgorithmsinthebinaryclassificationpredictionofmalwareinfectionisstudied.Forthemalwareinfectionpredictiondataset,throughpreprocessingtocorrectoutliers,theappropriateencodingmethodisselectedtoprocessdifferenttypesofdatainthedataset.Featureengineeringprocessingiscarriedout,includingtheconstructionoforiginalfeaturesandthesplittingofsomefeatures,andtheconstructionoftimestampfeaturestosupplementthemissingtimeinformation.TherandomforestalgorithmbasedonBaggingintegrationistoobtainfeatureimportancescoreswhicharearrangedinorderfromhightolowtofindthefactorsthathaveagreaterimpactonpredictingmalwareinfection.Datasetswithdifferentfeaturesaredividedaccordingtoimportancescores.RandomforestandLightGBM algorithmbasedonBoostingintegrationareselectedrespectivelytoestablishapredictionmodel,andthemostappropriatefeaturesetisevaluatedaccordingtoAUCvaluechangesofrandom foresttoachievedimensionreduction.Traditionaldecisiontreeandintegratedlearningalgorithm areselectedtocompare50% crossverificationresult.Theexperimentshowsthattheintegratedlearningalgorithm candeterminetheappropriatenumberoffeaturesinthepredictionprocess,andthepredictionperformanceissignificantlyhigherthanthetraditionaldecisiontreealgorithm.

相似文献/References:

[1]陈全 赵文辉 李洁 江雨燕.选择性集成学习算法的研究[J].计算机技术与发展,2010,(02):87.
 CHEN Quan,ZHAO Wen-hui,LI Jie,et al.Research of Selective Ensemble Learning Algorithm[J].,2010,(05):87.
[2]贾瑞玉 冯伦阔 李永顺 张新建.基于集成学习的覆盖算法[J].计算机技术与发展,2009,(07):76.
 JIA Rui-yu,FENG Lun-kuo,LI Yong-shun,et al.Cover Algorithm Based on Ensemble Learning[J].,2009,(05):76.
[3]姚明海,赵连朋,刘维学.基于特征选择的Bagging分类算法研究[J].计算机技术与发展,2014,24(04):103.
 YAO Ming-hai,ZHAO Lian-peng,LIU Wei-xue.Research on Bagging Classification Algorithm Based on Feature Selection[J].,2014,24(05):103.
[4]周丰,王未央. 基于最小最大模块化集成特征选择的改进[J].计算机技术与发展,2016,26(09):149.
 ZHOU Feng,WANG Wei-yang. Improvement of Multi-classification Integrated Selection Based on Min-Max-Module[J].,2016,26(05):149.
[5]黄 琳,荆晓远,董西伟.基于多核集成学习的跨项目软件缺陷预测[J].计算机技术与发展,2019,29(06):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 006]
 HUANG Lin,JING Xiao-yuan,DONG Xi-wei.Cross-project Software Defect Prediction Based on Multiple Kernel Ensemble Learning[J].,2019,29(05):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 006]
[6]李田港,叶 硕,叶光明,等.基于集成学习的语音情感识别算法研究[J].计算机技术与发展,2020,30(06):82.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 016]
 LI Tian-gang,YE Shuo,YE Guang-ming,et al.Research on Speech Emotion Recognition Algorithm Based on Ensemble Learning[J].,2020,30(05):82.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 016]
[7]郭 晨,陈 龙.基于机器学习方法的数字岩芯电导率预测[J].计算机技术与发展,2020,30(07):100.[doi:10. 3969 / j. issn. 1673-629X. 2020. 07. 022]
 GUO Chen,CHEN Long.Prediction of Digital Core Electrical Conductivity Using Machine Learning Method[J].,2020,30(05):100.[doi:10. 3969 / j. issn. 1673-629X. 2020. 07. 022]
[8]李天举,谢志峰,张侃弘,等.基于集成学习的烟草异常数据挖掘研究与应用[J].计算机技术与发展,2020,30(11):128.[doi:10. 3969 / j. issn. 1673-629X. 2020. 11. 024]
 LI Tian-ju,XIE Zhi-feng,ZHANG Kan-hong,et al.Study and Application of Tobacco Anomaly Data Mining Based on Ensemble Learning[J].,2020,30(05):128.[doi:10. 3969 / j. issn. 1673-629X. 2020. 11. 024]
[9]张照鑫,朱允刚,虞玉峰,等.基于贝叶斯网和集成学习的智能电表状态评价[J].计算机技术与发展,2021,31(06):146.[doi:10. 3969 / j. issn. 1673-629X. 2021. 06. 026]
 ZHANG Zhao-xin,ZHU Yun-gang,YU Yu-feng,et al.State Evaluation of Smart Energy Meter Based on BayesianNetwork and Integrated Learning[J].,2021,31(05):146.[doi:10. 3969 / j. issn. 1673-629X. 2021. 06. 026]
[10]肖 梁,韩 璐,魏鹏飞,等.基于 Bagging 集成学习的多集类不平衡学习[J].计算机技术与发展,2021,31(10):1.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 001]
 XIAO Liang,HAN Lu,WEI Peng-fei,et al.Bagging Ensemble Learning Based Multiset Class-imbalanced Learning[J].,2021,31(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2021. 10. 001]

更新日期/Last Update: 2020-05-10