[1]苗立志,刁继尧,娄 冲,等.基于 Spark 和随机森林的乳腺癌风险预测分析[J].计算机技术与发展,2019,29(08):142-146.[doi:10. 3969 / j. issn. 1673-629X. 2019. 08. 027]
 MIAO Li-zhi,DIAO Ji-yao,LOU Chong,et al.Breast Cancer Risk Prediction Analysis Based on Apache Spark and Random Forest Algorithm[J].,2019,29(08):142-146.[doi:10. 3969 / j. issn. 1673-629X. 2019. 08. 027]
点击复制

基于 Spark 和随机森林的乳腺癌风险预测分析()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
29
期数:
2019年08期
页码:
142-146
栏目:
应用开发研究
出版日期:
2019-08-10

文章信息/Info

Title:
Breast Cancer Risk Prediction Analysis Based on Apache Spark and Random Forest Algorithm
文章编号:
1673-629X(2019)08-0135-05
作者:
苗立志123 刁继尧4 娄 冲4 崔进东4
1. 南京邮电大学 地理与生物信息学院,江苏 南京 210023; 2. 南京邮电大学 江苏省智慧健康大数据分析与位置服务工程实验室,江苏 南京 210023; 3. 南京邮电大学 泛在网络健康服务系统教育部工程研究中心,江苏 南京 210003; 4. 南京邮电大学 通信与信息工程学院,江苏 南京 210003
Author(s):
MIAO Li-zhi123 DIAO Ji-yao 4 LOU Chong 4 CUI Jin-dong 4
1. School of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210023,China; 2. Jiangsu Engineering Laboratory for Smart Analysis of Healthy Big Data and Location Based Services,Nanjing University of Posts
关键词:
Apache Spark随机森林疾病预测机器学习智能健康大数据分析
Keywords:
Apache Sparkrandom forestdisease predictionmachine learningintelligent healthbig data analysis
分类号:
TP311
DOI:
10. 3969 / j. issn. 1673-629X. 2019. 08. 027
摘要:
现代医疗正在朝着智能健康的方向发展。 在此大背景下,为了提高乳腺癌风险的发现及预测效果,文中采用大数据分析技术并基于随机森林模型,应用多个弱分类器,将多个决策树获得的结果进行集成,得到疾病发病概率;并采用管道学习方法来训练模型,基于该模型开展了致病因素分析以及结果预测。 同时,通过皮尔逊相关系数和 Spearman 等级相关系数来进行相关度分析,研究权重较高的影响因子,提高乳腺癌风险的监测和早期预防。 实验结果表明,在乳腺癌致病细胞细胞核的相关参数中,Perimeter、Texture 和 Concave points 影响因子对于乳腺癌的致病影响程度较大,更易导致疾病的发生。 基于管道训练方法所建立的模型预测精度可达 99.04%,精度高、方法可靠。 最终的实验研究结果对于乳腺癌风险的发现具有一定程度的参考意义。
Abstract:
Modern medicine is developing towards intelligent health. Under this background,to improve the detection and prediction of breast cancer risk,we use big data analysis and multiple weak classifiers based on random forest model to integrate the results of decision trees to obtain incidence of disease. The pipeline learning method is used to train the model. We also carry out pathogenic factor analysis and result prediction based on the pipeline learning. Meanwhile,the influencing factors with higher weight are studied by Pearson correlation coefficient and Superman rank correlation coefficient,to improve the monitoring risk of breast cancer. The experiment shows that among the relevant parameters of the nucleus of breast cancer pathogenic cells,the Perimeter,Texture and Concave points have a greater impact on the pathogenesis of breast cancer and are more likely to cause the lead to the disease. The prediction accuracy of the model based on the pipeline training method can reach 99.04%,which will provide a certain reference for the discovery of breast cancer risk.

相似文献/References:

[1]陈斌,苏一丹,黄山. 基于KM-SMOTE和随机森林的不平衡数据分类[J].计算机技术与发展,2015,25(09):17.
 CHEN Bin,SU Yi-dan,HUANG Shan. Classification of Imbalance Data Based on KM-SMOTE Algorithm and Random Forest[J].,2015,25(08):17.
[2]张丹丹,李雷. 基于PCANet-RF的人脸检测系统[J].计算机技术与发展,2016,26(02):31.
 ZHANG Dan-dan,LI Lei. Face Detection System Based on PCANet-RF[J].,2016,26(08):31.
[3]刘广东,邱晓晖.基于多模式LBP 与深度森林的指静脉识别[J].计算机技术与发展,2018,28(07):83.[doi:10.3969/ j. issn.1673-629X.2018.07.018]
 LIU Guang-dong,QIU Xiao-hui.Finger Vein Recognition Based on Multi-mode LBP and Deep Forest[J].,2018,28(08):83.[doi:10.3969/ j. issn.1673-629X.2018.07.018]
[4]张鑫,吴海涛,曹雪虹.Hadoop 环境下基于随机森林的特征选择算法[J].计算机技术与发展,2018,28(07):88.[doi:10.3969/ j. issn.1673-629X.2018.07.019]
 ZHANG Xin,WU Hai-tao,CAO Xue-hong.A Feature Selection Algorithm Based on Random Forest in Hadoop Platform[J].,2018,28(08):88.[doi:10.3969/ j. issn.1673-629X.2018.07.019]
[5]刘凯,郑山红,蒋权,等.基于随机森林的自适应特征选择算法[J].计算机技术与发展,2018,28(09):101.[doi:10.3969/j.issn.1673-629X.2018.09.021]
 LIU Kai,ZHENG Shanhong,JIANG Quan,et al.A Self-adaptive Feature Selection Algorithm Based on Random Forest[J].,2018,28(08):101.[doi:10.3969/j.issn.1673-629X.2018.09.021]
[6]陆兵,顾苏杭.基于级联特征的随机森林运动目标跟踪算法[J].计算机技术与发展,2019,29(05):86.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 019]
 LU Bing,GU Su-hang.A Moving Object Tracking Algorithm of Random Forest Based on Features Cascade[J].,2019,29(08):86.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 019]
[7]刘耀杰,刘独玉.基于不平衡数据集的改进随机森林算法研究[J].计算机技术与发展,2019,29(06):100.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 021]
 LIU Yao-jie,LIU Du-yu.Research on Improved Random Forest Algorithm Based on Unbalanced Datasets[J].,2019,29(08):100.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 021]
[8]于 澍,曹 琦,刘 涛.基于随机森林的微博互动特征分析[J].计算机技术与发展,2019,29(10):51.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 011]
 YU Shu,CAO Qi,LIU Tao.Analysis of Interactive Characteristics of Weibo Based on Random Forest[J].,2019,29(08):51.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 011]
[9]姬晓飞,石宇辰.多分类器融合的光学遥感图像目标识别算法[J].计算机技术与发展,2019,29(11):52.[doi:10. 3969 / j. issn. 1673-629X. 2019. 11. 011]
 JI Xiao-fei,SHI Yu-chen.Optical Remote Sensing Image Object Recognition Based on Multiple Classifications Fusion[J].,2019,29(08):52.[doi:10. 3969 / j. issn. 1673-629X. 2019. 11. 011]
[10]王 诚,高 蕊.基于特征约简的随机森林改进算法研究[J].计算机技术与发展,2020,30(03):40.[doi:10. 3969 / j. issn. 1673-629X. 2020. 03. 008]
 WANG Cheng,GAO Rui.An Improved Random Forest Algorithm Based on Feature Reduction[J].,2020,30(08):40.[doi:10. 3969 / j. issn. 1673-629X. 2020. 03. 008]

更新日期/Last Update: 2019-08-10