«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn.1673-629X.2018.06.008]
点击复制

基于Spark 的 CVFDT 分类算法并行化研究()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年06期

页码:: 35-38

栏目:: 智能、算法、系统工程

出版日期:: 2018-06-10

文章信息/Info

Title:: Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark

文章编号:: 1673-629X(2018)06-0035-04

作者:: 庄荣; 李玲娟; 南京邮电大学计算机学院,江苏南京 210023

Author(s):: ZHUANG Rong; LI Ling-juan; School of Computer,Nanjing University of Posts and Telecommunications,Nanjing 210023,China

关键词:: 数据流; CVFDT; 并行化; Spark; 弹性分布式数据集

Keywords:: data streams; CVFDT; parallelization; Spark; resilient distributed datasets

分类号:: TP301.6

DOI:: 10.3969/ j. issn.1673-629X.2018.06.008

文献标志码:: A

摘要:: 以提升流数据的分类挖掘效率为目标,研究将概念适应快速决策树算法(CVFDT)部署到流数据计算平台 Spark 上进行并行化的方案。设计了 CVFDT 基于 Spark 的并行化实现方案,首先对 CVFDT 算法进行属性间并行化改造,即分割点计算过程中的并行化;然后基于 Spark 在 CVFDT 的建树过程中将节点的所有属性列表转化为 Spark 特有的弹性分布式数据集 RDD,通过计算由每个 RDD 生成的并行化任务,汇总并且比较每个最佳分割点,再计算 Hoeffding 边界作为节点分裂条件找到最佳分割点,从而递归创建决策树。实验结果表明,在 Spark 集群环境下,CVFDT 算法的分类效率相对于单机环境有显著提高,改进后的并行化 CVFDT 算法对大规模流数据处理有良好的适应能力,而且合理设定 RDD 过滤可使分类效率进一步提高。

Abstract:: Aiming at increase of classification and mining efficiency for stream data,we study a parallelization scheme of deploying the CVFDT (concept-adapting fast decision tree) to the stream data computing platform Spark and design a implementation scheme of CVFDT based on Spark. Firstly,the CVFDT should be parallelized among attributes,that is the parallelization of the splitting point calculation. Then in the process of building decision trees of CVFDT based on Spark,all the attribute lists of the node are transformed into Spark’s unique resilient distributed datasets (RDD),and through calculation of parallel task from each RDD,each optimal splitting point is summarized and compared. The Hoeffding boundary is calculated as the node splitting condition to find the optimal splitting point,and the decision tree is recursively created. The experiment shows that the classification efficiency of CVFDT in the Spark cluster environment relative to the stand-alone environment has improved significantly. The improved parallel CVFDT has better adaptability to large-scale stream data processing and the reasonable setting of RDD filtering can further improve the classification efficiency.

相似文献/References:

[1]吴众欣钱德沛黄泳翔.基于软件管道Actor模型的BPEL流程转化研究[J].计算机技术与发展,2009,(07):4.
　WU Zhong-xin,QIAN De-pei,HUANG Yong-xiang.Research on BPEL Process Conversion Based on Actor Model with Pipeline[J].,2009,(06):4.
[2]朱桂宏王刚.基于数据流的网络入侵检测研究[J].计算机技术与发展,2009,(03):175.
　ZHU Gui-hong,WANG Gang.Research on Network Intrusion Detection Based on Data Stream[J].,2009,(06):175.
[3]司开君毛宇光.一种新的基于数据流的数据模型[J].计算机技术与发展,2007,(01):1.
　SI Kai-jun,MAO Yu-guang.A New Data Model Based on Data Stream[J].,2007,(06):1.
[4]程转流[] 王本年.数据流中的频繁模式挖掘[J].计算机技术与发展,2007,(12):53.
　CHENG Zhuan-liu,WANG Ben-nian.Frequent Pattern Mining in Data Streams[J].,2007,(06):53.
[5]肖裕权周肆清.基于粒子群优化算法的数据流聚类算法[J].计算机技术与发展,2011,(10):43.
　XIAO Yu-quan,ZHOU Si-qing.Clustering Evolving Data Streams Based on Particle Swarm Optimization[J].,2011,(06):43.
[6]戴翔[],毛宇光[][],吴非[],等. 基于数据流的测试用例自动生成研究[J].计算机技术与发展,2014,24(09):1.
　DAI Xiang[] MAO Yu-guang[][],WU Fei[],XUE Yi-fan[]. Research on Automatic Test Case Generation Based on Data Flow[J].,2014,24(06):1.
[7]罗雅过[],赵宁社[]. 高校数字化校园数据中心平台的研究与设计[J].计算机技术与发展,2014,24(09):217.
　LUO Ya-guo[],ZHAO Ning-she[]. Research and Design of University Digital Campus Data Center Platform[J].,2014,24(06):217.
[8]马可,李玲娟,孙杜靖. 分布式并行化数据流频繁模式挖掘算法[J].计算机技术与发展,2016,26(07):75.
　MA Ke,LI Ling-juan,SUN Du-jing. Distributed Parallel Algorithm of Mining Frequent Pattern on Data Stream[J].,2016,26(06):75.
[9]陈煜,李玲娟.一种基于决策树的隐私保护数据流分类算法[J].计算机技术与发展,2017,27(07):111.
　CHEN Yu,LI Ling-juan. A Decision Tree-based Privacy Preserving Classification Mining Algorithm for Data Streams[J].,2017,27(06):111.
[10]朱蔚林[][],木伟民[],金宗泽[][],等.基于 MR 的高可靠分布式数据流统计模型[J].计算机技术与发展,2018,28(01):6.[doi:10.3969/ j. issn.1673-629X.2018.01.002]
　ZHU Wei-lin [][],MU Wei-min [],JIN Zong-ze [][ ],et al.Statistical Model of Distrubuted Data Strem with High Reliability Based on MR[J].,2018,28(06):6.[doi:10.3969/ j. issn.1673-629X.2018.01.002]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed998
全文下载/Downloads585
评论/Comments