[1]朱蔚林[][],木伟民[],金宗泽[][],等.基于 MR 的高可靠分布式数据流统计模型[J].计算机技术与发展,2018,28(01):6-010.[doi:10.3969/ j. issn.1673-629X.2018.01.002]
 ZHU Wei-lin [][],MU Wei-min [],JIN Zong-ze [][ ],et al.Statistical Model of Distrubuted Data Strem with High Reliability Based on MR[J].Computer Technology and Development,2018,28(01):6-010.[doi:10.3969/ j. issn.1673-629X.2018.01.002]
点击复制

基于 MR 的高可靠分布式数据流统计模型()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年01期
页码:
6-010
栏目:
出版日期:
2018-01-10

文章信息/Info

Title:
Statistical Model of Distrubuted Data Strem with High Reliability Based on MR
文章编号:
1673-629X(2018)01-0006-05
作者:
朱蔚林[1][2] 木伟民[1] 金宗泽[1][2] 王伟平[1]
1. 中国科学院 信息工程研究所,北京 100093;
2. 中国科学院大学,北京 100049
Author(s):
ZHU Wei-lin [1][2] MU Wei-min [1] JIN Zong-ze [1][2 ]WANG Wei-ping [1]
1. Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China;
2. University of Chinese Academy of Sciences,Beijing 100049,China
关键词:
数据流分组统计连续查询分布式系统实时处理
Keywords:
data streamgrouping statisticcontinuous querydistributed systemreal-time processing
分类号:
TP391
DOI:
10.3969/ j. issn.1673-629X.2018.01.002
文献标志码:
A
摘要:
结合流数据独有的特点,以数据流上基于窗口模型的连续分组统计为应用场景,结合现今主流的流数据处理平台Storm 和 Spark Streaming 的优点,提出了一个高吞吐、低延迟、高可扩展性的分布式数据流统计模型 Mars,解决由于流数据易失、时效性强造成的吞吐量压力大、数据延迟低等问题。 在容错方面,Mars 提供了 at-least-once 语义支持以防出现重大错误。 采用真实实验环境对 Mars 进行测试,与目前流行的分布式流处理平台 Spark Streaming 和 Storm 相比,Mars 对数据的实时性操作延迟介于二者之间,但就不同的集群规模而言,Mars 的吞吐率明显优于二者 1 到 2 倍,就语义准确性而言,Mars 实现了与 Storm 同级别的语义限制。
Abstract:
According to the unique characteristics of the data stream,with consecutive grouping statistics based on window model in the data flow as application scenarios,combined with the advantages of mainstream stream data processing platform like Storm and Spark Streaming,we propose a distributed statistical model of data stream with high throughput and scalability as well as low latency,namely Mars. It solves the problems of strong throughput and low latency due to losing data easily and strong timelessness. On the fault-tolerant,Mars provides at-least-once semantic support against major errors. It is tested in real experiment environment and made a comparison with the currently popular distributed flow processing platform Spark Streaming and Storm,which show that it is between them in real-time operation delay for data. However,in terms of the scale of the cluster,Mars’ throughput rate is significantly better than that of the two,and in terms of semantic accuracy,it achieves the semantic limits of the same level as Storm.

相似文献/References:

[1]吴众欣 钱德沛 黄泳翔.基于软件管道Actor模型的BPEL流程转化研究[J].计算机技术与发展,2009,(07):4.
 WU Zhong-xin,QIAN De-pei,HUANG Yong-xiang.Research on BPEL Process Conversion Based on Actor Model with Pipeline[J].Computer Technology and Development,2009,(01):4.
[2]朱桂宏 王刚.基于数据流的网络入侵检测研究[J].计算机技术与发展,2009,(03):175.
 ZHU Gui-hong,WANG Gang.Research on Network Intrusion Detection Based on Data Stream[J].Computer Technology and Development,2009,(01):175.
[3]司开君 毛宇光.一种新的基于数据流的数据模型[J].计算机技术与发展,2007,(01):1.
 SI Kai-jun,MAO Yu-guang.A New Data Model Based on Data Stream[J].Computer Technology and Development,2007,(01):1.
[4]程转流[] 王本年.数据流中的频繁模式挖掘[J].计算机技术与发展,2007,(12):53.
 CHENG Zhuan-liu,WANG Ben-nian.Frequent Pattern Mining in Data Streams[J].Computer Technology and Development,2007,(01):53.
[5]肖裕权 周肆清.基于粒子群优化算法的数据流聚类算法[J].计算机技术与发展,2011,(10):43.
 XIAO Yu-quan,ZHOU Si-qing.Clustering Evolving Data Streams Based on Particle Swarm Optimization[J].Computer Technology and Development,2011,(01):43.
[6]戴翔[],毛宇光[][],吴非[],等. 基于数据流的测试用例自动生成研究[J].计算机技术与发展,2014,24(09):1.
 DAI Xiang[] MAO Yu-guang[][],WU Fei[],XUE Yi-fan[]. Research on Automatic Test Case Generation Based on Data Flow[J].Computer Technology and Development,2014,24(01):1.
[7]罗雅过[],赵宁社[]. 高校数字化校园数据中心平台的研究与设计[J].计算机技术与发展,2014,24(09):217.
 LUO Ya-guo[],ZHAO Ning-she[]. Research and Design of University Digital Campus Data Center Platform[J].Computer Technology and Development,2014,24(01):217.
[8]马可,李玲娟,孙杜靖. 分布式并行化数据流频繁模式挖掘算法[J].计算机技术与发展,2016,26(07):75.
 MA Ke,LI Ling-juan,SUN Du-jing. Distributed Parallel Algorithm of Mining Frequent Pattern on Data Stream[J].Computer Technology and Development,2016,26(01):75.
[9]庄荣,李玲娟.基于Spark 的 CVFDT 分类算法并行化研究[J].计算机技术与发展,2018,28(06):35.[doi:10.3969/ j. issn.1673-629X.2018.06.008]
 ZHUANG Rong,LI Ling-juan.Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark[J].Computer Technology and Development,2018,28(01):35.[doi:10.3969/ j. issn.1673-629X.2018.06.008]
[10]陈煜,李玲娟.一种基于决策树的隐私保护数据流分类算法[J].计算机技术与发展,2017,27(07):111.
 CHEN Yu,LI Ling-juan. A Decision Tree-based Privacy Preserving Classification Mining Algorithm for Data Streams[J].Computer Technology and Development,2017,27(01):111.

更新日期/Last Update: 2018-03-09