[1]王刚,李盛恩. MapReduce中数据倾斜解决方法的研究[J].计算机技术与发展,2016,26(09):201-204.
 WANG Gang,LI Sheng-en. Research on Handling Data Skew in MapReduce[J].,2016,26(09):201-204.
点击复制

 MapReduce中数据倾斜解决方法的研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
26
期数:
2016年09期
页码:
201-204
栏目:
智能、算法、系统工程
出版日期:
2016-09-10

文章信息/Info

Title:
 Research on Handling Data Skew in MapReduce
文章编号:
1673-629X(2016)09-0201-04
作者:
 王刚李盛恩
 山东建筑大学 计算机科学与技术学院
Author(s):
 WANG GangLI Sheng-en
关键词:
 大数据MapReduce 负载均衡抽样
Keywords:
 big dataMapReduceload balancingsampling
分类号:
TP301
文献标志码:
A
摘要:
 随着移动互联网和物联网的飞速发展,数据规模呈爆炸性增长态势,人们已经进入大数据时代。 MapReduce是一种分布式计算框架,具备海量数据处理的能力,已成为大数据领域研究的热点。但是MapReduce的性能严重依赖于数据的分布,当数据存在倾斜时,MapReduce默认的Hash划分无法保证Reduce阶段节点负载平衡,负载重的节点会影响作业的最终完成时间。为解决这一问题,利用了抽样的方法。在用户作业执行前运行一个MapReduce作业进行并行抽样,抽样获得key的频次分布后结合数据本地性实现负载均衡的数据分配策略。搭建了实验平台,在实验平台上测试WordCount实例。实验结果表明,采用抽样方法实现的数据划分策略性能要优于MapReduce默认的哈希划分方法,结合了数据本地性的抽样划分方法的效果要优于没有考虑数据本地性的抽样划分方法。
Abstract:
 With the rapid development of mobile Internet and the Internet of Things,the data size explosively grows,and people have been in the era of big data. As a distributed computing framework,MapReduce has the ability of processing massive data and becomes a focus in big data. But the performance of MapReduce depends on the distribution of data. The Hash partition function defaulted by MapReduce can’ t guarantee load balancing when data is skewed. The time of job is affected by the node which has more data to process. In order to solve the problem,sampling is used. It does a MapReduce job to sample before dealing with user’ s job in this paper. After learning the distribution of key,load balance of data partition is achieved using data locality. The example of WordCount is tested in experimental plat-form. Results show that data partition using sample is better than Hash partition,and taking data locality is much better than that using sample but no data locality.

相似文献/References:

[1]严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,(04):168.
 YAN Xiao-feng,ZHANG De-xin.Big Data Research[J].,2013,(09):168.
[2]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(09):1.
[3]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(09):5.
[4]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(09):13.
[5]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(09):21.
[6]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(09):25.
[7]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(09):29.
[8]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(09):34.
[9]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(09):38.
[10]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(09):43.
[11]王雷,陈彦先,袁哲,等. 面向预拌混凝土行业的云计算[J].计算机技术与发展,2014,24(08):14.
 WANG Lei,CHEN Yan-xian,YUAN Zhe JI Xu. Research on Cloud Computing for Ready-mixed Concrete Industry[J].,2014,24(09):14.
[12]金宗泽,冯亚丽,文必龙,等. 大数据分析流程框架的研究[J].计算机技术与发展,2014,24(08):117.
 JIN Zong-ze,FENG Ya-l,WEN Bi-long,et al. Research on Framework of Big Data Analytic Process[J].,2014,24(09):117.
[13]张也弛,周文钦,石润华. 一种面向云的大数据完整性检测协议[J].计算机技术与发展,2014,24(09):68.
 ZHANG Ye-chi,ZHOU Wen-qin,SHI Run-hua. A Big Data Integrity Checking Protocol for Cloud[J].,2014,24(09):68.
[14]谢怡,王航,刘新瀚,等. 大数据环境下数据读取关键技术研究[J].计算机技术与发展,2015,25(02):113.
 XIE Yi,WANG Hang,LIU Xin-han,et al. Research on Data Reading Techniques Based on Big Data Environment[J].,2015,25(09):113.
[15]付燕平,罗明宇,刘其军. 大数据三维模型快速显示技术研究[J].计算机技术与发展,2015,25(05):87.
 FU Yan-ping,LUO Ming-yu,LIU Qi-jun. Research on Fast Display Technology for Big Data Three-dimensional Model[J].,2015,25(09):87.
[16]赵震,任永昌. 大数据时代基于云计算的电子政务平台研究[J].计算机技术与发展,2015,25(10):145.
 ZHAO Zhen,REN Yong-chang. Research on E-government Platform Based on Cloud Computing in Big Data Era[J].,2015,25(09):145.
[17]胡存刚,程莹. 基于粒子群算法的大数据智能搜索引擎的研究[J].计算机技术与发展,2015,25(12):14.
 HU Cun-gang,CHENG Ying. Research on Big Data Intelligent Search Engine Based on PSO[J].,2015,25(09):14.
[18]肖洁,袁嵩,谭天. 大数据时代数据隐私安全研究[J].计算机技术与发展,2016,26(05):91.
 XIAO Jie,YUAN Song,TAN Tian. Research on Data Privacy in Big Data Age[J].,2016,26(09):91.
[19]郭先超,林宗缪,姚文勇. 互联网+质量检测平台设计[J].计算机技术与发展,2016,26(05):120.
 GUO Xian-chao,LIN Zong-miao,YAO Wen-yong. Design of Platform for Internet+ Quality Inspection[J].,2016,26(09):120.
[20]程艳云,张守超,杨杨. 基于大数据的时间序列异常点检测研究[J].计算机技术与发展,2016,26(05):139.
 CHENG Yan-yun,ZHANG Shou-chao,YANG Yang. Research on Time Series Outlier Detection Based on Big Data[J].,2016,26(09):139.

更新日期/Last Update: 2016-10-26