«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn.1673-629X.2018.08.009]
点击复制

基于 Spark 的高维数据相似性连接()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年08期

页码:: 43-47

栏目:: 智能、算法、系统工程

出版日期:: 2018-08-10

文章信息/Info

Title:: Similarity Joins of High-dimensional Data Based on Spark

文章编号:: 1673-629X(2018)08-0043-05

作者:: 成小海; 天津工业大学计算机科学与软件学院,天津 300387

Author(s):: CHENG Xiao-hai; School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China

关键词:: 高维数据; 相似性连接; Spark; 分段聚合近似; 符号聚合近似

Keywords:: high-dimensional data; similarity joins; Spark; piecewise aggregate approximation; symbolic aggregate approximation

分类号:: TP311

DOI:: 10.3969/ j. issn.1673-629X.2018.08.009

文献标志码:: A

摘要:: 高维数据相似性连接(HDSJ)是指在给定的空间数据库中,频繁执行连接和距离计算操作找出向量空间满足给定条件的数据对。但是随着数据量和维数的增加,HDSJ 的计算成本将呈指数增加。针对 HDSJ 在处理海量数据时效率不佳的问题,利用 Spark 集群分布式和基于内存并行计算特性,提出了基于 Spark 框架的 HDSJ 改进方法。该方法主要借助Spark 中高效的 RDD 算子,使用分段聚合近似(PAA)表示原始的高维向量,用符号聚合近似(SAX)将表示后的向量重新组织成组,这样可以避免大量不必要的计算。 PAA 和 SAX 都是已有的降维技术,将二者结合使用可以很好地过滤掉大部分的干扰数据。实验结果证明,该方法在保证实验结果准确率的前提下提高了运算速率,比现有方法有更好的性能优势。

Abstract:: High-dimensional data similarity joins (HDSJ) is to find the data pairs of meeting the conditions by frequently using operations of the joins and distance calculation in a given spatial database. However,with the increasing of the data volume and the number of the dimensions,the computational cost of HDSJ will increase exponentially. In order to solve the problem of HDSJ of poor efficiency,we propose an improved method of HDSJ by using Spark cluster and memory parallel computing. This method mainly uses piecewise aggregate approximation (PAA) to represent the high-dimensional vectors and reorganize these vectors into groups based on their symbolic aggregate approximation (SAX) representations by using the efficient RDD operator in Spark cluster,which avoids many unnecessary calculations. PAA and SAX are existing dimensionality reduction techniques,the combination of the two can be used to filter out most of the interference data. Experiment shows that the proposed method can improve the operation rate while ensuring the accuracy rate,which has better performance than that of the existing method.

相似文献/References:

[1]游芳姜建国张坤.基于二维属性的高维数据聚类算法研究[J].计算机技术与发展,2009,(05):111.
　YOU Fang,JIANG Jian-guo,ZHANG Kun.Cluster- Algorithm Studies Based on Two- Dimensional Attribute Higher - Dimension Data[J].,2009,(08):111.
[2]施冬冬贾瑞玉黄义堂.基于遗传算法的高维离群点检测算法的改进[J].计算机技术与发展,2009,(03):141.
　SHI Dong-dong,JIA Rui-yu,HUANG Yi-tang.An Improved High-Dimensional Outlier Detection Algorithm Based on Genetic Algorithm[J].,2009,(08):141.
[3]邵昌昇楼巍严利民.高维数据中的相似性度量算法的改进[J].计算机技术与发展,2011,(02):1.
　SHAO Chang-sheng,LOU Wei,YAN Li-min.Optimization of Algorithm of Similarity Measurement in High-Dimensional Data[J].,2011,(08):1.
[4]王晓阳,张洪渊,沈良忠,等.基于相似性度量的高维数据聚类算法研究[J].计算机技术与发展,2013,(05):30.
　WANG Xiao-yang,ZHANG Hong-yuan,SHEN Liang-zhong,et al.Research on High Dimensional Clustering Algorithm Based on Similarity Measurement[J].,2013,(08):30.
[5]甄俊涛,刘臣.高维数据多标签分类的食品安全预警研究[J].计算机技术与发展,2020,30(09):109.[doi:10. 3969 / j. issn. 1673-629X. 2020. 09. 020]
　ZHEN Jun-tao,LIU Chen.Research on Food Safety Early Warning of Multi-label Classification of High Dimensional Data[J].,2020,30(08):109.[doi:10. 3969 / j. issn. 1673-629X. 2020. 09. 020]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1096
全文下载/Downloads593
评论/Comments