[1]王晓霞,孙德才.一种基于 MapReduce 的局部相似自连接算法[J].计算机技术与发展,2020,30(02):88-93.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 018]
WANG Xiao-xia,SUN De-cai.A MapReduce-based Local Similarity Self-join Algorithm[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2020,30(02):88-93.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 018]
点击复制
一种基于 MapReduce 的局部相似自连接算法(
)
《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]
- 卷:
-
30
- 期数:
-
2020年02期
- 页码:
-
88-93
- 栏目:
-
智能、算法、系统工程
- 出版日期:
-
2020-02-10
文章信息/Info
- Title:
-
A MapReduce-based Local Similarity Self-join Algorithm
- 文章编号:
-
1673-629X(2020)02-0088-06
- 作者:
-
王晓霞; 孙德才
-
渤海大学 信息科学与技术学院,辽宁 锦州 121013
- Author(s):
-
WANG Xiao-xia; SUN De-cai
-
School of Information Science and Technology,Bohai University,Jinzhou 121013,China
-
- 关键词:
-
相似连接; 自连接; MapReduce; 数据清洗; 大数据
- Keywords:
-
similarity join; self-join; MapReduce; data cleaning; big data
- 分类号:
-
TP391
- DOI:
-
10. 3969 / j. issn. 1673-629X. 2020. 02. 018
- 摘要:
-
局部相似自连接能在给定的单个数据集中快速找到所有满足相似要求的记录对,它在数据清洗、基因序列比对和 剽窃检测等领域都有广泛的应用。 为研究基于单个字符串集的并行自连接算法,提出了一种基于MapReduce框架的自连 接算法,解决了局部相似自连接的定位问题。 该算法采用了过滤验证二阶段模式;在过滤阶段,采用无关对过滤和冗余对 过滤抛弃了大量的无效字符串对;在验证阶段,通过生成小编号串内容保留项解决了字符串编号和内容的快速配对问题。 实验结果显示,该算法在大数据集上的自连接速度一直快于当前的优秀算法LS-Join,同时非常适合动态编辑距离参数环 境下的局部相似自连接操作。 实验结果也证明,该算法中提出的相关技术有效地提高了局部相似自连接的速度。
- Abstract:
-
Local similarity self-join can find all local similar pairs from a given set quickly,which is widely used in many areas,such as data cleaning,gene sequence alignment,near duplicate detection and so on. In order to study the parallel self-join algorithm based on single string set,a self-join? algorithm based on MapReduce framework is proposed to solve the locating problem of local similarity selfjoin. Filter-verify framework is employed in this algorithm. In filter stage,a lot of dissimilar string pairs are discarded by using the techniques of irrelevant-pairfilterand redundant-pairfilter. In verify stage,the technique of generating reserved termsisadopted to solvethe problem of matching string contentswith IDsquickly. Experiment showsthat theproposed algorithm outperformsthecurrent excellent algorithm LS-Join on big dataset and performs well on condition of dynamic parameter of edit distance. It also demonstrates that the performance of local similarity self-join is improved by using the techniques of the proposed algorithm.
相似文献/References:
[1]冯林静. 多核的并行相似连接[J].计算机技术与发展,2017,27(07):43.
FENG Lin-jing. Parallel Similarity Join of Multi-core[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2017,27(02):43.
更新日期/Last Update:
2020-02-10