[1]杨 彬,高俊涛,王志宝,等.基于词嵌入的元组级数据溯源方法[J].计算机技术与发展,2023,33(12):49-57.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 007]
 YANG Bin,GAO Jun-tao,WANG Zhi-bao,et al.A Tuple-level Data Lineage Approach Based on Word Embedding[J].,2023,33(12):49-57.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 007]
点击复制

基于词嵌入的元组级数据溯源方法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
33
期数:
2023年12期
页码:
49-57
栏目:
大数据与云计算
出版日期:
2023-12-10

文章信息/Info

Title:
A Tuple-level Data Lineage Approach Based on Word Embedding
文章编号:
1673-629X(2023)12-0049-09
作者:
杨 彬1 高俊涛1 王志宝1 李 菲2 马 强2 江树涛1
1. 东北石油大学 计算机与信息技术学院,黑龙江 大庆 163318;
2. 黑龙江八一农垦大学 信息与电气工程学院,黑龙江 大庆 163319
Author(s):
YANG Bin1 GAO Jun-tao1 WANG Zhi-bao1 LI Fei2 MA Qiang2 JIANG Shu-tao1
1. School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China;
2. School of Information and Electrical Engineering,Heilongjiang Bayi Agricultural University,Daqing 163319,China
关键词:
结构化数据数据溯源元组向量相似度比较词嵌入
Keywords:
structured datadata lineagetuple vectorssimilarity comparisonword embedding
分类号:
TP311. 13;TP391
DOI:
10. 3969 / j. issn. 1673-629X. 2023. 12. 007
摘要:
在信息爆炸时代,数据量与日剧增,使用数据挖掘技术可挖掘其内在联系,但前提是所使用的数据正确无误,否则其后的一切工作将毫无意义。 数据溯源技术可帮助数据分析人员快速定位到错误数据的来源和加工过程,减少错误数据的分析时间和难度,对数据质量控制与可信管理具有重要价值。 现有的元组级数据溯源方法存在存储开销大和溯源效率低的问题,文章使用词嵌入技
术改进元组级数据溯源方法。 首先,研究元组向量化编码机制,依据元组向量相似度识别元组溯源关系;其次,提出基于属性重要性的优化算法提高溯源的精确率;再次,引入近似最近邻搜索和
元组过滤优化机制降低溯源时间复杂度;最后,采用有向无环图展示元组数据的溯源关系。 实验结果表明,该方法精确率较高、时间复杂度较低、存储消耗较少,能够有效改进元组级数据溯源方法。
Abstract:
In the era of information explosion,the volume of data is increasing day by day,and data mining technology can be used toexplore its inner connection,but only if the data used is correct,otherwise all the subsequent work will be meaningless. Data lineage technology can help data analysts quickly locate the source and processing process of erroneous data, reduce the time and difficulty ofanalyzing erroneous data,and has important value for data quality control and trustworthy management. The existing tuple - level datalineage methods have the problems of high storage overhead and low lineage efficiency, and we use word embedding technology toimprove the tuple-level data lineage methods. Firstly,the tuple vectorization encoding mechanism is investigated and tuple lineage relationships based on the similarity of tuple vectors is identified. Secondly, an optimization algorithm based on attribute importance isproposed to improve the precision of lineage. Thirdly,the approximate nearest neighbor search and tuple filtering optimization mechanismis used to reduce the lineage time complexity. Finally,a directed acyclic graph is used to show the lineage relationships of tuple data. Theexperiment shows that the proposed method has higher lineage precision,lower time complexity and less storage consumption,and can effectively improve the tuple-level data lineage method.

相似文献/References:

[1]梁宝龙,崔学林,谢寒生,等.多源气象数据实时推送系统的设计与实现[J].计算机技术与发展,2018,28(08):139.[doi:10.3969/ j. issn.1673-629X.2018.08.029]
 LIANG Bao-long,CUI Xue-lin,XIE Han-sheng,et al.Design and Implementation of a Real-time Push System for Multi-source Meteorological Data[J].,2018,28(12):139.[doi:10.3969/ j. issn.1673-629X.2018.08.029]
[2]赖 欣,胡敬玉.基于 AIXM 的民航共享数据集成系统构建[J].计算机技术与发展,2020,30(06):181.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 035]
 LAI Xin,HU Jing-yu.Construction of Civil Aviation Sharing Data Integration System Based on AIXM[J].,2020,30(12):181.[doi:10. 3969 / j. issn. 1673-629X. 2020. 06. 035]

更新日期/Last Update: 2023-12-10