[1]阳智欢,田纹龙 *,何婷婷,等.上下文语义嵌入的变粒度云存储相似数据去重技术[J].计算机技术与发展,2024,34(04):16-23.[doi:10. 3969 / j. issn. 1673-629X. 2024. 04. 003]
 YANG Zhi-huan,TIAN Wen-long *,HE Ting-ting,et al.Variable Granularity-based Chunk-context Aware Similar Data Deduplication Technique for Cloud Storage[J].,2024,34(04):16-23.[doi:10. 3969 / j. issn. 1673-629X. 2024. 04. 003]
点击复制

上下文语义嵌入的变粒度云存储相似数据去重技术()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
34
期数:
2024年04期
页码:
16-23
栏目:
大数据与云计算
出版日期:
2024-04-10

文章信息/Info

Title:
Variable Granularity-based Chunk-context Aware Similar Data Deduplication Technique for Cloud Storage
文章编号:
1673-629X(2024)04-0016-08
作者:
阳智欢1 田纹龙12 * 何婷婷3 叶旭明1 唐 佳1
1. 南华大学 计算机学院,湖南 衡阳 421001;
2. 新加坡南洋理工大学 数理科学学院,新加坡 637371;
3. 衡阳师范学院 教育科学学院,湖南 衡阳 421010
Author(s):
YANG Zhi-huan1 TIAN Wen-long12 * HE Ting-ting3 YE Xu-ming1 TANG Jia1
1. School of Computer Science,University of South China,Hengyang 421001,China;
2. School of Mathematical Sciences,Nanyang Technological University,Singapore 637371,Singapore;
3. School of Education Science,Hengyang Normal University,Hengyang 421010,China
关键词:
相似数据去重数据块语义变粒度云存储元数据
Keywords:
similar data deduplicationdata block semanticsvariable granularitycloud storagemetadata
分类号:
TP39
DOI:
10. 3969 / j. issn. 1673-629X. 2024. 04. 003
摘要:
针对云存储环境下现有相似数据去重技术效果不佳以及元数据开销大等问题,提出了上下文语义嵌入的变粒度云存储相似数据去重技术。 该技术采用基于子块重组的特征提取算法,对数据块内容内部结构进行初步特征提取,并利用 BP( Back Propagation) 神经网络上下文感知模型将数据块上下文特征信息嵌入到初始特征中,实现了具有上下文语义嵌入的变粒度数据块。 通过控制数据块大小,动态地合并相邻相似数据块或非冗余数据块,减少元数据开销,并对位于相似数据块和非冗余数据块之间过渡区域进行分割,从而获得更好的相似数据块表示形式。 最后,为了评估其性能,实现了一个变粒度相似数据检测算法原型 rCARD 并在真实世界的数据集进行了实验,实验结果表明,与最新相似性检测去重技术 Finesse 相比, rCARD 在实现更高重复数据删除率的同时,显著降低了元数据的大小,并且加速相似性检测速度高达11.07 倍。
Abstract:
Aiming at the problems of poor effect of existing similar data deduplication techniques and high metadata overhead in cloudstorage environment,variable granularity-based?
chunk-context aware similar data deduplication technique for cloud storage is proposed.The technique adopts a feature extraction algorithm based on sub-block reorganization?
to perform initial feature extraction of the internalstructure of the data block content,and utilizes a BP ( Back Propagation) neural network context-aware model to embed the?
data blockcontextual feature information into the initial features,realizing a variable granularity data block with contextual semantic embedding. Abetter representation of similar data blocks is obtained by controlling the data block size,dynamically merging neighboring similar datablocks or non-redundant data blocks to reduce metadata overhead, and segmenting the transition region located between similar and non-redundant data blocks. Finally,to evaluate its performance,a prototype variable granularity similar data detection algorithm,rCARD,isimplemented and extensively experimented on real world datasets. The experimental results show that compared to the latest similarity detection deduplication technique Finesse,rCARD achieves a higher deduplication rate while significantly reducing the metadata size and accelerates the similarity detection speedup by up to 11. 07 times.
更新日期/Last Update: 2024-04-10