[1]彭双和,图尔贡·麦提萨比尔,周巧凤. 基于Simhash的中文文本去重技术研究[J].计算机技术与发展,2017,27(11):137-140.
 PENG Shuang-he,Tuergong MAITISABIER,ZHOU Qiao-feng. Research on Deduplication Technique of Chinese Text with Simhash[J].,2017,27(11):137-140.
点击复制

 基于Simhash的中文文本去重技术研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年11期
页码:
137-140
栏目:
应用开发研究
出版日期:
2017-11-10

文章信息/Info

Title:
 Research on Deduplication Technique of Chinese Text with Simhash
文章编号:
1673-629X(2017)11-0137-04
作者:
 彭双和图尔贡·麦提萨比尔周巧凤
 北京交通大学 计算机与信息技术学院
Author(s):
 PENG Shuang-heTuergong MAITISABIERZHOU Qiao-feng
关键词:
 重复数据删除Simhashhash 数据分块
Keywords:
 data deduplicationSimhashhashdata blocking
分类号:
TP311
文献标志码:
A
摘要:
 随着计算机技术的飞速发展,各领域存储系统中的数据存储量迅猛上升,而其中的冗余数据也呈不断增加趋势.以往的研究表明,某些存储系统中的冗余数据已达60%,其存储管理成本较高.处理冗余数据已成为目前存储系统研究的热点.为此,提出了一种基于Simhash的中文文本去重方案.该方案采用数据块作为粒度对重复数据进行去重处理,主要是将中文文本中的".?!"等特殊字符作为分割点,对数据进行相应的分块处理,并以Simhash作为唯一标识,通过海明距离(Hamming Distance)来判断其相似性并以此为依据进行数据去重.对比验证实验结果表明,相比于传统的hash去重技术,提出的基于Simhash的去重方案具有更高的去重率和准确率,展现了较好的应用价值和应用前景.
Abstract:
 With the rapid development of computer technology,the amount of data storage in various areas of storage systems has been in-creased rapidly,of which the redundant data also does. Previous studies shown that some storage system has achieved 60% of redundant data,which displays the higher cost of storage management,so processing of that has become a hot spot for storage system research. For this,a method to duplicate redundant data based on Simhash is proposed,which uses the data blocks as the granularity to deduplicate the data,in which the special characters in Chinese documents,such as".?!",are acted as split points for blocking. Simhash can be the only identifications and the similarity of those is judged by Hamming Distance for data duplication. Experimental results show that compared with the traditional hash deduplication technology,it has higher deduplication rate and accuracy,which displays good application value and application prospect.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(11):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(11):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(11):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(11):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(11):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(11):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(11):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(11):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(11):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(11):47.

更新日期/Last Update: 2017-12-26