[1]王小林,肖慧,邰伟鹏. 基于Hadoop平台的文本相似度检测系统的研究[J].计算机技术与发展,2015,25(08):90-93.
 WANG Xiao-lin,XIAO Hui,TAI Wei-peng. Research on Text Similarity Detection System Based on Hadoop[J].,2015,25(08):90-93.
点击复制

 基于Hadoop平台的文本相似度检测系统的研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
25
期数:
2015年08期
页码:
90-93
栏目:
智能、算法、系统工程
出版日期:
2015-08-10

文章信息/Info

Title:
 Research on Text Similarity Detection System Based on Hadoop
文章编号:
1673-629X(2015)08-0090-04
作者:
 王小林肖慧邰伟鹏
 安徽工业大学 计算机与技术学院
Author(s):
 WANG Xiao-lin XIAO HuiTAI Wei-peng
关键词:
 文本相似度语义Map/Reduce框架TFIDF算法TFIDFWGE算法
Keywords:
 text similaritysemanticMap/Reduce frameworkTFIDFTFIDFWGE
分类号:
TP391
文献标志码:
A
摘要:
 在现有的文本相似度计算方法中,获取关键词权值的TFIDF算法没有完全考虑到关键词在文本中的位置和其在文本库中的离散度对权值的影响,且当处理的文本库中信息量过大时,运行效率较低。针对上述问题,文中提出一种基于语义的信息熵与信息增益的TFIDF算法( TFIDFWGE )。该算法通过对给定的关键词添加位置权重与计算熵值和信息增益,得到关键词的最终权值,并利用Hadoop平台的Map/Reduce框架来实现TFIDFWGE算法和向量空间模型( VSM)的文本相似度计算过程。通过对两组真实的数据集进行的实验结果表明,与现有的TFIDF算法相比,TFIDFWGE算法的查全率和查准率更高,且在Hadoop平台上实现的文本相似度检测系统对信息量大的文本库处理效率更加高效。
Abstract:
 In existing method of calculating similarity,TFIDF which is usually used to obtain weights of key words doesn’ t take into con-sideration the influence of key words’ position and their dispersion in text library,and moreover runs in low efficiency when dealing with large quantity of data. To tackle the problems above,propose a kind of TFIDF algorithm ( TFIDFWGE) based on the semantic informa-tion entropy and information gain by adding position weight to key words and calculating the entropy and information gain to acquire final value. The algorithm adds position weight and calculation entropy and information gain for given keywords to get the final weights of keywords,and use Map/Reduce framework of Hadoop platform to achieve TFIDFWGE algorithms and Vector Space Model ( VSM) in the text similarity calculation process. Experimental results on two real datasets show that compared with the existing TFIDF, TFIDF-WGE’ s recall and precision is higher,and in the Hadoop platform text similarity detection system is more efficient for information large text database processing.

相似文献/References:

[1]邱欢堂 何聚厚 何秀青.教学反思内容自动评估模型研究[J].计算机技术与发展,2012,(09):173.
 QIU Huan-tang,HE Ju-hou,HE Xiu-qing.Automatic Assessment Model for Content of Teaching Reflection[J].,2012,(08):173.
[2]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
 SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(08):217.
[3]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(08):1.
[4]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(08):5.
[5]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(08):13.
[6]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(08):21.
[7]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(08):25.
[8]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(08):29.
[9]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(08):34.
[10]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(08):38.
[11]陈攀[],杨浩[],吕品[][],等. 基于LDA模型的文本相似度研究[J].计算机技术与发展,2016,26(04):82.
 CHEN Pan[],YANG Hao[],L Pin[][],et al. Study on Text Similarity Based on LDA Model[J].,2016,26(08):82.

更新日期/Last Update: 2015-09-11