[1]赵涛[],张太红[][],陈燕红[]. 中文农业网页去重及相似度判断研究[J].计算机技术与发展,2015,25(01):191-194.
 ZHAO Tao[],ZHANG Tai-hong[][],CHEN Yan-hong[]. Research on Duplicate Removal and Similarity Evaluation of Chinese Agricultural Web Pages[J].,2015,25(01):191-194.
点击复制

 中文农业网页去重及相似度判断研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
25
期数:
2015年01期
页码:
191-194
栏目:
应用开发研究
出版日期:
2015-01-10

文章信息/Info

Title:
 Research on Duplicate Removal and Similarity Evaluation of Chinese Agricultural Web Pages
文章编号:
1673-629X(2015)01-0191-04
作者:
 赵涛[1] 张太红[1][2] 陈燕红[1]
 1.新疆农业大学 计算机与信息工程学院;2.中国农业大学 信息与电气工程学院
Author(s):
 ZHAO Tao[1] ZHANG Tai-hong[1][2] CHEN Yan-hong[1]
关键词:
 中文农业网页MD5向量空间模型知网潜在语义分析
Keywords:
 Chinese agricultural Web pageMD5vector space modelHowNetlatent semantic analysis
分类号:
TP393
文献标志码:
A
摘要:
 随着信息技术的飞速发展,互联网中的网页急剧增长,在这海量、繁杂的网页中却呈现出一定比例的重复网页及近似网页。为了减少农业领域中近似及重复网页对农业垂直搜索引擎性能的影响,文中首先使用MD5算法去除网页集合中完全相同的网页,再利用向量空间模型(VSM)、基于知网的语义相似度模型及潜在语义分析(LSA)三种相似度判断方法对其余网页的相似度进行计算。实验结果显示,当相似度阈值r=60%、维数K=250时,潜在语义分析( LSA)的综合评价F1测度最高,且准确率达到了90.5%。
Abstract:
 With the rapid development of information technology,the Internet Web pages are growing sharply. In this massive,complex pages,preach a certain percentage of duplicate pages and similar pages. In order to reduce the influence of agricultural field approximation and repeated Web pages on agricultural vertical search engine performance,first use the MD5 algorithm to remove the same Web pages in the Web page set,then through three kinds of methods which include the Vector Space Model ( VSM) ,semantic similarity model based on HowNet and Latent Semantic Analysis ( LSA) ,calculate the similarity of the rest Web pages. The experimental results show that when the similarity threshold is 60% (r=60%),the dimension is 250 (K=250),the F1 comprehensive evaluation measure of LSA is highest, and the accuracy rate has reached 90. 5%.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(01):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(01):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(01):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(01):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(01):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(01):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(01):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(01):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(01):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(01):47.

更新日期/Last Update: 2015-04-17