[1]张玲,祁玉娟,姜华. 改进的Shark-search算法在网络采集中的应用[J].计算机技术与发展,2017,27(08):192-194.
 ZHANG Ling,QI Yu-juan,JIANG Hua. Application of Improved Shark-search Algorithm in Web Crawler[J].,2017,27(08):192-194.
点击复制

 改进的Shark-search算法在网络采集中的应用()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年08期
页码:
192-194
栏目:
应用开发研究
出版日期:
2017-08-10

文章信息/Info

Title:
 Application of Improved Shark-search Algorithm in Web Crawler
文章编号:
1673-629X(2017)08-0192-03
作者:
 张玲祁玉娟姜华
 湖南省第一师范学院 信息科学与工程学院
Author(s):
 ZHANG LingQI Yu-juanJIANG Hua
关键词:
 Shark-search算法网页分块Web信息搜集链接价值主题漂移
Keywords:
 Shark-search algorithmweb page blockingWeb crawlerlinkages’ valuetopic-drfit
分类号:
G354
文献标志码:
A
摘要:
Shark-search是一种依据链接价值的高低进行优先采集的算法,用于主题信息采集系统时由于只考虑了网页文本和链接锚文本与主题的相关性而忽略了网页的组织结构特性,在抓取有较多噪音链接的网页时效果欠佳.基于网页组织结构特性的分析研究,提出了一种基于网页主题分块的Shark-search算法.该算法在经典Shark-search算法的基础上依据网页组织结构根据网页布局标签对页面内容进行分块,从网页,块和链接三个层面与主题的相关性得到链接的综合价值,因而具有自学习功能,能统计学习与主题相关性较大的块特征,并在发生主题漂移的时候具有自调整功能,给予主题相关性较大的父页面上的链接更多被抓取的机会.采集实验结果表明,所提出的算法在经典Shark-search的基础上能较好地改进主题信息采集的查准率,能够更灵活地针对实际的Web资源状况进行自调整.
Abstract:
 The Shark-search algorithm ranks Web linkages based on their topic value,which only estimates the linkage’s value by pages’ text content and linkages’ anchor text,not taking into account the link structure of the Web and has not good enough performance in crawling web pages including many linkages irrelevant to topic.An improved Shark-search algorithm based on topical segments has been proposed,which segments the Web page into blocks on the basis of the page’s structure.The linkage’s integrated value is comprised of the parent page’s value,the block’s value and the linkage’s value.Moreover,it regards the visited out links as feedback to modify the block’s relevance resulting with self-learning to statistical the characteristic of blocks.It has the ability of self-adjusting in the case of topic-drift to give more chance to the linkages in the web pages more relevant to the topic.The results of experiment in Web crawler show the algorithm proposed can well improve the precision of topical information acquisition on the basis of the classical Shark-search and more flexibly adjusts according to actual Web resources status.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(08):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(08):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(08):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(08):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(08):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(08):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(08):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(08):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(08):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(08):47.

更新日期/Last Update: 2017-09-22