[1]吴家皋[][],余浩[] [],张雪英[]. 基于链接回溯的地理信息更新主题爬虫研究[J].计算机技术与发展,2014,24(07):52-55.
 WU Jia-gao[] [],YU Hao[] [],ZHANG Xue-ying[]. Study of Topic-driven Web Crawler for Geographic Information Updating Based on Link Backtracking[J].,2014,24(07):52-55.
点击复制

 基于链接回溯的地理信息更新主题爬虫研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
24
期数:
2014年07期
页码:
52-55
栏目:
智能、算法、系统工程
出版日期:
2014-07-10

文章信息/Info

Title:
 Study of Topic-driven Web Crawler for Geographic Information Updating Based on Link Backtracking
文章编号:
1673-629X(2014)07-052-04
作者:
 吴家皋[1][2] 余浩[1] [2]张雪英[3]
 1.南京邮电大学 计算机学院;2.江苏省无线传感网高技术研究重点实验室3.南京师范大学 虚拟地理环境教育部重点实验室,
Author(s):
 WU Jia-gao[1] [2]YU Hao[1] [2]ZHANG Xue-ying[3]
关键词:
 主题爬虫地理信息更新支持向量机回溯算法
Keywords:
 topic-driven web crawlergeographic information updatingsupport vector machinebacktracking algorithm
分类号:
TP31
文献标志码:
A
摘要:
 互联网的崛起为地理信息更新检索提供了一条新的途径,具有实时性强、成本低的优势。文中从实际出发,针对现有爬虫算法的缺陷,提出一种基于链接回溯的地理信息更新主题爬虫方法。首先,结合支持向量机分类技术,能够快速有效地找出一个网站中最有可能包含主题相关内容的链接方向;然后,回溯到这些链接后继续进行爬取,并通过地理信息变化要素知识库确定主题内容,从而优化爬取路径,减少低效率的爬取过程。实验结果表明,该方法可以找出最有可能包含地理信息的链接方向,大幅提高主题爬取效率,在其他主题方向也具有一定的可推广性。
Abstract:
 The rise of Internet makes it a new way to search for information about geographic information updating,which has advantages of low cost and strong real-time. In allusion to the insufficiency of current top-driven web crawler,a new web crawler based on link backtracking algorithm is proposed in view of practice. First,it can find out the link paths in a website which most probably lead to topic information by using support vector machine classification;then,backtrack to these links and restart crawling,the theme of every links will be confirmed by using geographic information changing factor knowledge base,as a result,it will optimize crawling path and reduce low efficient crawling process. According to results from experiments,it can find out paths which lead to wanted information and enhance effi-ciency of crawling process,and also has a good possibility to extend to other topic areas.

相似文献/References:

[1]袁浩 黄烟波.网页标题分析对主题爬虫的改进[J].计算机技术与发展,2009,(06):22.
 YUAN Hao,HUANG Yan-bo.Analysis of Title Page to Improve Focus Crawler[J].,2009,(07):22.
[2]罗林波 陈绮 吴清秀.基于Shark-Search和Hits算法的主题爬虫研究[J].计算机技术与发展,2010,(11):76.
 LUO Lin-bo,CHEN Qi,WU Qing-xiu.Research on Topical Crawler of Shark-Search Algorithm and Hits Algorithm[J].,2010,(07):76.
[3]赵思佳 尹婷.基于规则引擎的个性化主题网页爬虫的研究[J].计算机技术与发展,2011,(03):56.
 ZHAO Si-jia,YIN Ting.Research of Personalization Theme Crawler Based on Rule Engine[J].,2011,(07):56.
[4]张海亮 袁道华.基于遗传算法的主题爬虫[J].计算机技术与发展,2012,(08):48.
 ZHANG Hai-liang,YUAN Dao-hua.Focused Crawling Based on Genetic Algorithms[J].,2012,(07):48.
[5]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(07):1.
[6]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(07):5.
[7]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(07):13.
[8]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(07):21.
[9]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(07):25.
[10]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(07):29.
[11]林子皓. 主题爬虫的设计与实现[J].计算机技术与发展,2014,24(08):99.
 LIN Zi-hao. Design and Implementation of Topic-focused Crawler[J].,2014,24(07):99.
[12]张金,倪晓军. 基于语义树与VSM的主题爬取策略研究[J].计算机技术与发展,2017,27(11):66.
 ZHANG Jin,NI Xiao-jun. Research on Topic Crawling Strategy Based on Semantic Tree and VSM[J].,2017,27(07):66.

更新日期/Last Update: 2015-03-12