[1]张金,倪晓军. 基于语义树与VSM的主题爬取策略研究[J].计算机技术与发展,2017,27(11):66-70.
 ZHANG Jin,NI Xiao-jun. Research on Topic Crawling Strategy Based on Semantic Tree and VSM[J].,2017,27(11):66-70.
点击复制

 基于语义树与VSM的主题爬取策略研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年11期
页码:
66-70
栏目:
智能、算法、系统工程
出版日期:
2017-11-10

文章信息/Info

Title:
 Research on Topic Crawling Strategy Based on Semantic Tree and VSM
文章编号:
1673-629X(2017)11-0066-05
作者:
 张金倪晓军
 南京邮电大学 计算机学院
Author(s):
 ZHANG JinNI Xiao-jun
关键词:
 主题爬虫语义树向量空间模型内容相关度链接排序
Keywords:
 topic crawlersemantic treeVSMcontent relevancelink ranking
分类号:
TP301
文献标志码:
A
摘要:
 主题爬虫主要用于解决用户的定制化搜索需求,即在日益增长的网络数据中快速、有效、准确地选取用户关注的主题内容进行爬取.提高爬取特定信息的准确性,需要对网页的内容相关度进行主题相关判断,而主题爬虫关注的核心问题就是相关度计算,但现有的改进算法大多采用人工智能和机器学习等技术,不仅引起算法复杂度的提高,而且提升效果有限.为此,提出了一种基于语义树与VSM的主题爬取策略,将语义相似度的计算加入到内容相关度计算与链接排序中,并通过对策略中算法细节的改进优化相关度的主题判别.实验结果表明,使用基于语义树与VSM爬取策略的主题爬虫可将爬行路线一直保持在相关度较高的网页链接中,对网页链接进行了相关与不相关的有效分类,显著地提高了爬取的准确率.
Abstract:
 Topic crawler is mainly adopted to solve the customized search needs of users,that is to select the concerning topics of users for crawling quickly,effectively and accurately in the growing network data. In order to improve the accuracy of crawling specific informa-tion,the relevance of the content of the page needs to be subject-related judgments while the main concern of the topic crawler is the cor-relation calculation. But the most of the existing improved algorithms adopt techniques like artificial intelligence and machine learning, which not only improve their complexity,but also own limitations in effect enhancement. Therefore,a topic crawling strategy based on se-mantic tree and VSM is proposed and the semantic similarity calculation is added to the content relevance calculation and link ranking to optimize the subject discrimination of relevance through the improvement of detail of the algorithm in the strategy. Experimental results show that it can always keep the crawl course in the link of the web page with high relevance,which has effectively classified the web links relevant or not and significantly improved accuracy of crawling.

相似文献/References:

[1]袁浩 黄烟波.网页标题分析对主题爬虫的改进[J].计算机技术与发展,2009,(06):22.
 YUAN Hao,HUANG Yan-bo.Analysis of Title Page to Improve Focus Crawler[J].,2009,(11):22.
[2]罗林波 陈绮 吴清秀.基于Shark-Search和Hits算法的主题爬虫研究[J].计算机技术与发展,2010,(11):76.
 LUO Lin-bo,CHEN Qi,WU Qing-xiu.Research on Topical Crawler of Shark-Search Algorithm and Hits Algorithm[J].,2010,(11):76.
[3]赵思佳 尹婷.基于规则引擎的个性化主题网页爬虫的研究[J].计算机技术与发展,2011,(03):56.
 ZHAO Si-jia,YIN Ting.Research of Personalization Theme Crawler Based on Rule Engine[J].,2011,(11):56.
[4]张海亮 袁道华.基于遗传算法的主题爬虫[J].计算机技术与发展,2012,(08):48.
 ZHANG Hai-liang,YUAN Dao-hua.Focused Crawling Based on Genetic Algorithms[J].,2012,(11):48.
[5]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(11):1.
[6]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(11):5.
[7]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(11):13.
[8]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(11):21.
[9]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(11):25.
[10]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(11):29.
[11]吴家皋[][],余浩[] [],张雪英[]. 基于链接回溯的地理信息更新主题爬虫研究[J].计算机技术与发展,2014,24(07):52.
 WU Jia-gao[] [],YU Hao[] [],ZHANG Xue-ying[]. Study of Topic-driven Web Crawler for Geographic Information Updating Based on Link Backtracking[J].,2014,24(11):52.
[12]林子皓. 主题爬虫的设计与实现[J].计算机技术与发展,2014,24(08):99.
 LIN Zi-hao. Design and Implementation of Topic-focused Crawler[J].,2014,24(11):99.

更新日期/Last Update: 2017-12-26