«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j．issn．1673－629X．2019．02．029]
点击复制

基于Scrapy框架的爬虫和反爬虫研究()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 29
期数:: 2019年02期

页码:: 139-142

栏目:: 安全与防范

出版日期:: 2019-02-10

文章信息/Info

Title:: Research on Crawler and Anti－reptile Based on Scrapy Framework

文章编号:: 1673－629X(2019)02－0139－04

作者:: 韩贝¹ ; 马明栋² ; 王得玉²; 1．南京邮电大学通信与信息工程学院，江苏南京 210003;2．南京邮电大学地理与生物信息学院，江苏南京 210003

Author(s):: HAN Bei¹ ; MA Ming－dong² ; WANG De－yu²; 1．School of Telecommunication ＆ Information Engineering，Nanjing University of Posts and Telecommunications，Nanjing 210003，China;2．School of Geographical and Biological Information，Nanjing University of Posts and Telecommunications，Nanjing 210003，China

关键词:: 网站; 网络爬虫; 反爬虫; Python; Scrapy 框架

Keywords:: website; web crawler; anti－reptile; Python; Scrapy framework

分类号:: TP309

DOI:: 10．3969/j．issn．1673－629X．2019．02．029

摘要:: 伴随着互联网的快速发展，获取信息已经成为人们日常生活中必不可少的一部分。在众多信息来源中，通过浏览器进入网站获取信息是绝大多数人的选择，但如果按照这种方式正常地进行信息获取，获取信息速度较慢，量较小，由此便产生了网络爬虫。网络爬虫又称为网络蜘蛛或网络机器人，可以按照使用人定制的规则，短时间内在万维网上搜集大量特定信息。网络爬虫在爬取信息的同时，也带了一些问题，如大量信息被非正常获取，是一种损失，同时，大量爬虫对网站维护也是一个巨大的负担。如何在维护网民正常访问的前提下，有效阻止这些爬虫就显得十分重要。因此，文中主要研究 Python 语言开发的开源爬虫框架 Scrapy 所开发的爬虫，对目前网站常用的一些反爬虫手段进行分析，基于 Scrapy 框架以及具体网站，举例说明爬虫如何应对网站这些反爬措施

Abstract:: With the rapid development of the Internet，getting information has been an indispensible part of people’s daily life． Among many sources of information，it is the choice of the vast majority of people to enter the website through the browser for information． However，if the information is obtained normally in this way，the speed is slow and the amount is small，which generates a web crawler． Web crawlers，also known as web spiders or network robots，can collect a large number of specific information on the world wide web in a short time according to the rules customized by users． Web crawler brings some problems while crawling information． For example，a lot of information is abnormal access，which is a loss，and a large number of reptiles are also a huge burden for website maintenance． How to effectively prevent these crawlers is very important on the premise of maintaining the normal access of Internet users． Therefore，we mainly study the reptiles developed by Scrapy，an open source crawler framework developed by Python language，and analyze some anti－reptile means commonly used in current website． Based on Scrapy framework and specific website，we give an example of how crawlers deal with these anti crawling measures．

相似文献/References:

[1]张林才张燕王红霞.节点对等WebSpider设计与实现[J].计算机技术与发展,2010,(03):195.
　ZHANG Lin-cai,ZHANG Yan,WANG Hong-xia.Design and Realization of Peer - to - Peer Web Spider[J].,2010,(02):195.
[2]张春元康耀红伍小芹.Web新闻自动采集发布系统的设计与实现[J].计算机技术与发展,2009,(09):250.
　ZHANG Chun-yuan,KANG Yao-hong,WU Xiao-qin.Design and Implementation of Web News Automatically Gathering and Publishing System[J].,2009,(02):250.
[3]于群英.基于中间件的网站建设管理模式研究[J].计算机技术与发展,2006,(12):79.
　YU Qun-ying.Research of Module of Building and Managing Web Based on Middleware[J].,2006,(02):79.
[4]余小鹏彭鸿儒.我国电子商务环境下通用运费计算系统研究[J].计算机技术与发展,2011,(07):101.
　YU Xiao-peng,PENG Hong-ru.Research on General Freight-Computing System in Chinese E-Commerce[J].,2011,(02):101.
[5]周凤丽林晓丽.基于Lucene的Web搜索引擎的研究和实现[J].计算机技术与发展,2012,(01):140.
　ZHOU Feng-li,LIN Xiao-li.Research and Implementation of Web Search Engine Based on Lucene[J].,2012,(02):140.
[6]黄宇达魏霞王迤冉[].一种轻量级中文搜索引擎模型的设计与实现[J].计算机技术与发展,2012,(09):201.
　HUANG Yu-da,WEI Xia,WANG Yi-ran.Design and Implementation of System Model of a Lightweight Chinese Search Engine[J].,2012,(02):201.
[7]罗福强熊永福.基于分层的Web系统的性能优化研究与探讨[J].计算机技术与发展,2012,(11):85.
　LUO Fu-qiang,XIONG Yong-fu.Web System Performance Optimization Research and Discussion Based on Layering[J].,2012,(02):85.
[8]张俊,李鲁群,周熔.基于Lucene的搜索引擎的研究与应用[J].计算机技术与发展,2013,(06):230.
　ZHANG Jun,LI Lu-qun,ZHOU Rong.Research and Application of Search Engine Based on Lucene[J].,2013,(02):230.
[9]孙青云,王俊峰,赵宗渠,等.一种基于模拟登录的微博数据采集方案[J].计算机技术与发展,2014,24(03):6.
　SUN Qing-yun[],WANG Jun-feng[],ZHAO Zong-qu[],et al.A Microblog Data Collection Method Based on Simulated Login Technology[J].,2014,24(02):6.
[10]杨洋[][],李晓风[][],赵赫[][],等. 基于网络爬虫的文献检索系统的研究和实现[J].计算机技术与发展,2014,24(11):35.
　YANG Yang[][],LI Xiao-feng[][],ZHAO He[][],et al. Research and Realization of Academic Search System Based on Network Crawler[J].,2014,24(02):35.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1640
全文下载/Downloads873
评论/Comments