«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn.1673-629X.2018.10.037]
点击复制

基于 Scrapy 技术的数据采集系统的设计与实现()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 28
期数:: 2018年10期

页码:: 177-181

栏目:: 智能、算法、系统工程

出版日期:: 2018-10-10

文章信息/Info

Title:: Design and Implementation of Data Acquisition System Based on Scrapy Technology

文章编号:: 1673-629X(2018)10-0177-05

作者:: 杨君; 陈春玲; 余瀚; 南京邮电大学计算机学院,江苏南京 210003

Author(s):: YANG Jun; CHEN Chun-ling; YU Han; School of Computer,Nanjing University of Posts and Telecommunications, Nanjing 210003,China

关键词:: Scrapy; Django; 数据采集; 网络爬虫

Keywords:: Scrapy; Django; data acquisition; Internet crawler

分类号:: TP302

DOI:: 10.3969/ j. issn.1673-629X.2018.10.037

文献标志码:: A

摘要:: 面对互联网信息极其庞大并且经常更新的问题,基于 Scrapy 爬虫框架设计并实现了一种数据采集系统。不仅可以根据用户自身需求获取数据,还可以对自身的采集任务进行简单的管理。介绍了系统开发的关键技术,探讨了系统框架设计、功能模块和数据库设计方案。使用 Django MTV 模式进行开发,底层数据采集框架使用 Scrapy,一种使用 Python 编写实现的网站数据异步爬虫应用框架,网页解析采用 XPath 和 Python 正则相结合的方法,采用 jQuery 树插件 zTree 实现了任务的树形管理,使用 bootstrap 实现了数据的任务名加关键字组合查询和页面效果。系统主要分为网页解析模块、数据处理模块、系统登录模块、任务新建模块、任务管理模块和数据查询模块。最后分析了浏览器端和服务器端的数据交互, 以及网页数据定位和解析的实现。

Abstract:: For the huge and frequent updating of the Internet information,we design and implement a data acquisition system based on the Scrapy crawler framework,which can not only obtain data according to the user’s own needs,but also manage its own collection tasks simply. The key technology of system development is introduced,and the frame design,function module and database design scheme of the system are discussed. The Django MTV mode is used for development,and the underlying data collection framework applies Scrapy, an asynchronous crawler application framework implemented by Python. The web page analysis uses the method in combination of XPath and Python regular. The jQuery zTree plug-in is utilized to realize tree management of tasks,the bootstrap to achieve the effect of task name with the keyword combination query and page. The system is divided into web page analysis module,data processing module,sys- tem login module,task module,task management module and data query module. Finally,the realization of data interaction between browser and server,and the web page data positioning and analysis are analyzed.

相似文献/References:

[1]张翠丽,孟小艳,杨抒.基于 Django 框架的管理系统的设计与开发[J].计算机技术与发展,2019,29(10):63.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 014]
　ZHANG Cui-li,MENG Xiao-yan,YANG Shu.Design and Development of Django Framework-based Management System[J].,2019,29(10):63.[doi:10. 3969 / j. issn. 1673-629X. 2019. 10. 014]
[2]迪力夏提·多力昆,张太红,冯向萍.LabelMe 标注核对系统的设计与实现[J].计算机技术与发展,2022,32(03):214.[doi:10. 3969 / j. issn. 1673-629X. 2022. 03. 036]
　Duolikun DILIXIATI,ZHANG Tai-hong,FENG Xiang-ping.Design and Implementation of LabelMe Label Checking System[J].,2022,32(10):214.[doi:10. 3969 / j. issn. 1673-629X. 2022. 03. 036]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1051
全文下载/Downloads676
评论/Comments