科研论文爬取与多维度分析系统的设计与实现-《计算机技术与发展》

文章信息/Info

Title:: Design and Implementation of Scientific Papers Crawling and Multidimensional Analysis System

Author(s):: WANG Shu-mei; SHANG Yan-liang; School of Computer Science and Technology,Jiangsu Normal University,Xuzhou 222111,China

Keywords:: paper crawling; multidimensional analysis; data mining; information collection; crawler automation

摘要:: 信息时代的到来,知网( CNKI) 成为国内最大的论文数据库,如何高效地获取论文信息,挖掘论文价值,成为了一个亟待解决的问题。目前,论文检索工具多为通用爬虫,只能采集到部分少量的信息,且包含着不符合用户要求的信息,因此实现一个集聚焦论文信息采集和实时论文数据分析的系统变得极为重要。该系统针对如何高效获取论文信息,挖掘论文价值等问题,使用 Python Django 框架和 Celery 框架将网站和爬虫结合,实现了爬虫的自动化。系统分为论文爬取模块和多维度分析模块。其中,论文爬取模块使用 Selenium,模拟用户点击,并使用 Beutifulsoup4 和 Requests 解析网页内容,最后将获取到的论文信息存储到 MySQL 数据库中。多维度分析模块使用 High Charts 进行数据展示,主要对与关键词相关的论文发表趋势,高产作者、机构等信息进行分析。通过该系统,科研学者可以方便快捷地获取到研究领域的各种论文信息,为以后的深入研究提供数据支撑。

Abstract:: With the advent of the information age,CNKI has become the largest paper database in China. How to efficiently obtain paper information and excavate paper value has become an urgent problem to be solved. At present,the paper retrieval tools are mostly general crawlers, which can only collect a small amount of information and contain information that does not meet the user’s requirements. Therefore,it is of great importance to implement a focused paper information collection and real-time paper data analysis system. For this purpose,Python Django framework and Celery framework are used to combine the website with the crawler and realize the automation of the crawler. The system is divided into a paper crawling module and a multidimensional analysis module. Among them, the paper crawling module uses Selenium to simulate user clicks,and parses web content with Beutifulsoup4 and Requests,and finally stores them in MySQL database. The multidimensional analysis module uses High Charts to display,which mainly analyze the trend of papers,high-yielding authors, institutions and other information about keywords. Through this system, researchers can quickly and easily obtain various information in the field of research,and provide data support for future research.