[1]王树梅,尚衍亮.科研论文爬取与多维度分析系统的设计与实现[J].计算机技术与发展,2020,30(05):165-169.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 031]
WANG Shu-mei,SHANG Yan-liang.Design and Implementation of Scientific Papers Crawling and Multidimensional Analysis System[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2020,30(05):165-169.[doi:10. 3969 / j. issn. 1673-629X. 2020. 05. 031]
点击复制
科研论文爬取与多维度分析系统的设计与实现(
)
《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]
- 卷:
-
30
- 期数:
-
2020年05期
- 页码:
-
165-169
- 栏目:
-
应用开发研究
- 出版日期:
-
2020-05-10
文章信息/Info
- Title:
-
Design and Implementation of Scientific Papers Crawling and Multidimensional Analysis System
- 文章编号:
-
1673-629X(2020)05-0165-05
- 作者:
-
王树梅; 尚衍亮
-
江苏师范大学 计算机科学与技术学院,江苏 徐州 222111
- Author(s):
-
WANG Shu-mei; SHANG Yan-liang
-
School of Computer Science and Technology,Jiangsu Normal University,Xuzhou 222111,China
-
- 关键词:
-
论文爬取; 多维度分析; 数据挖掘; 信息采集; 爬虫自动化
- Keywords:
-
paper crawling; multidimensional analysis; data mining; information collection; crawler automation
- 分类号:
-
TP302
- DOI:
-
10. 3969 / j. issn. 1673-629X. 2020. 05. 031
- 摘要:
-
信息时代的到来,知网( CNKI) 成为国内最大的论文数据库,如何高效地获取论文信息,挖掘论文价值,成为了一个亟待解决的问题。 目前,论文检索工具多为通用爬虫,只能采集到部分少量的信息,且包含着不符合用户要求的信息,因此实现一个集聚焦论文信息采集和实时论文数据分析的系统变得极为重要。该系统针对如何高效获取论文信息,挖掘论文价值等问题,使用 Python Django 框架和 Celery 框架将网站和爬虫结合,实现了爬虫的自动化。 系统分为论文爬取模块和多维度分析模块。 其中,论文爬取模块使用 Selenium,模拟用户点击,并使用 Beutifulsoup4 和 Requests 解析网页内容,最后将获取到的论文信息存储到 MySQL 数据库中。 多维度分析模块使用 High Charts 进行数据展示,主要对与关键词相关的论文发表趋势,高产作者、机构等信息进行分析。 通过该系统,科研学者可以方便快捷地获取到研究领域的各种论文信息,为以后的深入研究提供数据支撑。
- Abstract:
-
With the advent of the information age,CNKI has become the largest paper database in China. How to efficiently obtain paper information and excavate paper value has become an urgent problem to be solved. At present,the paper retrieval tools are mostly general crawlers, which can only collect a small amount of information and contain information that does not meet the user’s requirements. Therefore,it is of great importance to implement a focused paper information collection and real-time paper data analysis system. For this purpose,Python Django framework and Celery framework are used to combine the website with the crawler and realize the automation of the crawler. The system is divided into a paper crawling module and a multidimensional analysis module. Among them, the paper crawling module uses Selenium to simulate user clicks,and parses web content with Beutifulsoup4 and Requests,and finally stores them in MySQL database. The multidimensional analysis module uses High Charts to display,which mainly analyze the trend of papers,high-yielding authors, institutions and other information about keywords. Through this system, researchers can quickly and easily obtain various information in the field of research,and provide data support for future research.
更新日期/Last Update:
2020-05-10