[1]马 蕾,冯锡炜,窦予梓,等.分布式爬虫的研究与实现[J].计算机技术与发展,2020,30(02):192-196.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 037]
 MA Lei,FENG Xi-wei,DOU YU-zi,et al.Research and Realization of Distributed Crawler Based on Nutch[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2020,30(02):192-196.[doi:10. 3969 / j. issn. 1673-629X. 2020. 02. 037]
点击复制

分布式爬虫的研究与实现()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
30
期数:
2020年02期
页码:
192-196
栏目:
应用开发研究
出版日期:
2020-02-10

文章信息/Info

Title:
Research and Realization of Distributed Crawler Based on Nutch
文章编号:
1673-629X(2020)02-0192-05
作者:
马 蕾冯锡炜窦予梓高天铸朱 睿吴衍兵
辽宁石油化工大学 计算机与通信工程学院,辽宁 抚顺 113001
Author(s):
MA LeiFENG Xi-weiDOU YU-ziGAO Tian-zhuZHU RuiWU Yan-bing
School of Computer and Communication Engineering,Liaoning Shihua University,Fushun 113001,China
关键词:
分布式集群NutchSolr企业官网
Keywords:
distributed clusterNutchSolrenterprise’s official website
分类号:
TP391
DOI:
10. 3969 / j. issn. 1673-629X. 2020. 02. 037
摘要:
网络中的数据蕴藏着大量有价值信息,在实际的项目需求中,为了实现能够自动的在网页上对大量数据的数据信 息的收集、解析、格式化存储的过程,提出了基于分布式的网络爬虫技术。 采用Nutch爬虫框架和Zookeeper分布式协调服 务,配合高性能的Key-Value数据库Redis对数据进行存储,采用Solr引擎将抓取信息进行清晰地索引、展示。 运用提取页面信息算法优化提取页面信息流程,通过关键词匹配优化算法根据指标从抓取的数据中获取指标相关数据。 通过分布 式集群的搭建,Nutch项目的实现,及大量数据的采集,验证了基于Nutch的分布式网络爬虫的可行性。 通过页面解析流程 实验分析,基于Nutch的分布式爬虫与其他爬虫多组实验数据对比结果表明,基于Nutch的分布式爬虫项目在性能和准确 度方面都优于传统其他爬虫。
Abstract:
The data in the network contains a lot of valuable information. In the actual project requirements,in order to realize the process of automatically collecting,parsing and formatting the data information of a large amount of data on the webpage,a distributed web crawler technology isproposed. The Nutch crawler framework and the Zookeeper distributed coordination serviceareused to storedatain conjunction with the high-performance Key-Value database Redis. The Solr engine is used to clearly index and display the captured information. The extraction page information algorithm is used to optimize the process of extracting page information,and the keyword matching optimization algorithm is used to obtain the indicator related data from the captured data according to the index. Through the construction of distributed clusters,the implementation of the Nutch project,and the collection of large amounts of data,the feasibility of Nutch-based distributed web crawlers is verified. Through the analysis of the page analysis process, the experimental data comparison between the Nutch-based distributed crawler and other reptiles proves that the Nutch-based distributed crawler project is superior to other traditional crawlers in performance and accuracy.

相似文献/References:

[1]肖红玉,贺辉,黄灼东,等.基于Nutch的就业垂直搜索引擎研究[J].计算机技术与发展,2019,29(02):207.[doi:10.3969/j.issn.1673-629X.2019.02.043]
 XIAO Hongyu,HE Hui,HUANG Zhuodong,et al.Research on Employment Vertical Search Engine Based on Nutch[J].COMPUTER TECHNOLOGY AND DEVELOPMENT,2019,29(02):207.[doi:10.3969/j.issn.1673-629X.2019.02.043]

更新日期/Last Update: 2020-02-10