分布式爬虫的研究与实现-《计算机技术与发展》

文章信息/Info

Author(s):: MA Lei; FENG Xi-wei; DOU YU-zi; GAO Tian-zhu; ZHU Rui; WU Yan-bing; School of Computer and Communication Engineering,Liaoning Shihua University,Fushun 113001,China

摘要:: 网络中的数据蕴藏着大量有价值信息,在实际的项目需求中,为了实现能够自动的在网页上对大量数据的数据信息的收集、解析、格式化存储的过程,提出了基于分布式的网络爬虫技术。采用Nutch爬虫框架和Zookeeper分布式协调服务,配合高性能的Key-Value数据库Redis对数据进行存储,采用Solr引擎将抓取信息进行清晰地索引、展示。运用提取页面信息算法优化提取页面信息流程,通过关键词匹配优化算法根据指标从抓取的数据中获取指标相关数据。通过分布式集群的搭建,Nutch项目的实现,及大量数据的采集,验证了基于Nutch的分布式网络爬虫的可行性。通过页面解析流程实验分析,基于Nutch的分布式爬虫与其他爬虫多组实验数据对比结果表明,基于Nutch的分布式爬虫项目在性能和准确度方面都优于传统其他爬虫。

Abstract:: The data in the network contains a lot of valuable information. In the actual project requirements,in order to realize the process of automatically collecting,parsing and formatting the data information of a large amount of data on the webpage,a distributed web crawler technology isproposed. The Nutch crawler framework and the Zookeeper distributed coordination serviceareused to storedatain conjunction with the high-performance Key-Value database Redis. The Solr engine is used to clearly index and display the captured information. The extraction page information algorithm is used to optimize the process of extracting page information,and the keyword matching optimization algorithm is used to obtain the indicator related data from the captured data according to the index. Through the construction of distributed clusters,the implementation of the Nutch project,and the collection of large amounts of data,the feasibility of Nutch-based distributed web crawlers is verified. Through the analysis of the page analysis process, the experimental data comparison between the Nutch-based distributed crawler and other reptiles proves that the Nutch-based distributed crawler project is superior to other traditional crawlers in performance and accuracy.