[1]李跃健 朱程荣.基于Larbin的网络爬虫体系结构的研究与改进[J].计算机技术与发展,2012,(07):147-150.
 LI Yue-jian,ZHU Cheng-rong.Study and Improvement on System Architectures of Larbin Web Crawler[J].,2012,(07):147-150.
点击复制

基于Larbin的网络爬虫体系结构的研究与改进()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2012年07期
页码:
147-150
栏目:
安全与防范
出版日期:
1900-01-01

文章信息/Info

Title:
Study and Improvement on System Architectures of Larbin Web Crawler
文章编号:
1673-629X(2012)07-0147-04
作者:
李跃健 朱程荣
同济大学计算机科学与技术系
Author(s):
LI Yue-jianZHU Cheng-rong
Department of Computer Science and Technology, Tongji University
关键词:
Larbin爬虫哈希算法url去重布隆过滤器
Keywords:
Larbin web crawler hash url distinguish Bloom filter
分类号:
TP309
文献标志码:
A
摘要:
Larbin是一种开源的网络爬虫/网络蜘蛛,抓取效率极高。它的ud去重方法的设计,效率极高,占用的内存非常小,理论上下载6400万网页,使用的内存只有8M,然而它的冲突将会对它的性能大打折扣,实际上当达到10%的ud时就已经有很大的冲突概率,导致内存利用率的降低以及很多网页不能被抓取。通过研究布隆过滤器,将url的hash算法进行改进,把原本一对一的映射变成多对一的映射,减小了冲突概率,同时也将大大地提高Larbin在ud内存方面的利用率。经过实验检验,使用布隆过滤器,同样8M内存,当达到10%的ud占有率时,采用7个映射,可以使得冲突概率最小,达到0.82%。而没采用Bloom filter的冲突概率则达到了10%
Abstract:
Larbin is an open source web crawler, it scratches pages very efficiently. On url comparing algorithm,it has great efficiency and cost very little memory. In theory,downloading 64 million pages cost only 8M memory,but its url conflict will greatly affect its performance. In fact, when 10% of the urls are in memory, the new url will have 10% possibility to conflict, resulting in lower memory usage and many pages can not be crawled. By studying the Bloom filter,with the hash algorithm of url distinguish improves the original into a many-to-one mapping,reducing the probability of conflict,and also greatly enhance the Larbin's memory utilization. From the experiment,in the 8M memory with 10% used by url,if make the map number to be 7, the conflict percentage reaches to only 0.82% while it remains 10% if no bloom filter is applied to the algorithm

相似文献/References:

[1]蔡建超 郭一平 王亮.基于Lucene.Net校园网搜索引擎的设计与实现[J].计算机技术与发展,2006,(11):73.
 CAI Jian-chao,GUO Yi-ping,WANG Liang.Design and Implementation of School Search Engine Based on Lucene. Net[J].,2006,(07):73.
[2]张海亮 袁道华.基于遗传算法的主题爬虫[J].计算机技术与发展,2012,(08):48.
 ZHANG Hai-liang,YUAN Dao-hua.Focused Crawling Based on Genetic Algorithms[J].,2012,(07):48.
[3]苏金波,朱剑宇,杨柳,等.基于关键词相关性的有害信息爬虫系统研究[J].计算机技术与发展,2014,24(03):143.
 SU Jin-bo,ZHU Jian-yu,YANG Liu,et al.Research on Harmful Information Crawler System Based on Keywords Correlation[J].,2014,24(07):143.

备注/Memo

备注/Memo:
国家863高技术发展计划项目(2010AA122200);上海市科委国际合作项目(10510712500)李跃健(1985-),男,硕士,研究方向为计算机应用;朱程荣,副教授,研究方向为容错计算及信息安全
更新日期/Last Update: 1900-01-01