一种基于模拟登录的微博数据采集方案-《计算机技术与发展》

文章信息/Info

Title:: A Microblog Data Collection Method Based on Simulated Login Technology

Author(s):: SUN Qing-yun[1]; WANG Jun-feng[2]; ZHAO Zong-qu[1]; GAO Meng-chao[1]

摘要:: 随着Web 2.0时代的到来，舆情信息在微博上能够更快速的产生和传播。为了有效地分析微博舆情信息，微博数据的获取显得尤为重要。文中以新浪微博为研究对象，提出了基于模拟登录的网络爬虫采集方案。此方案解决了调用微博API接口对开发者的次数限制，解决了传统的网络爬虫需要身份验证的问题，加快了微博数据的采集速度，可以在短时间内获得海量的微博数据。实验表明，用该方案开发的系统具有快速的微博信息采集速度，更加灵活，可以很好地为舆情系统分析提供大量准确的数据支持。

Abstract:: Public sentiment information on the microblog generates rapidly and disseminates widely resulting from the coming era of Web 2. 0. Now the information collection is becoming more and more important in analyzing public sentiment. A Web crawler based on simu-lated login technology on the Sina microblog is presented. In the crawler,resolve the limiting numbers of calling microblog API interface for developer,meanwhile it provides a solution for the authentication of traditional Web crawler. It can collect huge amount of data in the short-term because of accelerated progress of collection. According to the result of experiments,this system can improve the microblog in-formation collection speed and become more flexible that can provide accurate data for the public sentiment analysis system.