[1]陈珂,蓝鼎栋,柯文德,等. 基于Java的新浪微博爬虫研究与实现[J].计算机技术与发展,2017,27(09):191-196.
 CHEN Ke,LAN Ding-dong,KE Wen-de,et al. Research and Realization of Weibo Crawler with Java[J].,2017,27(09):191-196.
点击复制

 基于Java的新浪微博爬虫研究与实现()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
27
期数:
2017年09期
页码:
191-196
栏目:
应用开发研究
出版日期:
2017-09-10

文章信息/Info

Title:
 Research and Realization of Weibo Crawler with Java
文章编号:
1673-629X(2017)09-0191-06
作者:
 陈珂蓝鼎栋柯文德黎树俊邓文天
 广东石油化工学院 计算机与电子信息学院
Author(s):
 CHEN KeLAN Ding-dongKE Wen-deLI Shu-junDENG Wen-tian
关键词:
 新浪微博网络爬虫Java 数据挖掘
Keywords:
 Sina WeiboWeb crawlerJavadata mining
分类号:
TP39
文献标志码:
A
摘要:
 
为了高效获取更多的微博数据,针对调用微博API和网页版(com版)等传统微博爬虫在数据采集中所存在的问题,设计开发了一个基于Java的采集新浪微博Weibo.cn站点的网络爬虫系统.该系统通过广度遍历结合组拼URL的方式采集网页源码,使网页源码更加简洁,纯净度更高,降低了网络传输压力并减少了HTML源码解析时间.主要实现了微博模拟登陆、微博网页爬取、微博页面数据提取和任务调度控制,并对爬取数据进行了分析,在爬虫中添加了主题微博筛选功能.为验证该系统的有效性和可行性,与其他传统方法进行了分析对比.实验验结果表明,所提出的系统爬取效率更高,实现代码更简便.
Abstract:
 In order to obtain more microblog data efficiently,a Java-based acquisition system of Sina is designed and developed for Wei-bo API,traditional crawler and Web version ( com version) ,by which Weibo. cn Web site crawler system has been established through the breadth combination of traverse combination to collect web page source code and thus the page source code is more concise and purer, reducing network transmission pressure and the HTML source code analysis time. It mainly realizes the Weibo simulated logging,Weibo web crawling,Weibo page data extraction and task scheduling control,and analyzes the crawling data. The theme Weibo selection is add-ed in the crawler. To verify its effectiveness and feasibility,the analysis and comparison is made with other traditional methods. The ex-perimental results show that it is of higher efficiency with simpler code.

相似文献/References:

[1]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(09):1.
[2]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(09):5.
[3]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(09):13.
[4]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(09):21.
[5]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(09):25.
[6]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(09):29.
[7]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(09):34.
[8]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(09):38.
[9]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(09):43.
[10]余松平[][],蔡志平[],吴建进[],等. GSM-R信令监测选择录音系统设计与实现[J].计算机技术与发展,2014,24(07):47.
 YU Song-ping[][],CAI Zhi-ping[] WU Jian-jin[],GU Feng-zhi[]. Design and Implementation of an Optional Voice Recording System Based on GSM-R Signaling Monitoring[J].,2014,24(09):47.

更新日期/Last Update: 2017-10-26