[1]成卫青,于静,杨晶,等.基于页面分类的 Web 信息抽取方法研究[J].计算机技术与发展,2013,(01):54-58.
 CHENG Wei-qing,YU Jing,YANG Jing,et al.Web Information Extraction Research Based on Page Classification[J].,2013,(01):54-58.
点击复制

基于页面分类的 Web 信息抽取方法研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2013年01期
页码:
54-58
栏目:
智能、算法、系统工程
出版日期:
1900-01-01

文章信息/Info

Title:
Web Information Extraction Research Based on Page Classification
文章编号:
1673-629X(2013)01-0054-05
作者:
成卫青于静杨晶杨龙
南京邮电大学 计算机学院
Author(s):
CHENG Wei-qingYU JingYANG JingYANG Long
关键词:
Web信息抽取正则表达式页面分类HTMLParser结点树
Keywords:
Web information extractionregular expressionspage classificationHTMLParsernode tree
文献标志码:
A
摘要:
通过对现有 Web 信息抽取方法和当前 Web 网页特点的分析,发现现有抽取技术存在抽取页面类型固定和抽取结果不准确的问题,为了弥补以上两个不足,文中提出了一种基于页面分类的 Web 信息抽取方法,此方法能够完成对互联网上主流信息的提取.通过对页面进行分类和对页面主体的提取,分别克服传统方法抽取页面类型固定和抽取结果不够准确的问题.文中设计了一个完整的 Web 信息抽取模型,并给出了各功能模块的实现方法.该模型包含页面主体提取、页面分类和信息抽取等模块,并利用正则表达式自动生成抽取规则,提高了抽取方法的通用性和准确性.最后用实验证实了文中方法的有效性与正确性
Abstract:
By means of analysis of existing Web information extraction and the current Web page characteristics,current extraction tech-niques are found to have problems that the types of extract page fixed and the extract results are not accurate. In order to make up for the deficiency mentioned above,propose a Web information extraction method based on page classification. This method is able to complete the extraction of the mainstream of information on the Internet page. By classifying the Web page and extracting the main body of the page,it overcomes the two problems existing in traditional method respectively. A complete model of the Web information extraction is designed and the details of each functional module are provided. The unique features of the model are containing modules of Web page principle part extraction and Web page classification,as well as using regular expression to generate extraction rules automatically that promote the generality and precision of the extraction method. Experimental results have verified the validity and accuracy of the method

相似文献/References:

[1]韩普 姜杰.HMM在自然语言处理领域中的应用研究[J].计算机技术与发展,2010,(02):245.
 HAN Pu,JIANG Jie.Application and Research of Hidden Markov Model in Natural Language Processing Domain[J].,2010,(01):245.
[2]陆遥 魏皎 陈丽果.基于Web的个性化营养评估保障系统设计与实现[J].计算机技术与发展,2010,(03):1.
 LU Yao,WEI Jiao,CHEN Li-guo.Design and Implementation of Web - Based Personalized Intelligent Nutrition Assessment and Guarantee System[J].,2010,(01):1.
[3]何忠秀 王霜 杜亚军.基于Web的多渠道用户需求知识获取框架研究[J].计算机技术与发展,2010,(04):124.
 HE Zhong-xiu,WANG Shuang,DU Ya-jun.Research on Multi- channel's Knowledge Acquisition Frame for Customer Requirements Based on Web[J].,2010,(01):124.
[4]戴伟 陈永艳.基于物理隔离环境下的Web Service访问研究[J].计算机技术与发展,2010,(04):167.
 DAI Wei,CHEN Yong-yan.Research on Web Service Access Approach in Physical Separation[J].,2010,(01):167.
[5]高永兵 吴纪磊 胡文江 魏晓东.基于Web服务的Mashup应用的研究与实现[J].计算机技术与发展,2010,(06):137.
 GAO Yong-bing,WU Ji-lei,HU Wen-jiang,et al.Research and Implementation of Mashup Application Based on Web Service[J].,2010,(01):137.
[6]许晓宏 胡志学 张建军[].基于Web的石油科技管理自动化办公系统[J].计算机技术与发展,2009,(06):213.
 XU Xiao-hong,HU Zhi-xue,ZHANG Jian-jun.Office Automation System of Petroleum Science and Technology Management Based on Web[J].,2009,(01):213.
[7]宋丽华 刘方爱.基于WebService的网格服务功能的研究[J].计算机技术与发展,2009,(07):59.
 SONG Li-hua,LIU Fang-ai.Research of Functions of Grid Service Based on Web Service[J].,2009,(01):59.
[8]周爱武 李玉梅 周闪闪 王宝铜.基于返回结果的DeepWeb查询接口识别[J].计算机技术与发展,2009,(07):117.
 ZHOU Ai-wu,LI Yu-mei,ZHOU Shan-shan,et al.Recognized Query Interface of Deep Web Based on Response Pages[J].,2009,(01):117.
[9]刘於勋 李智.基于嵌入式WebServer的粮仓温湿度监测系统设计[J].计算机技术与发展,2009,(07):213.
 LIU Yu-xun,LI Zhi.Design Granary Temperature and Humidity Measure System Based on Embedded Web Server[J].,2009,(01):213.
[10]陈红红.一种支持WebQoS的前置服务器解决方案[J].计算机技术与发展,2009,(07):227.
 CHEN Hong-hong.A New Prepositive Server Plan That Support Web QoS[J].,2009,(01):227.
[11]秦振海 谭守标 徐超.基于Web的表格信息抽取研究[J].计算机技术与发展,2010,(02):217.
 QIN Zhen-hai,TAN Shou-biao,XU Chao.Study on ,Tables Information Extraction Based on Web[J].,2010,(01):217.

更新日期/Last Update: 1900-01-01