[1]邵振凯.网页信息提取技术[J].计算机技术与发展,2013,(09):36-38.
 SHAO Zhen-kai.Web Page Information Extraction Technology[J].,2013,(09):36-38.
点击复制

网页信息提取技术()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2013年09期
页码:
36-38
栏目:
智能、算法、系统工程
出版日期:
1900-01-01

文章信息/Info

Title:
Web Page Information Extraction Technology
文章编号:
1673-629X(2013)09-0036-03
作者:
邵振凯
安徽理工大学 计算机科学与工程学院
Author(s):
SHAO Zhen-kai
关键词:
DOM标签提取信息提取网页净化
Keywords:
DOMtags extractioninformation extractionWeb page purifying
文献标志码:
A
摘要:
随着互联网的快速发展,Web页面上的信息量已变得非常巨大,面对网页上海量的信息资源,如何快速有效地检索及发现有价值的信息已成为Web研究的一个重要方面。对此提出了一种标签提取方法。利用JTidy将网页优化为格式良好的HTML文档并解析为DOM树,然后用标签提取方法对该DOM树中包含有文本信息内容的叶子节点标签进行提取,把用于控制网页交互性和显示的标签删除掉,并运用基于标点符号的信息提取方法去除版权说明等信息。对不同网站的网页进行抽取实验,结果表明标签提取方法不但通用性强,而且能够准确地提取网页的主题信息
Abstract:
With the rapid development of the Internet,the amount of information in the Web page has become very large,how to quickly and efficiently search and find valuable information has become an important aspect of Web research. In this regard a tag extraction meth-od is proposed. Optimize the Web page into good HTML format documents with JTidy,and resolve to a DOM tree. Then use tag extrac-tion approach to extract the tags contain the text message content from DOM tree,remove the tags used to control the Web interaction and display,and use the method based on the punctuation information extraction method to remove the copyright notice and other informa-tion. The results on a number of different sites extraction show that the tags extraction methods not only have a great generality but also can accurately extract site theme

相似文献/References:

[1]张以利 刘亚军.分布式智能答疑系统的知识库构建与维护研究[J].计算机技术与发展,2006,(07):15.
 ZHANG Yi-li,LIU Ya-jun.Research of Setting up and Support of Distributed Intelligent Question Answer System[J].,2006,(09):15.
[2]赵君珂,张振宇,蔡开裕.基于自然语言处理的医学实体识别与标签提取[J].计算机技术与发展,2019,29(09):18.[doi:10. 3969 / j. issn. 1673-629X. 2019. 09. 004]
 ZHAO Jun-ke,ZHANG Zhen-yu,CAI Kai-yu.Medical Entity Recognition and Label Extraction Based on Natural Language Processing[J].,2019,29(09):18.[doi:10. 3969 / j. issn. 1673-629X. 2019. 09. 004]

更新日期/Last Update: 1900-01-01