«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

[1]殷彬杨会志.灵活结构网页的正文提取[J].计算机技术与发展,2011,(09):111-113.
　YIN Bin,YANG Hui-zhi.Content Extraction Based on Unknown Structure Web[J].,2011,(09):111-113.
点击复制

灵活结构网页的正文提取()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:: 2011年09期

页码:: 111-113

栏目:: 智能、算法、系统工程

出版日期:: 1900-01-01

文章信息/Info

Title:: Content Extraction Based on Unknown Structure Web

文章编号:: 1673-629X（2011）09-0111-03

作者:: 殷彬杨会志; 电子科技大学中山学院

Author(s):: YIN Bin; YANG Hui-zhi; Zhongshan Institute,University of Electronic Science and Technology of China

关键词:: Web数据挖掘; 网页内容提取; 正文节点; 超链接节点; 节点权值; 链接密度

Keywords:: Web data mining; Web information extraction; content node; hyperlink node; node weight; link density

分类号:: TP391

文献标志码:: A

摘要:: 在Web数据挖掘中,由于网页大多都含有指向其他页面的超链接等噪音信息,为了减少噪音信息对Web数据挖掘效果的影响,有必要对网页进行净化处理,提取其中的正文,同时,现实中很多网页的代码结构不是特别规范,对此,提出一种对灵活结构网页适用的正文抽取算法。将网页用HTML标签分割成节点形式,找出其中含有正文内容的一个节点,以此节点为基础向前和向后进行余下正文内容的抽取。实验结果表明,本算法的适用性强、正确率较高

Abstract:: There is often some useless information in the Web page,such as hyperlinks,copyright,which will affect the accurateness of Web data mining results.Extracting useful text content from a Web page for the mining is necessary.On the other hand,some pages＇ HTML codes are not standard.To solve this problem,propose an approach of Web information extraction based on unknown structure Web.It splits a Web page into a lot of nodes using HTML tags,then finds out one of the nodes which contained valuable information,and searches out other informative content nodes in front or back of the node,finally extracts the article from the Web page after connecting all found nodes＇ contents together.Experiments show that the arithmetic can deal with unstructured Web pages and is effective

相似文献/References:

[1]邵延振蒙韧袁鼎荣李新友.基于Web结构分区的协同过滤推荐算法研究[J].计算机技术与发展,2010,(06):67.
　SHAO Yan-zhen,MENG Ren,YUAN Ding-rong,et al.Collaborative Filtering Recommendation Algorithm Research Based on Web Blocks[J].,2010,(09):67.
[2]李健徐超谭守标.一种Web数据挖掘系统的设计和研究[J].计算机技术与发展,2009,(02):70.
　LI Jian,XU Chao,TAN Shou-biao.Design and Research of a Web Data Mining System[J].,2009,(09):70.
[3]朱志国孔立平.面向电子商务的Web使用挖掘技术应用研究[J].计算机技术与发展,2008,(06):228.
　ZHU Zhi-guo,KONG Li-ping.Research and Application of Web Usage Mining Technology Oriented E- Commerce[J].,2008,(09):228.
[4]范莉莎刘刚刘志镜.Web数据挖掘在网络教育中的应用[J].计算机技术与发展,2006,(06):68.
　FAN Li-sha,LIU Gang,LIU Zhi-jing.Application of Web Data Mining in Web - Based Education[J].,2006,(09):68.

备注/Memo

备注/Memo:: 中山市科技计划项目（20092A210）殷彬（1978-），男，讲师，硕士，研究方向为Web数据挖掘、Web商务智能；杨会志，教授，博士，研究方向为数据仓库，数据挖掘

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1473
全文下载/Downloads526
评论/Comments

更新日期/Last Update: 1900-01-01

[1]殷彬 杨会志.灵活结构网页的正文提取[J].计算机技术与发展,2011,(09):111-113. YIN Bin,YANG Hui-zhi.Content Extraction Based on Unknown Structure Web[J].,2011,(09):111-113. 点击复制 灵活结构网页的正文提取()