[1]李欣,李绍稳,许高建,等.基于正则抽取的竹种数据结构化方法研究[J].计算机技术与发展,2018,28(06):147-150.[doi:10.3969/ j. issn.1673-629X.2018.06.033]
 LI Xin,LI Shao-wen,XU Gao-jian,et al. Research on a Data Structuralization Method of Bamboo Species Based on Regular Extraction Model[J].,2018,28(06):147-150.[doi:10.3969/ j. issn.1673-629X.2018.06.033]
点击复制

基于正则抽取的竹种数据结构化方法研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年06期
页码:
147-150
栏目:
应用开发研究
出版日期:
2018-06-10

文章信息/Info

Title:
 Research on a Data Structuralization Method of Bamboo Species Based on Regular Extraction Model
文章编号:
1673-629X(2018)06-0147-04
作者:
李欣李绍稳许高建林建彬
安徽农业大学 信息与计算机学院,安徽 合肥 230036
Author(s):
LI XinLI Shao-wenXU Gao-jianLIN Jian-bin
School of Information and Computer Science,Anhui Agricultural University,Hefei 230036,China
关键词:
信息抽取正则表达式竹种数据数据结构化
Keywords:
information extractionregular expressionbamboo species datadata structuring
分类号:
TP391
DOI:
10.3969/ j. issn.1673-629X.2018.06.033
文献标志码:
A
摘要:
研究旨在通过基于规则的信息抽取技术解决竹类种质资源(简称竹种)数据的自动提取和结构化存储问题,为快速构建竹种数据库提出一种基于正则抽取模型的竹种数据结构化方法。 该方法以竹种数据库表结构为抽取模板,以数据表属性名称为规则触发词,利用正则表达式构建抽取规则,构建正则抽取模型。 以中国植物志在线版为实验对象,通过网页解析和字段抽取两步实现了竹种数据的自动抽取与结构化,实验抽取竹种信息五百多条,取数据表前八个字段进行抽样统计分析,抽取竹种有效字段信息准确率高达 89%以上。 实验结果表明,基于正则抽取的竹种数据结构化方法是可行有效的,并采用 Java 语言开发了竹种信息抽取系统,实现了该方法。
Abstract:
This study aims to provide a effective and feasible method for efficiently constructing the Bamboo species database by automatically extracting and structurally storing the morphological data of Bamboo germplasm resources (Bamboo species) through the information extraction technology. To develop the Bamboo regular extraction model,the Bamboo species structure is used as extraction template,database properties as regulation triggers and then the extraction regulation is constructed by regular expression. The experimental objective is set as the flora of Chinese online edition,and then the Bamboo species data is structurally extracted by two steps including web crawler and text extraction. Over five hundred of Bamboo species information is extracted. The accuracy rate of effective field information for extracted Bamboo species is more than 89%. The suggested method is achieved by developing the Bamboo species data extraction system using Java language. On the basis of regular expression,it is a feasible and effective data structuring method.

相似文献/References:

[1]宋鑫坤 陈万米 朱明 桂春胜 程硕远 陈海波.基于正则表达式的语音识别控制策略研究[J].计算机技术与发展,2010,(02):106.
 SONG Xin-kun,CHEN Wan-mi,ZHU Ming,et al.Study on Speech Recognition Control Strategy Based on Regular Expression[J].,2010,(06):106.
[2]秦振海 谭守标 徐超.基于Web的表格信息抽取研究[J].计算机技术与发展,2010,(02):217.
 QIN Zhen-hai,TAN Shou-biao,XU Chao.Study on ,Tables Information Extraction Based on Web[J].,2010,(06):217.
[3]韩普 姜杰.HMM在自然语言处理领域中的应用研究[J].计算机技术与发展,2010,(02):245.
 HAN Pu,JIANG Jie.Application and Research of Hidden Markov Model in Natural Language Processing Domain[J].,2010,(06):245.
[4]胡琼凯 黄建华.基于协议分析和决策树的入侵检测研究[J].计算机技术与发展,2009,(06):179.
 HU Oiong-kai,HUANG Jian-hua.Intrusion Detection Based on Protocol Analysis and Decision Tree[J].,2009,(06):179.
[5]胡国晴 李建华.一种基于可信度分析的Web页面新属性发现方法[J].计算机技术与发展,2009,(01):56.
 HU Guo-qing,LI Jian-hua.A Credibility Analysis- Based Method to Discover New Attributes Web Pages[J].,2009,(06):56.
[6]李宏伟 史培中 张素智.一种高效Web数据抽取包装器的设计与实现[J].计算机技术与发展,2009,(02):123.
 LI Hong-wei,SHI Pei-zhong,ZHANG Su-zhi.Design and Implementation of an Efficient Wrapper for Web Data Extraction[J].,2009,(06):123.
[7]赵金仿 赵艳 缪建明.网页信息抽取及其自动文本分类的实现[J].计算机技术与发展,2008,(10):37.
 ZHAO Jin-fang,ZHAO Yan,MIAO Jian-ming.Extraction of Homepage Text Information and Realization of Text Automatic Categorization[J].,2008,(06):37.
[8]崔阳 吴爱华.一种面向B2B垂直搜索的网页信息去噪方法[J].计算机技术与发展,2008,(12):70.
 CUI Yang,WU Ai-hua.A Method of Eliminating Noisy Information in Web Pages Oriented B2B Vertical Searching[J].,2008,(06):70.
[9]徐慧 杨学兵.基于本体相似度的中文科研论文信息抽取[J].计算机技术与发展,2008,(12):203.
 XU Hui,YANG Xue-bing.Information Extraction from Chinese Research Papers Based on Ontology Similarity[J].,2008,(06):203.
[10]仲华 崔志明.基于XML的信息抽取和多层向量空间技术研究[J].计算机技术与发展,2007,(07):49.
 ZHONG Hua,CUI Zhi-ming.Research on Information Extraction and Multilayer Vector Space Based on XML Technology[J].,2007,(06):49.
[11]成卫青,于静,杨晶,等.基于页面分类的 Web 信息抽取方法研究[J].计算机技术与发展,2013,(01):54.
 CHENG Wei-qing,YU Jing,YANG Jing,et al.Web Information Extraction Research Based on Page Classification[J].,2013,(06):54.

更新日期/Last Update: 2018-08-22