«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2022. 10. 021]
点击复制

面向特定领域文本的重叠关系语料库构建方法()

分享到：

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 32
期数:: 2022年10期

页码:: 126-131

栏目:: 人工智能

出版日期:: 2022-10-10

文章信息/Info

Title:: Constructing of Corpus of Overlapping Relationships forDomain-specific Text

文章编号:: 1673-629X(2022)10-0126-06

作者:: 刘凯; 廖湘琳; 张宏军; 陆军工程大学指挥控制工程学院,江苏南京 210000

Author(s):: LIU Kai; LIAO Xiang-lin; ZHANG Hong-jun; School of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210000,China

关键词:: 实体关系; 信息抽取; 语料库构建; schema; 触发词

Keywords:: entity relations; information extraction; corpus construction; schema; trigger word

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 10. 021

摘要:: 实体关系语料库是信息抽取领域的基础数据资源,其规模和质量直接影响信息抽取深度学习模型的效果。目前,建立的特定领域语料库在重? ? ?叠关系方面的研究较少,且现有方法需要高昂的人工标注成本。该文融合已有的基于实体识别和触发词规则的语料标注算法,基于自定义关? ? 系 schema 实现网络文本中重叠关系的自动标注。首先,借助特定领域专业词典进行命名实体识别,构造命名实体集;然后根据自定义关系模式 schema 和依存句法分析进行特征词聚类,构造触发词词典;最后,基于命名实体集和触发词词典进行语料回标。该算法有效减少了人工标注量,标注速度快,标注后的语料规模较大,有效提取重叠关系信息,为特定领域信息抽取扩充语料库提供了可行方案。同时,该文探讨了数据源可用性,评价了标注质量并对语料库进行了统计分析。实验结果显示,该方法总体回标成功率为 76. 7% ,总体关系标注准确率为85. 8% ,利用基础重叠关系抽取模型进行实验,实验结果 F1 值达到 93. 68% 。

Abstract:: The corpus of entity relations is the basic data resource in the field of information extraction,and its scale and quality directly affect the training effect of the deep learning model. There is little research on overlapping relationships for domain-specific corpus at present, and existing methods require high manual annotation cost. We incorporate the existing annotation algorithm based on entity recognition and trigger word rules and implement the automatic annotation of overlapping relations in network text according to the custom relations schema. First,after the named entities were identified by professional dictionary in the specific field,the named entity set was constructed. Then the trigger word dictionary was constructed by clustering the feature words according to the custom relational pattern schema and dependency parsing. Finally,the corpus automatic annotation was carried out based on the named entity set and the trigger word dictionary. The proposed algorithm can effectively reduce the amount of manual annotation, with fast annotation speed and largescale of corpus after annotation, which extracts the information of overlapping relations effectively and provides a feasible scheme for expanding the corpus in information extraction of specific field. Meanwhile,we explore the availability of data source,evaluate the qualityof annotation and make statistical analysis of the corpus. The experimental results show that the overall success rate of the proposed method is 76. 7% ,the overall relationship annotation accuracy is 85. 8% . In the experiment using the basic overlap relations extraction model,the value of F1 reaches? 93. 68% .

相似文献/References:

[1]秦振海谭守标徐超.基于Web的表格信息抽取研究[J].计算机技术与发展,2010,(02):217.
　QIN Zhen-hai,TAN Shou-biao,XU Chao.Study on ,Tables Information Extraction Based on Web[J].,2010,(10):217.
[2]韩普姜杰.HMM在自然语言处理领域中的应用研究[J].计算机技术与发展,2010,(02):245.
　HAN Pu,JIANG Jie.Application and Research of Hidden Markov Model in Natural Language Processing Domain[J].,2010,(10):245.
[3]胡国晴李建华.一种基于可信度分析的Web页面新属性发现方法[J].计算机技术与发展,2009,(01):56.
　HU Guo-qing,LI Jian-hua.A Credibility Analysis- Based Method to Discover New Attributes Web Pages[J].,2009,(10):56.
[4]李宏伟史培中张素智.一种高效Web数据抽取包装器的设计与实现[J].计算机技术与发展,2009,(02):123.
　LI Hong-wei,SHI Pei-zhong,ZHANG Su-zhi.Design and Implementation of an Efficient Wrapper for Web Data Extraction[J].,2009,(10):123.
[5]赵金仿赵艳缪建明.网页信息抽取及其自动文本分类的实现[J].计算机技术与发展,2008,(10):37.
　ZHAO Jin-fang,ZHAO Yan,MIAO Jian-ming.Extraction of Homepage Text Information and Realization of Text Automatic Categorization[J].,2008,(10):37.
[6]崔阳吴爱华.一种面向B2B垂直搜索的网页信息去噪方法[J].计算机技术与发展,2008,(12):70.
　CUI Yang,WU Ai-hua.A Method of Eliminating Noisy Information in Web Pages Oriented B2B Vertical Searching[J].,2008,(10):70.
[7]徐慧杨学兵.基于本体相似度的中文科研论文信息抽取[J].计算机技术与发展,2008,(12):203.
　XU Hui,YANG Xue-bing.Information Extraction from Chinese Research Papers Based on Ontology Similarity[J].,2008,(10):203.
[8]仲华崔志明.基于XML的信息抽取和多层向量空间技术研究[J].计算机技术与发展,2007,(07):49.
　ZHONG Hua,CUI Zhi-ming.Research on Information Extraction and Multilayer Vector Space Based on XML Technology[J].,2007,(10):49.
[9]陈静朱巧明贡正仙.基于Ontology的信息抽取研究综述[J].计算机技术与发展,2007,(10):84.
　CHEN Jing,ZHU Qiao-ming,GONG Zheng-xian.Overview of Ontology - Based Information Extraction[J].,2007,(10):84.
[10]邹腊梅肖基毅龚向坚.基于Maximum Likelihood与HMM的文本挖掘[J].计算机技术与发展,2007,(12):110.
　ZOU La-mei,XIAO Ji-yi,GONG Xiang-jian.Text Information Mining Based on Maximum Likelihood and Hidden Markov Model[J].,2007,(10):110.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed530
全文下载/Downloads352
评论/Comments