分层区域穷举的中文嵌套命名实体识别方法-《计算机技术与发展》

文章信息/Info

Title:: Layered Regional Exhaustive Model for Chinese Nested Named Entity Recognition

文章编号:: 1673-629X(2022)09-0161-06

作者:: 余诗媛¹; 2 ; 郭淑明² ; 黄瑞阳² ; 张建朋² ; 胡楠¹; 2; 1. 郑州大学软件学院,河南郑州 450001
2. 国家数字交换系统工程技术研究中心,河南郑州 450002

Author(s):: YU Shi-yuan1; 2 ; GUO Shu-ming2 ; HUANG Rui-yang2 ; ZHANG Jian-peng2 ; HU Nan1; 2; 1. School of Software,Zhengzhou University,Zhengzhou 450001,China
2. National Digital Switching System Engineering and Technological R&D Center,Zhengzhou 450002,China

关键词:: 嵌套命名实体识别; 分层区域穷举; 卷积神经网络; 双向长短时记忆网络; 信息抽取

Keywords:: nested named entity recognition; layered regional exhaustive model; convolutional neural network; bi-directional long shortterm memory network; information extraction

分类号:: TP18

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 09. 025

摘要:: 嵌套命名实体之间蕴含着丰富的语义关系与结构信息,开发能够准确识别嵌套命名实体的算法具有重要研究意义。针对现有的中文嵌套命名实体数据集中存在错标漏标以及现有识别方法大多忽略嵌套实体内部信息关联关系而导致准确性下降的问题,结合自动生成与手动标注的方法构建新的中文嵌套命名实体数据集 NEPD,在此基础上,设计一种利用分层区域穷举的中文嵌套命名实体识别模型。该模型通过遍历文本组合实体,获取低层编码层的词嵌入信息;其次,为使邻接编码层之间实现信息交换,将低层编码层的词嵌入信息融入高层编码层;最后,利用多层解码层使长度为 L 的命名实体仅在第 L 层预测,有效防止错误传播现象发生从而提高识别准确度。实验结果表明,在没有外部知识资源的情况下,LREM 模型在嵌套命名实体与非嵌套命名实体上的识别 F1 值分别达到 87. 19% 和 86. 27% ,其中非嵌套命名实体识别的 F1 值比传统的 BiLSTM+CRF 模型提升 1. 18% ,验证了该模型的可靠性。

Abstract:: Nested named entities contain rich semantic relationships and structural information among them,and it is essential to develop algorithms that? can accurately identify nested named entities. To address the problems of mislabeling and omission in the existing Chinesenested named entity dataset,and the problem that most of the existing recognition methods ignore the internal information association relationship of nested entities,? ?a new Chinese nested named entity dataset NEPD is constructed by combining automatic generation and manualannotation methods,based on which a Chinese nested named entity recognition model is designed using hierarchical region exhaustive.The model obtains the word embedding information of the lower coding layer by traversing the text combination entities. Furthe rmore,the word embedding information of the lower coding layer is incorporated into the higher coding layer to exchange data between neighboring coding layers. Finally, the named entities of length L are predicted only in the L layer by using multiple decoding layers, which effectively prevents the occurrence of error propagation and thus improves the recognition accuracy. The experimental results show that without external knowledge resources, the F1 values of the LREM model reach? ? ?87. 19% and 86. 27% for the recognition of nested name dentities and non - nested named entities, respectively, with the F1 value of non - nested named entities recognition improving 1. 18%compared with the traditional BiLSTM+CRF model. The experiments verify the reliability of the model in this paper.

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

常用功能

导航/Navigate

工具/Tools

统计/Statistics