[1]陶慧丹,段 亮,王笳辉,等.基于 BERT 的民间文学文本预训练模型[J].计算机技术与发展,2022,32(11):164-170.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 024]
 TAO Hui-dan,DUAN Liang,WANG Jia-hui,et al.BERT Based Pre-training Model of Folk Literature Texts[J].,2022,32(11):164-170.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 024]
点击复制

基于 BERT 的民间文学文本预训练模型()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年11期
页码:
164-170
栏目:
人工智能
出版日期:
2022-11-10

文章信息/Info

Title:
BERT Based Pre-training Model of Folk Literature Texts
文章编号:
1673-629X(2022)11-0164-07
作者:
陶慧丹12 段 亮12 王笳辉12 岳 昆12
1. 云南大学 信息学院,云南 昆明 650500;
2. 云南大学 云南省智能系统与计算重点实验室,云南 昆明 650500
Author(s):
TAO Hui-dan12 DUAN Liang12 WANG Jia-hui12 YUE Kun12
1. School of Information Science and Engineering,Yunnan University,Kunming 650500,China;
2. Key Lab of Intelligent Systems and Computing of Yunnan Province,Yunnan University,Kunming 650500,China
关键词:
预训练语言模型民间文学文本BERT自然语言处理下游任务
Keywords:
pre-training language modelFolk literature textsBERTnature language processingdownstream task
分类号:
TP391
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 11. 024
摘要:
民间文学文本中含有大量生动形象的修辞手法;人名、地名极其复杂,难以判断词与词之间的边界;与现代汉语表达差别较大,预训练语言模型难以有效地学习其隐含知识,为机器自然语言理解带来困难。 该文提出一种基于 BERT 的民间文学文本预训练模型 MythBERT,使用民间文学语料库预训练,将 BERT 的字隐蔽策略改进为对中文词语隐蔽策略。 对民间文学文本中解释字、词的注释词语重点隐蔽,减小 BERT 隐蔽的随机性并有利于学习词语语义信息。 同时利用注释增强语言模型表示,解决一词多义、古今异义等问题。 将 MythBERT 与 BERT、BERT-WWM 和 RoBERTa 等主流中文预训练模型在情感分析、语义相似度、命名实体识别和问答四个自然语言处理任务上进行比较。 实验结果表明,注释增强的民间文学预训练模型 MythBERT 在民间文学文本任务上性能显著提升,与基线方法相比取得了最优的效果。
Abstract:
There are a large number of vivid figures of speech in folk literature. The named entities,including the name of the people andplaces,are extremely complicated,which is difficult to judge the boundary between words. It is quite different from modern Chinese expression,which makes it difficult for pre-training language model to learn the implicit knowledge of the texts. It brings great challengesto machine natural language understanding. MythBERT,a pre-training model for folk literature text,is proposed. It not only changes themasked Chinese characters of BERT to the masked Chinese words,but also makes full use of annotations in the folk literature texts toenhance the expression of the language model. MythBERT is compared with some mainstream Chinese pre - training models such asBERT,Bert-WWM and RoBERTa,and four natural language processing tasks were selected for validation,including sentiment analysis,semantic similarity,named entity recognition and question answering. The experimental results show that MythBERT,which enhanced semantics with annotation,has a significant improvement of performance on downstream task for folk literature text,and outperforms thebaseline method.

相似文献/References:

[1]崔从敏,施运梅,袁 博,等.面向政府公文的关系抽取方法研究[J].计算机技术与发展,2021,31(12):26.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 005]
 CUI Cong-min,SHI Yun-mei,YUAN Bo,et al.Research on Relation Extraction Method for Government Documents[J].,2021,31(11):26.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 005]

更新日期/Last Update: 2022-11-10