基于 Soft-Masked BERT 的新闻文本纠错研究-《计算机技术与发展》

文章信息/Info

Title:: Research on Error Correction of News Text Based onSoft-Masked BERT

文章编号:: 1673-629X(2022)05-0202-06

作者:: 史健婷¹ ; 吴林皓¹ ; 张英涛² ; 常亮1; 1. 黑龙江科技大学,黑龙江哈尔滨 150022;
2. 哈尔滨工业大学,黑龙江哈尔滨 150000

Author(s):: SHI Jian-ting1 ; WU Lin-hao1 ; ZHANG Ying-tao2 ; CHANG Liang1; 1. Heilongjiang University of Science and Technology,Harbin 150022,China;
2. Harbin Institute of Technology,Harbin 150000,China

关键词:: 新闻稿件; 计算机辅助技术; 深度学习; 中文文本纠错; Soft-Masked BERT

Keywords:: news release; computer-aided technology; deep learning; Chinese text error correction; Soft-Masked BERT

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 05. 034

摘要:: 互联网时代的新闻宣传领域,每天都会产生海量的文本稿件,仅依靠人工进行校正,成本极高,效率低下。利用计算机辅助技术对新闻稿件进行审阅极大地提高了校稿效率,大大减少人力成本,进一步利用特定新闻领域语料集的深度学习模型,完成个性化定制,在该领域的纠错过程中可以取得更好的效果。文中使用一种全新的中文文本纠错模型理论:Soft-Masked BERT,该模型将中文文本的检错过程与纠错过程分离,纠正网络的输入来自于检测网络输出。文中旨在 Soft -Masked BERT 基础上进行改进并应用。使用“ 哈尔滨工业大学新闻网” 新闻稿件中 10 000 条文本序列( HIT News Site) 作为初始语料进行训练,之后对该新闻网的相关稿件进行中文文本校对。结果表明,Soft-Masked 模型在 HIT News Site 数据集上的整体性能表现优于 BERT-Finetune,准确率提高 0. 6 个百分点,精确率提高 1. 3 个百分点,召回率提高 1. 5 个百分点,F1 分数提高 1. 4 个百分点,效果良好。

Abstract:: In the field of news and propaganda with the Internet era, a large number of text manuscripts are produced every day.Proofreading the first draft? is a huge amount of work. It is extremely costly and inefficient to rely on manual correction. Therefore,it is of great practical significance to find? ? ? a new method to automatically correct the first draft of news. The efficiency of proofreading can be greatly improved with the computer-aided technology, greatly reducing the labor cost. Further use of deep learning model of corpus set in specific news field to complete personalized customization can achieve better results in the process of error correction in this field. A new Chinese text error correction model theory, Soft - Masked BERT, is used in the paper, which separates the error detection process of Chinese text from the error correction process,and the input? of the correction network comes from the output of the detection network.We aim to improve the application on the basis of Soft-Masked BERT, using 10 000 text sequences ( HIT News Site) in the news articles of " Harbin Institute of Technology News Network" as the initial corpus? for training, so as to carry out Chinese texts of related articles of the news network Proofreading. By comparison,the overall performance of the Soft - Masked model on the HIT News Site data set is better than that of BERT-Fine tune,with an accuracy increase of 0. 6% ,an accuracy increase of? 1. 3% ,a recall rate of 1. 5% ,and F1 score of 1. 4% .

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

常用功能

导航/Navigate

工具/Tools

统计/Statistics