面向公共卫生领域的语言模型预训练-《计算机技术与发展》

文章信息/Info

Author(s):: WANG Lian-xi; HU Guan-feng; School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou 510006,China

Keywords:: public health; domain-specific pretrained language model; natural language processing; BERT; adaptive pre-training

摘要:: 公共卫生事件的突发性、变化性及不确定性加大了公共卫生信息处理与监测的难度,而构建领域预训练模型则可提升其下游任务的效果。目前国内外已有一些专注于社交媒体和医学领域的增强式公共卫生领域预训练模型,但是这些模型的训练语料规模小、来源单一、文本长度短小,缺乏包含丰富语义信息的长文本无监督语料。为了解决这一问题,该文采用基础 BERT 模型对获取的大规模公共卫生领域新闻语料进行自适应预训练,并构建了一个适用深度语义学习的领域预训练模型 PHD-News-BERT,以更好地进行该领域相关任务的学习。通过在 5 种下游任务的 8 个数据集上与 5 种基线模型进行实验比较,结果显示,PHD-News-BERT 在大多数任务中取得了显著的性能,表明其具有良好的泛化性和鲁棒性。预期可为公共卫生领域的未来工作引入新的基准。

Abstract:: The suddenness,variability,and unpredictability of public health emergencies have increased the difficulty of processing and monitoring public health information. Constructing domain-specific pretrained model can enhance the performance of downstream tasks. Currently,there are some enhanced pre-training models in the field of public health that focus on social media and medical domains.However,these models have small training corpora,limited data sources,and short text lengths,lacking long text unsupervised corpora with rich semantic information. To address this issue,we adopt the BERT model to perform adaptive pre-training on a large-scale corpus of public health news in order to construct a domain-specific pretrained model,called PHD-News-BERT,which is suitable for deep semantic learning and facilitates learning tasks in the field. By conducting experiments on eight datasets from five downstream tasks and comparing them with five baseline models,the results demonstrate that PHD-News-BERT achieves significant performance in most tasks,indicating its excellent generalization and robustness. It is expected to introduce new benchmarks for future work in the field of public health.