GSGD:一种基于 BERT 与本体推理的自动分级系统-《计算机技术与发展》

文章信息/Info

作者:: 王珊珊¹; 2; 邹佳¹; 2; 程序¹; 2; 刘汪洋¹; 2; 蔡惠民¹; 2; 1. 中电科大数据研究院有限公司,贵州贵阳 550022; 2. 提升政府治理能力大数据应用技术国家工程实验室,贵州贵阳 550022

Author(s):: WANG Shan-shan¹; 2; ZOU Jia¹; 2; CHENG Xu¹; 2; LIU Wang-yang1; 2; CAI Hui-min¹; 2; 1. CETC Big Data Research Institute Co. ,Ltd. ,Guiyang 550022,China; 2. National Engineering Laboratory for Big Data Application in Improving Government Governance Capabilities,Guiyang 550022,China

Keywords:: data grading; government data; BERT; legal ontology; cosine similarity

摘要:: 政府数据资源分级管理是政府数据共享开放和数据治理的关键性工作。由于数据资源规模大,分级体系不完善,工具缺乏,使得该工作多由人工进行,导致支撑依据不足、主观性强、精确性差、成效不足。文中设计并实现了基于政策法规、典型案例的政府数据自动分级系统—GSGD(grading system for government data)。首先, 利用政策法规以及典型案例构建本体库, 根据分级目标以及构建的本体特性,构建自定义推理规则;再通过 BERT 获得输入数据与关键词的语义特征词/句向量,并计算向量之间的余弦相似度; 最后对相似度较高的关键词,采用 Jena 对政策法规库以及典型案例库进行查询推理得到分级结果以及分级依据,以实现对政府数据的自动化分级,提高分级工作效率。通过实验对比分析,验证了该方法的有效性。

Abstract:: Grading of government data resources is the key work of government data sharing and opening. Due to the large scale of data resources,imperf- ect classification system and lack of tools,this work is mostly carried out manually,which leads to insufficient supporting basis,strong subjectivity, poor accuracy and insufficient effectiveness. Therefore,we design and implement GSGD,an automatic grading system for government data based on policies,regulations and typical cases. Firstly,policies and regulations as well as typical cases are used to build ontology, and custom inference rules are built according to grading work and the ontology characteristics. Then, the semantic features word/sentence vectors of the input data and keywords are obtained through BERT,and cosine similarity between the vectors is calculated. Finally,for keywords with high similarity,Jena is used to query and reason the policy and regulation database and typical case database to obtain grading results and basis, which helps automatically to grade the data. The effectiveness of the method is verified by experiment.