[1]李保珍,顾秀莲.面向未登录词及多义词的共现性词嵌入改进[J].计算机技术与发展,2022,32(12):117-122.[doi:10. 3969 / j. issn. 1673-629X. 2022. 12. 018]
 LI Bao-zhen,GU Xiu-lian.Co-occurrence Word Embedding Improvement for Unknown andPolysemous Words[J].,2022,32(12):117-122.[doi:10. 3969 / j. issn. 1673-629X. 2022. 12. 018]
点击复制

面向未登录词及多义词的共现性词嵌入改进()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年12期
页码:
117-122
栏目:
人工智能
出版日期:
2022-12-10

文章信息/Info

Title:
Co-occurrence Word Embedding Improvement for Unknown andPolysemous Words
文章编号:
1673-629X(2022)12-0117-06
作者:
李保珍顾秀莲
南京审计大学 信息工程学院,江苏 南京 211815
Author(s):
LI Bao-zhenGU Xiu-lian
School of Information Engineering,Nanjing Audit University,Nanjing 211815,China
关键词:
词嵌入未登录词多义词共现矩阵词向量
Keywords:
word embeddingunknown wordspolysemous wordco-occurrence matrixword vector
分类号:
TP391
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 12. 018
摘要:
基于语料库构建词语语义性向量的词嵌入模型,可以定量刻画词语的上下文语义。 然而,传统的词嵌入模型在揭示一词多义词汇的语义时,存在着语义空间向量维度不确定或缺乏直观可解释性等局限,此外,对于词汇表外未登录新词语的语义性嵌入识别,尚缺乏有效的途径。 针对一词多义问题和未登录词问题,可将词嵌入的优势和词共现的优势相融合,以弥补传统词嵌入模型的语义空间维度不确定、语义维度不可解释及未登录词忽略等方面的不足。 主要创新工作包括:基于训练后的词嵌入矩阵与单词归一化的共现矩阵,构建全局性语料词向量;为未登录词创建语料词向量,并与全局性语料词向量进行权重融合,以提高词嵌入的精确率。 通过公开数据集的两项实验结果表明,基于词共现的一词多义及未登录词嵌入模型,可有效提升词嵌入的精确度,并可缩短词嵌入的进程时间。
Abstract:
The word embedding model of word semantic vector based on corpus can quantitatively describe the context semantics ofwords. However,the traditional word embedding model has some limitations in revealing the semantics of polysemy words, such asuncertain semantic space vector dimension or lack of intuitive interpretability. In addition,there is still a lack of effective way for thesemantic embedding recognition of new words that are not registered outside the vocabulary. Aiming at the problem of polysemy andunlisted words,the advantages of word embedding and word co-occurrence can be combined to make up for the shortcomings of the traditional word embedding model, such as uncertain semantic space dimension, unexplainable semantic dimension and ignoring unlistedwords. The main innovative work in this paper includes:constructing global corpus word vector based on the trained word embeddingmatrix and word normalized co-occurrence matrix;creating a corpus word vector for unregistered words and fusing the weight with theglobal corpus word vector to improve the accuracy of word embedding. Two experiments on public data sets show that the polysemy and unregistered word embedding model based on word co-occurrence can effectively improve the accuracy of word embedding and shortenthe process time of word embedding.

相似文献/References:

[1]张赢 万仲保.对专业搜索引擎中未登录词的识别研究[J].计算机技术与发展,2009,(05):134.
 ZHANG Ying,WAN Zhong-bao.Professional Search Engine Unknown Word of Recognition[J].,2009,(12):134.
[2]孙悦,李晶,吴铁峰,等.基于卷积神经网络的短评语情感分类[J].计算机技术与发展,2018,28(11):61.[doi:10.3969/ j. issn.1673-629X.2018.11.014]
 SUN Yue,LI Jing,WU Tie-feng,et al.Classification of Short Comment Emotion Based on Convolutional Neural Network[J].,2018,28(12):61.[doi:10.3969/ j. issn.1673-629X.2018.11.014]
[3]孟涛,王诚.基于扩展短文本词特征向量的分类研究[J].计算机技术与发展,2019,29(04):57.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 012]
 MENG Tao,WANG Cheng.Research on Short Text Classification Based on Extended Word Feature Vectors[J].,2019,29(12):57.[doi:10. 3969 / j. issn. 1673-629X. 2019. 04. 012]
[4]黄 鹤,荆晓远,董西伟,等.基于 Skip-gram 的 CNNs 文本邮件分类模型[J].计算机技术与发展,2019,29(06):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]
 HUANG He,JING Xiao-yuan,DONG Xi-wei,et al.CNNs-Highway Text Message Classification Model Based on Skip-gram[J].,2019,29(12):143.[doi:10. 3969 / j. issn. 1673-629X. 2019. 06. 030]
[5]潘理虎,郝彦杰,周耀辉,等.基于文本卷积的多因素煤炭产品推荐模型[J].计算机技术与发展,2021,31(04):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 04. 034]
 PAN Li-hu,HAO Yan-jie,ZHOU Yao-hui,et al.Multi Factor Coal Product Recommendation Model Based onText Convolution[J].,2021,31(12):198.[doi:10. 3969 / j. issn. 1673-629X. 2021. 04. 034]
[6]臧玑珣,徐鑫航.基于网络嵌入的农产品销售推荐系统[J].计算机技术与发展,2022,32(10):209.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 034]
 ZANG Ji-xun,XU Xin-hang.Recommendation System for Agricultural Products MarketingChannels Based on Network Embedding[J].,2022,32(12):209.[doi:10. 3969 / j. issn. 1673-629X. 2022. 10. 034]
[7]许鸿奎,周俊杰,姜彤彤,等.基于 BERT 和混合神经网络的诈骗电话文本识别[J].计算机技术与发展,2022,32(11):37.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 006]
 XU Hong-kui,ZHOU Jun-jie,JIANG Tong-tong,et al.Chinese Telephone Fraud Text Recognition Based on Word Embedding and Hybrid Neural Network[J].,2022,32(12):37.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 006]
[8]高 贵,赵 阳,于舒娟,等.基于 GNN 的文本分类算法研究[J].计算机技术与发展,2023,33(05):138.[doi:10. 3969 / j. issn. 1673-629X. 2023. 05. 021]
 GAO Gui,ZHAO Yang,YU Shu-juan,et al.Research on Text Classification Algorithm Based on GNN[J].,2023,33(12):138.[doi:10. 3969 / j. issn. 1673-629X. 2023. 05. 021]
[9]杨 彬,高俊涛,王志宝,等.基于词嵌入的元组级数据溯源方法[J].计算机技术与发展,2023,33(12):49.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 007]
 YANG Bin,GAO Jun-tao,WANG Zhi-bao,et al.A Tuple-level Data Lineage Approach Based on Word Embedding[J].,2023,33(12):49.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 007]

更新日期/Last Update: 2022-12-10