基于集成学习的 N6 甲基化位点预测方法研究-《计算机技术与发展》

文章信息/Info

Title:: Research on Prediction Method of N6-methylation Sites Based on Ensemble Learning

文章编号:: 1673-629X(2021)03-0149-08

作者:: 赵媛媛¹; 陈进祥¹; 李富义²; 3; 吴昊¹; 刘全中^1*; 1. 西北农林科技大学信息工程学院,陕西杨凌 712100;
2. 蒙纳士大学数据科学中心,澳大利亚墨尔本 VIC 3800;
3. 蒙纳士大学生物医学发现研究所和生物化学与分子生物学系,澳大利亚墨尔本 VIC 3800

Author(s):: ZHAO Yuan-yuan¹; CHEN Jin-xiang¹; LI Fu-yi²; 3; WU Hao¹; LIU Quan-zhong^1*; 1. School of Information Engineering,Northwest A& F University,Yangling 712100,China;
2. Monash Centre for Data Science,Monash University,Melbourne VIC 3800,Australia;
3. Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,Monash University,Melbourne VIC 3800,Australia

关键词:: 6mA 甲基化; stacking 集成学习; XGBoost; LightGBM; 支持向量机

Keywords:: N6 - methyladenine ( 6mA); stacking ensemble learning; extreme gradient boosting ( XGBoost ); LightGBM; supportvector machine

分类号:: TP181

DOI:: 10. 3969 / j. issn. 1673-629X. 2021. 03. 026

摘要:: N6-甲基腺嘌呤(N6-methyladenine,6mA)是指腺嘌呤第 6 位氮原子的甲基化修饰。 6mA 在维持细胞正常的转录活性、DNA 损伤修复、染色质重塑、遗传印记、胚胎发育和肿瘤发生等生物过程中起着非常重要的作用。通过生物实验的方法来鉴定 6mA 位点耗时且昂贵。近年来,研究界提出了一些基于机器学习的 6mA 位点预测方法,但这些预测方法过度依赖一种学习模型,导致模型的泛化能力不足以及预测的准确度不高等问题。集成学习综合多种预测模型的优点,具有较好的泛化能力及预测性能。因此,为了进一步提升 6mA 位点的预测准确性,提出了一种基于 stacking 集成学习的 6mA位点预测模型 Stack6mAPred。该模型由两层分类器组成,第一层集成了朴素贝叶斯、支持向量机(support vector machine,SVM)和 LightGBM 等三种主流分类器,第二层使用逻辑回归(logistic regression,LR)分类器。 Stack6mAPred 利用增强核苷酸组成等 5 种特征对实验已鉴定 6mA 序列和非 6mA 序列进行编码,使用 XGBoost (extreme gradient boosting)算法进行特征选择,去除冗余特征。通过在水稻基准数据集上进行五折交叉验证,与目前性能最优的方法 MM-6mAPred 相比,Stack6mAPred 在敏感性、特异性、准确度、MCC 和 AUC 上均获得了更好的性能,分别提高了 1.7%、1.36%、1.72%、0.06 和 0.031。

Abstract:: N6-methyladenine (6mA) refers to the methylation modification of the nitrogen atom at position 6 of adenine, which plays an important role in maintaining the normal transcriptional activity of cells,DNA damage repair, chromatin remodeling,genetic imprinting,embryonic development and tumorigenesis. However,it is a challenge to detect 6mA sites through experimental methods,which are time consuming and costly. In recent years,the research community has proposed several machine learning-based approaches for 6mA sites detection. However, the existing 6mA detection approaches heavily rely on a single learning model. A single learning model mainly focuses on some respects to detect 6mA sites,and its accuracy and generalization ability need to be further improved. Ensemble learning can achieve powerful performance by combining multiple models. To address the drawbacks of a single learning model, a stackin gensemble-based 6mA site prediction method called Stack6mAPred is proposed. Stack6mAPred consists of two layers of classifiers. In the first layer,three mainstream classifiers such as Naive Bayes,support vector machine ( SVM) and LightGBM are integrated,and in the second layer the logistic regression ( LR) classifier is used. Stack6mAPred uses five sequence features to encode the experimentally identified 6mA sequences and non-6mA sequences into feature vectors,and employs XGBoost (extreme gradient boosting) algorithm to select important features from a high dimension. We conduct a five - fold cross - validation test on the benchmark rice datasets and compare with current best performing method MM-6mAPred. Results show that Stack6mAPred has achieved better performances on five common evaluation metrics,including sensitivity,specificity,accuracy,MCC (Matthews correlation coeffici-ent) and AUC (area underthe ROC curve). Performances of these five metrics are increased by 1. 7%,1. 36%, 1. 72% ,0. 06 and 0. 032 respectively.

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

常用功能

导航/Navigate

工具/Tools

统计/Statistics