基于逻辑斯蒂回归的恶意请求分类识别模型-《计算机技术与发展》

文章信息/Info

Title:: A Classification and Recognition Model of Malicious Requests Based on Logistic Regression

Author(s):: CHEN Chun－ling; WU Fan; YU Han; School of Computer Science ＆ Technology，Nanjing University of Posts and Telecommunications，Nanjing 210003，China

Keywords:: Web requests; logistic regression; maximum likelihood estimation; TF－IDF; classification model

摘要:: 为了解决针对 Web 应用层的攻击，有效分类识别恶意请求，深入研究有监督的学习方法，针对请求文本内容不足、特征稀疏的缺陷，提出了一种基于非重复多 N－Gram 的 TF－IDF 分词策略和逻辑斯蒂回归方法构建的恶意请求分类模型。通过从 Secrepo 安全数据样本库等来源采集到的大量样本数据进行特征提取后对模型进行训练，以最大似然估计作为模型的优化目标，利用梯度下降的方法得到最优分类模型，并在测试集上验证模型的可靠性。实验结果表明，短文本、低语义的请求内容通过字母形式在多 N－Gram 的分词下构造的分类模型，相对于单词和单倍 N－Gram 分词的分类模型具有较高的分类准确率和得分，并且训练模型所耗时间相差不大。该方法训练出的最终模型在测试集上的准确率、召回率和 F 1值都达到了 99%以上。

Abstract:: In order to effectively defend the attack from Web application layer and classify and recognize the malicious requests，the supervised learning methods are researched in－depth． Aiming at the defects of insufficient content and sparse features of requests text，we pro-pose a malicious requests classifier model based on logistic regression method and TF－IDF word segmentation with non－repetition andmulti－N－Gram． The model is trained after feature extraction of a large number of sample data collected from online security databasesuch as Secrepo． Taking the maximum likelihood estimation as the optimization goal of the model，we use the gradient descent method toobtain the optimum classification model，and its reliability is validated on the test set． The experiment shows that compared with the clas-sification model of words and single－fold N－Gram segmentation，the classification model built by request content with short text and lowsemantic in letters on multi－N－Gram segmentation has higher accuracy and score． Their training time is not much different． The finalmodel trained by this way reaches more than 99% of accuracy，recall and F 1 －measure on test set．