基于Transformer架构的端到端视频异常检测方法-《计算机技术与发展》

文章信息/Info

Title:: An End-to-end Video Anomaly Detection Method Based on Transformer Architecture

Author(s):: LI Shi-feng*; LUO Xi; LIU Xiao-ru; TIAN Ye; School of Information Science and Technology,Bohai University,Jinzhou 121000,China

Keywords:: video anomaly detection; Transformer architecture; spatio - temporal information fusion model; deep support vector data description (Deep SVDD); joint training

摘要:: 传统的卷积神经网络虽然能够处理空间结构数据,但在处理大规模视频数据时,其时空建模能力不足。为了解决这一问题,需要一个能够处理海量视频数据的高效模型。该文提出了一种新的基于 Transformer 架构的端到端视频异常检测方法。该方法结合 Swin Transformer 架构和 Video Vision Transformer (ViViT)模型设计了时空信息融合模型,以提取视频帧序列的丰富时空信息。此外,通过将时空信息融合模型和深度支持向量数据描述(Deep SVDD)方法进行联合训练,实现了端到端的视频异常检测。在两个公开视频数据集上与最新的 10 种方法进行了对比实验,在 UCSD Ped2 数据集上,该模型取得了最高的 96. 5% 的 AUC;在 CHUK Avenue 数据集上,该模型也取得了 80. 7% 的 AUC,优于多数方法。与领先的视频异常检测方法相比,该方法具有一定的优势和竞争力。

Abstract:: Although the traditional convolutional neural network can process spatial structure data,its spatiotemporal modeling ability is insufficient when processing large-scale video data. In order to solve this problem,an efficient model that can handle massive video data is needed. A new end - to - end video anomaly detection method based on Transformer architecture is proposed. Combining Swin Transformer architecture and Video Vision Transformer ( ViViT) model, a spatio - temporal information fusion model is designed to extract rich spatio-temporal information of video frame sequences. In addition,by combining spatiotemporal information fusion model and Deep SVDD method, end-to-end video anomaly detection is realized. A comparison experiment was conducted on two public video datasets with the latest 10 methods. On UCSD Ped2 dataset,the proposed model achieved the highest AUC of 96. 5% . On the CHUK Avenue dataset,it also achieved 80. 7% AUC,which is better than that of most methods. The proposed method has certain advantages and competitiveness compared with the leading video anomaly detection methods.