基于运动跟踪与特征融合的视频实例分割方法-《计算机技术与发展》

文章信息/Info

Title:: Video Instance Segmentation Method Based on Motion Tracker and Feature Aggregation

Author(s):: ZHOU Zhen; LI Ying; LIU De-yun; JI Gen-lin; School of Computer and Electronic Information / Artificial Intelligence,Nanjing Normal University,Nanjing 210023,China

Keywords:: video instance segmentation; image instance segmentation; motion tracker; feature aggregation; attention mechanism

摘要:: 视频实例分割(VIS) 提供了对视频更深层次的理解,是智能监控、自动驾驶、机器人等领域高级任务的前置任务之一。目前对于图像实例分割已经有很多研究,但是对于视频实例分割的研究却相对较少,而将图像分割方法直接应用到视频领域也存在很多问题,其中实例被遮挡、实例成像差以及高速运动引起实例模糊等异常情况导致的追踪和分割效果差是主要问题。针对该问题,提出一种基于运动跟踪与注意力特征融合的视频实例分割方法( MTFA) 。该方法利用运动跟踪头依据运动和特征信息在整个视频中跟踪实例并分配实例标签,然后按照实例标签对每一帧中实例去其他帧提取同一实例的特征信息,通过注意力机制融合这些特征信息用以增强当前帧的特征并生成分割掩码。该方法在 Youtube-VIS数据集测试中最佳 AP 为 38. 3% ( ResNet-50)和 41. 2% ( ResNet-101) 。

Abstract:: Video instance segmentation ( VIS) provides a deep understanding of video and it is a pre-task for advanced tasks such as intelligent surveillance,autonomous driving and robotics. Many works focus on the image segmentation,but there is relatively few researchon the video instance segmentation. There are many problems in applying image segmentation to video,the main problem is the poor segmentation and tracking result caused by instance occlusion,image blurring and so on. To solve the above problem,we propose a video instance segmentation method based on motion tracker and feature aggregation ( MTFA ) . This method uses motion tracker to trackinstances across frames and assign labels to instances. According to these labels,the feature information of the same instance is extractedfrom other frames by instances in current frame,then the feature information of the current frame is enhanced by fused features from attentional feature aggregation module and segmentation masks are generated with enhanced feature. The best AP of the proposed method inthe Youtube-VIS dataset test is 38. 3%? ? ? ? ? ? ? ?( ResNet-50) and 41. 2% ( ResNet-101) .