[1]周 震,李 莹,柳德云,等.基于运动跟踪与特征融合的视频实例分割方法[J].计算机技术与发展,2022,32(11):43-49.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 007]
 ZHOU Zhen,LI Ying,LIU De-yun,et al.Video Instance Segmentation Method Based on Motion Tracker and Feature Aggregation[J].,2022,32(11):43-49.[doi:10. 3969 / j. issn. 1673-629X. 2022. 11. 007]
点击复制

基于运动跟踪与特征融合的视频实例分割方法()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
32
期数:
2022年11期
页码:
43-49
栏目:
媒体计算
出版日期:
2022-11-10

文章信息/Info

Title:
Video Instance Segmentation Method Based on Motion Tracker and Feature Aggregation
文章编号:
1673-629X(2022)11-0043-07
作者:
周 震李 莹柳德云吉根林
南京师范大学 计算机与电子信息学院 / 人工智能学院,江苏 南京 210023
Author(s):
ZHOU ZhenLI YingLIU De-yunJI Gen-lin
School of Computer and Electronic Information / Artificial Intelligence,Nanjing Normal University,Nanjing 210023,China
关键词:
视频实例分割图像实例分割运动跟踪特征融合注意力机制
Keywords:
video instance segmentationimage instance segmentationmotion trackerfeature aggregationattention mechanism
分类号:
TP399
DOI:
10. 3969 / j. issn. 1673-629X. 2022. 11. 007
摘要:
视频实例分割(VIS) 提供了对视频更深层次的理解,是智能监控、自动驾驶、机器人等领域高级任务的前置任务之一。 目前对于图像实例分割已经有很多研究,但是对于视频实例分割的研究却相对较少,而将图像分割方法直接应用到视频领域也存在很多问题,其中实例被遮挡、实例成像差以及高速运动引起实例模糊等异常情况导致的追踪和分割效果差是主要问题。 针对该问题,提出一种基于运动跟踪与注意力特征融合的视频实例分割方法( MTFA) 。 该方法利用运动跟踪头依据运动和特征信息在整个视频中跟踪实例并分配实例标签,然后按照实例标签对每一帧中实例去其他帧提取同一实例的特征信息,通过注意力机制融合这些特征信息用以增强当前帧的特征并生成分割掩码。 该方法在 Youtube-VIS数据集测试中最佳 AP 为 38. 3% ( ResNet-50)和 41. 2% ( ResNet-101) 。
Abstract:
Video instance segmentation ( VIS) provides a deep understanding of video and it is a pre-task for advanced tasks such as intelligent surveillance,autonomous driving and robotics. Many works focus on the image segmentation,but there is relatively few researchon the video instance segmentation. There are many problems in applying image segmentation to video,the main problem is the poor segmentation and tracking result caused by instance occlusion,image blurring and so on. To solve the above problem,we propose a video instance segmentation method based on motion tracker and feature aggregation ( MTFA ) . This method uses motion tracker to trackinstances across frames and assign labels to instances. According to these labels,the feature information of the same instance is extractedfrom other frames by instances in current frame,then the feature information of the current frame is enhanced by fused features from attentional feature aggregation module and segmentation masks are generated with enhanced feature. The best AP of the proposed method inthe Youtube-VIS dataset test is 38. 3%? ? ? ? ? ? ? ?( ResNet-50) and 41. 2% ( ResNet-101) .
更新日期/Last Update: 2022-11-10