Skip to main content
Article
Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  • Omkar Thawakar, Mohamed bin Zayed University of Artificial Intelligence
  • Sanath Narayan, Inception Institute of Artificial Intelligence
  • Jiale Cao, Tianjin University, China
  • Hisham Cholakkal, Mohamed bin Zayed University of Artificial Intelligence
  • Rao Anwer, Mohamed bin Zayed University of Artificial Intelligence
  • Muhammad Haris Khan, Mohamed bin Zayed University of Artificial Intelligence
  • Salman Khan, Mohamed bin Zayed University of Artificial Intelligence
  • Michael Felsberg, Linköping University, Sweden
  • Fahad Shahbaz Khan, Mohamed bin Zayed University of Artificial Intelligence & Linköping University, Sweden
Document Type
Conference Proceeding
Abstract

State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multi-scale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multi-scale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1%, outperforming the best reported results in literature by 2.7% and by 4.8% at higher overlap threshold of AP 75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0% on Youtube-VIS 2019 val. set. Source code is available at https://github.com/OmkarThawakar/MSSTS-VIS. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

DOI
10.1007/978-3-031-19818-2_38
Publication Date
10-22-2022
Keywords
  • Attention computation,
  • Multi-scale features,
  • Multi-scales,
  • Multiple scale,
  • Spatio-temporal,
  • Spatiotemporal feature,
  • Split attentions,
  • State of the art,
  • Temporal consistency,
  • YouTube
Comments

IR conditions: non-described

Citation Information
O. Thawakar et al, "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", in Computer Vision (ECCV 2022), Lecture Notes in Computer Science, vol 13689, pp. 666-681, Oct. 2022, doi:10.1007/978-3-031-19818-2_38