Skip to main content
Article
Multi-modal Transformers Excel at Class-agnostic Object Detection
arXiv
  • Muhammad Maaz, Mohamed Bin Zayed University of Artificial Intelligence
  • Hanoona Bangalath Rasheed, Mohamed Bin Zayed University of Artificial Intelligence
  • Salman Hameed Khan, Mohamed Bin Zayed University of Artificial Intelligence & Australian National University
  • Fahad Shahbaz Khan, Mohamed bin Zayed University of Artificial Intelligence
  • Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence & Aalto University
  • Ming-Hsuan Yang, University of California & Yonsei University & Google Research
Document Type
Article
Abstract

What constitutes an object? This has been a longstanding question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on these findings, we develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs offer enhanced interactability with intelligible text queries. Code: https://git.io/J1HPY. © 2021, CC0.

DOI
doi.org/10.48550/arXiv.2111.11430
Publication Date
11-22-2021
Keywords
  • Object detection,
  • Semantics,
  • Aligned images,
  • Excel,
  • Feature processing,
  • Image texts,
  • Learning-based approach,
  • Multi-modal,
  • Multi-scale features,
  • Objects detection,
  • State-of-the-art performance,
  • Topdown,
  • Object recognition,
  • Computer Vision and Pattern Recognition (cs.CV)
Comments

Preprint: arXiv

Archived with thanks to arXiv

Preprint License: CC0 1.0

Uploaded 25 March 2022

Citation Information
M. Maaz, H.B. Rasheed, S.H. Khan, F.S. Khan, R.M. Anwer, and M.H. Yang, "Multi-modal transformers excel at class-agnostic object detection", 2021, arXiv:2111.11430