Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction

10.7910/DVN/PRHQMK Melih Serin Melih Serin Boğaziçi University Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction Harvard Dataverse 2024 Engineering Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) Object Detection Vision Transformers End-to-End Training Generalized Intersection over Union (gIoU) Loss Melih Serin Melih Serin Boğaziçi University 2024-04-14 2024-04-14 4173342 application/pdf 1.0 Creative Commons CC0 1.0 Universal Public Domain Dedication. Computer vision has been receiving increasing attention with the recent complex Generative AI models released by tech industry giants, such as OpenAI and Google. However, there is a specific subfield that we wanted to focus on, that is, Image-Guided Object Detection. A detailed literature survey directed us towards a successful study called Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) [1], which is a multifunctional complex model that can also perform image-guided object detection as a side function. In this study, some experiments have been conducted utilizing OWL-ViT architecture as the base model and manipulated the necessary parts to achieve a better one-shot performance. Code and models are available on GitHub.