Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction

10.7910/DVN/PRHQMKMelih SerinMelihSerinBoğaziçi UniversityImage-Guided Object Detection using OWL-ViTand Enhanced Query Embedding ExtractionHarvard Dataverse2024EngineeringOpen-Vocabulary Object Detection with Vision Transformers (OWL-ViT)Object DetectionVision TransformersEnd-to-End TrainingGeneralized Intersection over Union (gIoU) LossMelih SerinMelihSerinBoğaziçi University2024-04-142024-04-144173342application/pdf1.0CC0 1.0Computer vision has been receiving increasing attention with the recent complex Generative AI models released by tech industry giants, such as OpenAI and Google. However, there is a specific subfield that we wanted to focus on, that is, Image-Guided Object Detection. A detailed literature survey directed us towards a successful study called Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) [1], which is a multifunctional complex model that can also perform image-guided object detection as a side function. In this study, some experiments have been conducted utilizing OWL-ViT architecture as the base model and manipulated the necessary parts to achieve a better one-shot performance. Code and models are available on GitHub.