Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extractionhttps://doi.org/10.7910/DVN/PRHQMKMelih SerinHarvard Dataverse2024-04-142024-04-14T22:49:15ZComputer vision has been receiving increasing attention with the recent complex Generative AI models released by tech industry giants, such as OpenAI and Google. However, there is a specific subfield that we wanted to focus on, that is, Image-Guided Object Detection. A detailed literature survey directed us towards a successful study called Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) [1], which is a multifunctional complex model that can also perform image-guided object detection as a side function. In this study, some experiments have been conducted utilizing OWL-ViT architecture as the base model and manipulated the necessary parts to achieve a better one-shot performance. Code and models are available on GitHub.EngineeringOpen-Vocabulary Object Detection with Vision Transformers (OWL-ViT)Object DetectionVision TransformersEnd-to-End TrainingGeneralized Intersection over Union (gIoU) Loss10.5281/zenodo.109383422024-04-14KUUJE2024-04-14CC0 1.0