Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extractiondoi:10.7910/DVN/PRHQMKHarvard Dataverse2024-04-141Melih Serin, 2024, "Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction", https://doi.org/10.7910/DVN/PRHQMK, Harvard Dataverse, V1Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extractiondoi:10.7910/DVN/PRHQMKMelih SerinHarvard DataverseMelih SerinKUUJE2024-04-14EngineeringOpen-Vocabulary Object Detection with Vision Transformers (OWL-ViT)Object DetectionVision TransformersEnd-to-End TrainingGeneralized Intersection over Union (gIoU) LossComputer vision has been receiving increasing attention with the recent complex Generative AI models released by tech industry giants, such as OpenAI and Google. However, there is a specific subfield that we wanted to focus on, that is, Image-Guided Object Detection. A detailed literature survey directed us towards a successful study called Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) [1], which is a multifunctional complex model that can also perform image-guided object detection as a side function. In this study, some experiments have been conducted utilizing OWL-ViT architecture as the base model and manipulated the necessary parts to achieve a better one-shot performance. Code and models are available on GitHub.<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a>ImageGuidedObjectDetection.pdfapplication/pdf