<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction</titl><IDNo agency="DOI">doi:10.7910/DVN/PRHQMK</IDNo></titlStmt><distStmt><distrbtr source="archive">Harvard Dataverse</distrbtr><distDate>2024-04-14</distDate></distStmt><verStmt source="archive"><version date="2024-04-14" type="RELEASED">1</version></verStmt><biblCit>Melih Serin, 2024, "Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction", https://doi.org/10.7910/DVN/PRHQMK, Harvard Dataverse, V1</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Image-Guided Object Detection using OWL-ViTand Enhanced Query Embedding Extraction</titl><IDNo agency="DOI">doi:10.7910/DVN/PRHQMK</IDNo></titlStmt><rspStmt><AuthEnty affiliation="Boğaziçi University">Melih Serin</AuthEnty></rspStmt><prodStmt/><distStmt><distrbtr source="archive">Harvard Dataverse</distrbtr><contact affiliation="Boğaziçi University" email="melihsrnn@gmail.com">Melih Serin</contact><depositr>KUUJE</depositr><depDate>2024-04-14</depDate></distStmt><holdings URI="https://doi.org/10.7910/DVN/PRHQMK"/></citation><stdyInfo><subject><keyword xml:lang="en">Engineering</keyword><keyword>Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT)</keyword><keyword>Object Detection</keyword><keyword>Vision Transformers</keyword><keyword>End-to-End Training</keyword><keyword>Generalized Intersection over Union (gIoU) Loss</keyword></subject><abstract date="2024-04-15">Computer vision has been receiving increasing attention with the recent complex Generative AI models released by tech industry giants, such as OpenAI and Google. However, there is a specific subfield that we wanted to focus on, that is, Image-Guided Object Detection. A detailed literature survey directed us towards a successful study called Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) [1], which is a multifunctional complex model that can also perform image-guided object detection as a side function. In this study, some experiments have been conducted utilizing OWL-ViT architecture as the base model and manipulated the necessary parts to achieve a better one-shot performance. Code and models are available on GitHub.</abstract><sumDscr/></stdyInfo><method><dataColl><sources/></dataColl><anlyInfo/></method><dataAccs><setAvail/><useStmt/><notes type="DVN:TOU" level="dv">&lt;a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0&lt;/a></notes></dataAccs><othrStdyMat/></stdyDscr><otherMat ID="f10117853" URI="https://dataverse.harvard.edu/api/access/datafile/10117853" level="datafile"><labl>ImageGuidedObjectDetection.pdf</labl><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">application/pdf</notes></otherMat></codeBook>