Byung-Kwan Lee,Beomchan Park,Chae Won Kim,Yong Man Ro
Abstract
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
- Anthology ID:
- 2024.findings-acl.66
- Volume:
- Findings of the Association for Computational Linguistics ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand and virtual meeting
- Editors:
- Lun-Wei Ku,Andre Martins,Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1121–1138
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.66
- DOI:
- Bibkey:
- Cite (ACL):
- Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics ACL 2024, pages 1121–1138, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
- Copy Citation:
- PDF:
- https://aclanthology.org/2024.findings-acl.66.pdf
PDFCiteSearch
Export citation
- BibTeX
- MODS XML
- Endnote
- Preformatted
@inproceedings{lee-etal-2024-collavo, title = "{C}o{LL}a{VO}: Crayon Large Language and Vision m{O}del", author = "Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand and virtual meeting", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.66", pages = "1121--1138", abstract = "The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from {`}what objects are in the image?{'} or {`}which object corresponds to a specified bounding box?{'}. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.",}
Download as File
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="lee-etal-2024-collavo"> <titleInfo> <title>CoLLaVO: Crayon Large Language and Vision mOdel</title> </titleInfo> <name type="personal"> <namePart type="given">Byung-Kwan</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Beomchan</namePart> <namePart type="family">Park</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chae</namePart> <namePart type="given">Won</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yong</namePart> <namePart type="given">Man</namePart> <namePart type="family">Ro</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2024-08</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics ACL 2024</title> </titleInfo> <name type="personal"> <namePart type="given">Lun-Wei</namePart> <namePart type="family">Ku</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andre</namePart> <namePart type="family">Martins</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vivek</namePart> <namePart type="family">Srikumar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Bangkok, Thailand and virtual meeting</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.</abstract> <identifier type="citekey">lee-etal-2024-collavo</identifier> <location> <url>https://aclanthology.org/2024.findings-acl.66</url> </location> <part> <date>2024-08</date> <extent unit="page"> <start>1121</start> <end>1138</end> </extent> </part></mods></modsCollection>
Download as File
%0 Conference Proceedings%T CoLLaVO: Crayon Large Language and Vision mOdel%A Lee, Byung-Kwan%A Park, Beomchan%A Kim, Chae Won%A Ro, Yong Man%Y Ku, Lun-Wei%Y Martins, Andre%Y Srikumar, Vivek%S Findings of the Association for Computational Linguistics ACL 2024%D 2024%8 August%I Association for Computational Linguistics%C Bangkok, Thailand and virtual meeting%F lee-etal-2024-collavo%X The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.%U https://aclanthology.org/2024.findings-acl.66%P 1121-1138
Download as File
Markdown (Informal)
[CoLLaVO: Crayon Large Language and Vision mOdel](https://aclanthology.org/2024.findings-acl.66) (Lee et al., Findings 2024)
- CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
ACL
- Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics ACL 2024, pages 1121–1138, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.