CoLLaVO: Crayon Large Language and Vision mOdel (2024)

Byung-Kwan Lee,Beomchan Park,Chae Won Kim,Yong Man Ro

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

Anthology ID:
2024.findings-acl.66
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku,Andre Martins,Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1121–1138
Language:
URL:
https://aclanthology.org/2024.findings-acl.66
DOI:
Bibkey:
Cite (ACL):
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics ACL 2024, pages 1121–1138, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.66.pdf

PDFCiteSearch

Export citation
  • BibTeX
  • MODS XML
  • Endnote
  • Preformatted
@inproceedings{lee-etal-2024-collavo, title = "{C}o{LL}a{VO}: Crayon Large Language and Vision m{O}del", author = "Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand and virtual meeting", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.66", pages = "1121--1138", abstract = "The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from {`}what objects are in the image?{'} or {`}which object corresponds to a specified bounding box?{'}. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.",}

Download as File

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="lee-etal-2024-collavo"> <titleInfo> <title>CoLLaVO: Crayon Large Language and Vision mOdel</title> </titleInfo> <name type="personal"> <namePart type="given">Byung-Kwan</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Beomchan</namePart> <namePart type="family">Park</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chae</namePart> <namePart type="given">Won</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yong</namePart> <namePart type="given">Man</namePart> <namePart type="family">Ro</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2024-08</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics ACL 2024</title> </titleInfo> <name type="personal"> <namePart type="given">Lun-Wei</namePart> <namePart type="family">Ku</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andre</namePart> <namePart type="family">Martins</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vivek</namePart> <namePart type="family">Srikumar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Bangkok, Thailand and virtual meeting</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.</abstract> <identifier type="citekey">lee-etal-2024-collavo</identifier> <location> <url>https://aclanthology.org/2024.findings-acl.66</url> </location> <part> <date>2024-08</date> <extent unit="page"> <start>1121</start> <end>1138</end> </extent> </part></mods></modsCollection>

Download as File

%0 Conference Proceedings%T CoLLaVO: Crayon Large Language and Vision mOdel%A Lee, Byung-Kwan%A Park, Beomchan%A Kim, Chae Won%A Ro, Yong Man%Y Ku, Lun-Wei%Y Martins, Andre%Y Srikumar, Vivek%S Findings of the Association for Computational Linguistics ACL 2024%D 2024%8 August%I Association for Computational Linguistics%C Bangkok, Thailand and virtual meeting%F lee-etal-2024-collavo%X The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.%U https://aclanthology.org/2024.findings-acl.66%P 1121-1138

Download as File

Markdown (Informal)

[CoLLaVO: Crayon Large Language and Vision mOdel](https://aclanthology.org/2024.findings-acl.66) (Lee et al., Findings 2024)

  • CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
ACL
  • Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics ACL 2024, pages 1121–1138, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
CoLLaVO: Crayon Large Language and Vision mOdel (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dong Thiel

Last Updated:

Views: 6020

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.