Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Tel Aviv University,   The Hebrew University of Jerusalem,
Google Research,  
*Equal Contribution
arXiv Code

🤗

Dataset


We propose enhancing traditional image-text alignment models by introducing a feedback mechanism that not only scores but also describes and visually annotates discrepancies between images and text.
Our method utilizes large language models and visual grounding models to automatically generate a training set (TV-Feedback dataset) with plausible misaligned captions, along with textual explanations and visual indicators.
We also provide a human-curated test set (SeeTRUE-Feedback) with authentic textual and visual misalignment annotations.

BibTeX


	      @misc{gordon2023mismatch,
      title={Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment}, 
      author={Brian Gordon and Yonatan Bitton and Yonatan Shafir and Roopal Garg and Xi Chen and Dani Lischinski and Daniel Cohen-Or and Idan Szpektor},
      year={2023},
      eprint={2312.03766},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}