claim
Large multimodal vision-language models, including GPT-4V, struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding.

Authors

Sources

Referenced by nodes (1)