claim
Large multimodal vision-language models, including GPT-4V, struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding.
Authors
Sources
- Understanding LLM Understanding skywritingspress.ca via serper
Referenced by nodes (1)
- Large Vision-Language Models concept