claim
Multimodal Large Language Models, such as Google's Gemini and GPT-4 with vision (GPT-4V), possess vision capabilities.

Authors

Sources

Referenced by nodes (4)