I read the paper Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint from UC Berkely.
The authors built a hand-crafted benchmark of 432 English rebus puzzles, each annotated with 11 cognitive-skill categories and they also tested a wide range of models from open-source VLMs to reasoning-enabled models.
Performance was measured in two ways:
- Naive matching: exact answer comparison
- LLM-judged: using GPT-4o or Qwen3-8B to semantically judge correctness.
Below are some main findings:
- GPT-5 performed best but still far below human experts.
- Reasoning-enabled models outperformed non-reasoning ones by a large margin.
- Open-source VLMs barely solved any puzzles.(<5%)
- Models did realtively well on symbolic or quantitative reasoning, but failed badly at
- visual metaphors
- Negation or absence cuses
- Phonetic puns and cultural references.
评论