LLMs can look at an image, judge its creativity, and reveal the logic behind the score.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-07-04
A new paper finds that large language models can evaluate visual creativity in images and provide interpretable reasoning for their scores, with Gemini 3 Flash performing best, but all tested models show systematic bias toward overrating polished AI-generated images and underrating rough sketches.
Extraction
Topics: llm-evaluationvisual-creativitymultimodal-aiai-biascreativity-assessment
Claims
- LLMs can assess visual creativity in images in a zero-shot setting and provide interpretable reasoning behind their scores.
- Most tested models matched human creativity scores fairly well, with Gemini 3 Flash leading on both image types evaluated.
- Models exhibited systematic bias, rating polished AI-generated images too generously and rough human sketches too harshly.
- When explaining their reasoning, models predominantly discussed visual content, originality, visual quality, and the final score.
- Visual creativity scoring can scale using LLMs, but systematic biases require calibration before reliable deployment.
Key quotes
LLMs can look at an image, judge its creativity, and reveal the logic behind the score.
the models had clear biases: they rated polished AI images too generously and rough sketches too harshly.
visual creativity scoring can scale, while its biases still need calibration