OmniBench: A Comprehensive and Scalable Benchmark for Fine-grained Compositional Text-to-Image Generation

Abstract

Notwithstanding recent progress in text-to-image (T2I) models, current approaches struggle to effectively integrate multiple concepts from compositional textual prompts into a cohesive scene. Designing benchmarks to systematically assess the compositional capabilities of T2I models is challenging yet crucial. However, existing benchmarks lack comprehensive coverage of fine-grained compositional prompts, limiting their ability to support in-depth, multi-granularity evaluations. To address these issues, we introduce \(\textbf{OmniBench}\), a comprehensive and scalable benchmark for evaluating compositional T2I generation in a fine-grained manner. \(\textbf{OmniBench}\) provides a large-scale dataset with textual prompts spanning 29 fine-grained categories and 10 complex compositions. The prompts are assigned different difficulty levels, allowing the evaluation of T2I models' capability from simpler to more complex prompts. Additionally, we propose \(\textbf{FITScore}\), a Fine-grained, Interpretable, and Training-free evaluation framework tailored for compositional T2I generation. By integrating structured analysis with interpretability, FITScore enables automatic and transparent assessment of T2I alignment, enhancing adaptability across diverse scenarios and ensuring robust evaluation in open-world settings. Extensive evaluation of mainstream T2I models reveals their strengths and limitations in complex compositional tasks.

Prompts Dataset

Evaluation

Evaluation Result

MY ALT TEXT

The performance of nine T2I models on OmniBench.

MY ALT TEXT

Comparative analysis of FITScore across fine-grained categories and complex compositions.

MY ALT TEXT

The performance of nine T2I models on each difficulty level.

Qualitative Results Comparison

Fine-Grained Categories



Complex Compositions