This section first describes the implementation details in
Section 4.1. To assess the effectiveness of TIAM-ADV,
Section 4.2 presents a user study conducted to compare its results with human evaluations. Moreover, we analyze the application of TIAM-ADV to three models, i.e., SD1.4, SD2, and SD3.5. Their characteristics are summarized in
Section 4.3.
4.2. User Study
In this experiment, we aimed to measure the gap between human judgment and the scores produced by various evaluation metrics, including the proposed method and existing approaches. First, we conducted a T2I matching survey of the generated images with 30 human participants. The survey presented the participants with several prepared images, each accompanied by a descriptive text generated by
. The participants were asked to place a check mark before the text if the image matched it. As a restriction, the participants were instructed to ignore aesthetics or personal preference, and to focus solely on the text–image alignment of each image.
Figure 5 shows some example questions from the survey in Google Forms. We used 120 images generated in the experiment described later in
Section 4.3.
Table 1 summarizes the object, attribute, action, and position aspects of these images, including the number of erroneous images identified in each aspect. The results of the participants yielded Fleiss’ kappa of 0.694, indicating “substantial consistency” [
37] confirming the reliability of the survey results.
Second, we compared the human judgment with the scores produced by BLIP-2, PickScore [
38], ImageReward [
39], HPS v2 [
40], and TIAM-ADV using Pearson correlation to quantify how closely each automatic metric reflects human judgment. BLIP-2, PickScore, ImageReward, and HPS v2 are commonly used metrics for evaluating image–text alignment and visual quality in T2I models.
Table 2 presents the experimental results. Compared with previous studies, the Pearson correlation coefficient of TIAM-ADV was the highest, reaching 0.853 (
). Typically, a correlation greater than 0.7 is considered a strong positive association, indicating that the proposed method aligns closest with human judgment.
The prediction capabilities of the metrics were further assessed using ROC curves, as shown in
Figure 6. Among them, TIAM-ADV achieved the highest predictive performance, with an area under curve (AUC) of 0.96, slightly exceeding ImageReward and outperforming other metrics. Unlike ImageReward, our method does not require additional training of the VLM, providing greater flexibility. Notably, TIAM-ADV also outperformed BLIP-2 (AUC 0.82) and exhibited a stronger correlation with human judgment, highlighting the effectiveness of its attention region extraction of tokens, which distinguishes it from BLIP-2. This confirms that the attention maps are sufficiently informative for the evaluation task considered in this work.
Overall, TIAM-ADV exhibits the highest correlation with human judgment, demonstrating its effectiveness for assessing text–image alignment. A brief comparison of the other metrics clarifies their complementary roles. As highlighted in previous research, PickScore predicts relative scores with an emphasis on user preferences, HPS v2 provides absolute scores but primarily emphasizes image quality rather than text–image alignment, and ImageReward considers both alignment and factors such as fidelity and harmlessness. In contrast, TIAM-ADV generates absolute scores solely based on text–image alignment, making it particularly reliable when alignment is the primary concern. At the same time, it can still serve as a complement to such approaches. Meanwhile, TEM-ADV does not require any additional training and can readily utilize different VLMs as needed, making it a flexible and lightweight evaluation approach.
4.3. Results of Model Comparison
In this experiment, we adapted our method to three models, namely, SD1.4, SD2, and SD3.5. Our method evaluated their performance by assigning scores using four prompt sets that focused on the image generation of objects, attributes, actions, and positions. These prompt sets were generated from templates (
Table 3) and word sets (
Figure 2). Based on these scores, we further analyzed the differences in performance among the three models and identified their individual characteristics.
First, we report the performance scores for object generation. The set consists of 50 object categories. For each distinct pair of categories with , we constructed the prompt “an image of and ” (controlling the articles as expected), generating 1225 images in total. We randomly sampled 600 images each from the images generated by SD1.4 and SD2 and 300 images from those generated by SD3.5 to ensure sufficient data for statistical significance testing. The alignment between the prompts and the images was then evaluated using TIAM-ADV. For comparison, we also experimented with 300 images, which were generated from the simpler prompt “an image of ,” while changing the random seed .
The results are presented in
Table 4, which reports the evaluation scores for object generation using prompts with one or two objects. The values in parentheses indicate the error margin and the 95% confidence interval. All models performed well in generating images containing a single object, achieving scores above 0.847. However, when generating two objects simultaneously, SD1.4 and SD2 struggled, achieving significantly lower scores of 0.338 and 0.399, respectively. In contrast, SD3.5 consistently maintained a score of 0.904. Significance analysis confirms that the differences between the three models are statistically significant at the 95% confidence level. These findings indicate that although SD1.4 and SD2 exhibit comparable performance, SD2 performs slightly better in generating images with two objects. Overall, SD3.5 substantially outperforms both models.
Next, we characterized the capacity of the models to apply attributes to objects. Two types of prompts were considered: “an image of ” and “an image of and ,” where and , which were obtained from the and sets. For the single-attribute prompts, prompts were generated. Using a LLM, we filtered out prompts exhibiting unnatural language expressions and classified the remaining prompts into “realistic” and “unrealistic” categories. We then randomly sampled 300 images from each category. For the two-attribute prompts, only the realistic prompts were considered. To ensure a sufficient number of samples for statistical significance testing, we randomly sampled 600 images each from the images generated by SD1.4 and SD2 and 300 images from those generated by SD3.5.
The results are presented in
Table 5, where the columns correspond to prompts with one realistic attribute, one unrealistic attribute, and two realistic attributes. For the single-attribute prompts, SD1.4 and SD2 achieved scores of approximately 0.7, whereas SD3.5 achieved a score of 0.877. The scores of SD1.4 and SD2 dropped to 0.035 and 0.068, respectively, when generating two objects with attributes. In contrast, SD3.5 consistently maintained a score of 0.625 for the two-attribute prompts. The statistically significant difference indicates that SD1.4 performs worse than SD2, which is outperformed by SD3.5 in multi-attribute generation. All models did not exhibit a significant difference between the realistic and unrealistic categories. Compared with object-only generation (
Table 4), adding attribute generation led to decreased scores across all models. These results suggest that although SD1.4 and SD2 exhibit comparable performance for single-attribute prompts, SD2 outperforms SD1.4 in multi-attribute generation, and SD3.5 substantially outperforms both in terms of attribute generation.
To evaluate the ability of the models to associate actions with objects, we used the set to construct prompts. To avoid unnatural prompts, the object set was restricted to = {‘bird’, ‘cat’, ‘horse’, ‘elephant’, ‘giraffe’, ‘person’}. Specifically, two types of prompts were considered: “an image of ” and “an image of and ,” where and . For the first prompt type, we generated 66 text prompts (including 35 unrealistic prompts such as “An image of a talking giraffe,” and 31 realistic prompts), yielding 660 images by changing the random seed . For the second prompt type, we randomly sampled 600 images each from the images generated by SD1.4 and SD2 and 300 images from those generated by SD3.5, all drawn from the realistic prompts.
The results (
Table 6) indicate that for all three models, generation using realistic prompts outperformed that using unrealistic prompts. Furthermore, the score decreased as the number of specified actions increased. SD3.5 substantially outperformed the other two models. In particular, in the two-action generation experiment, SD3.5 achieved a score of 0.422, which was significantly higher than those of the other two models (0.003 and 0.025, respectively). Between these two models, SD2 outperformed SD1.4 in multi-action generation, with statistically significant results. Overall, more than half of the image generations involving two actions failed. Therefore, we consider that enhancing image generation for multiple actions would be a promising research direction for future work.
To evaluate the ability of the models to generate images depicting the spatial relationships between objects, we considered prompts of the form “an image of apc(),” where and are from the and sets, respectively, and ∈ {‘chair’, ‘bed’, ‘mirror’, ‘door’, ‘desk’ }. We obtain 300 prompts from all possible combinations after filtering out unnatural expressions.
The experimental results are presented in
Table 7, showing that SD3.5 also substantially outperforms SD1.4 and SD1.5 in representing positions, achieving a score of 0.851. Compared with object generation, adding positional relationships reduces the score of image generation, particularly for SD1.4 and SD2, whose scores drop from 0.92 and 0.847 to 0.357 and 0.381, respectively.
The experimental results reveal that the text–image alignment performance of SD1.4 and SD2 is roughly comparable. However, SD2 exhibits better performance in multi-object, multi-attribute, and multi-action image generation. SD3.5 consistently outperforms SD1.4 and SD2 across multiple generation tasks. Notably, substantial improvements were observed in position, multi-attribute, and multi-action generation. However, the most significant limitation of SD3.5 lies in its multi-action generation capability, which may represent a valuable direction for further exploration in future research.