Figure 1.
Example of a sample post with sentiment annotations across four aesthetic attributes: Color & Light, Composition, Focus, and Contrast. The figure illustrates the original comment, the revised version, and the corresponding aesthetic evaluation scores for each attribute.
Figure 1.
Example of a sample post with sentiment annotations across four aesthetic attributes: Color & Light, Composition, Focus, and Contrast. The figure illustrates the original comment, the revised version, and the corresponding aesthetic evaluation scores for each attribute.
Figure 2.
TASLP pipeline overview comprising four stages: (1) data acquisition and cleaning, (2) aesthetic comment rewriting, (3) aesthetic category clustering, and (4) aesthetic category sentiment labeling.
Figure 2.
TASLP pipeline overview comprising four stages: (1) data acquisition and cleaning, (2) aesthetic comment rewriting, (3) aesthetic category clustering, and (4) aesthetic category sentiment labeling.
Figure 3.
RMSD data distribution. The figure shows the distribution of word counts in user comments and the proportion of posts containing four or more comments.
Figure 3.
RMSD data distribution. The figure shows the distribution of word counts in user comments and the proportion of posts containing four or more comments.
Figure 4.
Performance comparison across prompt configurations for aesthetic sentiment labeling.
Figure 4.
Performance comparison across prompt configurations for aesthetic sentiment labeling.
Figure 5.
LAGA framework architecture. The model consists of three main components: (1) Encoding Module, which extracts semantic, positional, and sentiment-aware embeddings using BERT; (2) Aesthetic Category Group (ACG) Module, which aggregates and pools representations of different aesthetic attribute categories; (3) Multi-task Module, which jointly performs aesthetic category sentiment analysis (ACSA) and image aesthetic assessment (IAA). In the figure, different colored nodes represent embeddings associated with different aesthetic attribute categories, including Color & Light, Composition, Focus, and Contrast. CLS denotes the classification token, Enc represents the encoder module, correspond to the four aesthetic category representations, and the binary vector indicates the presence (1) or absence (0) of the corresponding aesthetic categories.
Figure 5.
LAGA framework architecture. The model consists of three main components: (1) Encoding Module, which extracts semantic, positional, and sentiment-aware embeddings using BERT; (2) Aesthetic Category Group (ACG) Module, which aggregates and pools representations of different aesthetic attribute categories; (3) Multi-task Module, which jointly performs aesthetic category sentiment analysis (ACSA) and image aesthetic assessment (IAA). In the figure, different colored nodes represent embeddings associated with different aesthetic attribute categories, including Color & Light, Composition, Focus, and Contrast. CLS denotes the classification token, Enc represents the encoder module, correspond to the four aesthetic category representations, and the binary vector indicates the presence (1) or absence (0) of the corresponding aesthetic categories.
Figure 6.
ACSFM model architecture for unified aesthetic evaluation. Red circles in the LAGA component represent sentiment polarity nodes (positive, neutral, negative), while colored circles in MACE indicate channel-specific visual feature representations.
Figure 6.
ACSFM model architecture for unified aesthetic evaluation. Red circles in the LAGA component represent sentiment polarity nodes (positive, neutral, negative), while colored circles in MACE indicate channel-specific visual feature representations.
Figure 7.
ACSFM variant performance comparison on IAC task.
Figure 7.
ACSFM variant performance comparison on IAC task.
Figure 8.
ACSFM performance comparison against large language models on aesthetic evaluation tasks. BLEU metrics follow standard definitions [
50].
Figure 8.
ACSFM performance comparison against large language models on aesthetic evaluation tasks. BLEU metrics follow standard definitions [
50].
Figure 9.
Example failure case demonstrating ambiguous aesthetic judgment with mixed sentiment that LAGA struggles to capture accurately. Model tends to oversimplify contradictory evaluations into uniform sentiment predictions.
Figure 9.
Example failure case demonstrating ambiguous aesthetic judgment with mixed sentiment that LAGA struggles to capture accurately. Model tends to oversimplify contradictory evaluations into uniform sentiment predictions.
Figure 10.
Case study comparing aesthetic scoring predictions across models. GT represents ground truth aesthetic scores. Graph demonstrates LAGA and BLIP(+ACSFM) achieve predictions closely aligned with ground truth values.
Figure 10.
Case study comparing aesthetic scoring predictions across models. GT represents ground truth aesthetic scores. Graph demonstrates LAGA and BLIP(+ACSFM) achieve predictions closely aligned with ground truth values.
Figure 11.
Case study demonstrating comparative results for aesthetic caption generation across models. Generated captions from ACSFM variants exhibit stronger alignment with ground truth evaluations.
Figure 11.
Case study demonstrating comparative results for aesthetic caption generation across models. Generated captions from ACSFM variants exhibit stronger alignment with ground truth evaluations.
Table 1.
Model evaluation and comment revision prompts. The table presents a unified prompt used for model-based evaluation and aesthetic comment revision.
Table 1.
Model evaluation and comment revision prompts. The table presents a unified prompt used for model-based evaluation and aesthetic comment revision.
| Unified Evaluation Prompt |
|---|
| Part I: Model Cross Evaluation (Text Rewriting) |
| You are a professional text optimization assistant specializing in photography reviews. Rewrite the following review according to these requirements: |
| Preserve Sentiment: Maintain the original emotional tone (positive, negative, or neutral). |
| Retain Semantics: Preserve the core message and key points without introducing irrelevant content. |
| Condense Word Count: Remove redundancy while maintaining clarity and impact. |
| Enhance Expression: Improve fluency, coherence, and professionalism using precise language that highlights photographic elements. |
| Part II: Aesthetic Comment Revision (Quality Scoring) |
| You are an expert text evaluator assessing the quality of rewritten reviews. Score both the original and revised versions according to the following criteria: |
| Content Retention (Ci):
1—Significant loss or inconsistencies;
2—Partial loss or minor semantic drift;
3—Accurate retention of original content. |
| Fluency (Fi):
1—Frequent grammatical errors or awkward phrasing;
2—Mostly correct but somewhat unnatural;
3—Natural and smooth expression. |
| Readability (Ri):
1—Unclear structure or logic;
2—Generally clear but overly complex;
3—Clear, concise, and well-structured. |
Table 2.
Cross-evaluation scores for comment revision models. Best scores in bold, second-best underlined. Final row shows column sums (200-entry sample).
Table 2.
Cross-evaluation scores for comment revision models. Best scores in bold, second-best underlined. Final row shows column sums (200-entry sample).
| Model | LLaMA-2-14B | Gemini | Qwen-14B | GPT-4o |
|---|
| LLaMA-2-14B | 0.26 | 0.22 | 0.21 | 0.31 |
| Gemini | 0.23 | 0.34 | 0.26 | 0.32 |
| Qwen-14B | 0.20 | 0.28 | 0.36 | 0.37 |
| GPT-4o | 0.15 | 0.17 | 0.24 | 0.41 |
| Column Sum | 0.84 | 1.01 | 1.07 | 1.41 |
Table 3.
Top 10 topics from BERTopic clustering. Subjective attribute words in bold.
Table 3.
Top 10 topics from BERTopic clustering. Subjective attribute words in bold.
| Topic | Count |
|---|
| shot_subject_focus_bit | 0.332 |
| trees_tree_branches_foreground | 0.119 |
| sky_clouds_ground_cloud | 0.107 |
| bird_birds_feathers_duck | 0.081 |
| car_cars_lights_headlights | 0.075 |
| hdr_range_halo_effect | 0.063 |
| composition_critique_opinion_composer | 0.043 |
| dog_dogs_pup_doggo | 0.039 |
| crop_cropping_ratio_aspect | 0.036 |
| sky_crop_ratio_horizon | 0.035 |
Table 4.
Prompts for Aesthetic Category Sentiment Analysis.
Table 4.
Prompts for Aesthetic Category Sentiment Analysis.
| Prompts for Aesthetic Category Sentiment Analysis |
|---|
| Prompt I |
| 1 | Extract paragraph attitude toward following dimensions. Attitude range: (positive, neutral, negative). |
| 2 | Extract paragraph attitude toward following dimensions. Attitude range: (positive, neutral, negative). If unrelated to dimension, attitude is neutral. |
| 3 | Provide sentiment polarity for specified dimensions only. Polarity range: (positive, neutral, negative). If unrelated, polarity is neutral. |
| 4 | Comment on paragraph sentiment polarity for specified dimensions. Provide polarity within (positive, neutral, negative). If unrelated, polarity is neutral. |
| Prompt II |
| 1 | Dimensions: 1. color & light; 2. composition; 3. contrast; 4. focus. Paragraph: [Comment] |
| 2 | Dimensions: 1. Impression of color and light; 2. Impression of composition; 3. Impression of contrast; 4. Impression of focus. Paragraph: [Comment] |
Table 5.
Filtering Statistics for RMSD Dataset Construction.
Table 5.
Filtering Statistics for RMSD Dataset Construction.
| Filtering Criterion | Excluded | Percentage | Remaining |
|---|
| Initial collection | - | - | 61,224 posts, 141,452 comments |
| Multi-image posts | 8247 | 13.5% | 52,977 posts |
| Comments with <3 attributes | 23,156 | 16.4% | 118,296 comments |
| Multimodal attachments | 4893 | 3.5% | 113,403 comments |
| Harmful language | 2318 | 1.6% | 111,085 comments |
| Final dataset | - | - | 53,977 posts, 111,085 comments |
Table 6.
Summary of publicly available image aesthetic assessment datasets.
Table 6.
Summary of publicly available image aesthetic assessment datasets.
| Dataset | Images | Scale | Topics | Attributes | Comments | Annotations |
|---|
| Photo.Net | 20 K | 1–7 | - | - | - | Score |
| CUHK-PQ | 17 K | 0–1 | 7 | - | - | Label |
| AVA | 250 K | 1–10 | 66 | 14 | - | Score |
| AVA-Comments | 250 K | 1–10 | 66 | - | 1.5 M | Text |
| PCCD | 4 K | 1–10 | - | 7 | 29 K | Text |
| DPC-Captions | 110 K | 1–10 | - | 5 | 2.4 M | Text |
| RPCD | 70 K | 1–10 | - | 5 | 219 K | Score |
| RMSD | 61 K | 1–5 | - | 4 | 141 K | Text & Score |
Table 7.
Text-based models and LAGA comparison on ACSA task. Best results in bold.
Table 7.
Text-based models and LAGA comparison on ACSA task. Best results in bold.
| Model | Acc. (%) | F1 | Precision | Recall |
|---|
| BART-Generation [42] | 82.77 | 76.73 | 75.60 | 77.91 |
| HGCN [43] | 82.89 | 76.94 | 77.64 | 76.25 |
| AAGCN [44] | 77.91 | 68.14 | 70.93 | 65.57 |
| LAGA (ours) | 83.02 | 77.05 | 78.52 | 75.64 |
Table 8.
Model performance comparison on IAA scoring task using RMSD dataset. Best results in bold. Statistical significance denoted by * () and ** () via paired t-tests against second-best model.
Table 8.
Model performance comparison on IAA scoring task using RMSD dataset. Best results in bold. Statistical significance denoted by * () and ** () via paired t-tests against second-best model.
| Model | SRCC | PLCC | 2-Cate (%) | D ± 0.5 (%) |
|---|
| BART-Generation [42] | 0.805 | 0.807 | 82.13 | 72.57 |
| HGCN [43] | 0.832 | 0.831 | 83.98 | 71.03 |
| AAGCN [44] | 0.700 | 0.662 | 81.94 | 54.86 |
| VGG-16 [45] | 0.068 | 0.041 | 65.50 | 35.00 |
| ResNet-34 [46] | 0.096 | 0.070 | 65.00 | 37.50 |
| ViT-Base [41] | 0.132 | 0.093 | 69.50 | 38.00 |
| AestheticCLIP | 0.456 | 0.442 | 74.23 | 45.67 |
| ViT+GPT2 (+ACSFM ours) | 0.204 | 0.218 | 59.51 | 25.72 |
| BLIP (+ACSFM ours) | 0.615 | 0.610 | 73.16 | 37.84 |
| LAGA (ours) | 0.842 ** | 0.836 ** | 85.67 * | 73.19 ** |
Table 9.
Model performance comparison on aesthetic image captioning task using RMSD dataset. Best results in bold.
Table 9.
Model performance comparison on aesthetic image captioning task using RMSD dataset. Best results in bold.
| Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE |
|---|
| BLIP [25] | 41.65 | 17.03 | 6.70 | 2.71 | 20.27 |
| BLIP(+ACSFM ours) | 45.92 | 21.79 | 10.31 | 5.30 | 21.47 |
| ViT-GPT2 [48] | 51.39 | 24.78 | 10.88 | 4.95 | 23.54 |
| ViT-GPT2(+ACSFM ours) | 52.27 | 27.32 | 12.09 | 5.52 | 23.89 |
| LLaVA-1.5 | 38.42 | 15.67 | 6.23 | 2.84 | 19.35 |
Table 10.
LAGA variant performance comparison on ACSA task. Best results in bold, second-best underlined, decreases marked with downward arrow and decrease amount.
Table 10.
LAGA variant performance comparison on ACSA task. Best results in bold, second-best underlined, decreases marked with downward arrow and decrease amount.
| Model | ACC | F1 | Precision | Recall |
|---|
| LAGA | 83.02 | 77.05 | 78.52 | 75.64 |
| w/o sentiment | 73.12↓(9.90) | 69.42↓(7.63) | 69.18↓(9.34) | 69.67↓(5.97) |
| w/o ACG | 71.10↓(11.92) | 67.83↓(9.22) | 68.32↓(10.20) | 67.34↓(8.30) |
| w/o A-BLSTM | 74.32↓(8.70) | 69.72↓(7.33) | 70.12↓(8.40) | 69.32↓(6.32) |
Table 11.
LAGA variant performance comparison on IAA task. Best results in bold, second-best underlined, decreases marked with downward arrow and decrease amount.
Table 11.
LAGA variant performance comparison on IAA task. Best results in bold, second-best underlined, decreases marked with downward arrow and decrease amount.
| Model | SRCC | PLCC | 2-Cate (%) | D ± 0.5 (%) |
|---|
| LAGA | 0.842 | 0.836 | 85.67 | 73.19 |
| w/o sentiment | 0.833↓(0.009) | 0.831↓(0.005) | 85.05↓(0.62) | 71.64↓(1.55) |
| w/o ACG | 0.812↓(0.030) | 0.808↓(0.028) | 83.36↓(2.31) | 69.65↓(3.54) |
| w/o A-BLSTM | 0.811↓(0.031) | 0.812↓(0.024) | 83.84↓(1.83) | 69.77↓(3.42) |
Table 12.
Human evaluation results comparing aesthetic caption quality across models. Three expert evaluators rated generated captions on a 5-point Likert scale across three dimensions: Aesthetic Relevance, Evaluative Depth, and Overall Quality. Inter-annotator agreement measured by Fleiss’ kappa yielded , indicating substantial agreement. Results are averaged over 100 test samples with standard deviations reported in parentheses. Bold values indicate the best-performing result for each evaluation dimension.
Table 12.
Human evaluation results comparing aesthetic caption quality across models. Three expert evaluators rated generated captions on a 5-point Likert scale across three dimensions: Aesthetic Relevance, Evaluative Depth, and Overall Quality. Inter-annotator agreement measured by Fleiss’ kappa yielded , indicating substantial agreement. Results are averaged over 100 test samples with standard deviations reported in parentheses. Bold values indicate the best-performing result for each evaluation dimension.
| Model | Aesthetic Relevance | Evaluative Depth | Overall Quality |
|---|
| BLIP | 3.1 (0.8) | 2.8 (0.7) | 3.0 (0.7) |
| BLIP+ACSFM (ours) | 4.2 (0.6) | 4.1 (0.5) | 4.3 (0.5) |
| ViT-GPT2 | 3.3 (0.7) | 3.0 (0.8) | 3.2 (0.7) |
| ViT-GPT2+ACSFM (ours) | 4.0 (0.6) | 3.9 (0.6) | 4.1 (0.5) |
| LLaVA-1.5 | 2.9 (0.9) | 2.6 (0.8) | 2.8 (0.8) |