ChatCAS: A Multimodal Ceramic Multi-Agent Studio for Consultation, Image Analysis and Generation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsReviewer's Comments
I am very honored to have been selected by your journal to review the authors' corrected article and to provide my comments, which I hope will help improve the final document for a future publication.
The article is titled:
ChatCAS: A Multimodal Ceramic Multi-Agent Studio for Consultation, Image Analysis and Generation.
Journal: Electronics (ISSN 2079-9292)
This article explores the application of LLMs in the field of ceramics and presents EvalCera, the first open-source evaluation dataset specialized for knowledge, analysis, and image generation in ceramics. It overcomes the limitations of older methods that struggled to analyze complex load data.
This topic is of great importance for the protection of intangible cultural heritage.
General Format
- The authors have written the document well.
- It's better not to use abbreviations such as (LLM) in the abstract.
- Include a table of notations after the keywords.
- The number of references is more than sufficient given the document's length.
- The references are up-to-date.
Equations 1-2
- Provide the optimal values for each variable involved.
- Justify Equation 1.
- Give an example of this equation.
- In Equation 2, rewrite (qdn as qdn).
- What does "can" represent?
- What is the purpose of the set AS?
Simulation
- To measure the performance of your technique, you calculate statistical and physical parameters. a. What is the role of each parameter and its optimal value range?
- In Table 3, provide the optimal values for EvalAcu and C-Eval.
Author Response
Comments 1:
It's better not to use abbreviations such as (LLM) in the abstract. Include a table of notations after the keywords.
Response 1:
Thank you for pointing this out. I agree with these comments. Therefore, I have removed the abbreviation “LLM” from the abstract and replaced it with the full term “large language model.” This revision can be found in the Abstract section (page 1, lines 2–14) of the revised manuscript, and the updated parts have been highlighted in red for ease of identification.
“[Updated text in the manuscript: Many traditional ceramic techniques are inscribed on UNESCO’s Intangible Cultural Heritage lists, yet expert scarcity, long training cycles, and stylistic homogenization impede intergenerational transmission and innovation. Although large language models offer new opportunities, research tailored to ceramics remains limited. To address this gap, we first construct EvalCera, the first open-source domain large language model evaluation dataset for ceramic knowledge, image analysis, and generation, and conduct large-scale assessments of existing general large language models on ceramic tasks, revealing their limitations. We then release the first ceramics-focused training corpus for large language models and, using it, develop CeramicGPT, the first domain-specific large language model for ceramics. Finally, we build ChatCAS, a workflow multi-agent system built on CeramicGPT and GPT-4o. Experiments show that our model and agents achieve the best performance on EvalCera (A) and (B) text tasks as well as (C) image generation. Code is available at: https://github.com/HanYongyi/HYY.]”
In addition, to improve readability and assist the reader in following the mathematical expressions, I have added a table of notations. The table has been inserted immediately before section 2.3.1 (page 7, Table 1) so that readers can easily consult it prior to the introduction of related formulas. The revised parts have been highlighted in red in the manuscript for easy identification.
Comments 2:
Provide the optimal values for each variable involved.
Justify Equation 1.
Give an example of this equation.
In Equation 2, rewrite (qdn as qdn).
What does "can" represent?
What is the purpose of the set AS?
Response 2:
Thank you for the careful reading and helpful suggestions. We have updated the manuscript as follows:
1、“Provide the optimal values for each variable involved.”
We now report the optimal values for all variables appearing in the equations and the corresponding experimental settings. These values are listed in the revised manuscript.
2、“Justify Equation 1.”
We added a concise justification of Equation (1) in the Methods/Equation section, explaining the assumptions and the reasoning that lead to this form.
3、“Give an example of this equation.”
As requested, we have added an explicit example for Equation (1) in the text.
4、“In Equation 2, rewrite (qdn as qdn).”
We appreciate the sharp observation. This typographical issue in Equation (2) has been corrected throughout.
5、“What does ‘can’ represent?”
We clarify that “can” represents the n-th candidate agent selected from the candidate agent set (CA).
6、“What is the purpose of the set AS?”
It is to aggregate the dispersed results of multiple candidate agents into a unified whole through the manager agent, providing complete and traceable input for subsequent voting, discussion, modification, and final decision-making.
All of the above edits pertain specifically to Equations (1)–(2) and their associated notation.
These revisions can be found in the revised manuscript at page 8, section 2.3.1-2.3.2 (lines 260–303). The revised passages are highlighted in red for ease of identification.
Comments 3:
To measure the performance of your technique, you calculate statistical and physical parameters. a. What is the role of each parameter and its optimal value range?
Response 3:
Thank you for your insightful comment. We are pleased to clarify the role and interpretation of the evaluation parameters used in our study.
For EvalCera (A) and EvalCera (B), the maximum score is 100 points. These scores are derived from the objective correctness of the model’s answers to multiple-choice or true/false questions. Each item is marked as correct or incorrect, and the total accuracy is converted into a score on a 100-point scale. The role of this parameter is to provide an objective measure of the model’s domain knowledge and image analysis capability. The optimal value range is therefore close to 100 points, with higher scores indicating better performance.
For EvalCera (C), the maximum score is 10 points, based on expert evaluations across four subjective dimensions: aesthetic quality, cultural relevance, creativity, and functional plausibility. Each dimension is rated from 1 to 10, and the aggregated result is normalized to a total of 10 points. The role of this parameter is to capture human-centered aspects of image generation quality that cannot be fully reflected by objective accuracy. Here, the optimal value range is also close to the maximum (10 points), with higher scores indicating superior alignment with professional and cultural standards.
In summary, the roles of the parameters are to provide complementary perspectives: (i) objective correctness for knowledge and analysis tasks (A, B), and (ii) expert-driven qualitative judgment for creative tasks (C). In both cases, higher scores represent better model performance, and the theoretical optimal range extends up to the full score.
Comments 4:
In Table 3, provide the optimal values for EvalAcu and C-Eval.
Response 4:
Thank you for this valuable comment. Because we added an additional table earlier in the manuscript, the original Table 3 has now been renumbered as Table 4. You are correct that in the previous version we did not clearly indicate which results were optimal; in the revised Table 4 we have now highlighted the best values in bold for ease of identification. Inspired by your suggestion, we further distinguish the optimal results under two training regimes: for the non-fine-tuned group (“/”), GPT-4o is marked as the best-performing model, while for the LoRA fine-tuned group, Qwen2-7B-Ceramic achieves the highest overall score. These clarifications have also been incorporated into the corresponding description in the main text to ensure consistency and clarity.We sincerely apologize for my earlier oversight: the column originally labeled as EvalAcu was a provisional name I forgot to update, and it should in fact be EvalCera.
These revisions can be found in the revised manuscript at the following locations: page 12 (lines 431–440). The revised parts have been highlighted in red in the manuscript for easy identification.
“[Updated text in the manuscript: To evaluate both general-purpose and domain-specific capabilities, we conducted a comparative assessment on the C-Eval benchmark and the domain-specific EvalCera (A) dataset; the results are reported in Table 4. In the non-fine-tuned baseline cohort, GPT-4o achieved the best performance on both evaluations, DeepSeek-V2 performed the worst, and the remaining baselines exhibited relatively small differences. In the LoRA–tuned cohort, we applied supervised fine-tuning (SFT) on our high-quality ceramics dataset with low-rank adaptation, among which Qwen2-7B-Ceramic attained the best overall performance. Considering both training settings, GPT-4o remains the overall top performer, whereas DeepSeek-V2 is the overall worst. Relative to their original C-Eval scores, the fine-tuned models did not exhibit a statistically significant decrease.]”
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe valCera(B) image analysis dataset in the experiment contained only 127 samples. For a comprehensive assessment of the model's ability to identify multiple kiln systems (such as Yu Kiln, Yaozhou Kiln, Guan Kiln, Ding Kiln, Magnetic Kiln, Jun Kiln, Jingdezhen Kiln, Ge Kiln) and multiple production stages (such as blank drawing, trimming, glazing, firing, etc.), the sample size may be too small and underrepresentative. It is difficult to ensure that it covers the complexity and diversity of ceramic image analysis.
The ChatCAS multi-agent collaboration framework proposed by the author in the manuscript (task allocation, analytical recommendations, voting, discussing modifications, organizing summaries) is its core innovation. However, the paper only validated its effectiveness on specific tasks in the field of ceramics. The framework lacks validation for handling more complex and blurred-bound tasks, as well as for its applicability and efficiency in other areas.
Author Response
Comments 1:
The EvalCera(B) image analysis dataset in the experiment contained only 127 samples. For a comprehensive assessment of the model's ability to identify multiple kiln systems (such as Yu Kiln, Yaozhou Kiln, Guan Kiln, Ding Kiln, Magnetic Kiln, Jun Kiln, Jingdezhen Kiln, Ge Kiln) and multiple production stages (such as blank drawing, trimming, glazing, firing, etc.), the sample size may be too small and underrepresentative. It is difficult to ensure that it covers the complexity and diversity of ceramic image analysis.
Response 1 :
Thank you for this valuable suggestion. We agree with the reviewer’s comment. In response, we have made the following revisions:
1、Dataset expansion: We have expanded the EvalCera (B) dataset from 127 to 160 carefully selected and annotated image-based items. The items are approximately evenly distributed across eight representative kiln systems (Ru, Yaozhou, Guan, Ding, Cizhou, Jun, Jingdezhen, and Ge). This adjustment enhances representativeness and improves coverage of both kiln identification and production stages (throwing, trimming, glazing, firing, etc.).
2、Visualizations: To provide a clearer view of dataset composition, we added two distribution figures: another pie chart for EvalCera (B) showing the distribution across the eight kiln systems, and a pie chart for EvalCera (C) showing the proportions of the three creative-element sources (Anime / Natural / Modern cultural).
These new plots appear as Figure 1(b) and Figure 1(c) in the revised manuscript (see Page 4), and they are referenced in Section 2.1.2 and Section 2.1.3, respectively.
3、Discussion of limitations (expanded): In the Discussion section, we explicitly acknowledge that the current dataset size remains relatively small and may still lack full representativeness. In addition, we note that the three-way categorization adopted for EvalCera (C) (Anime / Natural / Internet-subculture) is preliminary and may not perfectly capture all nuances or avoid category overlap (e.g., meme motifs that reference natural elements). In the Conclusion section, we outline our plan to expand the scale of EvalCera, develop a more fine-grained taxonomy of creative categories, involve a larger and more diverse group of experts, and adopt inter-rater reliability metrics to enhance evaluation credibility.
These revisions can be found in the revised manuscript on page 16, section 4-5 (lines 497-539). The revised parts have been highlighted in red in the manuscript for easy identification.
Comments 2:
The ChatCAS multi-agent collaboration framework proposed by the author in the manuscript (task allocation, analytical recommendations, voting, discussing modifications, organizing summaries) is its core innovation. However, the paper only validated its effectiveness on specific tasks in the field of ceramics. The framework lacks validation for handling more complex and blurred-bound tasks, as well as for its applicability and efficiency in other areas.
Response 2 :
Thank you for this insightful comment. We fully agree that, in its current form, ChatCAS has only been validated on ceramic-specific tasks and lacks evidence on more complex/boundary-blurred settings as well as cross-domain applicability and efficiency. In response, we have revised the Discussion to explicitly acknowledge this limitation and to outline concrete next steps.
These revisions are added in Section 4-5 (Discussion and Conclusion), page 16, (lines 497–539), and have been highlighted in red for easy identification.
"[Updated text in the manuscript:
4.Discussion
This study presents an initial academic exploration of constructing LLMs tailored to the ceramics domain and developing ceramics-specific intelligent agents. Our experimental results highlight the advantages of CeramicGPT and ChatCAS, effectively demonstrating how LLMs can empower the ceramics field. Nevertheless, certain limitations must be acknowledged. First, EvalCera (B) contains only 160 image samples, which are insufficient to fully capture the complexity of diverse kiln systems and production stages. Second, the three-way classification schema of EvalCera (C) remains preliminary and may suffer from category overlap, necessitating sample expansion, expert/user studies, and inter-rater consistency analysis to improve robustness. Third, ChatCAS has thus far been validated only on ceramics-specific tasks; systematic evaluation is still lacking for boundary cases, cross-stage reasoning, and cross-domain applications. Finally, only five users were invited for the qualitative evaluation, which enhances authority but still necessitates the inclusion of a broader and more diverse reviewer group to reduce bias.
Overall, this study validates the value of integrating domain-specific data, specialized models, and process-oriented multi-agent collaboration in the ceramics domain. The research project has been implemented at the Yaozhou porcelain Livestreaming Base in Tongchuan City, Shaanxi Province, China, aiming to address challenges in transmitting the Yaozhou porcelain intangible cultural heritage as part of Tongchuan’s city IP—namely, inheritance disruption, insufficient innovation, and inadequate integration with cultural consumption. The livestreaming platform can collect personalized requests and comments from viewers in real time, and our LLM rapidly generates multiple corresponding customized ceramic patterns and design proposals. Upon customer confirmation and order placement, production is immediately initiated. This framework integrates technological innovation, cultural dissemination, and market demand, establishing a brand-new business model for ceramics.
5.Conclusion
This study makes significant progress in applying LLMs to the ceramics domain. By developing EvalCera, the first specialized dataset for ceramic knowledge, image analysis, and generation, we assessed the capabilities of state-of-the-art models, highlighting their limitations in the ceramics field. In response, we introduced CeramicGPT, a domain-specific LLM that outperformed general models in ceramic knowledge and image recognition tasks. We also developed ChatCAS, a multi-agent system powered by CeramicGPT and GPT-4o. Evaluation results show that our model and agents achieve the best performance on EvalCera (A) and (B) text tasks as well as (C) image generation tasks.
In future work on datasets, we plan to expand the scale of EvalCera, develop a more fine-grained taxonomy of creative categories, involve a larger and more diverse group of experts, and adopt inter-rater reliability metrics to enhance evaluation credibility. For ChatCAS, we will improve task allocation, voting, and discussion processes through strategy optimization and ablation studies, introduce confidence estimation and human–AI collaboration mechanisms to enhance efficiency and reliability, and develop ChatCAS 2.0 to further optimize ceramic design generation and extend the multi-agent framework to other specialized domains.]"
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors1. The manuscript emphasizes the claim of being “the first domain-specific large model and multi-agent system for ceramics.” However, compared with existing vertical-domain LLMs (e.g., in medicine and education), the methodological novelty is limited; the work appears more as a domain adaptation effort. The authors need to further clarify the unique aspects of their contribution.
2. The “voting–discussion–summarization” mechanism in ChatCAS is quite similar to existing multi-agent frameworks (e.g., AutoGen, CAMEL). The authors should explicitly highlight the differences and advantages.
3. All reported accuracy metrics are given as point estimates only. The absence of multiple runs with mean ± standard deviation or confidence intervals makes it difficult to judge result stability.
The image generation evaluation is based on two experts’ ratings, which is reasonable in terms of criteria, but lacks significance testing and user studies to strengthen reliability.
The comparison includes only OpenAI models and a few base models; additional benchmarks against other open-source domain-specific LLMs or multi-agent systems are missing.
4. The description of EvalCera’s question sources is too general. It is recommended to provide detailed distributions, representative samples, and confirmation of the completeness of the open-source link.
5. Ceramic image data mainly come from museums and Baidu Baike; copyright and data-cleaning standards should be clarified.
Although a code link is provided, the necessary elements for reproducibility—such as data split scripts, random seeds, and hardware configurations—are not sufficiently transparent.
6. Some sections are overly lengthy, especially the methodology parts (Sections 2.2 and 2.3). The authors are advised to streamline these and emphasize the core ideas.
Figure numbering and referencing need to be more standardized; for example, Figures 3 and 4 lack statistical explanations.
7. Certain expressions are too informal (e.g., “Ok, I’ll assign tasks” in Figure 2) and should be revised into a consistent academic style.
8. The Abstract and Conclusion contain too many quantitative results; these could be condensed to better highlight the contributions and insights.
9. Revised last paragraph of Introduction (section organization) should be:The remainder of this paper is organized as follows. Section 2 introduces the proposed methodology, including the construction of the EvalCera dataset, the development of CeramicGPT, and the multi-agent framework ChatCAS. Section 3 presents the experimental setup and evaluation results, analyzing the performance of general-purpose LLMs, CeramicGPT, and ChatCAS in the ceramics domain. Section 4 discusses the main findings, potential applications, and limitations of this study. Finally, Section 5 concludes the paper and outlines directions for future work.
Author Response
Comments 1:
The manuscript emphasizes the claim of being “the first domain-specific large model and multi-agent system for ceramics.” However, compared with existing vertical-domain LLMs (e.g., in medicine and education), the methodological novelty is limited; the work appears more as a domain adaptation effort. The authors need to further clarify the unique aspects of their contribution.
Response 1:
Thank you for this constructive suggestion. We agree that it is necessary to further articulate the uniqueness of our contributions. Below is a point-by-point clarification that makes explicit our originality beyond domain adaptation:
(i) Data contribution: scarce, expert-curated, and open
Ceramics is a craft domain grounded in lineage and transmission, where core know-how cannot be obtained through simple web crawling. Our training and evaluation data were curated over 3–4 months in an expert-led effort headed by a twelfth-generation inheritor of the Yaozhou-ware lineage (the first author) together with Mr. Li Jinping, a provincial master craftsman, and other practitioners. The resulting dataset is high-quality and highly specialized; it faithfully reflects workshop practice and kiln-style semantics and cannot be reproduced by generic scraping. We have opened these data to ensure transparency and reusability. The data construction itself constitutes a substantive, non-trivial contribution that goes well beyond routine domain adaptation.
(ii) Method contribution: an original multi-agent architecture tailored to ceramics
Our multi-agent system is not a simple reuse of generic “multi-turn/voting” templates; rather, it is purpose-built around the ceramics production workflow. Roles, hand-offs, and consensus mechanisms are mapped to real decision points and process constraints, covering key stages such as raw-material selection, forming, trimming, glazing, firing, and inspection.
On the generative side, we couple a ceramics-specialized model with GPT-4o to enforce kiln-style fidelity and manufacturability in image generation, instead of relying on unconstrained general-purpose prompts. This mechanism is explicitly customized for the ceramics domain and is central to our gains on both reasoning and generation tasks.
We hope this clarification addresses the concern that the work is merely domain adaptation and further illustrates its unique value.
Comments 2:
The “voting–discussion–summarization” mechanism in ChatCAS is quite similar to existing multi-agent frameworks (e.g., AutoGen, CAMEL). The authors should explicitly highlight the differences and advantages.
Response 2:
Thank you for the suggestion. We agree to make the differences and advantages explicit. We have added a short clarification at the beginning of Section 2.3 (ChatCAS) to explain why a ceramics-specific design is needed [p.6, section 2.3,lines 240–251]. In particular:
What is different in ChatCAS
- Workflow-grounded roles: Agent roles and hand-offs are explicitly aligned with real
ceramic production stages (raw materials → forming → trimming → glazing → firing → inspection), rather than generic “assistant/critic” abstractions.
- Reasoning–generation integration: Text-based reasoning agents collaborate with an image-generation agent, and outputs are evaluated not only for textual coherence but also for kiln-style fidelity and manufacturability.
- Domain-specific validation: Structured checks (e.g., style conformity, glaze and firing plausibility) regulate acceptance after collaborative discussion, ensuring technical feasibility.
- Expert knowledge grounding: All agents are guided by an expert-curated ceramics corpus, reducing hallucinations and keeping outputs consistent with workshop practice.
- Orchestrated multi-turn reasoning: ChatCAS organizes multi-turn reasoning among domain-specialized agents, ensuring that ceramic knowledge flows smoothly into concrete, executable decisions rather than remaining as abstract discussion.
Why this helps
- Better style consistency and functional plausibility in generated designs.
- More transparent outputs via checklists and plans that can be audited.
- Greater robustness to off-domain prompts due to domain constraints.
Text added in Section 2.3 (lead-in paragraph):
“Despite rapid advances in general-purpose multi-agent frameworks such as AutoGen and CAMEL, systematic adaptation for the ceramics domain remains clearly lacking. On the one hand, there is little operational, domain-specific guidance for image-generation models, making it difficult to consistently produce high-quality, professional ceramic images. On the other hand, there is no effective way to organize multi-turn reasoning in models with ceramics expertise, so domain knowledge does not flow smoothly into executable decisions. To close this gap, we propose ChatCAS, whose agent roles and handoffs map directly to real ceramics production stages—raw materials, forming, trimming, glazing, firing, and inspection—rather than relying on generic “assistant/critic” pairs. Building on this mapping and orchestration, ChatCAS provides professionalized guidance for image generation and delivers an end-to-end path that translates knowledge, reasoning, and decisions into practice as a unified solution.”
“[updated text in the manuscript: Despite rapid advances in general-purpose multi-agent frameworks such as AutoGen\cite{wu2024autogen} and CAMEL\cite{li2023camel}, systematic adaptation for the ceramics domain remains clearly lacking. On the one hand, there is little domain-specific guidance for image-generation models, making it difficult to produce high-quality ceramic images consistently. On the other hand, there is no effective way to organize multi-turn reasoning in ceramic models, so domain knowledge cannot flow smoothly into executable decisions. To close this gap, we propose ChatCAS, in which agent roles and handoffs map directly to real ceramics production stages—raw materials, forming, trimming, glazing, firing, and inspection—rather than relying on generic “assistant/critic” pairs. Based on this mapping, ChatCAS provides guidance for image generation and delivers an end-to-end path translating knowledge, reasoning, and decisions into practice.
This section describes our proposed collaboration framework, ChatCAS, powered by CeramicGPT and GPT-4o. The symbols in this subsection are summarized in Table~\ref{tab:notation-chatcas}. An overview of the pipeline and an illustrative example are shown in Figure~\ref{fig:maincomponents}. The ChatCAS framework has two components: task assignment driven by CeramicGPT and question answering powered by CeramicGPT and GPT-4o. These two components are further divided into five sub-stages: task assignment, analysis of recommendations, voting, discussion and modification, and final summarization. Each stage corresponds to a production step in ceramics, ensuring that user intent is translated into transparent and reliable decisions. The specific mechanisms of each stage are elaborated in the following subsections.]”
Comments 3:
All reported accuracy metrics are given as point estimates only. The absence of multiple runs with mean ± standard deviation or confidence intervals makes it difficult to judge result stability.The image generation evaluation is based on two experts’ ratings, which is reasonable in terms of criteria, but lacks significance testing and user studies to strengthen reliability.
The comparison includes only OpenAI models and a few base models; additional benchmarks against other open-source domain-specific LLMs or multi-agent systems are missing.
Response 3:
We sincerely appreciate your thoughtful and constructive comments. To address your concerns about stability, statistical rigor, and baseline coverage, we have refined our methodology and result presentation, and updated the figures and content in the main text accordingly.
First, to remedy the earlier issue of reporting only point estimates, we re-evaluated each model on EvalCera (A) and EvalCera (B) with multiple independent runs. We now report mean ± standard deviation, and we have added error bars to the relevant figures to visualize variability across runs.
Second, for EvalCera (C) (the expert-rated text-to-image evaluation), we initially established a performance ceiling using the two strongest general-purpose text-to-image models (o3-mini and GPT-4o). Following your suggestion, we have added Doubao and Qwen-Image to broaden the comparison and expanded the review panel from 2 to 5 raters (two ceramic experts, two ceramic historians, and one lay user). In addition, we upgraded the rating scale from 1–5 to 1–10 and had the five raters re-score under the new scale, thereby further improving the discriminability, robustness, and reliability of the ratings. The results show that across the four evaluation dimensions, the two newly added models have overall lower average scores than ChatCAS.
These revisions can be found in the revised manuscript at the following locations: page 10, section 3.1.2 (lines 364–392), Figure 3a,b ,and page 15, Section 3.3.2 (lines 470–496), Figure 5a,b. The revised parts have been highlighted in red in the manuscript for easy identification.
These updates improve the readability and robustness of the results and more fully address your concerns regarding stability and the breadth of comparisons. Thank you again for your valuable feedback, which has helped us substantially strengthen the rigor and completeness of the manuscript.
“[updated text in the manuscript :
3.1.2 . Performance of existing LLMs in EvalCera (C)
We designed a set of subjective evaluation criteria for ceramic image generation, referred to as EvalCera (C), to assess the image generation capabilities of existing LLMs. The evaluation is divided into four dimensions:
Aesthetic Quality assesses the visual appeal, balance, and artistic harmony of the generated ceramic images. This includes shape, texture, proportion, and color coordination as perceived by experienced ceramic designers.
Cultural Relevance evaluates the extent to which the image reflects traditional or contemporary ceramic styles, motifs, and symbolism, and whether it is rooted in relevant cultural contexts .
Creativity measures the originality, innovation, and uniqueness of the design, including unexpected forms, patterns, or conceptual approaches that go beyond standard templates.
Functional Plausibility considers whether the generated object appears realistically manufacturable and usable as a ceramic item, with appropriate structural features (e.g., base, handle, spout) and accurate physical proportions.
Each dimension is rated on a scale from 1 to 10 (1–2 poor, 3–4 fair, 5–6 adequate, 7–8 good, 9–10 excellent). The total score for each image is the sum of the four dimensions, with a maximum possible score of 40 points.We invited a five-member panel—two ceramic design experts, two art historians, and one lay user—to independently evaluate images generated by different models on these four dimensions. The evaluation results reveal that current models exhibit noticeable shortcomings in several areas. As shown in Figure 3b, o3-mini maintains relatively stable performance in terms of detail representation; however, its scores on cultural relevance and creativity are noticeably lower, at only 6 and 7 points, respectively. This suggests that the model has not yet effectively acquired or integrated professional ceramic cultural knowledge. In contrast, GPT-4o demonstrates a more balanced performance across all four dimensions, while Doubao and Qwen-Image generally receive lower scores, with particularly pronounced weaknesses in functional plausibility and cultural relevance.
3.3.2. Performance of ChatCAS models in EvalCera (C)
Figure 6 shows examples of the performance of three different models under the prompts of Yaozhou ceramics with Naruto, Nezha and Pikachu elements. Using the same subjective evaluation method, we tested the image generation capabilities of ChatCAS, and the results are shown in the figure. From the results, we can see that ChatCAS has made significant improvements in image generation. First, the professional prompt capabilities provided by CeramicGPT make the generated images better match the kiln's background, avoiding the hallucination problem where the image did not correspond to the kiln in previous generations. This is due to CeramicGPT’s expertise in ceramics and its precise prompt support, enabling ChatCAS to understand and generate images that more accurately reflect the actual background. Additionally, the multi-agent collaborative discussion mechanism of ChatCAS further enhances the quality of image generation. Under this mechanism, multiple agents collaborate and discuss in real-time during the generation process, gradually adjusting and optimizing the process to ensure that every step of the image generation aligns with the design requirements, leading to higher-quality images step by step. In the illustration, we can clearly see that elements such as Naruto, the Wind Fire Wheels, and the Hun Tian Ling are accurately and skillfully integrated into the images using the Yaozhou ware Fengming ewer as the foundation.
This collaborative mechanism not only improves the accuracy of image generation but also enhances the handling of details, ultimately resulting in more refined and practically suitable images.
As shown in Figure 5b, in the subjective evaluation of ceramic image generation , ChatCAS exhibited the most outstanding overall performance. The model received a score of 8 in both Aesthetic Quality and Cultural Relevance, 7 in Creativity, and a full score of 10 in Functional Plausibility. These results indicate that, under the guidance of CeramicGPT, the generated ceramic images not only exhibit superior visual appeal and cultural alignment but also consistently produce structurally sound and practically applicable designs, thereby achieving the best overall performance.]”
Comments 4:
The description of EvalCera’s question sources is too general. It is recommended to provide detailed distributions, representative samples, and confirmation of the completeness of the open-source link.
Response 4:
Thank you for your detailed suggestions concerning the sources of EvalCera items, the presentation of their distributions, and the completeness of the open-source links. We have made the following revisions and clarifications:
1、Item Sources and Design Principles (EvalCera-C)
Grounded in China’s Five Famous Kilns, we constructed 72 subjective text-to-image items for EvalCera (C). The prompt design follows two principles: (i) Anchoring in authoritative ceramic aesthetics to reflect established craft workflows and aesthetic norms; (ii) Systematically introducing contemporary cultural elements to ensure creative diversity and modern relevance.
Specifically, we referred to The History of Chinese Ceramics and the ceramic-craft textbooks in the National Occupational Skill Standards to ensure professional rigor; we also incorporated practice cases from provincial ceramic master Li Jinping used in ceramic design and skills-competition instruction, thereby enhancing creativity and manufacturability. All prompts were manually curated and vetted by experts to ensure that, while integrating diverse creative elements, they remain faithful to ceramic design logic and practical producibility.
2、Category Taxonomy and Distribution Display
Creative elements are organized into three categories: anime (e.g., ninja attire, Sharingan, Rasengan, Doraemon’s four-dimensional pocket, Pikachu), natural (e.g., lotus, peony, the “Four Gentlemen” set—plum, orchid, bamboo, chrysanthemum—, cloud motifs), and modern digital culture (e.g., emoji, meme templates, pixelated QR codes, kaomoji).
The distributions of these three categories are shown in Fig. 1(c) to help readers grasp the thematic and stylistic coverage. Inspired by your comment, we also added the distribution of EvalCera (B) in Fig. 1(b) to further improve transparency regarding data sources and composition.
3、Completeness of the Open-Source Links
We have audited and supplemented the open-source links to ensure that all required files and manifests are complete, the links are accessible, and the resources are convenient for reproduction.
These revisions can be found in the revised manuscript at the following locations: page 4, section 2.1.2 and 2.1.3(lines 132–156)、Figure 1 (b) (c). The revised parts have been highlighted in red in the manuscript for easy identification.
Thank you again for your valuable suggestions. These updates enhance the transparency and reproducibility of the dataset and help readers more fully understand the composition and intended use of EvalCera.
“[updated text in the manuscript :
2.1.2 Ceramic Image Analysis
The images used in this study primarily come from the collections of ceramic museums, the Baidu Baike image database (https://image.baidu.com/), and clinical teaching cases compiled by Li Jinping. We manually selected high-quality images that clearly display the key elements required for judgment and annotated them by hand. The test questions include identifying the origin of ceramics (such as Ru kiln, Yaozhou kiln, Guan kiln, Ding kiln, Cizhou kiln, Jun kiln, Jingdezhen kiln, and Ge kiln ), as well as recognizing different stages of ceramic craftsmanship (such as throwing, trimming, glazing, firing, etc.). The purpose of constructing these multiple-choice questions using the above methods is to comprehensively evaluate the capabilities of LLMs in image understanding and analysis within the field of ceramics. There are a total of 160 annotated image-based items, forming EvalCera (B), as shown in Figure 1b.
2.1.3 Ceramic Image Generation
Based on the five famous kilns of China, we constructed 72 subjective image-generation items for EvalCera (C). As illustrated in Figure~\ref{fig:Evaluation}c, to enable systematic evaluation, the prompts are organized by creative-element source into three categories: (i) anime elements (e.g., ninja attire, Sharingan, Rasengan, and hand signs from Naruto; the four-dimensional pocket, bamboo-copter, and time machine from Doraemon; Pikachu, electric symbols, Poké Balls, and evolution stones from Pokémon; as well as the Wind Fire Wheels, Hun Tian Ling, and lotus pedestal from Nezha); (ii) natural elements (e.g., floral motifs such as lotus, peony, and chrysanthemum; bamboo–plum–orchid–chrysanthemum sets; cloud and ruyi patterns; mountains, rivers, and animal forms); and (iii) modern cultural elements (e.g., emoji/sticker sets, meme-style compositions, pixelated QR codes, and kaomoji). Each prompt requires integrating traditional ceramic aesthetics with the specified elements to generate ceramic designs that are both visually appealing and practically manufacturable.]”
Comments 5:
Ceramic image data mainly come from museums and Baidu Baike; copyright and data-cleaning standards should be clarified.Although a code link is provided, the necessary elements for reproducibility—such as data split scripts, random seeds, and hardware configurations—are not sufficiently transparent.
Response 5:
Agree. We have accordingly revised the manuscript to clarify copyright and data-cleaning standards and to make the reproducibility elements fully transparent. Specifically, we now provide explicit details of the data split scripts, random seeds, and hardware configurations in our open-source repository (https://github.com/HanYongyi/HYY).
In the revised manuscript, this information has been added in the Methodology section (Page 11, Section 3.2, Lines 412–474) and further emphasized in the Data Availability Statement (Page 17, Lines 546–553). The revised parts have been highlighted in red in the manuscript for easy identification.
The revised parts have been highlighted in red in the manuscript for easy identification.
[Updated text in the manuscript]
“To ensure reproducibility, we have released all necessary resources, including data split scripts, random seed settings, and hardware configurations, in our GitHub repository (https://github.com/HanYongyi/HYY). In addition, copyright and data-cleaning standards for images sourced from museums and Baidu Baike have been clarified. These resources collectively enable transparent and reproducible experiments for all reported results.”
Comments 6:
Some sections are overly lengthy, especially the methodology parts (Sections 2.2 and 2.3). The authors are advised to streamline these and emphasize the core ideas.Figure numbering and referencing need to be more standardized; for example, Figures 3 and 4 lack statistical explanations.
Response 6:
Thank you for the constructive comments. We have carefully streamlined Sections 2.2 and 2.3 to make the methodology more concise and focused on the core ideas, thereby improving readability. In addition, inspired by your suggestion, we have merged Figures 3 and 4 into a single figure, and Figures 6 and 7 into another single figure, which not only reduces redundancy but also saves space and improves the overall presentation. We have also standardized all figure numbering and cross-referencing throughout the manuscript to ensure consistency and clarity. Moreover, the merged Figure 3 now includes statistical explanations to improve interpretability. The updated versions can be found in the revised manuscript on Page 9 (Figure 3), Page 15 (Figure 6), and the corresponding text references.
These revisions can be found in the revised manuscript at the following locations: page 4-8, Section 2.2 and 2.3 (lines 157–303). The revised parts have been highlighted in red in the manuscript for easy identification.
Comments 7:
Certain expressions are too informal (e.g., “Ok, I’ll assign tasks” in Figure 2) and should be revised into a consistent academic style.
Response 7:
Thank you for pointing this out. We have revised the informal expression in Figure 2 and replaced it with a more academic wording. Furthermore, we carefully reviewed the entire manuscript to ensure that all expressions follow a consistent academic style. The updated version can be found in the revised manuscript (Figure 2, Page 7) and throughout the text where applicable.
The revised parts have been highlighted in red in the manuscript for easy identification.
Comments 8:
The Abstract and Conclusion contain too many quantitative results; these could be condensed to better highlight the contributions and insights.
Response 8:
Thank you for the helpful suggestion. We have revised both the Abstract and the Conclusion to streamline the quantitative details, focusing instead on highlighting the main contributions and insights of the work. The updated and more concise versions can be found in the revised manuscript (Abstract, Page 1; Conclusion, Page 17).
The revised parts have been highlighted in red in the manuscript for easy identification.
“[updated text in the manuscript :
Abstract:
Many traditional ceramic techniques are inscribed on UNESCO’s Intangible Cultural Heritage lists, yet expert scarcity, long training cycles, and stylistic homogenization impede intergenerational transmission and innovation. Although large language models offer new opportunities, research tailored to ceramics remains limited. To address this gap, we first construct EvalCera, the first open-source domain large language model evaluation dataset for ceramic knowledge, image analysis, and generation, and conduct large-scale assessments of existing general large language models on ceramic tasks, revealing their limitations. We then release the first ceramics-focused training corpus for large language models and, using it, develop CeramicGPT, the first domain-specific large language model for ceramics. Finally, we build ChatCAS, a workflow multi-agent system built on CeramicGPT and GPT-4o. Experiments show that our model and agents achieve the best performance on EvalCera (A) and (B) text tasks as well as (C) image generation. Code is available at: https://github.com/HanYongyi/HYY.
Conclusion:
This study makes significant progress in applying LLMs to the ceramics domain. By developing EvalCera, the first specialized dataset for ceramic knowledge, image analysis, and generation, we assessed the capabilities of state-of-the-art models, highlighting their limitations in the ceramics field. In response, we introduced CeramicGPT, a domain-specific LLM that outperformed general models in ceramic knowledge and image recognition tasks. We also developed ChatCAS, a multi-agent system powered by CeramicGPT and GPT-4o. Evaluation results show that our model and agents achieve the best performance on EvalCera (A) and (B) text tasks as well as (C) image generation tasks.
In future work on datasets, we plan to expand the scale of EvalCera, develop a more fine-grained taxonomy of creative categories, involve a larger and more diverse group of experts, and adopt inter-rater reliability metrics to enhance evaluation credibility. For ChatCAS, we will improve task allocation, voting, and discussion processes through strategy optimization and ablation studies, introduce confidence estimation and human–AI collaboration mechanisms to enhance efficiency and reliability, and develop ChatCAS 2.0 to further optimize ceramic design generation and extend the multi-agent framework to other specialized domains.]”
Comments 9:
Revised last paragraph of Introduction (section organization) should be:The remainder of this paper is organized as follows. Section 2 introduces the proposed methodology, including the construction of the EvalCera dataset, the development of CeramicGPT, and the multi-agent framework ChatCAS. Section 3 presents the experimental setup and evaluation results, analyzing the performance of general-purpose LLMs, CeramicGPT, and ChatCAS in the ceramics domain. Section 4 discusses the main findings, potential applications, and limitations of this study. Finally, Section 5 concludes the paper and outlines directions for future work.
Response 9:
We sincerely thank the reviewer for this valuable guidance. Following the suggestion, we have revised the last paragraph of the Introduction accordingly, making the overall structure of the paper clearer and more logically organized. The updated section organization statement can be found in the revised manuscript (Introduction, Page 3, Lines 87-93).
The revised parts have been highlighted in red in the manuscript for easy identification.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article presents an interesting innovation, with the development of a specialized LLM (CeramicGPT) and a multi-agent system (ChatCAS) for the ceramics sector. The authors constructed their own evaluation dataset (EvalCera) covering knowledge, image analysis, and image generation. The code is also available, which is something that provides reliability and repeatability.
The work is exclusively focused on Chinese ceramics, which also shows that there is a clear application and scope. From the technological point of view, the paper is very well analyzed and well documented, while the technology readiness is also in a very good level.
However, something that could be improved is the evaluation of the qualitative criteria for scoring the final results. Specifically, for the qualitative assessment, two experts were invited to offer their opinion. This certainly adds weight to the evaluation, because it adds human judgment beyond numerical metrics. In fields such as aesthetics or cultural value, a broader group of judges or cross-referencing with audiences of different profiles (e.g., artists, art historians, users) is usually required. For the time-being and for this research paper, the number of evaluators/ experts could be considered acceptable. However, for future research I would recommend the authors to expand their evaluation group of experts.
To sum up, I really enjoyed reading this paper which is a technically noteworthy contribution with specific targets and goals. The livestreaming platform that collects personalized requests and comments from viewers in real time is also a great innovation idea to combine integration of technology with the market needs.
Author Response
Comments 1:
However, something that could be improved is the evaluation of the qualitative criteria for scoring the final results. Specifically, for the qualitative assessment, two experts were invited to offer their opinion. This certainly adds weight to the evaluation, because it adds human judgment beyond numerical metrics. In fields such as aesthetics or cultural value, a broader group of judges or cross-referencing with audiences of different profiles (e.g., artists, art historians, users) is usually required. For the time-being and for this research paper, the number of evaluators/ experts could be considered acceptable.
Response 1 :
Thank you very much for this thoughtful suggestion. We fully agree that qualitative assessment in domains involving aesthetics and cultural value benefits from a broader and more diverse judging panel. In the original submission, EvalCera (C) used four criteria (Aesthetic Quality, Cultural Relevance, Creativity, Functional Plausibility) rated on a 1–5 scale by two ceramic design experts, as described in Section 3.1.2 of the manuscript. We have now revised the protocol accordingly.
What we changed:
1、Expanded judging panel (2 → 5): In addition to the two ceramic design experts, we invited two art historians (material culture/Chinese ceramics specialization) and one lay user representing end-user perspectives. All judges provided informed consent; identities are anonymized.
2、Richer scale (5 → 10 points): We retained the same four dimensions but adopted a 10-point anchored scale (1–2 poor, 3–4 fair, 5–6 adequate, 7–8 good, 9–10 excellent) to increase sensitivity. We re-scored all images under the new scale rather than linearly mapping prior scores.
Where the manuscript was revised:
Section 3.1.2 ,page 10 (lines 364-392): updated panel composition, 10-point rubrics, and analysis plan.judge backgrounds and inter-rater reliability.
Section 3.3.1, page 14 (lines 470-496) : updated descriptive statistics under the new protocol; text adjusted accordingly.
Figure 3 and Figure 5 (revised): reflects 10-point scores.
Section 4 and 5: further clarified the current limitations of this study and outlined future research plans.
We appreciate your constructive recommendation. We believe these changes strengthen the ecological validity and robustness of the qualitative evaluation, and we thank you for helping us improve the study.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for Authorsno other concerns.