A Prompt Optimization System Based on Center-Aware Textual Gradients
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Embedding the 'text gradient' into the semantic space and selecting the optimal prompt through the center is essentially an incremental improvement of the ProTeGi framework, without proposing a new paradigm
- The "center selection" strategy is similar to the prototype network or clustering center idea (reference 31), but lacks theoretical innovation, Semantic stability (reference 24) and embedding space optimization (TextGrad et al.) have been widely studied, but this paper only applies them to prompt engineering and does not address the fundamental bottleneck
- The lack of clear distinction between the goals of "prompt optimization" and "prompt generalization" has led to experimental design bias towards performance rather than interpretability
- Section 3.1 (Background) and Section 3.2 (ProTeGi EMB) repeat the description of the ProTeGi framework, resulting in a more compact logic after compression.
- Algorithm 1 (pseudocode) does not reflect the details of "Top-K extension" (such as how to merge Top-K gradients), which contradicts the main text
- The use of cosine similarity and mean center assumes that the semantic space is isotropic, but the actual embedding space may exhibit non-uniform distribution (such as gradients in the LIAR dataset clustering due to political bias), leading to center shift.
- The sensitivity of unverified alpha (80%) may be overfitting to specific tasks
- Only selecting gradients instead of generating them limits the ability to explore new semantic directions (as discussed in Section 6.2). Comparison methods such as OPRO's iterative optimization are more dynamic
- The conclusion states that 'semantic center is a reliable signal', but Figure 3 shows that in 30% of cases, the optimal gradient is not at Top-1, emphasizing the advantage of probability rather than certainty
- Not discussing the impact of human annotation bias on Kappa in ETHOS (subjective tasks) may mislead readers into believing that performance improvement is solely due to prompt optimization
- The Kappa increase of ProTeGi EMB in Table 3 (such as 0.485 vs 0.457 on LIAR) was not reported for significance testing (such as t-test), and may be due to random fluctuations
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- Did you explore other similarity metrics besides cosine similarity? A comparison with alternatives would be interesting.
- Does the algorithm rely on any hyperparameters? If so, please specify them and justify the chosen values. Additionally, a sensitivity analysis could help assess the impact of these parameters on performance.
- Would it be possible to include log-likelihood as an additional evaluation metric alongside accuracy?
- Please provide details about the computational resources required to run the algorithm. Also, how does the time complexity of your approach compare with other standard methods?
- Given that large language models are inherently influenced by their pretraining data, how do you account for potential bias introduced by their prior knowledge when optimizing prompts?
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe author has completed the revisions to the manuscript
Reviewer 2 Report
Comments and Suggestions for AuthorsAll comments have been appropriately addressed.
