Figure 1.
The significant variations in tokenization results (tokens) of the perturbed text by different LLMs.
Figure 1.
The significant variations in tokenization results (tokens) of the perturbed text by different LLMs.
Figure 2.
Illustration of the self-correction reflection for improving LLM performance by simulating the brain’s error correction ability and leveraging relevant research findings.
Figure 2.
Illustration of the self-correction reflection for improving LLM performance by simulating the brain’s error correction ability and leveraging relevant research findings.
Figure 3.
Flowchart of self-correction reflection process.
Figure 3.
Flowchart of self-correction reflection process.
Figure 4.
Flowchart of the MSCS, BERTScore, and BLEU calculation procedures.
Figure 4.
Flowchart of the MSCS, BERTScore, and BLEU calculation procedures.
Figure 5.
The MSCS variations are visualized via radar charts under 3 perturbation levels (20%, 50%, 100%) for each LLM. This demonstrates how semantic similarity changes across various perturbation types concerning 12-dimensional properties. In total, 4 datasets under 3 different perturbations are considered.
Figure 5.
The MSCS variations are visualized via radar charts under 3 perturbation levels (20%, 50%, 100%) for each LLM. This demonstrates how semantic similarity changes across various perturbation types concerning 12-dimensional properties. In total, 4 datasets under 3 different perturbations are considered.
Figure 6.
Line charts visualize the performance decline trends of different LLMs under 4 perturbation levels (0%, 20%, 50%, 100%). The plots demonstrate how LLMs’ ACC changes across three perturbation types (SwapAlpha, SwapWord, and CaseAlpha) on 4 datasets (MMLU, LogiQA-EN, LSAT, and SAT-EN). Each line represents one LLM’s average performance decline, with the baseline performance (0% perturbation) normalized to 100%.
Figure 6.
Line charts visualize the performance decline trends of different LLMs under 4 perturbation levels (0%, 20%, 50%, 100%). The plots demonstrate how LLMs’ ACC changes across three perturbation types (SwapAlpha, SwapWord, and CaseAlpha) on 4 datasets (MMLU, LogiQA-EN, LSAT, and SAT-EN). Each line represents one LLM’s average performance decline, with the baseline performance (0% perturbation) normalized to 100%.
Figure 7.
The ACC variations are visualized via radar charts under 4 perturbation levels (0%, 20%, 50%, 100%) for each LLM. This demonstrates how LLMs’ robustness changes across various perturbation types concerning 12-dimensional properties. In total, 4 datasets under 3 different perturbations are considered.
Figure 7.
The ACC variations are visualized via radar charts under 4 perturbation levels (0%, 20%, 50%, 100%) for each LLM. This demonstrates how LLMs’ robustness changes across various perturbation types concerning 12-dimensional properties. In total, 4 datasets under 3 different perturbations are considered.
Figure 8.
Illustration of ACC performance under four perturbation intensities (0%, 20%, 50%, 100%), showing how different LLMs maintain their robustness across various test scenarios.
Figure 8.
Illustration of ACC performance under four perturbation intensities (0%, 20%, 50%, 100%), showing how different LLMs maintain their robustness across various test scenarios.
Figure 9.
Illustration of MSCS improvement through self-correction reflection, comparing the semantic recovery before and after correction for each LLM. The radar chart plots normalized MSCS percentages (before/after correction relative to 0% perturbation MSCS) on axes representing distinct perturbation scenarios.
Figure 9.
Illustration of MSCS improvement through self-correction reflection, comparing the semantic recovery before and after correction for each LLM. The radar chart plots normalized MSCS percentages (before/after correction relative to 0% perturbation MSCS) on axes representing distinct perturbation scenarios.
Figure 10.
Illustration of ACC enhancement through self-correction reflection, displaying the performance improvement for each LLM across different perturbation types. The radar chart plots normalized ACC percentages (before/after correction relative to 0% perturbation ACC) on axes representing distinct perturbation scenarios.
Figure 10.
Illustration of ACC enhancement through self-correction reflection, displaying the performance improvement for each LLM across different perturbation types. The radar chart plots normalized ACC percentages (before/after correction relative to 0% perturbation ACC) on axes representing distinct perturbation scenarios.
Figure 11.
Illustration of ACC after self-correction reflection comparison among LLMs, revealing their relative performance strengths across different datasets and perturbation types.
Figure 11.
Illustration of ACC after self-correction reflection comparison among LLMs, revealing their relative performance strengths across different datasets and perturbation types.
Table 1.
Comparison of external correction methods and the self-correction reflection.
Table 1.
Comparison of external correction methods and the self-correction reflection.
Method | External Intervention | Scalability | Semantic Integrity |
---|
Standard External Correction | Required | Limited | Potentially Compromised |
Self-Correction (ours) | Not Required | High | Preserved |
Table 2.
Example of correction process with LLM’s system prompt, input text, and response.
Table 2.
Example of correction process with LLM’s system prompt, input text, and response.
Correction Process Example |
---|
System Prompt | You are a professional text correction assistant. Your task is to receive a text that may have issues such as incorrect word spelling (e.g., “teh” instead of “the”), incorrect word order (e.g., “book the read” instead of “read the book”), and incorrect capitalization (e.g., “aPple” instead of “Apple”). For example, if the input text is “I likE applEs”, you should correct it to “I like apples”. Then correct the text to the correct and standard form and only output the corrected text content without adding any additional explanations or notes. |
Question | Please correct the following text: “IF 120 iS rEduced tO 96, whAt iS thE Reduction Percent?” |
Response | If 120 is reduced to 96, what is the reduction percent? |
Table 3.
Example of inference process with system prompt, history, question, and response.
Table 3.
Example of inference process with system prompt, history, question, and response.
Inference Process Example |
---|
System Prompt | Please select and reply with only one of the options A, B, C or D. Do not provide any explanations or additional text. This is very important. |
History | [(“You are a professional text correction assistant. Your task is to receive a text that may have issues such as incorrect word spelling (e.g., ‘teh’ instead of ‘the’), incorrect word order (e.g., ‘book the read’ instead of ‘read the book’), and incorrect capitalization (e.g., ‘aPple’ instead of ‘Apple’). For example, if the input text is ‘I likE applEs’, you should correct it to ‘I like apples’. Then correct the text to the correct and standard form and only output the corrected text content without adding any additional explanations or notes”, “Please correct the following text: ‘IF 120 iS rEduced tO 96, whAt iS thE Reduction Percent?”’, “If 120 is reduced to 96, what is the reduction percent?”)] |
Question | If 120 is reduced to 96, what is the reduction percent? A:30% B:20% C:40% D:10% |
Response | B |
Table 4.
Response MSCS among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
Table 4.
Response MSCS among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
LLMs | Dataset | MSCS | MSCS of SwapAlpha | MSCS of SwapWord | MSCS of CaseAlpha |
---|
0% | 20% | 50% | 100% | 20% | 50% | 100% | 20% | 50% | 100% |
---|
GLM4 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Table 5.
Response BLEU-1 score among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
Table 5.
Response BLEU-1 score among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
Dataset | LLMs | BLEU-1 Score | BLEU-1 Score of SwapAlpha | BLEU-1 Score of SwapWord | BLEU-1 Score of UpperCase |
---|
0% | 20% | 50% | 100% | 20% | 50% | 100% | 20% | 50% | 100% |
---|
MMLU | GLM4 | 1.00 | 0.834 | 0.557 | 0.086 | 1.00 | 1.00 | 1.00 | 0.825 | 0.534 | 0.042 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
LogiQA-en | GLM4 | 1.00 | 0.821 | 0.542 | 0.079 | 1.00 | 1.00 | 1.00 | 0.809 | 0.515 | 0.026 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
LSAT | GLM4 | 1.00 | 0.814 | 0.530 | 0.054 | 1.00 | 1.00 | 1.00 | 0.807 | 0.515 | 0.030 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
SAT-en | GLM4 | 1.00 | 0.814 | 0.533 | 0.066 | 1.00 | 1.00 | 1.00 | 0.807 | 0.518 | 0.039 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
Table 6.
Response BERTScore among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
Table 6.
Response BERTScore among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba).
Dataset | LLMs | BERTScore | BERTScore of SwapAlpha | BERTScore of SwapWord | BERTScore of UpperCase |
---|
0% | 20% | 50% | 100% | 20% | 50% | 100% | 20% | 50% | 100% |
---|
MMLU | GLM4 | 1.00 | 0.875 | 0.723 | 0.565 | 0.905 | 0.905 | 0.905 | 0.875 | 0.729 | 0.583 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
LogiQA-en | GLM4 | 1.00 | 0.858 | 0.698 | 0.522 | 0.885 | 0.885 | 0.885 | 0.855 | 0.696 | 0.525 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
LSAT | GLM4 | 1.00 | 0.827 | 0.665 | 0.510 | 0.879 | 0.879 | 0.879 | 0.822 | 0.660 | 0.515 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
SAT-en | GLM4 | 1.00 | 0.794 | 0.681 | 0.607 | 0.911 | 0.911 | 0.911 | 0.789 | 0.674 | 0.610 |
Llama3.1 |
Qwen2.5 |
Falcon3-Mamba |
Table 7.
Response ACC among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). The subscript represents the variance in accuracy (%).
Table 7.
Response ACC among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (0%, 20%, 50%, 100%) across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). The subscript represents the variance in accuracy (%).
LLMs | Dataset | ACC | ACC of SwapAlpha | ACC of SwapWord | ACC of CaseAlpha |
---|
0% | 20% | 50% | 100% | 20% | 50% | 100% | 20% | 50% | 100% |
---|
GLM4 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | | |
LogiQA-en | | | | | | | | | | |
LSAT | | | | | | | | | | |
SAT-en | | | | | | | | | | |
Table 8.
Response ACC change rates (%) among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (20%, 50%, 100%) across four datasets using four LLMs.
Table 8.
Response ACC change rates (%) among different perturbations (SwapAlpha, SwapWord, CaseAlpha) of different levels (20%, 50%, 100%) across four datasets using four LLMs.
Perturbation Type | Perturbation Level | Dataset | GLM4 | Llama3.1 | Qwen2.5 | Falcon3-Mamba |
---|
SwapAlpha | 20% | MMLU | −3.62 | −4.02 | −2.13 | −4.84 |
LogiQA-en | −6.07 | −5.34 | −3.63 | −1.66 |
LSAT | −5.73 | −5.96 | −1.05 | −2.95 |
SAT-en | −5.25 | −4.81 | −0.56 | −2.18 |
Average | −5.17 | −5.03 | −1.84 | −2.91 |
50% | MMLU | −10.73 | −10.97 | −6.81 | −11.02 |
LogiQA-en | −13.72 | −6.91 | −8.08 | −9.77 |
LSAT | −11.64 | −12.83 | −6.67 | −12.51 |
SAT-en | −8.74 | −13.52 | −4.24 | −2.92 |
Average | −11.21 | −11.06 | −6.45 | −9.06 |
100% | MMLU | −24.39 | −22.78 | −17.62 | −24.54 |
LogiQA-en | −24.27 | −19.52 | −19.08 | −20.08 |
LSAT | −28.20 | −27.61 | −23.86 | −21.76 |
SAT-en | −20.98 | −34.84 | −14.41 | −23.35 |
Average | −24.46 | −26.19 | −18.74 | −22.43 |
Average | −13.61 | −14.09 | −9.01 | −11.46 |
SwapWord | 20% | MMLU | −17.79 | −16.07 | −16.98 | −17.06 |
LogiQA-en | −23.79 | −15.19 | −24.79 | −10.69 |
LSAT | −28.57 | −25.28 | −27.03 | −18.67 |
SAT-en | −17.78 | −26.43 | −19.49 | −18.24 |
Average | −21.98 | −20.74 | −22.07 | −16.16 |
50% | MMLU | −23.19 | −20.23 | −20.73 | −21.36 |
LogiQA-en | −33.07 | −18.16 | −28.00 | −12.34 |
LSAT | −34.85 | −31.54 | −36.40 | −24.84 |
SAT-en | −27.11 | −38.75 | −27.97 | −25.55 |
Average | −29.56 | −27.17 | −28.28 | −21.02 |
100% | MMLU | −26.43 | −23.14 | −24.11 | −23.51 |
LogiQA-en | −35.77 | −28.21 | −31.21 | −13.45 |
LSAT | −41.87 | −39.44 | −40.64 | −27.68 |
SAT-en | −34.11 | −42.05 | −31.63 | −27.01 |
Average | −34.54 | −33.21 | −31.90 | −22.91 |
Average | −28.69 | −27.04 | −27.42 | −20.03 |
CaseAlpha | 20% | MMLU | −1.78 | −2.43 | −1.09 | −3.12 |
LogiQA-en | −3.18 | −1.60 | −1.54 | −2.39 |
LSAT | −4.37 | −4.86 | −1.87 | −0.64 |
SAT-en | −2.33 | −5.11 | −1.41 | −2.18 |
Average | −2.92 | −3.50 | −1.48 | −2.08 |
50% | MMLU | −4.76 | −6.30 | −2.85 | −5.77 |
LogiQA-en | −4.31 | −4.94 | −3.34 | −4.23 |
LSAT | −6.65 | −13.04 | −5.21 | −6.81 |
SAT-en | −2.62 | −8.41 | −3.11 | −7.29 |
Average | −4.58 | −8.17 | −3.63 | −6.02 |
100% | MMLU | −9.21 | −12.15 | −5.39 | −9.95 |
LogiQA-en | −11.18 | −6.33 | −8.08 | −9.03 |
LSAT | −13.64 | −16.89 | −11.32 | −13.40 |
SAT-en | −3.21 | −24.34 | −3.67 | −10.21 |
Average | −9.31 | −14.93 | −7.12 | −10.65 |
Average | −5.60 | −8.87 | −4.07 | −6.25 |
Table 9.
Response MSCS comparison before and after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the MSCS values before correction (“Before”), after correction (“After”), and the improvement percentage (“Improv.”).
Table 9.
Response MSCS comparison before and after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the MSCS values before correction (“Before”), after correction (“After”), and the improvement percentage (“Improv.”).
LLMs | Dataset | MSCS of SwapAlpha | MSCS of SwapWord | MSCS of CaseAlpha |
---|
Before | After | Improv. | Before | After | Improv. | Before | After | Improv. |
---|
GLM4 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Avg. Improv. | - | - | | - | - | | - | - | |
Table 10.
Response ACC comparison before and after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values before correction (“Before”), after correction (“After”), and the improvement percentage (“Improv.”).
Table 10.
Response ACC comparison before and after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values before correction (“Before”), after correction (“After”), and the improvement percentage (“Improv.”).
LLMs | Dataset | ACC of SwapAlpha | ACC of SwapWord | ACC of CaseAlpha |
---|
Before | After | Improv. | Before | After | Improv. | Before | After | Improv. |
---|
GLM4 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Avg. Improv. | - | - | | - | - | | - | - | |
Table 11.
Response MSCS comparison after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values after CTC Detection (CTC), TextBlob, and the self-correction reflection (Ours).
Table 11.
Response MSCS comparison after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values after CTC Detection (CTC), TextBlob, and the self-correction reflection (Ours).
LLMs | Dataset | MSCS of SwapAlpha | MSCS of SwapWord | MSCS of CaseAlpha |
---|
CTC | TextBlob | Ours | CTC | TextBlob | Ours | CTC | TextBlob | Ours |
---|
GLM4 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Table 12.
Response ACC comparison after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values after CTC Detection (CTC), TextBlob, and the self-correction reflection (Ours).
Table 12.
Response ACC comparison after correction for three types of perturbations (SwapAlpha, SwapWord, CaseAlpha) at 100% intensity across four datasets (MMLU, LogiQA-en, LSAT, and SAT-en) using four LLMs (GLM4, Llama3.1, Qwen2.5, and Falcon3-Mamba). For each perturbation type, the table presents the ACC values after CTC Detection (CTC), TextBlob, and the self-correction reflection (Ours).
LLMs | Dataset | ACC of SwapAlpha | ACC of SwapWord | ACC of CaseAlpha |
---|
CTC | TextBlob | Ours | CTC | TextBlob | Ours | CTC | TextBlob | Ours |
---|
GLM4 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Llama3.1 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Qwen2.5 | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Falcon3-Mamba | MMLU | | | | | | | | | |
LogiQA-en | | | | | | | | | |
LSAT | | | | | | | | | |
SAT-en | | | | | | | | | |
Table 13.
Response MSCS comparison of the self-correction reflection (Ours) with the ablated variant (w/o HECP) across three perturbation types at 100% intensity, evaluated on four datasets using four LLMs.
Table 13.
Response MSCS comparison of the self-correction reflection (Ours) with the ablated variant (w/o HECP) across three perturbation types at 100% intensity, evaluated on four datasets using four LLMs.
LLMs | Dataset | MSCS of SwapAlpha | MSCS of SwapWord | MSCS of CaseAlpha |
---|
w/o HECP | Ours | w/o HECP | Ours | w/o HECP | Ours |
---|
GLM4 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Llama3.1 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Qwen2.5 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Falcon3-Mamba | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Table 14.
Response ACC comparison of the self-correction reflection (Ours) with the ablated variant (w/o HECP) across three perturbation types at 100% intensity, evaluated on four datasets using four LLMs.
Table 14.
Response ACC comparison of the self-correction reflection (Ours) with the ablated variant (w/o HECP) across three perturbation types at 100% intensity, evaluated on four datasets using four LLMs.
LLMs | Dataset | ACC of SwapAlpha | ACC of SwapWord | ACC of CaseAlpha |
---|
w/o HECP | Ours | w/o HECP | Ours | w/o HECP | Ours |
---|
GLM4 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Llama3.1 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Qwen2.5 | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Falcon3-Mamba | MMLU | | | | | | |
LogiQA-en | | | | | | |
LSAT | | | | | | |
SAT-en | | | | | | |
Table 15.
Response GPU memory usage (in GB) of the self-correction reflection (Ours) with the ablated variant (w/o HECP) evaluated on four datasets using four LLMs.
Table 15.
Response GPU memory usage (in GB) of the self-correction reflection (Ours) with the ablated variant (w/o HECP) evaluated on four datasets using four LLMs.
LLMs | Dataset | GPU Usage (GB) | Extra Overhead (%) |
---|
w/o HECP | Ours | |
---|
GLM4 | MMLU | 18.55 | 19.32 | 4.15 |
LogiQA-en |
LSAT |
SAT-en |
Llama3.1 | MMLU | 16.34 | 17.37 | 6.30 |
LogiQA-en |
LSAT |
SAT-en |
Qwen2.5 | MMLU | 16.67 | 17.86 | 7.14 |
LogiQA-en |
LSAT |
SAT-en |
Falcon3-Mamba | MMLU | 16.84 | 18.04 | 7.13 |
LogiQA-en |
LSAT |
SAT-en |