TC-Verifier: Trans-Compiler-Based Code Translator Verifier with Model-Checking
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe subject of this paper is very interesting. Applying model-checking to code translators' verification has yet to be addressed well by previous research, and it deserves some systematic or experimental study.
However, the paper fails to provide enough contributions to this subject. There is a clear gap between what the readers expect after reading the summarized contributions in the introduction section and what they eventually get in the subsequent sections.
The methodology proposed in the paper is not mature enough to be used by the researchers (except the authors) in this area as a base for further research and development, given the following factors:
1. The proposed methodology is only applicable to Trans-compiler-based code translators. However, the paper tries to position it as a methodology for formally verifying all code translators without clearly addressing the challenges of verification brought by LLM-based code translators, even though LLM-based code translators have been mentioned and explained in the introduction section.
2. The proposed architecture of the trans-compiler-based translator verifier is coupled with Uppaal. The explanation of the selection of the verification tools in section 3.2 couldn't support the argument that Uppaal is compatible with the proposed verification methodology since it looks like the method is designed given how Uppaal can be used instead of focusing on how to verify the translator.
3. The result and discussion only covered some findings from a very limited case study and failed to provide more information to evaluate the model-checking-based verification methodology.
There are several limitations to the case study.
1. It's argued in the paper that the model can be applied to any trans-compiler-based code translator that follows a similar translation method. However, potential compatibility issues and the cost of migration should be analyzed. Furthermore, what has to be mitigated if the translation method does not depend on defining rules?
2. The quality of the generated query file using GPT4 should be analyzed.
3. In the experimental section, the paper only covered three kinds of code statements: Variable declaration, Function declaration, and Class declaration. At least, the paper should explain why only those statements are covered in the experiment rather than just arguing that it's the first round to verify a complete trans-compiler. We need to know how far to reach a relatively valuable verifier.
Author Response
Comments 1: There is a clear gap between what the readers expect after reading the summarized contributions in the introduction section and what they eventually get in the subsequent sections. The methodology proposed in the paper is not mature enough to be used by the researchers (except the authors) in this area as a base for further research and development, given the following factors: 1. The proposed methodology is only applicable to Trans-compiler-based code translators. However, the paper tries to position it as a methodology for formally verifying all code translators without clearly addressing the challenges of verification brought by LLM-based code translators, even though LLM-based code translators have been mentioned and explained in the introduction section.
|
Response 1: Thank you for your comment. LLMs verification is not the scope of this paper. We made the scope of the presented work clearer in the highlighted parts of the abstract and introduction and we changed the title as well. Our focus in this paper is to present a formal verification method that can be applied to rule-based translator software systems. All rule-based trans-compilers can apply the presented methodology. |
Comments 2: The proposed architecture of the trans-compiler-based translator verifier is coupled with Uppaal. The explanation of the selection of the verification tools in section 3.2 couldn't support the argument that Uppaal is compatible with the proposed verification methodology since it looks like the method is designed given how Uppaal can be used instead of focusing on how to verify the translator. |
Response 2: We have made an initial pilot test to mainly check the user-friendliness of three model-checking tools. This initial test aims to choose a tool to implement the model-checking idea and prove the concept. Choosing UPPAL doesn’t mean that the other tools cannot be used. We chose UPPAAL because of its user-friendliness and flexibility of writing the system specifications to be checked. [We have added a list of evaluation metrics that we used to evaluate the tools. subsection 3.2 lines 135-161. We have also added a table that compares the three experimented tools according to the defined evaluation aspects. Table 1]
|
Comment 3: The result and discussion only covered some findings from a very limited case study and failed to provide more information to evaluate the model-checking-based verification methodology. |
Response 3: Thank you for pointing out this point. We have defined an objective metric that the verifier measures and named it the verification success rate. It is defined as the number of paths verified over the total number of attempted paths by symbolic queries. [edited section 3.5 lines 249-691 , we have also added a summary table that has the verification success rates in the results. and added Table 10 to the results]
|
Comment 4: It's argued in the paper that the model can be applied to any trans-compiler-based code translator that follows a similar translation method. However, potential compatibility issues and the cost of migration should be analyzed. Furthermore, what has to be mitigated if the translation method does not depend on defining rules?
|
Response 4: We agree it is important to highlight the migration cost. The TC-verifier presented can be used to verify other translators following the abstraction and modeling as explained by the case study. The migration costs would be the modeling time spent by the verification engineer. Changing the translator will imply modeling of new state machines since state machine variables and symbols will differ from one language to another. While the symbolic queries can be reused for the same source language even if the output language is changed. If the source language is changed, new queries must be defined. However, queries can be automatically retrieved either using an LLM as we presented in the approach or by an automated script that can directly retrieve the symbolic queries from the XML of the state machine. |
Comment 5: The quality of the generated query file using GPT4 should be analyzed.
|
Response 5: We verified the generated query files using GPT4 manually. Since the queries are not many for each state machine modeled. Human validation was done for all queries used. [We have clarified this point more in subsection 3.5 lines 294-296] |
Comment 6: In the experimental section, the paper only covered three kinds of code statements: Variable declaration, Function declaration, and Class declaration. At least, the paper should explain why only those statements are covered in the experiment rather than just arguing that it's the first round to verify a complete trans-compiler. We need to know how far to reach a relatively valuable verifier.
|
Response 6: For a trans-compiler translator to be completely verified, the model should cover all input code statement types and the translation process for each statement. In the presented case study, Swift language is the input language. The Grammar of Swift has 9 statement types. We started with the declarations statements as the start of any code snippet would be a declaration. However, we agree that all statements covered were declarations and to prove the concept we have added one example for loop statements that come just after declarations in the grammar. Loops statements have four types of loops, we have modeled the while loop statement as an example of loops. The remaining statement types will be a repetition of the same engineering work done by the four examples given. We agree all statements should be modeled to have a completely verified trans-compiler. However, since we are presenting the architecture and evaluating the concept we have included these case studies that show the architecture can be implemented to the whole translator. [Added paragraph in subsection 3.7 lines 377-384. Added while loop lines 422-425, table 5, figure 6. Added results of modeling while loop lines 481-487, and table 9.] |
4. Response to Comments on the Quality of English Language |
Point 1: The English is fine and does not require any improvement. |
Response 1: Thank you. |
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes an architecture based on Uppall for verifying code translation of high level programming languages. The proposed architecture is implemented for Swift and Java translation.
The traditional method of verifying the fidelity of translation is BLEU. The authors proposed a new architecture that can measure the functional equivalence of code.
- It will be nice to name the proposed architecture.
- Would it be possible to have an objective metric in the proposed method? It would make it easier to evaluate the quality of a translation.
- It would be interesting to see how BLEU scores relate to the proposed architecture to clearly demonstrate what can be evaluated by the proposed method but not BLEU.
Author Response
Comments 1: It will be nice to name the proposed architecture.
|
Response 1: Thank you for this note. We have named the presented architecture “TC-Verifier” as it represents an architecture for verifying trans-compiler-based code translators. We have added the name to the title, abstract, introduction, and methodology. |
Comment 2: Would it be possible to have an objective metric in the proposed method? It would make it easier to evaluate the quality of a translation.
|
Response 2: Thank you for this valuable suggestion. We have defined an objective metric that the verifier measures and named it the verification success rate. It is defined as the number of paths verified over the total number of attempted paths by the symbolic queries. We have also added a summary table that has the verification success rates in the results. [edited section 3.5 lines 257-271 and added table 10 to the results]
|
Comment 3: It would be interesting to see how BLEU scores relate to the proposed architecture to clearly demonstrate what can be evaluated by the proposed method but not BLEU.
|
Response 3: Yes, thank you for suggesting this comparison. [We have added subsection 4.5 to compare the BLEU score evaluator with our verification methodology lines 490-522. We have also added tables 11 and 12 to show a practical example for the use of verification vs BLEU score] |
4. Response to Comments on the Quality of English Language |
Point 1: The English is fine and does not require any improvement. |
Response 1: Thank you. |