Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing
Abstract
:1. Introduction
- To improve the accuracy of determining whether a form is successfully submitted, we use GPT-4o [8] to assist in the decision-making process, in addition to page comparison.
2. Related Work
3. Review of mUSAGI Method
- Collection: In this step, we aim to gather as many input pages as possible. Each input page serves as an example for the agent to learn from. When encountering an input page, we use random actions (Monkey) to determine the values for input fields (e.g., Email, Name, Password) and then click the “submit” button.
- Training: Using the input pages collected in the previous step, we train an agent with reinforcement learning, defining specific rewards. The agent’s environment provides tags and texts of the fields. The actions involve selecting values to fill a form field from a list. Rewards are computed based on whether the agent selects the correct action according to the example. The training algorithm used is deep Q-learning [28]. The trained model is then stored.
- Testing: In this step, we use the trained agent to test another web application, referred to as the AUT. The training and testing applications are different, allowing for a certain level of generalization. If the Istanbul middleware [29] supports the AUT, the model reports code coverage and a directive tree. Otherwise, only the directive tree is reported.
- Lack of diverse input data: In this model, the actions for filling fields are limited to the following values: Email, Number, Password, Random String, Date, and Full Name. With such a limited set of values, other types of fields may not be correctly filled. This lack of diversity in input values needs to be addressed. This is particularly important in testing web applications, where almost every AUT contains multiple forms. Enhancing the diversity of input data can ensure that software testing covers more forms, thereby improving the AUT’s reliability and stability.
- Long training time: In our previous method, Monkey was used to randomly fill form fields and collect web forms for training the agent. However, the agent initially requires many attempts to guess the correct field value, resulting in considerable time spent collecting training samples (forms). This paper proposes a different approach to reduce the training time.
- Imprecise determination of form submission status. The previous method relied on DOM similarity to determine if a web form was successfully submitted. If the similarity between the postsubmission page and stored pages was less than 95%, the form was considered successfully submitted; otherwise, it was deemed a failure. The threshold of 95% was experimentally determined [3]. However, some web applications only display a small piece of confirmation text upon successful submission. In such cases, the overall DOM similarity remains very high, possibly over 95%, leading to false negatives where successful submissions are incorrectly marked as failures. This mistake causes the agent to repeatedly test the same successful page, lowering efficiency. Therefore, more reliable methods are needed to accurately determine form submission status to improve efficiency.
4. Proposed Approach
4.1. Overview of Proposed Approach
4.2. FormAgent
4.3. ValueGenerator and Prompt Tuning
4.3.1. T5 Model
4.3.2. Predefined Categories
4.3.3. ValueGenerator and Prompt Tuning of T5 Model
- Template: A key element of learning, it provides prompts by wrapping the original text in text or software-encoded templates, usually containing context markers.
- PromptModel: This component is used for training and inference. It includes a Pre-trained Language Model (PLM), a Template object, and an optional Verbalizer object. Users can combine these modules flexibly and design their interactions. The main goal is to allow training through a unified API without needing specific implementations for different PLMs, enabling more flexible usage.
- PromptDataset: This component is used to load training data.
4.3.4. DataFaker
4.4. Submit Button Checker
You are an AI web crawler assistant.The user will give you some web elements.Please answer if it is a form submission button.Please say only yes or no.
4.5. Determination of Successful Submission
Algorithm 1: Is directive effective |
Input: Page beforeSubmitPage, Page afterSubmitPage Output: Boolean isSimilar 1: begin 2: similarity ← calculatePagesSimilarity(beforeSubmitDom, afterSubmitDom) 3: if similarity == 100 then 4: return false 5: end if 6: if similarity >= 95 then 7: beforeSubmitElements ← getElements(beforeSubmitPage) 8: afterSubmitElements ← getElements(afterSubmitPage) 9: isSimilar ← getGptAnswer(beforeSubmitElements, afterSubmitElements) 10: return isSimilar 11: end if 12: return true 13: end 14 15: procedure getGptAnswer(beforeSubmitElements, afterSubmitElements) 16: begin 17: differentElements ← getDiffElements(beforeSubmitElements, afterSubmitElements) 18: answer ← openAiApi(differentElements) 19: if answer == “yes” then 20: return true 21: else if answer == “no” then 22: return false 23: end if 24: end |
5. Experiments and Results
5.1. Experimental Environment
5.2. Performance Metrics
- Code coverage. According to Brader et al., “Low coverage means that some of the logic of the code has not been tested. High coverage...nevertheless indicates that the likelihood of correct processing is good” [41]. Therefore, a method achieving a higher percentage of code coverage is considered better. There are two types of code coverage: statement coverage and branch coverage. In the experiments, only statement coverage is reported, as these two are highly correlated. The choice of a code coverage tool is dependent on the programming language in use, as mentioned in Section 4.5. To supplement the code coverage metric, we introduce three additional metrics: the number of input pages, input page depth, and ICI breadth, detailed below.
- Input page depth. This is the number of nodes on the longest path from the root node to the deepest input page node. In Figure 10, the longest path is from the root node through directive node (marked as a red circle) with ID 77eb5790 to the final input page with ID 1068395108. Therefore, the depth of this tree is 2. This value is used to measure the capability of an approach to explore forms hidden deeply within the web application.
- ICI breadth. This is the number of input nodes containing extensions to directive nodes. According to Figure 10, there are three input page nodes, each connecting to directive nodes, so the ICI breadth in this example is 3. ICI breadth can be used to count the number of forms successfully submitted. In most cases, this metric is highly correlated with the number of input pages.
5.3. Experiment One
5.4. Experiment Two
5.5. Experiment Three
5.5.1. Code Coverage Comparison with QExplorer and mUSAGI
5.5.2. Execution Time Comparison
5.6. Threats to Validity
5.7. Future Directions
- Enabling geographically sensitive values: The current model does not consider geography-related content from URLs or pages of the AUT. By applying LLMs to these elements, it is possible to store geographic information in the proposed model, enabling the generation of geographically sensitive values by LLMs.
- Better mechanism to detect successful submission: The current implementation uses GPT-4o only if the similarity between two pages exceeds a threshold to lower the test cost since GPT-4o is not a free service. In the future, we plan to use existing pre-trained LLMs to detect successful submissions, thereby eliminating the need for a threshold.
- Using LLM for value generation: Data fakers limit the categories of form fields. Instead, we can use an LLM as a value generator. To ensure the generated values are reasonable, another LLM will act as a verifier to approve the values from the generator.
- Testing more AUTs: Currently, only five AUTs are used to evaluate the testing performance of the proposed model. Testing more AUTs from various categories, such as e-commerce and learning management systems, is desirable to better evaluate its usefulness.
- Improvement of computational efficiency: The current model serves as a proof-of-concept, leveraging open-source resources like Crawljax, data fakers, and pre-trained LLMs. Consequently, its computational efficiency is not optimized. As mentioned in Section 5.5.2, Crawljax’s execution time accounts for more than 90% of the total time. Due to its large codebase, improving Crawljax’s response time is challenging. A potential future direction is to integrate the essential Crawljax functions into the proposed model to reduce overall execution time.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- How Many Websites Are There? Available online: https://www.statista.com/chart/19058/number-of-websites-online/ (accessed on 13 January 2025).
- Liu, C.-H.; You, S.D.; Chiu, Y.-C. A Reinforcement Learning Approach to Guide Web Crawler to Explore Web Applications for Improving Code Coverage. Electronics 2024, 13, 427. [Google Scholar] [CrossRef]
- Lai, C.-F.; Liu, C.-H.; You, S.D. Using Webpage Comparison Method for Automated Web Application Testing with Reinforcement Learning. Int. J. Eng. Technol. Innov. 2024. accepted. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-efficient Prompt Tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
- Ding, N.; Hu, S.; Zhao, W.; Chen, Y.; Liu, Z.; Zheng, H.-T.; Sun, M. Openprompt: An Open-Source Framework for Prompt-Learning. arXiv 2021, arXiv:2111.01998. [Google Scholar]
- Mocker-Data-Generator. Available online: https://github.com/danibram/mocker-data-generator (accessed on 23 May 2024).
- Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 23 May 2024).
- Wang, X.; Jiang, Y.; Tian, W. An Efficient Method for Automatic Generation of Linearly Independent Paths in White-box Testing. Int. J. Eng. Technol. Innov. 2015, 5, 108–120. [Google Scholar]
- Malhotra, D.; Bhatia, R.; Kumar, M. Automated Selection of Web Form Text Field Values Based on Bayesian Inferences. Int. J. Inf. Retr. Res. 2023, 13, 1–13. [Google Scholar] [CrossRef]
- Sunman, N.; Soydan, Y.; Sözer, H. Automated Web Application Testing Driven by Pre-recorded Test Cases. J. Syst. Softw. 2022, 193, 111441. [Google Scholar] [CrossRef]
- Crawljax. Available online: https://github.com/zaproxy/crawljax (accessed on 25 October 2023).
- Negara, N.; Stroulia, E. Automated Acceptance Testing of Javascript Web Applications. In Proceedings of the 2012 19th Working Conference on Reverse Engineering, Kingston, ON, Canada, 15–18 October 2012. [Google Scholar]
- Wu, C.Y.; Wang, F.; Weng, M.H.; Lin, J.W. Automated Testing of Web Applications with Text Input. In Proceedings of the 2015 IEEE International Conference on Progress in Informatics and Computing, Nanjing, China, 18–20 December 2015. [Google Scholar]
- Groce, A. Coverage Rewarded: Test Input Generation via Adaptation-based Programming. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering, Lawrence, KS, USA, 6–10 November 2011. [Google Scholar]
- Lin, J.-W.; Wang, F.; Chu, P. Using Semantic Similarity in Crawling-Based Web Application Testing. In Proceedings of the 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), Tokyo, Japan, 13–17 March 2017. [Google Scholar]
- Document Object Model (DOM) Technical Reports. Available online: https://www.w3.org/DOM/DOMTR (accessed on 20 October 2024).
- Qi, X.F.; Hua, Y.L.; Wang, P.; Wang, Z.Y. Leveraging Keyword-guided Exploration to Build Test Models for Web Applications. Inf. Softw. Technol. 2019, 111, 110–119. [Google Scholar] [CrossRef]
- Liu, C.-H.; Chen, W.-K.; Sun, C.-C. GUIDE: An Interactive and Incremental Approach for Crawling Web Applications. J. Super Comput. 2020, 76, 1562–1584. [Google Scholar] [CrossRef]
- Carino, S.; Andrews, J.H. Dynamically Testing GUIs Using Ant Colony Optimization. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, Lincoln, NE, USA, 9–13 November 2015. [Google Scholar]
- Nguyen, D.P.; Maag, S. Codeless Web Testing Using Selenium and Machine Learning. In Proceedings of the 15th International Conference on Software Technologies, Online, 7–9 July 2020. [Google Scholar]
- Kim, J.; Kwon, M.; Yoo, S. Generating Test Input with Deep Reinforcement Learning. In Proceedings of the IEEE/ACM 11th International Workshop on Search-Based Software Testing (SBST), Gothenburg, Sweden, 28–29 May 2018. [Google Scholar]
- Zheng, Y.; Liu, Y.; Xie, X.; Liu, Y.; Ma, L.; Hao, J.; Liu, Y. Automatic Web Testing Using Curiosity-Driven Reinforcement Learning. In Proceedings of the 43rd International Conference on Software Engineering (ICSE), Online, 22–30 May 2021. [Google Scholar]
- Sherin, S.; Muqeet, A.; Khan, M.U.; Iqbal, M.Z. QExplore: An Exploration Strategy for Dynamic Web Applications Using Guided Search. J. Syst. Softw. 2023, 195, 111512. [Google Scholar] [CrossRef]
- Liu, E.Z.; Guu, K.; Pasupat, P.; Shi, T.; Liang, P. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Mridha, N.F.; Joarder, M.A. Automated Web Testing Over the Last Decade: A Systematic Literature Review. Syst. Lit. Rev. Meta-Analy J. 2023, 4, 32–44. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, C.; Wang, J.; Che, X.; Huang, Y.; Hu, J.; Wang, Q. Fill in the Blank: Context-aware Automated Text Input Generation for Mobile GUI Testing. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, VIC, Australia, 14–20 May 2023. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; et al. Human-level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Istanbul. Available online: https://istanbul.js.org/ (accessed on 20 October 2023).
- Gur, I.; Nachum, O.; Miao, Y.; Safdari, M.; Huang, A.; Chowdhery, A.; Narang, S.; Fiedel, N.; Faust, A. Understanding HTML with Large Language Models. arXiv 2022, arXiv:2210.03945. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. Lamda: Language Models for Dialog Applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
- Most Popular Websites Worldwide as of November 2023, by Unique Visitors. Available online: https://www.statista.com/statistics/1201889/most-visited-websites-worldwide-unique-visits/ (accessed on 15 January 2025).
- Fine-Tuning, vs. Prompt Engineering: How To Customize Your AI LLM. Available online: https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/philip-dsouza/2024/06/07/fine-tuning-vs-prompt-engineering-how-to-customize?form=MG0AV3&communityKey=200b84ba-972f-4f79-8148-21a723194f7f (accessed on 17 January 2025).
- Prompt Tuning, vs. Fine-Tuning—Differences, Best Practices and Use Cases. Available online: https://nexla.com/ai-infrastructure/prompt-tuning-vs-fine-tuning/?form=MG0AV3 (accessed on 17 January 2025).
- TimeOff.Management. Available online: https://github.com/timeoff-management/application (accessed on 20 October 2023).
- NodeBB. Available online: https://github.com/NodeBB/NodeBB (accessed on 20 October 2023).
- KeystoneJS. Available online: https://github.com/keystonejs/keystone (accessed on 25 January 2024).
- Django Blog. Available online: https://github.com/reljicd/django-blog (accessed on 20 October 2023).
- Spring PetClinic. Available online: https://github.com/spring-projects/spring-petclinic (accessed on 20 October 2023).
- Brader, L.; Hilliker, H.; Wills, A. Testing for Continuous Delivery with Visual Studio 2012; Microsoft: Washington, DC, USA, 2013; p. 30. [Google Scholar]
Category | Count | Category | Count |
---|---|---|---|
First Name | 19 | Province | 1 |
Last Name | 20 | Region | 1 |
18 | Number | 25 | |
Gender | 1 | Country | 1 |
String | 32 | Display Name | 1 |
User Name | 12 | Address | 11 |
Full Name | 9 | Suburb | 3 |
Postal Code | 8 | Company Name | 1 |
Store Name | 1 | Card Number | 1 |
Phone Number | 7 | Expiration Date | 1 |
Street Address | 8 | CVV | 1 |
City | 5 | Date | 6 |
State | 1 |
Hardware/Software | Specifications/Model/Version |
---|---|
CPU | Intel Xeon W-2235 3.80 GHz |
RAM | 32 GB DDR4 |
OS | Ubuntu 20.04 |
GPU | NVIDIA GeForce RTX 2070 8 GB GDDR6 |
Selenium | 3.141.0 |
Crawljax | 3.7 |
Python | Version: 3.7 |
Application Name | Version | GitHub Stars Count | Lines of Code | Type |
---|---|---|---|---|
TimeOff.Management [36] | V0.10.0 | 921 | 2698 | Attendance Management System |
NodeBB [37] | V1.12.2 | 14 k | 7334 | Online Forum |
KeystoneJS [38] | V4.0.0-beta.5 | 1.1 k | 5267 | Blogging Software |
Django Blog [39] | V1.0.0 | 26 | - | Blogging Software |
Spring Petclinic [40] | V2.6.0 | 22.9 k | - | Veterinary Client Management System |
Web App | Code Coverage | Input Pages | Input Page Depth | ICI Breadth | ||||
---|---|---|---|---|---|---|---|---|
T5 | mUSAGI | T5 | mUSAGI | T5 | mUSAGI | T5 | mUSAGI | |
TimeOff | 54.67 | 52.51 | 15 | 9 | 3 | 3 | 12 | 9 |
NodeBB | 44.49 | 41.45 | 7 | 4 | 2 | 1.3 | 3 | 2 |
KeystoneJS | 49.48 | 49.20 | 14 | 14 | 4 | 3.3 | 5 | 5 |
Django | - | - | 6 | 4 | 3 | 2 | 6 | 4 |
Petclinic | - | - | 17 | 14 | 9 | 3.3 | 9 | 9 |
Web App | Code Coverage | Input Pages | Input Depth | ICI Breadth | ||||
---|---|---|---|---|---|---|---|---|
T5-GPT | T5 | T5-GPT | T5 | T5-GPT | T5 | T5-GPT | T5 | |
TimeOff | 54.74 | 54.67 | 14 | 15 | 3 | 3 | 14 | 12 |
NodeBB | 44.52 | 44.49 | 7 | 7 | 2 | 2 | 3 | 3 |
KeystoneJS | 50.86 | 49.48 | 20 | 14 | 4 | 4 | 5 | 5 |
Django | - | - | 6 | 6 | 3 | 3 | 6 | 6 |
Petclinic | - | - | 17 | 17 | 9 | 9 | 9 | 9 |
Web App | mUSAGI | T5-GPT | QExplorer | |
---|---|---|---|---|
Training | Test | |||
TimeOff | 07:08 | ~01:00 | 08:09 | 04:00 |
NodeBB | 42:19 | 05:11 | - | |
KeystoneJS | 20:45 | 05:28 | 04:00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, F.-K.; Liu, C.-H.; You, S.D. Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing. Information 2025, 16, 102. https://doi.org/10.3390/info16020102
Chen F-K, Liu C-H, You SD. Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing. Information. 2025; 16(2):102. https://doi.org/10.3390/info16020102
Chicago/Turabian StyleChen, Feng-Kai, Chien-Hung Liu, and Shingchern D. You. 2025. "Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing" Information 16, no. 2: 102. https://doi.org/10.3390/info16020102
APA StyleChen, F.-K., Liu, C.-H., & You, S. D. (2025). Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing. Information, 16(2), 102. https://doi.org/10.3390/info16020102