1. Introduction
Software testing plays a pivotal role in evaluating the quality and correctness of software systems throughout the development lifecycle [
1]. It ensures that applications function as intended and meet user expectations. Software testing can consume over 50% of total development costs, especially in complex modern applications [
1]. Web applications are particularly challenging to test due to their dynamic content, asynchronous interactions, and frequent interface changes, a set of problems well established in the field’s literature [
2,
3,
4]. As these systems evolve rapidly to meet changing user demands, testing methodologies must also adapt to maintain reliability and efficiency [
3,
5]. Inadequate software testing can lead to critical system failures and widespread vulnerabilities, emphasising the need for reliable, automated solutions [
6,
7]. These cases highlight the need for more effective testing methods that can be applied in real-world systems with limited time and resources.
Modern web applications are characterised by complex structures, dynamic content, and asynchronous interactions [
8]. These features present significant testing challenges that have been reported in the literature [
2,
9]. This complexity makes it challenging to test effectively using traditional manual methods [
8]. When relying on manual testing, developers and QA engineers must spend additional time and effort on test design and execution, thereby increasing the organisation’s overall testing costs [
10]. This makes traditional manual testing slow and difficult. Automated testing [
11], codeless testing [
12], and model-based testing using machine learning have been proposed to improve this process [
3]. However, these methods have certain limitations, including difficulties in training models on large datasets and concerns regarding the quality of the resulting product.
LLMs have been widely used in the software development lifecycle [
6], including code generation [
13], requirements engineering [
14], vulnerability detection [
15], and test case creation [
10,
16,
17]. Recently, the use of LLMs in software testing has become a significant and rapidly expanding field of research, with the literature mapping the current landscape of tools and techniques [
18]. Tools such as GPTDroid and PENTESTGPT demonstrate that LLMs can perform automated testing with minimal human assistance [
19,
20]. This demonstrates that LLMs can be integrated into automated and context-sensitive testing systems for improved quality with reduced resources and lower costs. However, to date, according to our knowledge, there is little work that uses LLMs in complete testing pipelines for web applications. A few studies have utilised LLMs in conjunction with web scraping or regular expressions; however, these approaches do not encompass full automation or script refinement [
16]. There exists a research gap concerning automated systems capable of dynamically adapting to changes in page structure while preserving script readability.
This study introduces a test automation framework called AutoQALLMs that utilises GPT-4 alongside Selenium WebDriver, BeautifulSoup, and regular expressions. The proposed approach reads the HTML of any webpage, finds the elements, and uses GPT-4, Claude 4.5 and Grok to generate Selenium scripts. Regular expressions are used to refine and adapt the scripts even when the webpage structure changes. Unlike conventional methods that rely on static scripts and require updates for every UI change, this approach dynamically adapts to the structure of webpages using DOM (Document Object Model) parsing and pattern-based matching, reducing manual effort and improving resilience to UI changes [
3,
4]. Unlike GPTDroid, which is designed for mobile testing [
20], AutoQALLMs are tailored for web application testing. While GPTDroid focuses on extracting UI trees and generating scripts for native mobile apps, AutoQALLMs target dynamic webpages by parsing the DOM using BeautifulSoup and refining scripts with regular expressions. Additionally, AutoQALLMs support zero-shot prompt-based test generation (enabling HTML-to-Selenium translation without prior examples or fine-tuning) using GPT-4, Claud 4.5, and Grok, eliminating the need for intermediate instrumentation or mobile-specific tools. To our knowledge, this is the first study to integrate LLMs with DOM parsing and regex-based script enhancement, creating a fully automated, browser-oriented testing framework.
With the proposed approach, we aim to answer three research questions: (i) How can LLMs be combined with web scraping and Selenium to create test scripts? (ii) Can LLMs turn HTML into working Selenium scripts using zero-shot prompts? (iii) Can AutoQALLMs outperform manual or semi-automated testing in code coverage and fault detection? Our results demonstrate that LLMs can be adapted to convert HTML into valid Selenium scripts without requiring human-written code. The proposed AutoQALLMs are tested across various websites, and expert feedback indicates that they improve coverage, fault detection, and the time required to generate scripts compared to traditional methods.
Our key contribution is the development of a framework that extracts HTML from a webpage and parses the DOM using BeautifulSoup. We compose structured prompts to direct LLMs (GPT-4, Claude 4.5, and Grok) in generating Selenium-based test scripts. Additionally, we have implemented a regex-driven refinement module that enhances the robustness of these scripts by identifying element patterns, thereby improving their adaptability to changes in the user interface (UI). Another contribution is the integration of these components into a dynamic test automation flow that supports zero-shot test case generation to eliminate the need for manually written templates or prior training. We also conducted extensive evaluations across 30 websites to measure AutoQALLMs’ performance in terms of coverage, execution time, and fault detection. Furthermore, we gathered and incorporated expert feedback from software testers to validate the system’s practicality and to understand its strengths and limitations in real-world testing environments. AutoQALLMs’ code is made available for future work (
https://github.com/Sindhupriya2797/AutoQALLMs).
We developed AutoQALLMs, a fully automated framework that transforms web content into executable Selenium test scripts using LLM intelligence. We designed the framework to perform zero-shot test generation by extracting HTML from any given webpage, parsing its DOM structure with BeautifulSoup, and constructing structured prompts that guide LLM to produce context-aware Selenium scripts. We further implemented a regex-driven refinement module that identifies and optimises element patterns to improve the syntactic validity and adaptability of scripts when the user interface or DOM changes. Through this design, we integrated parsing, generation, and refinement into a single workflow that supports end-to-end, self-adjusting automation from HTML extraction to test execution and reporting.
We also devised an evaluation framework to validate the scalability and practicality of AutoQALLMs. We tested the framework across 30 diverse web applications and measured performance using quantitative metrics such as coverage, generation time, execution speed, and fault detection. We further collected expert feedback from professional testers to assess qualitative attributes, including readability, adaptability, maintainability, and scalability. Our results show that AutoQALLMs generate accurate Selenium scripts within seconds, achieve coverage levels close to manual testing, and significantly reduce human effort. With these contributions, we present a practical and intelligent approach to AI-assisted web testing that advances the role of LLMs in automated quality assurance.
The remainder of this paper is organised as follows:
Section 2 presents related work in software testing, including manual, ML-based, and LLM-based approaches.
Section 3 describes the architecture and methodology of the AutoQALLMs framework.
Section 4 outlines the experimental setup, evaluation metrics, and testing procedures.
Section 5 presents the results, expert feedback, and comparative analysis. Finally,
Section 6 concludes the study and discusses limitations as well as potential directions for future research.
3. Methodology
The proposed AutoQALLMs automate the web application testing in four stages: (i) HTML Extraction, (ii) LLMs-based test script generation, (iii) Regex-based script cleaning and optimisation, and (iv) test execution and reporting. These stages form a modular and scalable framework that automatically generates Selenium scripts from web content. The architectural choice to use a powerful LLM like GPT-4, Claude, and Grok in conjunction with structured parsing of HTML is supported by recent empirical studies, which have found this combination to be highly effective for test generation compared to other LLMs and less structured inputs [
31]. This approach improves the speed and accuracy of web application testing. Each stage operates as an independent unit with defined inputs and outputs, enabling iterative development and practical deployment. Below, the proposed methodology is elaborated with various functions developed for GPT-4. Moreover, the complete implementation of various LLMs, including GPT-4, Claude, and Grok, is available on GitHub (
https://github.com/Sindhupriya2797/AutoQALLMs).
3.1. Step 1: HTML Extraction
The proposed AutoQALLMs approach extracts HTML when the user (software tester) enters the URL of the web application to be tested, as shown in
Figure 1. This URL provides the raw HTML content needed for LLM-based testing. The system calls the fetch_html(url) function, which uses the requests library to send an HTTP GET request. The command response = requests.get(url) fetches the webpage content and response.raise_for_status() handles HTTP errors such as 404 or 500 to ensure reliability during data collection.
After retrieving the HTML, the content is parsed using BeautifulSoup with BeautifulSoup(response.text, ’html.parser’). This converts the unstructured HTML into a well-organised DOM tree. The DOM enables structured access to webpage elements, such as buttons, links, and form fields. This representation is essential for the next stage, where the LLMs generate test cases based on a contextual understanding of the UI. After parsing, the AutoQALLMs extract content using the parse_html (soup) function. This function isolates a predefined set of HTML tags relevant for web UI testing. It specifically targets:
- (a)
The <title> tag provides the title of the webpage to verify that the correct page has been loaded and is displayed as expected in the browser.
- (b)
Anchor (<a>) tags with valid href attributes. These contain hyperlinks for navigation.
- (c)
Header tags from <h1> to <h6>, which capture the semantic layout and hierarchy of the content. These are useful for validating content structure and screen reader accessibility.
- (d)
Image (<img>) tags that include src attributes, representing embedded media that should be loaded and visible.
- (e)
Form (<form>) for verifying form submission structure (attributes such as action, method, and name)
- (f)
Input (<input>) for capturing interactive input fields with attributes (type, id, name, placeholder).
- (g)
Button (<button>) for clickable elements that trigger actions or submissions.
- (h)
Select (<select>) for validating dropdown options and user selections.
This subset was selected as it represents both the fundamental and interactive elements required to verify a webpage’s structure and behaviour. It includes tags for verifying a page’s identity (<title>), navigational integrity (<a>), content hierarchy (<h1>–<h6>), and media rendering (<img>), as well as interactive components such as forms (<form>), input fields (<input>), buttons (<button>), and dropdowns (<select>). Together, these elements capture the core static and dynamic aspects of modern web interfaces, enabling comprehensive validation of user interactions such as data entry, submission, and option selection.
These extracted elements are assembled into a Python dictionary using key-value mappings, where each category is stored as a list of values. A representative sample output would resemble the following:
{
"title": "Example Web Page",
"links":"[
"https://example.com/about",
"https://example.com/contact"
],
"headings":"{
"h1":"["Welcome"]"
"h2":"["About Us"]
},
"images":"[
"/images/banner.png"
" ],
"forms":"[
{
"action":"/submit-form",
"method": "post",
"id": "contactForm",
"name": "contact_form"
}
],
"inputs": [
{
"type": "text",
"name": "username",
"id": "userField",
"placeholder": "Enter your name"
},
{
"type": "email",
"name": "email",
"id": "emailField",
"placeholder": "Enter your email"
}
],
"buttons": [
{
"text": "Submit",
"type": "submit",
"id": "submitBtn",
"name": "submit_button"
}
],
"selects": [
{
"name": "country",
"id": "countrySelect",
"options": ["India", "United Kingdom", "United States"]
}
]
}
The parsed output is lightweight and semantically rich, capturing key aspects of the page’s user interface while remaining easy to interpret. It serves as the direct input for the next stage, where LLM prompts are constructed and test scripts are generated using the GPT-4, Claude, and Grok APIs (Application Programming Interface). To ensure the reliable execution of the HTML extraction phase, exception handling is implemented. If the network request fails, the URL is invalid, or the response cannot be parsed, the system catches the error, logs a clear message, and stops further processing. This prevents incomplete or faulty HTML data from affecting later stages, helping to maintain the accuracy and consistency of the overall test automation workflow. Step 1 involves acquiring the webpage content via HTTP requests and then using BeautifulSoup to parse its structure, identifying key elements. This step is fundamental for setting up the automated generation of test cases.
3.2. Step 2: Test Script Generation
After the HTML is parsed and key elements are extracted, the framework enters its second phase, which serves as the intelligence layer, where we utilise various LLMs (GPT-4, Claude, and Grok) to generate automation scripts based on a predefined prompt, requiring minimal manual effort. The structured data from the first phase is passed to the generate_selenium_code() function. This function merges the parsed HTML content with a prompt to generate executable Selenium code. The prompt has two primary purposes: (1) to describe the web content clearly to LLMs and (2) to guide LLMs in producing test logic that follows standard automation practices. The prompt used in this framework is depicted in Box follows:
![Computers 14 00501 i001 Computers 14 00501 i001]()
In response, LLMs generate a structured Python script using Selenium WebDriver for browser automation, the find_element() method for DOM interaction, and try-except blocks for error handling. The script begins with WebDriver setup, runs through the test steps, and ends with driver.quit() for cleanup (refer to the video in the GitHub link). This format aligns with standard practices in UI testing. The LLMs output often includes extras, such as markdown formatting (e.g., “‘python), phrases like “Here is the Selenium code,” and explanatory comments. While helpful for human readers, these elements are unnecessary for automated execution. To address this, a post-processing step is included to clean and optimise the script before running it, Step 3 of the proposed AutoQALLMs approach, as shown in
Figure 1. By embedding best practices and test logic within an intelligent agent, this step reduces manual effort and ensures consistency, scalability, and quicker adaptation to UI changes. It demonstrates how LLM-driven development can be integrated into real world test automation.
3.3. Step 3: Script Cleaning and Optimisation
After LLMs generates the Selenium test script, the framework moves to its third stage, which prepares the raw output for execution by cleaning the script using automated text processing tools, as shown in
Figure 1. The goal is to convert the semi-structured and often verbose output into a clean Python file that follows syntax rules and software engineering best practices. Although LLMs produce logical and well-structured code, they frequently include additional text intended for human readers. Examples include markdown markers like “‘python, phrases such as “Here is the Selenium code,” and placeholders like “Please replace this section…”. These elements are not suitable for automated execution and must be removed. To handle this, the framework uses the clean_selenium_code() function, which applies a set of regular expressions to remove unnecessary content. This is a typical application, as regular expressions are a powerful and widely used technique for text parsing and validation. Although they can be difficult for developers to read and maintain, their utility in programmatic text processing is well-established [
35].
Table 2 lists the types of content removed during this step and explains why each is excluded:
The regex rules are carefully designed to ensure that only executable code remains in the final script. By removing narrative content and presentation-related syntax, the framework provides a cleaned script that is ready for automated testing without causing execution errors by the Python interpreter. After this cleaning step, the script is passed through a formatting stage using the autopep8 library. This tool enforces Python’s PEP 8 style guidelines, which focus on readability, consistent indentation, and proper line spacing. Formatting the code at this point enhances clarity for human reviewers and makes the script more easily integrable into automated workflows or version control systems.
In addition to formatting and content cleanup, the framework uses the remove_lines_ after_quit() function. This step ensures the script ends immediately after the driver.quit() command, which marks the close of the Selenium session. Any lines that follow, such as debug notes or leftover placeholders, are removed to prevent errors during test execution. The cleaned code is then saved as a .py file. These cleaning and optimisation steps together form a strong post-processing layer that enforces proper syntax and functional accuracy. This final version of the script moves to the last stage of the framework: test execution and reporting. At this point, the refined and clean script interacts with a live browser session, completing the LLM-powered automation cycle.
3.4. Step 4: Test Execution and Reporting
The final stage of the proposed AutoQALLMs framework, i.e., Step 4, focuses on test execution and reporting. It runs the optimised .py test script on the web application and records the outcomes to verify the application’s functionality. Execution is managed by the execute_selenium_code() function, which launches a browser using Selenium’s Chrome WebDriver. The browser is configured with ChromeOptions to open in full-screen mode. Running in non-headless mode enables real-time observation of the test flow, helping to identify issues such as UI misalignment, loading delays, or unexpected pop-ups. The script then carries out a sequence of actions generated by LLMs, including navigating the DOM, locating elements, clicking buttons, submitting forms, and following links.
During execution, all test outcomes, whether passed or failed, are printed to the console along with contextual log messages. These logs serve as a live reporting mechanism, offering immediate insight into the test flow. Although the current setup uses terminal-based reporting, the architecture is flexible. Future versions can easily integrate tools like PyTest plugins or Allure to generate HTML dashboards, JSON logs for CI pipelines, or XML reports for systems like TestRail and Zephyr. Moreover, if an error occurs, whether due to a logic issue, an outdated selector, or an unhandled exception, detailed error traces are captured. This final phase validates both the functionality of the web application and the reliability of the LLM-generated test script. It completes the framework with actionable results, demonstrating that LLM-driven test automation is practical and effective for real-world QA processes.
4. Experimental Setup
4.1. Tools and Technologies
AutoQALLMs combine tools, libraries, and APIs that enable intelligent, scalable, and high-performance test automation.
Table 3 presents a summary of the main tools and their respective functions.
4.2. Testing Strategy
To validate the reliability, performance, and robustness of AutoQALLMs, a multi-layered testing strategy is implemented, comprising unit testing, integration testing, functional testing, and performance evaluation.
Unit testing is applied to core functions such as fetch_html(), parse_html(), and clean_ selenium_code(). Each function is tested independently to confirm it works correctly under different input conditions. Simulating edge cases, such as invalid URLs, missing tags, or malformed HTML content, also verifies error handling.
Integration testing verifies that modules work together smoothly. For example, the flow of parsed data from parse_html() to generate_selenium_code() is tested using websites with different layouts. The handoff from LLMs output to the cleaned script is also reviewed to ensure it maintains logic and functionality.
Functional testing is performed on real websites to confirm that the generated scripts can interact with and validate dynamic elements. This includes checking page titles, clicking links, detecting broken images, and filling out input forms. These tests ensure the automation mimics realistic user behaviour.
Performance testing measures how efficiently the framework runs. It assesses response time and execution speed across different types of websites, ranging from simple static pages to complex, JavaScript-heavy ones. Key metrics such as script generation time, test execution duration, and browser response delays are recorded and analysed.
Together, these validation steps confirm that the framework runs reliably across various environments. They also demonstrate their adaptability, effectiveness in practical use, and readiness for integration into continuous testing pipelines.
4.3. Test Subjects
To validate the reliability, performance, and robustness of AutoQALLMs, the framework was evaluated on 30 publicly available automation practice websites. The complete list of target applications is provided in
Table 4. These 30 websites encompass a diverse range of structural and functional characteristics, ensuring a comprehensive evaluation of the proposed framework. The selection includes static and text-oriented pages (e.g., HerokuApp: iFrame, Nested Frames), interactive and form-based applications (e.g., DemoQA: Practice Form, LetCode), and multimedia-rich or e-commerce platforms (e.g., DemoBlaze, OpenCart, Greencart). Across these websites, the average number of interactive menus or navigation components ranged from three to eight per page, depending on the website’s complexity and content type. Static websites primarily comprised textual and image elements, whereas dynamic and multimedia sites incorporated AJAX-driven components, modal windows, dynamic tables, and multiple form-input fields. This diversity ensures that AutoQALLMs have been assessed under heterogeneous testing conditions, validating their adaptability and robustness across simple, content-focused pages as well as complex, multi-component web interfaces.
4.4. Evaluation Metrics
To evaluate the performance of the proposed AutoQALLMs, the following metrics are used:
Script Generation Time: The time taken to generate test scripts before execution. Faster script generation reduces manual effort and improves productivity [
33].
Execution Speed: The total time needed to run all test cases after the script is ready. Shorter execution time increases overall testing efficiency [
21].
Test Coverage: Test coverage means how many elements on a webpage (like links, headings, and images) are tested by the test script compared to the total number of elements parsed. More coverage means better testing [
34].
Currently, AutoQALLMs do not scan or test all the elements on the page. Still, it covers several common ones, including links, anchors, forms, inputs, buttons, selects, headings, and images, and creates 30 basic test cases.
Test coverage is calculated using the following formula:
Failure Rate: The failure rate is the percentage of test cases that fail during execution. This can happen due to incorrect locators, timing issues, or logic errors in the script [
4].
For the proposed approach, the failure rate is around 20%. This is expected, as zero-shot prompting can occasionally generate incorrect locators for complex or dynamically loaded elements. Future work will aim to reduce this by experimenting with few-shot learning, fine-tuning and chain-of-thought learning approaches. Moreover, the observed 20% failure rate suggests that while AutoQALLMs are comparatively effective for rapid, initial test suite generation, their output would still require a brief human review before being integrated into a mission-critical regression pipeline. This positions the tool as a powerful ’test accelerator’ rather than a complete replacement for QA oversight.
Adaptability: The degree to which a product or system can be effectively and efficiently adapted for different or evolving hardware, software or other operational or usage environments [
36].
Readability: The ease with which a human reader can comprehend the purpose, control flow, and operation of source code. It is a human judgement of how easy a text is to understand [
37].
Maintainability: The degree of effectiveness and efficiency with which the intended maintenance engineers can modify a product or system. Maintainability is composed of modularity, reusability, analysability, modifiability, and testability [
36].
Scalability: The degree to which a system, network, or process can handle a growing amount of work, or its potential to be enlarged to accommodate that growth [
38].
The independent variables considered in this evaluation include script generation time, execution speed, number of web elements parsed, and type of web application. These variables were varied across multiple websites to assess AutoQALLMs’ performance. The dependent variables, namely test coverage, failure rate, readability, adaptability, maintainability, and scalability, were used to measure performance outcomes. The relationship between these variables indicates that as page complexity and element count increased, AutoQALLMs maintained high coverage and low execution time, confirming their scalability and efficiency.
5. Results and Discussion
We evaluated the performance of the AutoQALLMs using 8 widely used metrics: (i) script generation time, (ii) execution speed, (iii) test coverage, (iv) failure rate, (v) adaptability, (vi) readability, (vii) maintainability, and (viii) scalability. We compared the results with manually written Selenium scripts and Monkey Testing. Each method was tested on 30 web applications. The tests were repeated 50 times, and the results were averaged to minimise variation. The evaluation focused on how each method performs in structured testing tasks and how suitable each is for real-world use.
Monkey Testing is a widely used technique for automated testing that involves generating random user interactions on applications without predefined steps [
8,
27,
39]. This method can help identify unexpected issues, such as application crashes, and is often used for stress testing [
39]. However, because the approach “lacks knowledge about the form fields, its random trials can be extremely slow,” which makes it less useful for structured or planned automation tasks [
8].
The performance of AutoQALLMs-generated scripts, manually written scripts, and monkey testing is evaluated, and the results are presented in
Table 5. The comparison highlights how each method performs across different web applications.
5.1. Evaluation Methodology and Expert Rubric
To validate the results, structured feedback was collected from five domain experts in software testing and automation using a short survey, as shown in
Appendix A. The experts were selected from diverse industries and roles to provide balanced insights. Their input helped interpret the observed differences in test performance across AutoQALLMs-generated scripts, manually written scripts, and Monkey Testing (see
Table 6).
Expert 1 stated that AutoQALLMs could create test scripts within seconds, whereas manual scripting typically took several hours to complete. The expert believed this speed was beneficial for manual testers who lack strong coding skills. He also mentioned that AutoQALLMs can generate multiple tests simultaneously, making it easier to scale testing for large projects. However, the expert pointed out that some of the generated scripts were too general. While AutoQALLMs made it easy to regenerate broken tests after UI changes, the expert felt that manual scripts were still easier to understand, fix, and share with others. Expert 2 noted that AutoQALLMs covered slightly fewer UI elements (96%) than manual scripts (98%). However, the expert explained that manual scripts often included more meaningful checks based on the application’s logic. In the expert’s view, AutoQALLM was good at identifying what was visible on the screen, but it sometimes missed deeper, more thoughtful validations. The expert 2 felt this showed a trade-off between more exhaustive coverage and detailed accuracy.
Similarly, Expert 3 focused on how well the scripts handled changes to the user interface. The expert stated that manual scripts were more effective at handling these changes because human testers could write the code in a way that anticipated updates. In contrast, AutoQALLMs scripts often broke when the layout changed. The expert also said Monkey Testing didn’t adjust at all, making it unreliable for serious testing. Additionally, Expert 4 agreed with many of the earlier points, particularly regarding the speed at which the scripts ran. The expert stated that AutoQALLMs scripts typically complete in about 20 s, which helps speed up regression testing. Manual scripts took longer (approximately 35 s), primarily due to additional checks added by the tester. Expert 4 also agreed with Expert 3 that manual scripts handled changes in the UI better. Finally, Expert 5 talked about failure rates. The expert added that manual scripts had the lowest failure rate (approximately 10%) because testers could carefully select the correct elements and use custom wait times. AutoQALLMs had a moderate failure rate (approximately 20%); it performed well on stable websites but struggled with dynamic ones. Monkey Testing failed most often (around 50%) because its actions were random. The expert mentioned that AutoQALLMs scripts were easy to read but sometimes longer and less clear than manual scripts.
The expert feedback confirmed that the AutoQALLMs testing brought gains in speed, coverage, and scalability. However, it still had some limitations in terms of test accuracy, UI changes, and deeper validation. These limitations underscore the continued importance of human oversight in automation testing workflows. The results indicate that LLMs can support software testing, but they should be used in conjunction with manual review for reliable outcomes.
To complement the qualitative opinions presented in
Table 6, the same five domain experts were asked to rate the qualitative attributes of the generated test scripts: readability, adaptability, maintainability, and scalability using a five-point Likert rubric (1 = Poor, 5 = Excellent). The detailed rubric and evaluation guidelines provided to experts are presented in
Appendix A.2. The aggregated expert ratings are summarised in
Table 7.
The quantitative evaluation confirmed high consistency among experts, supporting the reliability of the qualitative feedback and validating the robustness of the generated Selenium scripts across all four attributes.
5.2. Model Comparision
In addition to these performance metrics, the analysis was expanded to include cost efficiency.
Table 8 presents both the cost per million tokens and the estimated cost per individual test script. Across the evaluated websites, GPT-4 averaged about USD 0.011 [
40] per test, compared with USD 0.016 for Claude 4.5 [
41] and USD 0.0006 for Grok Fast [
42], based on an average token consumption of 1800 tokens per site. This cost-aware evaluation quantitatively supports GPT-4’s balance between accuracy and affordability, reinforcing the study’s emphasis on optimising trade-offs between performance and operational expense rather than focusing solely on raw coverage. LLM comparison resources and implementation details are available for future work (
https://github.com/Sindhupriya2797/AutoQALLMs).
During experimentation, each test case generation consumed approximately 1800 tokens per website, evenly distributed between input and output tokens. The estimated cost per script for each model was computed using 2025 API pricing as follows:
For 900 input and 900 output tokens:
Thus, the average cost per test case was approximately $0.01 for GPT-4, $0.016 for Claude 4.5, and $0.0006 for Grok Fast, demonstrating that GPT-4 offers a strong trade-off between cost and accuracy.
The pricing structure of each model clarifies the observed cost-performance trade-offs. Grok Fast is designed for high-volume, low-cost tasks, with output rates nearly twenty times cheaper than GPT-4 and thirty times cheaper than Claude 4.5. Claude 4.5 Sonnet is the most expensive, particularly in terms of output tokens, reflecting its emphasis on advanced reasoning. GPT-4 offers a balanced middle ground between capability and affordability. Across all three, output tokens remain significantly more costly than input tokens.
To extend the existing LLM comparisons, as shown in
Table 8, Claude 4.5 achieved the highest coverage and execution reliability, benefiting from its stronger contextual reasoning and coherent instruction-following ability. GPT-4, however, demonstrated the best balance between syntactic accuracy, runtime stability, and execution speed, making it the most consistent model for Selenium-based test automation. Although Grok Fast generated outputs more quickly, its limited handling of dynamic DOM structures led to a lower pass rate.
5.3. Computational Complexity Analysis
In addition to the empirical comparison of execution time, the computational complexity of the proposed framework was analysed to evaluate its scalability. AutoQALLMs operate through four sequential phases: (i) HTML extraction and parsing, (ii) LLM-based script generation, (iii) regex-driven code cleaning, and (iv) Selenium-based execution. The overall time complexity can be expressed as:
where
n denotes the number of parsed HTML elements,
m represents the length of the generated script, and
t corresponds to the number of executed test cases. Since HTML parsing dominates the total computation, the framework exhibits an approximately linear growth rate:
This indicates that the execution time increases proportionally with the number of web elements.
Manual Selenium scripting, on the other hand, requires repetitive human intervention for every additional element and test case, resulting in a slower, non-scalable process that can be approximated as:
where
h is the human effort factor per element.
Similarly, Monkey Testing performs random interactions with a complexity of , where r is the number of random events required to reach sufficient coverage. However, due to its stochastic nature, this approach demands significantly more iterations to achieve comparable accuracy.
Empirically, AutoQALLMs generated executable Selenium scripts within approximately 20–25 s per website, compared to 2 h for manual scripting, while maintaining 96% coverage. The average execution time per test was around 20 s, confirming both the linear time complexity and the substantial reduction in computational and analytical effort.
5.4. Comparison with State-of-the-Art Approaches
A significant contribution to codeless test automation was made by [
3], who proposed a framework combining Selenium and machine learning to generate tests without writing code. Their method relied on an SVM trained on manually annotated data to predict actions. Still, it lacked testing several behaviours at the same time (Tested only for search functionality). Similarly, ref. [
8] introduced a T5-GPT-based framework that combined Crawljax, T5, and GPT-4o to automate web form interaction. Their method parsed the DOM using Crawljax to identify input fields and used LLMs for field classification and value generation. While it improved coverage over reinforcement learning agents, it was limited to form-filling tasks and could not handle categories whose formats depend on the selected locations. In mobile GUI testing, Liu et al. [
20] presented GPTDroid, which redefines testing as a question-and-answer task using GPT-3.5 with functionality-aware prompts. This method achieved high activity coverage on Android apps but was limited to mobile domains, and it faced challenges with UI components that lacked clear text labels. Lastly, ref. [
22] proposed an LLM-powered approach for visual UI and A/B testing, using screenshot comparisons to identify rendering issues. Although effective for visual validation, the method lacked DOM-level analysis and did not produce an actionable test script.
Another key point of comparison is the underlying methodology used to interpret the web application’s structure. While AutoQALLMs generate scripts by providing a direct, parsed HTML summary to the LLM, the framework by [
30] employs an intermediate modelling step. Their approach creates explicit screen transition graphs and state graphs to model the application’s flow before generating tests. This graph-based methodology is specifically designed to improve test coverage for dynamic navigation and complex conditional forms, which are known challenges for more direct generation methods. In contrast, AutoQALLMs eliminate the need for handwritten templates or screenshot-based prompts by utilising various LLMs, including GPT-4, Claude, and Grok, to generate executable Selenium scripts directly from HTML. It integrates BeautifulSoup to parse live DOM structures and uses regular expressions to adapt the script automatically to structural changes. Unlike previous approaches, AutoQALLMs cover functional testing, dynamically updates scripts, and supports a full-cycle pipeline from generation to execution and reporting without relying on visual UI cues or fixed form data. This enables AutoQALLMs to maintain robustness in fast-changing web applications. Based on our understanding and knowledge, we demonstrated the comparison between the proposed and state-of-the-art approaches in
Table 9.
In addition to the model comparisons discussed earlier, three complementary categories were examined to contextualise existing approaches: bot-detection and behaviour analysis frameworks, web crawling systems such as Scrapy, and Named Entity Recognition (NER) techniques. Behaviour analysis tools, such as OWASP AppSensor, monitor application layer activities to detect anomalous or automated behaviour and trigger alerts based on predefined sensor rules [
43]. Scrapy is an open-source crawling framework designed for efficient extraction of static HTML content, although it cannot process dynamic or interactive web elements [
44]. NER techniques [
45] identify and classify meaningful entities such as names, locations, and organisations within unstructured text using advanced natural language processing models.
Table 10 summarises these approaches, highlighting their objectives, methodologies, and core limitations in relation to intelligent automation systems that integrate semantic reasoning with executable validation.
5.5. Discussion
The proposed AutoQALLMs provided a preliminary study on how LLMs can collaborate with tools such as DOM parsers and Selenium to automatically generate and execute test scripts for websites. The results show that when provided with well-structured prompts and the correct HTML input, LLMs can generate Selenium scripts that run successfully without any human intervention. This directly answers our first two research questions: (i) “How can LLMs be combined with web scraping and Selenium to create test scripts?”, and (ii) “Can LLMs turn HTML into working Selenium scripts using zero- shot prompts?” Using BeautifulSoup to extract HTML elements and designing prompts that give the script to generate selenium test scripts, AutoQALLMs were able to convert web content into test actions. The use of zero-shot prompting also meant that the system could handle new or unfamiliar page elements without requiring retraining of the model.
For the third research question, which asked “Can AutoQALLMs outperform manual or semi-automated testing in code coverage and fault detection?”, the expert feedback revealed both strengths and areas for improvement. On the positive side, AutoQALLMs achieved broad UI coverage (96%) that was competitive with manual methods (98%) and saved significant time in writing scripts. However, there were also concerns, such as missing deeper logic validations and having lower debuggability compared to manually written scripts. AutoQALLMs were able to quickly regenerate broken tests and run them, which is helpful for agile teams and CI/CD pipelines. This capability directly addresses a significant, unresolved challenge in the field, as identified by recent surveys: the high maintenance costs of test suites in the face of frequent UI changes [
9]. Overall, the approach shows promise in making automation more accessible, especially for teams with less coding experience, and it opens up new directions for combining human expertise with LLM-based automation in the future.
While AutoQALLMs are primarily implemented using GPT-4, a comparative evaluation was conducted with Claude 4.5 and Grok Fast to benchmark performance. The findings indicated that Claude 4.5 achieved marginally higher test coverage (96%) due to its advanced architecture and enhanced reasoning capabilities. Nonetheless, GPT-4 consistently generated syntactically valid Selenium scripts with fewer runtime errors and exhibited faster, more predictable generation times across diverse websites. When both cost and stability were considered, GPT-4 emerged as the most balanced option for scalable deployment: its average cost per test generation was lower than that of Claude 4.5, and its prompt outputs were more reproducible under identical conditions, an essential factor for continuous testing pipelines. Consequently, the framework continues to utilise GPT-4 as its default engine, while Claude 4.5 and Grok Fast serve as comparative baselines that contextualise performance-versus-cost trade-offs in LLM-driven testing. It should also be noted that Claude 4.5 represents a newer model generation than GPT-4, which partly explains its marginally higher coverage. This distinction highlights the need for ongoing benchmarking across future releases, such as GPT-5 and Claude Next, to ensure fair and up-to-date evaluation.
5.6. Threats to Validity
A key threat to the internal validity of AutoQALLMs is its dependence on LLMs for generating Selenium scripts. While it performs well overall, LLMs can lose context in long or complex interactions, leading to incomplete or incorrect test cases. The output quality also heavily depends on prompt design, which may affect consistency across different environments. Moreover, we currently parse only a subset of tags (e.g., links, headings, and images), which limits test coverage. This may result in an incomplete representation of real-world page structures.
Construct validity is limited by AutoQALLMs’s ability to convert natural language into actionable test steps reliably. This depends on the model’s understanding of dynamic or ambiguous web structures. Prior work, such as GPTDroid [
20], has demonstrated that LLMs can handle functionality-aware prompts; however, they sometimes struggle to maintain reasoning over multiple turns.
Technical challenges also arise during DOM scraping, especially for JavaScript-heavy pages or those with anti-scraping protections, which can result in the omission of key elements. To address these issues, future work could explore memory-augmented models and explainable AI techniques to enhance the reasoning and traceability of the generated steps.
Finally, since AutoQALLMs currently focus on web GUI testing, their generalizability to other domains, such as mobile apps or chatbot interfaces, remains an open question. Real-world testing also involves complex user behaviours, so integrating multimodal inputs and modelling user actions could further enhance test realism and coverage.
6. Conclusions and Future Work
This study demonstrated that the proposed AutoQALLMs can be used to generate Selenium test scripts that are fast, scalable, and easy to maintain. The results indicate that AutoQALLMs scripts achieved competitive UI coverage while drastically reducing the time needed to create and run tests. Experts’ feedback confirmed that this method can support faster regression testing and make automation more accessible, particularly for teams with less technical skill. However, the scripts sometimes failed when webpages changed or when more thorough checks were required. By contrast, Monkey Testing was less valuable due to its random behaviour and limited test coverage.
The comparison also showed that manual scripts had the lowest failure rates and adapted best to UI changes; however, they required more time to write and maintain. AutoQALLMs scripts offered a good balance between speed and reliability. They were slightly more prone to errors, but could be quickly fixed or regenerated. Expert feedback confirmed that this approach is beneficial in agile environments and can enhance overall testing efficiency when used in conjunction with human review. Our comparative analysis showed that Claude 4.5 attained marginally higher coverage, but GPT-4 demonstrated the best trade-off between accuracy, generation cost, and stability, reaffirming its role as the core model of AutoQALLMs.
In the future, we plan to improve the accuracy and stability of AutoQALLMs scripts. Fine-tuning the LLMs for test automation tasks may help reduce errors. Testing on larger and more complex applications will also help evaluate how well the system scales in real-world projects. Another helpful step is to connect LLM-based testing with tools like JMeter or Gatling to support performance testing. Adding self-healing features could help the system adjust to UI changes without manual updates. This is a key area of research, with recent approaches using LLMs to intelligently re-locate web elements after a UI change, significantly improving test script robustness [
29]. Future work will also explore semi-automated assertion generation to enable basic behavioural validation. By inferring expected outcomes from HTML5 attributes and JavaScript-based interface cues, AutoQALLMs can extend beyond structural testing to verify user interactions and page responses. Furthermore, future work could incorporate visual understanding by leveraging Large Vision-Language Models (LVLMs). As demonstrated by the VETL framework [
32], LVLMs can analyze screenshots to overcome the limitations of ambiguous HTML and better understand complex, dynamic user interfaces, representing a promising path to enhance the robustness of AutoQALLMs. Comparing this approach with traditional frameworks in terms of cost, speed, and resource use will also help guide its use in industry. These steps may lead to more reliable and intelligent testing systems with less human effort. AutoQALLMs demonstrate the growing role of LLMs in shaping the future of scalable and intelligent software testing.