You are currently viewing a new version of our website. To view the old version click .
Software
  • Article
  • Open Access

5 August 2025

Research and Development of Test Automation Maturity Model Building and Assessment Methods for E2E Testing

,
,
and
1
Customer Success Division, Nihon Knowledge Co., Ltd., JS Building 9F, 3-19-5 Kotobuki, Taito-ku, Tokyo 111-0042, Japan
2
Faculty of Systems Design, Tokyo Metropolitan University, 6-6 Asahigaoka, Hino-shi, Tokyo 191-0065, Japan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Software Reliability, Security and Quality Assurance

Abstract

Background: While several test-automation maturity models (e.g., CMMI, TMMi, TAIM) exist, none explicitly integrate ISO 9001-based quality management systems (QMS), leaving a gap for organizations that must align E2E test automation with formal quality assurance. Objective: This study proposes a test-automation maturity model (TAMM) that bridges E2E automation capability with ISO 9001/ISO 9004 self-assessment principles, and evaluates its reliability and practical impact in industry. Methods: TAMM comprises eight maturity dimensions, 39 requirements, and 429 checklist items. Three independent assessors applied the checklist to three software teams; inter-rater reliability was ensured via consensus review (Cohen’s κ = 0.75). Short-term remediation actions based on the checklist were implemented over six months and re-assessed. Synergy with the organization’s ISO 9001 QMS was analyzed using ISO 9004 self-check scores. Results: Within 6 months of remediation, mean TAMM score rose from 2.75 → 2.85. Inter-rater reliability is filled with Cohen’s κ = 0.75. Conclusions: The proposed TAMM delivers measurable, short-term maturity gains and complements ISO 9001-based QMS without introducing conflicting processes. Practitioners can use the checklist to identify actionable gaps, prioritize remediation, and quantify progress, while researchers may extend TAMM to other domains or automate scoring via repository mining.

1. Introduction

Today, software has become an indispensable element in almost every aspect of our lives, making our lives more convenient, user-friendly, and efficient, and supporting people’s comfortable lifestyles. In 2001, 17 software developers who had adopted methods different from traditional software development techniques met to discuss their principles and methods and presented their results. Agile software development methodologies (derived from the Agile Software Development Manifesto [1]) and DevOps [2] emerged, enabling software development methodologies that adapt to the speed of business and the convenience of software. Full-stack engineers, who have a wide range of skills and can handle all stages of the development process—from defining system requirements to design, development, operations, and maintenance—have become active by integrating development and operations, prioritizing automation and monitoring, and improving development efficiency. Technologists keeping pace with these new development technologies support the systems that connect our lives, and each engineer continues to challenge himself to improve quality.
On 18 July 2024, a system failure caused by a malfunction in CrowdStrike Holdings, Inc. [3] security software resulted in blue screen errors on Windows devices worldwide. This incident is estimated to have affected approximately 8.5 million Windows devices worldwide, making it the largest such incident in history. The global economic damage from this outage is estimated to be at least USD 10 billion, and it disrupted numerous systems, including medical facilities and public transportation systems, impacting people’s lives around the world.
In Japan, frequent reports of system outages at banks and securities firms have prevented users from accessing systems or conducting transactions, causing widespread disruptions to business operations.
Considering that such a critical outage occurred in a system that requires high reliability and had a social impact, improving development productivity alone is not enough to reduce outages. It is essential to advance quality management technologies without compromising development productivity.
The latest quality model for “quality-in-use” in the ISO 25000 [4] series of international standards for software quality requirements and evaluation, known as SQuaRE (Software Product Quality Requirements and Evaluation), which is widely used by many software quality management engineers, has added “environmental and social risks” as a sub-characteristic of “freedom from risk,” which requires consideration of measures to mitigate the social impact of systems (Figure 1). Thus, when considering software quality, it is essential for software developers to consider the social impact of failures and to ensure that users can use the software safely and with confidence.
Figure 1. Quality model of quality-in-use on ISO/IEC 25019.
Definition of environmental and societal risk:
The extent to which a product or system mitigates the potential risk to the environment and society at large in the intended contexts of use.
Source:
ISO/IEC 25019:2023 [5] Systems and software engineering—Systems and software Quality Requirements and Evaluation (SQuaRE)—quality-in-use model 3.2.2.2 freedom from environmental and societal risk.
Agile development methodologies are being adopted by companies developing SaaS and web systems to rapidly develop systems while maintaining high quality without sacrificing productivity. Various quality management and testing methodologies suitable for agile development have emerged, and to maintain development productivity, it is necessary to focus on the efficiency and automation of software testing, which is said to account for the majority of development work.
To effectively implement quality management, it is necessary to use a project Kanban board to visualize the pass rate of automated tests, the number of bugs identified, and the number of fixes, thereby clarifying the status of development productivity and quality.
Mugridge et al. [6] addressed the issues of redundancy, repetition, and high maintenance costs associated with automation using capture-replay tools by developing a custom tool to achieve efficient automation of ATDD (Acceptance Test-Driven Development). Regarding the impact of automation, Berłowski et al. [7] automated over 70% of functional tests in a Java EE/JSF-based project using Selenium WebDriver and JUnit, established a process for continuous execution in Scrum sprints, and demonstrated a 48% reduction in defect leakage rate and a reduction in customer acceptance test duration from 5 days to 3 days after implementation.
Jussi et al. [8] studied barriers to test automation by conducting interviews and observing data from 12 teams in six Indian IT companies. They inductively theorized the barriers and success factors for adopting and scaling test automation in agile projects. They found that skill shortages and fear of fragile testing lead to a passive attitude, while management’s requirements for quality metrics and the team’s experimental culture act as drivers.
Furthermore, Kato et al. [9] proposed a method for using the quality characteristics defined in international quality standards for software quality in agile software development using DevOps. They demonstrated that by using quality characteristics to clarify target quality in Scrum development and sprint plans, the number of defects detected can be significantly reduced, even as test density is increased. By using quality characteristics as project quality KPIs, it becomes possible to visualize DoR (Deliverable/Required/Acceptable). Furthermore, by using metrics for each quality characteristic as quality gates for each sprint, they demonstrated that it is possible to achieve built-in quality.
So how can we maximize the effectiveness of test automation to facilitate the collection of quality data? In general, organizational maturity in software development is commonly assessed using CMMI® [10]. Gibson et al. [11] systematically examine the extent to which process improvement using CMMI actually works in practice, using publicly available data and 13 case studies from 10 companies. After a cross-sectional analysis of six indicators (cost, schedule, productivity, quality, customer satisfaction, and ROI), the results show a median cost reduction of 34%, an average schedule reduction of 22%, and a productivity improvement of up to 62%.
Meanwhile, the maturity of software testing activities can be assessed with TMMi® [12]. Garousi et al. [13] conducted a polyphonic literature review covering 181 articles (academic 130/industry 51) on software testing maturity assessment and improvement (TMA/TPI) and extracted 58 maturity models, of which TMMi was used in the largest number of 34. The business benefits of TMMi include up to 63% reduction in defect costs, ROI payback within 12 months, and improved customer satisfaction; the operational benefits include prevention of release delays, shorter test cycles, improved risk management, and more accurate estimates based on measurements; and the technical benefits include fewer critical defects, improved traceability, test automation, and improved design techniques. The report concludes that TMMi is effective regardless of organization size or development process. In addition, the report identifies factors that inhibit the adoption of TMMi, such as lack of resources, resistance, and ROI uncertainty, and organizes measures to overcome these factors, providing practical guidance for practitioners considering test process improvement. Also, Eldh et al. [14] proposed the Test Automation Improvement Model (TAIM) as a new approach to improve the quality and efficiency of test automation to complement TPI and TMMI, which do not support test automation.
The effectiveness of test automation depends on an organization’s adherence to automation principles and the guidelines for achieving them. Objectively evaluating and analyzing this adherence allows us to measure the maturity of test automation within a development organization. However, no existing maturity model explicitly maps its practices to ISO 9001 [15] clauses, leaving compliance-driven organizations without clear guidance.
We have devised the following research questions (Table 1) and elected to develop a quantifiable test automation maturity model (TAMM) employing a checklist to fill this gap, while taking into account the validity of the ensuing hypotheses.
Table 1. Research questions for TAMM.
 H1: 
There is a positive correlation between low maturity and high individual differences among automation engineers, significant variation in design and code, and the number of issues identified during reviews.
 H2: 
A consensus-based, multi-reviewer assessment approach yields consistent checklist scores with no material divergence after group discussion. Three evaluators with κ ≥ 0.75.
Research Gap: None of the existing maturity models explicitly link E2E test-automation maturity with an ISO 9001-based QMS.
Research Goal: To devise TAMM that bridges E2E automation and ISO 9001 QMS, and to validate it in three industrial projects.
Figure 2 positions TAMM against CMMi, TMMi, and TAIM along two strategic axes (ISO 9001 alignment, automation scope). Through this research, we provide methodologies for improving test automation maturity, maximizing the productivity of test work—which accounts for the majority of development tasks—through automated testing, and further improving development productivity and software quality control to a higher level.
Figure 2. TAMM position.

3. Effectiveness of Guidelines for Test Automation

In general, the field of software development is characterized by a high degree of individualization. Consequently, within the domain of software engineering, there has been substantial research conducted on coding conventions, review processes, and related subjects. The enhancement of code readability exerts a substantial influence on the quality of the code. Consequently, when developing software, we consistently adhere to coding conventions and verify compliance through reviews.
In the context of agile development, the project charter serves as a document that provides a comprehensive overview of the project, encompassing its purpose, scope, budget, schedule, and risks. This document is generated at the inception of the project and functions as a framework for project progression (Table 2). The establishment of a project charter serves to facilitate the dissemination of project objectives among its members.
Table 2. Project charter items and contents.
As with general software development, to increase the effectiveness of test automation, the development of automated tests must not become dependent on specific individuals. As shown in Table 3, the actors involved in test automation can naturally be categorized. This categorization is the same as that used for general development work. According to their skills, designers and developers will also serve as reviewers. Based on this classification, Listing 1 is often considered the most appropriate guideline. However, such guidelines only list activities and standardize the tasks to be performed. They do not improve the quality of each task. Therefore, it is necessary to establish criteria. Clearly setting and managing compliance with these criteria makes it easier to link with the QMS.
Table 3. Actors of test automation.
Listing 1. Draft table of contents for test automation guidelines.
1.
Preparation (Common)
-
Installing the test tool
-
Building a test environment
-
Checking the operation of test tool samples
2.
Test design/script implementation (Test scripts designers)
-
Test Scenario Design
-
Design/implementation of use cases
-
Design/implementation of test data
3.
Test scripts development (Test scripts developers)
-
Script development
-
Script debugging
-
Test suite creation
4.
Test execution (Test executors)
-
Scripts execution
-
Test results analysis
5.
Test maintenance/Test operations (Maintenance developers)
-
Development of a general-purpose test environment (CI environment)
-
Maintenance and updating of the test environment
-
Maintenance and updating of test tools
-
Script maintenance
Based on the key points of the identified guidelines, we created guidelines for projects using Ranorex [35] and Power Automate [36].
For projects using Power Automate, we created a separate document outlining coding conventions for flows that function as automation scripts. This difference stems from the fundamental differences between Power Automate and Ranorex. Power Automate is a GUI-based, no-code development tool, whereas Ranorex requires coding. This results in lower update frequencies for coding conventions compared to guidelines, hence the separate document.
Ranorex also supports GUI-based development; however, user code can be implemented in C#, and user coding is essential to improving the quality of automated testing. In other words, logic recorded and edited on a GUI basis can be converted to user code and then converted into reusable components. As with general development, these components increase the reusability of test code.
Listing 2 is the table of contents for the Power Automate guidelines, which span over 150 pages.
The guidelines provide step-by-step explanations of development and operational procedures, including installing Power Automate and utilizing scenario tests created for existing manual tests as automated tests. The guidelines also explain the necessity of subflows and how to create them. The review section describes the criteria for implementing flows and offers tips to prevent those flows from becoming person-dependent. The Operations section describes how to perform build verification checks (BVT) within the project’s shared environment and provides points to consider when executing BVT in the shared environment and the implementer’s local environment. It also describes the criteria for continuously upgrading Power Automate to ensure long-term stable operation.
In contrast, for projects that adopted Ranorex, the guidelines were developed using Confluence [37], a wiki-based development document management tool. Confluence was chosen because the project uses Jira [38], a ticket management tool, for task and bug management. Furthermore, Bitbucket [39], a source code management tool that can be integrated with Ranorex, was adopted to manage test suites created with Ranorex. The guidelines are documented in the Confluence space and integrated with Jira.
The main differences between the Power Automate and Ranorex guidelines stem from these tools’ inherent design differences, with little variation in the rules governing test operations. Ranorex can convert code created in an integrated development environment into user code and transform it further into components.
The Ranorex guidelines also specify that test suites should be used as test case management tools with clearly defined test modules for each case. Additionally, the guidelines outline the implementation process for user code.
Listing 2. Table of contents of guideline for project using Power Automate.
1. About this guideline
  1.1 Purpose of the Guidelines
  1.2 Subject of the Guideline
2. Test Environment construction
  2.1 Execution environment
  2.2 Power Automate License
  2.3 How to install Power Automate for Desktop
  2.4 How to connect from Cloud Flow to Desktop Flow
3. Test Design Documentation
  3.1 Positioning of Test Design Documentation
  3.2 Template files for test design
  3.3 How to design
4. Backup data creation
5. Test files creation
  5.1 Test Case sheets
  5.2 Test Data sheet
  5.3 Generic Data Conversion Tool
  5.4 Global Variables
6. Flow creation
  6.1 Cloud Flow Configuration
  6.2 Desktop Flow Configuration
  6.3 Creation rules
  6.4 Exception Handling
  6.5 Precautions
  6.6 Confirmation of Operation after Creation
  6.7 Merge into shared environment
7. Review
  7.1 Purpose of review
  7.2 Review target
  7.3 Reviewer
  7.4 Example of review perspectives
  7.5 Checklist for review
  7.6 Record of review
8. Operation
  8.1 Operational flow
  8.2 Frequency and timing of test execution
  8.3 Test execution method
  8.4 Test result management
  8.5 Operation for daily build testing
  8.6 Test Result Reporting
  8.7 Revision up Scenario
  8.8 Conversion Scenario
  8.9 Setup test
  8.10 Past trouble collection
9. Maintenance
  9.1 Product Revision Upgrade (without scenario update)
  9.2 Product Revision Upgrade (with scenario update)
  9.3 PAD Version Upgrade
  9.4 Script management
For this project, we used Jira, a ticket management tool, to oversee task execution and identify software defects. Using Confluence spaces to manage test automation documentation has proven effective in improving efficiency. The main difference between the Power Automate and Ranorex guidelines is the difference in tools. Ranorex can convert code created in the GUI into user code and then into components, thereby increasing reusability through componentization. Therefore, Ranorex guidelines primarily focus on reusability. They explain how to use test suites as test case management tools, how to use test modules associated with each test case, and provide important considerations for implementing user code.
These guidelines clearly define the use of test suites and test modules for managing Ranorex test cases and provide important considerations for implementing user code. When used with coding conventions (Figure 6), they prevent inconsistencies between developers and reduce dependency on specific individuals. Furthermore, they clearly explain the concept and implementation methods of script verification, clarifying test design and implementation rules and enabling the measurement of test coverage.
Figure 6. Example of coding rule.
Ranorex includes a data binding feature that allows tests to be executed repeatedly using different data sets. Utilizing this feature increases test coverage and enables the seamless automation of screen operation tests incorporating diverse data sets. To use the data binding feature without creating dependency on individuals, it is necessary to establish rules. Additionally, to distinguish between software and script issues, it is essential to clearly define test result reporting rules. Depending on the test execution order or environment, “flaky tests” may arise, causing test results to become unstable. Therefore, test reports are important for reducing these issues, and report rules play a crucial role in this process.
Ranorex can be integrated with Jenkins to build a continuous integration (CI) environment. Recovery scripts and error handling for test failures improve test continuity by enabling repeated execution of automated tests and early detection of degradation.
Establishing guidelines eliminates human dependency, ensures data quality, improves test accuracy, and contributes to the development of test automation personnel.
Test automation’s effectiveness in agile development methods can be enhanced by combining it with ideas and guidelines related to quality characteristics in agile process sprints. This combination ensures quality checks in each sprint are performed effectively and that each quality characteristic is met when adding new features.
Standardizing test automation provides the same benefits as the software engineering development process. For example, it improves the accuracy of development man-hour estimates and contributes to quality control. In other words, test automation guidelines should be consistent with development rules and their effects on the software engineering development process from the beginning. It is also desirable to apply general development process rules to test automation.
Similar to organizational project management evaluation processes, such as CMMI, developing a test automation maturity model and establishing assessment methods enables objective measurement of guideline effectiveness. Furthermore, integrating with a QMS based on ISO 9001 clarifies the level of test automation adoption and the challenges to quality improvement within the development organization.

4. Developing TAMM

4.1. Preparation for Maturity Model Building

Given that TAIM does not consider compatibility with ISO 9001, it was determined that an examination of methods that can be linked to quality management without compromising the usefulness of previous studies would be necessary.
In the context of constructing TAMM, the use of CMMI as a guideline and reference framework for the enhancement of test processes is a common practice. Furthermore, the utilization of the international standard ISO/IEC 33000 family was contemplated for the evaluation of TMMi and assessment methodologies. Given the pervasive utilization of the TMMi maturity model, a test technology version of CMMI, by numerous test teams, it was determined that the development of TAMM based on the TMMi model was warranted.
The model that was constructed places significant emphasis on E2E testing, with an underlying focus on continuous improvement. It utilizes compliance levels with established policies and rules as a metric for evaluating maturity, thereby enabling a comprehensive assessment of its strategic positioning.

4.2. Development of TAMM

CMMI and TMMi are both five-level model. However, we excluded organizations that do not implement test automation. We positioned TAMM at four levels. Each level in the test automation process is represented in Figure 7.
Figure 7. Developed maturity model for software test automation.
Table 4 presents an overview of the status of each level. It also indicates the status that should be aimed for in order to reach the next level.
Table 4. State of each level and state to aim for to reach the next level.
The model that was developed demonstrates the necessary steps to achieve the subsequent level. The model enables comprehension of the present state of test automation, even in the absence of an assessment. The model was initially presented at a seminar on test automation, where it was met with favorable feedback from numerous attendees.
This model facilitates comprehension of the levels of test automation and elucidates the issues that must be addressed to achieve the subsequent level.
Level 1 is a state in which engineers automate processes without any established rules, relying exclusively on their own methods. In essence, the efficacy of automation is contingent upon the actions of individuals. In general, E2E test automation code and flows created in this state are not only difficult to share with other engineers but also challenging to maintain and operate in the long term. Even unit test code, which is frequently developed by software code implementers, must be designed to avoid becoming dependent on individuals. To advance beyond this stage, it is imperative to elucidate the purpose of test automation, enhance teamwork awareness, and establish and disseminate minimum rules, such as coding conventions and test pass/fail criteria, as guidelines.
Level 2 signifies the ability to design and execute automated tests with a focus on quality enhancement and maintaining a certain level of quality. This is distinct from the mere automation of manual test scenarios. The team begins to recognize the benefits of test automation, and guidelines and rules for test automation are established. In order to progress to the subsequent level, it is imperative to establish a reproducible operational environment and employ build verification testing (BVT) to expeditiously identify degradation. Furthermore, the integration of continuous integration (CI) environments is crucial.
Level 3 is the state where automated tests can be repeatedly executed using external data, i.e., the state where automated testing is utilized as data-driven testing. In order to achieve data-driven testing, it is imperative to design and implement tests with automation in mind. When establishing automation goals, it is imperative to consider the potential benefits of test automation for the entire development organization or project. As an organization reaches this state, it can enjoy more benefits of automation. As the level of automation increases, the organization can leverage the results of automation to improve quality visibility and productivity.
The final level 4 indicates a state in which the organization maintains the level 3 status quo while continuously improving the test automation process by addressing various issues that arise. At this level, test automation is considered a standard component of software development, and the purpose and objectives of test design, implementation, and operation are not dependent on specific individuals. Instead, the implementation of test automation is based on a common understanding within the organization. Automated test results are utilized for the purpose of quality control and are represented visually in real time. Furthermore, various issues that arise in the lifecycle of automated test design, implementation, execution, and operation are addressed within the organization, and activities are carried out to further enhance the effectiveness of automation.
However, the application of this maturity model alone is insufficient to achieve Level 4. Therefore, an assessment checklist was developed, drawing upon previous examples and CMMI and TMMi. This initiative was undertaken in recognition of the imperative to identify issues across the entire test automation lifecycle and to contemplate countermeasures. The objective of conducting evaluations using the checklist is to establish a long-term roadmap for visualizing quality. This will enable the organization to reach the next maturity level and contribute to the reorganization of its development processes and improvements in development productivity.

5. Methodology of Assessment Checklist

5.1. Consideration of Assessment Methods

The checklists employed in the assessment are meticulously designed to align with quality improvement through quality management principles. In addition to CMMI and TMMi, which were referenced during the maturity model development, we have extensively referenced ISO 9004 [40], an international standard created as a self-assessment tool for ISO 9001, which outlines the requirements for quality management. TAMM’s clauses 7 and 8 refer to ISO 9004, allowing them to be linked to sections 9 at ISO 9001, Performance Evaluation. Additionally, all automation-related issues are linked to improvement activities through management reviews.
Initially, the requirements for test automation are categorized into eight distinct categories. The rationale underlying this phenomenon pertains to the efficacy of test automation in quantitatively assessing the quality of software. In the context of quality management that adheres to ISO 9001 standards and fosters continuous improvement initiatives, it is imperative to evaluate the maturity of the test automation lifecycle process and to promote the implementation of improvement processes. By referring to established guidelines and rules, it is possible to determine the level of compliance within projects and organizations, measure the maturity of test automation, and understand the level of capability to meet requirements. A paucity of research has been conducted on the maturity of test automation in the context of QMS, particularly with regard to the promotion of improvement. Given the prevalence of organizations incorporating features and enhancing software quality through derivative development, it is logical to expect a correlation between advancements in test automation and QMS.
Each requirement is comprised of four to five items, with a focus on more than just the presence or absence of guidelines or policies. Additional considerations include the modularization and reusability of test scripts and flows, as well as the consideration of parallel execution during test execution. The requirements delineated in Table 5 were formulated with consideration for the utilization, implementation, and operation of design documents and scripts.
Table 5. Contents of assessment.

5.2. Development of Assessment Checklists

For each requirement, we prepared assessment questions that could be scored. Referring to Annex A of ISO 9004 and TMMi, we prepared 11 questions that could be used to determine the compliance status of each requirement. By answering the questions, it is possible to assess each requirement (Table 6).
Table 6. Part of requirements questions from Supplementary Material S1.
Given the four levels of the maturity model and the process and improvement categories, which are based on ISO 9001, the checklist scores are set from 1 to 5, similar to ISO 9004. However, it should be noted that a single score is allotted for each requirement, with a maximum of five scores indicating the fulfillment of all requirements.
The checklist is to be completed by the assessor; however, an assessor with a high level of experience in automation will need to conduct interviews and make corrections to ensure that there are no problems with the descriptions.
Ultimately, a cumulative score is determined for each category, and the scores from the eight categories are presented in a radar chart to illustrate the variation in scores for each requirement. When determining the score for each item, we made sure to discuss and decide among multiple assessors to avoid inconsistencies.

6. Result

6.1. Assessment Implement Policy

The assessment of the test automation project was conducted in accordance with a predetermined policy. This policy called for the administration of checklist responses within the organization to be assessed. These responses were followed by interviews with members of the target organization. Various actual evidences were also confirmed.
In particular, a firm determination of the sufficiency of each item on the checklist was made through a combination of methods. These methods included the confirmation of test scripts and flows through sampling and the checking of test result reports.
The final assessment implementation policy is illustrated in Figure 8.
Figure 8. Assessment implementation policy.

6.2. Result of Assessment

We considered simply assigning scores to each question on the checklist, but decided that in order to actually assign scores to each question, it would be necessary to first assign scores to all requirements and then have the assessor review the checklist again. This is because it is important to review the results of related questions depending on the existence of policies and the status of processes. To ensure the consistency and reliability of the assessment, a thorough and meticulous procedure was implemented. First, the item scores were determined by a panel of multiple assessors. This approach was taken to avoid any potential inconsistencies or variations in interpretation.
For example, even if a project or organization has a policy, it may be that only part of the policy is being used or that some people are not following it, so it is necessary to review the overall score.
It is difficult to determine where to place the assessment results on the maturity model, but we believe it is best to make a level judgment based on the presence or absence of policies and audit rules, as well as the implementation status.
Rather than simply presenting the results of the analysis, we believe that adding short- and long-term improvement points makes it easier to target the next level.
Short-term improvement activities focus on identifying issues in the current process and improving them to increase maturity. Long-term improvement items, on the other hand, include issues that are likely to arise after implementing short-term improvements, examples of issues that will become apparent through repeated improvements, and other items necessary to increase test automation maturity over the long term. Finally, we decided to present a roadmap for reaching Level 4, which is an expandable level, through repeated improvements.
The evaluation results report is based not only on a summary of the evaluation results, but also on the analysis results for each question item of the requirements. We have clearly stated the short-term improvement actions for each requirement and identified the issues for each item.
(1)
Example of a team using the capture replay tool
The team performing E2E testing using capture replay tool, the OpenText Functional Testing [41], has a low overall score (Figure 9).
Figure 9. Team using OpenText Functional Testing.
While the team has documented automation rules in wiki, these are essentially notes from script writers, and third parties cannot maintain the system based on these records. There is no documentation of the purpose, goals, or value of automation, making it difficult to share the rationale for automation within the team.
Test design for automation has not been performed, and existing manual tests are being automated. Priorities for automation also appear to be set on an ad hoc basis. As a result, even when reviewing test results, it is unclear what level of quality is being ensured by automated tests, making the effectiveness of testing unclear.
Despite progress in automating E2E testing, the effectiveness of automation remains unclear due to the high degree of task specialization. By clarifying the guidelines for tests to be automated, it is believed that the effectiveness of automated testing can be shared across the team.
(2)
Examples of test automation with RPA tools
This is the result of the evaluation of E2E testing using Power Automate, which was also included in the guidelines (Figure 10). The guidelines are well established, and the processes of test design, test flow implementation, review, and operation are performed without reliance on specific individuals, resulting in a high overall score. However, the automated tests primarily automate the E2E test scenarios used in manual testing from a business perspective, resulting in only partial benefits from test automation. Although it is used as BVT, its effectiveness as a degradation check is low because the operational phase is significantly delayed in the project timeline. In addition, the guidelines and review processes rely heavily on external consulting teams, making it difficult to establish an internal improvement process.
Figure 10. Team using Power Automate.
(3)
Example of E2E test automation using generative AI
Finally, we present the evaluation results of a team that uses a proprietary generative AI tool to generate pytest [42] scripts using Playwright for manual scenario testing for E2E testing and to automate the process (Figure 11).
Figure 11. Team using generative AI tools.
As a generative AI tool, it enables the creation of automated tests immediately after test design is complete, resulting in high automation coverage for E2E testing. It not only supports automated scenario testing for functional combinations, but also for business verification, contributing significantly to development productivity. However, since the tool is used to automate manual test scenarios, despite its capability for data-driven testing, this feature is currently underutilized, which is a key challenge.
It is considered necessary to establish processes, guidelines, and improvement plans for test design rules to address the need for script regeneration due to tool version updates and to further drive automation.

6.3. Statistical Tests

Hypothesis 1: A quantitative approach is recommended to analyze fluctuations in the number of review comments. However, data collected using the same testing tool is unavailable. In the case of RPA tools, review comments are often found in automated test designs because these designs use existing manual scenario tests without modification. This is because existing scenario tests lack design rules and criteria.
Therefore, quality improvement through reviews is necessary.
Currently, multiple evaluators are conducting qualitative analysis to evaluate the results. At the same time, the effectiveness of automated testing in identifying software defects is being evaluated through qualitative analysis.
According to Hypothesis 2, if the scores given by multiple evaluators differ, the quality of the target software is classified based on quality characteristics. Then, discussions are held based on the qualitative analysis results, and the final decision is made by consensus among the three evaluators.
Hypothesis 2 assumes that the Cohen’s kappa (κ) value between Evaluator 3 and the 39 requirements should be 0.75 or higher; the existing approach meets this criterion.
Going forward, we plan to introduce a mechanism to calculate Spearman’s rho (ρ) in Hypothesis 1 and Cohen’s kappa (κ) in Hypothesis 2.

7. Discussion

Based on the findings of the TAMM survey, we recommend that the teams responsible for implementing automation using Power Automate promptly revise the following guidelines.
These guidelines classify each chapter, policy, and stakeholder and categorize them as follows: automation objectives and improvements, preparation, automation design, implementation, design/implementation review, operation, and maintenance.
Implementing automation in accordance with the revised guidelines is expected to increase the target values of TAMM’s “8. Improvement” and “7. Process” requirements from 2.0 to 2.5. Clarifying the processes in the guidelines enables a more thorough evaluation of each task’s compliance with the guidelines and improves automation quality by clarifying the criteria for each process. Improving the test automation process improves test automation activities under QMS management. TAMM increases maturity by integrating test automation issues with QMS, thereby improving overall development quality.
However, revising the guidelines takes two to three months. Implementing and operating test automation in accordance with the revised guidelines and conducting another TAMM evaluation after the product is released takes at least six months to a year.

8. Conclusions

This study has two limitations: sample size (n = 3) and follow-up period (six months). Nevertheless, the applicability of the proposed method has been demonstrated with three types of test tools. It has also been shown that implementing short-term measures based on assessment results can raise the maturity level immediately from 2.75 to 2.85—an increase of 0.1 points.
To increase the effectiveness of test automation, organizations must comprehensively review their policies and development processes. However, the absence of a mechanism to analyze the current state poses a significant challenge, hindering the identification of areas for improvement and the determination of necessary actions. Furthermore, even if such improvements are successful, it is difficult to apply unified policies and enhanced development processes to different organizations within the same company because of differences in maturity levels.
With the advent of agile development methods, scenario testing using use cases has expanded, and evaluation results are used as various criteria. Consequently, the number of organizations adopting end-to-end (E2E) test automation is expected to increase in the future. Therefore, measuring the maturity of E2E test automation will make it easier to identify areas for improvement based on the current situation and consider countermeasures, contributing to the improvement of the development process. To address this issue, we developed TAMM, which is compatible with QMS. We defined maturity levels and requirements and devised an assessment checklist and evaluation method. This method includes sample surveys and interviews.
The effectiveness of this evaluation method is as follows:
First, we conduct empirical evaluations to obtain quantitative measurements targeting organizations that utilize multiple evaluation tools.
Three evaluators determine scores and agree upon them to ensure a κ ≥ 0.75.
When combined with a QMS based on ISO 9001, TAMM can contribute to the maturity of development organizations and improve productivity by addressing automation issues, particularly in processes and improvement measures.
Test automation significantly improves software development productivity and enables quality visualization through automated test results.
From a quality management perspective, conducting test automation maturity assessments to evaluate development organization maturity and identify future challenges is a major benefit. Additionally, the consulting services provided by the author’s organization include identifying support measures for customers based on assessment results and developing countermeasures jointly. Deploying this method to various development organizations and combining it with data accumulation for improvement is expected to achieve quality visualization and productivity improvements through test automation.
To improve organizational maturity, it is essential to incorporate human resource development elements, such as accumulating test automation knowledge and establishing training programs, into the evaluation framework in addition to technical capabilities. Therefore, formulating test automation guidelines that consider the four Ps of ITIL [43] (people, process, products, and partners) is a prerequisite for considering long-term improvement measures. Furthermore, strengthening the evaluation of test automation by utilizing generative AI and promoting the evolution of evaluation methods is essential.
Future improvements to TAMM will collect additional evaluation result data and prepare a score determination process based on quantitative analysis. Additionally, we will develop a mechanism for setting target scores based on the software domain and required quality. We will also consider analyzing evaluation results that take development productivity into account.
Our goal is to promote the adoption of the proposed methodology and disseminate TAMM guidelines and evaluation checklists to support self-assessment and self-improvement processes within a broader range of organizations, alongside their quality management systems (QMS).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/software4030019/s1, S1: TAMM Assessment Check sheet.

Author Contributions

Conceptualization, methodology, validation, formal analysis, investigation, resources, data curation, writing—review and editing, D.K.; Data curation, writing—review A.M.; supervision, H.I. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no competing interests. The authors Daiju Kato and Ayane Mogi are employed by the company Nihon Knowledge Co., Ltd. There is no conflict of interest between any of the authors and the company.

References

  1. Beck, K.; Beedle, M.; van Bennekum, A.; Cockburn, A.; Cunningham, W.; Fowler, M.; Grenning, J.; Highsmith, J.; Hunt, A.; Jeffries, R.; et al. Manifesto for Agile Software Development. 2001. Available online: https://agilemanifesto.org/ (accessed on 31 December 2024).
  2. Allspaw, J.; Hammond, P. 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. Presented at the O'Reilly Velocity Conference, San Francisco, CA, USA, 23 June 2009. Available online: https://tech-talks.code-maven.com/ten-plus-deploys-per-day.html (accessed on 7 May 2025).
  3. Crowd Strike Holdings. Available online: https://www.crowdstrike.com/ (accessed on 7 May 2025).
  4. ISO/IEC 25000:2014; Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE)—Guide to SQuaRE. ISO/IEC: Geneva, Switzerland, 2014.
  5. ISO/IEC 25019:2023; Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE)—Quality-in-Use Model. ISO/IEC: Geneva, Switzerland, 2023.
  6. Mugridge, R.; Utting, M.; Streader, D. Evolving Web-Based Test Automation into Agile Business Specifications. Future Internet 2011, 3, 159–174. [Google Scholar] [CrossRef]
  7. Berłowski, J.; Chruściel, P.; Kasprzyk, M.; Konaniec, I.; Jureczko, M. Highly Automated Agile Testing Process: An Industrial Case Study. E-Inform. Softw. Eng. J. 2016, 10, 69–87. [Google Scholar] [CrossRef]
  8. Kasurinen, J.; Taipale, O.; Smolander, K. Software Test Automation in Practice: Empirical Observations. Adv. Softw. Eng. 2010, 2010, 620836. [Google Scholar] [CrossRef]
  9. Kato, D.; Ishikawa, H. Quality Control Methods Using Quality Characteristics in Development and Operations. Digital 2024, 4, 232–243. [Google Scholar] [CrossRef]
  10. CMMI (Capability Maturity Model Integration). Available online: https://cmmiinstitute.com/ (accessed on 7 May 2025).
  11. Gibson, D.L.; Goldenson, R.D.; Kost, K. Performance Results of CMMI®-Based Process Improvement; Technical Report CMU/SEI-2006-TR-004; Software Engineering Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 2006; Available online: https://insights.sei.cmu.edu/documents/764/2006_005_001_14762.pdf (accessed on 7 May 2025).
  12. TMMi Foundation. TMMi (Test Maturity Model integration). Available online: https://www.tmmi.org (accessed on 31 August 2024).
  13. Garousi, V.; Felderer, M.; Hacaloğlu, T. Software test maturity assessment and test process improvement: A multivocal literature review. Inf. Softw. Technol. 2017, 85, 16–42. [Google Scholar] [CrossRef]
  14. Eldh, S.; Andersson, K.; Ermedahl, A.; Wiklund, K. Towards a Test Automation Improvement Model (TAIM). In Proceedings of the 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Cleveland, OH, USA, 31 March–4 April 2014; pp. 337–342. [Google Scholar] [CrossRef]
  15. ISO 9001:2015; Quality Management Systems—Requirements. ISO: Geneva, Switzerland, 2015.
  16. ISO/IEC 33001:2015; Information Technology—Process Assessment—Concepts and Terminology. ISO/IEC: Geneva, Switzerland, 2015.
  17. Elbaum, S.; Malishevsky, A.G.; Rothermel, G. Test case prioritization: A family of empirical studies. IEEE Trans. Softw. Eng. 2002, 28, 159–182. [Google Scholar] [CrossRef]
  18. Rothermel, G.; Untch, R.H.; Chu, C.; Harrold, M.J. Prioritizing test cases for regression testing. IEEE Trans. Softw. Eng. 2001, 27, 929–948. [Google Scholar] [CrossRef]
  19. Kato, D. Improvement of Test Efficiency Through Automation of Regression Testing. In Proceedings of the Japan Symposium on Software Testing (JaSST) 2005, Osaka, Japan, 1–2 March 2005; Available online: https://www.jasst.jp/archives/jasst05w/pdf/S3-1.pdf (accessed on 7 May 2025). (In Japanese).
  20. Memon, A.M.; Xie, Q. Studying the Fault-Detection Effectiveness of GUI Test Cases for Rapidly Evolving Software. IEEE Trans. Softw. Eng. 2005, 31, 884–896. [Google Scholar] [CrossRef]
  21. Bertolino, A.; Korel, B. Cost Models and Return on Investment for Automated Regression Testing. ACM SIGSOFT Softw. Eng. Notes 1996, 21, 38–45. [Google Scholar] [CrossRef]
  22. Nagappan, N.; Maximilien, E.M.; Bhat, T.; Williams, L. Realizing quality improvement through test driven development: Results and experiences of four industrial teams. Empir. Softw. Eng. 2008, 13, 289–302. [Google Scholar] [CrossRef]
  23. Garousi, V.; Mäntylä, M. When and what to automate in software testing? A multi-vocal literature review. Inf. Softw. Technol. 2016, 76, 92–117. [Google Scholar] [CrossRef]
  24. Rafi, D.M.; Moses, K.; Petersen, K.; Mäntylä, M.V. Benefits and limitations of automated software testing: Systematic literature review and practitioner survey. In Proceedings of the 2012 7th International Workshop on Automation of Software Test (AST), Zurich, Switzerland, 2–3 June 2012; pp. 36–42. [Google Scholar] [CrossRef]
  25. Parry, O.; Kapfhammer, G.; Hilton, M.; Mcminn, P. A Survey of Flaky Tests. ACM Trans. Softw. Eng. Methodol. 2021, 31, 17. [Google Scholar] [CrossRef]
  26. Liu, X.; Yu, P.; Ma, X. An Empirical Study on Automated Test Generation Tools for Java: Effectiveness and Challenges. J. Comput. Sci. Technol. 2024, 39, 715–736. [Google Scholar] [CrossRef]
  27. Jenkins. Available online: https://www.jenkins.io/ (accessed on 8 September 2024).
  28. Pérez-Verdejo, J.M.; Sánchez-García, Á.J.; Ocharán-Hernández, J.O.; Mezura-Montes, E.; Cortés-Verdín, K. Requirements and GitHub Issues: An Automated Approach for Quality Requirements Classification. Program. Comput. Softw. 2021, 47, 704–721. [Google Scholar] [CrossRef]
  29. Mariani, L.; Pastore, F.; Pezzè, M. The Central Role of Test Automation in Software Quality Assurance. Softw. Qual. J. 2017, 25, 797–802. [Google Scholar] [CrossRef]
  30. Kato, D.; Shimizu, A.; Ishikawa, H. Quality classification for testing work in DevOps. In Proceedings of the 14th International Conference on Management of Digital EcoSystems (MEDES '22), Venice, Italy, 19–21 October 2022; Association for Computing Machinery: New York, NY, USA; pp. 1–8. [Google Scholar] [CrossRef]
  31. Buse, R.P.L.; Weimer, W.R. Learning a Metric for Code Readability. IEEE Trans. Softw. Eng. 2010, 36, 546–558. [Google Scholar] [CrossRef]
  32. Pressman, R.S.; Maxim, B.R. Software Engineering: A Practitioner's Approach; McGraw-Hill: Columbus, OH, USA, 2019; ISBN 978-1260423310. [Google Scholar]
  33. ISO/IEC 25010:2023; Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE)—Product Quality Model. ISO/IEC: Geneva, Switzerland, 2023.
  34. Wang, Y.; Mäntylä, M.; Eldh, S.; Markkula, S.; Wiklund, K.; Kairi, T.; Raulamo-Jurvanen, P.; Haukinen, A. A Self-assessment Instrument for Assessing Test Automation Maturity. In Proceedings of the 23rd International Conference on Evaluation and Assessment in Software Engineering (EASE ’19), Copenhagen, Denmark, 14–17 April 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 145–154. [Google Scholar] [CrossRef]
  35. Ranorex. Available online: https://ranorex.com/ (accessed on 7 May 2025).
  36. Microsoft. Power Automate. Available online: https://www.microsoft.com/ja-jp/power-platform/products/power-automate (accessed on 7 May 2025).
  37. Atlassian. Confluence. Available online: https://www.atlassian.com/software/confluence (accessed on 11 July 2025).
  38. Atlassian. Jira. Available online: https://www.atlassian.com/software/jira (accessed on 11 July 2025).
  39. Bitbucket. Available online: https://bitbucket.org/product/ (accessed on 7 May 2025).
  40. ISO 9004:2018; Quality Management—Quality of an Organization—Guidance to Achieve Sustained Success. ISO: Geneva, Switzerland, 2018.
  41. OpenText Functional Testing. Available online: https://www.opentext.com/products/functional-testing (accessed on 7 May 2025).
  42. Pytest. Available online: https://docs.pytest.org/en/stable/ (accessed on 7 May 2025).
  43. Axelos. ITIL (Information Technology Infrastructure Library). Available online: https://www.axelos.com/ (accessed on 31 December 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.