AI Test Modeling for Computer Vision System—A Case Study

Gao, Jerry; Agarwal, Radhika

doi:10.3390/computers14090396

Open AccessArticle

AI Test Modeling for Computer Vision System—A Case Study

by

Jerry Gao

¹

and

Radhika Agarwal

^2,*

¹

Department of Computer Engineering, College of Engineering, San Jose State University, San Jose, CA 95192, USA

²

ALPSTouchStone, Inc., San Jose, CA 95192, USA

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(9), 396; https://doi.org/10.3390/computers14090396

Submission received: 7 August 2025 / Revised: 13 September 2025 / Accepted: 15 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

This paper presents an intelligent AI test modeling framework for computer vision systems, focused on image-based systems. A three-dimensional (3D) model using decision tables enables model-based function testing, automated test data generation, and comprehensive coverage analysis. A case study using the Seek by iNaturalist application demonstrates the framework’s applicability to real-world CV tasks. It effectively identifies species and non-species under varying image conditions such as distance, blur, brightness, and grayscale. This study contributes a structured methodology that advances our academic understanding of model-based CV testing while offering practical tools for improving the robustness and reliability of AI-driven vision applications.

Keywords:

intelligence validation for computer vision systems; test modeling; test generation; data augmentation; adequate test coverage; a case study; failure rate analysis

1. Introduction

Computer vision is a field of computer science that focuses on enabling machines to interpret and understand the visual world using algorithms and image processing techniques. Using it enables computers to identify and understand objects and people in images and videos. Like other types of AI, computer vision seeks to perform and automate tasks by replicating human capabilities. At an abstract level, computer vision problems aim to use the observed image data to infer something about the world. Some common computer vision problems include image classification, object localization and detection, and image segmentation.

Our objectives/the need for computer vision systems encompasses the following:

The goal of computer vision is to understand the content of digital images. Typically, this involves developing methods that attempt to reproduce the capability of human vision.
Understanding the content of digital images may involve extracting information from an image, i.e., an object, a text description, a three-dimensional model, and so forth.
The goal of computer vision is to replicate human vision using digital images through three main processing components in consecutive order: image acquisition, image processing and image analysis, and understanding.

The advantages of computer vision are

Enhanced automation—With computer vision, machines can take on more complex tasks that would otherwise require human intervention. For example, robots with computer vision can perform tasks like sorting objects or detecting defects in manufacturing processes.
Improved accuracy—Unlike humans, computers can analyze visual data with extreme accuracy and precision. This makes computer vision a valuable tool in fields like medical imaging and security, where accurate interpretation of visual data can be critical.
Increased efficiency—Computer vision can help streamline processes and make them more efficient. For example, computer vision can be used in agriculture to detect pests or diseases in crops, helping farmers to take targeted action and save time and resources.
New possibilities—With computer vision, there are countless new possibilities for innovation and creativity. For example, computer vision can be used to develop new forms of art or entertainment or to create new tools and applications that we cannot even imagine yet.
Improved accessibility—Finally, computer vision can help make technology more accessible to people with disabilities. For example, computer vision can be used to develop assistive technologies, such as devices that can help people with vision impairment navigate their surroundings.

The most popular applications of computer vision and its use cases are smart drone vision, surveillance and security, smart autonomous vehicles, the manufacturing industry, smart agriculture, medicine and healthcare, face detection and recognition, sports (as a third umpire), image and document processing, defect detection in industry, language translation, entertainment, banking with computer vision, AV, VR, and MR, number plate recognition, and so on. The growing application of computer vision in safety-critical and extremely high-stakes tasks, including autonomous vehicles, medical diagnostics, industrial inspection, and environmental monitoring, has resulted in a pressing demand for systematic testing programs. The structured design of computer vision system test models is referred to as test modeling, which informally represents the relationship between the contexts (environmental conditions), input (object categories, data types), and outputs (images or expected recognition or classification results) of a computer vision system. This test modeling enables systematic test generation, automated data augmentation, and quantitative coverage analysis, which are beyond ad hoc testing or static benchmark validation. Setting this ground is significant and timely, as AI-based vision systems are increasingly exposed to dynamic, forcibly changing environments under which poor testing might result in unsafe or unreliable results.

The novelty of this paper lies in its discussion of the understanding of the testing of computer vision systems, including object-oriented intelligence validation and document-based intelligence validation. In addition, it discusses 3D test modeling specifically for object tracking intelligence (OTI). A detailed explanation of test generation and data augmentation has been given for image-based data, followed by the adequacy of test coverage. A case study, Seek by iNaturalist, has been presented involving the context, input, output, and decision table for different test cases where species and non-species are categorized based on their image quality, including blurriness, brightness, distance, and grayscale. Sample test cases for species and non-species have been provided, and the test results, along with the bug analysis, have been discussed.

This paper is structured as follows: Section 2 reviews the related research work on the practical application of computer vision, including background work. Section 3 discusses the test modeling of computer vision systems along with 3D intelligent test modeling for OTI systems. Section 4 presents the test generation and data enhancement for intelligent computer vision systems, focusing on image-based data. Section 5 provides test coverage for 3D computer vision systems. Section 6 elaborates on the case study Seek by iNaturalist, including the test results and the failure rate analysis. The concluding remarks on this work are presented in Section 8.

2. The Literature Review

Artificial intelligence test modeling has been a major topic of attention, with the focus drifting between mobile applications and computer vision systems. Recent work shows the use of AI-based test modeling in intelligent chatbot mobile apps [1]. Machine learning has been widely used in software testing to automate the process of generating, refining, and evaluating test cases [2]. This work identifies current research trends and highlights the need for stronger empirical validation of ML-based techniques in terms of testing. Machine learning and deep learning are widely used in AI-based GUI testing in mobile applications, including Android testing and automated robotic testing [3,4].

Drone-based traffic analysis for smart cities is a common application of computer vision systems [5]. Its basic AI requirements include traffic fleet monitoring, congestion analysis, and traffic analysis report generation. As a recent development in autonomous vehicle machine vision systems, an integrated machine learning model was studied to address this issue and the need for the detection and classification of street intersections, which are dependent on the context of the road and the weather conditions encountered [6]. The data collection, processing, and training efforts were based on the available dataset (BDD100K, COCO) and a new training dataset containing 13 street context and intersection classes, along with 6 classes of weather conditions.

Some review papers have also considered deep learning in knowledge-based computer vision techniques [7] to check how machine learning and the state of the art have been used in computer vision to extract crucial information from images [8]. With an emphasis on ease of use, consistent usability, and expandability [9], the earlier work offers a framework for the automated creation of tests for vision and image recognition systems. It also demonstrates how this work can be applied to testing a specific industrial application that involves identifying riblet surface defects. Recent studies in the field of computer vision have yielded significant theories, effective deep learning models, and applications, including object recognition, object detection, and object segmentation [10]. Other issues that have already been addressed, in terms of emerging trends and future directions, include facial feature extraction, transfer learning, and scene classification.

Industry experts discuss the practices, issues, and ideas for utilizing AI in software testing and development and creating self-testing systems [11]. The actual views of the community and a panel survey on software testing practices in terms of the changing role of AI have been presented. Machine learning software testing faces unique challenges due to the susceptibility of ML models to deception and failure, particularly in safety-critical applications [12]. The unique development methodologies used in AI/ML systems introduce critical testing challenges that differ significantly from traditional software testing approaches [13]. This study identifies and outlines the major testing hurdles in AI/ML applications, providing a foundation for future research into more effective and tailored testing strategies. A systematic representation of the use of AI in software testing based on a review of 20 secondary works has been presented [14]. Software testing can be enhanced by using AI to generate and validate test cases with the help of ML, NLP, and automation frameworks. Most recently, an article has mentioned the advantages in terms of productivity and reach while addressing challenges like AI bias, data needs, and integration hurdles [15].

An overview of the real-life applications of computer vision and AI in various fields of application is discussed in Table 1. It serves two purposes: (a) to highlight the extensive and powerful application of computer vision, thus underlining the urgency of creating systematic testing regimes that can be used across application domains, and (b) to emphasize the current dependency on domain-specific data and ad hoc testing, which supports the necessity of a generalizable model-based testing method, including the 3D model suggested by this work. Consequently, Table 1 was not designed solely as a summary of the literature but as evidence, in its context, to justify the primary goal of this article, i.e., to develop a structured model-based test methodology for computer vision systems scalable across domains.

3. Test Modeling for Computer Vision Systems

The term test modeling describes the method of systematically expressing the components of a system being tested (including inputs, situations, and responses) in a way that facilitates the systematic generation of test cases and adequacy evaluations. Decision tables, state-transition diagrams, and classification trees have been used as traditional models to model tests in software engineering. These techniques work well with deterministic systems, where it is possible to define and validate the expected outputs. However, with the emergence of AI-driven systems, particularly in computer vision, modeling is no longer sufficient. AI systems are sensitive to variations in data quality, the environment, and task complexity because they learn from data and operate under uncertain and context-dependent conditions. Consequently, AI testing models must be expanded to include traditional principles related to contextual variability, automated augmentation, and adequacy coverage metrics.

3.1. Understanding of Testing Computer Vision Systems

The issues faced during testing a computer vision system include

Difficulty in establishing well-defined, clear, and measurable quality testing requirements;
A lack of well-defined and practiced quality testing and assurance standards;
A lack of well-defined, systematic quality testing methods and solution;
A lack of automatic test tools with well-defined, adequate quality test coverage;
A lack of automatic, adequate quality test coverage analysis techniques and solutions.

Intelligence validation in computer vision systems is based on text and documentation processing, but the concept of intelligence validation is image-based and document-based. Image-based validation concerns the system performance when processing visual inputs under different circumstances (e.g., lighting, blur, occlusion) and evaluates system robustness in recognition, detection, and classification tasks. Document-based validation, by contrast, measures the capability of vision systems to read and recognize structured data in text-laden images and scanned documents, as required in applications such as OCR, automated form reading, and compliance validation. Figure 1 and Figure 2 highlight the validations for image-based and document-based intelligence in computer vision systems, respectively.

This study focuses on and targets image-based computer vision solutions and systems, listed in Table 2.

3.2. Conventional Test Modeling

For so long, conventional test modeling has been applied in software engineering to systematically design and validate test cases. Decision tables, state-transition diagrams, and classification trees are methods that provide structured representations of system behavior, enabling the systematic generation of test cases. In conventional test modeling, the objective is to achieve systematic coverage of the deterministic system’s behavior. The methodology applied is formal models (decision tables, transitions, and classification trees) to map inputs to expected outputs. It is effective in contexts where the outputs are explicitly defined and predictable. This method struggles with AI-based systems, where the outputs are probabilistic and sensitive to diverse environmental conditions.

As a result, conventional test modeling provides a strong foundation for systematic testing but is insufficient for AI-driven applications, particularly in domains such as computer vision, where uncertainty and input variability are dominant.

3.3. AI Test Modeling

AI test modeling extends the traditional methods to address the unique characteristics of AI systems. Unlike deterministic software, AI systems learn from data and must operate under uncertain, context-dependent conditions. The objective of AI test modeling is to capture the robustness of an AI system under real-world variability. It integrates context-based variations, automated test data generation/augmentation, and adequacy-driven coverage metrics. The advantage is that it goes beyond static benchmarks by revealing robustness gaps under diverse conditions. The focus is not only limited to identifying failures that occur but also how and why they occur (failure modes). This approach is particularly relevant for computer vision systems, where environmental variability—such as lighting, blur, distance, occlusion, and color shifts—can significantly impact performance. Therefore, AI test modeling emphasizes both systematic test generation and diagnostic insights, providing a richer evaluation of system behavior than conventional approaches.

The limitations of conventional approaches and the growing demand for robust AI testing highlight the need for structured frameworks tailored to computer vision. Building on these insights, the following section presents our three-dimensional (3D) AI test modeling framework. This model integrates context, input, and output classification trees into decision tables, building upon traditional modeling concepts to address the robustness issues in state-of-the-art computer vision systems. In this paper, AI test modeling is extended to AI features of computer vision by building three-dimensional models that consider context, input, and output variables. This systematic test generation approach supports systematic test data generation and diagnostic information concerning system robustness that transcends conventional or benchmark-based testing. For any given computer vision system, the intelligent test modeling process is carried out as follows:

Loop: For each computer vision feature (function),

Establish a test model using the 3D tree model below:
(a)
Establish a context tree model to represent well-classified contexts for the computer vision system;
(b)
Establish an input tree model to represent well-classified inputs for the computer vision system;
(c)
Establish the corresponding output tree model to represent well-classified outputs from the computer vision system under test.
Automatically generate a 3D classification decision table for each computer vision feature (function) based on the derived 3D tree model.

In this paper, a 3D classification test model is designed for an object tracking intelligence computer vision system. The different scenarios in the model include three classification tree models: context, input, and output.

3.4. Context Classification Modeling

The context of selected object tracking intelligence consists of the environmental conditions during which the photo/video of an object was captured. Many factors may impact the performance of a function, including weather, season, light/time conditions, background, and background noise. In Figure 3, a context classification tree has been formed while considering different factors that may affect the final performance of the computer vision system.

3.5. Input Classification Modeling

The input classification refers to the model’s technical aspects, including understanding what type of object has been detected. Figure 4 shows the input for a computer vision system in the case of an object tracking intelligence (OTI) system. Generally, the inputs try to track the behavior, instance tracking, hybrid tracking, and the type of object.

3.6. Output Classification Modeling

Output classification refers to the possible results that can be obtained for the inputs provided. So, the major divisions for output remain the same as those for input, as seen in Figure 5. The final results obtained for an output classification tree are in the form of correct/incorrect answers.

4. Test Generation and Data Augmentation for Intelligent Computer Vision Systems

This section covers the various techniques for creating and enhancing the data and text for intelligent chatbot systems.

4.1. Test Generation

The different test generation approaches for computer vision systems are as follows:

Test Data Discovery: Using Internet-based solutions to search for, select, and validate discovered test data for a computer vision system’s targeted computer vision feature (or function).
Test Data Augmentation: Using diverse test data augmentation solutions and tools, including machine learning models and frameworks, to generate diverse augmented test data based on the selected test data (images, videos, and document images) for a selected computer vision feature/function in a computer vision system under test.
Model-Based Test Data Generation: Using a model-based test data generation tool to generate diverse test data to generate well-classified model-based test data (images, videos, and document images) for a selected computer vision feature/function in a computer vision system under test to achieve a well-defined (or selected) adequate test coverage criterion.
AI-Based Test Data Generation: Using machine learning models and AI techniques to generate desirable test data for a targeted computer vision feature (or function) based on a given test scheme of vision feature (or function) in a computer vision system to achieve well-defined test coverage criteria.
Real-Time On-Site Collected Test Data and Processing: Using various test data collection methods and tools to collect real-time on-site raw data (camera videos, photos, and/or document images) via computer vision system APIs. To prepare the collected data as targeted test data, machine learning models or tools will be used to preprocess the collected data and convert it into targeted test data for well-defined, adequate test criteria.

In the case of an object tracking intelligence (OTI) system, a model-based test approach has been used, where initially, a 3D AI test tree model is selected, which can have k separate tree inputs, followed by k individual 3D AI test classification decision tables, as shown on the left side of Figure 6. AI test generation then works upon it, including automatic test data discovery, AI-based test data selection and validation, and AI-based test data augmentation, which are then finally added to the model-based test datasets.

4.2. Data Augmentation

The test data in a computer vision intelligence system can be augmented based on images and documents. Image-based augmentation can be a result of object rotation, season/weather augmentation, light/color augmentation, object distortion, digital augmentation, changed backgrounds, object addition/removal, text addition/removal, or replacing text in the image. Figure 7a shows different images with rotational augmentation, while one can see weather classification in Figure 7b. Document-based data augmentation is a vast field that will be discussed in future works.

5. Adequate Test Coverage and Standards for Computer Vision Systems

Quality testing adequacy for image-based computer vision intelligence is critical for the system’s performance and dependability. It entails the determination of the accuracy with which the model can process visual information and the ability of the model to perform accurately in situations that represent the diversity of real-life conditions, such as image qualities, illumination angles, and object perspectives. Table 3 shows the image-based test coverage for computer vision systems.

Model-Based Test Adequacy for a Computer Vision Intelligence System

One can establish a 3D AI test classification decision table for each targeted type of computer vision intelligence under test based on the established 3D AI tree model. Using this test classification decision table as a test scheme, a set of 3D AI test classification test case sets (known as 3DT-Set) can be generated. The four test coverage criteria can be defined below:

Three-Dimensional AI Test Classification Decision Table Test Coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any 3D element (CT-x, IT-y, OT-z) in the 3D AI test classification decision table;
Context classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any rule in a context classification decision table;
Input classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least one test case for any rule in an input classification decision table;
Output classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least a true case for any rule in an output classification decision table.

6. A Case Study—Seek by iNaturalist

Seek by iNaturalist is an AI-powered mobile app designed to help identify plants and animals. Using computer vision for real-time species identification provides educational information besides conservation insights. Although Seek performs static image classification rather than dynamic object tracking, it provides a suitable case study for demonstrating the generalizability of the proposed 3D test modeling framework. A classification tree diagram for an AI-powered function in the mobile app, Seek, is represented in Figure 8. The same principles of context variations Figure 8a (blur, brightness, distance, and grayscale), input categories Figure 8b (species vs. non-species), and output classification Figure 8c (correct vs. incorrect recognition) apply, allowing for the illustration of the practical application of structured test modeling in a real-world vision system.

6.1. Methodology and Dataset Construction

To ensure reproducibility and rigor, the methodology implemented to construct the test dataset was as follows:

Image Sourcing: Species images were collected directly from the Seek app, while non-species images were supplemented with publicly available resources (e.g., household items, simple backgrounds).
Selection Criteria: The images were selected to represent four species categories—insects, flowers, trees, and birds—and four non-species categories—chair, bottle, fan, and plain background. This ensured the inclusion of supported and unsupported inputs.
Contextual Variation: Augmentations were systematically applied to introduce controlled variations (e.g., Gaussian blur, brightness scaling, grayscale conversion), ensuring that the test dataset reflected realistic imaging conditions encountered by end users.

These enhancements were chosen because they are among the most common sources of errors in real-world CV performance, making test data realistic and diagnostic and providing a foundation for reproducibility in future studies.

6.2. AI Test Modeling for Selected AI Features

Context Modeling: This model categorizes various input formats. For Seek, it only receives the image as its input, though the image can be of various qualities, with different settings. To effectively test the species identification feature of Seek, the testing context is broken down into several categories that cover a wide range of scenarios. The first category is brightness, which assesses the identification feature under varying lighting conditions. The second category is blurriness, aiming to evaluate the identification performance when processing images of different quality levels. The third category is distance, which examines the feature’s capability to recognize species regardless of how close or far they appear in the image. Lastly, the fourth category is grayscale, which tests the feature’s effectiveness in identifying the species in black and white color, rather than in their natural colors.

Input Modeling: The purpose of this model is to focus on the types of input. It is observed that there are two main types of input for Seek: species and non-species. Species inputs are images that contain organisms supported by the identification system, such as birds, insects, trees, and flowers. Inputs that contain unrelated objects, such as soccer balls, houses, and food, will be classified as non-species inputs, as they fall outside the scope of the feature’s recognition capabilities.

Output Modeling: It is observed that there are two outputs for Seek’s identification feature: Pass and Fail. The two main parts of the feature we are testing are species recognition and species identification. Species recognition is the ability to recognize species while not being able to recognize non-species. Species identification is the ability to recognize the correct type of species in the image.

6.3. AI Function Test Cases, Data Generation and Test Coverage

The test cases are generated using the AI testing tool, which pulls from the test models defined previously. To stay true to the testing logic, non-species and species images are to be tested separately, even though the test suite combines them into a single test case. Some tabular representations of the test cases generated from the AI testing tool are given for species and non-species in Figure 9 and Figure 10, respectively. For the species, Figure 9a shows the impact of distance, while Figure 9b shows the impact of grayscale on the recognition performance.

Each test case group includes a combination of the context variables (blur, brightness, distance, grayscale) applied to the entire input scope (species type, non-species). Rather than isolating each of the contexts against the input space, combining them provides better insights into which contexts have a larger impact on the accuracy of the computer vision model. This allows for adequate testing of the model’s strengths without performing exhaustive tests.

The test data were generated and augmented in accordance with the practices in this study to represent real-world variations. Specifically, the validity of the test data was ensured through

Contextual Relevance: Augmentations were selected based on typical real-world variations that included blur, brightness, distance, and grayscale changes, among others, in image capture. They were chosen as they were generally identified as the main sources of performance degradation in deployed vision systems.
Controlled Augmentation: Data augmentation methods (e.g., rotation, brightness scaling, simulated blur) were controlled so that the generated images were realistic without imposing synthetic distortions that would not occur naturally.

This ensures that the generated test data is not arbitrary but is systematically designed to reflect real-world imaging conditions relevant to computer vision performance evaluation.

In Section 5, the manuscript identifies eight major types of coverage for testing computer vision systems (object detection, tracking, behavior classification, counting, segmentation, recognition, feature extraction, and object extraction). In the Seek case study, the test coverage focus is specifically on object detection and classification, object recognition, and object tracking intelligence (OTI). These were selected because they are directly aligned with Seek’s core functionality of identifying species from images and distinguishing them from non-species input. While other coverage dimensions (e.g., behavior classification, segmentation) remain important, they are beyond the scope of this particular demonstration and will be addressed in future work.

6.4. Results and Failure Mode Analysis

The AI function test for Seek’s species identification gave an overall pass rate of 38%. Table 4 shows the success rate for the categories of species and non-species. It varied by species, with success rates for birds (48%), trees (44%), insects (32%), and flowers (28%). In contrast, the non-species tests achieved a 100% pass rate, indicating that the system is effective at filtering out irrelevant inputs while facing challenges in accurately identifying some species.

The subsequent analysis in Table 5 indicates that picture quality has a substantial effect on performance. Lower blur, a good amount of brightness, and a close distance gave us a higher accuracy. However, high blur, low brightness, and far distance conditions degrade it. Non-grayscale images performed significantly better (71% pass rate) compared to grayscale ones (8% pass rate), suggesting weaknesses in Seek’s color processing pipeline. Blur and brightness also had notable effects, with higher blur and lower brightness leading to degraded performance. These insights suggest improvements in data augmentation, particularly for color handling and challenging conditions, cause better models and overall accuracy.

The graphical representations of the failure modes in species, specific to distance and grayscale, are presented in Figure 11a and Figure 11b, respectively. Specific failure modes are observed as

Image quality failures: High blur and low brightness consistently led to misclassifications.
Species confusion: Some flower images were misclassified as trees, indicating difficulty with fine-grained distinctions.
Color feature weaknesses: The performance was unexpectedly better in grayscale than in color, suggesting over-reliance on texture or shape features rather than robust color feature extraction.
Distance sensitivity: The accuracy dropped at farther distances, showing limitations in feature resolution handling.

These findings demonstrate how the proposed 3D test modeling framework provides structured insights into system weaknesses, moving beyond simple accuracy scores toward diagnostic analysis of robustness.

7. Discussion

The testing of computer vision (CV) systems still faces fundamental gaps that hinder reliability and scalability. Current practices often rely on ad hoc testing, static benchmark datasets, or subjective validation, which fail to capture the diversity and unpredictability of real-world environments. As a result, robustness, reliability, and reproducibility remain difficult to guarantee. This study addresses these gaps by introducing a three-dimensional (3D) AI test modeling framework that integrates context, input, and output classification trees into decision tables. The central contributions are

Academic Contribution: It extends model-based testing concepts to computer vision systems, considering blur, brightness, distance, and color variability. Unlike existing model-based approaches, the framework explicitly incorporates adequacy metrics, automated augmentation, and systematic classification of testing scenarios.
Practical Contribution: It gives developers and engineers a consistent and broadly applicable way of testing CV systems. The framework complements ad hoc testing or benchmark-driven validation by integrating model-based design with automated augmentation.

Although the initial inspiration for the work was object tracking intelligence (OTI), it is not limited to this. The principles of systematically structuring the contexts, inputs, and outputs are also applicable to image-based classification and recognition tasks. The case study of Seek by iNaturalist was thus chosen to demonstrate such generalization. Although Seek does not require dynamic tracking, the 3D modeling method induces some important information about robustness under various conditions.

Advantages of the methodology studied in this paper are

The proposed 3D AI test model with decision tables provides systematic, structured coverage across multiple dimensions.
Model-based AI function testing enhances traceability and repeatability.
Automated test data generation and augmentation improve coverage without the need for extensive manual datasets.
Contextual variation had strong effects: grayscale surprisingly outperformed color, blur and distance degraded recognition, and brightness variations altered the outcomes.
These findings emphasize the value of context-driven testing, which uncovers weaknesses often hidden by benchmark datasets.
By integrating coverage, adequacy, and automated validation, the methodology addresses key shortcomings of the existing solutions.

Table 6 illustrates the differences between conventional, model-based, AI-driven, and benchmark testing approaches for computer vision systems across objectives, methodologies, data handling, validation, and coverage.

This comparison highlights the evolution of the testing practices for computer vision systems. Conventional testing remains largely ad hoc, with limited coverage and reliance on human oracles, while model-based testing offers systematic coverage by leveraging formal models. AI-driven testing introduces automation in test generation, augmentation, and validation, enabling broader and adaptive coverage, but still faces challenges in standardization. Benchmark testing, although valuable for comparability through fixed test sets, is constrained by static metrics and limited flexibility. Collectively, these insights emphasize the need for hybrid approaches that combine the rigor of model-based testing, the scalability of AI-driven testing, and the comparability offered by benchmark test sets.

Despite these advantages, there are limitations. This research was limited to a range of coverage categories (detection, recognition, and tracking), a small augmentation set was used, and the framework was tested on a single case study. Test generation and coverage analysis are automated, but some manual interpretation of the test results is still necessary. Also, the inability to fully automate validation and coverage measurement is a drawback.

8. Conclusions and Future Scope

This paper presented a novel approach to intelligent AI test modeling for computer vision systems, emphasizing object tracking intelligence (OTI) through a three-dimensional (3D) test model and a decision table framework. By integrating model-based testing, automated test data generation, and comprehensive test coverage analysis, the proposed methodology advances the validation process for AI-powered computer vision applications. The case study of Seek by iNaturalist focused on image classification rather than tracking, demonstrating the flexibility of the work and highlighting failure modes such as sensitivity to blur, distance, and color features. The results highlight the potential of intelligent test modeling to improve reliability and performance evaluations for real-world AI systems. Specifically, the structured 3D testing model improves coverage and facilitates early defect detection, which is critical in high-stakes applications of computer vision.

Building on the proposed framework, future work can explore several directions. The test model can be extended to supporting multi-modal AI systems that combine document, image, text, and audio inputs. Applying the model to other computer vision use cases, such as medical imaging, autonomous vehicles, or industrial inspection, could validate its adaptability and domain independence. Finally, incorporating self-healing test strategies and reinforcement-learning-based test optimization may automate and improve the accuracy of the AI testing lifecycle further. The scope of the work is limited (restricted coverage categories, limited augmentations, reliance on a single case study), but this opens up areas for further development. Future efforts will expand the framework to dynamic tracking, segmentation, and multi-modal CV tasks; incorporate richer augmentation techniques; and provide validation across diverse application domains.

Author Contributions

J.G.: Conceptualization, formal analysis, resources, supervision, review, and administration; R.A.: original draft preparation, validation, case study, data curation, methodology, software, validation, writing, and review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Radhika Agarwal is employed by the company ALPSTouchStone Inc. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gao, J.; Agarwal, R.; Garsole, P. AI Testing for Intelligent Chatbots—A Case Study. Software 2025, 4, 12. [Google Scholar] [CrossRef]
Durelli, V.H.S.; Durelli, R.S.; Borges, S.S.; Endo, A.T.; Eler, M.M.; Dias, D.R.C.; Guimarães, M.P. Machine Learning Applied to Software Testing: A Systematic Mapping Study. IEEE Trans. Reliab. 2019, 68, 1189–1212. [Google Scholar] [CrossRef]
Zhang, T.; Liu, Y.; Gao, J.; Gao, L.P.; Cheng, J. Deep Learning-Based Mobile Application Isomorphic GUI Identification for Automated Robotic Testing. IEEE Softw. 2020, 37, 67–74. [Google Scholar] [CrossRef]
Gao, Y.; Tao, C.; Guo, H.; Gao, J. A Deep Reinforcement Learning-Based Approach for Android GUI Testing. In Proceedings of the Web and Big Web and Big Data: 6th International Joint Conference, APWeb-WAIM 2022, Nanjing, China, 25–27 November 2022; Proceedings, Part III. pp. 262–276. [Google Scholar] [CrossRef]
Gao, J.Z. UASACT 2023 Keynote Talk: Smart City Traffic Drone AI Cloud Platform—Intelligence, Big Data, and AI Cloud Infrastructure; Keynote presented at UASACT 2023, Kaohsiung Exhibition Center, Taiwan. Sponsored by TDECA, IEEE CISOSE 2023, and IEEE Future Technology; San Jose State University: San Jose, CA, USA, 2023. [Google Scholar]
Gao, J.; Wang, D.; Lin, C.P.; Luo, C.; Ruan, Y.; Yuan, M. Detecting and learning city intersection traffic contexts for autonomous vehicles. J. Smart Cities Soc. 2022, 1, 1–27. [Google Scholar] [CrossRef]
Matsuzaka, Y.; Yashiro, R. AI-Based Computer Vision Techniques and Expert Systems. AI 2023, 4, 289–302. [Google Scholar] [CrossRef]
Ayub Khan, A.; Laghari, A.A.; Ahmed Awan, S. Machine Learning in Computer Vision: A Review. EAI Endorsed Trans. Scalable Inf. Syst. 2021, 8, e4. [Google Scholar] [CrossRef]
Wotawa, F.; Klampfl, L.; Jahaj, L. A framework for the automation of testing computer vision systems. In Proceedings of the 2021 IEEE/ACM International Conference on Automation of Software Test (AST), Madrid, Spain, 20–21 May 2021; pp. 121–124. [Google Scholar] [CrossRef]
Hassaballah, M.; Hosny, K.M. (Eds.) Recent Advances in Computer Vision: Theories and Applications, 1st ed.; Volume 1: Studies in Computational Intelligence; Springer: Cham, Switzerland, 2019; pp. 113–187. [Google Scholar] [CrossRef]
King, T.M.; Arbon, J.; Santiago, D.; Adamo, D.; Chin, W.; Shanmugam, R. AI for Testing Today and Tomorrow: Industry Perspectives. In Proceedings of the 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 81–88. [Google Scholar] [CrossRef]
Marijan, D.; Gotlieb, A. Software Testing for Machine Learning. Proc. Aaai Conf. Artif. Intell. 2020, 34, 13576–13582. [Google Scholar] [CrossRef]
Sugali, K. Software Testing: Issues and Challenges of Artificial Intelligence & Machine Learning. Int. J. Artif. Intell. Appl. 2021. Available online: https://ssrn.com/abstract=3948930 (accessed on 17 July 2025).
Amalfitano, D.; Faralli, S.; Hauck, J.C.R.; Matalonga, S.; Distante, D. Artificial Intelligence Applied to Software Testing: A Tertiary Study. ACM Comput. Surv. 2023, 56, 1–38. [Google Scholar] [CrossRef]
Baqar, M.; Khanda, R. The Future of Software Testing: AI-Powered Test Case Generation and Validation. In Proceedings of the Intelligent Computing; Arai, K., Ed.; Springer: London, UK, 2025; pp. 276–300. [Google Scholar]
Salman, H.; Uddin, M.N.; Acheampong, S.; Xu, H. Design and Implementation of IoT Based Class Attendance Monitoring System Using Computer Vision and Embedded Linux Platform; Springer International Publishing: Cham, Switzerland, 2019; Volume 927, pp. 25–34. [Google Scholar] [CrossRef]
Khemasuwan, D.; Sorensen, J.S.; Colt, H.G. Artificial intelligence in pulmonary medicine: Computer vision, predictive model and COVID-19. Eur. Respir. Rev. 2020, 29, 200181. [Google Scholar] [CrossRef] [PubMed]
Gargin, V.; Radutny, R.; Titova, G.; Bibik, D.; Kirichenko, A.; Bazhenov, O. Application of the computer vision system for evaluation of pathomorphological images. In Proceedings of the 2020 IEEE 40th International Conference on Electronics and Nanotechnology (ELNANO), Kyiv, Ukraine, 22–24 April 2020; pp. 469–473. [Google Scholar] [CrossRef]
Shams, A.; Schekelmann, A.; Mülder, W. A proof of concept for providing traffic data by AI based computer vision as a basis for smarter industrial areas. Procedia Comput. Sci. 2022, 201, 239–246. [Google Scholar] [CrossRef]
Moore, S.; Liao, Q.V.; Subramonyam, H. fAIlureNotes: Supporting Designers in Understanding the Limits of AI Models for Computer Vision Tasks. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 23–28 April 2023. [Google Scholar] [CrossRef]
Sharma, A.; Prasad, K.; Chakrasali, S.V.; Gowda V, D.; Kumar, C.; Chaturvedi, A.; Pazhani, A.A.J. Computer vision based healthcare system for identification of diabetes & its types using AI. Meas. Sens. 2023, 27, 100751. [Google Scholar] [CrossRef]
Fuentes-Peñailillo, F.; Carrasco Silva, G.; Pérez Guzmán, R.; Burgos, I.; Ewertz, F. Automating Seedling Counts in Horticulture Using Computer Vision and AI. Horticulturae 2023, 9, 1134. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, L.; Konz, N. Computer Vision Techniques in Manufacturing. IEEE Trans. Syst. Man, Cybern. Syst. 2023, 53, 105–117. [Google Scholar] [CrossRef]
Tang, Y.M.; Kuo, W.T.; Lee, C. Real-time Mixed Reality (MR) and Artificial Intelligence (AI) object recognition integration for digital twin in Industry 4.0. Internet Things 2023, 23, 100753. [Google Scholar] [CrossRef]
Li, C.H.; Chow, E.W.H.; Tam, M.; Tong, P.H. Optimizing DG Handling: Designing an Immersive MRsafe Training Program. Sensors 2024, 24, 6972. [Google Scholar] [CrossRef] [PubMed]
Yazdi, M. Augmented Reality (AR) and Virtual Reality (VR) in Maintenance Training. In Advances in Computational Mathematics for Industrial System Reliability and Maintainability; Springer Series in Reliability Engineering; Springer: Cham, Switzerland, 2024; pp. 169–183. [Google Scholar]
Kalluri, P.R.; Agnew, W.; Cheng, M.; Owens, K.; Soldaini, L.; Birhane, A. Computer-vision research powers surveillance technology. Nature 2025, 643, 73–79. [Google Scholar] [CrossRef]

Figure 1. Validation of image-based intelligence for a computer vision system.

Figure 2. Document-based intelligence validation for a computer vision system.

Figure 3. A context classification tree for OTI systems.

Figure 4. An input classification tree for OTI systems.

Figure 5. An output classification tree for an OTI system.

Figure 6. Test service platform used for computer vision.

Figure 7. Samples for data augmentation. (a) Result of rotational augmentation. (b) Result of weather classification.

Figure 8. A sample of a classification tree for an AI-powered function in the mobile app, Seek. (a) A context classification tree. (b) An input classification tree. (c) An output classification tree.

Figure 9. Sample test cases for species inputs demonstrating context variation. (a) Effect of distance on recognition performance. (b) Impact of grayscale vs. color images on recognition performance.

Figure 10. Sample test case for non-species.

Figure 11. Graphical representation of failure modes in species classification. (a) Distance sensitivity. (b) Grayscale advantage.

Table 1. Literature review on practical applications of computer vision and AI.

Ref.	Objective	Methodology	Application	Challenges
[16]	Fast, accurate IoT-based attendance system on embedded Linux	Haar cascade for face detection, LBP histogram for recognition	Face detection and recognition	Large cloud storage requirements
[17]	Review of AI use in pulmonary medicine for specialists	A literature review on CV in imaging, ML for prediction, and AI in COVID-19 response	CV and AI in pulmonary medicine	Potential iatrogenic risks from AI algorithms
[18]	CV for scoring and counting in pathomorphological images vs. manual estimation	Python 2.7 for slide analysis	Pathomorphological image evaluation	9.2% error, 90.8% cancer marker detection accuracy
[19]	Scalable, low-cost AI-based CV for real-time traffic in industrial areas	Open-source ML, edge devices, ad hoc setup, modular dashboard	Traffic monitoring, decision support, smart infrastructure	Indoor setup only, heavy inference load, partner expertise needed
[20]	Workflow for early exploration of model behavior and failures	fAIlureNotes tool for model evaluation and failure identification	UX support in understanding AI limits	Limited AI expertise and lack of accessible tools
[21]	AI/ML model for improved prediction accuracy	Symptom-based diabetes classification from dataset	Diabetes type identification in healthcare	Demand for accurate, timely disease forecasting
[22]	Seedling counting via object detection and a mobile app	CRISP-DM for data capture, processing, and training	Horticulture for agro-process automation	Counting efficiency between 57 and 96%
[23]	Method survey in CV: detection, recognition, segmentation, 3D modeling	Manufacturing system with sensing, CV, decision making, actuation	Manufacturing	Implementation, preprocessing, labeling, benchmarking
[24]	Integrating MR with AI recognition for digital twins	Real-time MR–AI with IoT connectivity	Industry 4.0 object recognition/monitoring	High computation; accuracy–latency trade-offs
[25]	Designing MR-based safe training for dangerous goods (DGs)	Designing MR-based safe training for dangerous goods (DGs)	Logistics and industrial safety training	Realism of scenarios; transfer to real-world tasks
[26]	Applying AR/VR to enhance maintenance training	Interactive AR/VR learning modules	Industrial system maintenance	Industrial system maintenance
[27]	CV–surveillance link analysis	Paper–patent analysis, language pattern mining	Human-targeted detection in surveillance	Obfuscation, normalization, transparency gaps

Table 2. Focus and targets of computer vision systems.

Feature	Task Performed
Object extraction	It helps in the extraction of objects from an image
Object detection and classification	It helps with detecting and classifying an object from an image
Object tracking and counting	It helps with tracking and counting objects in an image
Object behavior detection and classification	It helps with behavior detection for objects and their classification in an image
Object identification, recognition, segmentation, and feature extraction	It is used to identify/recognize and segment an object along with extracting features from an image
Document preprocessing and classification	It helps with the preprocessing of a document and its classification in an image
Text/data extraction and collection from documents	It helps in text and data extraction along with its collection in a document
Document analysis and understanding	It helps understand and analyze a document
Document review and audit	It helps with auditing and reviewing a document
Data/text validation	It is used to validate the data and the data in a document

Table 3. Test coverage for computer vision systems.

Coverage	Task Performed
Object detection and classification	The test covers domain-specific object detection and its classification
Object tracking	The tracking of objects is covered in this test coverage
Object behavior detection and classification	This covers behavior detection for objects and their classification in an image
Object counting	This covers counting objects within an image
Object segmentation	This covers the segmentation of the object from an image
Domain-specific object recognition	This covers the domain-specific recognition of objects in an image
Feature extraction	This covers the extraction of features from an image
Object extraction	This covers the extraction of specific objects from an image

Table 4. Classification results per species category.

Category	Pass Rate (Per Category)
Species
Insects	8/25 = 32%
Flowers	7/25 = 28%
Trees	11/25 = 44%
Birds	12/25 = 48%
Total	38/100 = 38%
Non-Species
Chair	6/6 = 100%
Fan	2/2 = 100%
Blue Background	4/4 = 100%
Bottle	4/4 = 100%
Total	16/16 = 100%

Table 5. Classification results under different contextual variations.

Context Included	Pass Rate (All Categories)
Species
Low blur	23/48 = 48%
High blur	15/52 = 29%
Low brightness	10/36 = 28%
Normal brightness	14/32 = 41%
High brightness	14/32 = 41%
Close distance	22/52 = 42%
Far distance	16/48 = 33%
Grayscale	4/52 = 8%
Non-grayscale	34/48 = 71%
Non-Species
Low blur	2/16 = 12.5%
High blur	2/16 = 12.5%
Low brightness	2/16 = 12.5%
Normal brightness	2/16 = 12.5%
High brightness	2/16 = 12.5%
Close distance	2/16 = 12.5%
Far distance	2/16 = 12.5%
Grayscale	2/16 = 12.5%
Non-grayscale	2/16 = 12.5%
Total	16/16 = 100%

Table 6. Comparison of testing approaches for computer vision systems.

Aspect	Conventional Testing	Model-Based Testing	AI Testing	Benchmark Testing
Objective	Applying traditional system testing message to support test design	Setting up the test model to support test design and automation	Automating and optimizing the test design using AI/ML testing models	Comparing the performance on fixed test sets with standardized metrics
Test Approach	Mostly ad hoc, scripted, and conventional system function	Formalized using decision tables and model-based	AI-driven adaptive testing with dynamic feedback loops	Test-set-driven evaluation with static metrics
Test Data Generation	Manual or conventional test case generation	Derived systematically from model transitions and constraints	Generated automatically using AI/ML, including domain-specific data	Predefined test sets (e.g., ImageNet, COCO), limited flexibility
Test Augmentation	Rare or manual augmentations	Scenario-based extension using model variations	Automated augmentation (noise, blur, rotation, adversarial perturbations)	Limited augmentation, often restricted to test set scope
Test Validation	Human oracle or ad hoc oracle testing, often subjective	Model oracle validation against expected behavior	AI-assisted validation, anomaly detection, bug analysis	ad-hoc test validation based on given test set
Test Coverage	Limited conventional coverage achieved (e.g., test scenario coverage)	Systematic model-based coverage achieved for CV systems	Broad coverage through automated generation and augmentation	Restricted to benchmark test set scope

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, J.; Agarwal, R. AI Test Modeling for Computer Vision System—A Case Study. Computers 2025, 14, 396. https://doi.org/10.3390/computers14090396

AMA Style

Gao J, Agarwal R. AI Test Modeling for Computer Vision System—A Case Study. Computers. 2025; 14(9):396. https://doi.org/10.3390/computers14090396

Chicago/Turabian Style

Gao, Jerry, and Radhika Agarwal. 2025. "AI Test Modeling for Computer Vision System—A Case Study" Computers 14, no. 9: 396. https://doi.org/10.3390/computers14090396

APA Style

Gao, J., & Agarwal, R. (2025). AI Test Modeling for Computer Vision System—A Case Study. Computers, 14(9), 396. https://doi.org/10.3390/computers14090396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI Test Modeling for Computer Vision System—A Case Study

Abstract

1. Introduction

2. The Literature Review

3. Test Modeling for Computer Vision Systems

3.1. Understanding of Testing Computer Vision Systems

3.2. Conventional Test Modeling

3.3. AI Test Modeling

3.4. Context Classification Modeling

3.5. Input Classification Modeling

3.6. Output Classification Modeling

4. Test Generation and Data Augmentation for Intelligent Computer Vision Systems

4.1. Test Generation

4.2. Data Augmentation

5. Adequate Test Coverage and Standards for Computer Vision Systems

Model-Based Test Adequacy for a Computer Vision Intelligence System

6. A Case Study—Seek by iNaturalist

6.1. Methodology and Dataset Construction

6.2. AI Test Modeling for Selected AI Features

6.3. AI Function Test Cases, Data Generation and Test Coverage

6.4. Results and Failure Mode Analysis

7. Discussion

8. Conclusions and Future Scope

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI