Semi-Automated Training of AI Vision Models

Pelletier, Mathew G.; Wanjura, John D.; Holt, Greg A.

doi:10.3390/agriengineering7070225

Open AccessTechnical Note

Semi-Automated Training of AI Vision Models

by

Mathew G. Pelletier

^*

,

John D. Wanjura

and

Greg A. Holt

Lubbock Gin-Laboratory, Cotton Production and Processing Research Unit, United States Department of Agriculture, Agricultural Research Services, Lubbock, TX 79403, USA

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(7), 225; https://doi.org/10.3390/agriengineering7070225

Submission received: 18 April 2025 / Revised: 6 June 2025 / Accepted: 16 June 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

The adoption of AI vision models in specialized industries is often hindered by the substantial requirement for extensive, manually annotated image datasets. Even when employing transfer learning, robust model development typically necessitates tens of thousands of such images, a process that is time-consuming, costly, and demands consistent expert annotation. This technical note introduces a semi-automated method to significantly reduce this annotation burden. The proposed approach utilizes two general-purpose vision-transformer-to-caption (GP-ViTC) models to generate descriptive text from images. These captions are then processed by a custom-developed semantic classifier (SC), which requires only minimal training to predict the correct image class. This GP-ViTC + SC system demonstrated exemplary classification rates in test cases and can subsequently be used to automatically annotate large image datasets. While the inference speed of the GP-ViTC models is not suited for real-time applications (approximately 10 s per image), this method substantially lessens the labor and expertise required for dataset creation, thereby facilitating the development of new, high-speed, custom AI vision models for niche applications. This work details the approach and its successful application, offering a cost-effective pathway for generating tailored image training sets.

Keywords:

machine vision; plastic contamination; cotton; automated inspection

1. Introduction

The prevalence of plastic contamination in cotton lint has emerged as a critical concern for the U.S. cotton industry, particularly in light of the introduction of novel harvester designs utilizing plastic-wrapped cotton modules. This development has resulted in an economically significant increase in the incidence of plastic within cotton bales, as reported by textile mills. Samples from U.S. cotton classing offices have allowed for identification of the primary source of contamination as sourced from plastic wraps from cotton harvesters’ plastic-wrapped modules. Despite rigorous industry efforts to eliminate plastic during module unwrapping, residual contamination persists throughout the cotton gin processing and packaging systems [1].

The economic implications of plastic contamination on U.S. cotton market value are significant. Historically, U.S. cotton commanded a premium of 0.02 USD/kg in international markets due to its superior cleanliness. However, following the introduction of plastic-wrapped modules, economic projections suggest that U.S. cotton now trades at a 0.01 USD/kg discount, resulting in a total loss of 0.034 USD/kg compared to pre-plastic-wrapping market values. Extrapolating these figures to typical annual cotton yields, the estimated financial impact on U.S. producers exceeds 750 million USD per annum, raising substantial concerns among cotton growers and the gin industry. While it is acknowledged that plastic contamination may not be solely responsible for this economic downturn, economic analyses indicate that it is a major contributing factor. Further research is warranted to quantify the precise impact of plastic contamination on cotton quality and market value [2,3,4,5,6,7,8].

The primary aim of this technical manuscript is to advance the software architecture of the authors’ machine vision plastic contamination detection system such that it can operate in a fully autonomous mode needing at most minimal training. To achieve this several vision transformer AI models were developed. The difficulty with training is that it is unknown what objects are going to show up and there is a wide diversity in the nature of the subject matter that ranges from unknown reflected light sources, to wasps flying under the cameras, to birds chasing the wasps, to gin personnel sticking various objects with their hands and arms that are sometimes covered with brightly colored shirts that the machine vision system rightly flags as not cotton, so likely plastic. What we propose in order to solve this is to use an artificial intelligence (AI) vision transformer (ViT) model to classify these edge cases that have been impeding the adoption and use of an auto-calibration system [9,10,11]. The AI vision systems’ job is to identify what is truly plastic, what are images to ignore (birds, wasps, arms, etc.), and what are valid cotton images that need to be trained on in order to provide the adaptive performance necessary for optimal and robust performance as the cotton changes from white pristine cotton to cotton with lots of yellow-spotted cotton (which looks very close to yellow plastic wrap) to cotton with lots of green leaves (which looks a lot like green plastic wrap). Finally, there is the wide variety of vegetative other matter that gets caught up in the harvesters and brought into the gin, eventually making its way down the gin-stand feeder apron where it is detected by machine vision systems looking for anything that is not ordinary cotton; still, it should be ignored and excluded from both the plastic class as well as the cotton class as training on this is not desirable. In one season at one commercial gin, the machine vision systems flagged over 50,000 images as being plastic. This all translates to the need for a very robust AI vision classifier, and that, even with transfer learning, requires 10,000s of annotated images and adds significant cost in developing these image datasets.

For readers that are unfamiliar with training deep learning models, one of the challenges with using deep learning models is due to the millions of parameters they; that enables them to memorize minute details, which, if trained on too small of a dataset, leads to spurious classification, where the key is not what the developer hopes is the key but rather some unappreciated stronger correlator. In the example shown in Figure 1, the two classes are ‘dog’ or ‘cat’, but because the model was trained on a very limited dataset, what the model actually ended up finding as the highest correlator was not images of ‘dog’ or ‘cat’ but rather if the subject was outside or inside. The term for this is AI spurious classification bias. To avoid these types of incorrectly trained models, robust deep learning (DL) models are typically trained on millions of images to ensure removal of any classification bias in the model. The reason this occurs is that DL models are created by training billions of parameters, so over-fitting of the model is tremendously easy. Think of a multi-linear regression: with every extra variable is an associated loss of a degree of freedom. As the number of variables approaches the number of data points, the model falsely obtains a high coefficient of determination, r², but this does not mean anything as the model has been over-fit.

Given the high cost of developing massive image datasets, this led us to begin examining alternative semi-automated approaches, which is the primary focus of this report. We focused on how to take general-purpose (GP) vision-based image-to-caption (ViTC) models, leverage these GP-ViTC models, and use them to help reduce the labor required in annotating a custom massive image dataset that would be suitable for training a high-speed DL model. In our case, we used this auto-classified image dataset to train a high-speed custom ViTC image classifier that could be used in real-time applications that our plastic-detection machine vision system requires for use in its auto-calibration system [12,13].

1.1. Vision Transformers

Recent developments in general-purpose (GP) AI image-to-caption models, such as BLIP and GIT, gave us the impetus to explore their use [13,14,15]. Recent work [16] reported on leveraging BLIP to help in fine-tuning a model the authors were seeking to extend; however, the approach presented in the report required restructuring the model and using a complicated model augmentation method. Thus, while the approach was reported to be faster than fine-tuning a large pre-trained model, it is still quite extensive and requires significant expertise with how large language models combined with vision transformer models are constructed so that the researcher can retrofit the model and then retrain using their reported technique. This is very similar to other approaches that use adapters as alternatives to fine-tuning, which are also powerful techniques and have seen wide-spread adoption with Low-Rank Adaptation (LoRA) approaches [17]. However, while both of these approaches can achieve excellent predictive results and do reduce the amount training data required, they are both extensively involved and labor-intensive to design, configure, and run, which can only be done by experts. As such, both of these approaches are out of reach for non-AI researchers and would likely have taken us as long as simply hand-annotating the 50,000+ images that were needed to fine-tune a vision transformer model. For our research program, of interest was a simpler approach that could leverage the GP-ViTC image to caption models without getting buried in the details and extensive time required to build LoRA models and retrain them. Moreover, the GP-ViTC models are too slow to use in real time and even the LoRA approach requires 10,000s of images. We were looking for a much simpler approach that would be more accessible to non-AI researchers with a significant reduction in cost and performance in developing massive image datasets suitable for training task-specific vision transformer models (ViTs).

After running numerous test cases on our cotton images, it became apparent the captions provided by these GP models were obviously self-aligning using simple language partitioning. Thus, we set out to explore their potential use for classifying our images if these GP-ViTC models were followed by a semantic classifier (SC) that converts the output captions into a formal image class. In exploring this what we found was that we could use a tiny fraction of the size of the training set that would be required for even transfer learning of a standard AI convolutional neural network (CNN) model or a vision transformer (ViT) model. This represents significant time savings in the reduction in the number of images that must be annotated. Suddenly we had the ability, with minimal labor, to train our much larger dataset that could then be leveraged to train a high-speed ViT model.

1.2. Novel Contribution

Novelty: In examining the literature, scant existing research was uncovered where the specific workflow of combining two different off-the-shelf GP-ViTC (BLIP and GIT) models to obtain a richer set of caption keywords, which are then fed into a custom-developed, simple, and fast-to-train statistical keyword-based machine learning (ML) semantic classifier, and which in turn could be optionally then combined with a traditional machine vision (MV) model, such as the authors’ PIDES negative classifier [9], in combination creates a composite system that requires minimal training images, unlike what is needed to create new traditional deep learning (DL) models, like ViTC models. This composite “JOINT” model appears to be a unique approach as it is composed of two GP-ViTC+ML+MV systems, with the overall objective of efficiently generating a large annotated dataset for a niche problem and helping to minimize the number of images required in training custom ViT models, which normally would require manual annotation of 50,000–100,000 images or more, and which is extremely time-consuming and expensive.

In practice we found that the JOINT model was sufficiently predictive without the MV model and would likely provide the most useful topology. However, as MV models can be developed with minimal datasets, the extra degree of separation provided by the MV model can be very useful in certain harder-to-separate use cases.

Inherited from others: The BLIP and GIT models themselves are existing tools [13,14] with general concepts of image captioning and text classification.

Inherited from the authors’ past work: the PIDES negative classifier [9] is a novel machine vision algorithm for classifying images where one of the primary classes has undefined colors and as such a traditional Bayes pixel classifier cannot be used as there is no way to establish the decision boundary partition between classes.

1.3. Anticipated Research Products

The ability to utilize a much-reduced image dataset to train a ViT model by using a minimally trained JOINT model to automatically train a massive image dataset provides a unique time-saving solution for practitioners looking to develop in the physical artificial intelligence (AI) space, which is known for its dearth of datasets. To date, the lack of suitable datasets represents a major impediment to progress in these industrial niche areas due to prohibitive costs associated with creating suitable image datasets. The JOINT model presented here provides a unique solution and is anticipated to open up a new cost-effective way to develop AI models for edge devices in the physical AI arena. This proposed new approach provides a significant reduction in the barriers to adoption in these niche industries and is anticipated to bring a newfound accessibility to AI along with its never-before-seen performances. As such, it is truly a transformational inflection point in the development of machine vision and the new applications that are now made possible with this new approach.

1.4. Objective

The objective of this technical note is the rapid transfer of this new and novel technique that has tremendous potential in helping other researchers reduce their time and costs that are typically required when developing new state-of-the-art deep learning image classification models to support their research endeavors.

2. Materials and Methods

In creating the JOINT model, which is comprised of a GP-ViTC model whose outputs are then fed into a machine learning semantic bag-of-words classifier, a small subset of a much larger pool of machine vision images called plastic contamination images were each manually classed. These machine vision images flagged as “plastic” images are typically plastic, or containing an empty tray with reflected lights, or bugs flying in front of the camera, or objects or hands from gin personnel that are clearing debris on the gin-stand feeder apron where cleaning happens to occur under the machine vision cameras. To supplement these images, a second set of machine vision cameras were installed to sample cotton images at a 1 Hz rate and stored. This led to a rich 7 TB image dataset of normal cotton images. These combined image training data were obtained in the 2022 season at a commercial gin in West Texas, USA. Each image was hand-classed by an expert into one of four classes (cotton, empty-tray, plastic, hand-intrusion), where a hand-intrusion image was any image that had a hand, arm, or a ginner’s clearing stick on it. An additional class was created as non-determinable, “ND”, for those images where it was not obvious what the object in the image was (sometimes plastic cannot be differentiated from cotton, especially when there is a lot of yellow-spotted cotton intermixed with normal white cotton coming down the feeder apron, which runs under the machine vision cameras) [9].

In working with the GP-ViTC models, six models were examined and the two best were selected for use to create our JOINT-GP-ViTC image classifier system. BLIP and GIT performed the best and were selected to perform the first stage of the image classification [14,15,16]. In practice we found that as these two GP-ViTC models were trained on different datasets and had different architectures, by combining the two model’s outputs, we found that on many images, this combined captions provided complementary assessments which helped in class prediction. As for developing an auto-calibration system, it is critical not to introduce false positives; the system was designed to fail gracefully as it is better to skip an image than to allow a false positive to corrupt the image training dataset that is used in each auto-calibration negative-classifier build, utilized by the high-speed plastic detection machine vision cameras. Individually, neither of the GP-ViTC models was sufficiently accurate to be used as a standalone.

However, the outputs of BIT and BLIP when combined provided complementary captions. This allowed the semantic classifier to use the combined captions in its class inference for the image. This led to the JOINT classifier architecture (see Figure 2). Note: the sub-section on semantic keyword classifiers will be covered in the next section.

In order to explain the need for speed, it must be appreciated that in any one gin there could be upwards of 20–50 machine vision machines that will all be feeding in images to the AI workstation, which it will then have to process. As such, the GP-ViTC JOINT approach is too slow as it would require too many AI workstations to process all the images and would not be cost-affordable. Thus, the true value in this approach lies in the low-cost minimal training development that can then be leveraged to automatically, off-line, create a full image dataset that can then be used to train a ViT model for deployment to the gins.

To test this new concept, the JOINT GP-ViTC image classifier utilized a modest set of of images to build the semantic classifier (<500) which was then validated using a new independent set of images comprising 204 images from each gin-stand at two commercial gins with a total of 1224 images from gin 1 and 816 images from commercial gin 2. As there are two classification objectives, two semantic classifiers were created: one for the prediction of plastic, for obtaining only plastic images to be sent to the website in the gin’s control room (VISN-Web), and the second for obtaining pure cotton images suitable for use in an auto-calibration program (CALIBRATOR).

The objective of this report is to transfer the technology to readers. As such, there is no testable hypothesis, and statistical analysis is limited to the performance of the GP-ViTC image classifier, which was trained on a very limited dataset and validated against an independent dataset. Of note is that the validation set was obtained from completely different cotton gins in the next following ginning season. This ensures a fair unbiased validation test set was utilized for assessing the potential for this novel approach in using general-purpose foundational-quality image-to-caption models where a limited training set is used to design a semantic classifier to transform GP-ViTC caption outputs into a class prediction. Given our subject matter is pieces of ripped-up plastic, torn from large pieces in the process stream, the positive results were surprising and suggest a potential for wider applications, which may provide an answer to industries with very limited data and scarce budgets to come up with the traditionally massive datasets normally needed to create AI models to solve this niche industry’s most immediate problems. Moreover, it has the potential to do so in a very cost-effective manner.

2.1. Semantic Classifier Portion of JOINT Model

To create the semantic classifier section of the JOINT model, and to transform the captions into a class prediction, we explored the novel potential to utilize the outputs from foundational-quality vision-to-caption transformer (ViTC, e.g., GIT, BLIT) models by which to classify our task-specific images, which, in this case. are images of cotton flowing down a slide where the object of interest is some random occurrence of plastic contamination of unknown origin, shape, and color. The caption text outputs from the GIT and BLIT models were analyzed from a small test dataset to form a statistical basis for “bag-of-words” [18,19,20,21,22,23] keyword classification. Given neither of the foundational ViTC models were ever trained on any images remotely like the images of our study, the caption outputs exhibited a remarkable alignment with our classes. Individually used, classification rates were less than 80–90%, but when the caption outputs from two ViTC models were combined, the accuracy improved to over 95%. While these ViTC models are slow (8–10 s per image), they are still much faster than having to manually annotate 10,000s of images, which is very expensive and manual classification performed by people tends to miss-classify edge case images where class selection is not as clear, except to perhaps an expert with years of experience in that particular field. By automating classification, it also solves a well-known problem in the field, where alignment between human graders that inevitably deviate their class choice on images that are less obvious (of which there can be a significant number) is a major hurdle in the generation of massive image datasets suitable for training large language models. This method aims to reduce training time and improve classification accuracy by leveraging the frequency distribution of words across different classes to determine their predictive power in transforming GP-ViTC output captions into predicted image classifications. The machine learning (non-DL)-based statistical keyword semantic classifier subsection is detailed in Figure 2 where the image is fed into the GP-ViTC image-to-caption models, which then outputs captions into a bag-of-words [18,19,20,21,22,23] key-word classifier that is then referenced against an a priori frequency histogram to determine the most likely class associated with the keywords associated with a particular image. The frequency histograms are developed on a small micro-image dataset of a few hundred images, which is orders of magnitude smaller than the number of images that would have been required to train a DL model. While this approach is slow and not suitable for real time, as classification rates are 8–10 s per image (at the time of this writing), the benefit is that the JOINT model can be trained on a few hundred expertly annotated images to build up the frequency histogram (see Figure 3). The use of a frequency histogram for classification of textual data is a well-known and established method [22,23].

Once the frequency histogram has been created, the JOINT model can be fed 50,000+ images and let run unattended to automatically establish the correct class of each unknown image, thereby automatically creating and annotating a massive image dataset that is suitable for training DL models, moreover with minimal input from human image annotators, which are expensive and take up significant time to create such suitable datasets.

In practice we found that in numerous cases more than one class was indicated from a single caption, especially when captions were combined from the two GP-ViTC models. In the case of multiple predicted classes, we found it useful to set up a fail-gracefully bias to ensure the least damaging choice was selected. As we had two primary uses, the bias was different based upon the end target use. For our primary use case this led to bias, or a fail-safe, of TRAY > HID > PLASTIC > COTTON, with the difference going toward selection of the highest priority (so TRAY over-rides HID, etc.). In this example, if the JOINT classifier predicts either the TRAY or HID image, the image is rejected before it can be considered as either a PLASTIC or COTTON image (the two primary classes of interest for our use case).

JOINT Classifier Algorithm Description

This algorithm submits the image to GIT and BLIP GP-ViTC models, which returns captions, and which are combined and then sent to the statistical keyword bag-of-words semantic classifier (Figure 4 and Figure 5), which classifies the input text captions into predefined categories (TRAY, HID, PLASTIC, COTTON) by learning characteristic keyword frequencies for each category from a labeled training dataset.

Training phase: The model is trained using captions labeled with their true class. Each training caption undergoes a keyword extraction process (normalization to lowercase; removal of punctuation, stop words, and short words) to yield a list of meaningful keywords. Class-specific frequency histograms are then constructed by aggregating keyword counts from all training examples belonging to each category. These histograms represent the statistical distribution of keywords associated with each class;
Prediction phase: To classify a new input caption, it is first processed using the identical keyword extraction method. The algorithm then calculates a log-likelihood score for each potential class, estimating the probability of observing the caption’s keywords given the class’s learned frequency histogram (using a naïve Bayes-like approach with Laplace smoothing for robustness against unseen words);
Ambiguity resolution and biased prediction: If multiple classes yield scores close to the maximum score (within a predefined ambiguity margin), indicating uncertainty, a refined bias mechanism is employed. First, the set of high-scoring candidate classes is filtered to include only those whose training histogram contains at least one keyword present in the input caption. This ensures relevance. If ambiguity still exists among these filtered candidates, a predefined priority order (TRAY > HID > PLASTIC > COTTON) is applied only to this relevant subset, and the highest-priority class is selected as the final prediction. If only one filtered candidate emerges, it is chosen directly;
Output: The algorithm outputs the single, final predicted class for the input caption.

This method offers a computationally efficient approach to leveraging textual information for image classification tasks. By focusing on highly predictive keywords, it reduces the dimensionality of the feature space while maintaining discriminative power. This approach has several key advantages:

Efficiency: This method reduces the complexity and time required for training by focusing on keywords;
Interpretability: The use of confidence scores makes the model’s predictions more interpretable;
Customization: You can easily adjust thresholds and select keywords to balance precision and recall.

The JOINT classifier leverages the caption outputs from two GP-ViTC models that are combined and then submitted as a bag-of-words [19,20,21,22] input to the semantic classifier to build a keyword semantic classifier. This approach enables the ability to automate the classification of images via the caption outputs from the latest foundational image-to-caption models that have been trained on millions of images, and it provides a means to do so at minimal cost as all of the training utilizes machine learning techniques that can be performed with minimal training sets. As such, this enables the use of the power of deep learning without the high cost of having to train the models on custom use cases. In essence we’ve transitition from writing code through the use of English language, instead of formal abstract programming languages. This keyword frequency histogram semantic classifier approach proved sufficiently predictive for the target application. In our example implementation, we utilized the histogram approach, modified with bias toward fail-safe class predictions, which was found to be highly accurate for our intended use, even though most of the classes had little to do with the images that the GP-ViTC models were trained on. For the interested reader, the example code is provided in the Supplementary Materials as the file titled ‘histogram_semantic_classifier_r4.py’, and flow charts of the process are provided in the next section in Figure 4 and Figure 5.

2.2. Source Code

The main source code was written in python, to take advantage of the rich support provided by numerous AI libraries. It is also included in the Supplementary Materials under the file titled ‘final_imageObject_identifier.py’. This custom python source code was used to classify images, using a combination of outputs from foundational vision-to-caption models (GIT, BLIT), followed by a histogram-developed semantic classifier that was applied to captions, provided by GIT and BLIT models. The custom code utilizes the following approach (also see Figure 4 and Figure 5):

System prerequisites:

Nvidia graphical processing unit, “GPU” (should be >= RTX 3xxx; i.e., Ampere architecture or newer);
Linux operating-system, OS;
∘
Hardware platform: Code should run on x84 or AMD64 central processing unit (CPU), or ARMv8 (ie. AArch64);
∘
Tested on AArch64 Nvidia Orin-nx.

System setup:

Install Nvidia driver;
Install Cuda 12.x and CuDNN v8.x.x (be sure to match to Cuda version);
Create python virtual environment, “venv”;
∘
In “venv”, install the following:
▪
PyTorch 2.2+, Cuda 12.2+: see https://pytorch.org/get-started/locally/ (accessed 1 June 2025) for correct version to match installed Cuda versio;
▪
Requirements, i.e., python import libraries called in source code;
▪
Transformers, PyTorch, PIL, Pandas, Tkinter, os, sys, glob, gpustat.

Process steps: JOINT classifier: main application workflow:

Figure 4. Overview: Bag-of-words statistical semantic classifier with frequency histogram class selection.

Process steps: bag-of-words semantic classifier function:

Figure 5. Process steps: Bag-of-words statistical semantic classifier using fail-safe overlay with frequency histogram class selection.

3. Results

The performance of this approach can be appreciated in accuracy metrics as shown in Table 1, which are results from validation images of 1224 images, obtained from a commercial trial 1 at cotton gin number 1 and an additional 861 validation images from commercial trial 2 at cotton gin number 2. The training and test sets were taken on completely different cropping seasons and different commercial locations; thus, the validation images were as independent as possible from the results used to train and test the AI model. Of note is the ease and low-cost manner in which the training was achieved, especially in light of the enormous chore it is to manually annotate 10,000s of images necessary to obtain robust AI, CNN, or transformer based image classifiers. For readers interested in the implementation details, such as the semantic classifier, please see the Supplementary Materials included with this report for the full source code that can be easily modified for other specific applications.

The performance of the JOINT classifier is best appreciated by examining the range of images that it can correctly class (see Figure 6, Figure 7 and Figure 8).

4. Summary

The research found that by combining two general-purpose image-to-caption (GP-ViTC) models, both based on different versions of the transformer architecture, it was possible with minimal training of a semantic classifier to convert the captions, provided the GP-ViTC models, to develop a robust image classifier suitable for automatically training large image datasets. The speed of the combined GP-ViTC models, JOINT, was slow at 10 s per image. However, as it provides the ability to simply turn on a computer and run this model against 10,000s of images without needing an expert to hand-annotate each image, it represents tremendous time savings. Our research leveraged this JOINT model to train an extensive image dataset that was utilized to develop several high-speed ViT models for prediction of VISN-Web display “plastic” images as well as safe cotton images suitable for use in a fully automated auto-calibration system (CALIBRATOR). Further details on this work will be forthcoming in a subsequent paper. In summary, this approach of using a GP image to caption models for creating an off-line auto-image trainer was very successful. saved significant research time and expenses, and is a valuable tool to our research program.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/agriengineering7070225/s1: The software source code files are available along with this technical note online and are released into the public domain as open-source software with an appropriate open-source license as dictated by the leveraged underlying open-source libraries that are imported in the source code. The code is written in python and utilizes standard open-source numerical, statistical, and machine learning libraries, typically using GPL or MIT licenses.

Author Contributions

Conceptualization, M.G.P., G.A.H. and J.D.W.; methodology, M.G.P., G.A.H. and J.D.W.; software, M.G.P.; validation, M.G.P.; formal analysis, M.G.P.; investigation, M.G.P., G.A.H. and J.D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from Cotton Incorporated, Cary, North Carolina, 27513, U.S.A. under grant 18-239.

Conflicts of Interest

Mention of a product or trade name in this article does not constitute an endorsement by the USDA-ARS over other compatible products. Products or trade names are listed for reference only. USDA is an equal-opportunity provider and employer.

References

Pelletier, M.G.; Holt, G.A.; Wanjura, J.D. A Cotton Module Feeder Plastic Contamination Inspection System. AgriEngineering 2020, 2, 280–293. [Google Scholar] [CrossRef]
Devine, J. Cotton Incorporated Economist. Interview, 6 January 2020. [Google Scholar]
Barnes, E.; Morgan, G.; Hake, K.; Devine, J.; Kurtz, R.; Ibendahl, G.; Sharda, A.; Rains, G.; Snider, J.; Maja, J.M.; et al. Opportunities for Robotic Systems and Automation in Cotton Production. AgriEngineering 2021, 3, 339–362. [Google Scholar] [CrossRef]
Blake, C. Plastic Contamination Threatens U.S. Cotton Industry. Southwest Farm Press. 24 May 2013. Available online: https://www.farmprogress.com/node/319085 (accessed on 5 July 2020).
Adams, G. A Very Serious Matter. Cotton Farming. 3 August 2017. Available online: https://www.cottonfarming.com/cottons-agenda/a-very-serious-matter/ (accessed on 5 July 2020).
Ramkumar, S. Plastic Contamination Not Just a Cotton Problem. Cotton Grower. 13 September 2018. Available online: https://www.cottongrower.com/opinion/plastic-contamination-not-just-a-cotton-problem/ (accessed on 5 July 2020).
O’Hanlan, M. With Cotton Harvest Underway, Farmers Fear Grocery Bags, and Plastic Contamination. Victoria Advocate. 25 August 2019. Available online: https://www.victoriaadvocate.com/news/local/with-cotton-harvest-underway-farmers-fear-grocery-bags-plastic-contamination/article_9f8c90b0-c438-11e9-9c61-03c92ae351a7.html (accessed on 5 July 2020).
Adams, G. A Reputation at Stake. Cotton Farming. 1 October 2019. Available online: https://www.cottonfarming.com/cottons-agenda/a-reputation-at-stake/ (accessed on 5 July 2020).
Pelletier, M.G.; Holt, G.A.; Wanjura, J.D. Cotton Gin Stand Machine-Vision Inspection and Removal System for Plastic Contamination: Software Design. AgriEngineering 2021, 3, 494–518. [Google Scholar] [CrossRef]
Pelletier, M.G.; Wanjura, J.D.; Holt, G.A.; Kothari, N. Cotton Gin Stand Machine-Vision Inspection and Removal System for Plastic Contamination: Auto-Calibration Design. AgriEngineering 2023, 5, 1243–1258. [Google Scholar] [CrossRef]
Pelletier, M.G.; Wanjura, J.D.; Wakefield, J.R.; Holt, G.A.; Kothari, N. Cotton Gin Stand Machine-Vision Inspection and Removal System for Plastic Contamination: Hand Intrusion Sensor Design. AgriEngineering 2024, 6, 1–19. [Google Scholar] [CrossRef]
Pelletier, M.; Wanjura, J.; Holt, G. Vision-Transformer, ViT, model validation dataset [Data set]. Zenodo 2024. [Google Scholar] [CrossRef]
Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv 2022, arXiv:2205.14100. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv 2022, arXiv:2201.12086. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold GGelly, S.; Uszkoreit, J.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Chiang, C.Y.; Chang, I.H.; Liao, S.W. Blip-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning. arXiv 2023, arXiv:2309.14774. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Harris, Z.S. Distributional structure. Word 1954, 10, 146–162. [Google Scholar] [CrossRef]
Wang, C.; Huang, K. How to use Bag-of-Words model better for image classification. Image Vis. Comput. 2015, 38, 65–74. [Google Scholar] [CrossRef]
Qader, W.A.; Ameen, M.M.; Ahmed, B.I. An Overview of Bag of Words; Importance, Implementation, Applications, and Challenges. In Proceedings of the 2019 International Engineering Conference (IEC), Erbil, Iraq, 23–25 June 2019; pp. 200–204. [Google Scholar] [CrossRef]
Wu, L.; Hoi, S.C.H.; Yu, N. Semantics-Preserving Bag-of-Words Models and Applications. IEEE Trans. Image Process. 2010, 19, 1908–1920. [Google Scholar] [CrossRef] [PubMed]
McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI, USA, 26–27 July 1998; pp. 41–48. [Google Scholar]
Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–47. [Google Scholar] [CrossRef]

Figure 1. The need for millions of training images lies with deep learning models’ tendency for spurious classification bias, where the model memorizes some key unappreciated feature in the images instead of the objects of interest. In this case the model was keying on if the subject was indoors or outdoors, instead of either a ‘dog’ or ‘cat’, which the developer was hoping the model was being trained on. To avoid creation of false correlated models, the most robust models use millions of images to ensure all possible scenarios are presented during training.

Figure 2. JOINT classifier utilizing a combination of the authors’ high-speed machine vision negative classifier coupled with two general-purpose image-to-capture models, based on vision transformer to caption architectures, that feed into a semantic textual classifier that provides the final classification of the images.

Figure 3. Bag-of-words statistical machine learning classifier utilizes a small training set to build up a frequency histogram of keywords, where words with the highest frequency associated with a given class provide the class prediction for an unknown image.

Figure 6. Empty-tray images where periodic lights from skylights (a) or forklifts (a), or insects (b) or birds (c), were observed to lead to a lot of false-positive images by the negative classifier. The JOINT model is able to correctly identify them.

Figure 7. JOINT classifier correctly calls a great variety of hand-intrusion events, HIEs (images (a–c)), or clearing sticks (d), that would be very difficult to do utilizing traditional machine vision. It even is able to discern when the lens is being cleaned as images to avoid.

Figure 8. JOINT classifier had a few misses, but in each case they were good misses in that it should be called plastic, not HID (hand-intrusion detection).

Table 1. Validation results for the JOINT classifier in predicting true plastic images, for display on the VISN-Web portal and for use in the auto-classifier for building new negative classifiers to allow machine vision systems to adapt to changing cotton conditions (CALIBRATOR).

Commercial Gin 1	Commercial Gin 2
97.1%	98.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pelletier, M.G.; Wanjura, J.D.; Holt, G.A. Semi-Automated Training of AI Vision Models. AgriEngineering 2025, 7, 225. https://doi.org/10.3390/agriengineering7070225

AMA Style

Pelletier MG, Wanjura JD, Holt GA. Semi-Automated Training of AI Vision Models. AgriEngineering. 2025; 7(7):225. https://doi.org/10.3390/agriengineering7070225

Chicago/Turabian Style

Pelletier, Mathew G., John D. Wanjura, and Greg A. Holt. 2025. "Semi-Automated Training of AI Vision Models" AgriEngineering 7, no. 7: 225. https://doi.org/10.3390/agriengineering7070225

APA Style

Pelletier, M. G., Wanjura, J. D., & Holt, G. A. (2025). Semi-Automated Training of AI Vision Models. AgriEngineering, 7(7), 225. https://doi.org/10.3390/agriengineering7070225

Article Menu

Semi-Automated Training of AI Vision Models

Abstract

1. Introduction

1.1. Vision Transformers

1.2. Novel Contribution

1.3. Anticipated Research Products

1.4. Objective

2. Materials and Methods

2.1. Semantic Classifier Portion of JOINT Model

2.2. Source Code

3. Results

4. Summary

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI