1. Introduction
The prevalence of plastic contamination in cotton lint has emerged as a critical concern for the U.S. cotton industry, particularly in light of the introduction of novel harvester designs utilizing plastic-wrapped cotton modules. This development has resulted in an economically significant increase in the incidence of plastic within cotton bales, as reported by textile mills. Samples from U.S. cotton classing offices have allowed for identification of the primary source of contamination as sourced from plastic wraps from cotton harvesters’ plastic-wrapped modules. Despite rigorous industry efforts to eliminate plastic during module unwrapping, residual contamination persists throughout the cotton gin processing and packaging systems [
1].
The economic implications of plastic contamination on U.S. cotton market value are significant. Historically, U.S. cotton commanded a premium of 0.02 USD/kg in international markets due to its superior cleanliness. However, following the introduction of plastic-wrapped modules, economic projections suggest that U.S. cotton now trades at a 0.01 USD/kg discount, resulting in a total loss of 0.034 USD/kg compared to pre-plastic-wrapping market values. Extrapolating these figures to typical annual cotton yields, the estimated financial impact on U.S. producers exceeds 750 million USD per annum, raising substantial concerns among cotton growers and the gin industry. While it is acknowledged that plastic contamination may not be solely responsible for this economic downturn, economic analyses indicate that it is a major contributing factor. Further research is warranted to quantify the precise impact of plastic contamination on cotton quality and market value [
2,
3,
4,
5,
6,
7,
8].
The primary aim of this technical manuscript is to advance the software architecture of the authors’ machine vision plastic contamination detection system such that it can operate in a fully autonomous mode needing at most minimal training. To achieve this several vision transformer AI models were developed. The difficulty with training is that it is unknown what objects are going to show up and there is a wide diversity in the nature of the subject matter that ranges from unknown reflected light sources, to wasps flying under the cameras, to birds chasing the wasps, to gin personnel sticking various objects with their hands and arms that are sometimes covered with brightly colored shirts that the machine vision system rightly flags as not cotton, so likely plastic. What we propose in order to solve this is to use an artificial intelligence (AI) vision transformer (ViT) model to classify these edge cases that have been impeding the adoption and use of an auto-calibration system [
9,
10,
11]. The AI vision systems’ job is to identify what is truly plastic, what are images to ignore (birds, wasps, arms, etc.), and what are valid cotton images that need to be trained on in order to provide the adaptive performance necessary for optimal and robust performance as the cotton changes from white pristine cotton to cotton with lots of yellow-spotted cotton (which looks very close to yellow plastic wrap) to cotton with lots of green leaves (which looks a lot like green plastic wrap). Finally, there is the wide variety of vegetative other matter that gets caught up in the harvesters and brought into the gin, eventually making its way down the gin-stand feeder apron where it is detected by machine vision systems looking for anything that is not ordinary cotton; still, it should be ignored and excluded from both the plastic class as well as the cotton class as training on this is not desirable. In one season at one commercial gin, the machine vision systems flagged over 50,000 images as being plastic. This all translates to the need for a very robust AI vision classifier, and that, even with transfer learning, requires 10,000s of annotated images and adds significant cost in developing these image datasets.
For readers that are unfamiliar with training deep learning models, one of the challenges with using deep learning models is due to the millions of parameters they; that enables them to memorize minute details, which, if trained on too small of a dataset, leads to spurious classification, where the key is not what the developer hopes is the key but rather some unappreciated stronger correlator. In the example shown in
Figure 1, the two classes are ‘dog’ or ‘cat’, but because the model was trained on a very limited dataset, what the model actually ended up finding as the highest correlator was not images of ‘dog’ or ‘cat’ but rather if the subject was outside or inside. The term for this is AI spurious classification bias. To avoid these types of incorrectly trained models, robust deep learning (DL) models are typically trained on millions of images to ensure removal of any classification bias in the model. The reason this occurs is that DL models are created by training billions of parameters, so over-fitting of the model is tremendously easy. Think of a multi-linear regression: with every extra variable is an associated loss of a degree of freedom. As the number of variables approaches the number of data points, the model falsely obtains a high coefficient of determination, r
2, but this does not mean anything as the model has been over-fit.
Given the high cost of developing massive image datasets, this led us to begin examining alternative semi-automated approaches, which is the primary focus of this report. We focused on how to take general-purpose (GP) vision-based image-to-caption (ViTC) models, leverage these GP-ViTC models, and use them to help reduce the labor required in annotating a custom massive image dataset that would be suitable for training a high-speed DL model. In our case, we used this auto-classified image dataset to train a high-speed custom ViTC image classifier that could be used in real-time applications that our plastic-detection machine vision system requires for use in its auto-calibration system [
12,
13].
1.1. Vision Transformers
Recent developments in general-purpose (GP) AI image-to-caption models, such as BLIP and GIT, gave us the impetus to explore their use [
13,
14,
15]. Recent work [
16] reported on leveraging BLIP to help in fine-tuning a model the authors were seeking to extend; however, the approach presented in the report required restructuring the model and using a complicated model augmentation method. Thus, while the approach was reported to be faster than fine-tuning a large pre-trained model, it is still quite extensive and requires significant expertise with how large language models combined with vision transformer models are constructed so that the researcher can retrofit the model and then retrain using their reported technique. This is very similar to other approaches that use adapters as alternatives to fine-tuning, which are also powerful techniques and have seen wide-spread adoption with Low-Rank Adaptation (LoRA) approaches [
17]. However, while both of these approaches can achieve excellent predictive results and do reduce the amount training data required, they are both extensively involved and labor-intensive to design, configure, and run, which can only be done by experts. As such, both of these approaches are out of reach for non-AI researchers and would likely have taken us as long as simply hand-annotating the 50,000+ images that were needed to fine-tune a vision transformer model. For our research program, of interest was a simpler approach that could leverage the GP-ViTC image to caption models without getting buried in the details and extensive time required to build LoRA models and retrain them. Moreover, the GP-ViTC models are too slow to use in real time and even the LoRA approach requires 10,000s of images. We were looking for a much simpler approach that would be more accessible to non-AI researchers with a significant reduction in cost and performance in developing massive image datasets suitable for training task-specific vision transformer models (ViTs).
After running numerous test cases on our cotton images, it became apparent the captions provided by these GP models were obviously self-aligning using simple language partitioning. Thus, we set out to explore their potential use for classifying our images if these GP-ViTC models were followed by a semantic classifier (SC) that converts the output captions into a formal image class. In exploring this what we found was that we could use a tiny fraction of the size of the training set that would be required for even transfer learning of a standard AI convolutional neural network (CNN) model or a vision transformer (ViT) model. This represents significant time savings in the reduction in the number of images that must be annotated. Suddenly we had the ability, with minimal labor, to train our much larger dataset that could then be leveraged to train a high-speed ViT model.
1.2. Novel Contribution
Novelty: In examining the literature, scant existing research was uncovered where the specific workflow of combining two different off-the-shelf GP-ViTC (BLIP and GIT) models to obtain a richer set of caption keywords, which are then fed into a custom-developed, simple, and fast-to-train statistical keyword-based machine learning (ML) semantic classifier, and which in turn could be optionally then combined with a traditional machine vision (MV) model, such as the authors’ PIDES negative classifier [
9], in combination creates a composite system that requires minimal training images, unlike what is needed to create new traditional deep learning (DL) models, like ViTC models. This composite “JOINT” model appears to be a unique approach as it is composed of two GP-ViTC+ML+MV systems, with the overall objective of efficiently generating a large annotated dataset for a niche problem and helping to minimize the number of images required in training custom ViT models, which normally would require manual annotation of 50,000–100,000 images or more, and which is extremely time-consuming and expensive.
In practice we found that the JOINT model was sufficiently predictive without the MV model and would likely provide the most useful topology. However, as MV models can be developed with minimal datasets, the extra degree of separation provided by the MV model can be very useful in certain harder-to-separate use cases.
Inherited from others: The BLIP and GIT models themselves are existing tools [
13,
14] with general concepts of image captioning and text classification.
Inherited from the authors’ past work: the PIDES negative classifier [
9] is a novel machine vision algorithm for classifying images where one of the primary classes has undefined colors and as such a traditional Bayes pixel classifier cannot be used as there is no way to establish the decision boundary partition between classes.
1.3. Anticipated Research Products
The ability to utilize a much-reduced image dataset to train a ViT model by using a minimally trained JOINT model to automatically train a massive image dataset provides a unique time-saving solution for practitioners looking to develop in the physical artificial intelligence (AI) space, which is known for its dearth of datasets. To date, the lack of suitable datasets represents a major impediment to progress in these industrial niche areas due to prohibitive costs associated with creating suitable image datasets. The JOINT model presented here provides a unique solution and is anticipated to open up a new cost-effective way to develop AI models for edge devices in the physical AI arena. This proposed new approach provides a significant reduction in the barriers to adoption in these niche industries and is anticipated to bring a newfound accessibility to AI along with its never-before-seen performances. As such, it is truly a transformational inflection point in the development of machine vision and the new applications that are now made possible with this new approach.
1.4. Objective
The objective of this technical note is the rapid transfer of this new and novel technique that has tremendous potential in helping other researchers reduce their time and costs that are typically required when developing new state-of-the-art deep learning image classification models to support their research endeavors.
2. Materials and Methods
In creating the JOINT model, which is comprised of a GP-ViTC model whose outputs are then fed into a machine learning semantic bag-of-words classifier, a small subset of a much larger pool of machine vision images called plastic contamination images were each manually classed. These machine vision images flagged as “plastic” images are typically plastic, or containing an empty tray with reflected lights, or bugs flying in front of the camera, or objects or hands from gin personnel that are clearing debris on the gin-stand feeder apron where cleaning happens to occur under the machine vision cameras. To supplement these images, a second set of machine vision cameras were installed to sample cotton images at a 1 Hz rate and stored. This led to a rich 7 TB image dataset of normal cotton images. These combined image training data were obtained in the 2022 season at a commercial gin in West Texas, USA. Each image was hand-classed by an expert into one of four classes (cotton, empty-tray, plastic, hand-intrusion), where a hand-intrusion image was any image that had a hand, arm, or a ginner’s clearing stick on it. An additional class was created as non-determinable, “ND”, for those images where it was not obvious what the object in the image was (sometimes plastic cannot be differentiated from cotton, especially when there is a lot of yellow-spotted cotton intermixed with normal white cotton coming down the feeder apron, which runs under the machine vision cameras) [
9].
In working with the GP-ViTC models, six models were examined and the two best were selected for use to create our JOINT-GP-ViTC image classifier system. BLIP and GIT performed the best and were selected to perform the first stage of the image classification [
14,
15,
16]. In practice we found that as these two GP-ViTC models were trained on different datasets and had different architectures, by combining the two model’s outputs, we found that on many images, this combined captions provided complementary assessments which helped in class prediction. As for developing an auto-calibration system, it is critical not to introduce false positives; the system was designed to fail gracefully as it is better to skip an image than to allow a false positive to corrupt the image training dataset that is used in each auto-calibration negative-classifier build, utilized by the high-speed plastic detection machine vision cameras. Individually, neither of the GP-ViTC models was sufficiently accurate to be used as a standalone.
However, the outputs of BIT and BLIP when combined provided complementary captions. This allowed the semantic classifier to use the combined captions in its class inference for the image. This led to the JOINT classifier architecture (see
Figure 2). Note: the sub-section on semantic keyword classifiers will be covered in the next section.
In order to explain the need for speed, it must be appreciated that in any one gin there could be upwards of 20–50 machine vision machines that will all be feeding in images to the AI workstation, which it will then have to process. As such, the GP-ViTC JOINT approach is too slow as it would require too many AI workstations to process all the images and would not be cost-affordable. Thus, the true value in this approach lies in the low-cost minimal training development that can then be leveraged to automatically, off-line, create a full image dataset that can then be used to train a ViT model for deployment to the gins.
To test this new concept, the JOINT GP-ViTC image classifier utilized a modest set of of images to build the semantic classifier (<500) which was then validated using a new independent set of images comprising 204 images from each gin-stand at two commercial gins with a total of 1224 images from gin 1 and 816 images from commercial gin 2. As there are two classification objectives, two semantic classifiers were created: one for the prediction of plastic, for obtaining only plastic images to be sent to the website in the gin’s control room (VISN-Web), and the second for obtaining pure cotton images suitable for use in an auto-calibration program (CALIBRATOR).
The objective of this report is to transfer the technology to readers. As such, there is no testable hypothesis, and statistical analysis is limited to the performance of the GP-ViTC image classifier, which was trained on a very limited dataset and validated against an independent dataset. Of note is that the validation set was obtained from completely different cotton gins in the next following ginning season. This ensures a fair unbiased validation test set was utilized for assessing the potential for this novel approach in using general-purpose foundational-quality image-to-caption models where a limited training set is used to design a semantic classifier to transform GP-ViTC caption outputs into a class prediction. Given our subject matter is pieces of ripped-up plastic, torn from large pieces in the process stream, the positive results were surprising and suggest a potential for wider applications, which may provide an answer to industries with very limited data and scarce budgets to come up with the traditionally massive datasets normally needed to create AI models to solve this niche industry’s most immediate problems. Moreover, it has the potential to do so in a very cost-effective manner.
2.1. Semantic Classifier Portion of JOINT Model
To create the semantic classifier section of the JOINT model, and to transform the captions into a class prediction, we explored the novel potential to utilize the outputs from foundational-quality vision-to-caption transformer (ViTC, e.g., GIT, BLIT) models by which to classify our task-specific images, which, in this case. are images of cotton flowing down a slide where the object of interest is some random occurrence of plastic contamination of unknown origin, shape, and color. The caption text outputs from the GIT and BLIT models were analyzed from a small test dataset to form a statistical basis for “bag-of-words” [
18,
19,
20,
21,
22,
23] keyword classification. Given neither of the foundational ViTC models were ever trained on any images remotely like the images of our study, the caption outputs exhibited a remarkable alignment with our classes. Individually used, classification rates were less than 80–90%, but when the caption outputs from two ViTC models were combined, the accuracy improved to over 95%. While these ViTC models are slow (8–10 s per image), they are still much faster than having to manually annotate 10,000s of images, which is very expensive and manual classification performed by people tends to miss-classify edge case images where class selection is not as clear, except to perhaps an expert with years of experience in that particular field. By automating classification, it also solves a well-known problem in the field, where alignment between human graders that inevitably deviate their class choice on images that are less obvious (of which there can be a significant number) is a major hurdle in the generation of massive image datasets suitable for training large language models. This method aims to reduce training time and improve classification accuracy by leveraging the frequency distribution of words across different classes to determine their predictive power in transforming GP-ViTC output captions into predicted image classifications. The machine learning (non-DL)-based statistical keyword semantic classifier subsection is detailed in
Figure 2 where the image is fed into the GP-ViTC image-to-caption models, which then outputs captions into a bag-of-words [
18,
19,
20,
21,
22,
23] key-word classifier that is then referenced against an a priori frequency histogram to determine the most likely class associated with the keywords associated with a particular image. The frequency histograms are developed on a small micro-image dataset of a few hundred images, which is orders of magnitude smaller than the number of images that would have been required to train a DL model. While this approach is slow and not suitable for real time, as classification rates are 8–10 s per image (at the time of this writing), the benefit is that the JOINT model can be trained on a few hundred expertly annotated images to build up the frequency histogram (see
Figure 3). The use of a frequency histogram for classification of textual data is a well-known and established method [
22,
23].
Once the frequency histogram has been created, the JOINT model can be fed 50,000+ images and let run unattended to automatically establish the correct class of each unknown image, thereby automatically creating and annotating a massive image dataset that is suitable for training DL models, moreover with minimal input from human image annotators, which are expensive and take up significant time to create such suitable datasets.
In practice we found that in numerous cases more than one class was indicated from a single caption, especially when captions were combined from the two GP-ViTC models. In the case of multiple predicted classes, we found it useful to set up a fail-gracefully bias to ensure the least damaging choice was selected. As we had two primary uses, the bias was different based upon the end target use. For our primary use case this led to bias, or a fail-safe, of TRAY > HID > PLASTIC > COTTON, with the difference going toward selection of the highest priority (so TRAY over-rides HID, etc.). In this example, if the JOINT classifier predicts either the TRAY or HID image, the image is rejected before it can be considered as either a PLASTIC or COTTON image (the two primary classes of interest for our use case).
JOINT Classifier Algorithm Description
This algorithm submits the image to GIT and BLIP GP-ViTC models, which returns captions, and which are combined and then sent to the statistical keyword bag-of-words semantic classifier (
Figure 4 and
Figure 5), which classifies the input text captions into predefined categories (TRAY, HID, PLASTIC, COTTON) by learning characteristic keyword frequencies for each category from a labeled training dataset.
Training phase: The model is trained using captions labeled with their true class. Each training caption undergoes a keyword extraction process (normalization to lowercase; removal of punctuation, stop words, and short words) to yield a list of meaningful keywords. Class-specific frequency histograms are then constructed by aggregating keyword counts from all training examples belonging to each category. These histograms represent the statistical distribution of keywords associated with each class;
Prediction phase: To classify a new input caption, it is first processed using the identical keyword extraction method. The algorithm then calculates a log-likelihood score for each potential class, estimating the probability of observing the caption’s keywords given the class’s learned frequency histogram (using a naïve Bayes-like approach with Laplace smoothing for robustness against unseen words);
Ambiguity resolution and biased prediction: If multiple classes yield scores close to the maximum score (within a predefined ambiguity margin), indicating uncertainty, a refined bias mechanism is employed. First, the set of high-scoring candidate classes is filtered to include only those whose training histogram contains at least one keyword present in the input caption. This ensures relevance. If ambiguity still exists among these filtered candidates, a predefined priority order (TRAY > HID > PLASTIC > COTTON) is applied only to this relevant subset, and the highest-priority class is selected as the final prediction. If only one filtered candidate emerges, it is chosen directly;
Output: The algorithm outputs the single, final predicted class for the input caption.
This method offers a computationally efficient approach to leveraging textual information for image classification tasks. By focusing on highly predictive keywords, it reduces the dimensionality of the feature space while maintaining discriminative power. This approach has several key advantages:
Efficiency: This method reduces the complexity and time required for training by focusing on keywords;
Interpretability: The use of confidence scores makes the model’s predictions more interpretable;
Customization: You can easily adjust thresholds and select keywords to balance precision and recall.
The JOINT classifier leverages the caption outputs from two GP-ViTC models that are combined and then submitted as a bag-of-words [
19,
20,
21,
22] input to the semantic classifier to build a keyword semantic classifier. This approach enables the ability to automate the classification of images via the caption outputs from the latest foundational image-to-caption models that have been trained on millions of images, and it provides a means to do so at minimal cost as all of the training utilizes machine learning techniques that can be performed with minimal training sets. As such, this enables the use of the power of deep learning without the high cost of having to train the models on custom use cases. In essence we’ve transitition from writing code through the use of English language, instead of formal abstract programming languages. This keyword frequency histogram semantic classifier approach proved sufficiently predictive for the target application. In our example implementation, we utilized the histogram approach, modified with bias toward fail-safe class predictions, which was found to be highly accurate for our intended use, even though most of the classes had little to do with the images that the GP-ViTC models were trained on. For the interested reader, the example code is provided in the
Supplementary Materials as the file titled ‘histogram_semantic_classifier_r4.py’, and flow charts of the process are provided in the next section in
Figure 4 and
Figure 5.
2.2. Source Code
The main source code was written in python, to take advantage of the rich support provided by numerous AI libraries. It is also included in the
Supplementary Materials under the file titled ‘final_imageObject_identifier.py’. This custom python source code was used to classify images, using a combination of outputs from foundational vision-to-caption models (GIT, BLIT), followed by a histogram-developed semantic classifier that was applied to captions, provided by GIT and BLIT models. The custom code utilizes the following approach (also see
Figure 4 and
Figure 5):
System prerequisites:
Nvidia graphical processing unit, “GPU” (should be >= RTX 3xxx; i.e., Ampere architecture or newer);
Linux operating-system, OS;
- ∘
Hardware platform: Code should run on x84 or AMD64 central processing unit (CPU), or ARMv8 (ie. AArch64);
- ∘
Tested on AArch64 Nvidia Orin-nx.
System setup:
Install Nvidia driver;
Install Cuda 12.x and CuDNN v8.x.x (be sure to match to Cuda version);
Create python virtual environment, “venv”;
- ∘
In “venv”, install the following:
- ▪
- ▪
Requirements, i.e., python import libraries called in source code;
- ▪
Transformers, PyTorch, PIL, Pandas, Tkinter, os, sys, glob, gpustat.
Process steps: JOINT classifier: main application workflow:
Figure 4.
Overview: Bag-of-words statistical semantic classifier with frequency histogram class selection.
Figure 4.
Overview: Bag-of-words statistical semantic classifier with frequency histogram class selection.
Process steps: bag-of-words semantic classifier function:
Figure 5.
Process steps: Bag-of-words statistical semantic classifier using fail-safe overlay with frequency histogram class selection.
Figure 5.
Process steps: Bag-of-words statistical semantic classifier using fail-safe overlay with frequency histogram class selection.
4. Summary
The research found that by combining two general-purpose image-to-caption (GP-ViTC) models, both based on different versions of the transformer architecture, it was possible with minimal training of a semantic classifier to convert the captions, provided the GP-ViTC models, to develop a robust image classifier suitable for automatically training large image datasets. The speed of the combined GP-ViTC models, JOINT, was slow at 10 s per image. However, as it provides the ability to simply turn on a computer and run this model against 10,000s of images without needing an expert to hand-annotate each image, it represents tremendous time savings. Our research leveraged this JOINT model to train an extensive image dataset that was utilized to develop several high-speed ViT models for prediction of VISN-Web display “plastic” images as well as safe cotton images suitable for use in a fully automated auto-calibration system (CALIBRATOR). Further details on this work will be forthcoming in a subsequent paper. In summary, this approach of using a GP image to caption models for creating an off-line auto-image trainer was very successful. saved significant research time and expenses, and is a valuable tool to our research program.