1. Introduction
Disposable bamboo chopsticks, as a representative of traditional tableware, have gained widespread usage worldwide due to their natural, environmentally friendly, and renewable characteristics [
1,
2]. However, with the continuous expansion of production scale and the increasing market demand, quality issues of bamboo chopsticks have gradually emerged as a major bottleneck restricting industry development. Currently, defect inspection of bamboo chopsticks largely relies on manual visual examination, which suffers from low efficiency, high subjectivity, and a considerable risk of oversight. With the rapid advancement of deep learning technologies, automated defect detection methods based on computer vision have increasingly become a focal point of research [
3,
4,
5].
In industrial quality inspection, automated defect identification technologies typically encompass three categories of tasks: defect classification, defect object detection, and defect segmentation. This study focuses on defect classification, particularly multi-label defect classification. With the advancement of deep learning, convolutional neural network (CNN) based methods have achieved remarkable breakthroughs in complex visual tasks. For instance, ResNet demonstrated outstanding feature extraction capability on ImageNet [
6]; the YOLO series models, benefiting from a complete ecosystem, attained excellent performance and favorable configurability in defect detection scenarios [
7,
8,
9]. In parallel, several recent works have explored large-scale appearance inspection and fine-grained visual understanding in industrial and agricultural domains. Fan et al. [
10] introduced a large-scale grain appearance dataset and benchmark, highlighting the importance of fine-grained recognition and distribution-aware modeling in real-world inspection tasks. Meanwhile, the rise of Vision Transformer (ViT) [
11] has driven the further application of self-attention-based models in industrial vision scenarios.
Although ViT have achieved promising results in defect detection, they still face significant challenges when processing images with extreme aspect ratios. Taking bamboo chopstick images as an example, whose aspect ratios often exceed 10:1, naively resizing them to the default ViT input size of 224 × 224 leads to severe distortion of local details, making it difficult to capture subtle defects such as black spots, mold stains, and cracks. The conventional patch embedding mechanism of ViT also struggles to provide effective positional encoding for such data.
Recent related studies have attempted to fuse CNN with ViT; however, most approaches merely incorporate CNN as front-end feature extractors in a structural manner, lacking structure optimization tailored to specific industrial scenarios. For special data such as bamboo chopstick images, which exhibit high aspect ratios and fine-grained defects, these generic fusion strategies fail to provide sufficient geometric representation capacity, thereby limiting the model’s detection performance.
There are also notable differences between multi-class and multi-label classification in data annotation. Multi-class classification addresses problems where a sample must be assigned to exactly one of several mutually exclusive classes, implying that each sample receives only a single label [
12,
13]. In multi-class tasks, the SoftMax activation is typically used to produce a probability distribution over classes, ensuring that each sample is associated with a single label. In contrast, multi-label classification permits a sample to be associated with multiple non-exclusive labels, making it more suitable for describing samples with multiple attributes. In multi-label tasks, the output layer usually adopts the Sigmoid activation, enabling independent probability outputs for each label and allowing a sample to possess multiple labels simultaneously [
14,
15].
Consequently, in the bamboo chopstick multi-label defect classification task, severe class imbalance is common; high-frequency “easy” examples tend to dominate gradient updates, making it difficult for the model to focus on more challenging hard examples. Among mainstream approaches, binary cross-entropy (BCE) is the most widely used loss; however, it suffers from the problem of easy examples dominating the gradients [
16].
To address these issues, inspired by the integration of CNN and ViT, this paper proposes a Convolutional Feature Embedding (CFE) module tailored to the structural characteristics of bamboo chopstick images, and incorporates it into the ViT framework to construct an improved visual Transformer—C-ViT. Unlike previous approaches that simply combine CNN and ViT, we carefully design the parameters of the CFE module specifically for the bamboo chopstick dataset, rather than naively replacing positional encoding with conventional convolutions. Leveraging the local receptive field and translation invariance inherent in CNN, the module encodes features of images with extreme aspect ratios, thereby avoiding the geometric distortion caused by traditional patch embedding especially preserving the fine texture information of subtle defects [
17,
18].
Meanwhile, to mitigate training bias induced by the dominance of “easy” examples in multi-label classification, this paper proposes a Hard Examples Contrastive Learning (HCL) loss function. HCL dynamically identifies hard examples whose prediction confidence is close to the decision threshold, and constructs a contrastive learning objective based on their label and feature similarity. This mechanism forces the model to enhance its discriminative capability for the subtle defect features present in hard examples.
To address this issue, this paper proposes a Hard examples Contrastive Learning (HCL) loss function. The HCL dynamically identifies hard examples whose prediction confidence is close to the decision threshold, and constructs a contrastive learning objective based on their label and feature similarities. This mechanism compels the model to enhance its discriminative capability for subtle defect features in hard examples.
To validate the effectiveness of the proposed method, a bamboo chopstick defect dataset (BCDD) containing 4000 samples was constructed, covering five defect types: mildew, bending, black spots, cracks, and slenderness. Experimental results demonstrate that the proposed C-ViT model combined with the HCL achieves a mean Average Precision (mAP) of 93.8% on the test set, significantly outperforming existing benchmark models. Furthermore, the generalization capability of the HCL was also verified on public datasets.
The main contributions of this paper are summarized as follows:
- (1)
This paper proposes an improved visual transformer model, C-ViT, which introduces the CFE module for bamboo chopstick defect classification, which can better handle image samples with extreme aspect ratios (such as bamboo chopsticks).
- (2)
A new loss function, Hard Examples Contrastive Loss (HCL), is proposed to enhance the model’s ability to dynamically selects hard examples. Extensive experiments are conducted on self-built and public datasets. For example, on the public dataset VOC2012, the mAP can reach 92.8%.
- (3)
We construct a bamboo chopstick defect multi-label classification dataset (BCDD), which contains 4000 defective samples, with no less than 500 samples in each defect category.
4. Experimental Results and Analysis
4.1. Experimental Environment and Parameter Settings
The operating system used in the experiment is Windows 10, and the AMD Ryzen 9 7950X processor is paired with the NVIDIA GeForce RTX 4090 graphics card, with 128G memory. The deep learning framework uses Pytorch 2.0.1 version, and the CUDA version is 12.0. In terms of model training, the experimental settings are as follows: the input image resolution is adjusted to 512 × 96 pixels, the batch size is set to 64, the optimizer uses Adam and sets the initial learning rate to 1 × 10
−4, the similarity threshold is
0.5, the contrast loss weight
is 0.5, and the total training cycle is 300 epochs. The specific experimental running environment is detailed in
Table 1.
4.2. Evaluation Indicators
In order to effectively evaluate the improvement effect of the model, this experiment uses three indicators: mean average precision (mAP) of all categories, overall F1 score (OF1), and average F1 score of each category (CF1). The specific calculation formula is as follows:
Among them, TP represents the number of samples with defects that are correctly detected; FP represents the number of samples without defects but misjudged as defects; FN represents the number of samples with defects but not detected; N represents the total number of defect categories in the detection task; Precision and Recall represent the overall detection accuracy and recall rate, respectively; AP represents the integrated area under the Precision and Recall curves, which is used to comprehensively evaluate the detection performance.
4.3. Experimental Results on BCDD
To verify the effectiveness of the improved C-ViT and other methods on the BCDD, this section designed comparative experiments with mainstream multi-label classification algorithms such as ViT-Small, ResNet-50, ASL (ResNet-50), and Q2L under the same equipment and training parameters, using mAP, OF1, and CF1 as evaluation metrics.
The experimental results are shown in
Table 2 below. ViT-Small performed poorly on all three metrics: mAP at 91.6%, OF1 at 85.8%, and CF1 at only 81.7%, indicating that its ability to identify more difficult category labels still has room for improvement. The unmodified C-ViT achieved a mAP of 92.8%, slightly better than ResNet-50′s 92.0%. OF1 and CF1, at 87.2% and 82.8%, respectively, were comparable to ResNet-50′s 86.9% and 82.5%, indicating that the visual Transformer structure improves semantic modeling. ASL achieved a mAP of 93.6%, slightly lower than Q2L, but its performance in OF1 (89.5%) and CF1 (87.7%) was superior, demonstrating that its adaptive loss mechanism effectively alleviates the label imbalance problem. Q2L achieved a 94.1% mAP, while OF1 and CF1 reached 90.1% and 88.0%, respectively, indicating significant advantages in overall detection capability and multi-label learning strategies.
Simultaneously, we conducted comparative experiments with recent hybrid models. ConViT integrates inductive biases into the Vision Transformer through the ingenious design of gated positional self-attention units, achieving a mAP of 92.4%, with OF1 and CF1 scores of 87.0% and 82.4%, respectively. FasterViT achieves a mAP of 92.6% by introducing a hierarchical attention mechanism combined with efficient local feature extraction, while its OF1 and CF1 reach 87.1% and 82.6%, respectively.
However, as shown in
Figure 5, when using HCL, the performance of C-ViT (HCL) is further improved to mAP 94.3%, which is 2.7% higher than ViT-Small’s 91.6% and 2.3% higher than ResNet-50. OF1 and CF1 are also improved to 90.2% and 88.5%, respectively, verifying the effectiveness of this method in enhancing the multi-label classification ability of the model.
4.4. Ablation Experiment
In order to evaluate the impact of the proposed CFE module and HCL function on model performance, this section conducts an ablation experiment based on the ViT-Small model.
First, the effectiveness of the CFE module is evaluated by evaluating whether to use the CFE model and the impact of the input image size on the model. The specific experimental results are shown in the following
Table 3. We evaluated the impact of image size on the ViT model with two different image size inputs. When the input image size is 224 × 224, the mAP is only 91.6, while when the patch size in the patch embedding is modified to adapt to the bamboo chopstick input image, the mAP is 90.9, which is a decrease. Then we introduced the CFE module, and when the image size was kept at 512 × 96, the mAP was 92.8%, an increase of 1.2%. After introducing the HCL function, the mAP was 94.3%, from which we can clearly see the effectiveness of the proposed method.
In the CFE module, we employ a small number of convolutional layers for lightweight feature extraction to avoid structural redundancy and control computational overhead. The feature embedding dimension is kept consistent with the baseline model, ViT-Small, to ensure fair comparisons. To analyze the impact of key hyperparameters on performance, we conduct ablation experiments with different kernel sizes (3 × 3, 5 × 5, and 7 × 7) and pooling strategies (AvgPool and MaxPool), as summarized in
Table 4.
The experimental results indicate that, under the same pooling strategy, model performance slightly degrades as the kernel size increases, suggesting that an excessively large receptive field is unfavorable for modeling fine-grained bamboo chopstick defect features. In addition, MaxPool consistently outperforms AvgPool across all configurations, demonstrating its advantage in preserving locally salient features. Considering both performance and model simplicity, the configuration with a 3 × 3 kernel and MaxPool achieves the highest mAP (92.8%) and is therefore adopted as the default setting in this study.
Then, in order to evaluate the effectiveness of the HCL function, we designed the following ablation experiments using the hard examples selection mechanism, label similarity, and feature similarity in the HCL. Among them, our models all use C-ViT. The specific experimental results are shown in the following
Table 5. When the hard examples selection mechanism is used alone and then the hard examples are weighted, the mAP is improved by 0.8%. When label similarity is added, there is also an improvement of 0.2%. When we use the complete HCL, the mAP reaches 94.3%, an increase of 1.5%, which proves the effectiveness of each module in the HCL. Note that the reason why feature similarity is not designed to be used alone is that in our loss calculation, feature similarity depends on the positive and negative sample pairs obtained based on label similarity.
The effectiveness of the HCL function primarily depends on the selection of hard examples and the construction of contrastive objectives based on label and feature similarity. To further explain and validate the proposed mechanism, we conducted hyperparameter ablation experiments on the number of hard examples and the similarity threshold.
As shown in
Table 6, when the Topk_ratio was set to 10%, the performance improvement of the model was not significant, indicating that selecting too few hard examples limited the formation of sufficient contrastive relationships within a batch. As the Topk_ratio increased, the model performance gradually improved, showing a slight enhancement at 20% and reaching its peak at 40%. However, when the Topk_ratio continued to increase to 60%, the performance declined. This degradation is likely due to the inclusion of excessive samples, which diluted the definition of “hard examples” and introduced easier or irrelevant samples, thereby weakening the model’s ability to focus on truly difficult instances.
As shown in
Table 7. For the label similarity threshold, it plays a crucial role in constructing the contrastive loss by determining the sets of positive and negative hard examples associated with each current hard examples. As shown in
Table 6, when the similarity threshold is set to 0.5, the performance improvement is marginal, likely because the threshold is too low to provide sufficient discriminability between positive and negative samples. As the threshold increases to 0.6 and 0.7, a noticeable performance gain is observed, indicating that the model benefits from a clearer separation between positive and negative pairs. However, when the threshold is further increased to 0.8, the performance drops. This degradation may result from the overly strict matching conditions induced by a high threshold, which exclude potential positive pairs and consequently undermine the adequacy of contrastive learning and the model’s generalization capability.
4.5. Experimental Results on Public Datasets
Since our CFE module is designed for the special size of bamboo chopstick images, our comparative experiments on public datasets only compare the HCL function. The dataset used is PASCAL VOC 2012.
As shown in
Table 8, the proposed HCL function achieves the best performance on the VOC dataset. Specifically, compared with the traditional BCE loss function, the HCL function is better than the BCE loss function in these major indicators. Compared with the Foca-Loss loss function, it is improved by 1.2% on mAP, 2.19% on OF1, and 2.77 on CF1. Compared with the ASL function, it improved by 0.6% on mAP, 1.84% on OF1, and 2.47 on CF1. Experimental results show that the proposed HCL prime function has performance advantages compared with the sun loss function commonly used in multi-label classification tasks. These results suggest that HCL more effectively focuses on ambiguous or boundary-region samples, which are typically underrepresented or insufficiently weighted by existing loss functions. By integrating hard-example selection and contrastive learning guided by feature and label similarity, HCL strengthens the model’s ability to distinguish subtle differences among confusing defect patterns. Consequently, the proposed loss function demonstrates clear advantages over conventional losses widely used in multi-label classification tasks.
As shown in
Figure 6, the visualization provides an intuitive comparison of the performance of different loss functions on the VOC dataset, clearly demonstrating that HCL achieves the best results across all key metrics.
5. Discussion
5.1. Addressing Labor Costs and Work Efficiency Issues in Traditional Methods
As the production scale of disposable bamboo chopsticks continues to expand, the traditional quality inspection process has gradually become a major bottleneck restricting production capacity and quality stability. Although manual visual inspection has long been the primary method, its limited efficiency, poor consistency, and high labor dependency have become increasingly prominent. For example, a bamboo chopstick production line can produce millions of chopsticks per day. If it relies entirely on manual inspection, even at an efficiency of 2000 chopsticks per person per hour, it would require 20 people working continuously for 25 h to complete the daily inspection task. This not only results in high labor costs and management pressure, but also causes inspectors to lose focus due to long, repetitive tasks, which increases the rate of missed detection of subtle defects, leading to quality fluctuations and potential risks.
The introduction of the C-ViT model has significantly optimized the bamboo chopstick defect detection process. By integrating this model into the production line’s automated optical inspection system, it enables real-time and efficient inspection of bamboo chopsticks, significantly reducing the need for manual inspection. The system can process hundreds of bamboo chopsticks per second, significantly exceeding manual inspection speeds without interrupting the production line. This effectively alleviates the workload of manual quality inspection, freeing personnel to focus on more critical tasks such as data analysis and quality control, thereby improving overall intelligence.
From the perspective of reducing labor costs, the deployment of this model effectively replaces traditional, high-intensity manual inspection methods. Manual inspection requires a long time to identify multiple complex defects on bamboo chopsticks under adverse conditions such as noise and strong light. This not only limits efficiency but is also susceptible to subjective factors such as emotion and fatigue. In contrast, the automated inspection system based on the C-ViT model can be linked with high-speed industrial cameras to capture high-definition images and complete defect classification in milliseconds, achieving inspection efficiency far exceeding that of traditional manual teams. Enterprises can significantly reduce the number of quality inspectors, avoid human errors, and achieve round-the-clock, standardized automated quality inspection.
5.2. Novel Contributions to Model Enhancement
Regarding model enhancement, for the improvement of the ViT model, previous ViT and its variants use patch embedding to positionally encode the input image to obtain the image’s position information, but most of them use images with a moderate aspect ratio, which is different from the bamboo chopstick data studied in this paper, which has an extreme aspect ratio. If the bamboo chopstick data is forcibly resized to square image data, it will cause image distortion and loss of small defect features. Inspired by the characteristics of convolution, a series of convolutions are used to extract and transform the input image, and finally obtain the input required by the ViT encoding layer. In the future, it can be considered to replace the patch embedding with traditional convolution, which can be further optimized into dynamic convolution or deformable convolution to adapt to the various complex defects and scale changes in bamboo chopsticks.
Regarding the loss function, the traditional loss function based on binary cross entropy (BCE) is very common in many defect detection studies, but it treats all samples and labels equally, resulting in high-frequency and easy-to-distinguish samples dominating the gradient update, while the contribution of hard examples (such as small targets and occluded samples) is easily ignored. To address this issue, this paper proposes the HCL function. This function dynamically selects hard examples with prediction confidence levels close to a threshold and constructs a contrastive learning objective based on hard examples labels and feature similarity. This enhances the model’s ability to extract features from defects in hard examples. Furthermore, it incorporates a dynamic focusing mechanism to enhance gradient distribution, thereby improving performance in challenging scenarios. Our experimental results demonstrate that the HCL function outperforms traditional binary cross entropy (BCE)-based loss functions. Currently, the contrastive learning component of HCL is primarily based on confidence screening and feature similarity, but in multi-label or multi-scale scenarios, inter-category relationships can be more complex. Future work could consider constructing multi-level, multi-granular positive and negative comparison pairs, combined with label semantic topological relationships or graph attention networks, to ensure that the model maintains feature discriminability in the face of multi-category interference.
6. Conclusions
In this paper, we propose an improved ViT model architecture; on the basis of ViT, we integrate the CFE module proposed in this paper, which is specially designed for bamboo chopstick samples with special shapes, and uses the local perception ability of convolutional neural network to efficiently extract subtle defect features on the surface of bamboo chopsticks, while avoiding the influence of image deformation on feature representation. At the same time, the HCL function is also proposed, which avoids the need for explicit modeling of complex label dependencies, and enhances the characteristic learning ability of the model for the defects of hard examples through dynamic screening of hard examples prediction of hard examples defects.
To verify the validity of the proposed method, we conducted extensive ablation studies and comparative experiments on a newly established private dataset, BCDD, and a public benchmark dataset VOC2012. The results show that each module contributes positively to classification performance, and their combination achieves significant improvements. Our final model achieved 94.3% mAP on BCDD, outperforming the original ViT and other classical baselines. In addition, experiments on VOC2012 datasets further validate the generalization ability of the proposed method.
All in all, our work provides a robust and scalable identification and classification solution for the quality inspection of bamboo chopsticks. In future work, we plan to explore and introduce lightweight deployment technology to achieve real-time detection and classification of bamboo chopsticks in field applications.