This section introduces the proposed enhanced YOLOv8-based algorithm. The improvements center on augmenting global feature extraction with the Mobile Vision Transformer v2 (MobileViTv2), replacing expensive convolutions in the neck with the Gated-Structured Convolution (GSConv) to reduce computational overhead, and employing a Slim-neck structure that constrains parameter growth. Finally, we introduce a Triplet Attention mechanism to enhance cross-dimensional interaction.
4.1. Mobile Vision Transformer v2 Integration
The original YOLOv8 backbone consists of a hierarchical arrangement of convolutional layers, C2f modules, and a Spatial Pyramid Pooling-Fast (SPPF) block. However, the convolution-centric C2f modules exhibit inherent inductive bias toward local receptive fields, making them suboptimal for capturing long-range dependencies in highly irregular defect patterns [
21,
22]. Formally, given an input tensor
, a convolutional layer
of kernel size
produces
where
are learnable weights. The effective receptive field grows only logarithmically with depth, leading to information bottlenecks when modeling global gasket textures.
To overcome this, we integrate MobileViTv2blocks into the YOLOv8 backbone. Each MobileViTv2 block consists of two components: (1) an MV2 inverted residual block, and (2) a lightweight Transformer encoder that performs patchwise global reasoning.
- (1)
MV2 block.
Let
denote the input. The MV2 block first applies an expansion layer:
where
t is the expansion ratio. This is followed by depthwise convolution:
and finally by a compression step:
- (2)
MobileViTv2 block.
The MV2 output
is partitioned into non-overlapping
patches:
which are projected into a latent space and processed by a multi-head self-attention mechanism (MHSA):
where
are linear projections of
. The attention complexity is
, but MobileViTv2 reduces it via local-global mixing strategies, yielding a near-linear computational footprint.
Thus, the hybridized backbone enables fine-grained local feature extraction through CNNs while simultaneously capturing global gasket surface dependencies via Transformer-style self-attention.
Figure 2 illustrates the improvement.
4.2. Slim-Neck and GSConv Module
The YOLOv8 neck aggregates multi-scale features using stacked C2f modules built upon standard convolution(SC) [
23]. For an input tensor
, SC applies
spatial filters over all input channels and produces
:
where, SC is expressive but expensive: its cost grows with both input and output channels, which becomes a bottleneck as the neck deepens.
(1) Depthwise separable convolution (DSC)
DSC factorizes SC into a depthwise spatial step (one filter per input channel) [
24] followed by a
pointwise mixing step:
Its complexity becomes
which is substantially smaller than SC for common
k (e.g.,
).
Intuition. DSC keeps spatial filtering light while preserving representational power through the
mix.
(2) GSConv (mixed SC/DSC fusion)
GSConv runs SC and DSC
in parallel and then fuses them. Let a fraction
of the channels come from SC and
from DSC:
With
, the compute can be approximated as
yielding features richer than DSC alone while avoiding the full cost of SC.
Intuition. The SC branch preserves high-level semantics; the DSC branch adds efficient multi-scale responses; the
layer blends both.
GS bottleneck (residual stabilization). Stacking GSConv with a light residual refinement improves optimization:
If the trailing
is a small residual sub-block, the gradient satisfies
which remains close to identity for modest Jacobians, stabilizing training and aiding convergence. Residual mixing refines, rather than overwrites, the mixed SC/DSC features.
(3) From GS blocks to a Slim-neck.
In practice, we replace heavy SC stacks in the neck with GSConv/GS bottlenecks and employ a one-time (TGS-style) aggregation for lateral/top–down fusion. This Slim-neck preserves multi-scale expressiveness while reducing redundant computation and memory traffic. Empirically (
Section 5), it lowers neck FLOPs without sacrificing accuracy, improving real-time behavior on edge devices.
Figure 3 illustrates the GSConv module and
Figure 4 shows the GS bottleneck.
(4) TGS Bottleneck.
The TGS bottleneck (
Figure 5) extends the GS bottleneck by employing a
one-time aggregation strategy inspired by cross-stage partial (CSP) connections. Unlike iterative fusion, which repeatedly combines shallow and deep features across stages, one-time aggregation fuses all available feature maps in a single transformation step.
Let denote the set of feature maps extracted from different backbone or neck stages, where represent spatial dimensions and the channel dimension at stage i. To perform aggregation, each feature map must first be transformed into a common spatial and channel space. This is achieved through a mapping , which may involve resizing and convolution for dimensional alignment.
The aligned features are then concatenated along the channel axis and passed through a transformation
, typically another
convolution followed by batch normalization and non-linearity. Formally
Here, represents the aggregated feature map that combines both shallow fine-grained patterns and deep semantic information.
The computational complexity of this operation is dominated by the
convolution inside
, with cost
which grows linearly with the number of input branches
n. By contrast, iterative multi-stage fusion would require repeated pairwise combinations, leading to
cost. Thus, the one-time aggregation strategy reduces redundancy while retaining representational richness.
(5) Slim-neck Integration
The Slim-neck integrates GSConv, GS bottleneck, and TGS bottleneck modules to form an efficient neck structure. In the original YOLOv8 neck, feature fusion relies heavily on Standard Convolution (SC) [
25]. Given an input tensor
, SC with kernel size
and
output channels requires
This quadratic dependency on both and quickly increases computational demands as the network deepens.
By contrast, the Slim-neck reduces this burden through GSConv. Its cost is approximated as
where the first term corresponds to depthwise spatial convolution and the second to pointwise channel mixing. This reduces the number of multiply-accumulate operations significantly, especially for large
k or deep networks. When combined with the linear one-time aggregation in TGS, the Slim-neck achieves a FLOP reduction of 40–
in practice, while still preserving multi-scale fusion capability.
4.3. Triplet Attention Mechanism
Traditional channel attention methods, such as SE or CBAM, apply global average pooling to the input tensor
along the spatial dimensions
This operation reduces to a C-dimensional descriptor, discarding spatial variation and limiting the ability to model cross-dimensional dependencies.
Triplet Attention [
26] alleviates this issue by constructing three rotated branches, each focusing on a different pair of dimensions:
,
, and
. Let
denote the permutation of tensor axes for branch
b, and
denote pooling along the complementary axis. For each branch
where
is the sigmoid activation. The attention weight
is broadcast across the pooled dimension, rotated back by
, and applied element-wise to the input:
The final output is obtained by averaging across the three branches:
This formulation ensures that attention is not limited to the channel axis but jointly encodes dependencies across all dimension pairs. Intuitively, the HW-branch emphasizes spatial saliency, while the CW and CH branches emphasize cross-channel and mixed interactions, respectively. The result is a richer attention mechanism that balances accuracy and computational efficiency. Specifically, as shown in
Figure 6.
4.4. Improved YOLOv8 Module
The integration of the aforementioned modules yields the proposed VST-YOLOv8 architecture. Each component addresses a specific limitation of the original YOLOv8 framework and together they form a coherent, lightweight yet powerful detection system.
First, to strengthen the backbone, several C2f blocks are replaced by MobileViT v2 modules. Given an input feature map
, a MobileViT v2 block applies the following transformation:
This operation expands channels using pointwise convolution, performs depthwise convolution for local feature extraction, applies multi-head self-attention (MHSA) to capture global dependencies, and finally projects the output back to the original dimensionality [
27]. In this way, the backbone benefits from both local inductive biases and global reasoning capabilities, which are essential for detecting irregular gasket defects.
In the detection head, the standard convolutions are replaced by GSConv modules in order to balance computational efficiency with semantic richness. For a prediction feature tensor
, GSConv performs
where
denotes concatenation of the SC and DSC outputs followed by channel mixing through a
convolution. This design enables the detection head to emphasize multi-shape feature responses while significantly reducing FLOPs compared to pure SC layers.
To further improve feature fusion, the Slim-neck design is employed, in which multi-scale feature maps
from different backbone stages are aggregated in a single operation rather than through repeated iterative fusion [
28]. Formally, the aggregation is expressed as
where
aligns the channel dimensions of each input feature, and
is a transformation function such as a
convolution with non-linearity. This one-time aggregation eliminates redundant computations, reduces FLOPs, and preserves the multi-scale context necessary for accurate detection of defects under varying scales.
Finally, Triplet Attention modules are incorporated into selected backbone stages to enhance cross-dimensional feature interactions. For a given stage feature map
, the attention mechanism produces
which jointly models channel, height, and width dependencies. By capturing these interdependencies, Triplet Attention enables the network to better discriminate subtle defect signals from background noise, particularly under challenging lighting conditions or cluttered environments.
Taken together, these modifications yield the proposed VST-YOLOv8 architecture, which combines Vision Transformer modules, Slim-neck efficiency, GSConv-based expressiveness, and Triplet Attention. The overall structure is illustrated in
Figure 7, highlighting the interplay between backbone, neck, and detection head enhancements.
4.5. Trustworthiness of the Proposed Method
For industrial deployment, accuracy must be paired with trustworthiness, which we operationalize along four axes: (i) reliability (stable predictions under benign changes of input/conditions), (ii) robustness (graceful degradation under noise/blur/compression), (iii) interpretability (human-aligned evidence for detections), and (iv)reproducibility(clear, repeatable procedures). Below we provide an intuitive formulation of the reliability functional and connect each architectural choice to these axes.
Let
denote the detector and
its penultimate feature embedding after global pooling. Industrial imaging often introduces nuisance transforms
T (small translations, illumination shifts, mild blur, JPEG compression). A detector is reliableif both its representation and its outputs are stable under such
T. We therefore define a simple, interpretable reliability score for an image
as a convex combination of feature stability and hloutput consistency:
Here, is cosine similarity, the set of predicted boxes after NMS, and the per-box class posteriors (aligned by Hungarian matching), and a distribution over mild industrial transforms.Intuition: if a gasket image is shifted by a few pixels or lightly compressed, a reliable model should produce nearly the same features (high ), nearly the same boxes (high IoU), and nearly the same class probabilities (small KL), hence high .
Why CNN + ViT improves reliability (intuition for the convex mixing). Convolutions excel at + texture continuity; Transformers capture global shape/layout. We model their expected contributions as
where
is chosen to minimize empirical instability (i.e., maximize
) on a validation set with synthetic perturbations. when local textures are degraded (e.g., blur), global cues from MobileViTv2 sustain feature stability; when global layout is ambiguous (e.g., clutter), CNN inductive bias anchors local evidence. Triplet Attention further boosts
by enforcing cross-dimensional agreement (across
,
,
“views”), which empirically raises
and preserves IoU under
T.
Why Slim-neck helps robustness and reproducibility. The Slim-neck replaces expensive standard convolutions with GSConv and one-time TGS aggregation, reducing redundant mixing and controlling capacity:
The relative reduction
is ≈0.45 in our setting, which lowers latency variance and hardware-induced jitter. Better-conditioned mixing steps reduce overfitting to spurious textures, making predictions less sensitive to photometric noise and leading to tighter calibration (as observed by lower ECE/NLL in our experiments), hence higher
.
MobileViTv2’s global tokens and Triplet Attention’s rotated branches yield saliency maps that emphasize gasket boundaries and defect interiors rather than background clutter; Grad-CAM on the decoupled head shows tighter, class-specific heatmaps. when explanations focus on physically meaningful regions, output consistency under T improves (boxes/classes shift less), which correlates with higher .