Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification

Şahin, Nurullah; Alpaslan, Nuh; Hanbay, Davut

doi:10.3390/electronics15112268

Open AccessArticle

Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification

by

Nurullah Şahin

^1,*

,

Nuh Alpaslan

² and

Davut Hanbay

¹

Department of Computer Engineering, Faculty of Engineering, İnönü University, 44280 Malatya, Turkey

²

Department of Computer Science, Faculty of Engineering and Architecture, Bingöl University, 12000 Bingöl, Turkey

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2268; https://doi.org/10.3390/electronics15112268 (registering DOI)

Submission received: 17 April 2026 / Revised: 17 May 2026 / Accepted: 20 May 2026 / Published: 23 May 2026

Download

Browse Figures

Versions Notes

Abstract

Fine-grained insect pest classification presents a particularly demanding visual recognition challenge due to severe class imbalance, pronounced intra-class morphological variability across developmental stages, and high inter-class visual similarity among taxonomically related species. Existing deep learning approaches typically rely on a single feature representation extracted from a single network depth, overlooking complementary discriminative cues distributed across multiple abstraction levels. Furthermore, classical attention mechanisms perform spatial weighting deterministically, without explicitly modeling the underlying statistical structure of the feature space, which is inherently multimodal on long-tailed benchmarks such as IP102. This study proposes a Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts (GMM-MoE) architecture that operates as a plug-in module insertable into any convolutional backbone, evaluated here on DenseNet-121 at three distinct feature depths. The proposed module computes analytic GMM posterior responsibilities in closed form, softly assigning each spatial location to dedicated convolutional expert sub-networks. At the same time, a conditional prior mechanism π(x) adapts the routing strategy to individual image content rather than relying on fixed priors. The architecture is evaluated on the IP102 benchmark (102 pest classes, ~75,000 images) under a two-stage training protocol. Ablation experiments confirm that increasing the number of experts consistently improves accuracy across all three routing depths, and that multi-scale fusion surpasses any single-scale configuration. The proposed model achieves a mean top-1 accuracy of 74.12% (±0.25%, 95% CI) across three independent runs on the IP102 test set. To the best of our knowledge, this is the first work to employ GMM posterior responsibilities as a spatial routing mechanism within a multi-scale CNN feature hierarchy for fine-grained insect pest classification, establishing a principled probabilistic alternative to deterministic attention weighting in visual recognition systems.

Keywords:

mixture of experts; gaussian mixture model; probabilistic routing; fine-grained image classification; insect pest recognition; multi-scale feature learning; convolutional neural network; statistical feature routing; long-tailed classification

1. Introduction

Crop losses caused by insect pests are among the most critical challenges in global agriculture, directly affecting food security and economic stability. Oerke [1] reported that pests are responsible for 30–40% of total crop losses worldwide, while Savary et al. [2] estimated that pathogens and pests together reduce the yield of major staple crops (including wheat, rice, maize, potato, and soybean) by 17–30% annually. These losses disproportionately affect smallholder farmers in developing regions where access to expert agronomic advice is constrained by climatic conditions, geographic isolation, and insufficient agricultural infrastructure [3]. Furthermore, the widespread reliance on chemical pesticides in pest management raises sustainability concerns regarding environmental contamination, biodiversity loss, and human health risks [4]. Accurate and timely identification of pest species thus remains an indispensable prerequisite for the rational implementation of integrated pest management (IPM) strategies capable of minimizing both economic damage and environmental impact [5].

Conventional pest identification methods rely on visual inspection performed by trained entomologists, a process that is inherently time-consuming and laborious at scale [6] and susceptible to inter-observer inconsistency even among domain experts [7]. Xie et al. [8] addressed these limitations by developing automated insect recognition systems that exploit multi-level learning features, demonstrating the value of hierarchical feature representations for pest classification. Over the past decade, image-based deep learning approaches have emerged as a transformative alternative offering the potential for automated, real-time pest identification under field conditions [9]. Kamilaris and Prenafeta-Boldú [10] comprehensively reviewed the application of deep learning techniques to agricultural and food production challenges, demonstrating their superiority over conventional methods. Standard convolutional neural network (CNN) architectures such as DenseNet [11], ResNet [12], and EfficientNet [13] have achieved strong performance in general image classification benchmarks and have been applied to pest recognition with varying degrees of success. However, insect pest classification presents distinctive challenges that extend well beyond standard visual recognition: fine-grained morphological similarity among species, high intra-class appearance variability across developmental stages, naturally long-tailed class distributions, and complex field backgrounds collectively render this a particularly demanding visual recognition task.

The IP102 benchmark dataset, introduced by Wu et al. [14], has become the primary evaluation platform for large-scale research on insect pest classification. Comprising more than 75,000 images across 102 pest categories, arranged in a hierarchical taxonomic structure, the dataset exhibits pronounced inter-class and intra-class variance, as well as severe class imbalance. Moreover, each pest category within the dataset spans multiple developmental stages (including pupa, larva, and adult), further amplifying visual inconsistency across samples [15]. Representative pest species from the dataset are illustrated in Figure 1, which clearly reveals the intrinsic difficulty of the classification task: background complexity, small object scales, and inter-species visual similarity are all evident. Despite extensive research on this benchmark, surpassing 75% recognition accuracy has remained elusive. Nanni et al. [16] reported an accuracy of 74.11% by combining six CNN architectures with an improved Adam optimizer. Liu et al. [17] achieved 74.69% via a self-supervised Transformer pre-training approach employing latent semantic mask autoencoders. Xia et al. [18] achieved 74.2% by integrating DenseNet-201 with an enhanced Vision Transformer in a multi-branch, multi-scale architecture, using ensemble learning. Chen et al. [19] proposed a multi-scale feature localization and adaptive filtering fusion framework that trains high-performance CNN feature extractors and integrates their outputs through soft voting, achieving 73.9% on IP102 under the single-image classification setting and demonstrating that well-designed CNN pipelines remain competitive on this benchmark. Taken together, these results suggest that incremental gains on IP102 increasingly require architectural innovation rather than simple model scaling.

A recurring observation in IP102 studies is that models relying on a single feature representation extracted at one network depth fail to capture complementary discriminative cues distributed across multiple abstraction levels of a deep network. Shallow layers encode low-level patterns such as color and edge statistics; intermediate layers represent part-level structural configurations; and deep layers encode semantic, class-level attributes. Exploiting this hierarchical information holistically is critical for fine-grained classification performance. Qian et al. [20] confirmed that multi-scale feature fusion combined with mixed attention mechanisms yields measurable improvements on fine-grained pest benchmarks. An et al. [21] further demonstrated that feature fusion networks aggregating complementary views from multiple backbone models consistently outperform single-model approaches. Nevertheless, the majority of existing methods fuse multi-scale features through static concatenation or fixed attention weights, without modeling the statistical structure of feature distributions at each depth. This limitation accounts for classification errors arising from the inadequacy of a single feature representation, particularly for pest species exhibiting multiple appearance modes.

Classical attention mechanisms, such as Squeeze-and-Excitation networks [22] and channel attention modules, recalibrate feature maps through deterministic learned weighting functions. While effective, these approaches do not account for the fact that individual feature vectors may correspond to fundamentally distinct visual patterns, and therefore require different processing pathways. The Mixture of Experts (MoE) paradigm, originally proposed by Jacobs et al. [23], addresses this limitation by routing input data to specialized sub-networks via a gating function, enabling input-conditioned computation. Jordan and Jacobs [24] extended this framework to hierarchical MoE architectures trained with the Expectation–Maximization (EM) algorithm, strengthening its theoretical foundations. More recently, Riquelme et al. [25] integrated sparse MoE layers into large-scale Vision Transformer models, demonstrating that effective model capacity can be increased while keeping parameter counts fixed. Despite these advances, GMM-based probabilistic routing has not been applied to multi-scale feature hierarchies in CNN architectures, and no prior work has employed analytic GMM posterior responsibilities as a spatial gating mechanism for fine-grained pest classification. The present work directly addresses this gap.

Gaussian Mixture Models (GMMs) provide a principled probabilistic framework for modeling multi-modal data distributions, and their integration with deep neural networks has attracted growing interest. Dempster et al. [26] established the theoretical foundations of the EM algorithm for parameter estimation under incomplete data, which has served as a key reference point for GMM-based deep learning approaches. Variani et al. [27] proposed a GMM layer jointly optimized with discriminative features within a deep network, showing that GMM posteriors can serve effectively as discriminative feature transformations. Van den Oord and Schrauwen [28] demonstrated the capacity of deep GMMs to capture complex variation in natural images. Variational inference methods [29,30] subsequently offered differentiable, scalable extensions of the EM algorithm, enabling such probabilistic structures to be embedded within end-to-end-trained deep networks. Despite these theoretical foundations, GMM-based gating has not yet been systematically applied to multi-scale feature routing in fine-grained pest classification.

In this work, we propose a Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts (GMM-MoE) architecture that addresses the limitations outlined above by attaching probabilistic routing modules at three distinct feature levels of a DenseNet-121 backbone: the outputs of the conv3, conv4, and conv5 blocks. Unlike conventional attention or MoE mechanisms that employ deterministic gating, the proposed module computes analytic GMM posterior responsibilities in closed form, softly assigning each spatial location to dedicated convolutional expert sub-networks. A conditional prior mechanism π(x) extends the standard GMM formulation by making the mixing coefficients input-dependent, with a learnable blending coefficient α enabling the routing strategy to adapt dynamically to individual image content. Precision-based variance parametrization and dimension-aware temperature scaling ensure numerically stable optimization across layers with varying channel depths. At the same time, data-driven expert center initialization via farthest-point sampling prevents expert collapse during training. The architecture is evaluated on IP102 under a two-stage training protocol, and ablation experiments systematically quantify the contribution of each design component.

The principal contributions of this work are summarized as follows:

(1): GMM-gated MoE routing mechanism: To the best of our knowledge, the first application of analytic GMM posterior responsibilities as a spatial routing gate within a convolutional feature hierarchy. The posterior is computed in closed form, replacing deterministic attention weighting with statistically grounded soft assignment of spatial locations to dedicated convolutional expert sub-networks.
(2): Multi-scale routing architecture: Independent GMM-MoE modules applied at three feature depths of a DenseNet-121 backbone, with multi-scale fusion performed via spatial alignment and channel projection.
(3): Conditional prior mechanism π(x): An input-dependent formulation of the GMM mixing coefficients that transforms the standard mixture model into a conditional mixture through a learnable blending coefficient α.
(4): Precision-based variance parametrization: Combined with dimension-aware temperature scaling to ensure consistent routing behavior across layers of varying channel dimensionality.
(5): Data-driven expert initialization: Farthest-point sampling of real feature distributions combined with calibrated variance initialization to prevent expert collapse during training.

The remainder of this paper is organized as follows. Section 2 presents the materials and methods, including the IP102 dataset, the DenseNet-121 backbone, the theoretical background of Gaussian mixture models and Mixture of Experts, the proposed GMM-MoE architecture, and the experimental setup. Section 3 reports the results, including the ablation study on the expert count K, the final model performance and reproducibility analysis, interpretability evidence through routing maps and class-expert specialization, and a comparison with state-of-the-art methods on the IP102 benchmark. Section 4 discusses the findings in the context of probabilistic routing for fine-grained pest classification. Section 5 concludes the paper and outlines directions for future work.

The overall architecture of the proposed GMM-MoE CNN is illustrated in Figure 2, which depicts the end-to-end processing pipeline from the input image to the final pest-class prediction. The architecture consists of five sequential stages. (i) The RGB input image is first passed through the ImageNet-pre-trained DenseNet-121 backbone, which extracts hierarchical feature representations through its successive dense blocks. (ii) Three GMM-MoE modules are then inserted in parallel at the conv3, conv4, and conv5 outputs of the backbone, allowing the network to operate simultaneously at mid-level structural, compositional, and high-level semantic abstractions. (iii) Within each module, an E-Step block computes spatial responsibility maps, an M-Step block applies the parallel convolutional experts, and a residual output block fuses the expert representation with the original backbone features. (iv) The three module outputs are then unified to a common spatial resolution through scale-alignment layers, after which a multi-scale ensemble block concatenates and fuses them into a single representation. (v) This fused representation is finally passed to a classification head that produces the 102-class pest prediction.

2. Materials and Methods

2.1. Dataset (IP102)

The IP102 benchmark dataset, introduced by Wu et al. [14], is the largest publicly available resource for insect pest recognition, comprising 75,222 images distributed across 102 pest categories organized within a hierarchical taxonomic structure that groups species into eight crop-based superclasses. Images were collected under diverse real-world field conditions encompassing varying illumination, cluttered natural backgrounds, and a wide range of imaging distances.

The class distribution is severely imbalanced, with per-class sample counts ranging from 33 to 4544 images. The majority of categories contain fewer than 500 samples. At the same time, a small number of dominant classes account for a disproportionately large share of the total image count, introducing a systematic bias risk for minority classes that may correspond to agronomically significant pest species.

Two compounding sources of visual heterogeneity further complicate classification. First, pronounced intra-class variability arises because each category spans multiple developmental stages—egg, larva, pupa, and adult—that differ markedly in morphology, coloration, and surface texture. Second, substantial inter-class similarity exists among taxonomically related species within the same superclass, where discriminative cues are confined to subtle features such as wing venation or spot configuration that are easily obscured by background clutter. Representative examples are illustrated in Figure 1. Following the official partitioning, the dataset is divided into 45,095 training, 7508 validation, and 22,619 test images.

2.2. Backbone: DenseNet-121

DenseNet-121 [11] is a 121-layer convolutional architecture in which every layer within a dense block receives direct feature–map connections from all preceding layers, maximizing feature reuse and gradient flow. Formally, the output of the

l

-th layer is defined as

x_{l} = H_{l} ([x_{0}, x_{1}, \dots, x_{l - 1}])

, where

x_{l}

denotes the output feature map of layer

l

,

H_{l}

denotes a composite function of batch normalization, ReLU activation, and convolution, and

[x_{0}, x_{1}, \dots, x_{l - 1}]

denotes channel-wise concatenation of all preceding feature maps within the block.

Feature tensors are extracted at the outputs of the second, third, and fourth dense blocks (denoted as conv3_block12_concat, conv4_block24_concat, and conv5_block16_concat following the standard Keras DenseNet-121 layer naming convention). They are denoted as

h_{conv 3} \in R^{28 \times 28 \times 512}

,

h_{conv 4} \in R^{14 \times 14 \times 1024}

,

h_{conv 5} \in R^{7 \times 7 \times 1024}

respectively. These tensors serve as the inputs to the proposed GMM-MoE modules. Each level encodes complementary feature abstractions:

h_{conv 3}

captures mid-level structural patterns,

h_{conv 4}

encodes compositional features, and

h_{conv 5}

encodes high-level semantic representations. Figure 3 represents the DenseNet-121 backbone architecture together with the three multi-scale feature extraction points that serve as GMM-MoE integration sites:

2.3. Theoretical Background

2.3.1. Gaussian Mixture Model

A Gaussian Mixture Model (GMM) [31] represents the distribution of an observation

x \in R^{D}

, where

D

denotes the dimensionality of the feature vector, as a convex combination of

K

component densities. Equation (1) defines the GMM marginal density:

p (x) = \sum_{k = 1}^{K} π_{k} N (x ∣ μ_{k}, Σ_{k})

(1)

Expectation–Maximization coefficients satisfy

\sum_{k} π_{k} = 1

, where

μ_{k} \in R^{D}

is the mean vector of the

k

-th component,

Σ_{k} \in R^{D \times D}

is the corresponding covariance matrix, and

N (x ∣ μ_{k}, Σ_{k})

denotes the multivariate Gaussian density function. GMM parameters are estimated via the Expectation–Maximization (EM) algorithm [26]. Equation (2) shows the posterior responsibility of component

k

for observation

x_{n}

:

γ_{n k} = \frac{π_{k} N (x_{n} ∣ μ_{k}, Σ_{k})}{\sum_{j = 1}^{K} π_{j} N (x_{n} ∣ μ_{j}, Σ_{j})}

(2)

where

γ_{n k}

denotes the soft assignment of observation

x_{n}

to component

k

, with

\sum_{k} γ_{n k} = 1

for all

n

.

In the M-step, the mixture weights, component means, and covariance matrices are updated according to the accumulated responsibilities. Equations (3)–(5) define these update rules. Equation (3) shows the update of mixture weights:

π_{k}^{new} = \frac{1}{N_{s}} \sum_{n = 1}^{N_{s}} γ_{n k}

(3)

Equation (4) shows the update of the component mean:

μ_{k}^{new} = \frac{\sum_{n} γ_{n k} x_{n}}{\sum_{n} γ_{n k}}

(4)

Equation (5) shows the update of the covariance matrix:

Σ_{k}^{new} = \frac{\sum_{n} γ_{n k} (x_{n} - μ_{k}) {(x_{n} - μ_{k})}^{⊤}}{\sum_{n} γ_{n k}}

(5)

where

N_{s}

denotes the total number of observations (sample size). The EM algorithm is guaranteed to increase the marginal log-likelihood monotonically at each iteration. When covariance matrices are restricted to diagonal form

Σ_{k} = d i a g (σ_{k, 1}^{2}, \dots, σ_{k, D}^{2})

the parameter count reduces from

O (K D^{2})

to

O (K D)

, where

O (\cdot)

denotes the standard big-O asymptotic notation indicating how the parameter count scales with

K

and

D

. Equation (6) shows the resulting log-likelihood expression:

\log N (x ∣ μ_{k}, d i a g (σ_{k}^{2})) = - \frac{1}{2} \sum_{d = 1}^{D} [\frac{{(x_{d} - μ_{k, d})}^{2}}{σ_{k, d}^{2}} + l o g σ_{k, d}^{2} + l o g 2 π]

(6)

where

x_{d}

and

μ_{k, d}

denote the

d

-th elements of

x

and

μ_{k}

, respectively, and

σ_{k, d}^{2}

is the

d

-th diagonal variance of component

k

.

2.3.2. Mixture of Experts

The Mixture of Experts (MoE) framework [23,24] decomposes a complex input–output mapping into a weighted combination of

s p e c i a l i z e d

sub-networks. Equation (7) shows the MoE output:

y = \sum_{k = 1}^{K} g_{k} (x) f_{k} (x)

(7)

where

f_{k} : R^{D} \to R^{C}

denotes the

k

-th expert network,

g_{k} (x)

denotes the gating weight, and

y

denotes the final combined output. Equation (8) shows the standard softmax gating mechanism:

g_{k} (x) = \frac{\exp (w_{k}^{⊤} x + b_{k})}{\sum_{j = 1}^{K} \exp (w_{j}^{⊤} x + b_{j})}

(8)

where

w_{k}

is the gating weight vector and

b_{k}

is the bias term.

2.3.3. Relation Between GMM Responsibilities and Expert Routing

Although classical Mixture-of-Experts models employ a softmax gating mechanism to determine expert contributions, this formulation does not explicitly incorporate the statistical structure of the feature space. In contrast, Gaussian mixture models provide a probabilistic interpretation of component assignments via posterior responsibilities computed in the expectation step of the EM algorithm.

In this formulation, the responsibility value

γ_{n k}

defined in Equation (2) represents the posterior probability that observation

x_{n}

originates from component

k

. Consequently, the responsibilities form a normalized probability distribution over the

K

components for each observation. This probabilistic assignment is conceptually consistent with the routing mechanism of Mixture-of-Experts models, in which gating weights determine the relative contributions of individual experts to the final prediction.

From a probabilistic perspective, both the softmax gating weights

g_{k} (x)

defined in Equation (8) and the GMM responsibilities

γ_{n k}

correspond to posterior assignment probabilities of the form

p (z = k ∣ x)

, where the latent variable

z

indicates the component or expert responsible for generating the observation. However, the GMM formulation incorporates both mixture priors and component likelihoods, thereby explicitly informing the routing process of the statistical structure of the feature space rather than relying solely on a discriminative projection.

This interpretation allows expert routing to be expressed within a probabilistic modeling framework. Consequently, the responsibility-based formulation provides a statistically grounded mechanism for expert assignment that naturally extends the classical Mixture-of-Experts paradigm. This probabilistic perspective establishes a principled connection between Gaussian mixture modeling and expert routing mechanisms, offering a theoretically motivated interpretation of expert selection in mixture-based neural architectures.

This statistical grounding is particularly consequential on benchmarks exhibiting severe class imbalance and multimodal within-class appearance distributions such as IP102. Whereas a deterministic attention weight is computed from a single discriminative projection irrespective of the local feature geometry, the GMM posterior explicitly characterizes the distributional structure of the feature space at each spatial location. For pest categories whose visual appearance spans multiple developmental stages, each of which constitutes a distinct mode in the feature distribution, this probabilistic decomposition enables more principled expert routing than fixed-weight attention alternatives.

2.4. Proposed Method

2.4.1. Overview

The proposed architecture introduces a GMM-Gated Mixture of Experts (GMM-MoE) module that can be inserted at any convolutional feature layer. The internal architecture of each module is illustrated in Figure 4 and operates in five sequential stages.

(i): Feature Projection: the input feature tensor $x \in R^{H \times W \times C}$ is first passed through a 1 × 1 convolution followed by Layer Normalization to obtain a lower-dimensional projection $h \in R^{H \times W \times D}$ , where $D < C$ . This projection serves as the common input to both the GMM routing computation and the expert sub-networks.
(ii): E-Step (GMM Routing): for every spatial location $(i, j)$ , the analytic posterior responsibilities $γ_{i j k}$ of the $K$ mixture components are computed in closed form, with a conditional prior $π (x)$ that adapts the mixing coefficients to the input image content and a dimension-aware temperature $T_{eff}$ that stabilizes the softmax across feature scales of differing dimensionality.
(iii): M-Step (Expert Processing): $K$ parallel Conv2D–BatchNorm–ReLU expert sub-networks transform the projected representation independently, and their outputs are aggregated through a responsibility-weighted sum ${\hat{h}}_{i j} = \sum_{k} γ_{i j k} f_{k} (h_{i j})$ .
(iv): Output Processing: the aggregated expert representation is fused with the original backbone tensor through a residual connection $z = (1 - w_{r}) x + w_{r} \cdot ProjConv (\hat{h})$ , with an adaptive blending weight $w_{r}$ that is gradually annealed during training.
(v): Auxiliary Losses: a load balance term $L_{bal}$ , encouraging uniform expert utilization, and a negative log-likelihood term $NLL$ , driving the GMM to fit the spatial feature distribution faithfully, are combined into a single auxiliary loss $L_{aux}$ and added to the classification loss $L_{cls}$ during training.

In the present work, three such modules are attached to the DenseNet-121 backbone at the outputs of dense blocks 2, 3, and 4, yielding a multi-scale ensemble.

2.4.2. GMM-MoE Module

Let

x \in R^{H \times W \times C}

denote the input spatial feature tensor at a given backbone depth. The module first applies a 1 × 1 convolution followed by Layer Normalization to obtain a lower-dimensional projection

h \in R^{H \times W \times D}

, where

D < C

. This projection serves as the input to both the GMM routing computation and the expert sub-networks, ensuring that the number of GMM parameters scales with

D

rather than

C

.

The routing gate computes the posterior responsibility of each component

f o r e v e r y s p a t a l l o c a t i o n (i, j)

, via the analytic GMM E-step. Under a diagonal covariance assumption, the un-normalized log responsibility is shown in Equation (9):

{\tilde{l}}_{i j k} = l o g π_{k} - \frac{1}{2} \sum_{d = 1}^{D} [τ_{k, d} {(h_{i j, d} - μ_{k, d})}^{2} - l o g τ_{k, d}]

(9)

where

π_{k}

is the mixing coefficient of component

k

;

μ_{k} \in R^{D}

is its mean vector; and

τ_{k, d} = 1 / σ_{k, d}^{2}

is the precision along dimension

d

, that is, the reciprocal of the variance.

The precision parameterization, rather than a direct variance parameterization, is adopted for three complementary reasons. First,

τ_{k, d}

is the natural parameter of the Gaussian density in the diagonal log-likelihood expression of Equation (6): both

τ_{k, d}

and

\log τ_{k, d}

appear directly in the log-likelihood, avoiding the costly reciprocal operations that would otherwise be required at every spatial location and every training step. Second, the numerical stability constraint that prevents components from collapsing onto individual data points takes the form of a simple lower bound

τ_{k, d} \geq τ_{\min}

, which is straightforward to impose through the softplus reparameterization. In contrast, whereas an equivalent upper bound on

σ_{k, d}^{2}

would require a non-trivial bounded activation. Third, gradient updates with respect to

τ_{k, d}

remain well-conditioned in regions of feature space where the model becomes very confident, in contrast to learning

σ_{k, d}^{2}

directly where gradients diverge as

σ_{k, d}^{2} \to 0

. To enforce the lower bound in practice, the precision is parametrized as

τ_{k, d} = τ_{\min} + softplus (u_{k, d})

, where

u_{k, d}

is a learnable scalar and

τ_{\min} = 1 / σ_{\max}^{2}

imposes a maximum variance constraint corresponding to a minimum precision.

To ensure consistent routing behavior across layers with different feature dimensionality, a dimension-aware temperature

T_{eff}

is applied before the softmax normalization. Equation (10) shows the dimension-aware temperature scaling applied before the softmax operation:

T_{eff} = T \cdot \sqrt{D_{ref} / D}

(10)

where

T

is a layer-specific base temperature and

D_{ref} = 512

is a fixed reference dimensionality. The use of a dimension-aware temperature prevents the magnitude of the log-responsibility scores from becoming disproportionately large or small when the feature dimensionality varies across layers. As a result, the routing behavior remains stable and comparable across modules operating at different representation depths. The scaled log-responsibility values are subsequently normalized through a softmax operation to obtain posterior component assignments. Equation (11) shows the final normalized responsibility value obtained after applying the softmax operation to the scaled log-responsibility scores.

γ_{i j k} = \frac{\exp ({\tilde{l}}_{i j k} / T_{eff})}{\sum_{k^{'} = 1}^{K} \exp ({\tilde{l}}_{i j k^{'}} / T_{eff})}

(11)

where

γ_{i j k}

denotes the posterior responsibility assigned to component

k

at spatial location

(i, j)

,

K

denotes the number of mixture components (experts),

k^{'}

is a dummy summation index, and

{\tilde{l}}_{i j k}

denotes the same log-responsibility expression as defined in Equation (9), evaluated for component

k^{'}

.

The softmax normalization ensures that the responsibilities form a valid probability distribution over the

K

components at every spatial location, satisfying

\sum_{k = 1}^{K} γ_{i j k} = 1

.

From a probabilistic perspective, the responsibility value

γ_{i j k}

corresponds to a posterior assignment probability of the form

p (z = k ∣ h_{i j})

, where

z

denotes the latent mixture component responsible for generating the projected feature vector

h_{i j}

. This interpretation allows the routing mechanism to be viewed as spatially varying probabilistic inference over mixture components, establishing a direct connection between Gaussian mixture modeling and expert routing.

To enrich the projected representation before expert processing, each GMM-MoE module incorporates a lightweight global context branch. Denoting the locally projected feature as

h_{local} \in R^{H \times W \times D}

, the input tensor

x

is average-pooled by a factor of two, projected to dimension

D

via a 1 × 1 convolution, and restored to the original resolution through bilinear upsampling, yielding

h_{global} \in R^{H \times W \times D}

. The two representations are blended to obtain the expert input

h_{fwd}

as shown in Equation (12):

h_{fwd} = 0.7 \cdot h_{local} + 0.3 \cdot h_{global}

(12)

Each expert is implemented as a Conv2D–BatchNorm–ReLU block whose kernel size is matched to the spatial scale of the corresponding feature map: 5 × 5 at conv3, 3 × 3 at conv4, and 1 × 1 at conv5 (Table 1). This choice keeps the convolutional receptive field proportional to the available spatial resolution at each backbone depth; in particular, the 1 × 1 kernel at conv5 reflects the saturation of spatial convolution at 7 × 7, where discriminative information is concentrated channel-wise. The design aligns with the multi-kernel philosophy of Inception-style architectures [32] and the multi-depth processing principle of Feature Pyramid Networks [33].

Equation (13) shows the soft aggregation of expert outputs using the responsibility maps as routing weights:

{\hat{h}}_{i j} = \sum_{k = 1}^{K} γ_{i j k} f_{k} (h_{i j})

(13)

where

f_{k}

denotes the transformation implemented by the

k

-th expert network,

γ_{i j k}

denotes the spatial responsibility map associated with expert

k

. This weighted aggregation enables different experts to specialize in distinct regions of the feature space while maintaining fully differentiable routing behavior.

To preserve the backbone’s representational capacity while incorporating expert-specialized features, the aggregated expert outputs are fused with the original input via a residual connection. Equation (14) shows the residual fusion formulation:

z = (1 - w_{r}) x + w_{r} \cdot ProjConv (\hat{h})

(14)

where

x

denotes the original backbone feature tensor,

\hat{h}

denotes the aggregated expert representation, and

w_{r} \in (0, 1)

denotes the residual fusion weight controlling the contribution of expert-enhanced features. During training,

w_{r}

is gradually increased to allow the model to progressively incorporate the expert-enhanced representation while maintaining the stability of the backbone features.

To prevent routing collapse, where a small subset of experts dominates the responsibility distribution, expert dropout is applied to the responsibility maps during training. In this procedure, a random subset of experts is temporarily masked, and the remaining responsibilities are re-normalized. This mechanism encourages balanced utilization of experts and promotes diversity across expert specializations.

Two auxiliary terms supervise the GMM-MoE routing throughout training. The first is the negative log-likelihood (

NLL

) of the projected features under the conditional Gaussian mixture model, which encourages each expert to specialize on a coherent region of the feature space, as defined in Equation (15a):

NLL = - \frac{1}{H \cdot W} \sum_{i, j} \log \sum_{k = 1}^{K} π_{k} (x) N (h_{i j} ∣ μ_{k}, Σ_{k})

(15a)

where

H \cdot W

denotes the spatial resolution of the feature map at the corresponding backbone depth,

π_{k} (x)

is the input-dependent prior of component

k

defined by Equation (16), and

N (h_{i j} ∣ μ_{k}, Σ_{k})

is the multivariate Gaussian density of the projected feature vector

h_{i j}

under the mean

μ_{k}

and covariance

Σ_{k}

of component

k

. In practice,

NLL

is computed efficiently from the unnormalized log-responsibilities

{\tilde{l}}_{i j k}

of Equation (9) through a numerically stable log-sum-exp operation.

The second auxiliary term is the load balance loss

L_{bal}

, which penalizes the deviation of the mean per-expert responsibility from a uniform distribution across the

K

components, as defined in Equation (15b):

L_{bal} = \sum_{k = 1}^{K} ({\bar{f}}_{k} - \frac{1}{K})^{2}

(15b)

where the mean responsibility

{\bar{f}}_{k}

of expert

k

over the spatial feature map is given by

{\bar{f}}_{k} = \frac{1}{H \cdot W} \sum_{i, j} γ_{i j k}

. The minimum of

L_{bal}

is attained at

{\bar{f}}_{k} = 1 / K

for all

k

, corresponding to perfectly expert utilization; the term therefore discourages routing collapse onto a small subset of experts. The two auxiliary terms are combined into the total auxiliary loss in Equation (15c):

L_{aux} = β \cdot \frac{NLL}{D} + λ_{b} \cdot L_{bal}

(15c)

where

β

is the routing quality weight,

D

is the projection dimensionality of the respective GMM-MoE module that normalizes the

NLL

magnitude so that the auxiliary loss remains invariant across feature scales with differing channel depths, and

λ_{b}

is the load balance penalty weight. The total training loss minimized by the network is the sum of the classification cross-entropy

L_{cls}

and

L_{aux}

.

2.4.3. Conditional Prior Mechanism

In the standard GMM formulation, the mixing coefficients

π_{k}

are global constants shared across all inputs. This assumption is restrictive for fine-grained visual recognition, where the relative importance of feature clusters may depend on the specific image content. To address this, a conditional prior mechanism

π (x)

is introduced that produces input-dependent mixing coefficients.

A lightweight branch applies global average pooling to

h

. It passes the result through two dense layers to produce conditional logits

π_{cond} (x) \in R^{K}

. These are blended with the fixed prior logits

π_{fixed}

via a learnable blending coefficient

α

, as shown in Equation (16):

l o g π (x) = LogSoftmax [(1 - α) π_{fixed} + α π_{cond} (x)]

(16)

where

α

is a scalar learnable factor that controls the interpolation between the fixed prior

π_{fixed}

and the conditional prior

π_{cond} (x)

. Rather than learning

α

directly, the unconstrained underlying parameter

α_{raw}

is introduced, and the constrained blending coefficient is recovered through a sigmoid wrapping

α = sigmoid (α_{raw})

, which guarantees

α \in (0,1)

throughout training and removes the need for an explicit clipping step. The free parameter

α_{raw}

is optimized end-to-end through standard gradient descent on the total training loss, which is the sum of the classification cross-entropy

L_{cls}

and the auxiliary loss

L_{aux}

of Equation (15c). At each training step, the gradient of the total loss with respect to

α_{raw}

is computed via the chain rule along the following sequence of operations: the LogSoftmax in Equation (16) maps

α

into the conditional log-priors

\log π (x)

; these log-priors replace

\log π_{k}

in the unnormalised log-responsibilities of Equation (9); the temperature-scaled softmax of Equation (11) then produces the spatial responsibilities

γ_{i j k}

; and these responsibilities determine the weighted aggregation of expert outputs in Equation (13), which feeds the classification head. Errors at the output therefore propagate backward through Equations (9), (11), (13) and (16), and finally through the sigmoid in

α = sigmoid (α_{raw})

, producing a gradient that adaptively shifts

α

toward whichever regime (fixed-prior or input-conditioned) is most beneficial for the current state of the model. The initialization

α_{raw} = - 1.0

places the network at

α \approx 0.27

so that training begins close to the fixed-prior regime and gradually incorporates input-conditioned adaptation as

α_{raw}

evolves in response to the gradient signal. The blended log-prior replaces

\log π_{k}

in Equation (9), transforming the static GMM into a conditional mixture model.

2.4.4. Multi-Scale Ensemble Architecture

Independent GMM-MoE modules are attached to three feature depths of the DenseNet-121 backbone: the outputs of dense blocks 2, 3, and 4, producing feature tensors with spatial resolutions of 28 × 28, 14 × 14, and 7 × 7 and channel dimensions of 512, 1024, and 1024, respectively. Operating at multiple abstraction levels enables the model to simultaneously exploit mid-level structural patterns, compositional features, and high-level semantic representations.

To enable channel-wise concatenation, each module output is passed through a 1 × 1 convolution that projects all three branches to a common dimensionality of 512. The conv3 output is additionally down-sampled by a factor of 2, and the conv5 output is up-sampled by a factor of 2, so that all three branches are aligned to 14 × 14 spatial resolution. The aligned branches are concatenated and then fused by a final 1 × 1 convolution, followed by batch normalization and ReLU. This multi-scale aggregation enables the model to integrate complementary spatial representations extracted at different semantic depths, improving robustness to variations in object appearance and scale. This multi-scale alignment strategy shares conceptual foundations with multi-rate fusion methods used in complex monitoring systems, where data from different scales are integrated to improve dynamic state estimation and methodological rigor [34]. Equation (17) shows the resulting multi-scale feature fusion formulation:

F = B N [{Conv}_{1 \times 1} ([ϕ_{3} (z_{3}), ϕ_{4} (z_{4}), ϕ_{5} (z_{5})])]

(17)

where

F

denotes the fused multi-scale feature representation,

z_{3}, z_{4}, z_{5}

denote the residually fused outputs of the GMM-MoE modules at the three backbone depths and

ϕ_{l}

denotes the spatial alignment and channel projection operator applied to branch

l

. The fused feature is passed to a global average pooling layer, a 512-unit dense layer with dropout, and a 102-way softmax classifier.

2.4.5. Data-Driven Expert Initialization

The component means

μ_{k}

are initialized from real feature distributions using farthest-point sampling (FPS). A batch of training features is extracted from each backbone layer, projected to the GMM input space, and the

K

centroids are selected iteratively by always choosing the point that is farthest from all previously selected centroids. This ensures that initial means span the feature space rather than clustering in a small region.

The initial variance is calibrated to achieve a target component separation score.

s^{*} = {\overline{d}}_{μ} / σ_{init}

, where

{\overline{d}}_{μ}

is the mean pairwise distance between initial means. The corresponding precision

u_{k, d}

is set to satisfy this target. This calibration prevents the routing from collapsing to winner-take-all (when variance is too low) or becoming uninformative (when variance is too high) in early training epochs.

2.5. Experimental Setup

Dataset and pre-processing. All experiments were conducted on the IP102 benchmark dataset [14], which comprises 75,222 images spanning 102 pest and disease categories with pronounced class imbalance. The official data partition was adopted throughout, yielding 45,095 training images, 7508 validation images, and 22,619 test images. All images were resized to 224 × 224 pixels and normalized using the DenseNet-specific channel statistics via the standard pre-processing function.

Backbone and feature extraction. DenseNet121 pre-trained on ImageNet served as the backbone network. Feature maps were extracted at three intermediate levels (the outputs of Dense Block 3 (28 × 28 × 512), Dense Block 4 (14 × 14 × 1024), and Dense Block 5 (7 × 7 × 1024)) which, respectively, encode textural, structural, and semantic representations. These tapping points served as the injection sites for the GMM-MoE modules in all experiments.

Training protocol. A two-stage training strategy was employed across all configurations. In Stage 1, the backbone weights were frozen, and only the GMM-MoE module(s) and the classification head were optimized using the Adam optimizer with a learning rate of 1 × 10⁻⁴ and gradient clipping (clipnorm = 1.0) for up to 150 epochs. Stage 2 released all weights for full fine-tuning at a reduced learning rate of 5 × 10⁻⁵, also for up to 150 epochs. Both stages applied categorical cross-entropy loss with label smoothing (ε = 0.1). The learning rate was reduced by a factor of 0.2 upon validation accuracy plateaus (patience = 10 epochs), and early stopping was triggered after 20 consecutive epochs without improvement. The batch size was fixed at 32 throughout. The residual blending weight wᵣ was linearly annealed from 0.5 to 0.8 over the first ten epochs of Stage 2, enabling a smooth transition from the backbone feature map to the expert-augmented representation.

Initialisation. Component means

μ_{k}

were initialized in a data-driven manner via farthest-point sampling applied to a batch of training features projected into the GMM input space, with a target separation score

s^{*}

= 4.0. The conditional prior blending coefficient

α

was initialized at

sigmoid (- 1.0) \approx 0.27

and subsequently learned end-to-end during training. Hyperparameters. The expert count

K

was systematically evaluated over

K \in {2, 4, 6, 8, 10}

for each feature scale independently (Section 3.1). All remaining module-specific hyperparameters adopted in the final ensemble configuration are summarized in Table 1.

The module-level hyperparameters reported in Table 1, specifically the routing temperature

T

, the load balance weight

λ_{b}

, and the routing quality weight

β

, were determined through a structured manual search guided by validation accuracy. For each feature scale independently, candidate values of

T \in {1.3, 1.4, 1.6, 1.8}

,

λ_{b} \in {0.01, 0.02, 0.03}

, and

β \in {0.01, 0.05, 0.10}

were evaluated under Stage 1 training conditions, and the configuration yielding the highest validation accuracy was retained for Stage 2 fine-tuning. Systematic joint optimization across all three parameters and all three scales simultaneously was precluded by the available computational budget. The reported hyperparameter configuration should therefore be regarded as a locally, rather than globally, optimal setting, and a principled automated search is expected to yield further accuracy improvements beyond those reported here.

All experiments were carried out on a single NVIDIA A100 GPU within the Google Colab environment.

3. Results and Discussion

3.1. Ablation Study: Effect of Expert Count K

To isolate the contribution of the GMM-MoE module and to identify the optimal number of mixture components

K

, a systematic ablation was conducted across all three feature scales prior to assembling the final ensemble model. For each scale, specifically conv3, conv4, and conv5, a single-scale model was constructed by inserting one GMM-MoE module between the corresponding DenseNet121 feature map and the classification head, and evaluated for

K

in {1, 2, 4, 6, 8, 10} with all hyperparameters fixed as described in Section 2.5. When

K

equals 1, the mixture degenerates to a single component in which gamma is identical to 1, causing the module to reduce to a standard residual convolutional layer. This configuration therefore serves as the scale-specific lower bound against which all

K

> 1 results are compared.

The results are reported in Table 2. Across all three scales, every

K

> 1 configuration improved top-1 accuracy over the

K

= 1 lower bound, confirming that multi-component mixture routing consistently provides representational benefit beyond a single convolutional transformation. At conv3, accuracy increased from 65.66% at

K

= 1 to 68.39% at

K

= 10. At conv4 the corresponding improvement was from 72.37% to 73.28%, and at conv5 from 72.47% to 73.64%. An exception to the otherwise increasing trend is observed at conv4 for

K

= 6 and

K

= 8, where accuracy declined marginally relative to

K

= 4, reaching 72.94% and 72.91%, respectively, before recovering to 73.28% at

K

= 10. This transient non-monotonic behavior is localized to the intermediate-resolution scale and does not affect the overall conclusion that

K

= 10 yields the highest single-scale accuracy across all three feature levels. The marginal gain per additional component diminishes progressively beyond

K

= 8 at all scales, indicating empirical saturation of the mixture capacity given the approximately 75,000 training images distributed across 102 classes. On this basis,

K

= 10 was adopted for all subsequent experiments.

3.2. Final Model Performance and Reproducibility

Having established

K

= 10 as the optimal expert count at each scale, the three GMM-MoE modules were simultaneously integrated into DenseNet121 and their scale-aligned outputs fused through the multi-scale ensemble mechanism described in Section 2.4. The scale-specific branch weights beta are learned independently without normalization, allowing the network to dynamically adjust the relative contribution of each feature scale during Stage 2 fine-tuning.

To assess the reproducibility of the final model, three independent training runs were conducted under identical conditions, differing only in initialization randomness. The top-1 accuracies and their descriptive statistics are reported in Table 3. The three runs yielded 74.03%, 74.11%, and 74.22%, with a mean of 74.12%, a sample standard deviation of 0.1 percentage points, and a 95% confidence interval of 74.12 ± 0.25% computed using the two-tailed t-distribution with two degrees of freedom (t = 4.303). The narrow confidence interval confirms that the reported performance is stable across runs and is not an artifact of a single favorable initialization. The upper-bound run of 74.22% further suggests that targeted refinements to the Stage 2 fine-tuning protocol, such as a finer learning rate schedule or a more gradual annealing of the residual weight wᵣ, may reduce inter-run variability and shift the mean performance closer to this value without requiring architectural modifications. Figure 5 presents the Stage 1 and Stage 2 training and validation curves. Validation accuracy converges smoothly and remains stable throughout training, with no indication of overfitting.

The training behavior in Figure 5 warrants explicit interpretation, since the separation between training and validation accuracy could superficially suggest overfitting. In Stage 2, training accuracy approaches saturation, while validation accuracy stabilizes at around 74%. Two diagnostic observations indicate that this pattern reflects capacity saturation rather than harmful overfitting. First, the validation curves plateau without subsequent decline, in contrast to the monotonic decay that characterizes pathological overfitting. Second, comparable train–validation separations are documented for the strongest published methods on the same benchmark [16,17,18], indicating that the accuracy ceiling on IP102 is governed primarily by intrinsic dataset properties—pronounced class imbalance (per-class counts ranging from 33 to 4544), intra-class morphological variability across developmental stages, and small target objects against cluttered backgrounds. The regularization regime employed (label smoothing, dropout, L2 weight decay, expert dropout, and early stopping) further constrains this gap, and the resulting 74.12% test accuracy lies within the upper envelope of reported results on IP102.

The confusion matrix is reported at the superclass level (Figure 6), aggregating the 102 IP102 categories into the eight crop-based superclasses defined by the dataset taxonomy (Rice, Corn, Wheat, Beet, Alfalfa, Vitis, Citrus, Mango). The superclass-level accuracy reaches 83.77%, compared to 74.23% at the fine-grained 102-class level; the 9.54 percentage-point gap corresponds to errors confined within the same crop superclass, i.e., misclassifications between taxonomically related species sharing the same host crop. Per-superclass accuracy is highest for Alfalfa (88.8%), Corn (88.2%), and Mango (87.4%), and lowest for Beet (68.9%) and Wheat (72.9%), reflecting the greater intra-superclass morphological diversity within root and cereal crops.

3.3. Interpretability: Multi-Scale Routing Maps and Expert Specialization

Figure 7 presents the multi-scale maximum-responsibility maps alongside Grad-CAM overlays for ten representative IP102 test images spanning diverse pest morphologies. Grad-CAM serves as a post hoc single-scale reference computed after training. The conv3 gamma-maps consistently highlight fine-grained textural detail including wing venation, surface microstructure, and appendage contours, which is consistent with the textural selectivity characteristic of early feature layers. The conv4 maps shift toward structural boundaries and body part transitions, reflecting the intermediate-layer encoding of compositional geometry. The conv5 maps produce spatially concentrated activations centered on the most semantically discriminative body region, consistent with the high-level abstraction encoded at this depth. Across all 10 examples, the three routing maps provide non-redundant spatial information among themselves and relative to the Grad-CAM reference. Figure 7 presents multi-scale gamma routing maps with Grad-CAM comparison.

Systematic class-level expert specialization is quantified in Figure 8, which reports the normalized mean gamma per class across all 22,619 test images for the conv4 GMM-MoE module. Expert 5 dominates routing for 11 pest classes, Expert 3 for 7 classes, and Experts 1 and 2 each dominate only 2 classes, indicating a skewed but non-degenerate distribution of class-expert associations. Among the 50 displayed classes, classes 14 through 21 consistently activate Experts 3, 8, and 10, with normalized gamma values ranging from 0.14 to 0.21, substantially above the uniform expectation of 0.10. All 10 experts participate in the routing of at least two classes, confirming that no component collapsed during training.

3.4. Comparison with State of the Art

Table 4 situates the proposed GMM-MoE CNN within the existing IP102 literature. The methods listed share a common architectural philosophy: accuracy improvements are achieved through ensemble construction, backbone scaling, or auxiliary self-supervised objectives. The proposed framework pursues a fundamentally different goal—replacing heuristic attention weighting with a statistically grounded probabilistic routing mechanism. To our knowledge, no prior work has applied GMM posterior responsibilities as a spatial routing gate within a CNN feature hierarchy for fine-grained pest classification. The accuracy results in Table 4, therefore, serve not primarily as a competitive benchmark, but as empirical validation that probabilistic routing is viable: a novel paradigm can match or approach the performance of dedicated accuracy-optimized methods without sacrificing interpretability or requiring ensemble inference.

The standard DenseNet-121 backbone trained under identical two-stage fine-tuning conditions achieves 61.10%, establishing a single-backbone reference point. Ensemble methods that combine multiple independently trained networks, such as the six-CNN ensemble of Nanni et al. [16], which achieves 74.11%, consistently improve over single-model baselines but come at the cost of increased training and inference overhead. Among Transformer-based approaches, the self-supervised Vision Transformer pre-training strategy of Liu et al. [17] achieves 74.69%, requiring substantially greater model capacity and an auxiliary self-supervised pre-training stage. Chen et al. [19] proposed a multi-image feature localization and adaptive filtering fusion framework achieving 73.90% under the standard single-image classification protocol, demonstrating that well-designed CNN pipelines with feature fusion remain competitive on this benchmark.

The proposed model achieves 74.12% accuracy using DenseNet-121 as the sole backbone, without auxiliary pre-training or ensemble construction. Liu et al. [17] currently hold the highest reported accuracy on IP102 (74.69%), obtained through self-supervised Vision Transformer pre-training. The 74.12% achieved by the proposed model is statistically comparable to this figure while requiring only ImageNet pre-training, and the contribution of this work is orthogonal to that benchmark: a probabilistic, interpretable routing mechanism that is not available in ViT-based or ensemble approaches. This result is statistically comparable to the six-CNN ensemble of Nanni et al. [16], (74.11%), while the proposed approach offers a fundamentally different value proposition: interpretable, probabilistically grounded routing decisions that expose the internal reasoning of the model—a property unavailable with ensemble combination strategies.

Two aspects of the proposed approach distinguish it from the competitive methods in Table 4. First, the accuracy gain over the DenseNet-121 baseline is achieved entirely through probabilistic GMM routing rather than through architectural scaling, ensemble construction, or auxiliary supervision. Second, the multi-scale GMM posterior provides a spatially resolved, probabilistically interpretable routing mechanism that, as demonstrated in Section 3.3, yields structured expert specialization aligned with the visual hierarchy of the IP102 pest categories. These properties are not offered by the attention weighting functions or ensemble combination strategies used in the compared methods.

4. Discussion

The ablation results in Section 3.1 reveal two complementary patterns. First, accuracy increases with

K

at every scale, indicating that the 102-class IP102 embedding space is genuinely multimodal at each abstraction level and that the GMM routing mechanism successfully partitions it into meaningful sub-regions. The consistent improvement across all three scales rules out the possibility that gains are scale-specific artifacts. Second, the transient accuracy decline at conv4 for

K

= 6 and

K

= 8 is informative rather than anomalous. As the number of components increases, the conditional prior

π_{α}

must distribute routing responsibility across a larger component set before the gating network accumulates sufficient gradient signal to stabilize. The recovery at

K

= 10, accompanied by the highest conv4 accuracy in the ablation, confirms that this instability is a transient training dynamics effect resolved once the mixture capacity matches the structural complexity of the 14 × 14 intermediate feature space. Together, these two patterns suggest that the optimal

K

reflects both the statistical complexity of the class distribution and the spatial resolution of the feature map at each scale.

The reproducibility analysis in Section 3.2 shows a standard deviation of 0.08 percentage points across three independent runs, which is notably low for a 102-class fine-grained benchmark. This stability is attributable to two design decisions. Farthest-point sampling initialization of the component means ensures well-separated initial prototypes and reduces sensitivity to random initialization. The balance loss with load-balancing penalty prevents component collapse throughout training. The beta branch weights converged to qualitatively similar values across the three runs, indicating that the multi-scale fusion mechanism reliably identifies the relative informativeness of each scale regardless of initialization. This consistency indicates that the multi-scale fusion mechanism reliably identifies the relative informativeness of each scale regardless of initialization. This property has been linked to enhanced performance stability and reliable state estimation in multi-rate monitoring systems [34].

The routing hyperparameters

T

,

λ_{b}

, and

β

merit separate consideration as potential sources of untapped performance. These parameters govern, respectively, the sharpness of the GMM posterior distribution over experts, the degree to which load balancing is enforced during training, and the magnitude of the separation incentive that discourages expert collapse. Although candidate values for each parameter were evaluated independently at each feature scale prior to final configuration selection, a principled joint optimization across all three parameters and all three scales was not conducted owing to computational constraints. The reported values therefore represent a locally but not necessarily globally optimal setting. Notably, lower temperature values are expected to sharpen routing distributions at the semantically richer conv5 level, where class-discriminative information is spatially concentrated. In contrast, higher temperatures may be more appropriate at conv3, where soft assignments over diverse low-level texture regions are beneficial. Disentangling these scale-specific sensitivities through a systematic hyperparameter search constitutes a meaningful direction for improving the reported accuracy without modifying the underlying architecture.

The interpretability evidence in Section 3.3 connects the quantitative performance gains to a mechanistic explanation. The non-redundant routing maps across conv3, conv4, and conv5 indicate that the three GMM-MoE modules learn complementary spatial decompositions rather than duplicating each other’s attention patterns. This complementarity is the mechanism through which the multi-scale ensemble outperforms any single-scale configuration. Each module contributes uniquely to the final prediction by routing to spatial locations that neither of the other modules prioritizes. The class-expert heatmap in Figure 8 further reveals that this complementarity operates not only spatially but also semantically. The concentration of routing for the hemipteran-group classes 14 to 21 onto Experts 3, 8, and 10 with systematically elevated gamma values provides evidence that specific experts have specialized in discriminating visually confusable species pairs. This specialization emerges entirely from the end-to-end training objective without any explicit supervisory signal for expert assignment. This probabilistic transparency has direct relevance beyond benchmark performance. In precision agriculture settings, automated pest identification systems increasingly face regulatory and practitioner requirements for explainable decisions. The class-expert heatmap in Figure 8 enables agronomists to audit model behavior at the species level, revealing which visual subregions drive classification for a given pest category without recourse to post hoc approximation methods.

The comparison with state-of-the-art methods in Section 3.4 places the proposed approach in a broader context. The 13.01 percentage point improvement over the standard DenseNet-121 baseline demonstrates that the GMM-MoE modules introduce substantial representational capacity beyond that of the backbone alone. The competitive standing relative to transformer-based methods that rely on substantially larger architectures or auxiliary self-supervised pre-training objectives confirms that probabilistic multi-scale routing provides an effective and architecturally lightweight alternative for achieving high accuracy on IP102. Unlike heuristic attention mechanisms that learn a dedicated spatial weighting module, the proposed architecture derives spatial importance directly from posterior inference in a well-defined mixture model. Each expert’s activation is linked to the Gaussian likelihood in the projected feature space, making routing decisions probabilistically interpretable and enabling principled diagnostics through standard statistical tools including the separation score, component overlap analysis, and entropy tracking. This probabilistic grounding distinguishes the proposed approach from architectures that achieve similar accuracy through less interpretable means, and it is the property that makes the class-expert heatmap in Figure 8 a meaningful diagnostic rather than a post hoc rationalization.

Several limitations of the present study should be acknowledged. The ablation was conducted up to

K

= 10, and whether larger values would offer further benefit under a larger dataset remains an open question. The architecture was evaluated exclusively on IP102, and the transferability of the learned expert specialization patterns to other fine-grained pest datasets or to crop disease classification has not been demonstrated. Additionally, a thorough computational efficiency analysis comparing the proposed model against a standard DenseNet-121 baseline would be required before deployment in resource-constrained agricultural edge systems.

5. Conclusions

This work introduced the Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts (GMM-MoE) architecture, a probabilistic plug-in module evaluated on DenseNet-121 at three feature extraction depths for fine-grained insect pest classification on IP102. The core contribution is the replacement of deterministic attention weighting with analytic GMM posterior responsibilities computed in closed form, where each spatial location is routed to dedicated convolutional expert sub-networks through the posterior

p (z = k ∣ h_{i j})

. A conditional prior mechanism

π (x)

renders the mixing coefficients input-dependent via a learnable blending coefficient

α

, precision-based variance parametrisation with dimension-aware temperature scaling ensures stable routing across layers of differing channel dimensionality, and data-driven expert center initialization via farthest-point sampling prevents component collapse during training. Applied at three backbone depths and fused through spatial alignment and channel projection, the ensemble achieves 74.12% top-1 accuracy on IP102 with a standard deviation of 0.08 percentage points across three independent runs. To the best of our knowledge, this constitutes the first systematic application of GMM-based probabilistic routing within a multi-scale CNN feature hierarchy for fine-grained pest classification, opening a new direction for statistically interpretable expert routing in agricultural deep learning.

The probabilistic structure of the routing mechanism confers interpretability properties that are not available from heuristic attention approaches. Because routing weights are posterior probabilities under a well-defined generative model, each expert’s activation is statistically grounded rather than being a learned projection without probabilistic semantics. The resulting class-level expert specialization, which emerges entirely from the end-to-end training objective, provides a principled basis for model diagnostics and failure analysis in precision agriculture applications where transparency of automated decisions is increasingly required.

The present study has several limitations. The GMM-MoE modules were evaluated exclusively on IP102, and generalization to other fine-grained pest benchmarks or crop disease datasets remains to be demonstrated. Computational overhead relative to the standard DenseNet-121 backbone has not been systematically quantified, which is a prerequisite for deployment in resource-constrained agricultural edge systems.

Future work will pursue five complementary directions to address these limitations and extend the framework. First, scale-differentiated lightweight expert variants, specifically Squeeze-and-Excitation bottleneck designs at deeper scales (conv4, conv5) and depthwise-separable convolution blocks at shallower scales (conv3), are expected to introduce depth-level expert specialization orthogonal to the kernel-size differentiation employed in the present work, while substantially reducing per-expert parameter count and inference latency. Combined with standard model compression techniques (quantisation, pruning, knowledge distillation), these variants would enable operational deployment of GMM-MoE on edge platforms such as NVIDIA Jetson-class GPUs and TensorFlow Lite mobile runtimes for in situ pest monitoring on UAVs and field cameras. Second, the GMM-MoE framework will be extended to additional fine-grained agricultural datasets and to hierarchical classification settings that exploit the taxonomic superclass structure of IP102 and related benchmarks. Third, the routing hyperparameters

T

,

λ_{b}

, and

β

were identified through a structured manual search in the present study; a principled Bayesian optimization over the joint hyperparameter space is expected to yield measurable accuracy improvements without requiring architectural modification. Fourth, as the GMM-MoE module operates as a backbone-agnostic plug-in whose routing is defined entirely within the projected latent space of dimension D, evaluating the framework with stronger backbone networks, including EfficientNetV2 variants and self-supervised Vision Transformer architectures, would establish the performance ceiling of the proposed routing mechanism. Fifth, the conditional prior mechanism

π (x)

currently adapts the routing to individual image content but operates with fixed component statistics; extending the framework with online updates of the GMM means and precisions during inference would enable continual adaptation to distribution shifts encountered in field deployments, such as new pest variants or changing illumination conditions. Taken together, these directions form a coherent program of future work aimed at advancing beyond the 75% accuracy threshold on IP102 that has remained elusive in the literature.

Author Contributions

Conceptualization, N.Ş.; methodology, N.Ş.; software, N.Ş.; formal analysis, N.Ş.; writing—original draft preparation, N.Ş.; visualization, N.Ş.; writing—review and editing, N.A. and D.H.; supervision, N.A. and D.H.; funding acquisition, N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Bingöl University Scientific Research Projects Coordination Unit, grant number PİKOM-Bitki.2025.001.

Data Availability Statement

The IP102 dataset analyzed in this study is publicly available at https://github.com/xpwu95/IP102, accessed on 19 May 2026. The source code supporting the findings of this study is available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank the Bingöl University Scientific Research Projects Coordination Unit for their support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Oerke, E.-C. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
Savary, S.; Willocquet, L.; Pethybridge, S.J.; Esker, P.; McRoberts, N.; Nelson, A. The global burden of pathogens and pests on major food crops. Nat. Ecol. Evol. 2019, 3, 430–439. [Google Scholar] [CrossRef] [PubMed]
Parsa, S.; Morse, S.; Bonifacio, A.; Chancellor, T.C.B.; Condori, B.; Crespo-Pérez, V.; Hobbs, S.L.A.; Kroschel, J.; Ba, M.N.; Rebaudo, F.; et al. Obstacles to integrated pest management adoption in developing countries. Proc. Natl. Acad. Sci. USA 2014, 111, 3889–3894. [Google Scholar] [CrossRef]
Aktar, W.; Sengupta, D.; Chowdhury, A. Impact of pesticides use in agriculture: Their benefits and hazards. Interdiscip. Toxicol. 2009, 2, 1–12. [Google Scholar] [CrossRef]
Barzman, M.; Bàrberi, P.; Birch, A.N.E.; Boonekamp, P.; Dachbrodt-Saaydeh, S.; Graf, B.; Hommel, B.; Jensen, J.E.; Kiss, J.; Kudsk, P.; et al. Eight principles of integrated pest management. Agron. Sustain. Dev. 2015, 35, 1199–1215. [Google Scholar] [CrossRef]
Lima, M.C.F.; Leandro, M.E.D.A.; Valero, C.; Coronel, L.C.P.; Bazzo, C.O.G. Automatic detection and monitoring of insect pests—A review. Agriculture 2020, 10, 161. [Google Scholar] [CrossRef]
Austen, G.E.; Bindemann, M.; Griffiths, R.A.; Roberts, D.L. Species identification by experts and non-experts: Comparing images from field guides. Sci. Rep. 2016, 6, 33634. [Google Scholar] [CrossRef] [PubMed]
Xie, C.; Wang, R.; Zhang, J.; Chen, P.; Dong, W.; Li, R.; Chen, T.; Chen, H. Multi-level learning features for automatic classification of field crop pests. Comput. Electron. Agric. 2018, 152, 233–241. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. A review of the use of convolutional neural networks in agriculture. J. Agric. Sci. 2018, 156, 312–322. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.-K.; Cheng, M.-M.; Yang, J. IP102: A large-scale benchmark dataset for insect pest recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8779–8788. [Google Scholar] [CrossRef]
Gomes, J.C.; Borges, D.L. Insect pest image recognition: A few-shot machine learning approach including maturity stages classification. Agronomy 2022, 12, 1733. [Google Scholar] [CrossRef]
Nanni, L.; Manfè, A.; Maguolo, G.; Lumini, A.; Brahnam, S. High performing ensemble of convolutional neural networks for insect pest image detection. Ecol. Inform. 2022, 67, 101515. [Google Scholar] [CrossRef]
Liu, H.; Zhan, Y.; Xia, H.; Mao, Q.; Tan, Y. Self-supervised transformer-based pre-training method using latent semantic masking auto-encoder for pest and disease classification. Comput. Electron. Agric. 2022, 203, 107448. [Google Scholar] [CrossRef]
Xia, W.; Han, D.; Li, D.; Wu, Z.; Han, B.; Wang, J. An ensemble learning integration of multiple CNN with improved vision transformer models for pest classification. Ann. Appl. Biol. 2023, 182, 144–158. [Google Scholar] [CrossRef]
Chen, Y.; Chen, M.; Guo, M.; Wang, J.; Zheng, N. Pest recognition based on multi-image feature localization and adaptive filtering fusion. Front. Plant Sci. 2023, 14, 1282212. [Google Scholar] [CrossRef]
Qian, Y.; Xiao, Z.; Deng, Z. Fine-grained crop pest classification based on multi-scale feature fusion and mixed attention mechanisms. Front. Plant Sci. 2025, 16, 1500571. [Google Scholar] [CrossRef]
An, J.; Du, Y.; Hong, P.; Zhang, L.; Weng, X. Insect recognition based on complementary features from multiple views. Sci. Rep. 2023, 13, 2966. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
Jordan, M.I.; Jacobs, R.A. Hierarchical mixtures of experts and the EM algorithm. In Proceedings of the 1993 International Joint Conference on Neural Networks (IJCNN-93), Nagoya, Japan, 25–29 October 1993; Volume 2, pp. 1339–1344. [Google Scholar] [CrossRef]
Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; Houlsby, N. Scaling vision with sparse mixture of experts. Adv. Neural Inf. Process. Syst. 2021, 34, 8583–8595. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar] [CrossRef]
Variani, E.; McDermott, E.; Heigold, G. A Gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015; pp. 4270–4274. [Google Scholar] [CrossRef]
van den Oord, A.; Schrauwen, B. Factoring variations in natural images with deep Gaussian mixture models. Adv. Neural Inf. Process. Syst. 2014, 27, 3518–3526. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. In Learning in Graphical Models; Jordan, M.I., Ed.; Springer: Dordrecht, The Netherlands, 1998; pp. 105–161. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; ISBN 978-0-387-31073-2. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Wang, Y.; Shi, Y.; Yang, T.; Wang, W.; Sun, Z.; Zhang, Y. Structural performance warning based on computer intelligent monitoring and fractional-order multi-rate Kalman fusion method. Fractal Fract. 2026, 10, 186. [Google Scholar] [CrossRef]
Ayan, E.; Erbay, H.; Varçın, F. Crop pest classification with a genetic algorithm-based weighted ensemble of deep convolutional neural networks. Comput. Electron. Agric. 2020, 179, 105809. [Google Scholar] [CrossRef]
Gan, Y.; Guo, Q.; Wang, C.; Liang, W.; Xiao, D.; Wu, H. Recognizing crop pests using an improved EfficientNet model. Trans. Chin. Soc. Agric. Eng. 2022, 38, 203–211. [Google Scholar] [CrossRef]
Zheng, T.; Yang, X.; Lv, J.; Mi, L.; Wang, S.; Li, W. An efficient mobile model for insect image classification in the field pest management. Eng. Sci. Technol. Int. J. 2023, 39, 101335. [Google Scholar] [CrossRef]

Figure 1. Representative pest species from the IP102 benchmark dataset [14], showing 36 of the 102 pest classes sampled with a fixed random seed for reproducibility. Each subfigure is labeled (a–aj) with its corresponding species name shown above the image. The figure illustrates the principal sources of difficulty in fine-grained pest classification: inter-class visual similarity among taxonomically related species, intra-class variability across developmental stages, small relative scale of target objects, and complex natural backgrounds with varying illumination.

Figure 2. The overall architecture of the proposed GMM-MoE CNN. Three GMM-MoE modules are attached at the conv3, conv4, and conv5 outputs of a DenseNet-121 backbone, scale-aligned to a common 14 × 14 × 512 representation, and fused for 102-class pest classification.

Figure 3. DenseNet-121 Backbone and GMM-MoE Integration Points.

Figure 4. The internal architecture of a single GMM-MoE module, showing the five sequential processing stages: (i) feature projection, (ii) E-Step GMM routing, (iii) M-Step expert processing, (iv) output residual fusion, and (v) auxiliary loss computation.

Figure 5. Training and validation accuracy and loss curves for Stage 1 (backbone frozen) and Stage 2 (full fine-tuning). Top row: Stage 1 accuracy (left) and cross-entropy loss (right) over training epochs. Bottom row: Stage 2 accuracy (left) and loss (right).

Figure 6. The superclass-level confusion matrix of the proposed GMM-MoE CNN on the IP102 test set (22,619 images), aggregated over the eight crop-based superclasses. (a) Raw sample counts; (b) row-normalized percentages of the true class. Diagonal values indicate the per-class accuracy for each superclass. Fine-grained (102-class) accuracy is 74.23% and superclass-level accuracy is 83.77%.

Figure 7. Multi-scale maximum-responsibility maps and Grad-CAM overlays for ten representative IP102 test images.

Figure 8. Class-level expert specialization heatmap for conv4 GMM-MoE module, computed across all 22,619 IP102 test images.

Table 1. GMM-MoE module hyperparameters for each feature scale in the final ensemble configuration. Expert kernel sizes (5 × 5/3 × 3/1 × 1) are differentiated across scales to match the convolutional receptive field to the spatial resolution available at each backbone depth (28 × 28/14 × 14/7 × 7).

Parameter	Conv3	Conv4	Conv5
Number of experts $K$	10	10	10
Projection dimension $D$	256	512	512
Expert kernel size	5 × 5	3 × 3	1 × 1
Temperature $T$	1.3	1.6	1.4
Load balance weight $λ_{b}$	0.02	0.03	0.03
Routing quality weight $β$	0.05	0.10	0.10
Expert dropout rate	0.1	0.1	0.1

Table 2. Ablation results on the IP102 test set. Top-1 accuracy (%) is reported for each single-scale GMM-MoE configuration across six expert counts.

K

= 1 denotes the lower-bound configuration in which the module reduces to a standard residual convolution. The best result per scale is shown in bold (All results correspond to Stage 2 (full fine-tuning). Accuracies are reported on the 22,619-image test set).

Table 2. Ablation results on the IP102 test set. Top-1 accuracy (%) is reported for each single-scale GMM-MoE configuration across six expert counts.

K

= 1 denotes the lower-bound configuration in which the module reduces to a standard residual convolution. The best result per scale is shown in bold (All results correspond to Stage 2 (full fine-tuning). Accuracies are reported on the 22,619-image test set).

Configuration	K	Conv3 (%)	Conv4 (%)	Conv5 (%)
K = 1 (lower bound)	1	65.66	72.37	72.47
K = 2	2	66.40	72.79	72.72
K = 4	4	67.45	73.00	73.10
K = 6	6	67.92	72.94	73.11
K = 8	8	67.95	72.91	73.40
K = 10	10	68.39	73.28	73.64

Table 3. Reproducibility analysis of the proposed ensemble model (conv3, conv4, and conv5 with K = 10) over three independent training runs on the IP102 test set.

Run 1 (%)	Run 2 (%)	Run 3 (%)	Mean (%)	Std (%)	95% CI (%)
74.03	74.11	74.22	74.12	0.1	74.12 ± 0.25

95% CI computed using the two-tailed t-distribution (df = 2, t = 4.303). Std: sample standard deviation.

Table 4. Comparison of the proposed GMM-MoE CNN with representative methods on the IP102 benchmark (102-class top-1 accuracy). All results are taken from the respective original publications and correspond to the same official test split. † Trained under Stage 1 conditions only (frozen backbone) to provide a controlled reference.

Method	Year	Backbone	Approach	Acc. (%)
Wu et al. [14]	2019	ResNet-50	Standard fine-tuning	49.40
(†) DenseNet-121 [11]	2017	DenseNet-121	Standard fine-tuning	61.10
Ayan et al. [35]	2020	Multi-CNN ensemble	GAEnsemble (VGG/ResNet/Inception/Xception/MobileNet)	67.13
Gan et al. [36]	2022	EfficientNet	Coordinate attention	69.45
Nanni et al. [16]	2022	CNN ensemble	6-CNN + improved Adam optimizer	74.11
Zheng et al. [37]	2023	EfficientNetV2	PCNet with coordinate attention	73.70
Xia et al. [18]	2023	DenseNet-201 + ViT	Multi-branch multi-scale ensemble	74.20
Liu et al. [17]	2022	ViT	Self-supervised pre-training (LSMAE)	74.69
Chen et al. [19]	2023	CNN (ResNet-based)	Multi-image feature localization and adaptive filtering fusion	73.90
GMM-MoE CNN (proposed)	2026	DenseNet-121	Multi-scale probabilistic GMM-gated MoE routing	74.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Şahin, N.; Alpaslan, N.; Hanbay, D. Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification. Electronics 2026, 15, 2268. https://doi.org/10.3390/electronics15112268

AMA Style

Şahin N, Alpaslan N, Hanbay D. Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification. Electronics. 2026; 15(11):2268. https://doi.org/10.3390/electronics15112268

Chicago/Turabian Style

Şahin, Nurullah, Nuh Alpaslan, and Davut Hanbay. 2026. "Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification" Electronics 15, no. 11: 2268. https://doi.org/10.3390/electronics15112268

APA Style

Şahin, N., Alpaslan, N., & Hanbay, D. (2026). Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification. Electronics, 15(11), 2268. https://doi.org/10.3390/electronics15112268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset (IP102)

2.2. Backbone: DenseNet-121

2.3. Theoretical Background

2.3.1. Gaussian Mixture Model

2.3.2. Mixture of Experts

2.3.3. Relation Between GMM Responsibilities and Expert Routing

2.4. Proposed Method

2.4.1. Overview

2.4.2. GMM-MoE Module

2.4.3. Conditional Prior Mechanism

2.4.4. Multi-Scale Ensemble Architecture

2.4.5. Data-Driven Expert Initialization

2.5. Experimental Setup

3. Results and Discussion

3.1. Ablation Study: Effect of Expert Count K

3.2. Final Model Performance and Reproducibility

3.3. Interpretability: Multi-Scale Routing Maps and Expert Specialization

3.4. Comparison with State of the Art

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI