4.2.1. CNN Architecture
A standard CNN architecture typically begins with an input layer, followed by multiple convolutional and pooling layers stacked alternately in a hierarchical manner. The network then terminates with fully connected layers to perform feature-to-output mapping and generate the final predictions. The overall architecture is illustrated in
Figure 10.
Depending on the dimensionality of the input, CNNs are generally categorized into one-dimensional (1D), two-dimensional (2D), and three-dimensional (3D) variants. Considering the differences in temporal and spatial representations of the sensing data in this study, the network branches are configured in a targeted manner. For 1D time-series signals—such as power-frequency leakage current, PD signals, and voltage waveforms—1D convolutional kernels along the time axis are employed for feature extraction. For 2D representational data, such as temperature-field distributions, 2D convolutional kernels are adopted to model spatial features.
In this study, convolutions are performed in same mode, as indicated by the red box in
Figure 11. Specifically, symmetric boundary padding is applied to extend the edges of the input feature map (3 × 3), ensuring that, when a 3 × 3 convolution kernel is used, the output feature map remains exactly the same size as the input. This mode preserves effective extraction of edge features while maintaining continuity of feature propagation through parameter sharing, thereby providing the network with a stable hierarchical flow of spatial information.
As shown in
Figure 11, the convolution operation follows (1):
where
Hn denotes the output of the
n-th convolutional layer;
Hn−1 denotes the input to the
n-th convolutional layer (which is also the output of the (
n−1)-th layer);
f is the activation function;
wn and
bn are the weights and bias of the
n-th convolutional layer, respectively.
The pooling layer reduces the resolution of feature maps, thereby decreasing the number of parameters and computational complexity and alleviating overfitting. The pooling operation is illustrated in
Figure 12.
The output layer maps the continuous feature representation produced by the fully connected module to a class-level probability distribution, and a Softmax classifier is employed to determine the diagnostic category. The Softmax function is given by (2):
where
p(
y = m|
z) of the input
zm vector
z with sequence size
K, belonging to the probability of class
m.
4.2.3. Fundamental Principle of the SSA Algorithm
The sparrow search algorithm (SSA) is a swarm-intelligence optimization method that searches the solution space by simulating the division of labor and cooperative behavior of sparrows during foraging and anti-predation processes.
In the sparrow search algorithm, the sparrow population is represented by the following matrix:
where historically best position is denoted as
Xbest. The function
f(
x) is defined as the food amount in the current region, which represents the fitness value of the objective function. Based on this setting, according to the specific position of each sparrow, the corresponding fitness value at that position can be accurately calculated.
Once danger occurs, the discoverers need to quickly lead the population to evacuate the current area; if there is no danger, the discoverers need to search for new food sources over a wider range. The update equation for the discoverers is:
where
t denotes the current iteration number,
tmax is the maximum number of iterations in the entire process.
α is a random value generated in an open interval.
R2 is the preset warning value;
ST is the safety threshold.
q is a random number following a normal distribution;
L is a row matrix whose elements are all 1.
When R2 is less than ST, the population is in a safe state and performs a wide-range search. Conversely, if R2 is greater than or equal to ST, it indicates the presence of predation threats, and the population, guided by the vigilant individuals, moves to a safer area to continue foraging.
The update equation for the followers is:
where
represents the best position of the discoverers, while
denotes the worst position in the sparrow population at the current time. The matrix
A+ is a special one-row multidimensional matrix, in which each element takes a value of either 1 or −1.
n/2 refers to the number of remaining sparrows in the population. When
I >
n/2, it means that, in the whole sparrow population, the fitness value corresponding to the
i-th sparrow is relatively low; it has not successfully obtained food resources, and therefore needs to fly to other areas to search for food in order to survive.
The initial positions of the vigilant individuals are generated by a randomization method, ensuring that, in the early stage of algorithm iteration, they can be uniformly distributed in the solution space, thereby effectively fulfilling the functions of environmental monitoring and risk warning. The position update equation for the vigilant individuals is:
where
represents the global best position at this moment.
β is a step-size control parameter, which plays a regulating role in the step length of the corresponding operation.
β is a random number with a value range between (0, 1), and its randomness brings uncertainty factors to the system.
fi denotes the fitness value of the current individual in the population, and
fw and
fg correspond to the best and worst fitness values in the current population, respectively. To prevent the denominator from being zero, a constant
ε is introduced.
When fi ≠ fg, the individual is mostly located at the edge of the population and faces a higher predation risk. When fi = fg, it indicates that the individual has perceived danger and tends to move toward the population center to reduce the risk.
The overall procedure of SSA is shown in
Figure 14.
The power-frequency leakage-current sequences and the PD- and voltage-fluctuation feature sequences collected from surge arresters under different operating conditions are taken as inputs, and key features such as amplitude variations and temporal patterns are extracted using two 1D convolutional neural network layers, respectively. “Based on temperature signals” refers to using the internal temperature measurement points of the varistor blocks in each section at different operating stages together with the outer-surface temperature data as inputs, and adopting a structure composed of three convolutional layers and two pooling units to learn and extract deep representations of the spatial distribution characteristics of the temperature field and their temporal variations. Subsequently, the features extracted by the four branches are flattened into 1D vectors through a Flatten layer; after fusion, they are fed into an LSTM network, where the temporal dependence relationships among multi-source features are further modeled. Finally, the operating state of the surge arrester and the identification results of the degraded-section location are output through a fully connected layer and a Softmax layer. Based on the above analysis, the overall network architecture of the proposed multi-source information fusion degradation assessment algorithm is shown in
Figure 15. Multi-source information fusion deterioration assessment process. is shown in
Figure 16. Structural framework of degradation assessment model for multi-source information fusion is shown in
Figure 17.
To improve reproducibility, the detailed implementation settings of the proposed multi-branch CNN–LSTM model are summarized in
Table 10. The leakage-current, voltage, and PD branches each adopt two 1D convolutional layers followed by a max-pooling layer for local temporal feature extraction, whereas the temperature branch adopts three 2D convolutional layers and two max-pooling layers to capture the spatial–temporal characteristics of the temperature field. After branch-wise feature extraction, the resulting feature vectors are flattened and concatenated, and the fused features are then fed into the LSTM layer for temporal dependency modeling, followed by a fully connected layer and a Softmax classifier for final diagnosis.
As shown in
Table 11, the leakage-current, voltage, and PD signals were treated as one-dimensional single-channel sequences, and their input dimensions were therefore expressed as
or
, where the last dimension denotes the number of channels. The temperature data were represented as a
tensor, where 45 denotes the spatial measurement points, 7 denotes the acquisition instants, and 1 denotes the single temperature channel. After branch-wise feature extraction, the resulting feature maps were flattened into one-dimensional vectors and concatenated to form the fused feature representation. This fused feature vector was then fed into the LSTM layer to model temporal dependency, and the final classification result was obtained through the fully connected layer and the Softmax classifier. Based on the above network architecture, SSA was further employed to optimize the key hyperparameters of the proposed model.
In this study, SSA was used to optimize the key hyperparameters of the proposed CNN–LSTM model, including the number of hidden neurons in the LSTM layer, batch size, initial learning rate, L2 regularization coefficient, and number of training epochs. To ensure reproducibility, the search ranges were set as follows: the number of hidden neurons was searched in the range of 16–64, the batch size in 8–32, the initial learning rate in 1 × 10−4 to 1 × 10−2, the L2 regularization coefficient in 1 × 10−5 to 1 × 10−2, and the number of training epochs in 30–80. The population size of SSA was set to 10, and the maximum number of iterations was set to 15. Under these settings, a total of 150 candidate hyperparameter combinations were evaluated during the optimization process. These search intervals were chosen empirically in view of the relatively small sample size, so as to control model complexity while maintaining sufficient optimization flexibility.
The fitness function was defined as the classification accuracy on the validation subset. For each candidate solution generated by SSA, the model was trained on the training subset and then evaluated on the validation subset, and the resulting fitness value was returned to guide the population update. The predefined search ranges of the hyperparameters and the final optimized configuration obtained after convergence are summarized in
Table 12. After the optimization process converged, the final selected hyperparameters were determined as follows: the number of hidden neurons in the LSTM layer was 48, the batch size was 16, the initial learning rate was 2.5 × 10
−3, the L2 regularization coefficient was 5.0 × 10
−3, and the number of training epochs was 50.
If multiple candidate solutions yielded similar fitness values, the one with the lower validation loss and more stable convergence behavior was selected as the final configuration.