The hyperparameters of deep learning models have a significant influence on both the model performance and training process. For example, larger kernel sizes in BiTCN modules may increase the computational burden of the model, while smaller ones may fail to capture sufficient temporal information. A grid search is a typical technique for hyperparameter determination, which systematically traverses various hyperparameter combinations to find the optimal model performance. In this way, the effectiveness of the grid search is highly dependent on predefined traversal space, resulting in failure to identify an optimal solution if it lies outside the predefined values. Moreover, nested loop navigation is also essential for a grid search, leading to increased computational time. Furthermore, as hyperparameter dimensionality increases, the sparsity of the parameter space intensifies. Even if each hyperparameter takes only a small number of values (e.g., 3–5), the total number of effective combinations in the high-dimensional space still grows exponentially, leading to a sharp decline in search efficiency and even rendering traversing infeasible.
Therefore, it is hard to find the best solution through parameter combinations or a grid search since the diagnosis method always has various hyperparameters. Fundamentally, hyperparameter optimization in the deep learning model of the Transformer–BiTCN is typically a highly non-convex optimization problem, and the use of a metaheuristic algorithm can enable more effective exploration within this non-convex search space. The SGA is a novel nature-inspired metaheuristic algorithm. Owing to its Brownian motion mechanism, it excels at escaping local optima and precisely searching for the optimal value. Furthermore, it can enhance the number of effective iterations of the algorithm when applied to long-cycle, non-stationary open-circuit SOCF diagnosis, making it particularly applicable to open-circuit fault diagnosis of the Vienna rectifier.
In this paper, an ISGA-optimized Transformer–BiTCN algorithm is presented to determine the model’s hyperparameters, including learning rate decay factor, dropout rate, batch size, layer number of the Transformer, and the number of attention heads.
4.1. Presented ISGA Algorithm
Compared with traditional SGA [
20], the following improvements have been made in certain aspects to accelerate the algorithm’s convergence speed and enhance the accuracy of the optimal solution. Here, the ISGA is an intelligent optimization strategy designed to efficiently search the hyperparameter space and structural configurations of abovementioned deep hybrid model. In detail, the ISGA is employed to optimize key hyperparameters and network structure parameters. This data-driven, automated optimization helps to achieve a more efficient, accurate, and generalized model, avoiding human bias in design and significantly improving diagnostic performance over default or manually tuned models.
- A.
Bloch-Based Population Initialization Strategy
For the traditional SGA algorithm, the search space relies entirely on randomness, leading to low convergence accuracy and slow convergence speed. In this paper, a Bloch coordinate-encoding scheme is combined with SGA to enhance the diversity of the population and accelerate the algorithm’s convergence speed.
Figure 5 shows the Bloch sphere representation of a qubit. It is known that the point
P on the sphere can be determined by
and
θ, and any qubit corresponds to a point on the Bloch sphere. Therefore, all qubits can be represented using Bloch coordinates, as shown in Equation (5).
Then, the Bloch coordinates of qubits are directly used as the encoding, whose encoding scheme is as follows:
where
r is a random number within the interval [0, 1],
,
;
N is population size;
;
n is dimension of the optimization space; and
.
Each candidate solution simultaneously occupies 3 positions in the space, meaning it represents the following 3 optimal solutions at the same time: Solution
X, Solution
Y, and Solution
Z, as given in Equation (7).
The Bloch coordinates of the
j-th qubit on the candidate solution
pi are denoted as
. In the optimization problem, the value range of the
j-th dimension of each solution space is [
aj,
bj]. Then, the transformation formula for mapping from the unit space
to the optimization problem’s solution space is
Hence, each candidate solution corresponds to three solutions of the optimization problem. Among all candidate solutions, N individuals with smaller fitness values are selected as the initial population. This helps to enhance the traversing ability of the search space, increasing the diversity of the population, thereby improving the quality of the population and accelerating the algorithm’s convergence speed.
- B.
Improved Position Update Technique via Rime Search Strategy
The SGA algorithm relies on an exploration stage and development stage. Specifically, the exploration stage, namely a herringbone shape, is responsible for discovering the global optimal region through random perturbations and group collaboration and the development stage, namely a straight-line shape, is accountable for local refinement search. After the exploration phase identifies potential optimal regions, the development phase concentrates resources to conduct a refined search in the neighborhood of the optimal solution through straight-line flight. Hence, the performance of the development phase is of significant importance to achieve a high-efficiency local search and precise convergence.
As shown in Equation (9), the traditional SGA is guided by the current solution if
rand > 0.5, which may lead to the entire algorithm being unable to escape the local optimum.
Here, ⊕ denotes entry-wise multiplication and
Pbt and
Pit are the optimal solution positions at the current iteration and the current particle position.
In order to solve abovementioned problem, under the condition of
rand > 0.5, Equation (9) of the traditional SGA is refined using the Rime search strategy. As shown in Equation (10), the position of the Rime particles is calculated as follows [
21]:
where
denotes the new position of the updated particle, with
p and
q as the
q-th particle of the
p-th Rime agent;
is the
q-th particle of the best Rime agent in the Rime population
R;
t is the current number of iterations;
T is the maximum number of iterations of the algorithm; the
rand() function controls the direction of particle movement together with cos
θ, which will change following the number of iterations; and
β is the environmental factor, which follows the number of iterations to simulate the influence of the external environment and is used to ensure the convergence of the algorithm. The default value of ω is 5.
In this way, as the soft ice condensation area increases, its strong randomness and wide coverage enable the algorithm to rapidly cover full-space searches, thereby balancing globality and locality in the optimization process. Meanwhile, the hard ice, influenced by external factors, tends to condense in the same direction. Due to the consistent growth direction of hard ice, it easily intersects with other hard ice, enabling dimensional exchange between ordinary particles and optimal particles, which helps improve solution accuracy.
In general, compared with traditional SGA, our proposed ISGA with a Rime mechanism can improve the convergence of the algorithm and simultaneously achieve the ability to jump out of the local optimum.
4.2. Procedure of ISGA-Optimized Transformer–BiTCN Algorithm
In this section, we give a brief introduction to the structure of proposed algorithm. The ISGA-optimized Transformer–BiTCN algorithm provides an efficient way to obtain the global optimal hyperparameters of the Transformer–BiTCN model with fast convergence speed. To be specific, the time series data of currents and voltages are introduced as the model inputs, and all data is divided into a training set, validation set, and testing set.
For the ISGA algorithm, the accuracy of the Transformer–BiTCN is considered the objective function. By optimizing the loss function of Transformer–BiTCN outputs on both training and validation data, optimal weight parameters are obtained. During the model testing stage, the optimized parameters obtained through the ISGA are used to yield the final test results. The optimization ranges of each hyperparameter to be optimized are shown in
Table 2, including the learning rate decay factor
ƞ, dropout rate
d, batch size
ρ, layer number of Transformer
l, and number of attention heads
h.
As demonstrated in
Figure 6, the detailed structure of the proposed ISGA-optimized Transformer–BiTCN algorithm can be described as follows:
Step 1: Collect input currents
ia,
ib, and
ic and DC voltages
udc1 and
udc2 of the Vienna rectifier, shown in
Figure 1, under various operation conditions, and pre-process all these data through missing-value elimination, normalization, and label encoding.
Step 2: Pre-processed ia′, ib′, and ic′ and udc1 and udc2 are formed as the inputs of the model and they are divided into a training set, validation set, and testing set. It should be mentioned that there is no overlap between the three datasets.
Step 3: Set up the architecture of the Transformer–BiTCN model, as shown in in
Figure 5, and apply the initialized ISGA parameters with the population initialization strategy outlined in Equations (6)–(8) to the network structure. Apply the Transformer–BiTCN with the initial parameters to recognize the open-circuit fault and calculate the corresponding fitness. The fitness function is used in this paper to reflect the accuracy of the diagnosis results after optimization. A bigger value means a better result for open-circuit diagnosis. To avoid the instability of the fitness function caused by fluctuations in a single training run, the average performance of multiple cross-validations is employed as the fitness value. In this paper, the average accuracy of
F-fold cross-validation is used, as described in Equation (11):
where
F is the number of cross-validation folds and
Lossf(ζ) is the accuracy of the
f-th fold.
Step 4: Use the ISGA algorithm to optimize the parameters according to the improved position update technique via Equations (9) and (10). During this stage, it is necessary to check whether the algorithm has reached the maximum number of iterations and whether the fitness value is less than the system threshold. If so, the iteration is terminated; otherwise, the iteration is continued.
Step 5: The final positions of snow geese individuals are used as the optimized parameters for the Transformer–BiTCN.
Step 6: The training set is used to train the other hyper-parameters of the Transformer–BiTCN model with preset parameters via the ISGA algorithm. After each training epoch, the validation set is employed to conduct model optimization validation. This ensures that the model continuously improves during the training process. Once the training reaches the preset number of iterations, the trained model is saved for subsequent testing.
Step 7: The test set is input into the model saved in step 6 to evaluate the model’s performance. If the evaluation results indicate that the model’s diagnostic capabilities cannot yet reach the optimal level, the process returns to step 6 for further model training and validation. This iterative process continues until the model’s performance meets predefined criteria. At this point, the optimally performing model is obtained. Subsequently, this offline trained model can be deployed in the actual working production of the SOCF online diagnostics.
Step 8: Use the optimized parameter to diagnose open-circuit fault of Vienna rectifier and output the recognized results.