4.1. Improved LightGBM Model
Compared to XGBoost, LightGBM offers several advantages, including faster training speed, lower memory usage, and support for parallel computing. Its algorithm introduces the following optimizations based on the traditional GBDT framework:
- (1)
Model structure optimization
LightGBM employs a leaf-wise leaf growth strategy and replaces the level-wise layer growth strategy of traditional GBDT through a depth limit mechanism. Compared to the level-wise strategy, which can result in inefficient splitting of low-gain leaves and wasted computational resources, LightGBM’s depth control strategy both maintains growth efficiency and effectively prevents overfitting through hyperparameter tuning.
- (2)
Training process optimization
LightGBM optimizes the training process and significantly improves model efficiency by employing Gradient-based One-Side Sampling (GOSS), a histogram-based decision tree algorithm, and Exclusive Feature Bundling (EFB). These techniques accelerate training while maintaining accuracy.
LightGBM uses decision trees as weak learners. It iteratively trains multiple weak-performing trees by optimizing residuals, combining them into a strong predictive model. The calculation formula of LightGBM is expressed as follows:
Here, represents the input feature set, denotes the number of decision trees, refers to an individual decision tree, and is the LightGBM prediction value.
The goal of LightGBM model training is to minimize the loss between the predicted result
and the actual result
, which can be expressed as:
To achieve faster convergence, the negative gradient of the loss function—its first-order derivative—is used in each iteration to approximate the loss function, which can be expressed as:
Traditional LightGBM uses several loss functions, including Binary Cross-Entropy (BCE) Loss and Focal Loss. Each of these loss functions is introduced below.
Binary Cross-Entropy Loss (BCE Loss) is one of the most commonly used loss functions for binary classification tasks. It is defined as follows:
where
is the predicted probability value, and
is processed by the Sigmoid function and ranges from [0, 1].
It effectively quantifies the deviation between the model’s predicted value and the true label. Since the Sigmoid function restricts the model output to the range [0, 1], BCE Loss can directly process these prediction values, avoiding numerical overflow. Consequently, it is widely used in binary classification tasks. However, BCE Loss has the following limitations:
Sample weight equalization: BCE Loss sums the cross-entropy of all training samples equally, assigning the same weight to each sample.
Imbalanced sensitivity between easy and difficult samples: When a dataset contains both easy-to-classify samples (correctly classified with high confidence) and difficult samples (with ambiguous boundaries), the loss gradient from easy samples disproportionately influences model updates. As a result, the model overemphasizes easy samples during training, leading to poorer classification performance on difficult samples.
Focal Loss addresses class imbalance by dynamically adjusting sample weights during training. Its formula is defined as follows:
Here, represents the similarity between the model’s predicted value and the true label ; the larger is, the closer the prediction is to the label , indicating higher classification accuracy. The parameter is an adjustable focusing factor.
During training, the parameter can be adjusted to modify the model’s focus on difficult-to-classify samples. The term , known as the modulating factor, reduces the weight of easy-to-classify samples, enabling the model to concentrate more on harder cases.
Focal Loss addresses class imbalance by dynamically distinguishing between hard and easy samples. Its core principles are as follows:
The modulating factor reduces the loss contribution of easy-to-classify samples while increasing that of hard-to-classify samples, guiding the model to focus on optimizing performance for difficult cases during training.
Suppression of easy samples: When approaches 1, the sample is considered easy to classify. In this case, the modulating factor approaches 0, reducing its contribution to the loss and effectively lowering the impact of easy samples during training.
Emphasis on hard samples: When is very small, the predicted value deviates significantly from the true label, indicating a misclassified sample. In this case, the modulating factor approaches 1, allowing the loss to remain large and contribute more significantly to model training.
Figure 7 illustrates the output of the loss function for different values of
. By comparing Focal Loss and BCE Loss across different
values, it is observed that when the predicted probability approaches 0, the loss value of BCE Loss remains relatively small. Due to its symmetric nature, the function assigns equal loss weight to both easy- and hard-to-classify samples, making it difficult for the model to effectively differentiate sample difficulty. In contrast, Focal Loss enhances the model’s focus on hard samples by adjusting the parameter
, which increases the loss weight for samples with predicted probabilities close to 0.
.
Therefore, this paper proposes a hybrid loss function with dynamic weight adjustment as the optimization objective for model training. The dynamic weight adjustment strategy employs three decay functions—linear decay (LD), cosine decay (CD), and exponential decay (ED)—to adaptively adjust the weighting between the two loss functions at different stages of training. The corresponding decay curves are shown in
Figure 8.
Figure 8 illustrates that the weight under the linear decay (LD) function decreases linearly as the number of iterations increases. Its advantage lies in enabling a smooth transition from BCE Loss to Focal Loss. During the first half of training (the initial 50% of iterations), the weight of the cosine decay (CD) function remains above 0.5, indicating that BCE Loss primarily optimizes easy-to-classify samples early on. In the later stages, Focal Loss gradually takes precedence to enhance the learning of difficult-to-classify samples. The ED function utilizes exponential weight decay to significantly reduce the weight of BCE Loss in the early stages of training (the first 1/8 stage) and reduces its weight to near zero in the first 1/4 stage. Focal Loss dominates the later training stages. Considering LightGBM’s training characteristics and practical requirements, this paper aims for the model to quickly learn from easy-to-classify samples early on to ensure a low false alarm rate while improving recognition accuracy on difficult-to-classify samples through Focal Loss in the later stages. Because the ED function effectively meets this dynamic requirement, it is chosen as the final strategy for dynamic weight adjustment. The ED model loss function is shown in
Figure 9.
In
Figure 9, darker blue indicates lower loss values, while darker red indicates higher loss values. After adopting the ED function, the model’s loss trend over iterations and its recognition accuracy are shown in
Figure 9. In the early training stage, BCE Loss dominates, enabling the model to perform balanced optimization across all samples and achieve stable convergence. In the later stages of training, Focal Loss amplifies the penalty for misclassified or uncertain samples, encouraging the model to focus on learning complex cases with ambiguous boundaries. By adjusting the dynamic weight
, the model transitions from global optimization to focusing on difficult-to-classify samples. During this process, samples with predicted probabilities near the decision boundary extremes (positive class
p → 0 or negative class
p → 1) produce larger loss gradients. Consequently, their weight updates during backpropagation are intensified, prompting the model to prioritize correcting high-confidence misclassifications.
4.3. Experimental Results and Analysis
The constructed dataset is split into a training set and a test set at a 9:1 ratio. The LightGBM model is configured using the optimized hyperparameters, and different loss functions are employed for comparative training. BCE Loss is simple and efficient but often suffers from underfitting. Focal Loss mitigates class imbalance, although it relies on careful parameter tuning. By contrast, the proposed DWM Loss requires slightly longer training time but dynamically adjusts weights according to sample difficulty, offering stronger generalization. The computational time and complexity of the three methods are summarized in
Table 6. The model training process is illustrated in
Figure 11a,b, which show the curves of model accuracy and loss value, respectively, over the course of network iterations.
As shown in
Figure 11a, the LightGBM model experiences short-term underfitting at the start of training. When using BCE Loss, the model tends to prioritize optimizing easy-to-classify samples. However, its limited capacity to handle difficult samples leads to rapid convergence but ultimately results in lower accuracy. The model using Focal Loss shows low learning efficiency on easy-to-classify samples early in training, resulting in slow convergence and limited accuracy improvement. In contrast, the model trained with the dynamic weighted hybrid loss function proposed in this paper has an initial convergence speed between BCE Loss and Focal Loss, but its accuracy improves faster over time, ultimately achieving higher accuracy than the other two models.
As shown in
Figure 11b, the loss value changes throughout model training are illustrated. In the initial stage, the weights are fully assigned to easy-to-classify samples, resulting in equal loss values for both the DWM Loss and BCE Loss. As the weights gradually shift toward more complex samples, the loss curves of DWM Loss and Focal Loss begin to align, indicating that both models have reached convergence.
To evaluate the classification performance of the improved LightGBM algorithm for arc fault detection, this study used a test set of 22,362 samples and generated the corresponding confusion matrix. The test set includes two working conditions: with arc and without arc. The confusion matrix for the improved algorithm is visualized in
Figure 12. The model testing results are shown in
Table 7.
In
Table 7, TP stands for true positive; FP stands for false positive; FN stands for false negative; and TN stands for true negative. In the test results, 94 normal operation samples were misjudged as arc faults, and 153 arc fault samples were misidentified as normal. The calculated true positive rate, false positive rate, false negative rate, and true negative rate were 99.3%, 1.8%, 0.7%, and 98.2%, respectively, and the overall accuracy rate was 98.9%. After SVD and VMD extraction, the arc features are obvious, the model training is accurate and relatively stable, and the classification accuracy fluctuates only by ±0.02%. The proposed model was compared with the Arcnet [
36] model using a statistical significance test (two-tailed two-proportion Z-test, α = 0.05). The calculated Z-value is 2.674, and the
p-value is 0.0075, which shows that the proposed method is more sensitive to arc faults than the Arcnet model, but Arcnet performs better in resisting false detection. Subsequently, the method in this paper effectively reduces the false detection rate in practical applications by adopting a multi-cycle joint decision mechanism. These results demonstrate that the algorithm-optimized LightGBM performs exceptionally well in AC system series arc detection. Its very low false alarm rate indicates that the model effectively avoids overfitting.
To assess the impact of SVD on arc fault detection performance, a comparative experiment was conducted, with the results shown in
Table 8. The experimental group processed with SVD exhibited greater distinguishability of arc signals in the feature space, fully validating the effectiveness of the feature enhancement method.