4.4.1. Hyper-Parameter Settings
The method proposed in this paper includes 4 hyper-parameters: the number of attention maps channels for bilinear attention pooling, the threshold of mask , the standard deviation for discrete label distribution learning, and the weight for the loss function. To find the most suitable parameters, this paper analyzes the impact of each parameter on the overall model, finds a reasonable range for each parameter, and then finds the optimal combination.
Firstly, this paper used a model without a discrete label distribution learning module to study the impact of the number of attention feature map channels on overall model performance. As shown in
Table 3, different values of
will have different effects on the classification results. When
is adjusted to an appropriate range, it can significantly improve the model’s classification accuracy. If
is too small, the network may fail to learn the more important features. On the contrary, if
is too large, the network may pay too much attention to specific features, leading to overfitting, which is not favorable for future classification. For two completely different datasets, the setting of
should not be the same due to the different distributions of fog features in the synthetic and real datasets. Therefore, in the subsequent experiments, we set
for the RFID dataset and set
for the FRIDA dataset separately.
Secondly, this paper added a discrete label distribution learning module to the above experiments, aiming to investigate the impact of varying standard deviations
on overall model performance. As shown in
Figure 6, these two images illustrate the label distributions corresponding to two different original labels. The label distribution refers to the probability of each visibility level in the image. Different curves in the same graph represent different label distributions due to different standard deviations. A larger standard deviation
indicates higher probabilities for each visibility level in the image, indicating an uneven distribution of fog in the image. Conversely, a smaller standard deviation
indicates that the visibility probabilities are concentrated at the same level, suggesting a uniform fog distribution in the image. The changes in visibility probability brought by different standard deviations
can be clearly observed in
Figure 6.
Table 4 demonstrates the influence of different standard deviations
on the performance of the model in learning from the discrete label distribution. It is shown that in the visibility estimation task of this paper, a smaller standard deviation
is significantly superior to a larger
. This suggests that in the two datasets used in this paper, as there are 6 and 9 levels, respectively, the non-uniform fog features mainly shift to the adjacent visibility ranges of the original labels. A large standard deviation
causes the discrete label distribution to be too scattered, affecting the model’s search for features in the farthest region. Therefore, the standard deviation
should be adjusted to enable the model to search for features in the farthest region more effectively. In the subsequent experiments, we set
for both datasets.
Thirdly, this paper introduced an attention-based branch to enhance the original model’s extraction of visibility features, exploring the impact of the two branches on overall model performance by adjusting the weight
between the loss functions of the base branch and the attention-based branch.
Table 5 shows the effect of the variation of the weight
on the experimental results in learning from the discrete label distribution. A larger
indicates a greater weight of the base branch in optimizing the model. As shown in
Table 5, the performance of the model reaches the optimal value when
is in an appropriate range, at which time the fusion of the two different loss functions in the two branches is more optimal and can enhance the classification accuracy of the model more effectively. A larger or smaller value of
would have a negative impact on the model. Therefore, in the subsequent experiments, we set
for the RFID dataset and set
for the FRIDA dataset, respectively.
Subsequently, the following experiment continued to study the impact of the mask
threshold
on overall performance.
Table 6 shows the influence of different
on the performance of the attention-based branch. The results indicate that an appropriate
can better improve the performance of the model. In the two datasets used in this paper,
has a greater impact on model performance on the synthetic dataset FRIDA, where a larger
causes the weakly supervised localization module to select a smaller region, providing the model with incorrect local features and causing a significant decrease in accuracy. A smaller
causes the localization region to be closer to the original image without focusing on local features, reducing the model’s accuracy. Therefore,
should be adjusted to enable the model to locate the appropriate size of the farthest region. Therefore, in the subsequent experiments, we set
to a randomly selected value from the range [0.4,0.6] for the RFID and FRIDA dataset to improve the robustness of the model.
Finally, the best combinations were selected based on the hyperparameter experiments, with , , , for the RFID dataset and , , , for the FRIDA dataset. These combinations were then compared to other state-of-the-art visibility estimation methods.
4.4.2. Comparison with State-of-the-Art Methods
In order to demonstrate the effectiveness of our proposed method, we compared it with other classic classification methods and some improved methods based on visibility classification, as shown in
Table 7. First, we conducted transfer learning experiments on two fog image datasets, RFID and FRIDA, using several classic deep learning classification models, such as AlexNet, VGG16, ResNet18, and ResNet50. Our proposed method achieved the best performance in terms of accuracy, mean squared error, and F1-score. It is worth noting that our method is improved based on the ResNet18 network, which indicates that the combination of the introduced BAP, weakly supervised localization module, and discrete label distribution learning helps to enhance the local feature information in the image. Secondly, we compared our method with several models proposed for the visibility estimation task, such as SCNN, TVRNet, and VisNet. As shown in
Table 7, our method significantly outperforms these methods. This indicates that our method is more beneficial for enhancing fog features than data augmentation relying on image preprocessing in VisNet.
Figure 7 illustrates the accuracy curves during the training and testing stages of the proposed method and four baseline methods including ResNet18, VGG16, VisNet, and TVRNet. It is obvious from the figure that the method proposed in this paper has faster and more stable convergence in the training stage, and the accuracy curves in the testing stage are also comparable. Therefore, it can be concluded that the proposed method outperforms other baseline methods in terms of model stability and prediction accuracy.
Figure 8 demonstrates the classification result achieved by the proposed method and the corresponding farthest visible region of fog. The red box in the figure comes from the weakly supervised localization module, which is guided by the attention map. It can clearly be seen that the image area that the attention focuses on is concentrated in the farther visible region. When visibility is high, attention tends to focus on the farther regions of the road, while in low visibility conditions, it pays more attention to the nearby regions containing road texture details.
4.4.3. Ablation Study
The innovations of our proposed method mainly include two aspects: the introduction of dual-branch weakly supervised localization based on bilinear attention pooling and discrete label distribution learning. To verify their effectiveness, we conducted ablation experiments on the combination of these two modules, and the experimental results are shown in
Table 8.
First, we investigated the impact on the method of introducing the dual-branch weakly supervised localization based on bilinear attention pooling. Specifically, we considered three scenarios: ResNet18, ResNet18 combined with the base branch, and ResNet18 combined with both the base and attention-based branch. The experimental results show that after adding the BAP, the model can utilize the attention mechanism to focus on local features in the image. After adding the attention-based branch, the image from the weakly supervised localization module can enhance the feature region information that is strengthened by attention in the base branch, thereby improving the accuracy of the model.
Furthermore, we further investigated the impact of incorporating the discrete label distribution learning module into both the base branch and the attention-based branch. The results show that adding the discrete label distribution only to the base branch has a positive impact on the network, while adding it to the attention-based branch has a negative impact on the model. This suggests that the visibility feature distribution in the farthest fog-visible region generated in the base branch is more concentrated, and the global scattered distribution of visibility features no longer needs to be described by a discrete label distribution. Therefore, incorporating only the discrete label distribution into the base branch is more advantageous for the visibility estimation task in our proposed method.