CAM-FRN: Class Attention Map-Based Flare Removal Network in Frontal-Viewing Camera Images of Vehicles

: In recent years, active research has been conducted on computer vision and artificial intelligence (AI) for autonomous driving to increase the understanding of the importance of object detection technology using a frontal-viewing camera. However, using an RGB camera as a frontal-viewing camera can generate lens flare artifacts due to strong light sources, components of the camera lens, and foreign substances, which damage the images, making the shape of objects in the images unrecognizable. Furthermore, the object detection performance is significantly reduced owing to a lens flare during semantic segmentation performed for autonomous driving. Flare artifacts pose challenges in their removal, as they are caused by various scattering and reflection effects. The state-of-the-art methods using general scene image retain artifactual noises and fail to eliminate flare entirely when there exist severe levels of flare in the input image. In addition, no study has been conducted to solve these problems in the field of semantic segmentation for autonomous driving. Therefore, this study proposed a novel lens flare removal technique based on a class attention map-based flare removal network (CAM-FRN) and a semantic segmentation method using the images in which the lens flare is removed. CAM-FRN is a generative-based flare removal network that estimates flare regions, generates highlighted images as input, and incorporates the estimated regions into the loss function for successful artifact reconstruction and comprehensive flare removal. We synthesized a lens flare using the Cambridge-driving Labeled Video Database (CamVid) and Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) datasets, which are road scene open datasets. The experimental results showed that semantic segmentation accuracy in images with lens flare was removed based on CAM-FRN, exhibiting 71.26% and 60.27% mean intersection over union (mIoU) in the CamVid and KITTI databases, respectively. This indicates that the proposed method is significantly better than state-of-the-art methods.


Introduction
There is an increasing need for object detection and recognition technologies to prevent accidents during autonomous driving by precisely identifying the road conditions around a vehicle. In recent years, semantic segmentation methods have been used to identify objects on roads accurately, and further studies are being conducted to enhance segmentation performance [1][2][3][4][5][6][7]. However, there are limitations when detecting objects using a frontal-viewing camera. Ceccarelli et al. [8] reported that a flare is one of the causes of the failure of an RGB camera in autonomous driving vehicle applications. Figure 1 shows the generation process of a lens flare. To acquire a normal image, the light source and object must reach the image sensor through a correct path, as indicated by the figure's dotted gray and black solid lines. However, unintentional reflection and scattering (indicated by orange and yellow solid lines in Figure 1) may occur by a light ray from a light source and damage or foreign substance in the front part of a lens. As shown in Figure 1, an artifact owing to a light source is overlaid on top of the existing scene as a layer, which generates a lens flare [9,10]. This significantly degrades semantic segmentation performance and can lead to inaccurate decisions in dangerous situations during autonomous driving.  Figure 2 shows the effects of a lens flare on semantic segmentation, which is required for object detection during autonomous driving. Figure 2d shows an image damaged by a lens flare. When semantic segmentation by DeepLabV3+ [7] is performed using the image in Figure 2d, the error in the segmentation result worsens to the extent that objects are undetectable, as shown in Figure 2e. On comparing Figure 2c,e, which show the semantic segmentation results by DeepLabV3+ [7] of the original image, it was found that a lens flare is an obstacle in autonomous driving as it negatively affects the object detection system for autonomous driving. A lens flare can be prevented to a certain extent by improving the camera hardware. An anti-reflective coating can be applied to a lens, or a lens flare can be suppressed by improving the camera barrel or lens hood. However, such hardware improvement measures are expensive and can prevent only certain types of lens flares [11][12][13]. Another method involves software improvement. In particular, there are handcrafted featurebased methods for automatically detecting and removing flares in images with a lens flare [10,[14][15][16][17][18]. Because the types of flares that can be removed using handcrafted featurebased methods are limited, such methods are difficult to apply to autonomously driving vehicles. Therefore, we performed semantic segmentation tasks after removing the lens flare using deep learning methods.
However, there is a limitation to removing a lens flare using deep learning methods. There is insufficient training data for supervised learning, and extensive amounts of time and effort are required to obtain a pair of images with and without a lens flare at the same location and time. When these image pairs are being acquired, the data acquisition process becomes complicated owing to certain conditions (e.g., the angle at which the light is radiated onto the camera lens front, location of the light source, etc.) that must be satisfied to generate a lens flare. Even when a pair of images with and without a lens flare is obtained from the same scene, the two images cannot be guaranteed to be captured under the same conditions. Therefore, we used a lens flare generation method proposed by Wu et al. [9] to solve the issue of insufficient training data by classifying lens flares into scattering and reflective cases. To obtain a scattering flare, scattering lens flare images were generated using the physics-based data generation method based on the physics of a lens flare. Conversely, a reflective flare was obtained directly through experiments because obtaining such type of data through a simulation is rather difficult. Therefore, they created a dataset for a single image flare removal (SIFR) task by synthesizing a lens flare with clean images without a flare. Data synthesis is conducted to create a lens flare removal dataset to be used for training, considering a lens flare, as shown in Figure 1, is overlaid on top of an existing scene. Accordingly, previous work [9] proposed synthesizing lens flare artifacts with images without a flare to generate the training data. Therefore, in this study, we used the method proposed in [9] to synthesize lens flare artifacts with CamVid [19] and KITTI [20] dataset inputs wherein a semantic segmentation label exists.
In addition, flare artifacts can be a combination of different types of scattering and reflection artifacts. Considering the various artifacts when removing flare remains a challenge, and in some cases the network cannot remove them successfully, leaving an artifactual noises [9,21,22]. Furthermore, in some cases the network cannot remove the flare if there exist the severe level of flare in the input image [23]. To address these issues, we propose a generative-based flare removal network that estimates the flare region in an image, and generates additional images that highlight it in addition to serving as input to the network. In addition, we incorporate the estimated flare region into a loss function to successfully reconstruct objects occluded by the artifact and effectively remove flare artifacts that appear throughout the image.
This study proposes a novel lens flare removal technique based on a class attention map-based flare removal network (CAM-FRN) and a semantic segmentation method using images with the lens flare removed. The novelty of the proposed method with respect to previous studies is as follows.

•
This study is the first to solve the lens flare problem in the field of semantic segmentation for frontal-viewing camera images using CAM-FRN as a solution; • We propose a class attention map (CAM) module utilizing a ResNet-50 classifier to detect and remove areas damaged by lens flare artifacts effectively. Additionally, we incorporate the obtained flare regions into the network's objective function, enabling efficient lens flare removal; • We propose an atrous convolution dual channel attention residual block (ADCARB) that estimates the features corrupted by flare via channel attention and sigmoid function while performing multi-scale learning utilizing dilated convolution [24] to remove flare. • By applying self-attention to the latent space, global information is considered. To consider local information simultaneously, the latent space before and after self-attention is fused and then delivered to the decoder. Lastly, the CAM-FRN model with code and the flare-generated image database are publicly disclosed for a fair performance evaluation by other researchers via Github site [25].
The remainder of this paper is organized as follows: Section 2 introduces previous research methods related to this study. Section 3 explains the details of the proposed method. Section 4 analyzes the experimental results, and Section 5 presents the discussion. Lastly, Section 6 concludes the study and presents future research directions.

Related Works
Research on lens flare removal can be categorized into two main areas: general scene image environment, focusing on image quality improvement, and vehicle frontal viewing camera image environment, emphasizing semantic segmentation accuracy. Notably, the latter domain lacks prior research dedicated to solving the lens flare problem. In contrast, the former domain has existing studies proposing lens flare removal methods; however, these methods predominantly concentrate on enhancing image quality and do not address the specific objective of improving semantic segmentation accuracy.

Studies on Image Quality Improvement in General Scene Images
Previous studies that have proposed lens flare removal in general scene images can be categorized into hardware-and software-based methods.

Hardware-Based Methods
Several studies have attempted different methods to mitigate a lens flare through camera hardware and optical design. First, an anti-reflective coating is applied to the camera lens to prevent flare artifacts from being generated. However, considering an antireflective coating is only effective for suppressing and removing a lens flare when a light ray comes in at a specific angle under the appropriate conditions, it cannot be used as a solution for all lens flare artifacts. Boynton et al. [11] proposed a simulated-eye design (SED) wherein the camera interior is filled with liquid, which prevents unintentional reflection in a lens by acting as an anti-reflective coating. However, this method requires a complicated camera design compared with a general RGB camera, which increases the costs. Unlike previous methods that involved analyzing a lens flare in a two-dimensional image, Raskar et al. [12] demonstrated that lens flares occur in a four-dimensional light ray space and statistically analyzed flare artifacts generated inside a camera. However, as mentioned in [12], their proposed method cannot eliminate the streaks of light appearing on the aperture and the diffraction effect and cannot resolve the issue of light glare caused by the surrounding environment, such as fog. Additionally, the blooming phenomenon caused by a sensor and the purple-fringing phenomenon cannot be resolved, considering a lens flare cannot be removed if a light source is expanded, as in the case of vehicle headlights. Talvala et al. [13] proposed a method for analyzing and removing veiling glare and lens flare artifacts for diverse kinds of digital cameras by configuring an occlusion mask based on the measured data and selectively blocking light that triggers flare and glare.

Software-Based Methods
The hardware-based lens flare removal methods explained above are generally applied to acquire camera images by analyzing certain types of flare and hence, cannot pre-vent or remove various types of artifacts. Moreover, additional costs are required considering cameras require additional design modifications. To overcome these drawbacks, software-based methods for detecting and removing a flare in images have been developed based on image processing algorithms, which can be roughly classified into handcrafted feature-based and deep feature-based methods.
(1) Handcrafted Feature-Based Methods Wu et al. [14] proposed a method to extract shadows in an image through Bayesian optimization. This method, however, requires a user to provide information about the shadows. Asha et al. [15] proposed a method for removing bright spots generated when a scene having a strong light source is captured by a camera. However, the proposed method could only be applied to certain types of artifacts or bright spots. Chabert et al. [16] proposed a two-step post-processing method for detecting the region damaged by a lens flare in an image and restoring the damaged region. However, the method proposed in [16] is also effective in removing certain flare types, such as ghosting; however, it is ineffective for other types of flare. Similar to [15,16], Vitoria et al. [17] proposed a method for automatically detecting the flare region and estimating and restoring a mask for the detected region. However, their method only detects and removes flare spots and ghosting artifacts caused by the reflection of lens components inside a camera instead of detecting and removing various types of lens flare artifacts. Koreban et al. [18] proposed a method to mitigate a flare using images of two frames captured by a moving camera. The method proposed in [18] is specialized for a specific type of flare and requires continuous images. Zhang et al. [10] removed a flare in an image by decomposing the image damaged by a flare into the scene and flare layers and eliminated the effects of a flare by adjusting the brightness and color balance of the scene layer. However, segmentation of the scene layer and lens flare layer may not work appropriately if the texture feature is not evidently exposed, and the color of a local object may be distorted.
(2) Deep Feature-Based Methods Deep learning technologies have been gaining wide attention in recent years and have been widely used in restoration tasks. In particular, research is actively being conducted for the cases where images are damaged owing to environmental factors such as fog or rain [21,22]. However, there is limited research on removing artifacts generated inside a camera by a strong light source. Lens flare removal tasks are difficult to solve because distinguishing a light source from a flare is difficult, and obtaining paired data on whether a flare exists is challenging. Considering the above difficulties, the following deep learning-based studies examined different methods for removing lens flare artifacts generated in the process of acquiring images.
Wu et al. [9] successfully developed lens flare removal methods based on deep learning methods by focusing on the difficulty in obtaining pairs of images with or without a lens flare to obtain only the images with flare artifacts. Moreover, they proposed a semisynthetic data synthesis technique for creating flare-damaged images using two types of flare artifacts. And a flare removal method using U-Net [26] architecture for removing a flare in an image is used. This method is more outstanding than other handcrafted-based methods for a lens flare. However, it removes artifacts as well as the light source, and the light source is synthesized through post-processing; however, the method cannot accurately remove flares. Qiao et al. [23] proposed an unpaired dataset called the "unpaired flare removal (UFR) dataset" by focusing on the fact that it is challenging to acquire a paired dataset for flare removal tasks. Furthermore, they observed that information about a flare, such as its shape and color, is in the light source and hence, conducted unsupervised learning based on the observation result. Light source mask and flare mask were estimated within an image using the encoder-decoder structure. And the flare removal and generator modules based on a cycle-consistent generative adversarial network (Cy-cleGAN) were trained using the two masks, flare images, and flare-free images. Although this method can generalize flare images in real life through unsupervised learning, it inadequately removes lens flare artifacts found throughout an image.
As seen above, previous studies concentrated on improving the quality of general scene images through lens removal rather than improving the semantic segmentation accuracy. Therefore, no study has examined the solution for the lens flare problem in the field of semantic segmentation in front-viewing camera images captured by a vehicle. A more detailed explanation is provided in the following subsection.

Studies on Improving the Semantic Segmentation Accuracy in Frontal-Viewing Camera Images of a Vehicle
Previous studies can be distinguished into handcrafted feature-based and deep feature-based methods.

Handcrafted Feature-Based Methods
Previous studies on handcrafted semantic segmentation [27][28][29][30][31] performed segmentation using superpixels, which are a set of similar pixels that are connected or using contextual models such as conditional random field (CFR) and the Markov random field (MRF), which is based on the Markov theory. Tu et al. [27] proposed a method of utilizing context information to solve the high-level vision problem. Kontschieder et al. [28] suggested a method of integrating the structural information, wherein the object class label of semantic segmentation is formed in the designated region of an image with the random forest framework. Semantic segmentation using a hierarchical CRF, which has advanced from the existing CRF, demonstrates a better performance by combining multi-scale contextual information; however, it generates excessively simplified models that cannot allocate multiple labels. Gonfaus et al. [29] suggested harmony potential, which can encode all possible combinations of class labels to overcome such a drawback. Furthermore, they suggested a two-stage CRF utilizing harmony potential. Kohli et al. [30] suggested a new segmentation framework using an unsupervised algorithm based on higher-order CRF. They focused on how the superpixels obtained from the unsupervised segmentation algorithm belong to the same object and how higher-order features can be computed and used for classification based on all pixels constituting the segment. Their proposed method proceeds with segmentation by combining conventional unary and pair-wise information using higher-order CRF for potential functions that have been defined by the set of pixels. Zhang et al. [31] suggested a framework for semantic parsing and object recognition based on depth maps by extracting 3D features of object classes in a dense map using the random forest, followed by segmenting and recognizing various object classes by combining them with the features extracted from the MRF framework. The handcrafted methods [27][28][29][30][31] exhibit outstanding semantic segmentation performance in frontal-viewing camera images as in the CamVid dataset; however, a user must adjust the detailed parameters, which requires an extensive period of time for optimization. In addition, such methods are inadequate for classifying small objects such as streetlights, road signs, and poles if objects of different sizes are present, as in the CamVid dataset.

Deep Feature-Based Methods
Several studies have been conducted [1][2][3][4][5][6][7] to overcome the shortcomings of existing handcrafted-based methods based on deep learning. SegNet [1] has a symmetrical encoder-decoder structure, where max pooling indices are delivered to the max pooling layer of the encoder and the upsampling layer of the corresponding decoder to preserve the information of the pixels lost during the max pooling process of the encoder. The results of previous segmentation models did not adequately distinguish the objects' boundary; however, SegNet can accurately simulate the object boundary and is efficient in terms of memory and computational time during the inference process. However, relatively smaller objects such as streetlights, poles, road signs, and fences are not adequately detected. A pyramid scene parsing network (PSPNet) [2] was proposed to solve the problem of classifying incorrect semantic classes that are inappropriate for image situations considering previous segmentation methods did not consider the global context of input images. They applied various pooling operations to the feature maps extracted through a convolutional neural network (CNN) and connected them to obtain the segmentation prediction result. Various pooling operations enable the model to learn feature maps in different resolutions, and global contextual information can be considered when all information is combined. Classes appropriate for an image scene can be classified as global contextual information and verified. Image cascade network (ICNet) [3] provides detailed segmentation results with enhanced speed by extracting features from input images of various resolutions based on cascade feature fusion and cascade label guidance. Although the inference time and frames processed per second are improved compared with other models, accuracy is lower compared with state-of-the-art models. It is important to extract features from various receptive fields to detect different types of objects effectively. Therefore, various versions of DeepLab [4][5][6][7] proceeded with semantic segmentation using atrous convolution (dilated convolution). DeepLabV1 [4] used convolution of a fixed dilated rate; however, DeepLabV2 [5] introduced atrous spatial pyramid pooling (ASPP) where multi-scale feature information can be obtained from various receptive fields by combining features that have undergone different dilated rates. DeepLabV3 [6] uses an ASPP module that is enhanced from ASPP introduced in [5]. The difference is that spatial information loss is reduced significantly by applying different dilated rates according to the changes in the output stride. Segmentation is performed by capturing the information of multi-scale features and various objects in an image accordingly. The authors of [6] predicted segmentation results by applying a simple bilinear upsampling process to the features from the encoder in the decoder, which decreases the resolution of segmentation results, thereby preventing detailed information from being detected. As a solution, DeepLabV3+ [7] predicts the segmentation results by concatenating the feature maps of the interim stage and the last stage of the encoder and upscaling after learning.
However, previous studies did not consider the lens flare issue in images captured by a frontal-viewing camera of a vehicle. To resolve this problem, this study proposes a novel lens flare removal technique based on CAM-FRN and a semantic segmentation method using images to remove lens flares. Table 1 compares previous methods and the proposed method of semantic segmentation with frontal-viewing camera images of vehicles. Table 1. Comparison of previous methods and the proposed methods on semantic segmentation with frontal viewing camera images of vehicle.

Method Advantages Disadvantages
Not considering lens flare Handcrafted feature-based methods Auto-context algorithm [27], structural information + random forest [28], harmony potential + CRF [29], higher order CRF [30], and dense depth maps-based framework [31] Adequate semantic segmentation performance can be obtained by considering both contextual information and low-level information through superpixels, MRF, and CRF User must directly adjust the parameters in detail, and perfect optimization requires a long time Deep featurebased methods SegNet [1], PSPNet [2], ICNet [3], and DeepLab [4][5][6][7] Objects of various sizes are detected with high accuracy by applying pooling layers of different sizes or receptive fields are applied, or by sending pooling indices information to the decoder Semantic segmentation performance is degraded when a lens flare occurs in an image because the images damaged by a lens flare are not taken into consideration Lens flare region in an image is highlighted through CAM, and a lens flare is effectively removed by reflecting a binary mask for the lens flare region obtained through CAM in the loss Light source is removed along with a lens flare owing to insufficient training data Figure 3 shows the overall architecture of the model proposed. In the first step, when a frontal-viewing camera image is input, CAM with the flare region highlighted is obtained using the weights of the ResNet-50 [32]-based binary classifier, which classifies the presence of lens flare. Furthermore, the ResNet-50 classifier uses images with and without flare as input during training, and the datasets having labels 1 and 0 are used for training. Then, we create three additional input images through CAM, which are applied with channel-wise concatenation and are input to the proposed CAM-FRN. The second step removes the flare from CAM-FRN based on the received images. Finally, the final segmentation map is predicted as the flare-free image is input to the segmentation network.

Flare Removal by CAM-FRN and Semantic Segmentation
3.2.1.
Step 1: Generation of CAM and Channel-Wised Concatenated Inputs to CAM-FRN In this step, once the image is input to the ResNet-50 classifier, the feature map and weights generated based on the presence of flare in the image are used to find the CAM for the lens flare class. If the lens flare artifact is present in the image, as shown in Figure  4b, the feature map from the last CNN layer of ResNet-50 and the weights of the classifier can be used to find the CAM. The equation to find the CAM according to class (c) is expressed as follows [33]. , (1) where represents the two classes of flare-corrupted and non-flare-corrupted images.
( , ) is the feature map output from the last layer of CNN in ResNet-50, and is the trained weights of a ResNet-50 classifier for the class. The notation indicates the transpose of , which is applied in matrix product operations to calculate the CAM. The corresponding ( , ) obtained using a flare-corrupted image as input is denoted as ( , ) of Equations (2) and (3). and are two-dimensional coordinates of a feature map. When Figure 4a is input to the ResNet-50 binary classifier, the flare region is seen as shown in Figure 4b, considering the classifier determines whether there is a lens flare in the image. Using Figure 4b, we can highlight and detect the flare region. Figure 4c is the result of multiplying Figure 4a,b, as expressed in Equation (2), and shows the image in which the flare region is highlighted. Figure 4d is a binary mask in which the flare region created through binarization by providing a threshold value to (b) has a value of 1, while all other regions have a value of 0; the flare region mask ( ) is computed as follows: where the optimal threshold of 0.2 was experimentally determined for obtaining the highest accuracy of semantic segmentation using the training data. Lastly, Figure 4e shows an image created by covering the flare region with a mask using Figure 4d and Equation (4).
For lens flare removal, we used the images shown in Figure 4a-e as inputs to CAM-FRN. Figure 4a Figure 4d,e, we defined the flare region as having a missing pixel value as in the inpainting task, and CAM-FRN performs inpainting for the relevant region. In other words, we used four types of input in CAM-FRN to specify the lens flare region in an image through CAM, provided additional information on the relevant region, and restored the image details covered by the flare region. The inputs provided for this process are defined as: where (•) indicates channel-wise concatenation. Based on Equation (5), we define a concatenated image as . We used as an input image of CAM-FRN for flare removal.

Step 2: Lens Flare Removal by CAM-FRN
The image damaged by a lens flare can be expressed using the below equation based on the observation in Figure 1.
where refers to a clean image without a flare and f refers to lens flare artifacts. indicates an image synthesized with a lens flare. We aimed to remove lens flare artifact ( ) overlaid in in Equation (6) and retain only . In this section, we explain how lens flare artifacts in the images captured by a frontal-viewing camera of a vehicle are removed by CAM-FRN, which requires generated by CAM obtained in step 1 and three additional images as the input.
(1) Structure of CAM-FRN Figure 5 shows the architecture of CAM-FRN. Four types of inputs provided in step 1 are concatenated to generate an input for CAM-FRN. The structure of CAM-FRN includes a generator comprising an encoder and a decoder and a discriminator, and a reparameterization trick for variational inference is added between the encoder and the decoder.
enters the encoder of CAM-FRN for extracting features and undergoes a total of five ADCARBs. ADCARBs perform multi-scale feature learning using a dilated convolution (atrous convolution) layer and are trained to remove lens flare by extracting the parts damaged by lens flare artifacts within the feature map as a mask. Additionally, Tables 2-4 show in more detail the structure of layers and modules used in our proposed CAM-FRN.
We performed variational inference using the feature maps that underwent AD-CARBs of the encoder. We aimed to remove lens flare through variational inference, inspired by variational auto-encoder (VAE) [34,35] and VAE with denoising criterion [36]. To ensure the image generated by the generator is semantically similar to the ground truth, latent variable was sampled in the probability distribution ( | ) with the ground truth as the condition. If an image is generated accordingly, the generated image can be semantically similar to the ground truth instead of having a close Euclidean distance [35]. To obtain the same effect, the variational inference was applied to the proposed method to create an image that is semantically similar to the ground truth. Unlike existing VAE, which reconstructs an input image, we attempted to reconstruct a clean image without lens flare from the image damaged by a lens flare. Inspired by [36], we defined a lens flare as a noise, utilized the proposed encoder network as an inference network for variational inference, and applied the denoising variational lower bound suggested in [36].
The denoising variational lower bound [36] was proposed for applying the denoising criterion used in a denoising auto-encoder (DAE) to a VAE framework and used it to remove lens flare as noise. Then, self-attention was applied to latent space obtained from variational inference, and latent space applied and not applied with self-attention were fused through a convolution layer. The detailed explanation is provided in Section of "Variational Inference with Latent Space Fusion Using Self-Attention".
The latent space z obtained was sent to ADCARBs and decoder. Then, feature map resolution was increased to the input image size through transposed convolution, and the output image was generated through convolution layer and sigmoid layer at the end. The final output image was generated using summation of the generated image and the image damaged by lens flare through residual connection. The image generated accordingly entered the discriminator of PatchGAN [37]. We ensured that the image generated using a discriminator was similar to the ground truth and improved the quality of a flare removal image. Furthermore, Figure 4d was created using CAM and images utilizing were provided as an input of a discriminator. was multiplied as shown in Equations (7) and (8) below.
where refers to the lens flare removal image and is an image in which only the pixels value for the lens flare region are expressed. is the clean image without a lens flare, and is the image showing only the lens flare region from the ground truth in the form of instead of the image generated by CAM-FRN. We inputted and in the discriminator, and lens flare was removed by focusing more on the flare region. The detailed explanation is provided in Section of "Total loss function of CAM-FRN". (2) ADCARB To remove f from using four inputs obtained from step 1 as much as possible, we propose a new residual block, ADCARB, as shown in Figure 6.
When a feature map was provided as an input for ADCARB, we obtained feature maps trained with receptive fields of various sizes through 3 × 3 dilated convolution, with the dilated rates of 1, 4, and 16. For the activation function after each dilated convolution layer, Gaussian error linear unit (GELU) [38] was used instead of rectified linear unit (ReLU) [39]. Unlike ReLU, which determines a value depending on the input feature sign, GELU creates probabilistic characteristics by multiplying the standard Gaussian cumulative distribution function with the input feature. Feature maps extracted with different dilated rates pass through GELU and become concatenated. Concatenated feature maps comprise information extracted from receptive fields of various sizes, and concatenated features pass through a residual block that uses GELU, instead of ReLU, in the existing CycleGAN structure. We define such feature map as . is a feature created by fusing feature maps obtained from various receptive fields, and hence, considers the information of different scales for removing flare.
In addition, we highlighted the degraded channel within a feature through channel attention and extracted the degraded part as a mask after undergoing 3 × 3 convolution and sigmoid. We defined this mask as . We defined the following equation using and [40].
where is the feature map input in ADCARB, and and are defined identically as above.
_ represents the feature maps determined by considering in and . We multiplied with (1 − ) to highlight the parts not degraded by a flare and obtained a feature map that highlights the parts degraded by flare from . Again, channels with evident features were highlighted after going through channel attention for lens flare removal, and the output feature map of previous ADCARB was considered through residual connection. Lastly, the feature map obtained through Equation (9) was applied with channel attention to extract feature maps that highlight important channels for flare removal training. We propose ADCARB to extract features from various receptive fields and learn the regions covered by a lens flare by extracting the regions damaged by lens flare as a mask ( ) to enable inpainting. (3) Variational Inference with Latent Space Fusion Using Self-Attention In this section, we explain the process of sampling latent space in a significant probability distribution that considers a clean image without lens flare through the variational inference explained in Section of "Structure of CAM-FRN". As proposed in [36], we can define posterior distribution as: where is lens flare artifacts ( ) added to the original image in Equation (6). Lens flare artifacts ( ) can be considered as noise, and we can apply a denoising criterion to remove this noise. ( | ) in Equation (10) represents the distribution damaged by noise (lens flare), and we can sample the image damaged by lens flare to ~ ( | ). q ∅ (•) represents the proposed encoder network and is used as an inference network for variational inference. ∅ represents (mean) and (variance), which are trained to approximate the Gaussian distribution in the inference network.
Then, the process of generating images with , which is sampled from ~ ∅ ( | ), can be expressed as shown in Equation (11). (•) indicates the generator network and is the parameter for training the generator. As in [36], we can express the evidence of lower bound (ELBO) for denoising (lens flare removal), as shown in Equations (12) and (13) based on Equations (10) and (11). Because our final goal is to maximize ℒ (ELBO), it can be expressed below, as proven in [36].
If ℒ is maximized, ∅ are learned to minimize the difference between true posterior distribution ( | ) and posterior probability distribution ∅ ( | ), which infer the image damaged by lens flare. We used the denoising variational lower bound proposed in [36] to define the variational lower bound for flare removal, which is our ultimate goal, as defined in Equation (14). Using Equation (14), it is possible for CAM-FRN to sample the latent space in a significant probability distribution while considering the ground truth to generate a clean image without a noise that is semantically similar to the ground truth. However, we did not simply use only the images damaged by a lens flare as an input for CAM-FRN. As explained in Section 3.2.1 and Section of "Structure of CAM-FRN", we aimed to improve the performance by providing additional information on lens flare. As defined in Equation (5), becomes an input of CAM-FRN, and ultimately, the loss equations for variational inference are as shown in Equations (15) and (16).
We conducted an ablation study for the cases of using and not using variational inference, which confirmed that outstanding performance was achieved when variational inference was performed. The detailed explanation is provided in Section of "Performance Comparisons with and without Variational Inference".
Both the encoder of CAM-FRN, which was utilized as an inference network, and and , which are estimated for variational inference, used convolution layers. Although convolution layers can adequately extract local information of features using filters, it inadequately extracts global features (long-range dependency). To supplement this drawback, we applied the self-attention module proposed in [41] to our proposed method. We attempted to consider long-range dependency by applying self-attention to latent space , which was sampled from ∅ ( | ). Two latent spaces were concatenated to fuse two features to simultaneously consider the latent space before applying self-attention and the latent space wherein long-range dependency was considered. The process described above is shown in Figure 7. We confirmed that the performance was improved when the fused latent space was utilized; the relevant experimental results are explained in Section of "Performance Comparisons According to Module Combinations Performance Comparisons According to Module Combinations".

) Total loss function of CAM-FRN
The ultimate purpose of the proposed CAM-FRN is image-to-image translation from an image with lens flare to an image without lens flare. Therefore, we adopted content loss and style loss [42], which are frequently used in the image-to-image translation field. In this study, content and style losses use intermediate layers of the pretrained VGG-19 model [43].
where and are the width and height of the feature maps of the pretrained VGG-19 model, respectively, and (•) is the feature map obtained from the intermediate layers (relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1) of the pretrained VGG-19 model. is a target image and is the restored output of CAM-FRN. and ensure that the features of become identical to the features of by minimizing the difference between feature maps that have undergone the VGG-19. Equation (19) is the style loss that heightens the similarity between the output features of a target and the model by determining the correlation between feature maps of the VGG-19, and accordingly, an output similar to the target is generated. Similar to Equation (17), (•) is one of the layers (relu5_3) of the pretrained VGG-19. Therefore, , , , and have the same meaning. Using the mask of the lens flare region obtained using Equation (2), we can obtain and defined in Equations (7) and (8). Based on and , we applied the content and style losses, as shown in Equations (18) and (20) for the flare regions in the ground truth and the predicted image. In other words, we aim to concentrate more on the artifacts in the flare region for removal, and the restoration performance was improved when content loss and style loss were applied while considering a mask. The detailed explanations are presented in Section of "Performance Comparisons According to Mask Considering Loss".
Equation (21) is a total variational regularizer in which smoothing is applied to remove artifacts that may remain in an image from which lens flare artifacts are removed. Equation (22) is the edge loss proposed in MPRNet [21], which is a ∆(•) Laplacian function, and is a regularization term. Through the Laplacian function, we can allow the edge component of a target to become similar to the edge component of a model output. Therefore, we can preserve the edge component of contents and the objects restored in an image.
Furthermore, we used the discriminator of PatchGAN [37] to ensure that the image generated using the discriminator becomes similar to the distribution of the ground truth, and the following equations represent adversarial loss and discriminator loss, which utilizes the discriminator.
As shown in Equations (18) and (20), we applied the loss equation considering the lens flare region in Equations (24) and (26). As and pass through the discriminator, they focus more on the lens flare region to effectively remove lens flare artifacts in the predicted image. The final loss equation of CAM-FRN is as follows: in Equations (24) and (26), ( ) refers to , and the multiplication of ( ) and represents . Similar to Equations (17)- (20), learning is proceeded by using the loss defined in Equations (23)- (26), which used the discriminator where the entire image and lens flare region are considered.
CAM-FRN undergoes the process of optimizing ℒ and ℒ _ through which the damaged image as input can be successfully restored as an output.

Step 3: Semantic Segmentation Network
We perform semantic segmentation after removing lens flare in an image of a frontalviewing camera of a vehicle using CAM-FRN. Semantic segmentation is performed with DeepLabV3+ [7] as a network. This study compared the segmentation performance of PSPNet [2], ICNet [3], CGNet [44], and DeepLabV3+ [7]. The experimental results showed that DeepLabV3+ demonstrated the highest accuracy. Therefore, DeepLabV3+ was used as the semantic segmentation network.

Experimental Environment
In all our experiments, we used a desktop computer (Intel ® Core™ i9-11900K @ 3.50 GHz × 16 CPU with 64 GB of main memory) equipped with NVIDIA GeForce RTX 3090 graphics processing unit (GPU) with a graphics memory of 24 GB [45] on a Linux operating system. All the training and testing algorithms of our network were implemented with a pytorch library (version 1.12.0) [46]. Except for this, no tool or library was used in our method. In addition, our proposed model with the code for algorithm and the flare-generated image database are publicly disclosed for a fair performance evaluation by other researchers via Github site [25].

CamVid and KITTI Databases
No open database was present for frontal-viewing camera images of a vehicle containing a lens flare artifact along with the segmentation ground-truth label. Therefore, we synthesized lens flare artifacts with the frontal-viewing camera images of a vehicle as proposed by Wu et al. [9]. Then, we used the Cambridge-driving Labeled Video Database (CamVid) [19] and the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) [20] database, which are open databases having segmentation labels, to measure the segmentation performance for images with lens flare and the original images without lens flare. Previous works [19,20] are road scene databases built by capturing various scenes of roads using a camera installed inside a vehicle. Each database provides segmentation labels comprising 12 same labels; one is a void label, while the remaining 11 labels are for different objects. The CamVid database comprises data used in SegNet [1], wherein the input and label have a resolution of 360 × 480 pixels. In the KITTI database, the input and label have various resolutions of 370 × 1220, 376 × 1241, and 375 × 1242 pixels depending on each scene.

Synthesized Lens Flare CamVid and KITTI Databases
Wu et al. [9] synthesized lens flare artifacts of a reflective type they obtained through a simulation of clean images without lens flare. We synthesized lens flare artifacts in CamVid and KITTI databases, as shown in Figure 9. Figure 9a shows the original image of CamVid, and (b) shows the lens flare synthesized image. Similarly, Figure 9c shows the original image of KITTI, and (d) shows the lens flare synthesized image. In [9], each image was linearized before synthesizing lens flare artifacts and clean images without flare. These researchers performed linearization by applying a random value between 1.8 and 2.2, assuming an unknown gamma value is applied during image capturing. Furthermore, to obtain more diverse synthesized images, they proceeded with synthesis by applying digital gain, Gaussian blur, RGB gain, and offset values within a random range, as shown in Table 5, to linearized flare images. Furthermore, they added the Gaussian noise to clean images to represent various types of noises we can visually inspect during the image acquisition process. They sampled the variance of the Gaussian noise in the scaled chisquare distribution ( 2~0 .01 2 ). Synthesis was performed where clean images were added with lens flare artifact images, as shown in Equation (6).  We randomly shuffled all the data of the datasets and divided them into two parts to perform cross-validation on the data synthesized with lens flare artifacts [47]. Then, training and testing were performed based on two-fold cross-validation of the datasets by dividing them into two values, and the final testing accuracy was calculated by averaging the two testing accuracy values.

Training Our Proposed Method
First, a ResNet-50 [32]-based binary classifier was trained for extracting CAM for the images synthesized with lens flare. The syn-flare CamVid dataset was trained with the original image size of 360 × 480 pixels, while the syn-flare KITTI dataset was trained by resizing to 400 × 1200 pixels considering the original images had three different sizes depending on the road scene. Furthermore, the mean and standard deviation of each channel were normalized to 0.5. The learning rate of the ResNet-50-based binary classifier was 3 × 10 −5 , and the two datasets exhibited the same learning rate. The number of epochs was 200, and the batch size of training was 4, which were applied to two datasets. Additionally, adaptive moment estimation (Adam) [48] was applied as an optimizer.
Syn-flare CamVid, syn-flare KITTI input images, and all other images were randomly cropped to 300 × 300 pixels for training CAM-FRN. For both datasets, the learning rate of CAM-FRN was 1 × 10 −4 , the number of epochs was 400, the batch size was 2, and Adam was used as an optimizer. Inference proceeded at the size of 360 × 480 pixels for the syn-flare CamVid dataset. For the syn-flare KITTI dataset, test prediction images were obtained at the size of 400 × 1200 pixels; ultimately, images were resized to the size before the 400 × 1200 pixels when measuring peak signal-to-noise ratio (PSNR), structural similarity index map (SSIM), and Frechet inception distance (FID) score in Section 4.3. During the inference of the syn-flare KITTI dataset, we used bicubic interpolation to resize the images to the original size. Figure 10 shows the training and validation losses of CAM-FRN in which 10% of training data were used as validation data when measuring the validation loss; the validation data were not used for training. As the epoch increases, training loss decreasingly converges, which indicates that the proposed CAM-FRN was sufficiently trained for the training data. Additionally, when the epoch increased, the validation loss decreasingly converged, which indicates that the proposed CAM-FRN was not overfitted to the training data.

Evaluation Metrics
We used the following evaluation metrics to compare the proposed model's performance.
= 10 10 ( ) , (̂, ) = (2̂+ 1 )(2 , + 2 ) (̂2 + 2 + 1 )(̂2 + 2 + 2 ) , Equations (29)-(32) represent PSNR, SSIM, and FID, respectively, which are the metrics for evaluating the similarity and accuracy between image restoration results and the ground truth. In Equation (30), is the maximum measurement of an image pixel being and is the mean square error expressed in Equation (29). In Equation (29), is the image width, is the image height, ̂ is the predicted image, and is the groundtruth image. PSNR can evaluate the information lost in terms of the quality of an image generated in the network, and a higher score indicates that a flare is adequately removed. In Equation (31), ̂, , ̂, and are mean and standard deviations of ̂ and , respectively; , is the cross-covariance of ̂ and . 1 and 2 are constants that vary depending on the range of image pixel values. SSIM evaluates image quality from luminance, contrast, and structural perspectives where a value closer to 1 indicates that the image with lens flare removed is closer to the ground-truth image. Lastly, in Equation (32), ∅(̂) and ∅( ) are the intermediate feature maps created after the generated and ground-truth images pass through the inception v3 network [49]; ∅(̂) , ∅( ) , Σ ∅(̂) , and Σ ∅( ) are the mean and covariance of ∅(̂) and ∅( ) . Additionally, (•) is the sum of the diagonal components. The FID score calculates the distance between the distribution of the groundtruth image and the distribution of images with flare removed using feature maps that passed through the inception network. A lower FID score indicates that an image is more similar to the ground-truth image. We evaluate how well a lens flare is removed based on Equations (29)-(32).

= ∑ ( ) ∑ ( + )
, Equations (33)-(35) represent pixel accuracy, class accuracy, and mean intersection over union (mIoU), respectively, which are the metrics for evaluating the semantic segmentation performance for restored images. In each equation, is the number of classes for semantic segmentation. True positive ( ) indicates the ground-truth pixels that are correctly predicted by the segmentation network, false positive ( ) indicates the pixels that are not ground-truth and predicted as ground-truth by the segmentation network, and false negative ( ) indicates the pixels that are not ground-truth and not predicted as ground-truth by the segmentation network. Equation (33) evaluates how accurately the segmentation network predicts the ground-truth pixels among the entire pixels; Equation (34) evaluates how accurately the segmentation network predicts the ground-truth pixels for the pixels of each class. Lastly, Equation (35) calculates the ratio of the intersection of classes to the union of semantic segmentation classes. We used Equations (33)- (35) to evaluate the segmentation performance for the restored images.

Testing with Synthesized Lens Flare CamVid Database and Synthesized Lens Flare KITTI Databases
(1) Ablation Study

(a) Performance Comparisons According to Module Combinations
We compared the performance of CAM-FRN by combining the modules proposed in Section 3. We compared the semantic segmentation performance and image restoration performance by applying and not applying the following modules: a module for obtaining additional images using Equations (2)-(4) besides images damaged by a lens flare through CAM that represents the lens flare region in an image. Lastly, the ADCARB module, a self-attention module, fuses latent spaces that have undergone variational inference and applied with self-attention and not applied with self-attention, and sends them to the decoder. If ADCARB is not applied, the residual block used in the existing CycleGAN was applied; if the self-attention module is not applied, the latent space that has been sampled through variational inference is directly sent to the decoder. When the CAM module is not applied, the process of reflecting the mask for the lens flare region, which can be obtained by CAM and additional inputs in the losses, is omitted.
To compare the performance of different module combinations on each dataset, we analyzed the restoration performance with the results shown in Tables 6 and 7. Tables 6  and 7 demonstrate the greatest performance improvement, and the best performance was exhibited for all metrics.
Consequently, we verified two aspects through the ablation study. First, inputs additionally obtained by CAM enabled ADCARB of CAM-FRN to utilize the additional information of flare sufficiently and efficiently. CAM provides additional information about the flare region, which highlights the flare-damaged areas within the feature. This enables ADCARB to effectively extract and restore damaged areas. The evidence for this is as follows: in an ablation study, applying ADCARB and self-attention alone without using CAM resulted in worse performance than applying CAM and ADCARB together. Second, it was experimentally proven that using CAM, ADCARB, and self-attention module together may be effective for posterior distribution inference for restoring clean images. Next, we input the restored images into the segmentation network and tested them according to the combination of modules. To compare the performance with respect to the combination of the modules, the evaluation metrics of the semantic segmentation performance according to the combination of modules are shown in Tables 8 and 9.  Tables 8 and 9 demonstrate the greatest performance improvement; the best performance was exhibited for all metrics. Accordingly, the metrics of object detection performance of semantic segmentation increase along with the restoration performance evaluation metrics according to the combination of modules. Tables 10 and 11 present IoU metrics per class in which IoU metrics of each class were improved according to the restoration performance evaluation metrics, as analyzed in Tables 8 and 9. Table 8. Performance evaluation metrics of the semantic segmentation test results for the images in which a lens flare is removed for the syn-flare CamVid dataset (unit: %).

Modules
CamVid (   Next, we conducted an ablation study for numerically analyzing the semantic segmentation performance for the combination of inputs created with CAM. As shown in Tables 12 and 13, we measured the semantic segmentation performance according to the combination of additional inputs based on class accuracy, pixel accuracy, and mIoU. An image damaged by a flare was used in all combinations as a default, and we compared the performance of the combinations when the inputs proposed in our method were all used and unused. According to Table 12, there was no significant difference in the performance according to the combination of inputs. However, class accuracy, pixel accuracy, and mIoU were the highest when all inputs were used, as we proposed. In particular, when mIoU was increased by 0.13% then when only was used, and mIoU was 0.03% higher than the combination of , , and which demonstrated the second highest mIoU. And Table  13 similarly shows that there is no significant difference in performance based on the combination of inputs. However, as suggested, using all inputs resulted in the highest-class accuracy, pixel accuracy, and mIoU. Specifically, mIoU was 0.21% higher than using alone and 0.12% higher than the combination of , , and . In this section, we conducted an ablation study to verify the effects of adopting variational inference on the performance of our proposed method. The reparameterization trick structure for variational inference was removed between the encoder and decoder of CAM-FRN; then, latent spaces from the encoder that were applied with self-attention and not applied with self-attention were fused to be delivered to the decoder. Additionally, the experiment was conducted using L1 loss and L2 loss, without using the Kullback-Leibler divergence (KL divergence), which was applied to reduce the difference between true posterior distribution and posterior distribution inferred by the inference network, or the reconstruction loss used for variational inference.
Tables 14 and 15 present the evaluation metrics of the restoration performance for syn-flare CamVid dataset and syn-flare KITTI dataset. When L1 loss was used in place of variational inference, both datasets showed PSNR and SSIM did not exhibit a noticeable performance difference, but the FID score exhibited a significant difference. The FID score is a metric for evaluating the quality of images generated by the GAN structure, which uses a discriminator to calculate the distance between the distribution of the generated image and the distribution of the ground-truth image. The reason for using variational inference was to generate images that were semantically similar to the ground-truth image as much as possible in the decoder by sampling a significant latent space through the inference of the posterior distribution, which considered the ground-truth image. In other words, we aim to minimize the difference between the inferred posterior distribution ∅ ( | ) and the true posterior distribution ( | ) as shown in Equation (16). Therefore, Tables 14 and 15 show that using variational inference can result in a better performance in terms of the FID score.
When L2 loss is used instead of variant inference, similar results are yielded to the analysis using L1 loss when compared with using variant inference. Similarly, the FID score was significantly reduced when using variant inference, and as discussed above, using variant inference can lead to better performance in terms of FID score.
As shown in Tables 16 and 17, the same performance improvements in semantic segmentation were demonstrated in restoration. We can see that class accuracy, pixel accuracy, and mIoU are highest when using variational inference. For the next comparative experiment, we evaluate the semantic segmentation performance and image restoration performance for the two cases of considering the flare region and not considering the flare region in the proposed loss equation. For image-to-image translation, we used content loss [42] and style loss [42] based on VGG-19. Content loss and style loss were applied to the result of multiplying a flare region mask to the final output image and to the ground-truth image. Furthermore, the lens flare region was considered for the losses utilizing a discriminator; the image restoration performance was improved by making it difficult for the discriminator to discriminate whether the flare region is ground truth by focusing on the flare region. Accordingly, we expected CAM-FRN to concentrate more on the flare region for removal. We experimentally proved that our hypothesis is valid, as shown in Tables 18-21.  Tables 18 and 19 are analyzed with respect to the restoration results. When the loss considering a mask is used, PSNR and SSIM were improved, respectively, compared with when not used. Further, the FID score was also significantly decreased. These results demonstrate that considering the lens flare region and the entire image together significantly improves performance. Segmentation performance was also improved along with the restoration performance. According to Tables 20 and 21, when the loss considering a mask was used, class accuracy, pixel accuracy, and mIoU all increased, respectively, compared with when not used. Table 18. Comparison of the restoration performance when a mask is considered and not considered in the loss for syn-flare CamVid dataset.

Loss
CamVid ( (2) Comparisons of Proposed Method and the State-of-the-Art Methods We compared our proposed lens flare removal method with the previously proposed methods of Qiao et al. [23] and Wu et al. [9]. However, research has been insufficiently conducted owing to the difficulties of a lens flare removal task. Therefore, we additionally adopted several networks that were similar to our research purpose to compare the performance.
The proposed method used a GAN-based learning method utilizing a discriminator, wherein an image with a flare undergoes image-to-image translation to a clean image without flare. Therefore, we compared our proposed model with Pix2Pix [37] and Cy-cleGAN [50], which have been commonly proposed for image-to-image translation. Lastly, we compared the performance against FFANet [22] and MPRNet [21] proposed for dehazing and deraining, respectively. Dehazing and deraining tasks aim to remove artifacts covering the objects in an image owing to environmental factors, which are similar to the lens flare removal task, which removes artifacts generated in the presence of a strong light source in the surrounding. In particular, hazing was similar to a lens flare artifact generated by light scattering, and veiling glare, which affects contrast within an image and causes the image to become hazy; therefore, we compare our proposed method with the networks designed for dehazing considering they were deemed effective in removing lens flare artifacts. Raining was similar to light streaks that radiated from a light source among various lens flare artifacts. We compared our proposed method with MPRNet, which was considered effective in restoring the image details covered by light streaks. Figures 11 and 12 show the restoration results of the proposed method and previously explained methods. The method proposed by Qiao et al. [23] effectively removes reflection artifacts where the flare region is seen; however, their method was proven ineffective in removing lens flare artifacts generated through the image. As shown in Figures 11a and 12a, large regions of a flare within the image were not effectively removed throughout the image, and the restoration result was poorer than our proposed method. The method proposed by Wu et al. [9] demonstrated a better performance than [23]; however, lens flare was not completely removed. As shown in Figure 11b, this method could not restore the details of the boundary between the road and sidewalk, or the details of people riding a bicycle. And in Figure 12b, the artifacts overlaid on the pedestrian are not completely removed.
Next, we analyzed the restoration results of Pix2Pix and CycleGAN, which were proposed for image-to-image translation. Pix2Pix was far more outstanding than CycleGAN, considering the translation and execution from flare image to clean image and vice versa was not sufficiently trained. Conversely, the restoration results of Pix2Pix were visually more outstanding where the conditions of the ground-truth image were directly given to the discriminator. Compared with the proposed CAM-FRN, the restoration results of Pix2Pix did not adequately restore the details of the boundary between road and sidewalk, and light streaks of flare were remaining.
Lastly, our proposed method was compared with FFANet and MPRNet proposed for dehazing and deraining, respectively. Compared with the previously compared models, these models demonstrated a far more outstanding performance visually. MPRNet was effective in removing light streaks; however, it did not accurately restore the details of roads. FFANet successfully restored the image that became hazy by a flare to a clean original image, but flare artifacts were not perfectly removed. In contrast, our proposed method successfully restored the details of roads while adequately removing flare artifacts with outstanding performance.
Tables 22 and 23 present the numerical performance evaluation metrics according to the results of restoring images synthesized with a lens flare of each method. Among various models, the PSNR, SSIM, and FID scores of our proposed model were the best. FFANet demonstrated the second-highest performance, where PSNR and SSIM were similar to our method; however, the FID score was approximately 30 points higher than our proposed method. Based on such results, the distance difference of the feature maps extracted from inception v3 network [49] was smaller in the proposed CAM-FRN than FFANet, and the result image was closer to the ground-truth image. Table 22. Comparison of the restoration performance of the proposed method and state-of-the-art methods for syn-flare CamVid dataset.

Method
CamVid ( 13 and 14 show the semantic segmentation test results for the images restored by our proposed method, the previously proposed method for flare removal, image-toimage translation methods, and the methods for dehazing and deraining. Overall, segmentation performance improved tremendously as the restoration performance improved. The method proposed by Qiao et al. [23] was ineffective in removing lens flare artifacts generated throughout the image, as in the restoration result, thus also exhibiting poor segmentation results. The method proposed by Wu et al. [9] effectively removed lens flare artifacts generated through the image compared with the method proposed in [23], but as shown in the segmentation results in Figure 13b, the details of the road, sidewalk boundary, and the person riding the bike were not properly restored, which contradicts the restoration results in Figure 11b, and the pedestrian was not detected at all in Figure  14b. Next, the segmentation test results were analyzed according to the restoration result of Pix2Pix and CycleGAN, which were proposed for image-to-image translation. The restoration result of Pix2Pix was more outstanding than that of CycleGAN; which is in line with the restoration result, and the segmentation test result of the image restored by Pix2Pix was more outstanding than CycleGAN. However, Pix2Pix could not remove lens flare artifacts perfectly; road or people riding a bicycle were not properly detected depending on the restoration results, as shown in the enlarged part in Figures 13 and 14. Lastly, we compared the segmentation performance of FFANet and MPRNet. The segmentation result of the images restored by FFANet in Figures 13e and 14e, visually showed a greater segmentation improvement compared with (a)-(d), which are the results of the previous methods. As presented in Tables 24 and 25, the class accuracy, pixel accuracy, and mIoU were lower than that of the proposed method. Additionally, as shown in the enlarged part in Figure 13e, the proposed method detected the shape of a bicyclist object more effectively compared with other methods. When the pedestrian class in the enlarged part of Figure 14e and the enlarged part of the proposed method are compared, the proposed method detected the shape of the pedestrian more effectively. The segmentation result of the images restored by MPRNet in Figure 13f demonstrated a noticeable performance improvement compared with other methods, as in Figure 14e; however, the class accuracy, pixel accuracy, and mIoU were 6.49%, 2.99%, and 8.91% lower than that of the proposed method. In the segmentation result of the image restored by MPRNet shown in Figure 14f, the pedestrian was not detected at all. Class accuracy, pixel accuracy, and mIoU are 5.31%, 4.63%, and 7.3% lower than the proposed method, respectively. Tables 26 and 27 present the IoU of each class in the semantic segmentation result of the images restored by each method. As analyzed above, the proposed method demonstrates the best performance in terms of IoU per class. Figure 15 shows the results of extracting Grad-CAM for the bicyclist class when the original CamVid image was input in DeepLabV3+ [7], and when Grad-CAM [51] for the bicyclist class and the images restored by CAM-FRN and other methods were input in DeepLabV3+. And Figure 16 shows the results of extracting Grad-CAM for the pedestrian class when the original KITTI image is input in DeepLabV3+ [7], and when Grad-CAM [51] for the pedestrian class and the images restored by CAM-FRN and other methods are input in DeepLabV3+. After analyzing each figure, we found that our proposed method is the closest to the original image's Grad-CAM.
Based on the analysis of the previous ablation study and comparative analysis with existing methods, we can provide the following reasons for the better performance of our proposed method compared with existing methods. Existing methods suffer from the problem of not removing artifacts properly when there are complex flare artifacts or there is a severe level of flare in the input image [9,21,22,23]. To solve these problems, we used CAM to provide additional information about the flare regions to the network, and reflected it in the loss function to successfully restore the parts covered by flare. Furthermore, we were able to consider composite flare artifacts. By doing so, we could achieve better restoration results compared with other methods, and based on the restoration results, we can see that the performance of our final goal, semantic segmentation, is also better than that of the existing restoration methods.  We calculate the p-values using the values of the proposed method and the secondbest method among all semantic segmentation evaluation metrics in Table 24. We conducted t-test [52] and measured Cohen's d-values [53] to demonstrate the significance of the performance difference between the two methods. As shown in Figure 17, the p-value of pixel accuracy is 0.5 × 10 −1 , which indicates that a null hypothesis is rejected at the confidence interval of 95% and that the two methods have a difference in performance for pixel accuracy at the confidence interval of 95%. Subsequently, we measured Cohen's dvalue for pixel accuracy, and the result was 6.7363. The criteria for Cohen's d-value are divided into 0.2, 0.5, and 0.8, which are distinguished into small, medium, and large effective sizes, respectively. Our Cohen's d-value is greater than 0.8, which indicates that the performance difference between our method and the second-best method in terms of pixel accuracy is significantly large in the large effective size.

Computational Cost of Proposed Method
Lastly, we measured the parameters (Params), floating point operations (FLOPs), and multiply accumulate (MACs) to compare the computational costs of the proposed and previous methods. In Table 28, CycleGAN and Pix2Pix exhibited the lowest and highest number of parameters, respectively, while the proposed method exhibited the second highest number of parameters. In terms of MACs and FLOPs, [9] exhibited the lowest value while MPRNet had the highest value. The proposed method had the fourth highest value, thereby indicating that our method is heavy in terms of parameters but fourth most efficient in terms of computation amount. As shown in Tables 14-16 and 25-27, excluding FFANet and MPRNet, which demonstrated the second and third best performance besides the proposed method, lens flare removal performance was poor for the syn-flare CamVid dataset and the syn-flare KITTI dataset, which also resulted in poor segmentation performance. In other words, the proposed method has the most efficient lens flare removal and semantic segmentation performance compared with other methods (FFANet, MPRNet, and proposed method).

Limitations of Proposed Method
In this section, we analyze the failure cases of our proposed method when removing lens flare artifacts and related problems. The most serious problem is that not only the lens flare artifacts generated by the light source are removed, but also the light source is removed. A light source is an object that does not need to be removed because it is not an unnatural artifact. The problem occurs owing to the difficulty of finding datasets having images with lens flare and clean images without lens flare as input and label, respectively, and we must consider the semantic segmentation task for images captured by a frontalviewing camera of a vehicle. It is challenging to find images that have a segmentation label while simultaneously considering a lens flare. Therefore, to solve the problem of insufficient data, we configured datasets by synthesizing lens flare artifacts into semantic segmentation datasets built by images captured using a frontal-viewing camera of a vehicle, such as CamVid [19] and KITTI [20]. Therefore, the target image being restored by removing a lens flare does not include a light source such as the sun, streetlight, and vehicle headlight, and the image generated by CAM-FRN is trained to remove a light source.
If Figure 18a,d,g,j become the input of CAM-FRN, the results shown in Figure 18c,f,i, are produced owing to a lack of information on a light source in Figure 18b,e,h,k, and the parts where a light source is located are placed with other pixel values. The model is trained to create images that are similar to the original images in Figure 18b,e,h,k, which is one of the problems to be resolved for lens flare removal. For removing lens flare artifacts more appropriately, further research is needed on finding methods to remove only the flare region by distinguishing a light source and the flare region.
Subsequently, we analyzed the cases of adequate and inadequate restoration of our proposed model. The first and second rows in Figure 19 show the cases where CAM-FRN inadequately removed lens flare in the syn-flare CamVid dataset. In both examples, lens flare generated on top of an object is removed, but the color of the original image is not restored compared with the ground-truth image. In the first row in Figure 19a, a traffic light is located close to a light source, and the intensity of a lens flare from the light source is fairly strong; therefore, the color of the traffic light shown in the original image (b) is not restored properly in the CAM-FRN result image in (c). In the second row in Figure  19d, the object is not located close to the light source as in the first case; however, a strong lens flare completely covers a person in the enlarged part. As a result, the color of a coat on the pedestrian in the original image (d) is not appropriately restored in the CAM-FRN result image in (f). In the third and last rows, the lens flare is effectively removed, and the color of an object covered by the flare is restored fairly adequately. Similar to (a) and (b), the object is covered by lens flare in (g) and (j) in the third and fourth rows, respectively; however, the pedestrians and the details of the building in the enlarged part are preserved adequately.  Figure 19. Successful and unsuccessful restoration cases of CAM-FRN in the syn-flare CamVid dataset. Images of unsuccessful restoration: (a,d) input images, (b,e) ground-truth images, and (c,f) prediction images. Images of successful restoration: (g,j) input images, (h,k) ground-truth images, and (i,l) prediction images.
We analyzed the cases of successfully and unsuccessfully restoring the color details of an object behind a flare during adequate lens flare removal in the syn-flare KITTI dataset in Figure 20. In image (a) in the first row, lens flare is formed over a building, and the color of the building in the original image (b) is not properly restored in the CAM-FRN image (c) where the flare is removed. In image (b) in the second row, lens flare is formed over a building, and the paint color of the building in the original image (e) is not properly restored in the flare removal process and is shown as gray in image (f). Similar to the first and second rows, lens flare covers an object in images (g) and (j); however, lens flare is effectively removed, and the color of the object behind the flare is adequately preserved.
The cause of the result images in the first and second rows of Figures 19 and 20 is the effect of lens flare on an image. Lens flare affects the contrast of an image, and a higher intensity of lens flare reduces the contrast. Therefore, lens flare in images (a) and (d) in Figures 19 and 20 reduces their contrast, and the intensity of lens flare is high that the information of pixel values in the original image is severely lost. As the contrast value increases, dark areas become darker and bright areas become brighter, thereby creating a more evident contrast; however, the contrast between two areas becomes less in the opposite case. Because the contrast value is adjusted in proportion to the intensity of a pixel value, the loss of information of bright pixels before a lens flare increases as the contrast decreases. As a result, when lens flare is generated on objects with brighter pixel values, as shown in images (a) and (d) of Figures 19 and 20, the pixel information in the original images (c) and (f) cannot be perfectly restored. However, in the third and fourth rows of Figures 19 and 20, objects have relatively darker colors with low brightness or pixel intensity, and hence, lens flare is removed while retaining their colors to a certain extent. These limitations can adversely affect semantic segmentation, which is a pixel-wise classification task. Therefore, there is a need for research on methods for retaining the color and contrast of an image while removing lens flare. Images of unsuccessful restoration: (a,d) input images, (b,e) ground-truth images, and (c,f) prediction images. Images of successful restoration: (g,j) input images, (h,k) ground-truth images, and (i,l) prediction images.

Discussion of the Performance Analysis of the Proposed Method
As mentioned, we can see in Table 13 that the pixel accuracy metric is worse than the other results. However, we focused on the highest mIoU metric when utilizing all input images. Looking at Equation (33), pixel accuracy only considers true positives (TPs) and false positives (FPs). However, as we can see in Equation (35), mIoU additionally considers false negatives (FNs) and can also evaluate class misclassification, i.e., the possibility of class misclassification exists even if pixel accuracy is higher. Based on this rationale, we confirm that the proposed method utilizing all the inputs with the highest mIoU shows the best performance. In Table 16, we can see that our proposed method utilizing variational inference has a slightly worse SSIM score compared with the other results. However, we wanted to obtain an image with the most similar distribution of flare-free images and CAM-FRN outputs by using variational inference rather than simply considering the distance between pixels, as mentioned in the paper. The evaluation metric that can show this more clearly is the FID score, and we can see that the FID score is the best when using variational inference. We can also see in Table 17 that the proposed method using variational inference has the best segmentation performance. From this, we confirm that the FID score is more important than the other evaluation metrics in Table 16, and the best performance can be obtained when utilizing variational inference like the proposed method. In Tables 26 and 27, "Original" means the segmentation accuracies using original images without flare, and they are the baseline accuracies. The important part of Tables  26 and 27 is the performance over the segmentation network after flare removal, which can be seen in "With restoration use DeepLabV3+" in each table. We can see that the proposed method of "With restoration use DeepLabV3+" has the highest performance in each table. Table 28 compares the computational cost of the proposed method with previously studied methods. Although the proposed method does not show the best results compared with previous studies in Table 28, the segmentation accuracies by proposed method are higher than those by all the previous methods as shown in Tables 24-27, and we focus on the segmentation accuracy rather than computational cost in our paper.

Conclusions
This study examined different methods for improving semantic segmentation performance by removing lens flares from images captured by frontal-viewing cameras in vehicles. This study is the first to solve the problem of an autonomous driving vehicle being unable to detect objects owing to a lens flare while simultaneously conducting lens flare removal and segmentation.
The proposed method removes a lens flare by extracting the lens flare region of an image as a class attention map and providing additional information on lens flare artifacts and lens flare region by creating additional inputs. Furthermore, we proposed ADCARB, which uses multi-scale feature learning and extracts the parts damaged by a lens flare as a mask for learning, which significantly improves the lens flare removal performance by using such a block. Additionally, we created an image that was as similar to the groundtruth image as possible through variational inference; self-attention was applied to the estimated latent space by executing variational inference to ensure global information was considered. The lens flare region mask obtained using CAM was reflected in style loss, content loss, adversarial loss, and discriminator loss, which improved the image quality by removing the lens flare with a focus on the flare region. When applying CAM-FRN to the segmentation task on the restored images, it demonstrated considerable performance improvements compared with previous restoration models [9,21,22], achieving a class accuracy of 67.34%, pixel accuracy of 94.33%, and mIoU of 71.26% on the syn-flare CamVid dataset. Additionally, CAM-FRN exhibited superior performance on the syn-flare KITTI dataset, attaining a class accuracy of 54.73%, pixel accuracy of 90.62%, and mIoU of 60.27%.
As mentioned in Section 5, CAM-FRN has the problem of removing flare artifacts while also removing light sources, and it has the problem of failing to restore the original color due to contrast being ruined by flare. In our follow-up research, we plan to design a restoration network that can accurately remove only lens flare by separating the light source and lens flare regions to address these issues. Furthermore, we plan to address the problem of being unable to restore color due to contrast being ruined properly. Our final goal is to work on an end-to-end network design that combines lens flare removal and semantic segmentation steps.  Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.