Extreme Low-Visibility Events Prediction Based on Inductive and Evolutionary Decision Rules: An Explicability-Based Approach

: In this paper, we propose different explicable forecasting approaches, based on inductive and evolutionary decision rules, for extreme low-visibility events prediction. Explicability of the processes given by the rules is in the core of the proposal. We propose two different methodologies: ﬁrst, we apply the PRIM algorithm and evolution to obtain induced and evolved rules, and subsequently these rules and boxes of rules are used as a possible simpler alternative to ML/DL classiﬁers. Second, we propose to integrate the information provided by the induced/evolved rules in the ML/DL techniques, as extra inputs, in order to enrich the complex ML/DL models. Experiments in the prediction of extreme low-visibility events in Northern Spain due to orographic fog show the good performance of the proposed approaches.


Introduction
Fog-related low-visibility events are extreme situations that can be extremely dangerous for road transportation and aviation [1,2]. The accurate prediction of fog-related events has become an important task in the forecasting literature in the last decades. There are different methods devoted to low-visibility forecasting, including numerical weather prediction (NWP) based on the use of mesoscale (or high-resolution) NWP models to accurately forecast fog events [3][4][5][6]; additionally, data-driven and statistical approaches are based on the processing of measured or simulated meteorological variables [7].
As part of data-driven approaches, machine learning (ML) and deep learning (DL) approaches have become an attractive tool to tackle the prediction of low-visibility events, obtaining a level of accuracy not seen before. Many of these approaches were, in fact, proposed in the frame of transportation safety, and many of them were tested with airports [8] or roads data. For example, in [9], artificial neural networks (ANNs) were used to provide accurate forecasts at Canberra International Airport, employing a 44-year database of standard meteorological observations. In [10], different ML regression approaches were tested in the problem of low-visibility events prediction due to radiation fog at Valladolid airport, Spain. In [11], a multiobjective evolutionary neural network was applied to the same problem, obtaining improvements over alternative ML algorithms. In [12], a hybrid prediction model for daily low-visibility events at Valladolid Airport, which combines fixed-size and dynamic windows and adapts its size according to the dynamics of the time series, was proposed, using ordinal classifiers as the prediction model. In [13], tree-based statistical models based on highly resolved meteorological observations were applied to between runway visual range at the airport and surface meteorological conditions using the deep networks as basic elements.
All these previous approaches based on ML and DL techniques yielded accurate results in different fog-related low-visibility events prediction problems, but in the majority of cases, the explicability of the results is not easy to obtain, due to the complexity and extremely nonlinear characteristic of the ML and DL algorithms. In the last few years, explicableartificial intelligence (XAI) [33] has been acknowledged as a crucial feature of AI or ML algorithms when applied to science and engineering problems, and specially in geoscience-and atmospheric-related problems [34]. XAI is based on the fact that, when developing an ML model to solve a given problem, the consideration of interpretability is an extremely important additional design driver, especially in some fields in which the physics of the problem plays a principal role, such as in geoscience and Earth observation problems [35]. This leads to the development of ML models in which the interpretation of the system is a primary objective, in such a way that the models provide either an understanding of the model mechanisms or give the reason why some predictions are obtained. The interest in physics-awareness and XAI in atmospheric-related problems in the last years has produced revisiting classical methods which promote the interpretation of results, such as those based on inductive decision rules.
There have been different attempts to apply inductive decision rules to problems related to fog and visibility prediction. In [36], a C4.5 decision tree was used to generate induction rules for marine fog forecasting at different buoys off the California coast. In [37], a multirule-based method was developed, where threshold values of multiple variables were employed to define fog conditions in the forecast. The objective was to forecast low-visibility episodes due to radiation fog in Abu Dhabi, in the United Arab Emirates. Fuzzy-logic-based rules systems have also been developed in fog prediction, for example, in [38]. In [39], a fuzzy-rules-based system was proposed to predict fog in Lahore, India. In [40], an objective fuzzy logic fog forecasting model based on outputs from a highresolution operational NWP model was proposed, where the fuzzy logic was employed to deal with the inaccuracy of NWP prediction and uncertainties associated with relationships between fog predictors and fog occurrence. This methodology was tested at Perth Airport, with good results.
In this paper, we focus on events of extreme low-visibility, and how they can be accurately forecasted using rule induction and rule evolution methods. In this case, explicability of the rules is in the core of the proposal, in addition to obtaining accurate results in extreme low-visibility events. Two different methodologies were satisfactorily applied. In the first one, both induced and evolved rules and boxes of rules were used as a simpler alternative to ML/DL classifiers, offering, in some cases, some advantages with respect to those ones: (1) Rule-based classification models provide a direct physical explainability of the outcomes, which can be very helpful in the implementation of these models in real applications; (2) the complexity of these models is substantially lower than the ML/DL classification methods in many cases, thereby considerably reducing the computational cost, as well as the training time and the complexities of the system deployment; and (3) the performance of rule-based classifiers for the prediction of extreme low-visibility events are competitive with respect to far more complex ML/DL models, although this much depends on the specific problem at hand. The second approach focused on the integration of the information provided by the induced/evolved rules in order to enrich the complex ML/DL models. Specifically, the idea is to add to the original database, composed of meteorological variables measurements, a Boolean extra variable, which provides information related to the fulfillment of the rules or boxes of rules. This leads to a significant improvement of the classification models' performance for the three prediction time horizons assessed.
The rest of the paper is structured in the following way: the next section describes the algorithms used in this work for rule induction construction, the PRIM, and an evolutionary approach (CRO-SL), together with the ML and DL classification methods employed. Then, Section 3 describes the specific problem of extreme low-visibility events forecasting and description, and gives details on the atmospheric variables involved in the problem. Section 4 presents the experimental section, where the results obtained in this problem are presented and discussed for the two approaches implemented. Finally, Section 5 closes the paper with some concluding remarks on the research carried out.

Materials and Methods
The algorithms and methodologies employed in this paper are further detailed in this section. First, both rule induction and evolutionary rule discovery algorithms are summarized in Section 2.1. Next, the ML and DL methodologies used for building the classification models are listed and detailed in Section 2.2.

Rule Induction and Evolutionary Rule Discovery
This section summarizes the most important characteristics and details of the methods applied for rule-discovering in this paper. Particularly, the PRIM method for rule induction and an evolutionary-based approach for rule discovering evolution and selection are described.

PRIM
The PRIM method [41] is a rule induction method based on boxes or subsets of the data. Given n input variables and one output variable, we try to optimize the mean of the output variable, that is, we have to find a subregion R such that Equation (1) is satisfied: where f is the mean of the output, f (x) is the output value associated with x input data, and p(x) stands for the probability (also frequency) of the input data x. Our purpose is to find a subset of the data that optimizes f . Additionally, the size of the subset is necessary to consider, where a balance between size and optimization is the best approach. The PRIM method has two stages: peeling and pasting. The peeling stage is an iterative procedure in which data are cut to isolate the desired variable. Given a y i value of Y that we desire to optimize, we select an X k variable and a range [a, b] for this variable such that, by trimming the database in the range of this variable, the ratio of y i in the new subset is greater than in the original set.
The peeling stage is summarized in Figure 1, where D is the set of data, S i , i ∈ {1, 2, . . . , n} are the subsets obtained in several iterations, and our objective is to classify the red points; that is, S 4 defines the split in red data (inside the subset) and blue data (outside the subset).
The pasting step is a reverse peeling process designed to save a portion of the valuable data. Based on the dataset after peeling, an X t variable is selected whose domain range is [a t , b t ]. An attempt is made to extend this range by an value, either above [a t , b t + ] or below [a t − , b t ], so that the proportion of y i is augmented (or at least remains) while the size of the dataset increases.

DPCRO-SL
The Coral Reefs Optimization algorithm with Substrate Layers (CRO-SL) [42] was considered in this work to evolve and select rules which improve the PRIM performance. The CRO-SL was first introduced in [43] with further development shown in [44], where a multimethod ensemble was finally defined. In [45], a probabilistic-dynamic version of the algorithm DPCRO-SL was proposed, in which the probability of applying a given operator varies during the algorithm's evolution depending on its performance. This is the CRO-SL version used in this paper for rule induction evolution.

Rules Encoding
The rule encoding in the PRIM method gives us rules within the structure [N, L min , L max ], where N is the name of the predictive variable, L min is the lower limit, and L max is the upper limit. We adapt this encoding to the solutions in the CRO-SL encoding system, for which it is a vector x = [x 1 , x 2 , . . . , x n ] for x i with i ∈ {1, 2, . . . , n}. For evolving, since the PRIM method gives a triplet, understood as a vector with three components, we can adapt the CRO-SL encoding to follow Equation (2).
where rule k with k ∈ {1, 2, . . . , m} and m = n/3. For the selection case, we change the encoding to a binary vector b = [b 1 , b 2 , . . . , b n ] with b i ∈ {0, 1}, where n stands for the number of PRIM rules; 0 means that the rule is not selected, and 1 means that it is selected.

Objective Functions
An important issue to handle is the objective function of the DPCRO-SL. As a metaheuristic method, it needs a heuristic [46] to compute and process its algorithm. For this purpose, we have to define some previous metrics [47], such as the following:   (3).
• Recall. The recall is described by Equation (4).
• Support. The support is defined by Equation (5).
With these in mind, we establish our heuristics as follows: -H1: Weighted sum (WeSum) [48]. Weighted sum of precision and support. This function is evaluated for different values of α. Based on each value of α, it will enhance (or penalize) precision or support.
-H3: Subtraction pondered by complementary (SubCom). The subtraction of precision and recall pondered by the complementary of recall. This heuristic penalize the recall.
-H4: Inverse of absolute subtraction (InvAbs). The inverse of the absolute subtraction of precision and support. This penalizes the difference between precision and support.
Note that we handle the division by zero for H3 and H4 when recall = 1 and precision = support, respectively. Nevertheless, because of the stochastic nature of the process, the probability of this happening is near zero.
where t > 0 is a scale parameter, usually t = 1. Note that the Cauchy probability distribution looks similar to the Gaussian distribution (see Figure 2), but it approaches the axis so slowly that an expectation does not exist and the variance is infinite [52]. In this case, the mutated larva is calculated as x i = x i + ηδ, with η variance, and δ as a random number following the Cauchy distribution. • Differential evolution operators (DE): The differential evolution operator is inspired on the homologous evolutionary algorithm [53]. DE creates new larvae by perturbing the population, applying a difference operator over the individuals. The applied difference operator for introducing the perturbations is expressed as follows: where F is a factor which regulates the evolution by weighting the perturbation amplitude.

Machine Learning and Deep Learning Classification Methods
We present here the most important characteristics of ML and DL techniques considered in this work.

Support Vector Classifier
The support vector classifier (SVC) [54] is a supervised learning algorithm used in many classification problems [55]. The objective of the SVC algorithm is to find a hyperplane that, to the best degree possible, separates data points of one class from those of another class. Support vectors refer to a subset of the training instances that identify the location of the separating hyperplane. The standard SVC algorithm is formulated for binary classification problems, and multiclass problems are typically reduced to a series of binary ones.
Formally, given a labeled training dataset (x i , y i ) i=1:n , and given a nonlinear mapping φ(·), the SVC method solves the following problem: constrained to: where w and b define a linear classifier in the feature space, and ξ i are positive slack variables enabling to deal with permitted errors (see [55]). Appropriate choice of nonlinear mapping φ guarantees that the transformed samples are more likely to be linearly separable in the (higher dimension) feature space. The regularization parameter C controls the generalization capability of the classifier, and it must be selected by the user.

Random Forest
Random forest (RF) [56] is among the most renowned bagging-like techniques for classification and regression problems. Bagging is the simpler ensemble technique used to train multiple learners and provide an unified output. It considers an ensemble composed of learners with equal architecture, that is, with the same topology, number of input-output variables, and parameters. RF employs decision or regression trees as predictors in a way that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Therefore, RF differs from the pure bagging technique in the topology of the tree changes among them. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them.

Gaussian Naive Bayes
Naive Bayes methods [57] are supervised learning algorithms based on the Bayes theorem. These methods assume the "naive", i.e., simple, condition of independence among every pair of features given the value of the class variable. The Bayes theorem states the following relationship, given class variable y and dependent feature vector x 1 through x n : P(y|x 1 , . . . , x n ) = P(x 1 , . . . , x n |y)P(y) Assuming the independence of the random variables x i , we can express the previous equation as follows: The naive Bayes classifier retrieves the maximum argument of the previous expression, so-called maximum a posteriori, or simply MAP, considering that P(x 1 , . . . , x n ) is constant: Gaussian naive Bayes (GNB) simply assumes that each x i |y is a normal random variable:

K-Nearest Neighbors
The k-NN [58] method is among the top techniques for data mining that uses the well-known principle of Cicero: birds of a feather flock together or literally equals are easily associated with equals.It tries to classify an unknown sample based on the known classification of its neighbors.
K-NN is a nonparametric ML method which looks for a set of K instances of the training set which are the closest to the new test instance. The term "closest" to an instance x i is measured with respect to a metric or distance d(·) which fulfills the following: Most frequently, the Euclidean distance is used within the k-NN domain.

AdaBoost
Adaptive boosting (AB) [59] is a kind of a boosting method that proposes to train each of the ensembled learners (individual models) iteratively. Normally, each learner focuses on the data that were misclassified by its predecessor, adapting its parameters to achieve better results. As in all boosting methods, AdaBoost establishes the same structure for all of the learners involved in the ensemble, that is, the same architecture, number of parameters, or input-output variables. After creating the ensemble structure, the learners are trained sequentially in such a way that each new learner requires that the previous learner had been trained before.

Gradient Boost
Gradient boost [60] (GB) also combines a set of learners to create a stronger ensemble.
Here, the residual of the learner in the chain becomes the input for the next consecutive learner, and hence it is an additive model. The residuals are used in a step-by-step manner by the learners, in order to capture the maximum variance within the data.

Multilayer Perceptrons
Multilayer perceptron (MLP) [61] is a class of artificial neural network (ANN) [62] consisting of an input layer, several hidden layers, and an output layer, which are all built by a number of special processing units called neurons. Layers are placed consecutively, and each neuron of a layer is connected to the other neurons of the consecutive layer by means of weighted links. The values of these weights are related to the capacity of the MLP to learn the problem, and they are learned from a sufficiently long number of examples. The process of assigning values to these weights from labeled examples is known as the training process of the perceptron. MLP training processes are based on stochastic methods; a backpropagation trial-and-error is among the more well-known learning algorithms applied to train ANNs.

Deep Learning Methods for Time Series Classification
In recent decades, deep learning approaches have become increasingly popular for performing tasks related to time series classification or regression [63,64]. A general deep learning framework for time series forecasting is depicted in Figure 3. These networks are designed to learn hierarchical representations of the data. A deep neural network is a composition of N parametric functions referred to as layers, where each layer is considered a representation of the input domain.  Among the existing deep learning methodologies for time series forecasting, the following were selected as classifiers for fog detection problem: recurrent neural networks (RNNs) [65], gated recurrent units (GRUs) [66], long short-term memory (LSTM) [67], or 1D convolutional neural networks (1D-CNN) [68]. Preliminary work prior to the application of these methods involves the transformation of the data expressed as matrices into threedimensional tensors (variables × time sequence of length L × samples).
RNNs are a specialized type of ANN designed to deal with sequential data. They incorporate a feedback loop in which the output of the RNN layer is taken as an input into the next layer. Hence, RNNs are especially viable for analyzing time series in problems related to text generation [69], speech recognition [70], or forecasting tasks [71].
Long short-term memory (LSTM) networks constitute a specific type of RNN network that emerged to solve the vanishing gradient problem, which consists of the gradient decreasing as the number of layers increases, preventing the learning of the network when the number of RNN is relatively high. LSTM networks include an additional integrated feature, i.e., the memory cell. They use three gates (forget, update, and output) to keep longstanding relevant information and discard the irrelevant. The forget gate decides what relevant information should be saved. The update gate decides what new information to use to update the memory cell. Finally, the output gate decides which is the output value that will be the input of the next hidden unit [72]. Figure 4 shows the architecture of an LSTM unit. GRU networks emerged in 2014 as a streamlined version of the LSTM, with a simplified memory cell that considerably reduces the high computational cost of the LSTM and achieves comparable performance. GRU layers only present two gates: (1) the update gate, which decides whether the memory state is or is not updated; and (2) the relevance gate, which determines how relevant it is to compute the next candidate.
Finally, CNNs are a specific type of feedforward neural network initially developed for tasks related to image processing and computer vision. CNN architecture typically consists of a sequential stacking of layers [73]. In recent years, CNNs have become increasingly popular, not only for image processing tasks but also for all kinds of sequential-datarelated problems, such as text or time series forecasting. Usually, one-dimensional CNNs (1D-CNNs)are applied to extract features for each variable time series [63].

Data and Problem Description
Visibility data measured at the Mondoñedo weather station, Galicia, Spain (43.38 • N, 7.37 • W), from 1 January 2018 to 30 November 2019 with a sampling frequency of 30 min were used in this paper, accounting for a total of 33,015 samples. This station measures visibility in the A8 motor-road in Lugo, where orographic (hill) fog usually occurs, forcing the closure of the motor-road in cases of extreme low-visibility events. Note that orographic fog is formed when moist air moves up a hill, resulting in expansion and cooling of the air to its dew point, and producing intense episodes of low visibility. The Mondoñedo area is known to experience frequent low-visibility events caused by orographic fog due to winds coming from the Cantabrian Sea.
The data were acquired with a Biral WS-100 visibility sensor, which operates between 0 to 2000 m. The location of Mondoñedo is 140 m above sea level, and the distance to the sea is around 20 km; all the variables were acquired at 2 m height from the ground. Figure 5 illustrates the first 500 samples of the visibility time series, where the tendency of the data to remain close to 2000 or to 0 can be observed, while there are few data in intermediate ranges. This is further illustrated by plotting the density function of the data (Figure 6), where two peaks may be found: one near 0 m, corresponding to very low visibility levels, and a much more prominent peak near 2000 m, corresponding to events where visibility is very good.  This work is focused on extreme low-visibility events, i.e., those fog events leading to visibility values below 40 m, which are those affecting the normal motor-road circulation the most. Since the problem is approached as a binary classification problem, Figure 7 shows the distribution of samples for each class, where extreme low-visibility events (class 1) account for 7.41% of the total number of samples. Thus, we focus on developing inductive and evolutionary rules for detecting and describing these specific events. The predictive variables used in this study for dealing with extreme low-visibility events consist of 11 exogenous meteorological variables registered by the same weather station at Mondoñedo (Table 1). A brief discussion on the predictive variables is worthy here. Note that we used all the variables available (measured) in the Mondoñedo weather station; we did not carry out a previous step of feature selection, so we trust that the proposed algorithms are able to deal with possible redundancy between predicted variables, extracting the useful information from them in terms of their contribution to the visibility prediction. In addition, note that some of them are not related to visibility, but they are measured for other reasons, such as for the care of the road. For example, salinity of water in air is a variable related to corrosion of elements in the road, but it does not seem to be fully related to orographic fog. Similarly, accumulated precipitation (during the last hour) and ground temperature do not have much relation with the fog formation mechanism, but they can be involved in its persistence. The time horizons of the visibility forecast were established in +3 h, +5 h, and +10 h; therefore, measured variables acquired from t = h (or ranging from t = h − timestep to t = h in the case of DL methods) are used as predictors to forecast visibility at t = h + 3, t = h + 5, and t = h + 10. In order to obtain an idea of the importance of the predictive variables, the Pearson correlation coefficient among the predictive variables and the target variable (prior to its binarization) is presented in Figure 8 for the three prediction time horizons considered. It can be seen how for some variables (qp_ra, sa, td, ws, ap) the correlation remains constant as the prediction horizon increases, while for others (vs, hr, at, st, gr, wd) the correlation varies. The variables showing the highest correlation are visibility in previous hours, the relative humidity, and the dew temperature. On the contrary, there are some other variables, such as salinity of water in air or ground temperature, which do not have, as expected, a high correlation with visibility prediction.

Experiments and Results
This section reports the results obtained in this work. In the first place, Section 4.2 analyzes the rules obtained by the induction and discovery algorithms. Next, Section 4.3 presents and discusses the results obtained for the first proposed methodology, consisting of the use of decision rules as a possible alternative to classification algorithms. Finally, results for the second approach are provided in Section 4.4, where information related to decision rules is incorporated into the ML/DL classifiers as an additional input feature.

Experimental Setup
The experimental setup, along with the parameters set for the different ML and DL methods, are detailed in this section. First, a preliminary dataset preparation is carried out. The steps of this preprocessing are: (1) Splitting the dataset into training and test (75-25%) subsets, ensuring that no test instance was seen by the classification methods during the training. Due to dealing with timed series data, instead of randomly splitting the datasets, the last 25% of the data were separated as test data. (2) Scaling of the features, which is important to ensure the upper and lower limits of data in the given predefined range. Feature standardization was performed, causing data to have zero mean and a unit variance (Equation (25)), where x is the original feature vector,x denotes the feature mean, and σ its standard deviation. Table 2 shows the parameters used for the benchmark methods considered. All of the simulations were run on an Intel(R) Core(TM) i7-10700 CPU with 2.90 GHz and 16 GB RAM using the Python libraries sklearn, tensorflow, prim, and scipy.
In addition, we performed a previous optimization of the DPCRO-SL hyperparameters before using it for rule evolution. For this purpose, a genetic algorithm (GA) was used. There are many examples in the literature of successful uses of a GA for optimizing hyperparameters of, for example, neural networks [74]. That is why we opted for this option.
As a result of this process of optimization and selection of hyperparameters, we obtained the values specified in Table 2, as well as the operators defined in Section 2.1.2.3. It should be discussed why these operators perform better than others, i.e., DE/rand/1 instead of DE/best/1 and Cauchy mutation instead of Gaussian mutation. The reason in both cases is similar: the PRIM rules are already very good themselves, and we are unlikely to find an improvement in their neighborhood. Therefore, a mutation with greater variance such as the Cauchy (see Figure 2) facilitates a higher exploration than the Gaussian. Furthermore, DPCRO-SL itself is responsible for modifying the significance and quantity of the population with which each operator works at all times. This may explain why a 3Px crossover is not necessary. With 2Px already available to cross in blocks of components and MPx to cross several (or all) components, an intermediate operator between the two does not seem necessary, as the algorithm itself will combine MPx and 2Px to obtain the desired crossover. Eventually, the CS operator can offer a fine granularity level to modify the latest details close to the good solutions.

Analysis of the Extracted Decision Rules
We start this section by analyzing the heuristics defined in Section 2.1.2.2. For the WeSum objective function, we set 0.01, 0.1, 0.2, 0.3, 0.6, 0.7, and 0.9 as values of α. With this experiment, it is possible to verify whether it is better to penalize precision or support. Each heuristic is used 10 times to evolve and generate the best possible set of rules, i.e., 10 sets of rules (or boxes) are generated with each heuristic.
Then, we establish a threshold for minimum support size and precision for each box, namely, a minimum of 100 individuals (0.61% support) and a minimum of 40% precision. In this way we can measure how many boxes generated by each heuristic satisfy the threshold for each predictive horizon. Figure 9 shows the number of rule sets that fulfill the threshold on each time prediction horizon considered.
For the 3 h case (Figure 9a), the F1-score and PPR heuristics stand out from the others. As for WeSum, it seems that the low α values are better. Regarding the other heuristics, none of them can be said to work poorly for this time horizon, because they generate valid boxes. However, it can be seen that there are better heuristics within them.
In the 5 h horizon (Figure 9b), the heuristics F1-score and PPR continue to dominate, so that all (or almost all) the boxes generated outstrip the threshold. It also proves that WeSum is substantially better with small α values. Compared to other heuristics, only SubCom survives.
Finally, note that in the 10 h case (Figure 9c), only the F1-score and WeSum heuristics with small values of α fulfill the threshold.  With all these results in mind, we can draw several conclusions: first, we showed that alpha values between [0.1, 0.3] are the better option for WeSum heuristics. On the other hand, the best of all heuristics is undoubtedly F1-score. This is present in all time steps, producing very good sets of rules. It should also be mentioned that PPR performs well overall.
Afterwards, it is possible to observe in Figure 10 how many times each variable is repeated between the different boxes for each time horizon, both for the boxes of PRIM and those obtained by the DPCRO-SL. This helps us to analyze the importance of each variable over time, and the impact that each one of them can have on short-and long-term predictions. We consider the average behavior between PRIM and DPCRO-SL, remembering that the number of boxes that overtake the threshold is decreasing as the prediction horizon increases.  Instead, in Figure 11 we find the rules associated with each variable in the most restrictive box that was generated from each horizon. As mentioned in Section 2, these rules are expressed by a range of values to be taken by the variable, i.e., the lower limit (minimum) and upper limit (maximum) between which it has to be set. This means that if the range of values specified by the rules covers all possible values, the variable is not very important for the prediction; however, if the range is very small, the variable has great relevance. These figures allow us to reinforce the conclusions that can be obtained from the repetitions between boxes by time horizons. The first variable to evaluate is atmospheric pressure. It is observed that with a forecast horizon of +3 h, this variable appears in approximately 75% of PRIM boxes and 35% of DPCRO-SL boxes. In the next horizon, it is present in 60% and 90% of the boxes, respectively. Finally, for the 10 h horizon, the atmospheric pressure is found in all generated boxes. The most restrictive box value ranges emphasizes this point, showing that for 3 h and 5 h, the atmospheric pressure is irrelevant, while for the 10 h case it has a more significant role. In summary, these results show that atmospheric pressure can have effects on extreme low-visibility events in the long-term.
Something similar happens with air temperature. For a time horizon of 3 h, it appears in 40% of PRIM boxes and 45% of DPCRO-SL boxes. For 5 h it is found at 40% and 95%, respectively, while for 10 h, it is found at 100% and 35%. Even more accentuated is the case of global solar radiation, with 40% and 35% for 3 h, 40% and 15% for 5 h, and 90% and 70% for 10 h. Figure 11 reinforces this because it is seen that for the most restrictive box, air temperature and global solar radiation acquire importance at 10h. This means that, similar to atmospheric pressure, air temperature and global solar radiation can influence the long-term prediction of the thick fog at Mondoñedo station.
Accumulated precipitation and relative humidity seem to be very important variables for the prediction, for the time prediction horizon of 3 h are found in 75% and 100% of PRIM boxes, respectively, and 100% of DPCRO-SL boxes in both cases. In the other horizons, they maintain a presence equal to or greater than 70% in the different boxes. The ranges of these variables for the most restrictive box are very small for all horizons. In particular, it is important for fog formation that the value of accumulated precipitation oscillates around 0 and the value of relative humidity is as close as possible to 100%.
Visibility and wind speed seem to be closely related. Both variables maintain a presence greater than 80% (and almost always close to or equal to 100%) in the boxes of PRIM and DPCRO-SL for all prediction time horizons, except for 10 h. In that case, the visibility is in 60% of the boxes. In the evolution of the value ranges we can see that both variables have a very considerable influence in the short and medium term, but not in the long-term prediction. Remembering that the 10 h horizon boxes are much less numerous, there is no contradiction in that the variables are in most 10 h boxes, as long as the rules associated with these variables select all (or almost all) the range of possible values.
Dew temperature is found in only 30% of PRIM boxes and 70% of DPCRO-SL boxes for 3 h. In 5 h and 10 h, their presence increases above 60% in all cases. This means that it is a variable whose influence is only appreciable in the medium-long term. The opposite is true of salinity. It is a variable that has a large presence in the boxes of 3 h (75% and 50%), but whose presence in 10 h is somewhat reduced, reaching 30% for DPCRO-SL. Since its range of values is increasing over time horizons, it is a variable with relevance only in the short-term prediction time horizon.
Finally, the variables floor temperature and wind direction are irrelevant in the prediction of thick fog, although for different reasons. The floor temperature is present in most boxes, both for PRIM and DPCRO-SL in different time steps. Its lowest presence is 60% in 10 h, remaining in other cases close to 90% or 100%. While this is true, Figure 11 shows us that rules associated with floor temperature, although always present, select the full (or nearly full) range of possible values. This can set a floor temperature minimum to 12, but with negligible importance. In the case of wind speed, even though we can see that the range of values always covers the totality, it is a variable that is not even present in most boxes. Except for the case of 80% in the boxes of DPCRO-SL in 5 h, in the other boxes, it is always below 60%, reaching 10% for DPCRO-SL in 3 h and 0% for DPCRO-SL in 5 h.

Use of Induced Rules to Anticipate the Presence of Thick Fog
Initially, the induced rules were used as a classification technique in order to anticipate the presence or absence of thick fog ( Figure 12). For this purpose, two different approaches were tested. In the first one, given a test sample, each induced rule was assessed individually on the predictive variables and, if the rule was fulfilled, the prediction of that rule implied the presence of thick fog. Therefore, a different output was generated for each rule. In order to select the best performing rule, the performance of the forecasted values was also assessed individually on the training data, selecting the rules with the best scores for the three evaluated metrics, and computing those performances again on the test data.
The second approach consists of the evaluation of boxes of induced rules instead of just individual rules. Therefore, the outcome of the prediction for each box was considered to be positive (meaning that the presence of fog was forecasted) in the case that all the individual rules were fulfilled. Again, as the number of boxes is significant, the performance of each box is measured on train data, selecting the best performers for each metric and then assessing their values on the test data. Test error metrics for these experiments are further displayed in Appendix A. Table A1 shows the test error metrics for the +3 h prediction time horizon of all the classifiers evaluated. Here, it can be seen how the GB is the most reliable ML method, remaining the most balanced performing method in the three metrics measured. Regarding the DL methods, the four classifiers yield similar performance, though slightly worse than the GB.
Moving on to the evaluation of rule boxes, it may be observed that, by prioritizing the recall metric among others, a superior result is obtained compared to the model with the highest recall of the benchmark methods, GNB, which obtains a 98% of detection of severe fog, albeit with a very low precision of 18%. Instead, the best PRIM box achieves a 90% of positive detection rate with a precision of almost 30%, and the best CRO box yields 86% of detection with a 34% of precision. When evaluating the more balanced boxes, the ones with better F1-score, it can be seen that, although they achieve quite good results, especially with the CRO-evolved boxes with a score of 0.61, they do not improve the results achieved by the GB (0.66).
Regarding the performance of the individual induced rules used as a classification method, it is observed that for both cases, PRIM and CRO-evolved rules, the ones with the higher recall values reach a 100% of thick fog detection rate, although the use of these rules as classifiers lacks practical interest due to the low accuracy rate achieved (9.45% for both cases). Nevertheless, the rules with the best F1-score present very competitive results for both cases, even surpassing the best ML classifier for the case of the CRO-evolved rule. Thus, the use of this rule constitutes a simpler, faster, and computationally cheaper alternative to the use of the ML/DL classification techniques. Figure 13 displays the F1-score distribution for all the classification methods evaluated, showing a clear difference in error distribution shapes between the individual induced rules and the rule boxes: while the former reach higher maximum values than the latter, the average for the boxes remains higher. This is explained by the fact that boxes, as combinations of individual rules, take both strong and weak rules and combine them to produce more consistent results than the individual rules. However, in order to solve the prediction problem tackled in this paper, simply taking the best performing individual rule is more appealing, ensuring that it will lead to better results.
Finally, Figure 14 shows the comparison of all the classification methods tested in the +3 h prediction case, sorted from best to worst performance, and considering the sum of the three error metrics assessed. It is observed that the individual rules, both PRIM-induced and CRO-evolved, outperform the other methods used, representing the most optimal options for predicting thick fog events. Regarding the performance of the employed methods for a 5 h prediction time horizon, Table A2 shows the error metrics for each of the classification techniques. Here, it can be noted how the previously drawn conclusions for the preceding time horizon are maintained. GB remains as the ML/DL method with the best F1-score, but this result is surpassed by the best individual rule evolved with CRO. Regarding the rule boxes, it is shown how in terms of F1-score neither PRIM nor CRO boxes outperform the best individual rules, with detection rates of above 72% with precision around 25%. Figure 15 shows the sorted comparison of the methods' performances, headed by the best (in terms of F1-score) individual rule evolved by CRO. Then, the two boosting ML methods, GB and AB, are the next-best performers, followed by the individual rules induced by PRIM. Finally, Table A3 displays the test results for the time prediction horizon at +10 h. In this case, the conclusions drawn are slightly different from the previous scenarios: first, RF becomes the best performing ML classifier instead of GB. Moreover, DL methods outperform ML models in this long-term prediction, with the RNN model being the one with the higher F1-score. Differences are also observed in the performance analysis of the individual rules, where, in this case, it is the PRIM-induced rules that offer the best overall performance in terms of F1-score, with a significant increase of 6% with respect to the best ML/DL model. Figure 16 shows the test performance of all the methods sorted according to the sum of the three error metrics assessed. In this case, both the CRO-evolved and the PRIMinduced individual rules remain as the most reliable and robust classifiers, outperforming the ML/DL models.

Use of Rules as Additional Predictive Variables in ML/DL
The second methodology implemented in this paper involves the use of the induced rules as an additional predictive Boolean variable in the ML/DL classification models. In this way, once the inducted rules have been assessed both on the train and test samples, a Boolean variable is created according to whether the rule (or box of rules) is fulfilled or not. This additional variable is then inserted in addition to the meteorological variables that form the train and test data. Subsequently, the ML and DL classification models are trained using the new training set, and their test predictions are further evaluated. Figure 17 illustrates the flow diagram of the followed methodology.
Five different strategies were studied for selecting the additional variables to be introduced into the system, which are shown in Table 3. The first and second cases consist of the selection of the best box of rules, corresponding with the one presenting the highest F1-score in the training data, both the PRIM-induced best box (Case 1) and the CRO-evolved best box (Case 2). The same approach was repeated using the best individual rules instead of sets of rules in Cases 3 (best rule induced by PRIM) and 4 (best CRO-evolved rule). Finally, Case 5 adds the four preceding cases' variables to the database.  Figure 17. Flow diagram of the proposed methodology in which the induced rules are used as additional predictor variables. Table 3. Different scenarios evaluated when introducing the rule-related information into the database.

Case 0
Original training data.

Case 1
Addition of PRIM-induced box with highest F1-score on training data.

Case 2
Addition of CRO-evolved box with highest F1-score on training data.

Case 3
Addition of PRIM-induced rule with highest F1-score on training data.

Case 4
Addition of CRO-evolved rule with highest F1-score on training data.

Case 5
cAddition of four previous mentioned variables.
Next, the results obtained for the three prediction time horizons evaluated (+3 h,+5 h, and +10 h) are presented and discussed. Test error metrics for these experiments are further displayed in Appendix B. Table A4 displays the results for the +3 h prediction, revealing the substantial improvement in the three evaluated error metrics compared tothe classifiers experiment when introducing as predictive variables the information related to the different rules or sets of rules, both induced and evolved. It is noteworthy that the only models that maintain identical results are those with a boosting architecture (AB and GB), and it becomes apparent that the added variables do not contribute towards the ensembles that form these methods.
When comparing the different cases studied regarding which rules are introduced as additional variables, by observing the average of the error metrics from all classifiers, it can be noticed that Case 5 (which includes the four variables studied) is the one that achieves better results, improving the recall metric by 15.8% over the initial results, the precision by 9.1%, and the F1-score by 15.10%. For the sake of clarity, the average metrics of all classifiers for the six cases evaluated are represented in Figure 18. Three conclusions can be drawn from this Figure: (1) the best performing case is clearly Case 5, the one adding the four additional variables into the system, meaning that the models perform better when handling all the possible information they can have; (2) Case 2 performs better than Case 1, and Case 4 performs better than Case 3, indicating that in both situations, whether dealing with individual rules or with sets of rules, it yields better results to work with CRO-evolved rules than with those induced by PRIM; and (3) Cases 3 and 4 perform far better than Cases 1 and 2, suggesting that dealing with a single rule is more efficient than working with a complete set of rules. Subsequently, results for the 5 h time prediction horizon are shown in Table A5 and Figure 19. Again, a substantial improvement is observed in the performance of all the classification methods, except for the boosting methods, when adding the induced/evolved rules and boxes of rules as additional predictive features. Specifically, the average improvement shown by these models is 14.9% for the recall metric, 8.5% for the precision, and 17.1% for the F1-score, considering the best performing way of inputting the rule-based information into the systems: adding to the original database the four extra variables (Case 5). Figure 19 illustrates the differences between the distinct cases evaluated. On this occasion, it is worth mentioning that the use of PRIM-induced rules and rule boxes (Cases 1 and 3) performs better than those evolved with CRO (Cases 2 and 4). However, undoubtedly, the best performing approach continues to be Case 5, with an important enhancement of the three error metrics considered. Finally, Table A6 displays the results for the 10-hour-ahead forecasting. In this case, with the exception of the SVC and KNN models, it can be observed how the improvements achieved are smaller than in the previous time horizons. Again, the best performing scenario is Case 5, providing an average improvement of 2.6% in the recall metric, keeping the same precision values and improving the F1-score by 8.8%. Figure 20 illustrates how, in this long-term prediction case, Cases 1 to 4 exhibit very similar performances to the initial database. Only in Case 5, when adding all the supplementary variables to the database, are the recall and F1-score increased, though without any improvement in the precision metric.

Conclusions
In this paper, the prediction of extreme low-visibility events due to orographic fog at the A-8 motor-road (Galicia, Northern Spain) was tackled as a binary classification problem for three different prediction time horizons, covering from mid-term to longterm prediction problems (+3, +5, and +10 h). Specifically, the approach pursued in this work focused on the use of explainable classification models based on inductive and evolutionary decision rules. This approach eases the interpretation of the results, thus providing an understanding of the model mechanisms and of the reason why the predictions are obtained. This provides a critical piece of information that is especially relevant in challenging forecasting problems such as long-term predictions (10 h) of extreme low-visibility events, where the low persistence of fog significantly hinders the predictions. In these cases, having a physical explainability of the forecasts can help a committee of experts in decision making at an early stage.
As a preliminary step, a detailed analysis of the rules and sets of rules extracted by the PRIM and CRO algorithms was conducted, concluding that (1) accumulated precipitation and relative humidity are found to be the most important variables, consistently keeping their significance throughout the different time forecasting horizons evaluated; (2) visibility, wind speed, and salinity present a strong influence over the short-and medium-term forecast, but lose relevance in the long term; (3) regarding the medium-and long-term forecasting, the main significant variables arerelated to atmospheric pressure, air temperature, global solar radiation, and dew temperature; and (4) some variables, such as floor temperature and wind direction, are found to remain irrelevant and of no interest for the prediction of thick fog across the different prediction horizons.
The first evaluated methodology consisted of applying PRIM-induced and CROevolved rules as a possible simpler alternative to more complex ML/DL classification methods, with a reasonable performance in this particular case. Specifically, a comparison among different strategies of rules-based information application to forecast the presence or absence of thick fog was carried out, demonstrating some performance advantages of these rule-based strategies over conventional ML/DL classifiers, with slight improvements over the best classification model of 1.05% for the 3 h time horizon, 0.63% for the 5 h, and 6.13% for the 10 h (on the F1-score metric). These methods also offer a further strength over the ML/DL models, i.e., their complexity, computational costs, and training time are substantially lower, as they are far simpler models.
The second proposed strategy tested consists of the integration of the information provided by the induced/evolved rules in the ML/DL techniques, as additional predictive features, in order to enrich the complex ML/DL models. Different scenarios were tested to assess the most appropriate way to incorporate the rule-based information into the classification systems. This approach led to a significant enhancement of the classification models' performances for the three prediction time horizons assessed, and the achieved F1-score improvement ratios attained values of 15.10% for the 3 h time horizon, 17.10% for the 5 h, and 8.80% for the 10 h.
In this work we showed how the use of explainable induced and evolved rules can be of significant interest in the prediction of extreme low-visibility events, providing strong results for complex forecasting problems such as long-term prediction. Note that the proposal in this paper is generalizable to any other location where orographic fog occurs. Future lines of research will consist of improving the way in which rule information is utilized for prediction, by implementing ensembles of rules or rule boxes with a higher level of complexity, combining the individual rules into a fuzzy-based controller to refine the final results obtained, or testing functional regression methods as final outcomes for the system.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
This section presents the tables with the test error metrics for the three time horizons considered (Tables A1, A2, and A3, respectively), when using the induced/evolved rules as a classification technique.

Appendix B
This section presents the tables with the test error metrics for the three time horizons considered (Tables A4, A5, and A6, respectively), when using the induced/evolved rules as an additional predictive feature in the ML/DL classification techniques.