Feature and Language Selection in Temporal Symbolic Regression for Interpretable Air Quality Modelling

: Air quality modelling that relates meteorological, car trafﬁc, and pollution data is a fundamental problem, approached in several different ways in the recent literature. In particular, a set of such data sampled at a speciﬁc location and during a speciﬁc period of time can be seen as a multivariate time series, and modelling the values of the pollutant concentrations can be seen as a multivariate temporal regression problem. In this paper, we propose a new method for symbolic multivariate temporal regression, and we apply it to several data sets that contain real air quality data from the city of Wrocław (Poland). Our experiments show that our approach is superior to classical, especially symbolic, ones, both in statistical performances and the interpretability of the results.


Introduction
Anthropogenic environmental pollution is a known and indisputable issue. In everyday life, we are exposed to a variety of harmful substances, often absorbed by the lungs and the body through the air we breath; among the most common pollutants, NO 2 , NO x , and PM 10 are the most typical ones in averaged-sized and big cities. The potential negative effects of such exposure has been deeply studied and confirmed by several authors (see, among others, Refs. [1][2][3][4][5][6]). Air quality is regularly monitored, and in some cases, alert systems inform residents about the forecasted concentration of air pollutants. Such systems may be based on machine learning technologies, effectively reducing the forecasting problem to an algorithmic one. Air quality data, along with the most well-known influencing factors, are usually monitored in a periodic way; the set of measurements in a given amount of time and at a given geographical point can be then regarded to as a time series. In this sense, the problem to be solved is a regression problem and, more in particular, a multivariate temporal regression problem.
A multivariate temporal regression problem can be solved in several ways. Following the classic taxonomy in machine learning, regression can be functional or symbolic; functional regression is a set of techniques and algorithms that allow one to extract a mathematical function that describes a phenomenon, while symbolic regression is devoted to inferencing a logical theory. Functional regression, which is far more popular and common, can be as simple as a linear regression or as complex as a neural network. On the other hand, typical symbolic approaches include decision trees, random forests, and rule based regressors. Temporal regression generalizes regression by taking into account past values of the independent variable to predict the current value of the dependent one, and it has been successfully used in many contexts, including air quality prediction. Examples include autoregressive models [7,8], land use regression models [9][10][11], and optimized lag symbolic regression not only returns interpretable models that enable the user to know why a certain prediction has been performed, but, at least in this case, the extracted models present statistically better and more reliable results. In summary, we aim at solving the problem of air quality modelling by defining it as a temporal regression problem, and we benchmark our proposed methodology based on temporal decision trees against methods that are present in the literature that may or may not consider the temporal component in an explicit way; the symbolic nature of the proposal allows naturally interpreting the underlying temporal theory that resembles the data by means of Halpern and Shoham's logic. In this way, we hope to amend some of the well-known problems of prediction methods, including the post-hoc interpretability of the results.
The paper is organized as follows. In Section 2, we highlight the needed background on the function and symbolic temporal regression problem, along with the feature selection process for regression tasks. In Section 3, we propose to solve the symbolic temporal regression problem by means of temporal decision trees. In Section 4, we formalize the feature and language selection learning process by means of multi-objective evolutionary optimization algorithms. In Section 5, we present the data used in our experiments and the experimental settings. The experiments are discussed in Section 6, before concluding.

Functional Temporal Regression
Regression analysis is a method that allows us to predict a numerical outcome variable based on the value of one (univariate regression) or multiple (multivariate regression) predictor variables. The most basic approach to multivariate regression is a linear regression algorithm, typically based on a least squares method. Linear regression assumes that the underlying phenomenon can be approximated with a straight line (or a hyperplane, in the multivariate case). However, in the general case, a functional regression algorithm searches for a generic function to approximate the values of the dependent variable. Assume that A is a data set with n independent variables A 1 , . . . , A n and one observed variable B, where Dom(A) (resp., Dom(B)) is the set in which an independent variable (or attribute) A (resp., the dependent variable B) takes a value, and dom(A) (resp., dom(B)) is the set of its actual values of A (resp., B):     Then, solving a functional regression problem consists of finding a function F so that the equation: is satisfied. When we are dealing with a multivariate time series, composed of n independent and one dependent time series, then data are temporally ordered and associated with a timestamp:     and solving a temporal functional regression problem consists of finding a function F so that the equation: is satisfied for every t. Temporal regression is different from the non-temporal one when, in identifying the function F, one takes into account the past values of the independent variables as well. Having fixed a maximum lag l, the equation becomes: The literature on functional regression is very wide. Methods range from linear regression, to polynomial regression, to generic non-linear regression and include variants of the least squares method(s), such as robust regression [21,22]. Autoregressive models, typically of the ARIMAX [23] family, are methods that include, implicitly, the use of past values of the independent variables and, in the most general case, of the dependent one as well (therefore, modifying Equation (5) to include B(t − 1), B(t − 2), . . . , B(t − l + 1) as well; as a matter of fact, the simplest autoregressive models are based on the past values of the dependent variable only).
The machine learning counterpart approach to temporal functional regression and, in fact, to temporal regression as a whole consists of using a non-temporal regression algorithm fed with new variables, that is lagged variables, that corresponds to the past values of the variables of the problem. In other words, the typical strategy consists of producing a lagged data set from the original one: 1 a l−1,1 a l−2,1 . . . a l,n a l−1,n a l−2,n . . . b l t l a l+1,1 a l,1 a l−1,1 . . . a l+1,n a l,n a l− Such a strategy has the advantage of being applicable to every regression algorithm, up to and including the classic functional regression algorithm, but also, symbolic algorithms for regression. Linear regression is undoubtedly the most popular regression strategy, implemented in nearly every learning suite; in the case of WEKA [20], the class is called LinearRegression, and it can be used with lagged and non-lagged data.

Symbolic Temporal Regression
Classification and regression trees (CART) is a term introduced in [14] to refer to decision tree algorithms that can be used for both classification and regression. A regression tree is a symbolic construct that resembles a decision tree (usually employed for classification), based on the concept of data splitting and on the following language of propositional letters (decisions): S = {A a | A is an attribute and a ∈ dom(A)} where ∈ {≤, =} and dom(A) is the domain of the attribute A. A regression tree τ is obtained by the following grammar: where S ∈ S andb ∈ Dom(B) (however, b is not necessarily in dom(B)). Solving a regression problem with a regression tree entails finding a tree that induces a function F defined by cases: The conditions are propositional logical formulas written in the language of S, and intuitively, such a function can be read as if the value of these attributes is . . . , then the value of the dependent variable is, on average, this one, . . . , and so on. In other words, F is a staircase function. The main distinguishing characteristics of a staircase function obtained by a (classic) regression tree is that the conditions are not independent of each other, but they have parts in common, as they are extracted from a tree. Therefore, for example, one may have a first condition of the type if A 1 ≤ 5 and A 2 ≤ 3, then B = 1, and a second condition of the type if A 1 ≤ 5 and A 2 > 3, then B = 3. If functional regression is mainly based on the least squares method, the gold standard regression method with trees is splitting by variance, which consists of successively splitting the data set searching for smaller ones with lower variance in the observed values of the dependent variable; once the variance in a data set associated with a node is small enough, that node is converted into a leaf, and the value of the dependent variable is approximated with the average value of the data set associated with it. Such an average value labels the leaf. Regression trees are not as common as decision trees in the literature; they are usually employed in ensemble methods such as random forest. However, popular learning suites do have simple implementations of regression trees. In the suite WEKA, the to-go implementation in this case is called RepTree. Despite its name, such an implementation is a variant of the more popular J48, which is, in fact, its counterpart for classification. Regression trees can be used on both atemporal and temporal data, by using, as in the functional case, lagged variables.

Feature Selection for Regression
Feature selection (FS) is a data preprocessing technique that consists of eliminating features from the data set that are irrelevant to the task to be performed [24]. Feature selection facilitates data understanding, reduces the storage requirements, and lowers the processing time, so that model learning becomes an easier process. Univariate feature selection methods are those that do not incorporate dependencies between attributes, and they consist of applying some criterion to each pair feature-response and measuring the individual power of a given feature with respect to the response independent of the other features, so that each feature can be ranked accordingly. In multivariate methods, on the other hand, the assessment is performed for subsets of features rather than single features. From the evaluation strategy point of view, FS can be implemented as single attribute evaluation (in both the univariate and the multivariate case) or as subset evaluation (only in the multivariate case). Feature selection algorithms are also categorized into filter, wrapper, and embedded models. Filters are algorithms that perform the selection of features using an evaluation measure that classifies their ability to differentiate classes without making use of any machine learning algorithm. Wrapper methods select variables driven by the performances of an associated learning algorithm. Finally, embedded models perform the two operations (selecting variables and building a classifier) at the same time. There are several different approaches to feature selection in the literature; among them, evolutionary algorithms are very popular. The use of evolutionary algorithms for the selection of features in the design of automatic pattern classifiers was introduced in [25]. Since then, genetic algorithms have come to be considered as a powerful tool for feature selection [26] and have been proposed by numerous authors as a search strategy in filter, wrapper, and embedded models [27][28][29], as well as feature weighting algorithms and subset selection algorithms [30]. A review of evolutionary techniques for feature selection can be found in [31], and a very recent survey of multi-objective algorithms for data mining in general can be found in [32]. Wrapper methods for feature selection are more common in the literature; often, they are implemented by defining the selection as a search problem and solved using metaheuristics such as evolutionary computation (see, e.g., Refs. [26,30,33]). The first evolutionary approach involving multi-objective optimization for feature selection was proposed in [34]. A formulation of feature selection as a multiobjective optimization problem was presented in [35]. In [36], a wrapper approach was proposed taking into account the misclassification rate of the classifier, the difference in the error rate among classes, and the size of the subset using a multi-objective evolutionary algorithm. The wrapper approach proposed in [37] minimizes both the error rate and the size of a decision tree. Another wrapper method was proposed in [38] to maximize the cross-validation accuracy on the training set, maximize the classification accuracy on the testing set, and minimize the cardinality of feature subsets using support vector machines applied to protein fold recognition.
A multi-objective optimization problem [39] can be formally defined as the optimization problem of simultaneously minimizing (or maximizing) a set of z arbitrary functions: whereŪ is a vector of decision variables. A multi-objective optimization problem can be continuous, in which we look for real values, or combinatorial, in which we look for objects from a countably (in)finite set, typically integers, permutations, or graphs. Maximization and minimization problems can be reduced to each other, so that it is sufficient to consider one type only. A set F of solutions for a multi-objective problem is non dominated (or Pareto optimal) if and only if for eachŪ ∈ F , there exists noV ∈ F such that (i) there exists In other words, a solutionŪ dominates a solutionV if and only ifŪ is better thanV in at least one objective, and it is not worse thanV in the remaining objectives. We say thatŪ is non-dominated if and only if there is no other solution that dominates it. The set of non-dominated solutions from F is called the Pareto front. Optimization problems can be approached in several ways; among them, multi-objective evolutionary algorithms are a popular choice (see, e.g., Refs. [31,32,35]). Feature selection can be seen as a multi-objective optimization problem, in which the solution encodes the selected features, and the objective(s) are designed to evaluate the performances of some model-extraction algorithm; this may entail, for example, instantiating (10) as: whereŪ represents the chosen features; (11) can be seen as a type of wrapper. When the underlying problem is a regression problem, then (11) is a formulation of the feature selection problem for regression.

Symbolic Temporal Regression
Let A be a multivariate time series with n independent variables, each of m distinct points (from one to m), and no missing values; Figure 1 (top) is an example with n = 2 and m = 8. Any such time series can be interpreted as a temporal data set on its own, in the form of (3). In our example, this corresponds to interpreting the data as in Figure 1 (middle, left). As explained in the previous section, the regression problem for B can be solved in a static way. Moreover, by suitably pre-processing A as in (6), the problem can be seen as a temporal regression problem; in our example, this corresponds to interpreting the data as in Figure 1 (middle, right). The algorithm Temporal C4.5 and its implementation Temporal J48 [16,17] are symbolic (classification and) regression trees that can be considered as an alternative to classic solutions, whose models are interpretable, as they are based on decision trees, use lags (but not lagged variables), and are natively temporal. Briefly, Temporal C4.5 is the natural theoretical extension of C4.5 developed by Quinlan in the 1990s for the temporal case when dealing with more than propositional instances such as multivariate time series, and Temporal J48 is WEKA's extension of J48 to the temporal case; observe that such a distinction must be made since the implementation details may differ between public libraries, but the theory, in general, is the same.
Our approach using Temporal J48 for regression is based on two steps: (i) a filter applied to the original data A and (ii) a regression tree extraction from the filtered data, similar to the classic decision tree extraction problem. The first step consists of extracting from A a new data set, in which each instance is, in itself, a multivariate time series. Having fixed a maximum lag l, the i-th new instance (i ≥ 1) is the chunk of the multivariate time series A that contains, for each variable A 1 , . . . , A n , the values at times from i to i + l − 1, for 1 ≤ i ≤ m − l + 1 (i.e., an l-points multivariate time series). Such a short time series, so to say, is labelled with the (i + l − 1)-th value of the dependent variable B. In this way, we have created a new data set with m − l + 1 instances, each of which is a time series. In our example, this is represented as in Figure 1 (bottom), where l = 3. The second step consists of building a regression tree whose syntax is based on a set of decisions that generalizes the propositional decision of the standard regression tree. Observe that time series describe continuous processes, and when discretized, it makes less sense to model the behaviour of such complex objects at each point. Thus, the natural way to represent time series is an interval based ontology, and the novelty of the proposed methodology is to make the decision over intervals of time. The relationships between intervals in a linear understanding of time are well known; they are called Allen's relations [19], and despite a somewhat cumbersome notation, they represent the natural language in a very intuitive way. In particular, Halpern and Shoham's modal logic of Allen's relations (known as HS [18]) is the time interval generalization of propositional logic and encompasses Allen's relations in its language (see Table 1). Being a modal logic, formulas can be propositional or modal, the latter being, in turn, existential or universal. Let AP be a set of propositional letters (or atomic propositions). Formulas of HS can be obtained by the following grammar: where p ∈ AP, X is any of the modality corresponding to Allen's relation, and [X] denotes its universal version (e.g., ¬ A ϕ ≡ [A]¬ϕ). On top of Allen's relations, the operator = is added to model decisions that are taken on the same interval. For each X ∈ {A, L, B, E, D, O}, the modality X , corresponding to the inverse relation R X of R X , is said to be the transpose of the modality X , and vice versa. Intuitively, the formulas of HS can express the properties of a time series such as if there exists an interval in which A 1 is high, during which A 2 is low, then . . . , as an example of using existential operators, or as if during a certain interval, A 1 is always low, then . . . , as an example of using universal ones. Formally, HS formulas are interpreted on time series. We define: is a valuation function, which assigns to each proposition p ∈ AP the set of intervals V(p) on which p holds. Following the presentation, note that we deliberately use l for the domain of T, which is also the maximum fixed lag. The truth of formula ϕ on a given interval [x, y] in a time series T is defined by structural induction on formulas as follows: where X ∈ {A, L, B, E, D, O}. It is important to point out, however, the we use logic as a tool; through it, we describe the time series that predict a certain value, so that the expert is able to understand the underlying phenomenon. The semantics of the relations R X allows us to ease such an interpretation: an interval that meets the current one; R L (later than) an interval that is later than the current one; R E (ends) an interval that ends the current one; R B (starts) an interval that starts the current one; R D (during) an interval that is during the current one; R O (overlaps) an interval that overlaps the current one.
Thus, a formula of the type p ∧ A q is interpreted as p holds now (in the current interval), and there is an interval that starts when the current one ends in which q holds.
From the syntax, we can easily generalize the concept of decision and define a set of temporal and atemporal decisions S = S ∪ S = , where: where ∈ {≤, =, =, >}, γ ∈ (0.0, 1.0], and X is an interval operator of the language of HS. The value γ allows us a certain degree of uncertainty: we interpret the decision A a on an interval [x, y] with a certain value γ as true if and only if the ratio of points between x and y satisfying A a is at least γ. A temporal regression tree is obtained by the following grammar: where S is a (temporal or atemporal) decision andb ∈ Dom(B), in full analogy with non-temporal trees. The idea that drives the extraction of a regression tree is the same in the propositional and the temporal case, and it is based on the concept of splitting by variance. The result is a staircase function, with the additional characteristic that each leaf of the tree, which represents such a function, can be read as a formula of HS. Therefore, if a propositional tree for regression gives rise to tree rules of the type if A 1 < 3 two units before now and A 2 > 5 one unit before now, then, on average, B = 3.2 when used on lagged data, Temporal J48 gives rise to rules of the type if mostly A 1 < 3 during an interval before now and mostly A 2 > 5 in an interval during it, then, on average, B = 3.2. It should be clear, then, that Temporal J48 presents a superior expressive power that allows one to capture complex behaviours. It is natural to compare the statistical behaviour of regression trees over lagged data and that of Temporal J48 using the same temporal window.

HS Modality Definition w.r.t. the Interval Structure Example
. A temporal regression tree such as Temporal J48 is extracted from a temporal data set following the greedy approach of splitting by variance as in the propositional case. Being sub-optimal, a worse local choice may, in general, produce better global ones. This is the idea behind feature selection: different subsets of attributes lead to different local choices, in search for global optima. In the case of temporal regression trees, however, the actual set of interval relations that are used for splitting behaves in a similar way: given a subset of all possible relations, a greedy algorithm for temporal regression trees' extraction may make worse local choices that may lead to better global results. Therefore, we can define a generalization of (11): max Per f ormance(Ū,V) min Cardinality(Ū), (15) in whichŪ represents a selection of features andV represents a selection of interval relations to be used during the extraction. This is a multi-objective optimization problem that generalizes the feature selection problem, and we can call it the feature and language selection problem. Observe that there is, in general, an interaction between the two choices: different subsets of features may require different subsets of relations for a regression tree to perform well. The number of interval relations that are actually chosen, however, does not affect the interpretability of the result, and therefore, it is not optimized (in the other objective function).

Multi-Objective Evolutionary Optimization
In the previous section, we defined the feature and selection problem as an optimization problem. We chose to approach such an optimization problem via an evolutionary algorithm and, in particular, using the well-known algorithm NSGA-II [40], which is available as open source from the suite jMetal [41]. NGSA-II is an elitist Pareto based multi-objective evolutionary algorithm that employs a strategy with a binary tournament selection and a rank-crowding better function, where the rank of an individual in a population is the non-domination level of the individual in the whole population. As a regression algorithm, we used the class TemporalJ48, integrated in the open-source learning suite WEKA, run in full training mode, with the following parameters: l = 10, γ = 0.7. We used a fixed-length representation, where each individual solution consists of a bit set. In simple feature selection, each individual is of the type: U = (U 1 , U 2 , . . . , U n ), (16) where, for each 1 ≤ t ≤ n, U t = 1 (resp., U t = 0) is interpreted as the t-th attribute being selected (resp., discarded), while in feature and language selection, it becomes of the type: where, for each 1 ≤ t ≤ 13, V t = 1 (resp., V t = 0) is interpreted as the t-th interval relation being selected (resp., discarded). The structure of the second part, obviously, depends on the fact that there are 13 Allen's relations (including equality) between any two intervals, as we recalled above; there is no natural ordering of interval relations, and we can simply assume that a total ordering has been fixed.
In terms of objectives, minimizing the cardinality of the individuals is straightforward, and we do so by using the function Card(Ū) defined as: As much as optimizing the performances of the learning algorithm, we define: where ρ() measures the correlation between the stochastic variable obtained by the observations and the staircase function obtained by Temporal J48 using only the features selected byŪ and the interval relations selected byV. The correlation varies between −1 (perfect negative correlation) and one (perfect positive correlation), zero being the value that represents no correlation at all. Defined in this way, Corr ought to be minimized.

Data and Experiments
Our purpose in this paper is to solve a temporal regression problem for air quality modelling and prediction. We consider an air quality database that contains measurements of several parameters in the city of Wrocław (Poland); in particular, we consider data from a communication station located within a wide street with two lanes in each direction (GPS coordinates: 51.086390 North, 17.012076 East; see Figure 2). One of the largest intersections in Wrocław is located approximately 30 meters from the measuring station and is covered by traffic monitoring cameras. A weather measurement station is located on the outskirts of the city, at 9.6 km from the airport, and our data set is structured so that all such data are combined in an attempt to predict pollution concentrations. Pollution data are collected by the Provincial Environment Protection Inspectorate and encompasses the hourly NO 2 concentration values during three years, from 2015 to 2017. The traffic data are provided by the Traffic Public Transport Management Department of the Roads and City Maintenance Board in Wrocław and include the hourly count of all types of vehicles passing the intersection. Public meteorological data are provided by the Institute of Meteorology and Water Management, and they include: air temperature, solar duration, wind speed, relative humidity, and air pressure. In order to make the data uniform, solar duration values were re-normalized in the real interval [0, 1]. In the pre-processing phase, the instances with at least one missing value (617 samples, 2.3%) were deleted. Some basic statistic indicators on the remaining 25,687 instances are presented in Table 2.  We considered, in particular, the set A that contains the transport, meteorological, and pollution data from the year 2017. From it, we extracted the sets A month , where monthranges from Jan to Dec, each containing the hourly data of the first 10 days of each month. Therefore, each A month contains exactly 240 instances. For each month, then, we designed a regression experiment using: (i) classic, non-temporal linear regression (using the class LinearRegression); (ii) classic, non-temporal decision tree regression (using the class RepTree); (iii) lagged linear regression on the lagged version of A month , with l = 10; (iv) lagged propositional decision tree regression on the lagged version of A month , with l = 10, and (v) feature and language selection for temporal decision tree regression on the transformed version of A month , with l = 10 and γ = 0.7. We tested the prediction capabilities of each of the extracted models on the corresponding set A month . In the case of temporal regression, each experiment returned a set of classifiers, more precisely a Pareto set; from it, we selected the classifier with the best correlation. All experiments were executed in 10-fold cross-validation mode, which guarantees the reliability of the results. Observe how different experiments correspond, in fact, to different preprocessing of the data: In (i) and (ii), a given A month contains 240 instances, each corresponding to a specific hour sample, and six (+1) columns, each corresponding to an independent variable (plus the dependent one). In (iii) and (iv), a given A month contains 60 (+1) columns, each being an independent variable or its lagged version, with lags from one to 10 h; therefore, the number of instances is actually 231 (= m − l + 1), because the first sample for which the dependent value can be computed is the one at Hour 10. Finally, in (v), a given A month contains 231 multivariate time series, each with 10 values of each of the independent variables, temporally ordered, and labelled with the values of the independent one, starting, again, from the sample at Hour 10.

Results and Discussion
All results can be seen in the tables from Tables 3-6, in which we report, per each experiment, not only the correlation coefficient (cc) between the ground truth b ∈ dom(B) and the predicted valueb ∈ Dom(B) [20,42,43], but also the mean average error (mae), the root mean squared error (rmse), and the relative absolute error (rae). The first group of results concerns non-lagged data and standard approaches. As we can see, the correlation coefficient ranges from 0.59 to 0.83, with an average of 0.71, in the linear regression models, and from 0.60 to 0.72, with an average of 0.72, in the decision tree models. The fact that the latter show a slightly better behaviour than the former may indicate that the underlying process is not (strongly) linear and that a stepwise function may approximate this reality in a better way. The fact that the average correlation is not too high in both cases and that in both case, there is at least one month in which it is particularly low may indicate that non-lagged data probably do not capture the underlying phenomenon in its full complexity.
As much as lagged data are concerned, in linear regression models, the correlation coefficients range from 0.71 to 0.84, with an average of 0.78, while in decision tree models from 0.65 to 0.87, with an average of 0.76, presented in Table 4. As we can see, the situation reversed itself, the linear models being more precise than the decision tree ones. A possible explanation is that, while lagged data, in general, offer more information about the underlying process, reasoning with more variables (i.e., 60 vs. six) allows finding very complex regression hyperplanes, which adapt to the data in a natural way; unfortunately, this is a recipe for non-interpretability, as having such a complex regression function, with different coefficients for the same independent variable at different lags, makes it very difficult for the expert to create an explanatory physical theory. To give one example, we consider the linear model extracted from A Jan and, in particular, the coefficients of each variable, as shown in Table 5. As can be observed, the alleged influence of every variable seems to have some erratic behaviour, with coefficients with different signs and absolute values at different lags. It could be argued that such a matrix of values is no different from a weight matrix of a neural network, in some sense.
Finally, in Table 6, we can see the results of Temporal J48, in which case the correlation coefficient ranges from 0.78 to 0.87, with an average of 0.83. As can be noticed, in exchange for a higher computational experimental complexity, this method returns clearly better results. This is to be expected, as, by its nature, it combines the benefits of the lagged variables with those of symbolic regression. One can observe not only the improvement in the average, but also in the stability among the twelve months: in the worst case, the correlation index is 0.78, which is to be compared, for example, with the worst case of simple linear regression (0.59). Moreover, it seems that Temporal J48 behaves in a particularly good way on difficult cases: if the case of A jul , for example, we have a correlation coefficient of 0.67 with non-lagged data and decision trees, 0.65 with lagged data and decision trees, and 0.78 with serialized data. In addition to the statistical performances of these models, the following aspects should be noticed. First, these models were extracted in a feature selection context; however, in all cases, the evolutionary algorithm found that all variables had some degree of importance, and no variable was eliminated. Second, the language(s) that was selected allows one to draw some considerations on the nature of the problem; for example, the fact that, in all cases, the relation during or its inverse (i.e., D or D ) was selected indicates that the past interactions between the variables is a key element for modelling this particular phenomenon. Because Temporal J48, in this experiment, was run without pruning, the resulting trees cannot be easily displayed because of their dimensions. Nevertheless, thanks to its intrinsic interpretability, meta-rules can be easily extracted from a regression tree, as, for example: If Rel.humidity is high while traffic is high then NO 2 tends to be high If Sol.duration is high while traffic is very low then NO 2 tends to be low . . . (20) which can contribute to designing a real-world theory of the modelled phenomenon. The language selection part performed by the optimizer, in general, reduces the set of used temporal operators of HS when extracting the rules (see Table 6), and this is desirable considering that, among many others, one desideratum for interpretability is to explain the reasoning in an understandable way to humans, which have a strong and specific bias towards simpler descriptions [44].

Conclusions
In this paper, we consider an air quality modelling problem as an example of the application of a novel symbolic multivariate temporal regression technique. Multivariate temporal regression is the task of constructing a function that explains the behaviour of a dependent variable over time, using current and past values of a set of independent ones; air quality modelling and, in particular, modelling the values of a pollutant as a function of meteorological and car traffic variables can be seen as a multivariate temporal regression problem. Such problems are classically approached with a number of techniques, which range from simple linear regression to recurrent neural networks; despite their excellent statistical performances, in most cases, such models are unsatisfactory in terms of their interpretability and explainability. Classic symbolic regression is an alternative to functional models; unfortunately, symbolic regression has not been very popular, probably due to the fact that its statistical performances tend not to be good enough for many problems. Temporal symbolic regression reveals itself as a promising compromise between the two strategies: while keeping a symbolic nature, temporal symbolic regression takes into account the temporal component of a problem in a native way. In this paper, we not only apply a temporal symbolic regression to a real-world problem, but we also show that it can be embedded into a feature selection strategy enriched with a language selection one. The resulting approach shows an interesting potential, the statistical performances of the extracted models being superior to those of both atemporal and temporal classical approaches.  Data Availability Statement: Not Applicable, the study does not report any data.