Deep Learning XAI for Bus Passenger Forecasting: A Use Case in Spain

: Time series forecasting of passenger demand is crucial for optimal planning of limited resources. For smart cities, passenger transport in urban areas is an increasingly important problem, because the construction of infrastructure is not the solution and the use of public transport should be encouraged. One of the most sophisticated techniques for time series forecasting is Long Short Term Memory (LSTM) neural networks. These deep learning models are very powerful for time series forecasting but are not interpretable by humans (black-box models). Our goal was to develop a predictive and linguistically interpretable model, useful for decision making using large volumes of data from different sources. Our case study was one of the most demanded bus lines of Madrid. We obtained an interpretable model from the LSTM neural network using a surrogate model and the 2-tuple fuzzy linguistic model, which improves the linguistic interpretability of the generated Explainable Artiﬁcial Intelligent (XAI) model without losing precision.


Introduction
Efficient passenger transport in urban areas is an increasingly important problem in our society. In the past few years, it was empirically shown that constructing new infrastructures or expanding existing roads is not an adequate solution to this problem. However, the use of public transport should be encouraged to try to alleviate effects such as congestion, accidents and pollution [1].
Knowledge of passenger demand is crucial for decisions related to planning the supply of public transport services, as well as for the design of lines based on demand. Passenger demand depends on many factors: population, income, commercial and service establishments, while other factors depend solely on the transport offer, such as travel time, frequencies offered and comfort. To achieve maximum efficiency in the service, it is essential to know the passenger demand over time, since it is usually characterized by high seasonality with annual, weekly or daily peak periods.
Most public transport systems have frequent delays due to poor planning of the different routes, leading to crowded transport on certain routes and completely empty on others. This mismanagement of public transport also has other consequences, which have an impact not only on the lives of its users, but also on the people who do not travel on them.
This paper focuses on the bus passenger demand problem. In this way, poor planning of bus routes can cause congestion on the streets, which can make it difficult for those using private vehicles to get around. It can cause overcrowding at bus stops if not enough buses are assigned to the busiest routes. Preventing these and other public transport problems requires good passenger demand planning.
In this sense, there is currently a large amount of data due to the fact that we live in a totally connected world. For good planning of bus passenger demand, it is necessary to have a detailed description and analysis of travel information that will allow us to adopt the most appropriate technical or political measures. Therefore, it is necessary to obtain accurate and reliable information about trips, such as frequency, duration, type of transport and costs. Through the Internet of Things (IoT), all the data can be collected throughout the entire journey, from the passenger pick-up, the time and the stop. This real-time data collection, as well as its subsequent analysis and study, will provide all the information necessary to implement and optimize the processes involved in demand planning, in addition to helping strategic and efficient decision-making [2].
Time series forecasting of bus passenger demand is crucial for an optimal planning of the limited resources. This is usually a complex task where nonlinear statistical relations between predictors and target variable must be identified. Long-short-term memory networks (LSTM) are a type of recurrent neural network used in deep learning [3,4] that can be trained to successfully handle sequence dependency. In fact, LSTMs are often used to predict passenger demand.
A problem with these "black box" models is that it is difficult to explain in detail the internal operations beyond the function provided by each of the layers and the symmetries of the model. However, interpretability of the model is crucial for transparency, traceability, and auditability, especially in public resources allocation. Therefore, the interpretation of such algorithms using XAI techniques is crucial.
Our goal is to develop a predictive and linguistically interpretable model, useful for decision making using large volumes of data, and able to work with various sources including the calendar, that predicts bus passenger demand. Our case will be one of the most demanded bus lines of Madrid.
The original data source has been provided by Empresa Municipal de Transportes de Madrid, EMT (Municipal Transport Company of Madrid, Spain) via email request. It includes the passenger data history of its entire fleet of buses, including night buses from 1 January 2015 to 28 February 2017. These data were obtained from the bus machines that registered the entry of passengers. Every time a passenger inserts their ticket or subscription in the machine, it generates a new entry in the database and is registered as a new trip.
To achieve the goal, we have relied on the explainability of the algorithm developed using deep learning techniques [3,4]. In our project we will focus only on LSTM neural networks [5,6], used for the prediction of time series, since a comparison of models was made and this algorithm was the most predictive.
For this, we first proposed to improve the prediction of passenger demand through historical data and external sources, such as calendar, through the use of deep learning techniques using deep neural LSTM networks. The goal is to make urban transport more efficient, ensuring that the offer meets the requirements of the population. However, LSTM models are "black boxes" which are very difficult to interpret. In many practical cases, this prevents the application of the model as the decisions of the model cannot be audited [7].
Surrogate models, however, can be highly interpretable from a business point of view. However, in the case we are dealing with in this paper, we have a large volume of data on a large number of means of transport, each of which will be explained by a different surrogate model. In this case, interpretability is reduced as the regression trees usually have several end nodes with precise values of the dependent variable that must be interpreted relative to each specific means of transport. Thus, 2000 passengers may be an extreme figure for a particular bus line or a low number for another. To increase the interpretability of these variables, we propose the use of fuzzy linguistic variables which were originally proposed by Zadeh [8]. This type of variable is based on fuzzy sets that represent linguistic concepts normally handled by human beings, for example, with reference to a means of transport, it is easy to interpret that the number of passengers is high or medium. In order to increase interpretability, we propose the use of the fuzzy 2-tuple linguistic model. The advantage of this model is that it allows such interpretability in addition to achieving greater precision in the representation and computation of these fuzzy concepts.
The rest of article is structured as follows: Section 2 presents the state of the art and compares these related works to our proposal; Section 3 presents foundations of deep learning, fuzzy models and XAI on which our proposal is based; Section 4 presents the model for bus passenger demand forecasting and obtaining its interpretation, the model is applied to data from the city of Madrid (Spain); in Section 5, the results obtained by this model are discussed; and finally, in Section 6, conclusions and future work are suggested.

Related Works
The problem raised in this paper can be considered within the field of smart public transport. This is one of the most important issues for so-called smart cities [9]. Intelligent transport focuses on controlling and managing transport networks and systems to improve safety and efficiency. In such problems, there are multitudes of devices generating huge amounts of data. Based on such data, a crucial issue for decision making in a smart city is the prediction related to traffic flow [10], travel times [11], congestion prevention, or passenger demand, among others.
Our proposal focuses on forecasting of public bus passenger demand through deep learning techniques, and in this section, we will first study the background of this type of work in the literature. For this purpose, we will conduct a bibliometric study on this topic. Specifically, we will draw a bibliometric map. This type of map is part of the science mapping, which aims to obtain a conceptual structure of a given area prior to its interpretation. We follow the methodology based on [12] and the SciMat tool in order to obtain such map.
In the first stage, the data, i.e., the articles that compose the analysis, must be retrieved. Thus, bibliographic records were downloaded from the main Web of Science collection using the query: TS = (("forecast*" OR "prediction*") NEAR/6 ("passenger" or "traveler" or "traveller" or "voyager") AND ("machine learning" or "deep learning")) where the TS field is a search based on a given topic (title, abstract and keywords), NEAR/n is a proximity operator that requires both terms to appear close to each other, up to n-words, in the text. The query was run at the end of March 2022 and returned 120 documents which were manually reviewed.
In the next phase, pre-processing, we unify term duplicates and fixed misspelled items. Thereafter, we carry out the remaining phases of the process: network extraction, normalization, mapping, analysis and visualization.
An advantage of the methodology followed is that it allows us to obtain a strategic diagram of the topics based on two measures: centrality, i.e., degree of interaction of a network with other networks and density, i.e., the internal strength of the network or keywords that describe the theme in any science mapping workflow. We show the strategic diagram obtained for our analysis in Figure 1 including the interpretation of the topics of each quadrant. LSTM. This theme includes specific deep/machine learning algorithms which are im portant or motor for the problem at hand. Thus, LSTM models are the central and predominant topic, the Convolutional Neural Networks (CNN) are in second place and finally, Support Vector Machine (SVM) is much less prominent. The theme i related with time model prediction since in the problems we are dealing with, they are very common.  PASSENGER-FLOW-PREDICTION. This is a heterogeneous topic that includes the optimization problems and techniques studied, including genetic algorithms. It also includes other predictive techniques such as XGBoost. This theme includes other in teresting terms that point to the use of real-time data and predictive applications on urban and public transportation.  TRANSPORT. This is a very small and declining theme which includes the use o statistical analysis and generic machine learning for the problem posed. We can identify four main themes, on which Figure 2 shows their thematic networks: • DEEP-LEARNING. Based on neural networks, this algorithms are widely used for the traffic flow prediction problem presented here. • LSTM. This theme includes specific deep/machine learning algorithms which are important or motor for the problem at hand. Thus, LSTM models are the central and predominant topic, the Convolutional Neural Networks (CNN) are in second place, and finally, Support Vector Machine (SVM) is much less prominent. The theme is related with time model prediction since in the problems we are dealing with, they are very common. • PASSENGER-FLOW-PREDICTION. This is a heterogeneous topic that includes the optimization problems and techniques studied, including genetic algorithms. It also includes other predictive techniques such as XGBoost. This theme includes other interesting terms that point to the use of real-time data and predictive applications on urban and public transportation. • TRANSPORT. This is a very small and declining theme which includes the use of statistical analysis and generic machine learning for the problem posed.
Therefore, we consider in this work a model widely used in the related literature, the LSTM. On the other hand, the prediction of passengers involves various means of transport. Table 1 shows the work specifically related to buses. Our proposal is also included in this table. Therefore, we consider in this work a model widely used in the related literature, LSTM. On the other hand, the prediction of passengers involves various means   As can be deduced from a review of the related literature, most of the machine learning algorithms used are so-called black box algorithms, i.e., they are able to predict but are not interpretable by humans. If we focus the analysis on bus passenger flow predictions, there are many works on this type of prediction, but very few authors have been concerned with the interpretability of the models, as can be seen in Table 1. The only works we found are limited to obtaining the importance of the variables involved in the model [19].
Therefore, interpreting the results of such algorithms using XAI techniques is appropriate, although on the other hand, they are still rather new techniques whose main field of action is still very generic and not so much focused on concrete problems [33,34]. Certainly, some work is oriented towards the field of smart cities [35][36][37][38] but specifically focused on public transport there are very few jobs. In this way, we can find works oriented to the prediction and understanding of traffic flow [39] and work [40], which seeks to make the recommendations on the transport fleet interpretable. To the best of our knowledge, there is no work specifically oriented to the problem of predictability and interpretability about public bus users as the work presented here.
There are several approaches that incorporate fuzzy logic-based techniques into the XAI problem [41], as will be shown in Section 3.3. The present work incorporates a fuzzy technique, called 2-tuple model that improves the linguistic interpretability of the generated XAI model without losing accuracy. We did not find in the literature the joint use of both approaches.

Methodology
The proposed model is based on three important models that will be explained in this section: the 2-tuple fuzzy linguistic model, the LSTM and surrogate trees and rules.

The 2-Tuple Fuzzy Linguistic Model
The 2-tuple model, proposed in [42], is a model that allows a more accurate representation of fuzzy linguistic terms of a linguistic variable without losing linguistic interpretability. For this reason, it is a model widely used in a variety of areas, i.e., [43,44]. This model represents the information as a pair of values (s i , α i ), where s i ∈ S and α i ∈ [−0.5, 0.5).

Definition 1 ([42]
). Let S = s 0 , . . . , s g be a set of linguistic terms, and β ∈ [0, g] a value in the granularity interval of S. The symbolic translation of a linguistic term is a number valued in the interval [−0.5, 0.5) which expresses the difference in information between a given value β ∈ [0, g] obtained from a symbolic operation, and the index of the closest linguistic term s i in S.
This representation model defines a pair of functions to perform transformations between numerical values defined in the granularity interval and 2-tuple linguistic values to perform the computational processes on 2-tuple linguistic values.

Definition 2 ([42]
). Let S = s 0 , . . . , s g be a set of linguistic terms, S = S × [−0.5, 0.5) and β ∈ [0, g] a value representing the result of a symbolic operation. Then the linguistic 2-tuple expressing information equivalent to β is obtained using the following function: where round(·) is the visual rounding operator, s i is the label with index closest to β and α is the value of the symbolic translation.
Thus, a value in the interval [0, g] is always identified with a 2-tuple linguistic value in S . Definition 3. Let S = s 0 , . . . , s g be a set of linguistic terms and (s i , α i ) ∈ S = S × [−0.5, 0.5) the numerical value in the granularity interval [0, g] representing the linguistic value 2-tuple (s i , α i ) is obtained using the function: Along with the representation model seen above, we can analyse the associated computational model, for which the following operators are defined: 2-tuple linguistic value comparison operator. Given two 2-tuple linguistic values (s k , α 1 ) and (s l , α 2 ) representing quantities of information: , what we have to predict in this article is the number of passengers. As will be seen in Section 4, in order to improve the linguistic interpretability of this variable in the Machine Learning (ML) models obtained, we will represent this variable with this 2-tuple model. , with the definition shown in Figure 3. Thus, for example, if in a given context, a model predicts a number of passengers (High, −0.1), we can interpret it as being well above Medium and well below Very High, and it is also below the High concept, namely 0.1 of semantic translation.

LSTM Model
LSTM neural networks [5,6] are a type of recurrent neural networks where each neuron receives feedback from the others and apart has an internal state whose value is modified according to experience: each neuron learns to detect which specific patterns activate its particular memory, which ones reset it, and what information to memorise. In this way a particular neuron can decide to keep the information about a pattern that it detected long ago, until another concrete pattern updates this information. The richness of dynamics that can be detected by this type of network is therefore enormous, and also eliminates the effect of the "vanishing gradient" that have the traditional recurrent neural networks (impossibility of keeping in memory information beyond a certain time window).

LSTM Model
LSTM neural networks [5,6] are a type of recurrent neural networks where each neuron receives feedback from the others and apart has an internal state whose value is modified according to experience: each neuron learns to detect which specific patterns activate its particular memory, which ones reset it, and what information to memorise. In this way a particular neuron can decide to keep the information about a pattern that it detected long ago, until another concrete pattern updates this information. The richness of dynamics that can be detected by this type of network is therefore enormous, and also eliminates the effect of the "vanishing gradient" that have the traditional recurrent neural networks (impossibility of keeping in memory information beyond a certain time window).
Through an LSTM network we can classify and predict time series that present fast dynamics mixed with slow dynamics and whose time scales we do not know at first.

Surrogate Trees and Rules
The great progress that Artificial Intelligence has made in recent years has been largely due to so-called black box algorithms, i.e., algorithms that are not intuitively interpretable by humans. The XAI area has emerged in response to the paradox that there are many important decisions made by algorithms that we cannot understand [33]. There are two basic approaches in this area. On the one hand, there are global models that focus on explaining the decisions of black box models in a generic way. On the other hand, we have local approaches that try to explain particular decisions of the model (for a particular entity). The main XAI techniques are shown in Table 2. Since we want to understand transport forecasting as a whole (at bus line level) we will use a global model. We are dealing with a regression problem so we will use an interpretable model based on regression trees. For this purpose, we will use the following algorithm (based on [47]):

1.
Select the dataset X used to train the black box model.

2.
For the selected dataset X, get the predictions of the black box model.

3.
Select a regression tree model.

4.
Train the regression tree on the dataset X and its predictions.

5.
Obtaining the rules from the regression tree. 6.
Measure how well the surrogate model replicates the predictions of the black box model. 7.
Fuzzification of the variable to be predicted using the fuzzy 2-tuple linguistic model. 8.
Interpret the surrogate fuzzy linguistic model.
For step 6, we will use the more typical measure which is the adjusted R 2 , measuring how well the regression tree explains the predictions of the black box model we are trying to explain.
Decision trees are surrogate models that are often used to interpret black box models. The final tree generated can be expressed as a set of rules that are easy to interpret. Therefore, several works are focused on extracting these rules [48][49][50]. The fuzzy version of these rules belongs to the interpretable model because they use fuzzy sets (that can model values of the linguistic variables) that allow us to model fuzzy concepts in a way that is closer to the human being [51]. An example of a fuzzy rule is: if previous day's traveler demand is Low and today's weather is cold then traveler demand is Medium. In this example, the variables "previous day's traveler demand", "today's weather" and "traveler demand" are linguistic variables whose values are Low, Cold and Medium, respectively which could be represented by fuzzy sets. Most fuzzy rule applications are used for classification, i.e, problems with a categorical target variable [52,53].
In our case, the target variable is continuous, i.e., we are dealing with a case of regression. For a better understanding of the rules, we represent the target variable with a linguistic variable that has as its basis a set of linguistic values (e.g., Low, Medium and High).
In the presented algorithm, in step 7, we obtain the most appropriate fuzzy linguistic value of the target variable. The value of this variable is crisp and the process of fuzzification by converting it into a label of the linguistic variable may cause a loss of information. Using the 2-tuple model, this representation of a precise value as a fuzzy value can be made without loss of information thanks to the use of two values: the linguistic label itself and the symbolic translation. Although fuzzy rules have been applied to regression problems [54], as mentioned above, we have not found application of the 2-tuple model to this interpretability problem.

Proposed Model
To achieve our goals, we propose a model based on KDD and CRISP [55]. This model is shown in Figure 4 and its steps are explained below. Although this model can be used to independently address each of the transport means we want to predict, as a use case we have taken a single bus line. High). In the presented algorithm, in step 7, we obtain the most appropriate fuzzy linguistic value of the target variable. The value of this variable is crisp and the process of fuzzification by converting it into a label of the linguistic variable may cause a loss of information. Using the 2-tuple model, this representation of a precise value as a fuzzy value can be made without loss of information thanks to the use of two values: the linguistic label itself and the symbolic translation. Although fuzzy rules have been applied to regression problems [54], as mentioned above, we have not found application of the 2-tuple model to this interpretability problem.

Proposed Model
To achieve our goals, we propose a model based on KDD and CRISP [55]. This model is shown in Figure 4 and its steps are explained below. Although this model can be used to independently address each of the transport means we want to predict, as a use case we have taken a single bus line.

Developing and Understanding of the Application
The objective is to obtain a model that can be used to predict and interpret real passenger demand given by EMT. For this company, it is critical to know the passenger de-

Developing and Understanding of the Application
The objective was to obtain a model that can be used to predict and interpret real passenger demand given by EMT. For this company, it is critical to know the passenger demand that any bus line in the city may have on a given day. However, it is also critical to know why this demand is determined. In this way, the company will be able to make various decisions about the transport service, based on knowledge of what is expected to be happening and the factors and rules that will cause this demand.
In this work we focused on the analysis of bus line 1 in the afternoon schedule in the period January 2015-February 2017.

Creating a Target Data Set
The data of the mentioned bus line was provided by means of flat files (csv format). These files contained events (each time a passenger inserts the ticket or travel card into the machine, an event is generated that includes timestamp information, bus ID, bus stop ID, bus direction and ticket type). Given the volume of the transactional data provided, we decided to use Big Data platforms. Therefore, these files were ingested in a Hadoop platform within a cluster, in order to have the data in a coherent and correctly integrated way to proceed with its subsequent analysis. For this, Hive was used, through which the input data from csv files were inserted into our Hadoop database. Spark was also used, through which we were able to work with the data inserted in a Hadoop database in a parallel way.
The output variables are: -Date: Field that indicates the date formed by year, month and day (e.g., 20160101); -Passengers: number of passengers in the selected time slot.

Data Understanding
This is the second phase of the CRISP-DM process, which focuses on data collection and quality review. We focused on the analysis of bus line 1 in the afternoon schedule. We observed that November was the month with the highest average daily demand, while August was the month with the lowest average daily demand. On the other hand, Wednesday was the day with the highest demand and Sunday the day with the least passengers. The daily average number of passengers was 3066.
In Figure 5, the time series of the daily number of passengers in the afternoon is shown. The histogram on this variable is shown in Figure 6. The data of the mentioned bus line was provided by means of flat files (csv format). These files contain events (each time a passenger inserts the ticket or travel card into the machine, an event is generated that includes timestamp information, bus ID, bus stop ID, bus direction and ticket type). Given the volume of the transactional data provided, we decided to use Big Data platforms. Therefore, these files were ingested in a Hadoop platform within a cluster, in order to have the data in a coherent and correctly integrated way to proceed with its subsequent analysis. For this, Hive was used, through which the input data from csv files were inserted into our Hadoop database. Spark was also used, through which we were able to work with the data inserted in a Hadoop database in a parallel way.
The output variables are: -Date: Field that indicates the date formed by year, month and day (e.g., 20160101); -Passengers: number of passengers in the selected time slot.

Data Understanding
This is the second phase of the CRISP-DM process focuses on data collection, quality review and pattern obtaining form hypotheses about the information; for this, we carried out a monthly, weekly and hourly analysis behaviour, where we saw that the months of February and December are the ones with the highest demand, while August is the one with the lowest demand. Monday is the day with the most demand and Thursday the day with the fewest passengers. In addition, finally, by time slot [0-6, 6-12, 12-18 and 18-24], the peaks are at twelve in the morning and at six, while at lunchtime the number of passengers decreases. The average number of passengers is 3065.6.
In Figure 5, the time series of the daily number of passengers in the afternoon is shown. The histogram on this variable is shown in Figure 6.

Data Cleansing and Pre-Processing
This is the third stage of the KDD process, that focuses on the cleaning of target data and pre-processing. To do this, we represented line 1 by time slots and according to whether the day was a working day or weekend.
We noticed the presence of outliers or errors, since we found some travelers on daytime buses during nighttime hours, and on the other hand, travelers on night buses during daylight hours, so we removed these cases. For the night lines, we proceeded to maintain only the values that appear between 23 and 07 h.
Our database consisted of two sources: data from the buses, which we call "internal sources", and all the data that do not come from the buses, which we call "external sources".
First of all, we loaded the data in csv format or other formats to the HDFS system to later be inserted into our database. For this purpose, we developed two scripts in bash language that are responsible for loading the data from the cluster itself to the HDFS system. These scripts were divided according to the origin of the data, (internal sources versus external sources).
The internal sources are obtained from the bus machines that record the entry of passengers. Every time a passenger inserts the ticket or travel card into the machine, it generates a new entry in the database and is recorded as a new trip.
Finally, we obtained from http://www.calendarioslaborales.com (accessed on 28 February 2022) the holiday calendar of the Community of Madrid.

Data Transformation
This is the fourth stage of the KDD process that focuses on data transformation, so that algorithms can be easily implemented.
Some models are sensitive to the scale of the input data, so we transformed the scale by dividing the variable to be predicted by 5000. For all models except LSTM, we performed a one-hot encoding of the categorical variables day of week and month obtaining the following dummies: -day of week: Mon., Tue., Wed., Thu., Fri., Sat., Sun.; -month: Jan., Feb., March, April, May, June, July, Aug., Sept., Oct., Nov., Dec.
In the LSTM model, the categorical variables were processed using embedding layers. Finally, we computed the lagged versions of the variables up to 5 days.

Choosing the Most Suitable ML Algorithm
At this stage of the KDD process, the appropriate data mining task is chosen. Once we performed the transformation of the data, we proceed to build the predictive system. For this, we developed different models with the aim of selecting the most accurate. Finally, the one selected for the project was the LSTM, which was developed in Keras [56]. The comparison of models is shown in Table 3. The blocks of an LSTM network contain different internal gates that control the flow of information (Figure 7), in such a way that when the gate has a value close to 0 the information does not flow through it, and when they have a value close to 1, information flows.  The blocks of an LSTM network contain different internal gates that control the flow of information (Figure 7), in such a way that when the gate has a value close to 0 the information does not flow through it, and when they have a value close to 1, information flows. One of the gates (Forget Gate) is dedicated to forgetting, and controls the degree to which a value remains in the memory of the neuron, determining how much information about the past state should be discarded.
Where ft is the forget gate; σ is the sigmoid function; Wf is the forget gate weight matrix; x is the multiplication operator; where ft is the forget gate; σ is the sigmoid function; Wf is the forget gate weight matrix; ht−1 is the output of the previous LSTM cell; xt is the input; bf is the bias for the gate; The input gate, it, determines how much information about the current state of the network should be stored in the internal state.

it = σ(Wi × [ht−1, xt] + bi)
The output gate, ot, determines how much information from the current internal state should be sent to the external state. ot = σ(Wo × [ht−1, xt] + bo) The "tanh" node can be thought of as the activation function of the hidden layer, and its output is: In the final LSTM model, we used a configuration with 5 LSTM neurons, a window size of 5, a batch size of 32, and 2000 training epochs. Each of the categorical input variables (day of week and month) is processed by an embedding layer.
Finally, the dataset is divided into training dataset (80%, January 2015-September 2016) and test dataset (20%, September 2016-February 2017). In Figure 8, we can see the One of the gates (Forget Gate) is dedicated to forgetting, and controls the degree to which a value remains in the memory of the neuron, determining how much information about the past state should be discarded.
Where ft is the forget gate; σ is the sigmoid function; Wf is the forget gate weight matrix; x is the multiplication operator; where f t is the forget gate; σ is the sigmoid function; W f is the forget gate weight matrix; h t−1 is the output of the previous LSTM cell; x t is the input; b f is the bias for the gate.
The input gate, i t , determines how much information about the current state of the network should be stored in the internal state.
The output gate, o t , determines how much information from the current internal state should be sent to the external state.
The "tanh" node can be thought of as the activation function of the hidden layer, and its output is: In the final LSTM model, we used a configuration with 5 LSTM neurons, a window size of 5, a batch size of 32, and 2000 training epochs. Each of the categorical input variables (day of week and month) was processed by an embedding layer.
Finally, the dataset was divided into training dataset (80%, January 2015-September 2016) and test dataset (20%, September 2016-February 2017). In Figure 8, we can see the day-by-day prediction of a range of days in the test set. We can observe that the prediction fits quite well (R 2 train = 0.92, R 2 test = 0.89).

Interpreting Mined Patterns
Once our ML model has been built, which is a non-interpretable black box model, in this phase, we obtain a surrogate rules model that allows interpretation from the business domain point of view.
To this purpose, we applied the steps from 1 to 6 of the algorithm explained in Section 3.2 for the LSTM model (chosen in the previous phase). Thus, we obtain the rules shown in Table 4.  As discussed in Section 3.1, if we want to improve the interpretability of the variable number of passengers (prediction variable), we can represent it with its corresponding 2tuple value. To this purpose, we applied steps 7 and 8 of the algorithm explained in Section 3.3. We first transform this variable with the typical linear min-max transformation to represent it in the interval [0, 1], which is a necessary prior step for the transformation to a 2-tuple value (see Figure 3). The results of these transformations are shown in Table  5: 2-tuple representation of the prediction variable.

Interpreting Mined Patterns
Once our ML model was built, which is a non-interpretable black box model in this phase, we obtain a surrogate rules model that allows interpretation from the business domain point of view.
To this purpose, we applied the steps from 1 to 6 of the algorithm explained in Section 3.2 for the LSTM model (chosen in the previous phase). Thus, we obtain the rules shown in Table 4.  As discussed in Section 3.1, if we want to improve the interpretability of the variable number of passengers (prediction variable), we can represent it with its corresponding 2-tuple value. To this purpose, we applied steps 7 and 8 of the algorithm explained in Section 3.3. We first transformed this variable with the typical linear min-max transformation to represent it in the interval [0, 1], which is a necessary prior step for the transformation to a 2-tuple value (see Figure 3). The results of these transformations are shown in Table 5.
Once the predicted target variable was already represented linguistically, we proceed to replace the crisp values of this variable in the corresponding rules in a fuzzification process of this variable. After this process, we obtained the rules shown in Table 6. It should be noted that the use of the 2-tuple model allows the representation of the original variable in a linguistic but fully accurate way without loss of information.   We used time series, so the predicted variable itself was also used as an independent variable at earlier points in time. These independent variables were also replaced by their corresponding 2-tuple values to improve the linguistic interpretability of the rules.

Using Discovered Knowledge
This is the last and final step of the KDD process where the discovered knowledge has value for different use cases. In Section 4.6, we obtained an LSTM model that predicts passenger demand as can be seen in Figure 8 with good accuracy (R 2 train = 0.92, R 2 test = 0.89) and also as seen in Section 4.7 the model can be interpreted through rules from the business point of view with good precision (R 2 train = 0.95, R 2 test = 0.95). Therefore, we qualitatively know the reason for the prediction.

Discussion
We built an LSTM model that forecasts bus passenger demand with high accuracy. Then we applied the methodology explained in Section 3.3, constructing a surrogate tree and transforming it into a 2-tuple. The surrogate rules obtained for the LSTM model with 2-tuple linguistic are shown in Table 6 and can be interpreted as: • Passenger demand will be medium for the following cases: • Rule 3: Days in July that are neither weekends nor holidays. On those days, students do not have to attend classes and from the second fortnight onwards, some workers start their vacations. Therefore, passenger demand starts to decrease.
• Rule 4: Saturdays when passenger demand on the eve was very high. A large part of the population does not work on Saturday afternoons. On those days, public transport is usually used for leisure activities, so passenger demand is lower than on the eve.
Passenger demand will be low for the following cases: • Rule 5: Holiday that are not weekends or days in August. It seems that during those holidays people prefer to stay at home or move to other areas. • Rule 6. August days that are not weekends. In August many companies close and workers have to take mandatory vacations. In addition, students do not have to go to class, so passenger demand declines. • Rule 7. Saturdays where passenger demand on the eve was high/medium. Same interpretation as in Rule 4. • Rule 8. Sundays where passenger demand on the eve was not low or very low. The activities carried out by people in Madrid are similar on Saturdays and Sundays, but many street stores are closed on Sundays. For this reason, passenger demand on these two days is similar, with a lower demand on Sundays.
Passenger demand will be very low for the following case: • Rule 9. Sundays where passenger demand on the eve was very low. Same interpretation as in Rule 8.
We emphasize that the 2-tuple model allowed us, on the one hand, a good linguistic interpretability of the LSTM model. On the other hand, it allowed us precision in the interpretation of the rules. For example, Rules 5, 6, 7 and 8 imply a low demand forecast but with different symbolic translation so that we can distinguish that some cases are lower than others.

Conclusions and Future Work
The problem raised in this article is one of the most important for smart cities, since they focus on being able to make their transport network more efficient and control it, through the prediction of bus passenger demand. Most of the machine learning algorithms that are used with time series for this purpose are called black box algorithms, that is, they have a good prediction but are not interpretable by humans. However, interpretability of the model is crucial for transparency, traceability, and auditability, especially in public resources allocation.
In the literature, we found works oriented to the prediction and understanding of the traffic flow and works that seek to make the recommendations on the transport fleet interpretable. As far as we know, there is no work specifically oriented to the problem of predictability and interpretability of public bus users like the work presented here.
We have built an LSTM algorithm which is able to predict passenger demand in the public transport network and has been empirically demonstrated with a use case. We used a large amount of data using internal and external sources.
Finally, it has been possible to be interpretable thanks to the surrogate model and 2-tuple fuzzy linguistic model. A fuzzy technique called 2-tuple model improved the linguistic interpretability of the generated XAI model without losing accuracy. We did not find in the literature the joint use of both approaches.
From the economic point of view, passenger demand forecasting can be used for an optimal planning of the resources, leading to energy and cost savings for the public transport company. Additionally, this methodology can be applied in future work to predict passenger demand for others types of transport (air, railway, marine).