Trip Based Modeling of Fuel Consumption in Modern Heavy-Duty Vehicles Using Artiﬁcial Intelligence

: Heavy-duty trucks contribute approximately 20% of fuel consumption in the United States of America (USA). The fuel economy of heavy-duty vehicles (HDV) is affected by several real-world parameters like road parameters, driver behavior, weather conditions, and vehicle parameters, etc. Although modern vehicles comply with emissions regulations, potential malfunction of the engine, regular wear and tear, or other factors could affect vehicle performance. Predicting fuel consumption per trip based on dynamic on-road data can help the automotive industry to reduce the cost and time for on-road testing. Data modeling can easily help to diagnose the reason behind fuel consumption with a knowledge of input parameters. In this paper, an artiﬁcial neural network (ANN) was implemented to model fuel consumption in modern heavy-duty trucks for predicting the total and instantaneous fuel consumption of a trip based on very few key parameters, such as engine load (%), engine speed (rpm), and vehicle speed (km/h). Instantaneous fuel consumption data can help to predict patterns in fuel consumption for optimized ﬂeet operations. In this work, the data used for modeling was collected at a frequency of 1Hz during on-road testing of modern heavy-duty vehicles (HDV) at the West Virginia University Center for Alternative Fuels Engines and Emissions (WVU CAFEE) using the portable emissions monitoring system (PEMS). The performance of the artiﬁcial neural network was evaluated using mean absolute error (MAE) and root mean square error (RMSE). The model was further evaluated with data collected from a vehicle on-road trip. The study shows that artiﬁcial neural networks performed slightly better than other machine learning techniques such as linear regression (LR), and random forest (RF), with high R-squared ( R 2 ) and lower root mean square error.


Introduction
The fuel efficiency of heavy-duty trucks can be beneficial not only for the automotive and transportation industry but also for a country's economy and the global environment [1,2]. The cost of fuel consumed contributes to approximately 30% of a heavy-duty truck's life cycle cost. Reduction in fuel consumption by just a few percent can significantly reduce costs for the transportation industry [3,4]. The effective and accurate estimation of fuel consumption (fuel consumed in L/km) can help to analyze emissions as well as prevent fuel-related fraud. As per Environmental Protection Agency (EPA) reports, 28% of total greenhouse gas emissions come from transportation (heavy-duty vehicles and passenger cars) [5]. The United States Environmental Protection Agency (US EPA) has introduced Corporate Average Fuel Economy (CAFÉ) standards enforcing automotive manufacturers to be compliant with standards to regulate fuel consumption [6,7]. US EPA regulations enacting fuel economy improvements in freights released in 2016 target truck fuel efficiency, which is predicted to improve by 11-14% by 2021 [8]. Most states have now mandated that truck fleets update their vehicle inventory with modern vehicles due to air quality regulations.
Several studies have been presented in the past for evaluating the fuel efficiency of vehicles using simulation-based models and data-driven models. A simulation model was developed based on engine capacity, fuel injection, fuel specification, aerodynamic drag, grade resistance, rolling resistance, and atmospheric conditions, with simulated dynamic driving conditions to predict fuel consumption [9]. A statistical model which is fast and simple compared to the physical load-based approach was developed to predict vehicle emissions and fuel consumption [10]. The impact of road infrastructure [11], traffic conditions [12][13][14], drivers' behavior [15], weather conditions [16][17][18], and the ambient temperature on fuel consumption were studied, and it was determined that fuel consumption can be reduced by 10% with eco-driving influences. The era of big data and artificial intelligence has enabled the modeling of huge volumes of data for companies to reduce emissions and fuel consumption. Machine learning techniques such as support vector machine (SVM) [19], random forest (RF) [20], and artificial neural networks (ANN) [21,22] are widely applied to turn data into meaningful insights and solve complex problems. These techniques have been applied to estimate emissions and fuel consumption in motor vehicles [23], trucks [24], ships [25], and aircraft [26]. A comparison of previous studies has been shown in Table 1.
While the current approaches determine the fuel consumption of the vehicle, combining these techniques with data helps to identify parameters that may cause anomalies, such as malfunctions due to wear and tear of the engine, improper maintenance, engine failure, exhaust after-treatment system, and external factors like climate, traffic, road conditions, etc. Most studies in the literature have been limited to passenger cars [21,27], light-duty vehicles [28], heavy-duty vehicles [29], or were based on a huge number of parameters or limited dynamic data collected during on-road trips. The relative importance of various factors influencing fuel consumption was reviewed in the past [30,31]. However, modeling modern heavy-duty trucks with very few parameters is much more complicated.
This current study models fuel consumption in modern heavy-duty trucks based on portable emissions monitoring system (PEMS) data collected during on-road testing. An artificial neural network was developed to predict the total fuel consumed by a vehicle on a trip based on very few key parameters, such as engine load (%), engine speed (rpm), and vehicle speed (km/h). The model also provides the trend in fuel consumption for the trip, which give insights into the diagnostic performance of the truck affected by the input parameters. The model can predict the total fuel consumed more accurately with a mean absolute error of 0.0014 and root mean square error of 0.0025 compared to other techniques such as linear regression [32] and random forest [20].

Methodology
Regression analysis was performed using Machine Learning techniques such as Artificial Neural Network, Linear Regression, and Random Forest to estimate the fuel consumption of modern heavy-duty trucks using PEMS data. The preprocessed dataset, which related to a single vehicle, contained 672,658 rows of actual torque (Nm), vehicle speed (km/h), and engine speed (rpm), which were used as inputs for the models. The implementation stages of the artificial neural network for fuel consumption modeling are as described: •  Figure 1 shows the overall workflow for this work.
collinear variables; • Developing the neural networks and identifying the network with best-performing hyperparameters. The hyperparameters include the number of hidden layers, number of hidden neurons per layer, learning rate, and optimization function; • Calculating the correlation coefficient on the reduced database using the best-performing model selected; • Perform the generalization analysis of the model by calculating the performance measures such as MAE, RMSE, and R 2 ; • Evaluating the performance of the model by comparing the predicted values with the actual values collected during on-road testing. Figure 1 shows the overall workflow for this work.

Data Collection and Pre-Processing
Data collection methods such as onboard emission measurement [36], laboratory measurement, and tunnel study [37] have been used in past. An on-road data collection method using PEMS is increasingly being used, which makes it possible to collect realworld fuel consumption and emission data [38], and has proved to be reliable [39]. The data used in the current study was collected using a PEMS device mounted on the vehicle during on-road testing at a frequency of 1Hz. PEMS software outputs for the sensor ports were used to process second by second data into a comma-separated values (CSV) file for each trip. Over 100 parameters such as fuel rate (L/h), engine speed (rpm), speed (km/h), gas temperature, CO2, NOx, GPS altitude, GPS longitude, GPS latitude, etc. were collected for each trip based on data logger settings. Data were collected from two modern heavy-

Data Collection and Pre-Processing
Data collection methods such as onboard emission measurement [36], laboratory measurement, and tunnel study [37] have been used in past. An on-road data collection method using PEMS is increasingly being used, which makes it possible to collect realworld fuel consumption and emission data [38], and has proved to be reliable [39]. The data used in the current study was collected using a PEMS device mounted on the vehicle during on-road testing at a frequency of 1Hz. PEMS software outputs for the sensor ports were used to process second by second data into a comma-separated values (CSV) file for each trip. Over 100 parameters such as fuel rate (L/h), engine speed (rpm), speed (km/h), gas temperature, CO 2 , NOx, GPS altitude, GPS longitude, GPS latitude, etc. were collected for each trip based on data logger settings. Data were collected from two modern heavy-duty trucks with the same make/model of diesel engine manufactured in Detroit in 2016 were used in this study. The trucks were Cascadia models manufactured by Freightliner with DD13 engines and used as goods movement trucks. The on-road tests were performed in California, the fuel consumption during these on-road tests were recorded, and the cumulative fuel consumed per trip was calculated by summing the values. The vehicles were tested for multiple on-road trips with different routes, drivers, and conditions. However, modeling with too many parameters might overfit the ANN model resulting in poor performance. Hence, a subset of 10 features based on previous studies and domain knowledge was selected. These features included trip number, engine speed (rpm), trip distance (km), vehicle speed (km/h), fuel temperature ( • C), fuel rate (L/s), accelerator pedal position (%), actual torque (Nm), power (kW), and engine load (%).
For better modeling of the neural network, the data collected must be representative. The raw dataset contains noise/missing values, redundant values, and outliers due to failure in the sensor or the sensor not having been enabled for recording. With feature engineering, the raw data is transformed into features that better represent the relation between features to the predictive model, resulting in better performance accuracy. The interpretation of the regression model is complex when independent variables are multicollinear. Highly correlated independent variables overfit the model as a change in one variable causes significant change to another. Hence, to identify the multi-collinear variables, a correlation matrix that determines the correlation coefficient of each variable with every other variable in data is shown in Figure 2. poor performance. Hence, a subset of 10 features based on previous studies and domain knowledge was selected. These features included trip number, engine speed (rpm), trip distance (km), vehicle speed (km/h), fuel temperature (°C ), fuel rate (L/s), accelerator pedal position (%), actual torque (Nm), power (kW), and engine load (%).
For better modeling of the neural network, the data collected must be representative. The raw dataset contains noise/missing values, redundant values, and outliers due to failure in the sensor or the sensor not having been enabled for recording. With feature engineering, the raw data is transformed into features that better represent the relation between features to the predictive model, resulting in better performance accuracy. The interpretation of the regression model is complex when independent variables are multicollinear. Highly correlated independent variables overfit the model as a change in one variable causes significant change to another. Hence, to identify the multi-collinear variables, a correlation matrix that determines the correlation coefficient of each variable with every other variable in data is shown in Figure 2. The last four rows and columns of the correlation matrix indicate that the independent features accelerator pedal position (%), actual torque (ft−lb), power (bhp), and engine load (%) are highly correlated to each other with a correlation coefficient of 0.85 and higher. Hence, to prevent overfitting of the model, only engine load (%) of the four parameters was used in modeling. Feature dimension can further be reduced by identifying The last four rows and columns of the correlation matrix indicate that the independent features accelerator pedal position (%), actual torque (ft−lb), power (bhp), and engine load (%) are highly correlated to each other with a correlation coefficient of 0.85 and higher. Hence, to prevent overfitting of the model, only engine load (%) of the four parameters was used in modeling. Feature dimension can further be reduced by identifying the highly correlated features with the target variable fuel rate (L/s). The recursive feature elimination (RFE) method was used to identify and plot the feature importance scores. RFE determines the features based on the desired number by searching the subset of features starting with all features and recursively removing features. Feature importance of the remaining features, including engine load (%), accelerator pedal position (%), fuel temperature (deg C), vehicle speed (km/h), trip distance (km), and engine speed (rpm) concerning fuel rate (L/s), was determined with the RFE technique, and the top three features with the highest scores were selected for modeling. Based on the feature analysis, three independent features, namely engine load (%), vehicle speed (km/h), and engine speed (rpm), with high importance were selected for modeling the dependent feature fuel rate (L/s) and to identify patterns in fuel consumption.

Artificial Neural Network
Artificial neural networks is a machine learning technique inspired by biological neurons to create an accurate time-efficient predictive model [40]. ANN consists of multiple neurons which are computational, and the connections between neurons determine the functionality of the network [41]. The building block for a neural network is a neuron that represents the weighted sum of inputs passed through a non-linear activation function. The multi-layer perceptron (MLP) network is a type of neural network that consists of the input layer, one or more hidden layers, and an output layer. The initial layer takes the parameters/features as inputs to the network. At least one hidden layer is used to perform computations on input data by applying a non-linear activation function. The final output layer displays the output based on the task the network is trained for. ANN has gained popularity due to its adaptive learning ability and approximating non-linear functions to make predictions [42]. In this study, a feed-forward neural network [43] where data is transmitted from the input layer to output layer in a forward direction assigning weights to the connection between layers with a backpropagation algorithm [44], ReLU [45] activation, and the mean square error function, was used. The backpropagation algorithm is a learning method used to optimize the weights of neurons in a neural network by repeatedly adjusting the weights to minimize the error of prediction. The network used for this work has three inputs to the input layers, two hidden layers with six and eight neurons in the respective layers, and an output layer with a single neuron as shown in Figure 3. independent features, namely engine load (%), vehicle speed (km/h), and engine speed (rpm), with high importance were selected for modeling the dependent feature fuel rate (L/s) and to identify patterns in fuel consumption.

Artificial Neural Network
Artificial neural networks is a machine learning technique inspired by biological neurons to create an accurate time-efficient predictive model [40]. ANN consists of multiple neurons which are computational, and the connections between neurons determine the functionality of the network [41]. The building block for a neural network is a neuron that represents the weighted sum of inputs passed through a non-linear activation function. The multi-layer perceptron (MLP) network is a type of neural network that consists of the input layer, one or more hidden layers, and an output layer. The initial layer takes the parameters/features as inputs to the network. At least one hidden layer is used to perform computations on input data by applying a non-linear activation function. The final output layer displays the output based on the task the network is trained for. ANN has gained popularity due to its adaptive learning ability and approximating non-linear functions to make predictions [42]. In this study, a feed-forward neural network [43] where data is transmitted from the input layer to output layer in a forward direction assigning weights to the connection between layers with a backpropagation algorithm [44], ReLU [45] activation, and the mean square error function, was used. The backpropagation algorithm is a learning method used to optimize the weights of neurons in a neural network by repeatedly adjusting the weights to minimize the error of prediction. The network used for this work has three inputs to the input layers, two hidden layers with six and eight neurons in the respective layers, and an output layer with a single neuron as shown in Figure 3.  The available dataset of vehicle 1 was divided into train and test sets, with 70% to train the network, and 30% as a validation dataset, to test the generalization of the network [46]. The trained model weights were then used to make predictions on unseen test data (a single trip from vehicle 2). The performance of a neural network depends on many hyperparameters like the learning rate, number of epochs for training, initial weights, number of hidden layers, and the number of neurons in hidden layers. Multiple experiments were performed with different hyperparameters and the best results for the optimal network are presented in the results section.

Multiple Linear Regression
Linear Regression (LR) is the most well-known regression technique where the data is fitted to a straight line to predict output by minimizing a cost function or error. In this study, a multivariable linear equation given by Equation (1) is used due to multiple input parameters.
where, y is the output and x 1 , x 2 , x 3 are the input variables with θ 0 , θ 1, θ 2 , θ 3 being parameters to learn.

Random Forest
Random Forest (RF) is an ensemble machine learning method for regression and classification tasks. This method uses many decision trees, and the outcome is based on predictions of these decision trees. Thus, the accuracy of the model can be improved by increasing the number of trees, making it robust to outliers. In this study, the random forest was trained with 100 trees, as the performance with more than 100 trees did not significantly improve accuracy and is computationally expensive.

Performance Measures
The performance of the machine learning model for the regression problem was evaluated using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-Squared (R 2 ).

Mean Absolute Error
Mean Absolute Error (MAE) is the measure of error between the predicted value and the actual value given by Equation (2).
where, y pred,t is the predicted value and y actual,t is the measured fuel consumption at the same instant of time.

Root Mean Squared Error
Root Mean Squared Error (RMSE) is the square root of the average squared difference between the predicted value and the actual value given by Equation (3). The smaller the value, the closer the predicted values to actual values.
where,ŷ is the predicted value and y is the measured fuel consumption at the same instant of time.

R-Squared
R-Squared (R 2 ) is the statistical measure of variance for the dependent variable explained by the regression model given by Equation (4).
where, SS res is the sum of squares of residuals and SS total is the total sum of squares.

Results and Discussion
This study presents the fuel consumption modeling for modern heavy-duty vehicles using PEMS data under various driving conditions, different routes, and external factors. Engine Load (%), Engine Speed (rpm), and Vehicle Speed (km/h) were used as inputs for the ANN. Based on the hyper-parameter tuning, the neural network was trained for 100 epochs with a learning rate of 0.00001. During each epoch, the loss for each data item/batch in the training dataset and validation dataset was calculated. The loss plots shown in Figure 4 indicate the mean absolute error (MAE) and mean square error (MSE) on both training data and validation data.
using PEMS data under various driving conditions, different routes, and external factors. Engine Load (%), Engine Speed (rpm), and Vehicle Speed (km/h) were used as inputs for the ANN. Based on the hyper-parameter tuning, the neural network was trained for 100 epochs with a learning rate of 0.00001. During each epoch, the loss for each data item/batch in the training dataset and validation dataset was calculated. The loss plots shown in Figure 4 indicate the mean absolute error (MAE) and mean square error (MSE) on both training data and validation data.  The minimum generalization gap of training and validation data loss plots indicates a good fit. The generalization of the neural network was tested on test data collected from a single trip of another vehicle. From Figure 5, the data points close to the line indicate the neural network model can accurately predict the fuel consumption with few outliers. The points far away from the regression line indicate outliers in data that were not captured by the neural network due to sudden transitions in vehicle speed (km/h) and engine speed (rpm) where fuel consumption was high.
To determine the total fuel consumed by vehicle, the cumulative fuel consumption was calculated by adding the instantaneous fuel rate values for every second. The performance measures described in Section 3 were used to evaluate the model and the values obtained are MAE: 0.0009, RMSE: 0.0021 for the training data. The R 2 value of 0.7806 on the test data and 0.7762 on the train data indicate that the neural network model is generalized well for unseen data. The minimum generalization gap of training and validation data loss plots indicates a good fit. The generalization of the neural network was tested on test data collected from a single trip of another vehicle. From Figure 5, the data points close to the line indicate the neural network model can accurately predict the fuel consumption with few outliers. The points far away from the regression line indicate outliers in data that were not captured by the neural network due to sudden transitions in vehicle speed (km/h) and engine speed (rpm) where fuel consumption was high. The cumulative fuel consumption with distance was plotted against the actual data to determine how well the neural network predicted total fuel consumption. To evaluate the performance, the neural network predictions were compared with predictions of linear regression and random forest. The performance metrics MAE, RMSE, R 2 are compared in Table 2. Figure 6 shows the plots for comparison of cumulative fuel consumed for distance traveled for all models. The neural network prediction was closer to the actual measured data when compared to the linear regression model, which overestimated, and random forest, which underestimated the cumulative fuel consumed. The neural network model is better compared with other models because of its ability to predict values close to actual values. Also, the neural network model can be fine-tuned without developing a new model to perform the same task for complex learning on other vehicle types.  To determine the total fuel consumed by vehicle, the cumulative fuel consumption was calculated by adding the instantaneous fuel rate values for every second. The performance measures described in Section 3 were used to evaluate the model and the values obtained are MAE: 0.0009, RMSE: 0.0021 for the training data. The R 2 value of 0.7806 on the test data and 0.7762 on the train data indicate that the neural network model is generalized well for unseen data.
The cumulative fuel consumption with distance was plotted against the actual data to determine how well the neural network predicted total fuel consumption. To evaluate the performance, the neural network predictions were compared with predictions of linear regression and random forest. The performance metrics MAE, RMSE, R 2 are compared in Table 2. Figure 6 shows the plots for comparison of cumulative fuel consumed for distance traveled for all models. The neural network prediction was closer to the actual measured data when compared to the linear regression model, which overestimated, and random forest, which underestimated the cumulative fuel consumed. The neural network model is better compared with other models because of its ability to predict values close to actual values. Also, the neural network model can be fine-tuned without developing a new model to perform the same task for complex learning on other vehicle types.

Conclusions
In conclusion, the study demonstrates the modeling of fuel consumption in modern heavy-duty vehicles with an artificial neural network using very few technical parameters. An attempt was made to develop a model using very few parameters collected under different conditions. Data from modern heavy-duty trucks with the same make and model, driven by different persons on various routes under different external conditions, were used for training the artificial neural network. The model relies on very few parameters that could be obtained quickly and easily from a vehicle during a trip, unlike other parameters such as road grade, latitude, longitude, traffic information, etc. Moreover, the three parameters used were able to capture a minimum of 78% of the variance in the fuel rate compared to other studies where many parameters are used. Adding more input parameters would improve the performance of ANN, but collecting such data might require additional equipment setup. The performance measures MAE, RMSE, and R 2 indicate that accurate prediction can be obtained with the model. The data modeling can help to identify the trend in instantaneous fuel consumption and to calculate the total fuel consumed by the vehicle for each trip, which can further help in diagnosing vehicle performance in the case of abnormalities. Models that are accurate, fast, and able to predict in real-time Based on the input features that were modeled it is easy to determine the parameter affecting the fuel consumption in case of anomaly. This study presents an efficient and practical method of estimating fuel consumption per trip, based on very few parameters for which data is easily available. The cost incurred in modeling the data is very low compared with other simulation methods that also consume more time.

Conclusions
In conclusion, the study demonstrates the modeling of fuel consumption in modern heavy-duty vehicles with an artificial neural network using very few technical parameters. An attempt was made to develop a model using very few parameters collected under different conditions. Data from modern heavy-duty trucks with the same make and model, driven by different persons on various routes under different external conditions, were used for training the artificial neural network. The model relies on very few parameters that could be obtained quickly and easily from a vehicle during a trip, unlike other parameters such as road grade, latitude, longitude, traffic information, etc. Moreover, the three parameters used were able to capture a minimum of 78% of the variance in the fuel rate compared to other studies where many parameters are used. Adding more input parameters would improve the performance of ANN, but collecting such data might require additional equipment setup. The performance measures MAE, RMSE, and R 2 indicate that accurate prediction can be obtained with the model. The data modeling can help to identify the trend in instantaneous fuel consumption and to calculate the total fuel consumed by the vehicle for each trip, which can further help in diagnosing vehicle performance in the case of abnormalities. Models that are accurate, fast, and able to predict in real-time will enable the optimization of fuel consumption. The model can be fine-tuned easily to model more complex data from other vehicles with different makes and models that do not have the amount on-road data needed to train a network. This work can be extended to include other factors such as time, traffic information, road information, GPS data, etc. that affect fuel consumption, and to estimate vehicle exhaust emissions.