1. Introduction
Full truck loads (FTL) is a common transportation method, where the goods fill an entire truck. It perfectly suits a large volume of goods, where a load covers the whole truck space. Apart from the FTL, there exists an alternative method called less-than truckload (LTL), in which a truck takes several partial loads to different contract load/unload locations within a single journey. This work focuses on the FTL, however from a rarely addressed perspective.
In case of the external fleet contract pricing, the contracts are priced according to the varying contractor policy, which takes into account several objective and subjective market and non-market factors [
1,
2], such as contract0dependent, economic, regulatory, general and purely ambivalent factors. Those factors potentially reflect, in the opinion of the decision maker, the shipping cost for a given commodity, along some determined route within a certain time period. Dynamic pricing shows increasing shipping business dynamics [
3] and determines it simultaneously [
4]. Next, it is assumed the contractor may use some custom dynamic pricing model, which is associated with serious challenges [
5].
The shipping cost estimations start to play an even more important role in case of short routes, when common relations between the price, fuel costs and the driver time is not straightway included. The pricing of the FTL long-range contracts is frequently solved using deterministic analytical fright calculators [
6], or with the use of algorithmic estimators. It is worth noting that the AI and ML methods [
7] lead in that area. The literature mostly focuses on the blind machine learning approaches [
2,
8,
9,
10] or hybrid ones [
11].
On the contrary, the task of the short-range FTL shipment cost estimation, i.e., for routes shorter than 50 km, is seldom addressed in the literature. One may try to determine the reasons for that. Firstly, this task is of a smaller order and is often hidden in all data, or omitted due to the lower absolute cost values of such routes. Secondly, these routes often play a complementary or secondary role in relation to the long-distance ones. Thirdly, it is a thankless and simply difficult task. The fact that we undertake this task is related to its difficulty and observations made while dealing with the general task of estimating FTL costs [
11], where the largest relative errors occur precisely for short routes, which spoils the overall picture of the modeling and estimation task.
The multi-criteria assessment methodology is the second contribution of this work. It is a well-known fact that each performance index, such as means square error, absolute error or various residuum statistical factors, exhibit different properties and are sensitive towards different estimation error aspects. Despite that knowledge, there is a significant deficiency in the literature, because multi-criteria residuum assessment approaches are hardly reported [
12]. The researchers mostly use the mean square error (MSE), mean absolute error (MAE) or the relative mean absolute percentage error (MAPE). Each of them has different properties and puts more attention on various data features. Square errors are sensitive to large residua, while even often-occurring but small errors are neglected. Absolute errors equalize these differences and reflect small residua as well.
During this research, different measures are investigated: classical (normal), robust and L-moments, tail index and the Geweke-Porter-Hudak estimator of the fractional order of the ARFIMA filter. They are presented following the statistical approach named moment ratio diagrams (MRD) or L-moments ratio diagram (LMRD). Our work plans to supplement the estimation residuum analysis with the multi-criteria approach named the IRD—(Index Ratio Diagrams) using various measures.
The main contribution of this work lies in the proposal of the multicriteria residuum analysis concept, as this aspect is hardly existent in the research. The FTL estimation task is considered as the representative example for the assessment methodology.
General FTL cost estimation task formulation is introduced in
Section 2, while
Section 3 describes the utilized assessment methods and estimation algorithms. Various estimators are compared in
Section 4.
Section 5 presents the results of the multi-criteria residuum analysis, while
Section 6 concludes the paper.
3. Methods and Algorithms
This research uses quite a large scope of possible methods, which are included in the proposed IRD framework: integral indexes, classical, robust and L-moments, tail index and the Geweke–Porter–Hudak fractional order estimator of the ARFIMA filter. Methods used during calculations are described below.
3.1. Integral Measures
The MSE measure is calculated as the mean integral of the squared residua over some time period
It penalizes large errors, neglecting the smaller ones. This measure is significantly affected by outlying occurrences and exhibits the zero-breakdown point [
13]. The MAE index sums absolute residua values
The MAE is less conservative as it penalizes continuing small residua. Though its breakdown point is zero as well, it is robust against a portion of outliers. The MAPE is defined in a relative way:
Generally, it is difficult to define what error value is good and which model is proper enough. Lewis, in [
14], proposed the interpretation of typical MAPE values, which is presented in
Table 3.
3.2. Statistical Moments
This research follows a theoretical approach that assumes some distribution, which correctly represents the underlying process. Such probabilistic density function (PDF) is utilized through their factors and moments (if they exist).
Let us assume that
is a given time series with the mean
and the
r-th central moment
,
denotes the expectation. The mean
is the first moment
, and the variance
is the second one denoted as
, where
denotes the standard deviation. These moments are often used together with the third one, i.e., the skewness
and the fourth—the kurtosis
. The skewness reflects data asymmetry and kurtosis its concentration.
The existence of outlying observations in the time series causes its distributions to start to be fat-tailed [
15]. This feature biases the moments estimation. The use of statistical factors in the residuum analysis has quite a long history following the legacy of Gauss [
16,
17]. They are strictly connected with the assumption about data normality and normality tests.
3.3. L-Moments
The theory of L-moments was proposed by Hosking [
18] as a linear combination of order statistics. The theory of L-moments includes new descriptions of the distribution shape, helps to estimate factors of an assumed statistical function and allows the testing of hypotheses about theoretical distributions. We may define L-moments for any random variable, whose expected value exists. The L-moments give almost unbiased statistics, even for a small sample. They are less sensitive to the distribution tails [
19]. These properties are appreciated in the life sciences, although they might be also used in control engineering. Their calculation is done as follows. The data
,
N—number of samples, are ranked in ascending order from 1 to
N. Next, the sample L-moments (
), the sample L-skewness
and L-kurtosis
are evaluated as:
where
Statistical properties are reflected in L-shift
, L-scale
, L-covariance (L-Cv)
, L-skewness
and L-kurtosis
. They help to fit a distribution to a dataset. L-skewness and L-kurtosis work as the goodness-of-fit measure. They can be calculated for theoretical PDFs [
20] and normal distribution has:
,
,
and
.
The L-moments deliver reliable estimates, especially for small samples and fat-tailed distributions. They form a backbone for the L-moments ratio diagrams, which support the distribution fitting to empirical samples. The most common diagram uses L-kurtosis (
) versus L-skewness (
) relationship [
19]. Apart from that, L-moments diagrams are used to compare various samples originating from different sources in search for the homogeneity [
21,
22]. These features constitute the research idea for the proposal of the IRDs.
3.4. Robust Statistics
Robust statistics is taken into consideration to address the impact of outliers. Robust estimators acquired popularity with works of Huber [
23]. Robust estimators allow to evaluate the shift, the scale and the regression coefficients for data impacted by outliers. This work utilizes the M-estimators with logistic psi-function implemented in the LIBRA toolbox [
24].
M-estimators consider the maximum likelihood (ML) estimator that uses the log-likelihood formula for a given distribution
is
The location M-estimator
is defined as a solution of:
where
is an influence function,
is a location estimator and
is an assumed scale. In a similar way we define the scale M-estimator
where
is a loss function,
is a location estimator and
is a preliminary location. The work utilizes logistic functions
and
given by
The utilization of robust statistics is just straightforward, as they form the natural extension of the statistical scale measures (variance and standard deviation) in case o outliers [
25], which occur frequently in real-life applications [
26]. With their use, we are not biasing our assessment by anomalies or erroneous records.
3.5. Moment Ratio Diagrams
Moment ratio diagrams graphically show the statistical properties of the considered time series in a plane. The MRD is a graphical representation in Cartesian coordinates of a pair of standardized moments. Actually, there are two versions [
27]. The MRD(
) shows the third standardized moment
(or its square
) as abscissa and the fourth moment
as ordinate, plotted upside down. There exists a theoretical limitation of the accessible area, as
. The locus corresponding to PDF can be a point, curve or region. It depends on the number of shape parameters. PDFs lacking shape factor (like Gauss or Laplace) are represented by a point, and functions with one shape coefficient are represented by a curve. Regions reflect functions with two shape factors. The second type of the diagram MRD(
) represents variance
as the abscissa and skewness
as the ordinate.
Moment ratio diagrams initially have a formulated multicriteria assessment approach, though in the statistical context. This research uses this idea in the residual analysis.
3.6. L-Moment Ratio Diagrams
L-moments have been introduced by Hosking [
18]. The LMRDs are popular in the extreme analysis. They allow the identification of proper distribution for empirical observations. The LMRD(
) is the most common and it shows the L-kurtosis
versus L-skewness
. Similarly to MRDs, one can confront the empirical data with the theoretical PDF candidate [
19]. A blank diagram with shapes (points or curves) for some theoretical PDFs is presented in
Figure 1.
Similarly to the MRD(), there are two LMRD versions: LMRD() and LMRD(). They combine in a single plot the scale and skewness. As the CPA analyzes frequently use kurtosis, it is proposed to investigate a new formulation: the LMRD(). The LMRDs are the successor of the MRDs and predecessor of the IRDs and they should be considered from that perspective.
3.7. The -Stable Distribution
Apart from the specific robust estimators or L-moments one may use other distributions. Stable functions deliver an alternative set of the statistical measures [
28]. The
-stable distribution is expressed by the characteristics equation
where
The factor is called the index of stability (stability exponent), the is the skewness factor, the shift and the scale. Thus, the -stable distribution has one shift factor, one scale and two shape coefficients: and .
The
-stable distribution has an increasing potential in the assessment approaches [
29], as it allows to measure the data diversity (scale factor
) and other shaping factors such as skewness (
) and the tailedness (
). These features nominate them as the potential measures in the residual analysis as well. Their application might be constrained in case of data, which does not fall into the stable families, which should be validated before their use.
3.8. Tail Index
Statistics frequently use the law of large numbers and the central limit theorem. Once data exhibits outliers, which is revealed in the form of tails, the majority of the assumptions made are not met. In such a case the knowledge of where the tail starts and which observations are located in the tail plays an important role [
30,
31]. There are many methods to estimate it and the tail index, denoted as
, is the most promising one [
32]. There are quite a few tail index estimation approaches, with two leading ones: the Hill [
33] and Huisman estimator [
34]. This work uses the second one.
Tail index as such is an extension of the -stable distribution’s stability exponent and it measures where the distribution tail starts. This perspective nominates the tail index as the potential measure of the data properties and their contamination with anomalies.
3.9. ARFIMA Models and Fractional Order
The ARFIMA time series is treated as an extension to the classical ARIMA regression models, see [
35]. The process
is denoted as ARFIMA
where
and
are polynomials in the discrete time delay operator
,
is random noise with finite or infinite variance. We use Gaussian noise in this research. Fractional order
refers to process memory.
For the process exhibits long memory or long-range positive dependence (persistence). The process has intermediate memory (anti-persistence) or long-range negative dependence, when . The process has short memory for ; it is stationary and invertible ARMA. ARFIMA time series is calculated by d-fractional integrating of a classical ARMA process. The d-fractional integrating through the operator causes the dependence between observations, even as they are far apart in time.
The Geweke–Porter–Hudak (GPH) estimator proposed by [
36] uses a semi-parametric procedure to estimate the memory parameter
for ARFIMA process
:
Next, ordinary least squares (LS) are applied to estimate
from the
being evaluated for fundamental frequencies
,
,
, where
m is the largest integer in
and
is a constant. Discrete Fourier transform
is evaluated as
Application of the least squares algorithm to the Equation (
16) yields to the final formulation
where
being a periodogram and
. The GPH algorithm calculates the
without explicit assumptions about ARMA polynomial orders. We use the [
37] implementation.
The use of the Geweke–Porter–Hudak fractional order estimation in the assessment task is relatively new [
38] and still requires much attention. Nonetheless, the first results are quite promising and that is why they are included in the analysis. However, the argumentation could be extended to the fractal, multi-fractal and data persistence time series assessment perspective of this estimator, which finds earlier references [
39,
40].
5. The Results
We start the analysis with a presentation of the calculation of common integral measures.
Table 4 compares obtained values. Even the draft analysis allows for an interesting observation. Each measure indicates different models as the best. The MSE highly penalizes large errors [
76] and is sensitive to any outlying occurrences, while the MAE is less conservative. It enables closer relations to smaller variations and economic considerations [
77].
Moreover, relative indexes point out or penalize other methods. Therefore, the decisions about the model that should be chosen highly depends on the selected index. Practice shows that this decision is often unaware, which might be costly in further practical applications.
In
Section 3, we describe various performance indexes that can be found in the literature. Moreover, we suggest the performance of visual multi-criteria analysis using the so-called Index ratio Diagrams (IRD). This idea follows the notions of moment ratio diagrams, known in statistics.
We start the analysis from the classical moment ratio diagram that shows the relationship between the third and the fourth moment, i.e., between the skewness denoted as
and the kurtosis
.
Figure 3 presents the respective diagram. Each shaded circle denotes one model, which is labeled with the blue number according to the notation sketched in
Table 4. The circles are shaded according to some other index, in this case it is the MAE. Generally, such a drawing brings some relative visual information—we still expect to obtain a single performance indicator. Actually, we may assume that the best tuning is reflected by the shortest distance from some optimal point. In this case we may assume the point
.
As we wish to obtain this value independently, we scale it and obtain the following IRD distance index
dIRD(x,y) for scaled values
x and
y:
This index we name as the Aggregated Distance Measure (ADiMe). The assumed scaling factors, which are used in each case, are denoted on the plots.
The IRD(,) diagram points out the model no 24, i.e., the Histogram Gradient Boosting Regression (HGBoostRT) as the best modeling approach. We may observe that selected model is the same as the one selected by the MSE index. The value of the IRD distance index is equal to 0.674. What is interesting is that the second-best model is the Quadratic Support Vector Machine (QSVM), which is not appreciated by any of the integral indexes.
The next
Figure 4 presents the same IRD relationship IRD(
,
), but with different shading, which is conducted according to the relative MAPE index. It is decided that all the consecutive diagrams are shaded according to the MAE index.
The next two plots show the IRD diagrams showing the relationship between the standard deviation (the second moments) and the skewness.
Figure 5 uses classical Gaussian standard deviation estimator
, while
Figure 6 uses its robust counterpart
. In both cases the HGBoostRT model is indicated with the ADiMe measure equal to
and
. In contrast, the next-best models are different, i.e., the Fine Regression Tree (F-DTR) and Coarse Regression Tree (C-DTR).
Figure 7 presents standard L-Moment Ratio Diagram, i.e., the IRD(
,
). As the L-skewness and L-kurtosis are normalized, there is no need for any further scaling. This approach indicates the k-Nearest Neighbors Regressor (k-NN) with
. Interestingly, the two next-best models are Rational Quadratic Gaussian Process Regression (RGPR) and Exponential Gaussian Process Regression (EGPR). The favoring of these models is intriguing, because they are not indicated by other indicators. We may bring the hypothesis that their residua exhibit, in general, neutral statistical properties. This issue requires further investigation.
The following two diagrams takes into account the L-l2 scale measure together with the L-skewness in
Figure 8 and L-kurtosis in
Figure 9. The IRD(L-
,
) approach points out the Coarse Regression Tree (C-DTR) with
, which is highly favored by all integral measures. On the contrary, the IRD(L-
,
) selects the Orthogonal Matching Pursuit (OMP) method with
. It must be noted that the selected best models are quite close to the following ones, and thus the indications are not very decisive.
The next two plots combine the L-scale factor L-l2 with the alternative measures of the tail—the tail index
in
Figure 10 and the Geweke–Porter–Hudak ARFIMA filter fractional order estimator
shown in
Figure 11. Both diagrams select the same modeling approach, the Bagged Regression Trees (BRTR), which is highly favored by the MAE index. The ADiMe values for both approaches are
and
, respectively. In both cases, the second-best model is Extremely Randomized Trees (ERTR).
Finally, the IRD diagrams are constructed using factors of the
-stable distribution.
Figure 12 presents the model selection according to the combination of the skewness
and the stability exponent
. The considered optimal point is the
, where this point indicates normal distribution. In that sense, the Stepwise Linear Regression model (SLR) is selected with the
. The next-best models are the RGPR—Rational Quadratic Gaussian Process Regression—and the k-NN.
The last two plots present the combination of the scale factor
together with stability exponent
in
Figure 13 and with the skewness
in
Figure 14. The first one selects the Bagged Regression Trees (BRTR) approach (
), while the latter the Extremely Randomized Trees (ERTR)—
. Both modeling approaches are seriously favored by the integral indexes.
It should be noted that the difference between these two diagrams is quite significant. The shape factor is responsible for the tails, i.e., informs about the ratio of the outlying observations, while the second shape factor, the skewness measures the residuum asymmetry. With this difference kept in mind we may select which feature of the modeling error is considered to be the most important for us.
Finally, let us compare all the approaches and the features they favor and the models they indicate.
Table 5 aggregates the features favored by each of the IRD diagrams with the selected model.
Generally, the regression trees approaches are the best fitted to the considered estimation task. However, each method has different features and objective comparison is highly relative and sensitive to the selection of the index. The Histogram Gradient Boosting Regression method captures the outliers and utilizes them in the estimation, while the k-NN approach focuses on the bulk of the data and neglects the outliers.
Concluding, one should first define which feature of the estimations matters the most, and according to that one should select the assessment methodology and the indexes used. The scaling indexes and the visual inspection of the IRD diagrams deliver an additional degree of freedom, allowing for deeper insight into the properties of the considered model.
6. Conclusions and Further Research
This works focuses on two aspects. It addresses the issue of the cost estimation for the short routes of the external FTL fleet. This subject is hardly recognized in the literature, it is really difficult and has large practical importance.
The second contribution, in our opinion the most important, is connected with the model assessment. It is very subjective to assess the model, as we do not have a single universal measure. Each index favors different properties. If we neglect that fact, the resulting model can miss our expectations without any clue why. The model assessment should not be limited to the simple comparison of a single measure numbers, but deeper investigation and appropriate index selection, even using visual inspection, might help. We propose to use the novel approach using Index Ratio Diagrams (IRD) and resulting Aggregated Distance Measure (ADiMe).
The practical perspective of this research has three dimensions. It bridges the gap between statistics and machine learning, as nowadays researchers tend to forget the potential lying in statistical analysis. The review of the ML-based estimation reports and papers does not deliver positive conclusions. Almost always, authors do not try to assess why their model is so good or so bad. They simply report residuum measure index, often using one single value. They do not try to check whether their selection captures data properties. This work aims to show that the assessment task is not one-dimensional and the analysis of the nuances can improve the work and the knowledge. This fact has further and more significant consequences. Obtained models might be not so good as observed, and therefore the industrial end-user can be frustrated by the results. That might lead to a lack of satisfaction, no further use of the tool and general robustness to new ideas.
Finally, this work offers the method of multicriteria residual analysis accompanied with new, almost unknown measuring opportunities that can highlight currently unobserved properties.
The proposed method is not universal and some limitations might be observed. First of all, especially at the level of the results presentation to the end-user, it might be challenging to explain the conclusions. Also, the selection of the scaling index values might be subjective and the work on the results normalization is still required. It would be interesting to conduct more research aimed at the synthesis of the observation in the direction of the Pareto-front analysis. The connection between the IRD observations, their explanation and the way to improve the model should be investigated.
The analysis is still not over. The model assessment, though considered simple, is not as simple as perceived. One index is not equal to another index. This mistake can lead to costly consequences. A lot of subjects remain open. How can we assess models using various criteria? How can we combine our expectations about the model features with proper performance index selection? How can we make the residuum analysis simple, clear and comparable?