On Comparing Cross-Validated Forecasting Models with a Novel Fuzzy-TOPSIS Metric: A COVID-19 Case Study

: Time series cross-validation is a technique to select forecasting models. Despite the sophistication of cross-validation over single test/training splits, traditional and independent metrics, such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), are commonly used to assess the model’s accuracy. However, what if decision-makers have different models ﬁtting expectations to each moment of a time series? What if the precision of the forecasted values is also important? This is the case of predicting COVID-19 in Amapá, a Brazilian state in the Amazon rainforest. Due to the lack of hospital capacities, a model that promptly and precisely responds to notable ups and downs in the number of cases may be more desired than average models that only have good performances in more frequent and calm circumstances. In line with this, this paper proposes a hybridization of the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) and fuzzy sets to create a similarity metric, the closeness coefﬁcient (CC), that enables relative comparisons of forecasting models under heterogeneous ﬁtting expectations and also considers volatility in the predictions. We present a case study using three parametric and three machine learning models commonly used to forecast COVID-19 numbers. The results indicate that the introduced fuzzy similarity metric is a more informative performance assessment metric, especially when using time series cross-validation.


Introduction
By 27 October 2021, almost two years after the initial occurrence of SARS-COV-2, the World Health Organization (WHO) announced a total of 219.4 million cases worldwide and 5 million accumulated deaths due to coronavirus disease [1]. Indeed, by 22 November 2021, some countries in Europe have announced a partial or complete lockdown aimed at overcoming the infections spread across Europe, notwithstanding sustainability economy problems [2]. After 80.2 thousand confirmed cases in the world in almost two months, Brazilian authorities stated the SARS-COV-2's primary infection on 25 February 2020 [3]. After a lag of two months, Brazil saw its initial pandemic numbers soar. At the end of October 2021, Brazil had the third-largest number of confirmed cases globally (21.75 million) and the second-highest number of deaths (606 thousand). Moreover, the number of daily new cases and deaths started decreasing only in June 2021 after mass vaccinations took effect [4].
In Brazil, huge cities, such as São Paulo and Rio de Janeiro, attracted media attention mainly due to their population, economic concentration, and the consequent dimension of SARS-COV-2 numbers. Nevertheless, the pandemic affected even more Brazilian regions, such as the North, which is covered mainly by the Amazon forest and holds about half of the Brazilian area. Although the population density (4.78 inh./km 2 ) and concentration (8.8%) in the North region are low, by the end of September 2021, it was responsible for 30.3% of all Brazilian SARS-COV-2 confirmed cases [5]. Moreover, the death risk by standardized rates in the North region capitals was significantly higher [6,7] than in the rest of the country, mainly due to the poor sanitary and social conditions.
In the North region of Brazil lays a state like an island surrounded and carved out of the Amazon rainforest, as it does not hold land traffic routes with any other state of Brazil (see Figure 1). The Amapá state has just 877,613 residents who live in an area larger than England, but it is 67-times denser in England. As a result, Amapá has already experienced an overload of mortality from transmissible infections, predominantly amidst indigenous groups, such as other parts of the Brazilian Amazon [8]. In addition, despite previous government efforts, several social and health challenges remain for the many people residing in the state, such as public sanitation and minimal access to clean water [9]. This particular scenario makes Amapá receptive to SARS-COV-2 and other disease outbreaks that may occur. Until August 2020, according to the state's official data, the state reported at the second-highest Brazilian infection rate [5]. Consequently, in late May 2021, the capital of Amapá, Mapacá, suffered the collapse of its healthcare system due to SARS-COV-2.  Researchers have been presenting several models to assist local authorities in Amapá and worldwide in many fields [10][11][12][13][14][15]; some of them assisting in forecasting COVID-19 numbers, such as when the outbreak will peak, how long it will last, how many will be infected or die, and how the hospital demands will evolve [16][17][18][19]. The models vary from univariate [16,20,21] to multivariate approaches [18,22] and from ex ante to ex post [23][24][25]. Other variables such as the number of PCR tests, mobility data, meteorological data, and internet activity are also commonly forecasted or used as exogenous variables while predicting others [18,23,24,26].
The RMSE is applicable for evaluating the overall accuracy of the forecasts while punishing significant forecast errors in a square order [50]. The MAE is famous because it is straightforward to calculate and understand. However, it can produce biased results by counting meaningful outliers in datasets; the other measure, MSE, has the same limitation as MAE [51][52][53]. Another widely used evaluation measure is the MAPE due to its benefits of scale-independency and interpretability. However, MAPE has the notable limitation of resulting in infinite or undefined values for zero or close-to-zero values [51,53,54]. Other metrics used by researchers to assess the performance of machine learning methods are R2, R2 adjusted, precision, recall, F1-score, and accuracy, or the Matthews Correlation Coefficient (MCC) and the area under the receiver operating characteristic (ROC) curve, also known as AUC [44,48,55]. A forecasting method that has the R2, R2 adjusted, F1-score, or AUC closest to 1 is the one that should be chosen. Higher values for metrics that carry the word "Mean" on their names indicate poor performance of a given algorithm [52,53,55].
Despite the variety of existing metrics, none of them seem to be specially tailored to cross-validation forecasting approaches or to assess multiple forecasted values to each observation in the testing sets. Commonly, metrics that are based only on averages may not explore other information that the multiplicity of forecasted values may bring [56], such as variability and different fitting expectations to each data point [57]. In fact, in practice, frequently used forecasting metrics are also correlated, meaning that the performance of one model according to a given metric will be linearly correlated to other metrics [55].
The objective of this paper is to present a novel metric that, in addition to averaging errors, also deals with heterogeneous fitting expectations, capturing the volatility of the forecasted values during cross-validation. As a consequence, this measure can relatively assess the performance of COVID-19 forecasting models. To do so, we adopt a similarity metric, the closeness coefficient (CC), which we take from the Fuzzy-TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution), an outranking method generally used in the context of Multi-Criteria Decision Making (MCDM). A responsive metric in fitting expectations is the one that can capture different perspectives or decision criteria over a set of competing forecasting models. In our case, the perspectives are different periods of a time series, such as periods of an increase, stability, or a decrease in the number of COVID-19 cases. Furthermore, by using triangular fuzzy sets, the metric can consider the volatility in the forecasted data related to the output of the models. Finally, it is a relative metric that only makes sense when more then one model is at the pool of potential models to be chosen. We exemplify the usage of this metric with the case of selecting COVID-19 forecasting models in the Amapá state, Brazil.
The main justification for using TOPSIS over other well-consolidated MCDM methods is that it operates with the concept of distance to the ideal solution [58], which is a key aspect to the metric we propose. Other distance to ideal solution methods, such as VIKOR (VIseKriterijumska Optimizacija I Kompromisno Resenje, in English Multicriteria Optimization and Compromise Solution), could also be used to the same end, potentially producing similar results. However, TOPSIS is still preferred due to its easy usability and level of consolidation among scholars.
To the best of our knowledge, this metric has never been presented before to assess forecasting models, especially in the context of COVID-19 or other epidemics. The main contributions of this paper and the proposed metric are summarized as follows: • We introduce a novel similarity metric that, in addition to averaging errors, can also capture heterogeneous fitting expectations and volatility in the forecasted values. • The metric is suitable for comparing forecasting models trained and tested with cross-validation techniques, such as rolling forward forecasting. • The metric may potentially bring positive implications over automated model selection in a robust and sustainable manner. • The metric also enables the conjunction of data-driven and expert-driven models, such as TOPSIS and machine learning forecasting models, which are not commonly tight-coupled to produce decision support systems [58]. • To exemplify the usage of the metric, we present a case study that compares three classical and three machine learning forecasting models: Holt-Winters, ARIMA, Prophet, KNN, RFR, and SVR.
The proposed metric also has limitations, especially in the face of more traditional metrics that are widely used to compare forecasting models, such as MAE, RMSE, SMAPE, and others.

•
Since it is a relative metric, it is not suitable to assess standalone models' performances or compare models that are applied to different time series. • The metric is more complex and demands more effort to be set than other classical metrics. • It also requires judgments from decision-makers, making it more time-consuming.
The rest of this paper is organized as follows: Section 2 briefly introduces the TOPSIS, fuzzy-TOPSIS models, and the proposed approach, which is an adaptation of fuzzy-TOPSIS to rank forecasting models. Section 3 presents an actual case study that intends to rank models to the reality of a Brazilian state carved in the Amazon region during the COVID-19 pandemic. In Section 4, we discuss the results and compare the proposed metric to the ones already diffused in the literature. Finally, we provide the conclusions in Section 5.

Methodology
This section introduces the models used as the base of the introduced forecasting model performance metric as well as the proposed fuzzy-TOPSIS model to rank forecasting models.

TOPSIS
The TOPSIS is an MCDM method that calculates the value of a given alternative by distancing it to its ideal and non-ideal concepts (or solutions). Hwang and Yoon [59] presented TOPSIS in 1981, and since then, researchers have been primarily employing it to rank alternatives in different fields of study, such as business [58], construction [60], healthcare [61], and mining [62]. Hwang and Yoon [59] proposed the following step-by-step TOPSIS to select the most suitable alternatives given a list of decision criteria.
Step 1: Let A = [a ij ] m×n be a decision matrix with m rows and n columns. The element in the i-th row and j-th, a ij , indicates a value given to alternative i, regarding a decision criterion j. Consider also an n-dimensional vector C, where each position c j corresponds to the weight assigned to the decision criterion j, where ∑ n j=1 c j = 1.
Step 2: Normalize and weight elements of A by dividing them by the routed summation of the column square values and multiplying them by the corresponding criterion weight, as in Equation (1). Thus, a matrix A = [a ij ] m×n with all normalized values is set, as in Equation (2).
Step 3: Let V + and V − be m-dimensional vectors corresponding to the Positive and Negative Ideal Solutions, respectively. They are calculated as indicated next.
Step 4: Calculate the separation measures d + i and d − i between the row vector a i and V + and V − , respectively.
Step 5: Calculate the closeness coefficient (CC) as in Equation (7) of each alternative i to the ideal solutions. Then, rank the alternatives by sorting the CC values in their decreasing order.
Under several circumstances, crisp data are inadequate to real-life model situations. Human judgments are often vague and cannot estimate their preference with an accurate numerical value. Thus, we might use linguistic assessments, a more realistic approach, rather than numerical values to produce the classifications and weights of the criteria in the problem [63,64]. Furthermore, crisp averages may lose information carried by a set of samples to the same quantitative observation [57]. To tackle those gaps, researchers have been mainly applying fuzzy set theory to TOPSIS approaches. A positive fuzzy number,x, is characterized by a membership function, µx, which takes values in the interval [0, 1]. It is an extension of classical set theory, and the operations are extensions of the fundamental set theory operations of complement [65,66].
The triangular is the most common membership function in fuzzy-MCDM applications [58]. In contrast to more complex and novel membership functions and types of fuzzy sets, triangular fuzzy numbers are the easiest to learn and understand. Thus, they are traditionally suitable to first fuzzification attempts on novel MCDM methods when uncertainty is primarily tackled [58,66]. For instance, classic MCDM methods, such as AHP [67,68] and TOPSIS [63,69], debut on the fuzzy environment by employing the concept of triangular fuzzy numbers. For a given fuzzy number,x, its associated membership function is defined by a lower limit (x 1 ), a middle value (x 2 ), and an upper limit (x 3 ), with Figure 2, where: Chen [63] performed the first extension of TOPSIS with triangular fuzzy numbers. This approach is the baseline to the adaption of Fuzzy-TOPSIS we propose later on. The procedure of Fuzzy-TOPSIS is similar to classical TOPSIS and is summarized as follows [69].
Step 1: LetÃ = [ã ij ] m×n be the fuzzy decision matrix, withã ij being a triplet (a ij1 , a ij2 , a ij3 ), where a ij1 is the lower limit, a ij2 the medium value, and a ij3 is the upper value for alternative a ij . Consider also an n-dimensional vectorC, where each elementc j is a triplet (c j1 ,c j2 ,c j3 ), wherec j1 ,c j2 , andc j3 are the lower, medium, and upper values forc j , respectively.
Step 2: Compute the normalized fuzzy decision matrix,R = [r ij ] m×n , where each elementr ij is calculated according to Equation (9).
Step 5: Compute the positive (d * i ) and negative (d − i ) distances of each alternative i to its FPIS and FNIS, respectively.
Step 6: Compute the closeness coefficients for each alternative i, CC i , according to Equation (16). Then, sort the alternatives by decreasing value of CC.

The Proposed Approach
The proposed model is essentially an adaptation of the Fuzzy-TOPSIS model proposed by Chen [63]. In Steps 1-5, it organizes the decision environment and creates data throughout forecasting. To the best of our knowledge, those steps are not found in other TOPSIS-based models. Steps 6-11 directly use the logic and equations proposed by Chen, but we made several adjustments to accommodate the predicted values from the prediction models. We displayed these adjustments following each stage of the proposed model. The step-by-step of the introduced model is given next.
Step 1: Divide the time series observations into two portions: a fixed training set and a fluctuating testing/training set, as represented by Figure 4. Step 2: Segment the fluctuating testing/training set in n categories according to decision-makers preferences and/or the time series characteristics. Figure 5 exemplifies n = 4. Please notice that non-successive observations may fall into the same category since a time series may display similar behaviors for different periods in time. Step 3: Aggregate the observations according to the m categories and weight the observations inside each category according to the decision-makers preferences. Table 1 shows the weighting scale, containing the conversion from linguistic to crisp and fuzzy weights, used to assess each category. All data points that fall into a category will receive the corresponding weight to that category. Thus, for the p n observations inside the n categories, their weights are equal to the weight given to the category k, so c k = c k 1 , c k 2 , · · · , c k p k . The vector with the weights to all r observations in the fluctuating testing/training set is given in Equation (17): (17) with r = p 1 + p 2 + · · · + p n .  (4,5,5) In Table 1, the column Linguistic refers to the linguistic weights that the decisionmakers may assign to each criterion. The column Crisp refers to the crisp correspondent scale that is commonly used on MCDM methods. The Fuzzy correspondents to the fuzzy scale, which we use to convert the linguistic variable to triangular fuzzy numbers.
Step 4: Select the m forecasting models to be compared and assemble the decision hierarchy (See Figure 6). Step 5: To the time series defined in Step 1 , run the models q time each, where q = r − w + 1, with w the forecasting window, which is also the testing set. Thus, to each model run, add one observation to the training set until r = w. Figure 7 graphically exemplifies how the training set evolves in time and also how the testing set gives one step ahead in its origin with each new run in the cross-validation.

Training data
Test data Run 1 . . .

Run 2
Run q Figure 7. Forecasting window.
Step 6: To observe the fluctuating testing/training set, assemble a triangular fuzzy number,ẽ k ij , which is also composed by a triplet, with k = 1, 2, . . . , n and where e k ij1 is the smallest residual, e k ij2 is the median of all residuals to a given observation, and e k ij3 is the biggest residual of the forecasting to the r observations and m forecasting models. For the first and last observations of the fluctuating testing/training set, e k ij1 = e k ij2 = e k ij3 . Thus, the residual fuzzy matrix, aggregated according to n categories,Ẽ r×m , for r observations and m models is given by: Step 7: Compute the normalized residual fuzzy matrix,G = [g k ij ] r×m , just as presented by Chen [63], where each elementg k ij , for k = 1, . . . , n; i = 1, . . . , p k ; j = 1, . . . , m is given by: with the lower, medium, and upper values being, respectively, e k ij1 , e k ij2 , and e k ij3 , coming from matrixẼ and e * j3 = max i,k ẽ k ij3 .
Step 8: Construct the weighted normalized fuzzy decision matrix, Step 9: Compute the Fuzzy Positive Ideal Solution (FPIS) and Fuzzy Negative Ideal Solution (FNIS). We assume that the best forecasting model to a given observation is the one closer in distance to its FPIS and farthest to its FNIS. Thus, the FPIS is the minimal possible residual, and the FNIS is formed by the maximum residuals among all forecasting models for the given observation.
Step 10: Compute the distance of each model predicted observation from each FPIS, d * j , and FNIS, d − j , using Equations (22) and (23). We calculate the distances between two Triangular Fuzzy Numbers using the vertex method (See, Figure 3 and Equation (15)).
Step 11: Compute the closeness coefficient for each forecasting model j, CC j , according to Equation (24) for each forecasting model and rank them from highest closeness coefficient to the lowest.

Case Study: Forecasting in the Amazon Region
In this section, we exemplify the usage of the proposed metric with the case study of a Brazilian state, Amapá. We first present the data we use and how we collected them. Then, we briefly introduce the forecasting protocol and the models used to predict future observations of the target variable. Finally, we evaluate the models according to the similarity metric we introduced in Section 2.2.

Data Acquisition
We performed all modeling to the daily number of confirmed COVID-19 cases in the Amapá state, fully located in the Brazilian Amazonian region. The number of observations/timestamps in this case study since the first official case, in 20 March 2020, up to 28 September 2021, is 558. We gather the data from official reports at the state level. The collected data are also available in an application programming interface provided by Brasil.io repository [5], where the dataset is named "caso" and is presented under the "COVID-19" section.
The data we use may diverge from the Brazilian government's website as the counting protocol may differ from the state of Amapá. Additionally, this paper does not treat cases of sub-notifications, and the target variable is forecasted by the models only considering as predictors lagged values. All data we use, as well as the results from subsequent sections of this paper, can be found in the Supplemental File.
We divided the time series into two parts, respecting Step 1 of the methodology we proposed. Let the fixed training set be denoted by A and the fluctuating testing/training set be denoted by B. From the 558 observations/days, A encompasses 238 observations, while B encompasses 320. Figure 8 shows A and B, as well as their centered moving average, for 42 days. These categories reflect local decision-makers' concerns, which also weigh the importance of each category according to their needs. For instance, better predictions during periods with an increase in the number of daily COVID-19 confirmed cases are the most appreciated since those times may require more infrastructure investments or negotiation with other regions to transfer the excess of new patients. Similarly, but less critically, are the predictions during decreasing periods since during these times, the state may be open to receiving patients from other regions or temporarily close temporary facilities, such as campaign hospitals. Figure 9 shows the splitting of B into categories. Then, following Step 3, Table 2 presents the linguistic grades given by the decisionmakers to each of the categories. The column Category refers to each category defined recognized by the decision-makers, Code refers to its abbreviation, and Scale shows the verbal scale assigned, according to Table 1.

Forecasting Protocol
Thus, the first forecasting run uses all the fixed training set A as the training set and predicts the first 21 observations of B. Then, we perform the walking forward over the fluctuating testing/training B, now using 239 observations (238 from A and 1 from B), and attempt to predict from the 2nd to the 22th observation of B, as shown in Figure 7.  Figure 10 shows a flowchart describing the adopted forecasting protocol.

Forecasting Models
We run six forecasting models according to the forecasting protocol defined in Section 3.2 and take the data presented in Section 3.1 as input. First, we select the six forecasting models commonly used for predicting the number of COVID-19 cases. Once the models are defined, we complete the decision hierarchy, as prescribed by Step 4 in the proposed methodology subsection (see Figure 11).
For the forecasting in Figure 9, we split and started the time series from points 1 (equal to the point 238 original) to 320 (equal to 558) included in the forecasting window, as shown in Table 3, which shows the defined categories, their correspondent codes, and the data points included in the fluctuating testing/training set, that fall inside each one of the categories.  Figure 11. The case decision hierarchy. As proposed in Step 5 of the proposed methodology, for each model, we perform a total of 320 runs. Figure 12 presents a graphical summary of the mean predicted value for each observation in the fluctuating testing/training set. The parameters and hyperparameters are tuned to each new run of the models. Thus, we do not list the optimal combination of parameters to each one of the models for each new run during the cross-validation. The following section briefly presents the six forecast models used in this research.

ARIMA
ARIMA, also known as the Box-Jenkins model [70], is a statistical approach commonly used for time series analysis and forecasting. The model's composition of integration (I), autoregressive (AR), and moving average (MA) comprises the ARIMA model. Time series components consist of trend, seasonal, cyclic, and random or irregular movement categories [35]. ARIMA can additionally be set to recognize seasonality, the optimal value of which can be found after running a Canova-Hansen test [71].
The ARIMA model is commonly referred to as an ARIMA (p, d, q), where p is the order of autoregression, d is the degree of difference, and q is the order of the moving average [31].
The optimum values of autoregressive (p), degree of their differences (d), and moving average (q) may also be found by search-grid. Generally, the chosen parameter values are those that minimize the Information Criterion (AIC). Benvenuto et al. [21], Ceylan [31], and Singh et al. [32] present examples of the ARIMA applicability to forecast the number of COVID-19 cases. The general equations for AR and MA models are [31]: where Y t , ε t , φ, and θ are the observed values at time t, the value of the random shock at time t, AR, and MA parameters, respectively. Thus, an ARIMA model is given by: where α is a constant. When dealing with non-stationarity, the data may be differentiated first, and the ARIMA model is then performed.

Holt-Winters
Holt [72] and Winters [73] are the forerunners of the Holt-Winters method, likewise acknowledged as triple exponential smoothing. The Holt-Winters method is an improved version of the simple exponential smoothing model to recognize trend and seasonality in a time series. The method has three parameters: α, the smoothing factor, β, a trend smoothing parameter, and γ, which relates to seasonality. Numerous authors have used this model to forecast the number of COVID-19 cases [38,39].
The two literature Holt-Winters models use additive or multiplicative settings based on the seasonal component. The additive model can be applied with a linear trend and an exponential trend. Moreover, the Holt-Winters additive model is suitable for data with trends and seasonality that do not grow over time [35,36]. The equations of the additive model are as follows: where t indicates a given period, S t is the smoothed observation (level) at period t, L the cycle length, α is the smoothing parameter of level, and y t the value in t for a target variable. The trend factor (b t ), the seasonal index (I t ), and the forecast at m steps (F t + m) are given by Equations (29)-(31), respectively.
where β and γ are the smoothing parameters of trend and season, sequentially.

Support Vector Regression (SVR)
The Support Vector Machine (SVM) is a supervised machine learning technique adopted for regression, classification proposals, and time series data forecasts [37]. Vapnik [74] is the vanguard of this technique, as well as its regression variant, the support vector regression (SVR), which was vastly broadcasted by the work of Drucker et al. [75]. We can find some employment of SVR [16,48,49] in the COVID-19 forecasting context. The common logic of an SVR is moderately simple. For example, assume a linear regression, which aims to minimize the sum of square errors: where y i is the target, w i the coefficient, and x i the feature. Then, the SVR training intends to minimize the following system.
where b i is a linear coefficient and ε is the error. Cost and Kernel are two examples of hyperparameters usually tuned in this algorithm.

K-Nearest Neighbors (KNN)
The K-Nearest Neighbors (KNN) technique is a nonparametric and lazy learning classification method [37,44]. Initially, the KNN was designed to accomplish classification problems. Notwithstanding, decades after the KNN's original conceptualizations, around the early 1990s, researchers began investigating it for regression goals [76]. Instead of learning the training dataset, KNN does not need a training phase and holds the training dataset [44].
The KNN algorithm explores the k nearest past comparable values (nearest neighbors) by minimizing a similarity measure in a time series context [37]. Later, the forecasting is an average of these K-nearest neighbors. Moreover, the KNN gives the smallest similarity measure within the past and new cases [37]. Although it sounds straightforward, it requires a significant computational cost [43]. In the COVID-19 environment, several researchers have employed this approach in classification problems. Nevertheless, only some have applied it to forecast the number of COVID-19 cases [43]. For this algorithm, the number of neighbors is the most common hyperparameter to be tuned. The central distance functions employed for continuous variables are: (35) where k refers to the number of samples. When q = 1, the metric produces the Manhattan distance, whereas when q = 2, one has the Euclidean distance (both frequently used distance metrics for this end).

Random Forest Regression (RFR)
The Random Forest (RF) approach is a machine learning algorithm with several decision trees proposed by Breiman [77], which is a compound of bagging and random subspaces methods. Currently, practitioners of machine learning and researchers apply the RF approach in regression and classification assignments. For example, the authors [11,16,46,47] have employed the RF approach to deal with COVID-19 forecasting.
In the RF algorithm, initially, data are randomly split into two parts: training data (the in-Bag) for learning and validation (the out-of-Bag data) for the testing learning levels.
Next, the algorithm randomly creates many decision trees with "boot-strap samples" from the data [11,44,46]. Subsequently, the randomly selected predictors define the branching of each tree at node points. Lastly, the results from each tree average are the final RF estimation [44,46]. It is remarkable when applied with the randomness of the time series [16]. As for dividing criteria in each tree's branch in regression applications, we use the Mean Squared Error (MSE). For RFR, the number of estimators and the maximum tree depth are the most common hyperparameters tuned.

Prophet
The Facebook team released a decomposed model called Prophet for forecasting, which is an open-source library [40]. It applies a decomposable times series model, with three central model elements: the trend function (g(t)), the periodical function (s(t)), and the holidays (h(t)). It also appropriates an error t if the model does not predict any abnormal changes. where In Equation (37), k is the increase rate, δ is the rate arrangements, m is the offset parameter, and γ j is set to s j δ j to make a continuous function. An extra important feature is that the model does automated changepoint election, setting a sparse prior on δ.
On the other hand, it relies on the Fourier series to incorporate daily, weekly, and annual seasonalities. In the case of COVID-19, we are more concerned about weekly seasonality [8].
For instance, Prophet has few occurrences in forecasting the death and the accumulated cases confirmed in the COVID-19 context [20,41].

Model Evaluation and Comparison
After running each model 320 times, each time, one step ahead over the fluctuating testing/training set (see Figure 10, we reorganize the observations into the categories and then build triangular fuzzy numbers,ẽ k ij = (e k ij1 , e k ij2 , e k ij3 ), for each predicted observations, all in accordance with Step 6 of the proposed methodology. For instance, after running the Holt-Winters method, represented by j = 2, the observation t = 274, which occupies position i = 13 inside category k = 2, has the following rounded predicted values: [331, 406, 396, 308, 249, 390, 383, 355, 370, 331, 301, 284, 280, 250, 343, 342, 361, 338, 346, 335, 305] for the actual value y = 203. Thus, the lower limitẽ 2 13,1,1 = 250 is the predicted value with minimum deviation from the actual value,ẽ 2 13,1,3 = 406 is the value with maximum deviation, andẽ 2 13,1,2 = 338 is the median prediction considering all 21 predicted values, as collected from runs 17 to 37. This wayẽ 2 13,1,1 = (250, 338, 406). In possession of all fuzzy numbers for each observation of B and for the six models, we then build the Residual Fuzzy Matrix,Ẽ 320x6 , whose three first and last rows we represent in a tabular form in Table 4. Please notice that each model represents one column of the matrix defined by Equation (18), and the observations inside the fluctuation testing/training set are represented by the rows. Tables 6-9 are also tabular forms of the matrices defined by the proposed model step-by-step, and l, m, and u in the sub-heading of the tables denote the lower, middle, and upper values of the triangular fuzzy numbers. We compute them in the normalized fuzzy matrix,R 320x6 , according to Step 7. Table 5 shows the first and last three rows of the tabular form ofR.  We transform them into linguistic weights assigned by the decision-makers to the criteria in a fuzzy scale, according to Table 1, so we construct the Weighted Fuzzy Matrix, V = [ṽ ij ] 320x6 , as requested by Step 8. Table 6 shows the first and last three rows of the tabular form ofṼ.   Table 7 shows the first and last three FPIS and FNIS rows of the tabular form of the Fuzzy Ideal Solutions. According to Step 10, we calculate the Euclidian distances of each model's predicted values from each FPIS and FNIS by using the vertex method, which also defuzzifies the fuzzy numbers, obtaining once again crisp values. Table 8 shows the first and last three rows for the Positive and Negatives to each model. Finally, the total distances, d * j and d − j (see Step 10), as well as the closeness coefficient, CC j (see Step 11), are calculated to each model (see Table 9). According to the model we propose, forecasting models with a higher CC are preferred, which results from the association of great negative distances d − j to small positive distances. In Table 9, the positive distances contribute more to the turn of Prophet over Holt-Winters than the negative distances. Nevertheless, ARIMA has a significant distance to the negative ideal solution, its distance to the ideal solution is not sufficient to overcome Holt-Winters and Prophet. The application of the metric and the input data can be found in the Supplemental File.

Results and Discussion
To evaluate the proposed similarity metric, we first calculate the Mean Absolute Error (MAE) for each model according to the six categories illustrated in Sections 3.1 and 3.3. We also calculate the overall MAE for all models, as it is commonly found in the literature.
As we may observe in Figure 13, Holt-Winters is the model with better overall performance according to MAE. It is the model with the best performance after changes since it has the smallest MAE for the categories Stability Start and Decreasing Start and holds the second-best performance for Increasing Start. Moreover, according to MAE, Holt-Winters is also the best model during Increasing, the second-best model during Stability and Decreasing. In the radio chart of Figure 13, Holt-Winters seems to have the smallest area, with an advantage over the second and third best models, RGR and KNN. These results are not surprising since Holt-Winters has shown good forecasting performance, especially for short-term forecastings and even against fashioned models, such as machine learning models [39,[78][79][80]. In an overall manner, Prophet and ARIMA are the worst models, displaying higher MAE values.
We can also calculate the Weighted Mean Absolute Error (WMAE) by taking as weights the crisp values correspondent to the linguistic variable assigned by the decision-makers in Table 2. As observed in Figure 14, Holt-Winters still is the best overall model, but it loses performance, especially during periods of Decreasing, where Prophet has the best performance. In fact, with this approach, Prophet improves its overall performance since periods of decreasing accounts for 33% of all observations in the fluctuating testing/training set. Notice how the polar graph goes almost to zero for the effect of some chunks of data related to some categories over the overall result, such as c SS , c IS , and c DS . It happens due to the smaller weights given to starting periods and the number of observations that fall into those categories.  However, both MAE and WMAE consider only averages, and high volatilities in the data are not considered. For example, by looking at Figure 15, we notice how volatile Holt-Winters is when compared to Prophet. The first presents a much more significant difference between the lower and upper values computed to the fuzzy numbers than the latter; however, Holt-Winters medium values are closer to the target values. This variation can be better perceived by looking at the difference between the fuzzy residuals from Holt-Winters and Prophet and the positive ideal solutions, FPIS = (0, 0, 0). When Holt-Winters has more significant residuals than Prophet, the difference is positive. On the other hand, when Prophet has greater residuals than Holt-Winters, the difference is negative. In Figure 16, we calculate the differences between the residuals of Holt-Winters and Prophet. The residuals are calculated between the target variable and the medium values (which is also correspondent to the MAE values, See Figure 13) for both models. Since we compute the difference between the residuals in Figure 16 However, when we use the upper residual values, the results are far different (see Figure 17). Thus, the extremes in the predicted values are more evident. In this scenario, Prophet displays better results than Holt-Winters, not only in the most important categories, such as c D , but also where it lacked accuracy before, such as c S .
This new variability perspective, added to the perspective of the mean, explains the results of Table 9 and the dominance of Prophet over the other models. Thus, despite presenting the worst MAE compared to other models, it performs better in more critical categories for the decision-makers and displays lesser variability in the predicted data when we perform the moving forward validation. If classical TOPSIS were used instead of fuzzy-TOPSIS, the variability would not be captured, and some weighted means would only form the final metric. With classical TOPSIS, the ranking would be HW, KNN, RFR, ARIMA, Prophet, and SVR, which is the same ranking provided by the WMAE in Figure 14. Table 10 summarizes the proposed rankings from MAE, WMAE, TOPSIS similarity metric, and Fuzzy-TOPSIS similarity metric. To MAE and WMAE, the smaller the ranking, the better the results. To TOPSIS and Fuzzy-TOPSIS, the greater the results, the better. For the sake of comprehension, the results of Table 10 are normalized in Table 11. The greater the value, the better for all ranking approaches. According to the four ranking approaches, the normalized values also allow us to better notice the distance between the models. As noted, while WMAE and classic TOPSIS are able to capture heterogeneous fitting expectations, they do not penalize models with greater volatility. Thus, their ranks are similar to those obtained with MAE. This way, the volatility treatment brought by the usage of fuzzy numbers is the biggest game-changer of the model we propose, at least for the data we explored.

Conclusions
This paper proposed a novel forecasting metric, a fuzzy similarity metric, that, in addition to averaging errors, can capture heterogeneous fitting expectations and volatility in the forecasted values, especially when using cross-validation approaches. We tested the metric to select the best model to predict future values for the number of COVID-19 confirmed cases in Amapá, Brazil. Models that quickly respond to increases in the series are preferred over others that tend to be more conservative under those situations. Besides having a small mean error associated with each predicted observation, low volatility during cross-validation is also desired since more steady models may lead to more expected scenarios. The similarity metric proposed in this paper ranks Holt-Winters as a second option to Prophet. Despite presenting a modest performance according to MAE, Prophet presents low volatility in the forecasted data, meaning that its worst predictions are still closer to the target variable than the worst predictions of Holt-Winters, for example. Furthermore, when weighting the time series periods according to the decision-maker's preferences, Prophet has a good performance during periods of decreasing, which accounts for almost 33% of the fluctuating testing/training set. Nevertheless, other metrics that may capture heterogeneous fitting expectations also penalize Holt-Winters, such as WMAE or even classical TOPSIS, but they still rank similarly to MAE. Thus, capturing volatility in the data is the biggest game-changer to the case we presented. Since it is the first introduction to a novel metric, future explorations must be performed to test it in other datasets and under different time series perspectives. Our work is also limited to univariate time series, and its response to multiple target variables is still unknown.

Conflicts of Interest:
All authors have approved the manuscript and agree with its submission. Furthermore, we confirm that this manuscript has not been previously published and is not considered for publication in any other journal. Therefore, no conflict of interest needs to be disclosed.

Abbreviations
The following abbreviations are used in this manuscript:

AIC
Akaike Information Criterion ARIMA Integration (I) between autoregressive (AR) and moving average (MA) models AUC the area under the receiver operating characteristic (ROC) BIC The