Assessment of Variability in Irregularly Sampled Time Series: Applications to Mental Healthcare

Variability is defined as the propensity at which a given signal is likely to change. There are many choices for measuring variability, and it is not generally known which ones offer better properties. This paper compares different variability metrics applied to irregularly (nonuniformly) sampled time series, which have important clinical applications, particularly in mental healthcare. Using both synthetic and real patient data, we identify the most robust and interpretable variability measures out of a set 21 candidates. Some of these candidates are also proposed in this work based on the absolute slopes of the time series. An additional synthetic data experiment shows that when the complete time series is unknown, as it happens with real data, a non-negligible bias that favors normalized and/or metrics based on the raw observations of the series appears. Therefore, only the results of the synthetic experiments, which have access to the full series, should be used to draw conclusions. Accordingly, the median absolute deviation of the absolute value of the successive slopes of the data is the best way of measuring variability for this kind of time series.


Introduction
Variability, also known as statistical dispersion, scatter, or spread, can be defined as the propensity at which a given signal is likely to change. The analysis of variability is of major importance in many fields, such as astrophysics [1,2], hydrology [3], agriculture [4], or ecology [5,6]. Measuring variability also has important applications in medicine. For example, it can be used to accurately measure the bioequivalence of doses of different drugs [7], predict medication demand in a hospital [8], or to measure heart rate variability, which in turn can be used as a biomarker of anxiety [9]. Variability is of particular usefulness in mental healthcare, as fluctuations in suicidal ideation have been identified as a phenotypic marker for stress-induced suicide risk [10]. In this paper, we focus on how to mathematically describe variability in nonuniformly sampled time series, which can be used in turn to alleviate problems in mental healthcare. It must be mentioned that there is a surge in the application of statistics and machine learning to solve other problems in medicine, such as classifying normal versus cancerous cells [11] or diagnosing mental disorders with electroencephalogram data [12]. For a discussion about how these methods should be incorporated in practice, refer to the work in [13].
A variability measure is an operator that transforms a time series into a single scalar, whose value hinges on the dispersion of the values of the sample. In order to make them easier to understand, many of those measures are expressed in the same units as the studied data and are non-negative, in such a way that a value of 0 indicates no change at all. Most of the measures commonly used in psychiatry have these characteristics. One possible way to achieve this is to summarize the expected distance with respect to a reference point [14]. For instance, if the reference point is the sample mean, and the expected distance is measured as the square root of the sum of the squared distances to the mean, divided by the sample size minus one, the dispersion will be expressed in terms of standard deviations. However, if the median and the median of the absolute distances are the selected criteria, the dispersion will be measured in terms of median absolute deviations, which can be more robust to outliers.
Another approach to express variability in the same units of the observations is to compute the difference of some sorted observations, thereby removing the need of a reference value. This can be useful when such a computation can get skewed by the inherent noise or other factors. In particular, if the difference is taken from the most extreme values, variability is measured in terms of range, but it is possible to compute it making use of the different percentiles of the dataset. In other cases, it may be convenient to measure variability independently of both the particular scale and the units, so that different datasets can be compared [15]. The simplest way to achieve this is to obtain ratios of variability measures, for example, the ratio of the standard deviation and the mean expresses variability as coefficients of variation [16].
The analysis of variability requires the definition of a variable of interest whose changes are computed. In psychiatry, this variable of interest is usually a symptom, which has traditionally been measured by questionnaires administered in clinical sites. This method of assessment presents several disadvantages, such as recall bias (difficulty to remember past events) [17] or ecological invalidity (by asking the questions outside the usual environment) [18]. To alleviate this, a research line has tried to obtain extemporaneous feedback from the participants by regularly requesting them to answer a set of questions via their electronic devices. This approach, which can be categorized as ecological momentary assessment (EMA) [19], suffers from fatigue effects. This consists in a decrease in the number of questions answered as time goes by, which can result in withdrawal from the study [20].
To decrease such fatigue, our research group has recently adopted a new sampling approach for the questions that consists in randomly asking some of them out of a fixed pool, thereby decreasing the sense of repetitiveness [21]. This approach results in studies with longer follow-up periods, at the expense of every set of responses for the same question being sampled non-equidistantly (nonuniformly). Because of this, special care must be taken when measuring variability when using this approach.
In this paper, we study and propose metrics which are suited for this kind of data. However, the findings of this research are not only useful for this specific type of design, as they are also applicable to any analysis in which a significant amount of missing data is produced, yielding a nonuniformly sampled time series. Missing data represent a particularly common issue in EMA studies and it is usually dealt with discarding data points that do not meet certain criteria, such as removing responses provided over 30 min after being prompted [22]. We expect that our findings will provide an accurate method of analyzing variability in non-equidistantly (nonuniformly) sampled time series, what will also allow to better cope with missing data in mental health time series.
The rest of the paper is organized as follows. Section 2 summarizes the main variability measures that have been used in the literature. As most of these measures have only been applied to equidistant, dense time series, in Section 3 we propose some modifications to those measures so that they can be applied to irregularly sampled data. In this section, we also present some novel methods of measuring variability that could also be advantageous for this task. Then, Section 4 analyzes these methods using synthetic data, while Section 5 makes use of real data to assess their robustness. To correctly interpret the results, Section 6 provides additional simulations. Finally, in Section 7 we summarize our findings, discuss their implications, and we point to next lines of research.

Review of Variability Assessment
In this section, we summarize the most important measures used for assessing variability in mental healthcare time series. One of the most popular measures is the root mean square of successive differences (RMSSD) [10], that is, where T is the number of observations, and x t is the time series value, for instance, the response of the patient recorded at (discrete) time t in the considered application. The square is taken so that increases and decreases are not mutually canceled. This measure assumes that changes in either direction count the same. Such an assumption is usually fulfilled in psychiatry research, as commonly used scales are sometimes measured so that higher scores indicate better health status, and sometimes so that higher scores indicate worse health status [23]. The RMSSD contains some information about the temporal evolution of the assessments, as it is only invariant to complete reversal and some pairwise random permutations of the observations. This measure, normalized by the mean or by the standard deviation of the sample, also known as coefficient of variation (CoV) and ratio of determination (RoD), respectively, has been explored in [24] to produce variability measurements less sensitive to the mean value of the observations. Finally, notice that the square root is taken to recover the natural units of the explored variable in the series, though some studies do not do it [25]. This choice is not important, since the square root is a monotone transformation: it is similar to working with standard deviations or variances.
Another relevant measure of variability is the standard deviation (SD) of the observations [26]. This measure has the advantage of providing an interpretable parametric meaning in the case that the observations are normally distributed. However, this method does not carry any information about the order of the data (any permutation yields the same SD), and this approach is not appropriate when the distribution of the data contains many outliers or it is highly nonsymmetric [27]. A related measure that has also been employed is the standard error of the mean (SEM) [28], which is given by the SD divided by the square root of the number of observations.
Although simple and nonlinear, the range of the observations, measured as the difference between the most extreme values, has also been used [29]. On the other hand, more refined measures, such as the entropy (H) or the Teager-Kaiser energy operator (TKEO) have also been explored [30]. The former comes from the Information Theory field and it is a measure of data uncertainty, but it is not sensitive to local variations. We shall estimate it as where R = {r 1 , . . . , r N } is the set of N possible values in the time series andp(r t ) is the estimated probability of r t , which is given byp , with #(·) being the count function. Notice that (2) is measured in bits, and the sum explicitly removes terms involving null probabilities, which otherwise would create numerical indeterminations. Indeed, it is judicious to do so, as one of the properties of the entropy is that incorporating or discarding events with zero probability does not alter its value. The TKEO is an approximation of the energy of a signal, which depends on its amplitude and frequency. As highly variable signals have high frequency components, larger TKEO values are associated with greater variability. This metric requires at least three data points to be computed, and it may produce negative values under some circumstances [31]. To avoid this, we will use the mean of the absolute value of the TKEO (MATKEO) as in [32], that is, One final comment is in order. The variance terms of mixed-effect models have been used as an alternative approach to measure variability, either with raw [33] or with grand mean centered data [34]. However, we do not further analyze this model-based approach, as it is unfeasible to make fair comparisons of models tailored for specific settings, and these methods are somehow equivalent to computing the standard deviation of the data.

Variability Assessment of Irregularly Sampled Data
The variability measures presented above were not designed to cope with irregularly (nonuniformly) sampled time series. Therefore, in this section we propose some modifications to adapt them to such setting. First, notice that if (1) were directly applied to nonuniformly sampled data, its estimation would produce a non-desired result, as the same change in the scores, e.g., 20 units, would weight the same regardless of the time that it took such a change to happen, e.g., one day or one month. To correct this, we argue that the RMSSD in (1) can be understood as a finite-difference approximation of the square root of the mean of the squared first derivative, that is, which is a finite-difference approximation with (possibly) non-homogeneous step h t between x t+1 and x t . Moreover, (5) reduces to (1) when h t = 1, ∀t = 1, . . . , T − 1, in which case t N −t 1 T−1 = 1 as well. This measure, which we call the root mean square of successive slopes (RMSSS), is expressed in units of change in the data per unit of time. Building on this interpretation, we can propose measures that take into account the sampling interval, i.e., based on slopes. Specifically, we propose to work with the square root of the squared slopes of successive samples, that is, the absolute value. Then, we can use the measures presented in the previous section and apply them to absolute-value slopes (therefore, we shall append the subscript "as" to the abbreviations of those measures when there is risk of confusion) instead of the raw observations (the acronyms of which will end by the subscript "raw").
We introduce other measures for summarizing the time series into a scalar, which have been useful in other fields. In particular, we shall explore the potential of (1) the interquartile range (IQR) [35], that is, the value of the 75th percentile minus that of the 25th one. (2) The Gini's mean difference (GMD) [36], namely, the expected value of the absolute difference of every possible pair of the observed values, Moreover, (3) the median absolute deviation (MAD) [37], given by where median(·) is the median operator andx = median {x t } T t=1 . Notice that for clarity of presentation, (6) and (7) compute the corresponding measures for the raw scores, as the previous equations, but the variability measures can easily be assessed in terms of the absolute value of the slopes. For instance, the GMD of the absolute value of the slopes is given by For an schematic view of all the measures analyzed in this work, refer to Figure 1.  One final comment is in order. Overall, the complexity of the studied variability measures is comparable, since most measures are O(T) computationally complex, where O(·) denotes the big O notation [38]. All the metrics based on the absolute value of the slopes requires the additional computation of the absolute value of the successive slopes, but this is an inexpensive operation, which requires a number of operations which increases linearly with the number of observations. On the other hand, the (exact) sorting of the observed values used to compute some metrics, like IQR, is O(T log T) computationally complex [39]. Further, normalized metrics (i.e., SEM, CoV, and RoD) need to iterate twice the time series, but this can be done parallel. Finally, notice that the complexity of computing H raw also increases linearly as the number of possible values the time series can take on, N, is increased. The only metric that could be computationally more demanding is GMD, which has quadratic complexity in the number of observations, as can be seen in (6). How-ever, for time series with few to moderate observations, the differences in time complexity are negligible in practice.

Comparison on Synthetic Data
In this section, we study several variability measures using synthetic data. To the best of our knowledge, there is no previous work that has simulated responses to mental health questionnaires to assess it. Therefore, we describe a possible way to achieve this, which is supported by the item response theory (IRT). The IRT states that the answers that a person provides to a questionnaire can be modeled by a transformation of a given latent variable [40], which we shall denote by Ψ. Without loss of generality, let Ψ ∈ {ψ ∈ R : 0 ≤ ψ ≤ 100}. For instance, if Ψ measures the participants' mood, 100 may correspond with an objective, saturated feeling of elation, while 0 would be the worst possible feeling of sorrow and any value in between them represents an (objective) feeling proportional to the distance to the extremes. In our experiments, we set the possible responses to R = {x ∈ Z : 0 ≤ x ≤ 100}. In doing so, we account for the measurement error given by the precision of the scale. However, it is important to model other sources of noise, such as respondent inconsistency [41]. Indeed, it would be very unlikely that a participant always reports the same value in the questionnaire for the same value of Ψ. Furthermore, different participants, all sharing the same latent value of Ψ, may provide higher or lower values in the questionnaire if they tend to be optimistic or pessimistic, respectively.
Defining t as time in days, we exemplify three different changes for the latent state over 60 days as follows, Thus, (9) represents a slow linear drift towards the maximum value, while (10) and (11) model oscillations around the medium value with a periodicity of one week. Notice that (11) has double the amplitude of (10). These are prototypic types of behaviors that were found to be good examples of other more general cases which were analyzed but are not shown here for simplicity. Those other more general cases include a cosine with increasing amplitude or abrupt ground levels changes as time goes on, and a cubic function. In all those cases we arrived to the same conclusions.
Another limitation that should be considered is the temporal resolution. In our experiments, we restrict the observations to a maximum of one per day. Consequently, the first step is to obtain the days when the data will be acquired by sampling uniformly without replacement the set of 60 possible days. Then, using Equation (9), (10), or (11), the true value of the latent state is obtained. To account for the respondent inconsistency, we model the answer the participants would be willing to provide as a random realization of a normal distribution with mean Ψ t and standard deviation SD n , as normal distributions are well-suited for modeling responses to Likert questionnaires [42]. Finally, to account for the error produced by the precision of the scale, the observed data will be projected onto the set R.
In order to objectively assess the variability measures, we follow an approach similar to that of [43]. That is, we evaluate the measures in a favorable scenario, and then we obtain the performance degradation by computing the Cohen's d between the observed values in the favorable scenario and those observed in unfavorable scenarios. In particular, we compute it as where x re f is the mean in the reference set, in other words, the mean of a given variability measure in the favorable scenario. Moreover, x comp represents the mean of the comparison set, i.e., the mean of the same variability measure in the unfavorable scenario, and N re f (conversely N comp ) is the number of observations in the reference (conversely, comparison) dataset. Finally, SD re f (conversely SD comp ) stands for the standard deviation in the reference (conversely, comparison) dataset. We shall complicate the problem by reducing the number of observations and increasing the variance of the Gaussian function for the same Ψ t . In the favorable scenario, we observe the 60 possible answers with very low respondent inconsistency, i.e., SD n = 0.5. Figure 2 shows one realization under such favorable settings for the three explored changes, and the mean values of the variability measures over 4000 realizations are presented in Figures 3a, 4a and 5a. Notice that in these figures, to facilitate the interpretation, the same colormap has been used, which spans the range of all the data in the three figures. Finally, the performance degradation is depicted in Figures 3b, 4b and 5b, also using a single colormap, which spans the full range of the data in these three new figures. To further ease the interpretation, Figures 3b, 4b and 5b show the absolute value of the Cohen's d, as it is of little relevance whether the measure underestimates or overestimates the variability in the complicated scenario: for the robustness analysis the importance lies in the magnitude of the discrepancy. As previously explained, each row in these figures corresponds to a different value of SD n and number of observations, N comp , where this reduced number of observations is obtained by randomly sampling the 60 days without replacement.         As can be seen in Figures 3a, 4a and 5a, some measures are insensitive to the kind of change that Ψ describes. For instance, RoD raw and H raw provide very similar values for the two cosine functions, and COV as and RoD as cannot distinguish any of the three profiles. On the other hand, MAD raw and GMD raw are difficult to interpret, as they yield higher variability values for the slow linear drift than the periodic function, while the vast majority of the remaining measures experiment a nearly twofold increase when the amplitude of the cosine is multiplied by 2. Another issue regarding interpretability is that RoD raw , RoD as , and CoV as are not well-defined when the data do not vary at all, because their computation would involve a division of 0 by 0 under such a situation. Similarly, CoV_raw is undetermined if all observed values are equal to zero due to the same problem. This was not observed during the simulations, as Ψ was studied under three different varying conditions and there was also noise.
With respect to the degradation in the performance as the samples become more scarce and less reliable, Figure 3b shows that measures based on the raw value of the points of the time series struggle in the simple linear drift: SEM raw is the one with the highest absolute value of Cohen's d for most combinations of SD n and N comp , followed by RoD raw , RMSSD and CoV raw (recall that a magnitude of Cohen's d greater than 0.8 is considered a large difference [44]). In general, the rest of the variability measures based on the raw scores perform worse than those based on the absolute value of the slopes, with the notable exceptions of GMD raw , which performs exceptionally well for low values of added noise, and SD raw and IQR raw , which offer competitive robustness. The rest of the measures based on the slopes, except for SEM as , MATKEO as , IQR as , SD as , and RMSSS, are robust to both added noise and missing observations. Figure 4b reveals that the worst performing measure is RoD raw , followed by SEM raw , when Ψ describes a cosine with amplitude 20. When we compare Figures 3b and 4b we see that even if most of the raw-based variability measures perform better with the cosine profile than in the slow drift case (especially SEM raw for low values of noise and missing observations) they are still surpassed by slopebased ones, even if RMSSS, CoV as , RoD as , and SEM as are less reliable with the periodic profile than in the linear case. When analyzing the effect of doubling the amplitude of the cosine, i.e., comparing Figures 4b and 5b, we observe that RMSSS, SEM raw , and RMSSD are the measures whose performance is more affected.
When analyzing Figures 3b, 4b and 5b as a whole, we see how badly SEM raw , RoD raw , RMSSD, CoV raw , and MATKEO raw systematically perform in all the cases. RMSSD is one of the less robust measures for nonuniformly sampled observations, in spite of its wide applicability with uniform sampling. On the other hand, when looking at the most reliable ways of measuring variability, we notice that CoV as and RoD as offer averagely robustness for the linear case, although they fail to be robust enough for the cosine changes. Therefore, they should only be used if linear changes are expected. Similarly, IQR as , H raw , GMD as , and SD as only behave reliably if the underlying change is periodic. When the sign of the the Cohen's d value is considered (recall that Figures 3b, 4b and 5b show its absolute value) it is observed that the measures tend to underestimate variation as the comparison set is distorted by removing observations and adding noise. When comparing the first, sixth, and eleventh rows of these figures, we can analyze the performance with the same number of observations (i.e., 25) but increasing noise levels. The raw-based variability measures are the ones more affected by noise, notably RMSSD in the linear case. In general, noise in the responses causes more distortion than missing observations, what suggests that a small, yet reliable dataset can be more informative than a larger, noisier one. Finally, only SD raw , and both range raw , MAD raw and range as and MAD as are consistently subject to small degradations in the three cases, but MAD as is the most stable one for all the cases. In light of these findings, we conclude that the MAD as is the most reasonable measure to assess variability in irregularly sampled time series.

Comparison on Real Data
The previous section has shown the performance of the measures under three different synthetic scenarios. As the underlying data-generating process was known, it was possible to objectively characterize the robustness of each measure by computing the Cohen's d of the variability scores obtained from the complete time series used as a reference and a sampled version thereof. To complement that analysis, we now perform a similar experiment upon real data. However, the best reference that can be used with real data, namely, all the available responses of a participant, is already a sample of the complete time series. Therefore, we compute the variability measures for all the responses of all participants and use them as the reference sets. Then, up to 1000 degenerated time series are obtained by removing a given percentage of the observations in the reference sequence selecting the observations to discard uniformly without replacement at random, and the measures are computed thereon to obtain the comparison set. Finally, the Cohen's d is computed between the sets in order to assess the robustness of each variability measure.
The dataset contains the responses from the 198 participants who rated at least four times (so that the most restrictive metric, MATKEO as , can be computed) each of the following items: (a) "Today I feel the wish to live" and (b) "Today I feel tired during the day because of my sleep problems." For rating the items, participants used a slider with integer precision on scale that goes from 0 to 100. The mean ±SD sequence length of the above-mentioned items among all participants are 16.79 ± 11.00 and 12.01 ± 8.01, respectively. The mean distance ±SD among subjects is 7.85 ± 9.66 and 11.17 ± 12.59 days, respectively. The results are shown in Figure 6, where the absolute value of the Cohen's d has been taken so that the results can be more easily shown on a semi-logarithmic plot. The sample is composed of patients with a history of suicidal thoughts or behaviors, and all of them gave their informed consent. Overall, both Figure 6a,b exhibit a similar behavior, suggesting that the ensuing conclusions are not limited to a specific kind of questions, as the same pattern is observed for different topics (wish to live and sleep problems). Indeed, taking into account the error bars, there is not a significant loss in performance as more data are ignored. Other peculiarity of Figure 6 is that the Cohen's d values are significantly smaller than those of Figures 3b, 4b and 5b. Furthermore, according to Figure 6, GMD raw is the most reliable measure, clearly outperforming the rest. Equation (6) shows that such a variability measure is very reliable to simple loss of observations (without adding noise). Indeed, the synthetic data experiments also showcased very small absolute values of Cohen's d for low levels of added noise. Finally, when analyzing in more detail Figure 6, we see that SD raw , MATKEO raw CoV raw , and RMSSD and SEM as behave robustly.

Additional Synthetic Data Experiment
The two previous sections have shown that there is a disagreement between the results of the synthetic data experiments and those of the real data. On one hand, synthetic experiments show that the assessment of variability becomes harder as more data are lost, and that one of the ways of obtaining robust measures is to make use of the distance among the observations by considering the absolute value of the slopes of the consecutive data points in the time series. On the other hand, results with real data depict optimistic reliability for every measure of variability, regardless of the percentage of missing observations, and they favor simple measures that are based on raw values. This discrepancy warrants an additional synthetic data analysis to ascertain if the fact that the reference used in the real data experiments is itself a sample of the complete time series can bias the results. For this purpose, let us consider the following experiment. First, the given set of days that will be used as a reference out of the 60 possible days is sampled according to a Hawkes process whose (conditional) intensity function is The intensity function measures the number of times an observation is acquired per unit of time [45]. In (13), µ is a parameter expressing the basal intensity and α controls the increase of the probability of observing additional data points after the observations have been acquired. That is, α determines the burstiness of the locations of the data points in the sequence: low values of α produce more evenly spaced time series, while high values yield points that tend to follow each other forming clusters. In this way, it is also possible to study if the relative location of the points used in the reference sequence has any influence on the results. We take this cluster-producing sampling, and the simplest way of modeling it, i.e., the Hawkes process, as a relevant example of possible kinds of structures in the location of the data points in the reference set that could induce a bias.
Second, (11) is used to simulate the time series on the sampled instances obtained using the Hawkes process in (13), yielding N re f observations from which the variability measures will be evaluated, thereby obtaining the reference set. Third, 70% of those N re f observations are removed by means of uniform sampling without replacement, and the variability measures are computed in order to get the comparison set. Last, the process is repeated, and the Cohen's d is obtained. Figure 7 shows those values of the Cohen's d (keeping the sign) for 4000 repetitions and different values of µ and α, which, in turn, determines the mean number of observations in the reference set, explicitly denoted by "mean(N re f )", for clarity. As can be seen, the magnitude of the Cohen's d in Figure 7 lies within the same range of those of the real data analysis (Figure 6), which are significantly smaller than those of the first synthetic data experiments (Figures 3b, 4b and 5b). Therefore, the low density of observations in the reference set causes a bias towards low values of Cohen's d. This stems from the fact that the lower the number of observations for the same time series, the lower are the chances of observing changes, in such a way that the comparison set becomes more and more similar to the reference one. In fact, the large majority of the variability measures attain the best results for mean sample densities as low as (16.5 × 100)/60 = 27.5%, as depicted in the bottom row of Figure 7.  However, such a bias does not equally affect all the variability measures. Indeed, the value of Cohen's d of the measures based on the raw answers decreases faster to 0 than those based on the absolute value of the slopes, as the reference time series has fewer observations, particularly for simpler methods such as RMSSD, CoV raw , RoD raw , and SEM raw . Nonetheless, MAD as , GMA as , IQR as , and SE as , all of which performed averagely to poorly in the real data experiments, have a slight bias that increases the value of the Cohen's d as the reference losses observations. This can be explained by the fact that the absolute value of the slopes lies on a larger range of values than that of the raw observations, so the measures based on the absolute slopes can still capture some variability even when the reference observations are scant. Therefore, when the observations are removed in the comparison set, part of such variability is lost, and this discrepancy is shown in the Cohen's d between the two sets. On the other hand, normalization contributes to increase the bias towards low Cohen's d values. This explains why SEM as , RoD as , and Co as , which are based on the absolute slopes, suffer from it. In addition, SEM raw , RoD raw , and Co raw are the measures that decrease faster to zero as the reference becomes lighter. Therefore, the division by an already biased normalizing factor, such as the SD, amplifies the bias in variability measures.
With regard to the possible influence of the relative location of the observations used as a reference, Figure 7 allows us to compare different combinations of µ and α, which have roughly the same number of mean observations in the reference set. In particular, it is possible to compare: (1) the 2nd and 5th rows to observe an increase from α = 0.5 to α = 0.7 with mean(N re f ) ≈ 45; (2) the 8th and 10th rows to see what happens when the burstiness is increased from α = 0.1 to α = 0.5 with mean(N re f ) ≈ 32; and (3) the 11th and 13th rows to study the case wherein α = 0.3 and α = 0.7 with mean(N re f ) ≈ 28. When the differences among those rows is inspected, the values are very low and no consistent pattern is found, suggesting the density of the observations used as a benchmark is what determines the bias induced when the reference time series is incomplete.
In summary, this additional synthetic data experiment shows that using incomplete time series as references induces non-negligible bias, reducing the differences between the reference and comparison sets for simple variability measures based on raw observations. Therefore, the real data results from Section 5, which have very low density of observations in the benchmark (<15%), should not be considered for analyzing the robustness of the variability measures. Instead, the synthetic data experiments from Section 4, which have access to the complete sequence, are the ones that should be taken into account.

Conclusions and Future Work
In this paper, we have summarized the main variability measures used in the literature, with special emphasis in mental healthcare time series. Datasets in this kind of studies have a strong tendency to be nonuniformly sampled regardless of the experimental design; even studies that aim for data uniformity end up being nonuniform due to the large amount of missing data. We have explored the ability of different commonly used variability measures to address this issue, and we have proposed and studied the potential of other variability indicators not explored yet to cope with this inimical situation. We compared a total 21 candidates, exploring their robustness and interpretability using synthetic data. Thanks to the known profiles used in the synthetic data, some variability measures, such as MAD raw and GMD raw , have been identified as difficult to interpret, as they provided higher values for profiles that intuitively are less variable. On the other hand, these experiments identified measures based on the absolute value of the slopes of consecutive observations as the most robust ones. In particular, MAD as was the best candidate.
However, we reached different conclusions in the experiments performed upon real data, as they showed that the most robust measures are the ones that are normalized and/or based on the raw values, such as RMSSD. To shed some light on this, an additional synthetic data experiment showed that low number of observations used in a benchmark (such as the one of the real data analysis) induces a non-negligible bias that favors the variability measures that are normalized and/or based on the raw value of the series. Therefore, only the results of the real data experiments should be used to determine which is the most reliable measure, as the full time series is used as a reference. We can therefore conclude that the best way to assess variability is computing the MAD as , namely, the median absolute deviation of the absolute value of the successive slopes of the data in the time series. Finally, it would be instructive to conduct additional analyses on real and dense time series, as data density would decrease the risk of bias.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data are not publicly available due to ethical and privacy restrictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: (·) as (subscript indicating) computed upon the absolute value of the slopes (·) n (subscript indicating) of the noise (·) raw (subscript indicating) computed upon the (raw) scores