Learning Using Concave and Convex Kernels: Applications in Predicting Quality of Sleep and Level of Fatigue in Fibromyalgia

Fibromyalgia is a medical condition characterized by widespread muscle pain and tenderness and is often accompanied by fatigue and alteration in sleep, mood, and memory. Poor sleep quality and fatigue, as prominent characteristics of fibromyalgia, have a direct impact on patient behavior and quality of life. As such, the detection of extreme cases of sleep quality and fatigue level is a prerequisite for any intervention that can improve sleep quality and reduce fatigue level for people with fibromyalgia and enhance their daytime functionality. In this study, we propose a new supervised machine learning method called Learning Using Concave and Convex Kernels (LUCCK). This method employs similarity functions whose convexity or concavity can be configured so as to determine a model for each feature separately, and then uses this information to reweight the importance of each feature proportionally during classification. The data used for this study was collected from patients with fibromyalgia and consisted of blood volume pulse (BVP), 3-axis accelerometer, temperature, and electrodermal activity (EDA), recorded by an Empatica E4 wristband over the courses of several days, as well as a self-reported survey. Experiments on this dataset demonstrate that the proposed machine learning method outperforms conventional machine learning approaches in detecting extreme cases of poor sleep and fatigue in people with fibromyalgia.


Introduction
Fibromyalgia is medical condition characterized by widespread muscle pain and tenderness that is typically accompanied by a constellation of other symptoms, including fatigue and poor sleep [1][2][3][4][5][6][7][8][9]. Poor sleep, which is a cardinal characteristic of fibromyalgia, is strongly related to greater pain and fatigue, and lower quality of life [10][11][12][13][14][15][16]. As a result, any intervention that can improve sleep quality may enhance daytime functionality and reduce fatigue in people with fibromyalgia.
Studies of sleep in fibromyalgia often rely on self-reported measures of sleep or polysomnography. While easy to administer, self-reported measures of sleep demonstrate limited reliability and validity in terms of their correspondence with objective measures of sleep. In contrast, polysomnography is considered the gold standard of objective sleep measurement; however, it is expensive, difficult to administer, especially on a large scale, and may lack ecological validity. Autonomic nervous system (ANS) imbalance during sleep has been implicated as a mechanism underlying unrefreshed sleep in fibromyalgia. ANS activity can be assessed unobtrusively through ambulatory measures of heart rate variability (HRV) and electrodermal activity (EDA) [17,18]. Wearable devices such as the Empatica E4 are able to directly, continuously, and unobtrusively measure autonomic functioning such as EDA and HRV [19][20][21][22].
In the literature, there are few studies in which machine learning methods are used for classification or prediction of conditions related to fibromyalgia, none of which use physiological signals. A recent survey paper [23] summarizes various types of machine learning methods that have been used in pain research, including fibromyalgia. Previously, using data from 26 individuals (14 individuals with fibromyalgia and 12 healthy controls), the relative performance of machine learning methods for classification of individuals with and without pain using neuroimaging and self-reported data have been compared [24]. In another study using MRI images of 59 subjects, support vector machine (SVM) and decision tree models were used to first distinguish healthy control patients from those with fibromyalgia or chronic fatigue syndrome, and then differentiate fibromyalgia from chronic fatigue syndrome [25]. In [26], an SVM trained on fMRI images was used to distinguish fibromyalgia patients from healthy controls. The combination of fMRI with multivariate pattern analysis has also been investigated in classifying fibromyalgia patients, rheumatoid arthritis patients and healthy controls [27]. Psychopathologic features within an ADABoost classifier have also been employed for classification of patients with fibromyalgia and arthritis [28]. In another recent work [29], secondary analysis of gene expression data from 28 patients with fibromyalgia and 19 healthy controls was used to distinguish between these two groups.
In this study our immediate interest is to predict extreme cases of fatigue and poor sleep in people with fibromyalgia. For such an analysis, we use self-reported quality of sleep and fatigue severity, continuously collected data from the Empatica E4, to measure autonomic nervous system activity during sleep (Section 2). These signals are preprocessed to remove noise and other artifacts as described in Section 3.1. After preprocessing, a number of mathematical features are extracted, including various statistics, signal characteristics, and HRV features (Section 3.2). Section 4 provides a detailed description of our novel Learning Using Concave and Convex Kernels (LUCCK) machine learning method. This model, along with other conventional machine learning methods, were trained on the extracted features and used to predict extreme cases of poor sleep and fatigue, with our method yielding the best results (Section 5).
We believe this analytical framework can be readily extended to outpatient monitoring of daytime activity, with applications to assessing extreme levels of fatigue and pain, such as those experienced by patients undergoing chemotherapy.

Dataset
The data used for this study was collected from a group of 20 adults with fibromyalgia and consists primarily of a set of signals recorded by an Empatica E4 wristband over the course of seven days (removing 1 h/day for charging/download). Most (80%) participants were female with mean age = 38.79 (min-max=18-70 years). Of a possible 140 nights of sleep data, the sample had data for 119 (85%) nights. In this dataset, 19.9% of heartbeats were missing due to noisy signals or failure of the Empatica E4 in detecting beats. Data were divided into 5-min windows for HRV analysis; windows with more than 15% missing peaks were eliminated. This led to the exclusion of 30.9% of the windows.
The signals used in this analysis are each patient's blood volume pulse (BVP), 3-axis accelerometer, temperature, and EDA. In addition to these recordings, each subject self-reported his or her wake and sleep times, as well as self-assessed his or her level of fatigue and quality of sleep every morning. These data are labeled by self-reported quality of sleep (1 to 10, 1 being the worst) and level of fatigue (from 1 to 10, 10 indicating the highest level of fatigue).

Signal Processing: Preprocessing, Filtering, and Feature Extraction
The schematic diagram of Figure 1 represents our approach to analyzing the BVP and accelerometer signals in the fibromyalgia dataset. During preprocessing, we remove noise from the input signals and format them for future processing (via the Epsilon Tube filter). Once the BVP and accelerometer signals are fully processed, they along with the EDA and temperature signals can then be analyzed and features can be extracted, which in turn leads to the application of machine learning. The final output is a prediction model to which new data can be fed.

Preprocessing
To begin, the raw signals are extracted per patient according to his or her reported wake and sleep times. These are then split into two groups: awake and asleep. For each patient and day, the awake data is paired with the following night's data and ensuing morning's self-assessed level of fatigue and quality of sleep.
Our approach to preprocessing BVP signals consists of a bandpass filter (to remove both the low-frequency components and the high-frequency noise), a wavelet filter (to help reduce motion artifacts while maintaining the underlying rhythm), and Epsilon Tube filtering. In order to least perturb the true BVP signal, we chose the Daubechies mother wavelet of order 2 ('db2') as it closely resembles the periodic shape of the BVP signal. Other wavelets were also considered but ultimately discarded. Once we selected a mother wavelet, we performed an eight-level deconstruction of the input BVP signal. By setting threshold values for each level of detail coefficients (Table 1) and using the results to reconstruct the original signal, we were able to significantly reduce the amount of noise present without compromising the measurement integrity of the underlying physiological values. Utilizing this filter on a number of test cases showed that the threshold values produced consistently useful results regardless of the input, meaning tailored interactions are not required for each signal. The accelerometer data was upsampled from 32 Hz to 64 Hz via spline interpolation to match the sampling frequency of the BVP signal. The other signals (temperature and EDA) were left unfiltered. We then use these preprocessed signals as input into our main filtering approach (Epsilon Tube), the output of which is then used for feature extraction (Section 3.2).
After filtering of the BVP signal and interpolation of the accelerometer signal, the Epsilon Tube filter [30] is the final component of the preprocessing stage. As discussed in [30], since the BVP signal (and generally any impedance-plethysmography-based measurements) is very susceptible to motion artifact, reduction of this noise is a crucial part of the filtering process. This method uses the synchronized accelerometer data to estimate the motion artifact of BVP signal while leaving the periodic component intact. Let b t represent BVP values at time t, A a matrix whose rows are the accelerometer signals, and w the vector of Epsilon Tube filter coefficients. Given the tube radius , the error of b t estimation, i.e., y t (A, w), is zero if the point b t falls inside the tube |b t − y t (A, w)| = max{0, |b t − y t (A, w)| − }.
The Epsilon Tube filter is formulated as a constrained optimization problem that can be expressed as where N is the length of BVP signal, ζ t and ζ t are slack variables, R(s, A, w) is the regularization term and c is a designated parameter that adjusts the trade-off between the two objectives. More information about the Epsilon Tube filter can be found in [30]. Taking both the BVP and accelerometer signals as input, the method assumes periodicity in the BVP signal and looks for a period of inactivity at the beginning of the data to use as a template for the rest of the signal. To achieve this, the calmest section of the accelerometer signal (as determined by the longest stretch during which the values never exceed one standard deviation from the mean of the signal) is found. The signal is then shifted so this period of inactivity is at the beginning, and the BVP signal is also shifted to ensure the timestamps remain aligned. The shifted signals are then fed into the Epsilon Tube algorithm, and the resulting output is used for feature extraction.

Feature Extraction
Once the BVP and accelerometer signals are processed, the full signal set is used for feature extraction. There are 91 features extracted from each of the following signals: • Denoised (filtered) BVP signal, i.e., the output of the Epsilon Tube algorithm, with sampling frequency of 64 Hz. • Low-band, mid-band, and high-band pass filters applied to the denoised BVP signal. • Interpolated accelerometer signal, from 32 HZ to 64 Hz. • Tube sizes from the Epsilon Tube filtering method, another output of the Epsilon Tube algorithm that has the time-varying tube size signal. • Temperature signal, with sampling frequency of 4 Hz.
• EDA signal, with sampling frequency of 4 Hz.
• The calculated breaths per minute (BPM) signal based on the denoised BVP signal.
• The calculated HRV signal based on the denoised BVP signal.
The extracted features are listed in Table 2. These are extracted from both the awake and the sleep signals, resulting in a full feature set consisting of 182 features. When feature selection is performed using Weka's information gain algorithm [31] on the first four subjects, the only feature ranked consistently near the top is the average of the BVP signal after being run through a mid-band bandpass filter.

Machine Learning: Learning Using Concave and Convex Kernels
The final step in the analysis pipeline is the creation of a model that can be used to predict the extreme cases of quality of sleep or level of fatigue for people with fibromyalgia. As detailed in Section 5, in addition to testing a number of conventional machine learning methods, we tested a novel supervised machine learning called Learning Using Concave and Convex Kernels (LUCCK). A key factor in the classification of complex data is the ability of the machine learning algorithm to use vital, feature-specific information to detect settled and complex patterns of changes in the data. The LUCCK method does this by employing similarity functions (defined below) to capture and quantify a model for each of the features separately. The similarity functions are parametrized so that the concavity or convexity of the function within the feature space can be modified as desired. Once the similarity functions and attendant parameters are chosen, the model uses this information to reweight the importance of each feature proportionally during classification.

Notation
In this section, x ∈ R n is a real-valued vector of features such that x = (x 1 , . . . , x n ), and x i is a real-valued (scalar) feature. Throughout this section, we consider d classes, n features and m (data) samples; also the indexes k = 1, . . . , d; i = 1, . . . , n; and j = 1, . . . , m are used for classes, features and samples respectively. Additionally, j = 1, . . . , m k refers to m k < m samples in class C k .

Classification Using a Similarity Function
An instructive model for comparison to the Learning Using Concave and Convex Kernels method is the k-nearest neighbors algorithm [33][34][35] and weighted k-nearest neighbors algorithm [36]. In k-nearest neighbors, a test sample x is classified by comparing it to the k nearest training samples in each class. This can make the classification sensitive to a small subset of samples. Instead, LUCCK classifies test data by comparing it to all training data, properly weighted according to their distance to x, which is determined by a similarity function. One major difference between LUCCK and weighted k-nearest neighbors is that our approach is based on a similarity function that can be highly non-convex. A fat-tailed (relative to a Gaussian) distribution is more realistic for our data, given that there is a small but non-negligible chance that large errors may occur during measurement, resulting in a large deviation in the values of one or more of the features. The LUCCK method allows for large deviations in a few of the features with only a moderate penalty. Methods based on convex notions of similarity or distance (such as the Mahalanobis distance) are unable to deal adequately with such errors.
Suppose that the feature space is comprised of real-valued vectors x ∈ R n . A similarity function is a function Q : R n → R that measures the closeness of x to the origin, and satisfies the following properties: 1. Q(x) > 0 for all x ∈ R n ; 2. Q(x) = Q(−x) for all x ∈ R n ; 3. Q(λx) > Q(x) if x ∈ R n is non-zero and |λ| < 1.
The value Q(x − y) measures the closeness between the vectors x and y. Using the similarity function Q(x), a classification algorithm can be created as follows: The set of training data C is a subset of R n and is a disjoint union of d classes: C = C 1 ∪ C 2 ∪ · · · ∪ C d . Let m = |C| be the cardinality of C and define m k = |C k | for all k so that m = m 1 + · · · + m d . To measure the proximity of a feature vector x to a set Y of training samples, we simply add the contributions of each of the elements in Y: A vector x is classified in class C k , where k is chosen such that R(x, C k ) is maximal. This classification approach can also be used as the maximum a posteriori estimation (details can be found in Appendix A).

Choosing the Similarity Function
The function Q(x) has to be chosen carefully. Let Q(x) be defined as the product where x = (x 1 , . . . , x n ) ∈ R n and Q i (x i ) only depends on the i-th feature. The function Q i (x i ) is again a similarity function satisfying the properties Q i (−x i ) = Q i (x i ) > 0 for all x ∈ R, and Q(x) > Q(y) whenever |x| < |y|. After normalization, the Q, Q 1 , Q 2 , . . . , Q n can be considered as probability density functions. As such, the product formula can be interpreted as instance-wise independence for the comparison of training and test data. In the naive Bayes method, features are assumed to be independent globally [37]. Summing over all instances in the training data allows for features to be independent in our model. Next we need to choose the functions Q 1 , . . . , Q n . One could choose Q i (x i ) = e −γ i x 2 , so that is a Gaussian kernel function (up to a scalar). However, this does not work well in practice: • One or more of the features is prone to large errors -The value of Q(x − y) is close to 0 even if x and y only differ significantly in a few of the features. This choice of Q(x) is therefore very sensitive to small subsets of bad features. • The curse of dimensionality-For the training data to properly represent the probability distribution function underlying the data, the number of training vectors should be exponential in n, the number of features. In practice, it usually is much smaller. Thus, if x is a test vector in class C k , there may not be a training vector y in C k for which Q(x − y) is not small.

Consequently, let
for some parameters λ i , θ i > 0. The function Q i (x i ) can behave similarly to the Cauchy distribution. This function has a "fat tail": as x → ∞ the rate that Q i (x i ) goes to 0 is much slower than the rate at which e −γ i x 2 goes to 0. We have The function Q has a finite integral if θ i > 1 2 for all i, though this is not required. Three examples of this function can be found in Appendix B.

Choosing the Parameters
Values for the parameters λ 1 , λ 2 , . . . , λ n and θ 1 , θ 2 , . . . , θ n must be chosen to optimize classification performance. The value of log(Q i (x i )) = −θ i log(1 + λ i x 2 ) is the most sensitive to changes in x when maximal. An easy calculation shows that this occurs when x = λ where Λ is some fixed parameter. Next we choose the parameters θ 1 , . . . , θ n . We fix a parameter Θ that will be the average value of θ 1 , . . . , θ n . If we use only the i-th feature, then we define for any set Y of feature vectors. For x in the class C k , 1 measures how much closer x i is to samples in the class C k than to vectors in the set C of all feature vectors except x itself. This value measures how well the i-th feature can classify x as lying in C k as opposed to some other class. If we sum over all x ∈ C and ensure that the result is non-negative we obtain The θ 1 , . . . , θ n can be chosen so that they have the same ratios as α 1 , . . . , α n and sum up to nΘ: In terms of complexity, if n is the number of features and m is the number of training samples then the complexity of the proposed method would be O(n × m 2 ).

Reweighting the Classes
Sometimes a disproportionate number of test vectors are classified as belonging to a particular class. In such cases one might get better results after reweighting the classes. The weights ω 1 , ω 2 , . . . , ω d can be chosen so that all are greater than or equal to 1. If p is a probability vector, then we can reweight it to a vector If the output of the algorithm consists of the probability vectors p(x (1) ), . . . , p(x (m) ) the algorithm can be modified so that it yields the output W ω (p(x (1) )), . . . , W ω (p(x (m) )). A good choice for the weights ω 1 , . . . , ω d can be learned by using a portion of the training data. To determine how well a training vector x ∈ C can be classified using the remaining training vectors in C \ {x}, we define The value p k (x) is an estimate for the probability that x lies in the class C k , based on all feature vectors in C except x itself. We consider the effect of reweighting the probabilities p k (x), by . If x lies in the class C k , then the quantity measures how badly x is misclassified if the reweighting is used. The total amount of misclassification is We would like to minimize this over all choices of ω 1 , . . . ω d ≥ 1. As this is a highly nonlinear problem, making optimization difficult, we instead minimize instead. This minimization problem can be solved using linear programming, i.e., by minimizing the quantity for the variables ω 1 , . . . , ω d and new variables z (1) , . . . , z (m) under the constraints that z (j) ≥ ω k p(x (j) ) and ω k ≥ 1 for all k and j with 1 ≤ k ≤ d and 1 ≤ j ≤ m.

Experiments
In this section, the performance of LUCCK is first compared with other common machine learning methods using four conventional datasets, after which its performance on the fibromyalgia dataset is evaluated.

UCI Machine Learning Repository
In this set of experiments, LUCCK in compared to some well-known classification methods on a number of datasets downloaded from the University of California, Irvine (UCI) Machine Learning Repository [38]. Each method was tested on each dataset using 10-fold cross-validation, with the average performance and execution time across all folds provided in Table 3. Table 4 contains the average values for accuracy and time across all four datasets.

Fibromyalgia Dataset
In this study, we have created a model that can be used to predict the quality of sleep or level of fatigue for people with fibromyalgia. The labels are self-assessed scores ranging from 1 to 10. Attempts to develop a regression model showed less promise than the results from a binary split. The most likely reason for this failure of the linear regression model is the nature of self-reported scores, especially those related to patient assessment of their level of pain. This fact is primarily due to the differences in individual levels of pain-tolerance. In previous studies [39,40], proponents of neural "biomarkers" argued that self-reported scores are unreliable, making objective markers of pain imperative. In another study [24], self-reported scores were found to be reliable only for extreme cases of pain and fatigue. Consequently, in this study, binary classification of extreme cases of fatigue and poor sleep is investigated. In this situation, a cutoff value is selected: patients that chose a value less than the threshold are placed in one group, while those that chose a value above the threshold are placed in another. As such, the values >8 are chosen for extreme cases of fatigue, and the values <4 are chosen for extreme cases of poor sleep quality. In this way, binary classifications are possible (>8 vs. <8 for fatigue and >4 vs. <4 for sleep). Using the extracted feature set, machine learning algorithms are applied and tested using 10-fold cross-validation. This is done in a way so as to prevent the data from any one patient being in multiple folds: all of a given patient's data are including entirely in a single fold. In addition, in order to address possibly imbalanced data during fold creation, random undersampling is performed to ensure the ratio between the two classes is not less than 0.3 (this rate is chosen since the extreme cases are at most 30 percent of the [1,10] interval of self-reported scores). This prevents the methods from developing a bias towards the larger class.

Results with Conventional Machine Learning Methods
A number of conventional machine learning models listed in Table 5 were applied to the extracted data in this study. As can be seen, many major machine learning methods were tested. For each of these methods, various configurations were tested, and the best sets of parameters were chosen using cross-validation (hyperparameter optimization). For instance, we used the combination of AdaBoost with different types of standard methods such as Decision Stump and Random Forest in order to explore the possibility of improving the performance of these methods via boosting. The k-nearest neighbor method with k = 7 was used in this experiment. For the weighted k-nearest neighbor method [36], the inversion kernel (inverse distance weights) with k = 7 resulted in the best performance. For the Neural Network algorithm, the Weka (Waikato Environment for Knowledge Analysis) [41] multilayer perceptron with two hidden layers was used. The results of using these machine learning approaches for prediction of extreme sleep quality (cutoff of 4) and fatigue level (cutoff of 8) are presented in Table 5. As shown in this table, the AdaBoost method based on random forest yielded the best results for quality of sleep (based on area under the receiver operating characteristic curve, or AUROC). For level of fatigue, the neural network was the best performing model.

Results with Our Machine Learning Method: Machine Learning Using Concave and Convex Kernels
In addition to the aforementioned conventional methods, we also used our machine learning approach that resulted in superior performance compared to the standard machine learning methods discussed above. Recall that in the Learning Using Concave and Convex Kernels algorithm, test data is classified by comparing it to all training data, properly weighted according to information extracted from each of the features (see Section 4 for further details). The results of applying our method to fibromyalgia are presented in Table 5, with cutoff values of 4 and 8 for quality of sleep and level of fatigue, respectively. As can be seen, LUCCK was able to vastly outperform other models on the fatigue outcome; however, the improvement on sleep outcome was not significant. This disparity is likely due to the different feature spaces for the sleep and fatigue outcomes. In general, the feature space for fatigue is significantly more dispersed, due to there being more samples (during daytime) and also that daytime activity negatively affects the signal quality, increasing dispersion. In contrast, signals (and their associated features) recorded during sleep are of better quality. This leads to the better prediction result for sleep in all methods used. Our proposed LUCCK algorithm can ameliorate the nature of the fatigue feature space, as it is specifically designed to reduce the effect of training data for which there is a large deviation from test data. As such, LUCCK was able to vastly outperform other models on the fatigue outcome. We should note that while the cohort size in this study seems to be limited, the continuous recording of physiological signals for seven days and nights created a comprehensive dataset. Additionally, similar to k-NN and its weighted version (and unlike SVM and neural network models), LUCKK can be trained even with few samples, which is one advantage of the proposed algorithm.

Conclusions and Discussion
In this study we primarily focused on prediction of the extreme cases of fatigue and poor sleep. As such, we have created preprocessing/conditioning methods that have the ability to improve the quality of parts of the signals with low quality due to motion artifact and noise. In addition, we identified a set of mathematical features that are important in extracting patterns from physiological signals that can distinguish poor and good clinical outcomes for applications such as fibromyalgia. Additionally, we showed that our proposed machine learning method outperformed the standard methods in predicting the outcomes such as fatigue and sleep quality. Generally, our proposed framework (preprocessing, mathematical features, and proposed machine learning method) can be employed in any study that involves prediction using BVP, HRV and EDA signals.

Conflicts of Interest:
The authors declare no conflict of interest.

Patents:
The epsilon tube filter is covered by US Patent 10,034,638, for which Kayvan Najarian is a named inventor.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Classification as Maximum a Posteriori Estimation
The classification approach suggested in Section 4.2 can also be viewed in terms of probability density functions. Suppose that R n Q(x) = e with 0 < e < ∞. The function is therefore a probability density function. This probability density function is an estimation for the probability distribution from which the training data were taken.
We have is a probability density function for the training data in class C k for k = 1, 2, . . . , d and p(C k ) := m k m is the probability that a randomly chosen training vector lies in C k . f C can be considered as a mixture of the probability density functions f C 1 , . . . , f C d . Suppose that x ∈ R n is taken from the distribution f C k with probability p(C k ), then the distribution for x is f C . Given the outcome x, the probability that it was taken from the distribution f C k is This shows that the classifying scheme is the maximum a posteriori estimation. Instead of classifying a feature vector, the probability vector p(x) = (p 1 (x), p 2 (x), . . . , p d (x)) can be given as output. The formula is well-formed, even if Q(x) does not have a finite integral, which may be the case in some examples.
As λ 1 goes to zero, the function converges to the normal distribution e −x 2 (the red curve in Figure A1).
Example A2. Suppose that n = 2, then Q(x) is defined as with θ 1 = θ 2 = λ 1 = λ 2 = 1. Q(x) is depicted in Figure A2 at various level curves for Q(x) = α, with 0 < α < 1 . The Equation Q(x 1 , x 2 ) = α is a closed curve. Such a curve can be thought of as the set of all points that have a given distance to the origin. We observe that for α ≥ 1 4 , the neighborhood {x ∈ R 2 | Q(x) > α} of the origin is convex, but for α < 1 4 it is not.