^{1}

^{*}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

Several measurements are used to describe the behavior of a diabetic patient's blood glucose. We describe a new, wavelet-based algorithm that indicates a new measurement called a PLA index could be used to quantify the variability or predictability of blood glucose. This wavelet-based approach emphasizes the shape of a blood glucose graph. Using continuous glucose monitors (CGMs), this measurement could become a new tool to classify patients based on their blood glucose behavior and may become a new method in the management of diabetes.

Over the past ten years, continuous glucose monitors (CGMs) have become readily available for use by type 1 diabetic patients, and insurance companies are beginning to accept them as a legitimate part of managing diabetes. As the use of CGMs increases, both physicians and patients are faced with the question of how to make sense of the tidal wave of data generated by these devices. Unlike the measuring of blood sugar four to six times a day through a finger-prick or the determination four times a year of A_{1c} (a measure of the average blood glucose level over the past 3–4 months), current CGMs generate a reading every five minutes, leading to nearly 9000 readings and approximately thirty graphs of data per month, if the patient wears the monitor all the time. What can be done to mine this data and learn new facts about a patient's blood glucose management?

This paper presents an analysis of this data via an experimental, unsupervised learning algorithm that uses data generated by a wavelet analysis of a patient's daily CGM data. The results indicate that the shape of a graph corresponding to a day's worth of CGM data may be an important attribute in diabetes care. The shape is intimately connected to the variability of blood glucose readings, which past research suggests is a fundamental part of diabetes care [

The CGM data utilized in this analysis (which were provided by Medtronic) contained blood glucose readings for 106 type 1 patients [

For each day of CGM data, a graph can be generated. The PLA index is a scalar value that depends on the shape, and hence character, of a graph. It quantifies how variable a graph is.

A Piecewise Linear Approximation (henceforth PLA) can approximate any function using a piecewise linear function. (This was first mathematically formulated in [

Thus, we can fix a tolerance, and then for each of the over 6600 graphs, we can compute a PLA factor. Testing different tolerances, we found that choosing 12 mg/dL led to an adequate representation of the data, as it was sufficient in capturing overall trends while not introducing extraneous segments when blood sugar is relatively stable. When referring to the PLA factor throughout, it is therefore implied that the tolerance is 12 mg/dL. We then take the average PLA factor for a particular person to calculate for each patient a single

The PLA indices were found to range from 13 to 40, with most being between 21 and 26.

The classification scheme in this table corresponds with a separate classification, discovered independent of any knowledge of the PLA index, which was generated using a wavelet-based unsupervised learning algorithm that is described in Section 4. The algorithm clusters data based on some inherent similarities within the data. The unsupervised quality of the algorithm is vital because we start with no predisposition about what clusters, if any, exist in the data. When clusters do form, there is a significant relationship between points within them as seen in [

Using our wavelet-based approach, one possible way to identify clusters from the blood glucose data is the three banded clusters shown in _{1c} to analyze the interaction between average glucose value and variability for a patient.

The PLA index characterizes a graph in terms of its variability. A high PLA index means the typical glucose graph requires a larger number of line segments in its PLA approximation, and this is evidence of unpredictable blood glucose behavior. This suggests the patient is having difficulty controlling his or her diabetes well, because he or she is prone to a large amount of variability and hence unpredictability, and an adjustment to basal dosage, bolus dosage, diet, or exercise is recommended. A low PLA index would alternatively suggest the patient is in control. That being said, a low PLA index does not necessarily indicate good control. (Take the extreme example of blood sugar being constant at 300 mg/dL—low PLA index but certainly not healthy.) But it could denote a high level of predictability meaning only slight adjustments are needed to successfully control their diabetes.

The methods involved in the wavelet analysis used to create

After partitioning the CGM data into days as described in Section 2, we applied wavelet filters to the data. We give a brief description of wavelet filtering here. For a complete description, see [

Using a pair of filters called the low pass filter,

The high pass filter results, called the

A property of wavelets ensures that only finitely many of the coefficients _{i}_{i}_{4}, wavelet filters three times to yield three different levels, or sets of data describing a patient's daily CGM data. The filters determined by the _{4} wavelet are

The _{4} has the interesting property that when applied to a linear signal, the detail coefficients are 0. Thus, it is able to detect sections of a signal that have a local linear trend; these trends are indicated by 0's in the high pass results. This provides a possible mathematical justification for the correspondence between the wavelet-based analysis and the PLA index.

Along with performing the wavelet filtering, we also augmented the data with the calculation of linear predictors and their associated errors. Linear predictors, developed by Simoncelli and Buccigrossi in [

Our analysis expressed each first level detail coefficient magnitude as a linear combination of four neighboring coefficient magnitudes. They were the coefficient immediately prior to it in the time series analysis, the neighbor coefficient two values past it in the time series, the _{i}_{1}, _{2}, _{3}, _{4}:
_{i}_{i}

Because this is an overdetermined system, the most predictive

This amounts to setting (The derivation of this can be found in [

The base two log error for the linear predictor is also computed as

By taking the logarithm of a vector, we mean taking base two log of each component of the vector. Additionally, |

From the first level detail coefficients and errors associated with their linear predictors, we created a statistical signature for a single day of a diabetic's CGM data. Four statistics—the mean, standard deviation, skewness, and kurtosis—were generated directly from the first level detail coefficient magnitudes (and not directly from the blood glucose readings of a diabetic over the course of a day, which have distributions best approximated by a log-normal distribution). The same four statistics were also calculated from the data set of errors associated with a linear predictor for each detail coefficient. Hence, the blood glucose readings of a single day were encapsulated by an 8-dimensional vector of statistics. These specific statistics were chosen due to their success in the detection of art and handwriting forgeries [

With these vectors in ℝ^{8}, a distance or measure of similarity was calculated between two different days' worth of blood glucose data by computing the Euclidean norm between the vectors associated with each day. But in order to first compare two patients, we had to ensure that the samples were as representative as possible of an “ordinary” day for the patient under investigation. Due to the inherent unpredictability of diabetes, there are almost assuredly going to be a few days for each patient that are not representative of the patient's typical blood glucose behavior. To find these “outlying” days, we applied the Laplacian Eigenmap Method of Belkin and Niyogi using a slight modification of the Euclidean norm, known as the heat kernel (The parameter value we used in the heat kernel was ^{2}, removing the point farthest from the centroid, recomputing the centroid, and repeating until 25% of the points were removed. Subsequently, only these non-outlying days were considered. We chose 25% to ensure that the outlying days were removed while still leaving enough days within the central cluster to properly compare to other diabetics. Refining this process of selecting a subset of the CGM data to appropriately represent a patient is a possibility for future research.

With each patient being represented by a set of 8-dimensional vectors associated with these non-outlying days, two different patients were compared by using the Hausdorff metric. Hausdorff distances were then used to calculate the four patients nearest to a particular patient (one of these will always be the particular patient himself), and this information was used as input in the Laplacian Eigenmap Method using the nearest neighbor strategy developed in [

Examination of the statistics vectors revealed significant variation primarily in three statistics: skewness of the coefficient magnitudes, kurtosis of the coefficient magnitudes, and skewness of the error of the linear predictors. Thus, these were the factors that most strongly influenced the clustering. Additionally, all three statistics were related in the sense that if one was high for a patient, the other two were also likely to be high. (High values were as follows: skewness of the coefficient magnitudes (7 or higher), kurtosis of the coefficient magnitudes (65 or higher), and skewness of the error of the linear predictors (−0.05 or higher).) From this information, we devised a coloring scheme that indicated the average level of these statistics. They were colored green, red, and blue for high, medium, and low values respectively. Coloring

Computationally, this method is highly dependent on the number of diabetic patients being compared and the number of day's worth of complete data for each patient. The actual steps to compute the 8-dimensional vector for each day's worth of blood glucose readings are very simple, but the input and output operations required to produce and store the vectors dramatically increase the running time to several minutes for patients with hundreds of days of readings. Once this is completed, the main limiting factors are performing the Laplacian Eigenmap method and computing Hausdorff distances between patients. The former relies on numerically solving for eigenvectors for matrices with sizes equal to the number of days worth of data for each patient. Even for hundreds of days worth of data, this is a very feasible operation. Computing Hausdorff distances, however, requires comparing each day of each diabetic versus each day of every other diabetic. Still, this operation need only be carried out once so for the purposes of this paper was not an inhibiting factor.

We have presented a new tool, the PLA index, that encompasses blood glucose tendencies, specifically variability. This measure was discovered while applying a wavelet-based approach to clustering time series—a technique previously shown to capture fundamental characteristics of data. We anticipate further research and further refinements of our method will confirm the utility of the PLA index and that the PLA index will become part of a robust toolkit of analytic methods that will help type 1 diabetic patients better control their blood glucose.

One way the application of this index can be refined is to calculate the index using data from a limited amount of time, such as three months. In this way, doctors and patients can observe how the index is changing over time, much like A_{1c}, and whether these changes are due to changes in therapy, stress, or other factors can be explored.

There are several papers that develop other mathematical methods to quantify glucose variability, or to use CGM data. Kovatchev

PLA (purple) of a blood glucose graph (black). The PLA factor is 10.

A high PLA index shown above, and a low PLA index below.

Clustering using the wavelet-based method. Green corresponds to high wavelet-based statistics. Red corresponds to medium values, while blue corresponds to small values.

Clustering using the PLA index. Green corresponds to a low PLA index, red is medium, and blue is large.

Results of the Eigenmap method, before determining the clusters. Nearby patients should be similar in some way.

Three classes of variability.

Green | 22 or less | |

Red | 23, 24, or 25 | |

Blue | 26 or more |

This research was conducted at the Grand Valley State Research Experience for Undergraduates program, which was partially supported by the National Science Foundation under Grant No. DMS-0451254. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

We thank Medtronic for sharing the data with us. We thank Cesar Palerm for sharing his comments on an earlier version of this paper. We also thank Larry Shepp of the Department of Statistics at Rutgers University for introducing us to Medtronic and for his support. Our two anonymous reviewers also provided helpful observations to strengthen the final version of this paper.

Derek Olson was an undergraduate mathematics major at Drake University and is now a graduate student at the University of Minnesota. Robert Castellano is an undergraduate mathematics major at Stony Brook University. Edward Aboufadel is a Professor of Mathematics at Grand Valley State University and served as a faculty advisor to Olson and Castellano throughout.

The Laplacian Eigenmap Method used in Section 4 is a non-linear dimensionality reduction algorithm developed by Belkin and Niyogi [_{1}, …, _{k}_{i}^{n}_{1}, …, _{k}_{i}^{2}. The algorithm finds _{i}_{i}_{j}_{i}_{j}_{ij}_{i}_{j}

Given the symmetric weight matrix

Let _{ij}_{j} W_{ji}

Let

Compute eigenvectors _{1} and _{2} satisfying
_{1} and _{2} are the two smallest non-zero eigenvalues.

The vector _{i}