Identifying Patients with Familial Chylomicronemia Syndrome Using FCS Score-Based Data Mining Methods

Background: There are no exact data about the prevalence of familial chylomicronemia syndrome (FCS) in Central Europe. We aimed to identify FCS patients using either the FCS score proposed by Moulin et al. or with data mining, and assessed the diagnostic applicability of the FCS score. Methods: Analyzing medical records of 1,342,124 patients, the FCS score of each patient was calculated. Based on the data of previously diagnosed FCS patients, we trained machine learning models to identify other features that may improve FCS score calculation. Results: We identified 26 patients with an FCS score of ≥10. From the trained models, boosting tree models and support vector machines performed the best for patient recognition with overall AUC above 0.95, while artificial neural networks accomplished above 0.8, indicating less efficacy. We identified laboratory features that can be considered as additions to the FCS score calculation. Conclusions: The estimated prevalence of FCS was 19.4 per million in our region, which exceeds the prevalence data of other European countries. Analysis of larger regional and country-wide data might increase the number of FCS cases. Although FCS score is an excellent tool in identifying potential FCS patients, consideration of some other features may improve its accuracy.


Introduction
Fasting chylomicronemia may rarely be due to a monogenic disorder that markedly reduces the activity of lipoprotein lipase (LPL), resulting in a decreased clearance of the triglyceride-rich lipoproteins from plasma [1]. This condition, referred to as familial chylomicronemia syndrome (FCS), is characterized by severe hypertriglyceridemia and sustained fasting chylomicronemia, thus predisposing affected individuals to recurrent episodes of pancreatitis. With an estimated frequency of one per million in the population, FCS is usually due to the homozygous or compound heterozygous mutations of the LPL gene, leading to a severe lack of functioning LPL protein [2]. Although, the majority of the FCS patients are carriers of loss-of-function mutations in the LPL gene, similar mutations are found to be causal in FCS, including apolipoproteins C2 and A5 (APOC2 and APOA5, respectively), lipase maturation factor 1 (LMF1), glycosylphosphatidylinositolanchored high-density lipoprotein-binding protein 1 (GPIHBP1) and glycerol-3-phosphate dehydrogenase 1 (G3PDH1) [3][4][5][6].
Compared to those with multifactorial chylomicronemia syndrome (MFCS), patients with FCS are usually younger and less likely to possess any of the aggravating factors of hypertriglyceridemia; however, they are more prone to develop pancreatitis on the basis of the sustained chylomicronemia [7]. Interestingly, FCS patients are less likely to have cardiovascular disease (CVD), probably because of the severe reduction in LPL activity reducing the formation and accumulation of the atherogenic chylomicron and very low density lipoprotein (VLDL) remnants [2]. With a mortality rate of 2-5%, acute pancreatitis is the most dangerous consequence of hypertriglyceridemia [8]. Recently, an international expert panel proposed an excellent and easy-to-use diagnostic tool named the FCS score (Table 1) for the better identification of FCS patients [6]. According to Moulin et al., the FCS score turned out to have a sensitivity of 88% and specificity of 85% in identifying individuals with "very likely FCS". Although the disease represents a great health burden, exact data are lacking about the frequency of the disease in Hungary and other European countries as well [6]. Therefore, we aimed to identify FCS patients using the above mentioned FCS score with data mining methods in two major hospitals of the Northern Great Plain region of Hungary. We also tried to assess the usability of the FCS score using various machine learning methods that were trained on the data of previously identified FCS patients, individuals likely to have FCS based on their FCS score and the total clinical population in Debrecen (n = 590,500).

Patients and Methods
We obtained raw data from the hospital record system of the two leading medical centers of the Northern Great Plain region of Hungary including University of Debrecen Clinical Center (UDCC) and the County Hospital of Szabolcs-Szatmár-Bereg (CHSSB). Summing up eight total years, the data source contained all medical records from these two centers between 1 January, 2007 and 31 December, 2014. Through the servers of Aesculab Medical Solutions (Black Horse Group Ltd., Debrecen, Hungary), we accessed, cleaned, preprocessed and structured anonymous data that contained all medical records from these healthcare providers. As discussed previously [9], the studied population was considered to be representative for the regional population, therefore, the calculated prevalence may precisely estimate the regional prevalence of FCS. The information processed for the study contained three data sources as (i) laboratory data, (ii) diagnostic data using, and transforming to, the International Statistical Classification of Diseases and Related Health Problems (ICD)-10 convention and (iii) textual data including all hospital appointments. Data cleaning, preprocessing steps, detailed methodologies and software used were described previously [9]. The feature set (feature space) for the training included (i) all available nominal laboratory data during the medical history with nominal values calculated for the same units (e.g., triglycerides above 1.7 mmol/L) and (ii) the medical history either available from the diagnosis or mined from the textual data and calculated to 5 characters of the ICD-10, (e.g., E7800). The FCS score calculations and chart generation were performed with open-source software solutions on the textual data (Appendix A).
From the mined data, we calculated the previously proposed FCS score for each patient and grouped them according to the likelihood of FCS. Following data selection and screening, the medically evaluated data were trained with multiple machine learning techniques, including rectified linear unit neural networks (ReLU), adaptive boosting (AdaBoost), gradient boosting (XGBoost) and support vector machines (SVM). The training was carried out with an open source software (Appendix A) using the UDCC site clinical data. Tests were performed both on the trained data (with a 50-50 split) and on the CHSSB data as well. Labelling previously identified FCS patients as "positive" and individuals with no previous diagnosis of FCS as "negative", we trained binary classification models on a dataset, which contained all previously identified FCS patients labeled as "positive", and randomly selected patients from the remaining part of the clinical population labelled as "negative". We also experimented with models trained on a dataset where we treated individuals likely to have FCS based on their FCS score as patients belonging to the "positive" label.

About Machine Learning
We may define the problem as a traditional binary classification as we have a finite, real valued descriptor and a binary label for each patient. Thus, a patient may either have FCS, thus labelled as "1", or lack FCS and labeled as "0". Based on the annotated dataset, several ways exist to identify relations between the features (including the elements of the descriptors that contain the ICD-10 diagnosis, as well as laboratory test values) and the known labels. In order to determine the best method for FCS classification and to approximate the performance of the models over the whole population, we built models using subsets of patients with known true labels as clinically diagnosed FCS, and evaluated the performance of the learned models on an independent dataset with known true labels. Our reasoning was based on the fundamental theory of generalization introduced by Vapnik and Chervonenkis in 1971 [10] and as a set of consequences of the theorem, which apply to all methods but a set of special neural networks. For the latter, we refer to Nagarajan and Kolter [11] and Devroye et al. [12]. Therefore, even if the bounds in the Vapnik-Chervonenkis generalization are not informative about deep neural networks on the first hand, there may be an underlying structure for which the theorem is meaningful in practice, too. There are three key rules based on the theorem, which are in shape with the fundamentals of data mining and machine learning: (i) prefer models with low complexity to provide capacity to learn any labeling [13], (ii) evaluate on an independent test set and (iii) use a training set as large as possible.
To cover different but the most efficient methods, we selected three widely used machine learning frameworks, including tree ensembles (AdaBoost and XGBoost) [13,14], "shallow" neural networks with kernel functions (SVM) [15] and fully connected "deep" neural networks with ReLU activations [16]. In comparison to ReLU networks, tree ensembles methods are less powerful as a function approximation technique, while the smaller capacity helps in the case of small datasets like ours or non-spatiotemporal structural variables, when there are no previously known reoccurring structures over the features. The order of the features is arbitrary in our study as they do not form rigid structures, hence, we used the only viable option and adopted fully connected artificial neural networks. Tree ensembles and kernel-based methods are not sensitive to the order of the features.
Tree ensembles build a set of "weak" classifiers from small, almost random decision trees. There are several methods to determine the set of decision trees and their importance e.g., random forest [17], adaptive boosting [13] or gradient boosting machines [14]. In the case of the neural networks, we built fully connected deep networks with ANN (artificial neural network) that were trained using ReLU as activations, and the parameters were optimized with adaptive momentum [18]. Finally, SVM models were trained with various kernel functions, including linear, polynomial or radial basis functions. Table 2 indicates the best performing methods per class.  [14]. ReLU networks were optimized with Adam [18]. The networks contained five hidden layers, each with default units.
Besides sensitivity, specificity and accuracy, the most important metric is area under the receiver operating characteristic curve (ROC AUC) as an evaluation method for our binary classification method. Sensitivity is measured as the proportion of true positives in patients with FCS, while specificity describes the proportion of true negatives in patients without FCS. Accuracy is the proportion of the total number of patients that are correctly identified in the studied population. ROC curve is defined by the point pairs of true positive rates (sensitivity) and false positive rates (1 minus specificity) at different threshold settings. AUC can be interpreted as the probability of classifying a positive sample with higher confidence than a negative sample.
It is important to note that, based on the trees learned by a gradient boosted tree model, it is possible to rank the features using their position in the trees. There are multiple methods ranging from the simple count of occurrence to a complex subset identification that may yield a generously good ranking of the features. We relied on a weighted version of the former, most commonly used method [19]. Additionally, the order of the trees learned during the boosting phase is of utmost importance, thus, we decided to investigate the learnings of the first couple of trees learned by the model.

Results
Based upon the features of the previously proposed FCS score, we calculated the score of each individual that visited the two major healthcare providers in our region during the study period (n = 1.341.722; mean age: 38.12 ± 23.37 years, male/female: 602.258/739.464; 45/55%). Patient characteristics and their calculated FCS score are listed on Table 3. We identified a total of 26 patients very likely with FCS (score ≥ 10). These data suggest that FCS might be more frequent, at least in our region, with an estimated prevalence of 19.4 per million. For a rapid estimation of FCS scores, we gradually cut down data based on some strong key features of the score system to estimate the number of the patients that fell into the three major categories of "highly unlikely FCS", "unlikely FCS" and "likely FCS". As FCS is a disease characterized by serum triglyceride (TG) levels, we chose features which contributed markedly to the FCS score and were easily measurable with less subjectivity (Figure 1). Therefore, we took patients with fasting TG levels exceeding 10 mmol/L for three consecutive cases (+5 points) and those who never had TG levels less than 2 mmol/L (thus avoiding the −5 points), and added those patients who had no secondary causes such as diabetes mellitus, metabolic syndrome, hypothyroidism, corticosteroid therapy or alcohol abuse (+2 points). To further enhance this estimation of FCS scores and find those that Therefore, we took patients with fasting TG levels exceeding 10 mmol/L for three consecutive cases (+5 points) and those who never had TG levels less than 2 mmol/L (thus avoiding the −5 points), and added those patients who had no secondary causes such as diabetes mellitus, metabolic syndrome, hypothyroidism, corticosteroid therapy or alcohol abuse (+2 points). To further enhance this estimation of FCS scores and find those that potentially live with undiagnosed FCS, we added key features of fasting TG levels exceeding 20 mmol/L at least once (+1 point), symptoms below 40 years (+1 point) and positive history of pancreatitis (+1 point). Key features in the two major healthcare providers (UDCC and CHSSB) for FCS score estimation and the number of the patients falling into the score categories are represented on Table 4, respectively. Some intra-regional difference was detectable as we estimated the prevalence of "likely FCS" to be 8.47 per million in UDCC and 5.32 in CHSSB, respectively. As with the total population, we also calculated FCS score for every single patient available in the hospital database, separately in the two medical centers (Table 5, respectively). Based on our results, the calculated prevalence of FCS is 27.11 per million in the Debrecen (UDCC) region and 13.3 per million in the Nyíregyháza (CHSSB) region. Overall, male patients had a 4 to 5 times increased chance for a "likely FCS" than females. The magnitude of the number of patients with a calculated FCS score of 10+ ("likely FCS") was comparable with the estimated prevalence when checking the patients individually.
As our estimated prevalence turned out to be one order of magnitude higher than the literature data, we decided to evaluate thoroughly those patients of UDCC with an estimated 7+ score (n = 275, see Table 3). Therefore, all patients of this medical center with an estimated score falling into "unlikely FCS" and "likely FCS" diagnoses underwent a detailed evaluation of their medical history, TG levels and clinical signs in order to find those with undiagnosed FCS. During this data revision, we identified 7 patients with FCS and, without genetic testing, marked an additional 14 individuals with potential FCS. These data indicate an estimated prevalence of 11.8-35.6 FCS patients per million, which is a similar magnitude to our calculation detailed above.  Then we utilized machine learning, which was trained and tested on the UDCC dataset to identify those FCS patients who had ever been hospitalized. As trained data, we used the above mentioned 7 confirmed and 14 potential FCS patients against those who scored 7+ in the FCS score system and against random individuals. The results of the mathematical modeling are depicted on Table 2, while model parameters are detailed in Appendix B. During classification, boosting models (i.e., AdaBoost and XGBoost) performed most successfully in terms of ROC/AUC measures, tightly followed by support vector machines. Deep neural networks lagged behind, notably in terms of overall performance. Table 6 shows the summarized importance of conditions of the history in defining FCS, using all model trainings. To evaluate the accuracy of the FCS score, we trained these confirmed and potential FCS patients vs. patients with 7+ FCS score. Individual laboratory measurements were mined from the medical histories of the patients with no absolute values assigned to them. The parameters were ranked by the mathematical models from 0 to 100, where the value of 100 indicates the most important condition in decision making. Our results confirmed the foundational importance of the TG levels, as (i) the highest TG level and (ii) the average TG level were found to be the most important features, while (iii) conditions characterizing deviations in the TG concentrations (i.e., TG fluctuation, as well as highest and lowest TG levels) were also among the top conditions of the history. Cholesterol level also turned out to be a substantial feature in defining FCS. These conditions are the most important ones to distinguish FCS patients from those with no FCS but high FCS score. To find the most important conditions and decisive laboratory cut values that can be used for population screening, we also trained machine learning using the data of the confirmed and potential FCS patients vs. all patients ( Table 7). The cut values do not make distinction between their absolute importance but help the clinicians to get closer or away from the likelihood of FCS. Altogether, we found that patients may be identified based upon their highest and lowest TG levels, average TG levels and TG level deviations, as well as the highest and lowest total cholesterol concentrations and the deviations of the total cholesterol level. We also identified other parameters that may help to find individuals with potential FCS, as increasing hemoglobin, MCHC, basophil granulocyte, lymphocyte, or amylase above the cut levels raised the probability of FCS. On the other hand, elevated GPT, GGT, glucose, sodium and creatinine measurement cut levels decreased the chance of FCS. Interestingly, we also found that inflammatory markers as WBC and CRP, as well as the amylase activity had a negative impact on the probability of FCS. Table 7. Summary of the most decisive laboratory value cuts in machine learning models and their impact on getting closer to (+) or away (−) from likelihood of FCS.  Table 7. Cont.

Discussion
We suspected the regional frequency of FCS to be 19.4 per million among hospital goers, which exceeds the estimated worldwide prevalence of 1 per million [20]. As FCS is considered to be a rare disease, recent data indicate higher frequency of the disease when using larger cohorts. Indeed, reviewing the data of more than 1.5 million patients, Pallazola et al. found an FCS prevalence of 13 per million among the patients of a quaternary medical center [21]. On a smaller dataset of thirty thousand children, the prevalence of type 1 hyperlipoproteinemia (i.e., familial chylomicronemia syndrome) was estimated to be about 1 in 300,000 [22]. It is important to emphasize that we studied a population that was treated or checked in a hospital, which might have contributed to the variance of the disease prevalence. Though falling into the same magnitude, we also found the FCS prevalence to be different between the medical providers, either estimated with using key features of the disease or calculated individually in each patient. These discrepancies are presumably due to the different levels of care and the covered territories of the medical providers (university hospital vs. county hospital). Indeed, with its various lipid/metabolic disease outpatient clinics, our university hospital accepts patients from the county hospital, as well. More targeted history taking, wider diagnostic and laboratory availabilities may also explain our prevalence results after revisiting the university hospital data. Besides indicating the usability of our methods in distinct populations, our findings highlight the need of the specialist's expertise in recognizing FCS.
The diagnosis of FCS is largely based upon genetic analysis and post-heparin LPL activity assay [7]. Recently, an expert panel of lipidologists proposed a very practical FCS scoring system for the better identification of patients with this rare, inherited disease [6]. A solid advantage of the FCS score is the strong reliance on the exact serum triglyceride measurements. Indeed, the selection of the potential patients can be reduced to 1-2‰, if studying those with TG levels exceeding 10 mmol/L for three consecutive occasions and never below 2 mmol/L (as indicated on Table 4). Adding the other strong and measurable condition (TG levels exceeding 20 mmol/L at least once) cut down the patient selection to the zone of ten thousandths (‱).
On the other hand, we realized that patients with the highest FCS scores are not necessarily the similar ones that we diagnosed. That can be due to incomplete history taking (e.g., missing targeted questions on conditions aggravating hypertriglyceridemia), which can hamper proper diagnosis [23]; therefore, FCS scoring seems to be perfect when all such secondary factors can be excluded by the dedicated physician, while there could be an area for improvement when approaching FCS score on a larger, automatized level.
Machine learning, however, may serve as a helpful tool to better identify rare diseases when using larger datasets [9,24]. Trained and tested on the UDCC data, we also tried to find those FCS patients who, with any diagnosis, had ever been hospitalized in our university hospital. We found gradient boosting and SVM to be the most successful in terms of ROC/AUC measures. Contrary to neural networks, these boosting-based models were more useful to find those with FCS. Our investigations on laboratories indicated that mild-to-moderate or very high TG concentration cuts further improve identifying potential FCS patients, even when peaking above 20 mmol/L. Interestingly, total cholesterol level may also be a promising asset to improve identification. The role of cholesterol, however, seems to be more complex, as the likelihood of FCS decreases below 4 mmol/L and above 11 mmol/L. In other words, patients with low or with very high cholesterol levels should not be considered to have FCS, which indicates the importance the triglyceride-rich lipoprotein cholesterol and the intimate interplay between cholesterol and triglyceride metabolism [25].
On the other hand, we found several metabolic parameters including liver transaminases and serum glucose, whose increased activities or concentrations affected negatively the probability of FCS. These findings might be due to the common presence of insulin resistant conditions as obesity, type 2 diabetes mellitus and non-alcoholic fatty liver disease (NAFLD) among hospital goers and are concordant with the recent report of Paquette et al., who found higher activities of gamma-glutamyl transferase (GGT) in MFCS compared to FCS [7]. Of note, although occurring in both FCS and MFCS patients, NAFLD was observed to be significantly less frequent in patients with familial chylomicronemia syndrome [26].
Interestingly, we found that elevated amylase activity had a negative impact on FCS probability, which indicates a high prevalence of such laboratory findings in the studied population. Longitudinal studies on well-characterized patient populations, however, confirmed the higher incidence of acute pancreatitis in FCS patients [27]. These investigations may also shed light on cardiovascular outcomes in these subjects, as well. Nevertheless, besides indicating the potential existence of multifactorial backgrounds, our findings may also help to increase FCS awareness, as higher glucose levels or transaminase activities decrease the probability of FCS.
Limitations also exist in our study. Hospital goers represent a population that differs from the normal population; therefore, our calculations might overestimate the frequency of the disease. Although we could study a relatively large cohort of patients, it did not directly represent the total population in our region, as not 100% of the population goes to hospital each year. Also, we were unable to assess the data about family history and did not perform genetic testing to diagnose FCS. Verifying the existence of confirmed or potentially pathogenic mutations in LPL or other genes modulating lipoprotein lipase activity would have contributed to improve identification of potential FCS patients in the studied population. Genetic analysis of gene variants with triglyceride-lowering effect would also have modified our results. In addition, hospital goers tended to be older and checked more frequently. On the contrary, younger patients usually had less thorough laboratory examinations and their history was less detailed and asked less frequently. Such tendencies bias the identification of FCS patients towards the elderly. Additionally, a larger population is needed to define those exact cuts in cholesterol levels that could improve FCS scoring. Although our machine learning models found their impact on the likelihood of FCS, the real-life importance of the other laboratory parameters should also be addressed in future studies. While machine learning may overestimate the incidence of FCS, it also may help to reduce the number of those individuals that would require expensive and time-consuming genetic analysis.

Conclusions
Using the previously proposed FCS scoring based on a large hospital database, we found an increased prevalence of familial chylomicronemia syndrome in our region. Data mining and machine learning seem to be promising tools in screening for FCS; however, further studies on larger, national or international datasets are of major importance to prove their accuracy and usefulness. Also, an analysis of larger populations might increase the number of discovered FCS cases.
Although FCS scoring is an easy-to-use tool to set FCS and MFCS apart, "fine tuning" of the features and inclusion of the total cholesterol levels may be considered to better identify FCS patients. Although the weight of cholesterol levels in the score has to be determined, this may alleviate the need for systematic genotyping in patients with severe hypertriglyceridemia and would also help identify the high-priority candidates for genetic analysis. Furthermore, early and accurate diagnosis is essential for effective treatment to avoid severe, life-threatening complications of FCS. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: All data generated or analyzed during this study are included in this published article. All data generated or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest: Ákos
Németh is a co-worker of Aesculab Medical Solutions (Black Horse Group Ltd.), while also on staff at University Debrecen Department of Internal Medicine as a PhD candidate. As stated in the article, the company is a contractual partner of the university who provided cleaned, anonymized data for the research. The authors declared they do not have anything to disclose regarding conflict of interest with respect to this manuscript.

Appendix A
For the analysis of the textual data, we utilized Python 3.8.