Automatic Search of Cataclysmic Variables Based on LightGBM in LAMOST-DR7

: The search for special and rare celestial objects has always played an important role in astronomy. Cataclysmic Variables (CVs) are special and rare binary systems with accretion disks. Most CVs are in the quiescent period, and their spectra have the emission lines of Balmer series, HeI, and HeII. A few CVs in the outburst period have the absorption lines of Balmer series. Owing to the scarcity of numbers, expanding the spectral data of CVs is of positive signiﬁcance for studying the formation of accretion disks and the evolution of binary star system models. At present, the research for astronomical spectra has entered the era of Big Data. The Large Sky Area Multi-Object Fiber Spectroscopy Telescope (LAMOST) has produced more than tens of millions of spectral data. the latest released LAMOST-DR7 includes 10.6 million low-resolution spectral data in 4926 sky regions, providing ideal data support for searching CV candidates. To process and analyze the massive amounts of spectral data, this study employed the Light Gradient Boosting Machine (LightGBM) algorithm, which is based on the ensemble tree model to automatically conduct the search in LAMOST-DR7. Finally, 225 CV candidates were found and four new CV candidates were veriﬁed by SIMBAD and published catalogs. This study also built the Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), and eXtreme Gradient Boosting (XGBoost) models and used Accuracy, Precision, Recall, the F1-score, and the ROC curve to compare the four models comprehensively. Experimental results showed that LightGBM is more efﬁcient. The search for CVs based on LightGBM not only enriches the existing CV spectral library, but also provides a reference for the data mining of other rare celestial objects in massive spectral data.


Introduction
Cataclysmic Variables (CVs) are binary star systems with accretion disks [1]. The binary system consists of a white dwarf star [2] and a late main-sequence companion star [3]. The companion star transfers material to the main star through the accretion disk [4][5][6]. According to their amplitudes and timescale of variability and magnetism, CVs can be divided into five subtypes, namely, Novae-Like variables (NLs), Classical Novae (CNs), Dwarf Novae (DNs), Recurrent Novae (RNs), and Magnetic Cataclysmic Variables (MCVs) [7,8]. Studying the different subtypes of CVs is important to understand the accretion physics of CVs and the evolution of compact binaries [9].
The spectra of CVs have two characteristics: one type of CV spectra in the quiescent period is dominated by emission lines of Balmer, HeI, or HeII, and the accretion disk is the source of emission lines of hydrogen and helium [10]; the other type of CV spectra in the outburst period has the broad absorption lines of Balmer, where emission lines are overwhelmed by their continuum. Some CV spectra during the outburst period also show the pure absorption of the HeI and HeII lines, and a few Balmer absorption lines have emission nuclei, which means absorption surrounding the emission lines [8,11].
The traditional ways to search for CVs are spectroscopic and photometric observations [6]. The light curves of followup observations can help to further divide the CVs into subtypes. Szkody et al. [12][13][14][15][16][17][18] observed the spectral data released by the Sloan Digital Sky Survey (SDSS) [19] from 2002 to 2009 and finally published a total of 285 CV candidate catalogs in 2011 [20]. In 2014, Drake et al. obtained 855 CV candidates from the Catalina Real-time Transient Survey (CRTS) [21], of which 137 have been certified [22]. In 2015, Mróz et al. discovered 1091 dwarf nova candidates in the Optical Gravitational Lensing Experiment survey (OGLE) [23,24]. With the development of machine-learning and data-mining technologies, various machine-learning algorithms are gradually being applied in the astronomy field. Jiang et al. used PCA+SVM and the random forest algorithm separately to search for CVs in SDSS and LAMOST-DR1 and provided 58 and 16 new candidates [25,26]. Hou et al. used random forest and BaggingTopPush to search in LAMOST-DR5, and 54 of the results were verified as new candidates [8].
According to the features of the CV spectra, this study proposes a Light Gradient Boosting Machine (LightGBM) [27] model based on the ensemble tree to achieve automatic classification in the spectra of LAMOST-DR7. As a rare and special object, the CV has a few observational spectra. LAMOST-DR7 contains more than 10 million spectra. These spectra are numerous and complex. The scarcity of CV spectra and the complexity and diversity of massive data will increase the difficulty of model training. Thus, it is inappropriate to use Accuracy as the evaluation criterion. We also used Precision, Recall, the F1-score, the Receiver Operating Characteristic (ROC) curve, and runtime to evaluate model performance comprehensively, and the evaluation indicators are defined in Section 4.1. Then, we used the best-trained classifier to search for CV candidates in LAMOST-DR7.
The outline of this article is as follows. Section 2 describes the experimental data, including positive and negative data. Section 3 introduces the method used in this study. Section 4 presents the implementation of the method and the model performance evaluation in detail. Section 5 provides the conclusions and outlines the plans for future work.

Dataset Preparation
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST), which was designed and constructed by Chinese scientists, is a 4 m quasi-meridian equipped with a 4000-fiber reflective Schmidt telescope. Owing to its scientific design, the LAMOST can observe up to 4000 targets per exposure [28][29][30][31]. At present, the spectral data released by the LAMOST are more than the sum of the spectral data released by other optical telescopes in the world, making the LAMOST the telescope with the highest spectral acquisition rate in the world [32]. LAMOST-DR7 was released to astronomers in March 2020, which covers 4926 low-resolution observation areas and 10.6 million low-resolution spectra. These spectra provide data sources for searching for special and rare objects. In this study, the experimental data comprise more than 10 million spectral data, including stars, galaxies, QSOs, and unknown objects from 4926 regions of LAMOST-DR7 low-resolution observations. The distribution of the LAMOST spectra is shown in Table 1. In the work of searching for CVs, some known CV spectra need to be used as templates. We reference the CV catalogs (Szkody et al. [20], Drake et al. [22], and Hou et al. [8]) that have been published and the SIMBAD database. After cross-matching the LAMOST-DR7 and SDSS catalogs within a cross radius of 5 , we manually selected 567 high-quality spectra that have evident CV spectral characteristic. There are 272 spectra from SDSS and 295 spectra from LAMOST. Most of the CVs have emission line features, and only 54 CVs have absorption features in the 567 CV spectra. Although the two types of CV spectra are different, the potential relationship of these spectra can be extracted to construct the feature matrix by using the method proposed in this study. The result also proves the feasibility of the proposed method. The two types of CV spectra are shown in Figure 1. The upper two CV spectra are in the quiescent period, and the lower two are in the outburst period.

Method
LightGBM is a distributed gradient boosting framework based on the ensemble tree, which is also open sourced by Microsoft. The algorithm is applied in various fields. Pulicherla et al. used LightGBM to predict turnover probability [33]. Wang et al. identified and classified an miRNA target in breast cancer based on LightGBM [34]. Sun et al. applied the LightGBM algorithm to the cryptocurrency market and successfully predicted the price trend [35]. The basic idea of this algorithm is to generate a new regression tree iteratively by fitting the residual of the previous tree continuously. This model combines multiple weak classifiers into a stronger classifier with superior performance through accumulation. It has the characteristics of high efficiency, rapidity, and accuracy. Because of the superiority of the algorithm, LightGBM is outstanding in dealing with high-dimensional and large-scale data.
For a dataset composed of n samples with m features: D = {(x i , y i ), x i ∈R m , |D| = n}, the output of the model can be expressed as: where K is the total number of trees, f k is the regression tree generated in k time iterations, andŷ i is the prediction of sample i. The objective function (O(φ)) of LightGBM is: where ∑ i=1 l(y i ,ŷ i ) is the loss function and l(y i ,ŷ i ) is the residual between the label of sample i and the accumulated value of the tree model, which means the difference of y i andŷ i . The regularization term ∑ k Ω( f k ) can be expressed as: where T and w represent the total number and weight of each leaf node, respectively, and γ and λ are the regularization parameters.
LightGBM is an additive model. The t time output is the former t − 1 output plus the prediction of the t regression tree. Therefore, the objective function of the model can be expressed as: Transform (4) by using Taylor's formula: g i and h i are the first and second derivatives, respectively, of the loss function.
The regression tree f t can be expressed as: The tree structure q is the mapping of samples to the leaf nodes; the leaf nodes are the nodes that are not split in the tree structure.
Assume (5): where G j = ∑ i∈I j g i and H j = ∑ i∈I j h i . G j and H j are the sum of the first derivative and the sum of the second derivative of the objective function, respectively. To obtain the minimum value of the objective function, suppose its derivative is zero, then the weight of the leaf node is:w The minimum of the objective function is: Split the existing leaf nodes through the greedy algorithm [36] and find the optimal segmentation point by comparing the gain before and after the split: G L and G R represent the sum of the first derivatives of the left and right subtrees and the sum of the second derivatives of H L and H R after splitting. It can be seen that the greater the value of SplitGain, the greater the gain before and after the splitting is. Each time the feature with the largest SplitGain value is selected for splitting, the tree stops growing when the regression tree can no longer split.
The further optimization algorithms proposed by LightGBM are as follows: (a) Histogram algorithm: Compared with a presorted algorithm that consumes more runtime and memory space, LightGBM divides the continuous floating-point values of all features of the sample data into N integer intervals and constructs a histogram containing N bins by counting the number of discrete values falling into n intervals. When the tree model is splitting, LightGBM only traverses N discrete values in the histogram to find the optimal segmentation point, which reduces the memory consumption. The time complexity, which qualitatively describes the runtime of the algorithm, is where d is the sample size of the training set, f is the feature size, and N is the number of histograms. For high-dimensional and large-scale spectral data, LightGBM can greatly speed up the calculation; (b) Leafwise growth [27] algorithm: Traditional decision trees such as XGBoost [37] grow levelwise, in which the leaf nodes in the same layer are split at the same time and then pruned. This splitting mode causes much unnecessary computational consumption. LightGBM uses leafwise growth. The model searches for the node with the maximum gain among all the current nodes every time and then splits and iterates repeatedly until the decision tree is completely generated. Leafwise is more efficient than levelwise, but easily generates too deeply, which leads to overfitting. If the decision tree does not have a max depth limit, the tree will continue to split.
Under the same number of splits, the decision tree will more deeply generate with leafwise growth. Excessive splitting of the decision tree will make the model learn the information that is not important for the classification, thus reducing the accuracy of the classifier. The algorithm needs to control the maximum depth of the tree to reduce the risk of overfitting. The two algorithms are shown in Figure 2; (c) Acceleration of histogram differences: LightGBM accelerates the training process by using the differences of the histograms while constructing them. When splitting, the histogram of the current node is represented by the difference between the histogram of the parent node and the sibling node. This type of acceleration greatly improves the training speed and efficiency [38]. The schematic is shown in Figure 3.  Acceleration of histogram differences. The red node presents the parent node; the blue node represents the left child node; the yellow node represents the right child node.
In addition, LightGBM uses the GOSS algorithm [27] to sample data randomly based on gradients and uses the EFB algorithm [27] to further compress features and support efficient parallelism to improve the algorithm's efficiency without affecting the accuracy.

Experimental Process and Analysis
In this study, we selected a total of 567 CV template spectra as positive samples and 20,000 random unlabeled spectra in LAMOST-DR7 as negative samples. The mixed dataset of positive and negative samples was divided into the training set and the testing set according to the ratio of 7:3. Since the wavelength range of each spectra in LAMOST-DR7 is not consistent, to unify the wavelength range, we selected the wavelength range of 4000-8900 Å, which has evident spectral characteristic peaks for sampling, and the sampling points of each spectra were 3473. In machine learning, if the values of different features of samples are large, then the algorithm will prefer the features with larger values in processing, which will mislead the prediction. To enable the algorithm to deal with each feature equally, we normalized the data into [0, 1]. The normalization formula is: where s i represents the one-dimensional vector formed by spectral flux with wavelength i, s i is the mean of s i , and σ(·) is the standard deviation operator. Next, we constructed the input matrix of the algorithm through the normalized dataset.

Experimental Metrics
To assess the performance of the models on the dataset, Accuracy, Precision, Recall, the F1-score, and the Receiver Operating Characteristic curve (ROC) were calculated as the experimental metrics, as given below: where: (i) TP means the number of positive samples predicted correctly as CVs; (ii) FN means the number of positive samples that not predicted as CVs; (iii) FP means the the number of negative samples predicted incorrectly as CVs; (iv) TN means the number of negative samples predicted correctly as negative samples.
The ROC curve, which does not depend on the scale of the test set, can evaluate the performance of the model comprehensively. The curve is based on the False Positive Rate (FPR) and the True Positive Rate (TPR). The false positive rate is the ratio of the number of negative samples predicted incorrectly as CVs to the actual number of negative samples. The true positive rate is the ratio of the number of positive samples predicted correctly as CVs to the total number of actual CVs. By adjusting the threshold of the model, we can obtain different (FPR, TPR) points. The ROC curve connects these points as a line. The Area Under the ROC curve is the AUC. The larger the area (AUC) (i.e., the curve is closer to (0, 1)), the better the model classification performance is. If the ROC curve of one model is surrounded by the ROC curve of another model, it is considered that the latter has better performance than the former on this dataset.

Process Analysis
On the basis of the above dataset, our first step was to train LightGBM classifiers. By using the grid research, we adjusted the best parameters of the learning rate, n_estimators, max_depth, and num_leaves, and obtained the best classifier. The main parameters are defined as follows: (i) The learning rate determines whether and when the objective function converges to the local minimum; (ii) n_estimators is the number of iterations of the model; (iii) max_depth limits the maximum depth of the decision tree; (iv) num_leaves limits the maximum number of leaf nodes of the decision tree.
The best main parameters of the LightGBM classifier are shown in Table 2. In the second step, we used Accuracy, Precision, Recall, and the F1-score to evaluate the performance of the LightGBM model. Table 3 shows that LightGBM had a great performance on the testing set. All the indicators of LightGBM were over 90%, and the Accuracy even reached 99.69%. Moreover, we can obtain a distribution map of the importance score of the spectral features based on the classification model of LightGBM. The importance score is the importance of the features corresponding to the wavelengths to the classification performance in the training process. The higher the importance score, the more important for the classification model the feature is. Figure 4 shows that the importance scores of H δ (4102 Å), H γ (4340 Å), H β (4861 Å), H α (6563 Å), HeI I(4685 Å), and HeI(5876 Å) were relatively high. Figure 4 shows that LightGBM had better generalization capabilities and could extract the complex features of CV spectra. This result is consistent with the description of the spectral characteristics of CVs [26]. Although there is noise in the spectral data, we focused on the Balmer and He lines in the process of searching for cataclysmic variable candidates. Moreover, LightGBM constructs decision trees based on a combination of multiple features. In the process of each iteration, each split of the decision tree will select a spectral feature with the maximum gain, such as the Balmer and He lines. With the continuous growth and iteration of the decision tree, LightGBM will select multiple spectral features. The single noise, which has little gain in the split, is less selected or not selected in the split. Thus, LightGBM can effectively avoid a single feature from being interfered with by noise and affecting the classification performance. The experimental results also proved this point.

Comparison of the Models
In this paper, we also trained the AdaBoost [39], GBDT [40], and XGBoost models based on the same training set and tested them on the same test set for the comparison with the LightGBM model. The comparison result is shown in Figure 5. The results showed that the four models had good performance on spectra classification and the evaluation indicators of all models were over 80%. Compared with the other three models, LightGBM performed best and had the highest Accuracy, Precise, Recall, and F1-score among the four models. Table 4 shows the runtime of the four models, which was calculated from multiple runtimes. The runtime of the LightGBM model was far shorter than those of the other models, and the classification efficiency was high, which is suitable for its promotion for and application to larger-scale spectral data. Given the imbalance of positive and negative samples, this study also compared the ROC curves of the four models. Figure 6 shows that the ROC curves of AdaBoost, GBDT, and XGBoost are surrounded by the ROC curve of the LightGBM model, and the AUC of LightGBM is largest among all models. This outcome indicated that the LightGBM classifier had a higher accuracy and a stable performance. Hence, the superiority of LightGBM was further proven.

Experimental Result
This study used the LightGBM classifier to search for CV candidates in LAMOST-DR7, and 225 CV candidates were found. After verification by SIMBAD and the published CV catalogs, there were four new CV candidates including a CV candidate (J020321.98 + 460731.5) in the outburst period, for which the emission nucleus of the Balmer line is clearly visible from its spectra. The list of CV candidates is shown in Table 5, and the spectra are shown in Figure 7.

Conclusions
A CV star is a type of variable star. There are two different characteristics of the spectra in different periods. Therefore, the spectra of CVs are complex, and conventional methods can not learn these two characteristics at the same time. This study proposed a method to search for CV candidates automatically by using the LightGBM classifier in LAMOST-DR7. The model can extract the potential relationship of CV spectra in quiescent and outburst periods during the training process. By combining multiple features, LightGBM constructs the decision trees and can prevent a single feature from being disturbed by noise, affecting the classification accuracy. Finally, the experiment successfully found four new CV candidates, including a CV candidate in the outburst period, which verifies the accuracy and feasibility of the LightGBM model and enriches the existing CV spectral library. This study also used multiple indicators to compare LightGBM with AdaBoost, GBDT, and XGBoost. The result showed that the evaluation indicators of all models were over 80%, and all indicators of LightGBM were better than those of the other models. In addition, the runtime of LightGBM was much shorter, and the classification efficiency of LightGBM was higher. LightGBM is more suitable for large-scale and high-dimensional spectral data. The successful application of LightGBM in searching for CV candidates also provides a reference for data mining of other rare objects, such as planetary nebulae and HII regions.  be found at www.sdss.org (accessed on 11 November 2021). We acknowledge the use of spectra from the LAMOST and SDSS. This research made use of the SIMBAD database, operated by CDS, Strasbourg, France.

Conflicts of Interest:
The authors declare no conflict of interest.