Automated Classiﬁcation of Massive Spectra Based on Enhanced Multi-Scale Coded Convolutional Neural Network

: The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has produced massive medium-resolution spectra. Data mining for special and rare stars in massive LAMOST spectra is of great signiﬁcance. Feature extraction plays an important role in the process of automatic spectra classiﬁcation. The proper classiﬁcation network can extract most of the common spectral features with minimum noise and individual features. Such a network has better generalization capabilities and can extract sufﬁcient features for classiﬁcation. A variety of classiﬁcation networks of one dimension and two dimensions are both designed and implemented systematically in this paper to verify whether spectra is easier to deal with in a 2D situation. The experimental results show that the fully connected neural network cannot extract enough features. Although convolutional neural network (CNN) with a strong feature extraction capability can quickly achieve satisfactory results on the training set, there is a tendency for overﬁtting. Signal-to-noise ratios also have effects on the network. To investigate the problems above, various techniques are tested and the enhanced multi-scale coded convolutional neural network (EMCCNN) is proposed and implemented, which can perform spectral denoising and feature extraction at different scales in a more efﬁcient manner. In a speciﬁed search, eight known and one possible cataclysmic variables (CVs) in LAMOST MRS are identiﬁed by EMCCNN including four CVs, one dwarf nova and three novae. The result supplements the spectra of CVs. Furthermore, these spectra are the ﬁrst medium-resolution spectra of CVs. The EMCCNN model can be easily extended to search for other rare stellar spectra.


Introduction
SDSS (the Sloan Digital Sky Survey [1,2]) and LAMOST (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope [3,4]) are the two most ambitious spectroscopic surveys worldwide. The latest and final data release of optical spectra of SDSS is data release 16. LAMOST has been upgraded to have both low-and medium-resolution modes of observation and has finished data release 7 (DR7) [5] which is the product of the medium-resolution spectra (MRS) survey

Positive Samples
Training samples especially positive samples are crucial to the classifier of machine learning. Because of the lack of previous identified CVs spectra of LAMOST, the training set is from SDSS whose spectra are homogeneous with LAMOST. SDSS has an authentic pipeline which has already classified the spectra into specified subclass. We can verify the proposed methods and construct a credible network with SDSS spectra and search for CVs in the unlabeled LAMOST spectra.
A total number of 417 1D spectra were selected from SDSS [30] as shown in Figure 1. Figure 2 is a MRS spectrum of LAMOST. The top and bottom panel are B (blue, 4950∼5350 Å) and R (red, 6300∼6800 Å) band respectively (all figures of LAMOST spectra follow the same rules). The SDSS spectra are trimmed within the same band of LAMOST in prepossessing.

Experimental Spectra for 1D Classification
46,180 1D M type spectra were selected from SDSS-DR16 for 1D spectral classification. 1234445 unlabeled 1D MRS spectra from LAMOST-DR7 are data source for CVs searching, as shown in the bottom of Figure 1.

Experimental Spectra for 2D Testing
LAMOST has overall 16 spectrographs and each spectrograph accommodates 250 fibers. The lights of 4000 celestial objects are fed into the fibers simultaneously and recorded on the CCD detectors after being transferred into the spectrographs. The 2D fiber spectral images of LAMOST are thus achieved. There are 250 2D fiber spectra in Figure 3 as shown from left to right and the wavelength increases from top to bottom. Figure 4 is a single 2D spectrum extracted from Figure 3. 2D spectra are normally used to produce the 1D spectra followed by post-processing steps.
For the lack of enough labeled 2D spectra from LAMOST, we fold the 1D SDSS spectrum into a spectral matrix for 2D processing.

Data Preprocessing
SDSS (R∼2000) and LAMOST (R∼7500) have different spectral resolutions and the two sets need to be brought at the same resolution, otherwise at their respective resolution, the spectrum of the same object from the two surveys will present different features. Their wavelength sampling and normalization are also carried out in data preprocessing.
The flux of spectra is interpolated as follows in data preprocessing: Using the method above, the spectra with ∼4000 flux values (features) are interpolated to 5000 dimensions and folded into a m * n spectral matrix for 2D processing. In our experiment, m = 50 and n = 100 are empirical values.
SNR and spectral type are two important factors influencing the classification. M type stars from SDSS are selected for their influence analysis in the experiment. The distribution of M type spectra is shown in Figure 5. The spectra are classified according to the SNR and the type of M stars of SDSS. The SNR is divided into 5∼10, 10∼15, and above 15 and the type M stars are divided into five subclasses from M0∼M4. The specific spectra are shown in Figure 6.

Methods
The goal of spectra classification is to maximize common features and minimize noise and spectral individual features. A variety of classification networks are employed and tested.

Convolutional Neural Network
CNN can handle both 1D and 2D data, depending on whether its kernel is 1D or 2D. We use CNN to classify M stars, because CNN performs well in image classification, and it can also handle 1D spectra classification. The feature extraction capabilities of CNN can help us to improve the accuracy of traditional classification methods. Quadratic interpolation is performed to facilitate the folding of the spectra to form a 2D spectral matrix.
As a special form of CNN, 1D CNN has certain applications in the field of signal processing. We adopted the network structure shown in Figure 7.  In the specific network construction, the backpropagation algorithm is adopted. The 1D matrix information composed of spectra is continuously trained to fit the parameters in each layer by gradient descent.
Using the total errors to calculate the partial derivative of the parameter, the magnitude of the influence of a parameter on the overall errors can be obtained, which is used to correct the parameters in the back propagation. Since the construction of the network uses a linear arrangement, the total error calculated by the final output layer is used to perform the partial derivative calculation of the parameters in all layers: Each neuron in a CNN network is connected to all neurons in the previous layer, so the calculation of the local gradient requires a forward recursive calculation of the gradient of each subsequent layer of neurons. After defining the linear output of each layer and the parameters in the structure, the output of the activation function is: Out = ϕ(v), then the recursive formula for each gradient can be derived as: In the successive calculation of the formula above, the gradient calculated by each parameter under the total errors is used as the basis to achieve the goal of correcting the parameters in the direction of the minimum loss function.

VGGNet and ResNet
VGGNet with prominent generalization ability for different datasets is widely used in 2D data processing. Considering that our spectral dimensions are not particularly large, in order to prevent overfitting on the training set, the shallow VGG16 network shown in Figure 8 is used in our experiments. The greatest improvement of VGGNet is to convert large and short convolutional layers into small and deep neural networks by reducing the scale of convolution kernels. In the convolutional layer, the number of extracted features for each layer output can be calculated using the following formula: p, k and s are the size of max pooling, kernel size, and stride, respectively. With the continuous development of the classification network, the network is constantly deepening for better extraction and feature combination, but the deep network makes it difficult to train and has a potential of losing information.
ResNet as shown in Figure 9 proposes residual learning to solve these problems, allowing information to be transmitted over the layer to preserve information integrity, while learning only the residuals of the previous network output: This allows us to increase the number of convolutional layers, which is an effective way to train a very deep network. The most prominent characteristic of ResNet is that in addition to the result of the conventional convolution calculation in the final output, the initial input value is also added. Therefore, the result of network fitting will be the difference between the two, thereby we can obtain the calculation formula of each layer of ResNet as follows: Then the gradient calculation formula of the neural network is changed based on conventional ones: Compared with the traditional networks, the extra value "1" makes the calculated gradient value difficult to disappear, which means that the gradient calculated from the last layer can be transmitted back in the reverse direction, and the effective transmission of the gradient makes spectral features more efficient in the training of neural networks. However, considering that our spectral features are limited and may have high noise interference, the efficient feature extraction of the network may lead to significant overfitting, which makes the trained network generalization capability poor.

EMCCNN
EMCCNN is a network designed for the characteristics of spectra, and has achieved good results for different SNR spectra. We find that for spectra, a deep convolutional layer can lead to significant overfitting of the data and a very poor generalization capability of the model. Direct convolution of spectra will extract a lot of noise and individual features, which is not an ideal way to classify spectra. Unsupervised denoising methods tend to remove some spectral feature peaks in the data, which is also not an ideal method to maximize the common features of this spectral type.
Therefore, we try to add a supervised denoising network before the convolutional neural network, which can help in the identification of features and noise. It allows the network to extract spectral features instead of noise. On the other hand, the spectral feature peak types are not consistent. For different types of feature peaks, the different convolution kernel sizes may be able to extract different quality features. The extraction of some features may be better for larger convolution kernels. Others may be more friendly for small-scale convolution kernels. We decided to let convolution kernels of different scales learn features simultaneously, then combine these features, and obtain different weights for features through the fully connected layer. A better feature extraction network EMCCNN can be obtained as shown in Figure 10. The detailed description for the architecture of EMCCNN is shown in Appendix A and we use cross-entropy loss as loss function of EMCCNN.

1D Convolution Experiment and Results
The massive LAMOST spectra in this experiment are unlabeled, in order to verify the validity and efficiency of the method, we applied it in SDSS spectra before searching for CVs in massive LAMOST spectra.
We first try to compare the spectra classification of DNN (Deep Neural Network) with four hidden layers and 1D CNN as shown in Figure 11. The optimizer used was a random gradient descent, the learning rate was set to 10 −4 , and the activation function chosen was ReLU (Rectified Linear units). We found that the CNN network can quickly extract the corresponding features, but it will also overfit quickly, which has an accuracy (number of spectra correctly classified to the parent class/number of total spectra) of 60% in the test set, and the generalization capability is extremely poor.
CNN has strong feature extraction capabilities, but it is not suitable for spectra because it also extracts quite a lot of individual features and noise simultaneously. The problem becomes even worse with the increment of SNR.
Considering that VGG16 is composed of several convolutional layers, the direct convolution of the spectra will be the same as the direct use of CNN, leading to the overfitting of the training data. Hence, we try to add the encoder to the VGG16 network to denoise. We found that after denoising, VGG16 is not as bad as CNN for spectra within the scope of 5 ≤ sn ≤ 15, but it is also not satisfactory for spectra with sn > 15. It is considered that the data features of high SNR are more obvious. The spectral individual features and noise will have a strong interference effect on a very deep network. On the other hand, the feedback process of the denoising encoder is too long, which is not easy to identify the difference of individual features, noise and common features.   ResNet's interlayer transfer mechanism might help us solve the problem of overfitting, but the effect is not very satisfactory. ResNet and CNN also perform well on the training data, and can reach more than 99% accuracy on the three SNRs. It does not perform well in the test set, although not as serious as CNN. We try to add the dropout layer to help us solve the problem of overfitting, but this makes our training very slow and the final result is not satisfactory as well, as shown in Figure 12. Through experiments shown in Table 1, it is noticed that the network with deep structure is not suitable for the classification of spectra, which will make the individualized features of noise and spectrum fully extracted, and the denoising of the encoder before these networks will make the feedback time too long, leading to a bad denoising result. Based on the conclusions above, we propose EMCCNN, which is not too deep, and is beneficial to the training of denoising capability of the encoder. We find that EMCCNN is very robust to SNR, which is especially suited for the spectra of LAMOST. For objective comparison, we used support vector machine (SVM) to directly process spectra for classification and compare with the results of EMCCNN in Table 2. The comparison demonstrates that SVM can quickly fit the training set of the data, but it tends to overfit critically in the experiment.

2D Convolution Experiment and Results
We apply the classification network that is currently applied to 2D data in the field of spectra classification to explore the classification feasibility of the 2D folding of the 1D data. We fold the spectra into a 50 × 100 matrix and put it into a 2D classification network for experiments. The feature peaks of spectra become inconspicuous after folding. The spectra are only related on the same line after folding. The data correlation for each column is not obvious, but from the picture we can clearly see that each type of spectra is different after folding in Figure 13, which brings us the possibility of applying the 2D classification network. We aim to fully apply the classification network so that satisfactory performance can be achieved.  Table 3, VGG16 performs quite well in the range of 10 < sn < 15 SNR, and ResNet also performs better than 1D spectra in the range of 5 < sn < 15. This discovery is inspiring, because it proves that although 1D spectra is not directly related to the top and bottom after folding, the 2D classification network can still achieve satisfactory results, which proves that the 2D classification of 1D spectra is feasible, even for some spectra which can produce results better than 1D classification. Because the results of ResNet_2d on the training set are very good, we try to use early stopping to overcome overfitting. We show the results of ResNet_2d at different epochs in Figure 14. Obviously early stopping cannot improve the accuracy in test set.

AS shown in
After the above comprehensive analysis and comparison of the methods under different situations, EMCCNN shows its superiority in 1D spectra especially its robustness against noise and is selected as the final structure to search for CVs in LAMOST archives.

Subclass Classification Results
Because the experimental spectra categories are not uniform, we explore the detailed results of the spectral categories in this section. Here we use precision(P), recall(R) and F-measure(F) to judge the performance of EMCCNN for each subclass.
F-measure = 2 · precision · recall precision + recall (10) where TP are current subclass samples predicted correctly by classifier and FP are other subclass samples which is wrongly predicted to current subclass. FN are current subclass samples wrongly predicted to other subclass samples. Because the subclass data is not uniform and the number of features is not consistent, the performance of the EMCCNN in each subclass is not the same. We show results in Table 4.

Results of CVs Searching in LAMOST Spectra
A systematic search for CVs in MRS of LAMOST is carried out with EMCCNN. After cross-matching the results of EMCCNN with Simbad 2 , 15 spectra are identified by EMCCNN as CVs, of which one is a CV candidate observed three times. In Table 5, the identified CVs are listed with RA, GCVS names, type etc. Their spectra are shown from Figures 15-18. We report the spectra in the website https://github.com/whkunyushan/CVs.

Conclusions
There is a great need for accurate and automatic classification methods of specified objects in massive spectra. The goal that we propose for spectral feature extraction is to maximize the extraction of common features and to minimize the extraction of noise and spectral individual features, which enhances the generalization capabilities of the network. We found that the simple DNN network cannot extract the features of spectra well, but the CNN with strong feature extraction capability can lead to significant overfitting. It also emerges that the deep network is not suited for the denoising training of the encoder, because its feedback process can be very long. This makes the encoder difficult to train and achieve a good denoising result.
In most cases, CVs especially CVs at quiescence show emission lines of Balmer series, HeI and HeII lines. For LAMOST spectra, only the Hα emission line (6563 Å) is included in B band which gives it a higher request for the ability of this feature extraction method. In this work, EMCCNN simultaneously performs supervised denoising, feature extraction, and classification. EMCCNN is not a deep network, which is ideal for the encoder to achieve a good denoising result. On the other hand, convolution kernels of different scales can extract spectral peaks of different quality, which provides sufficient options for the classifier. By weighting up the quality of features, the network can select high-quality spectral type features. This design enables the EMCCNN to achieve the best results in the three SNRs of 1D data.
The traditional view is that folding a 1D spectrum into a 2D image causes an information loss and the network has to do extra learning to understand that the pixels are correlated only along the horizontal axis and not along the vertical axis. However, EMCCNN achieve precise acquisition of the characteristic features of folded 2D spectra and can achieve good classification results, especially in the case where the 1D classification tends to overfit to an extreme degree. It proves that folding the spectra into 2D can effectively prevent the tendency of overfitting under certain circumstances.
Furthermore, this paper creatively proves that the classification of 2D spectra is feasible which means as a method of deep learning, powerful deep learning SDK (Software Development Kit) such as Caffe, Cognitive toolkit, PyTorch, TensorFlow, etc. and image processing libraries can be used for 2D spectral classification directly.
The discovery of the MRS of LAMOST provide more samples for astronomers to characterize the population of CVs. More new CVs will be discovered with the gradual release of LAMOST spectra.