On the Quality of Deep Representations for Kepler Light Curves Using Variational Auto-Encoders

: Light curve analysis usually involves extracting manually designed features associated with physical parameters and visual inspection. The large amount of data collected nowadays in astronomy by different surveys represents a major challenge of characterizing these signals. Therefore, ﬁnding good informative representation for them is a key non-trivial task. Some studies have tried unsupervised machine learning approaches to generate this representation without much effectiveness. In this article, we show that variational auto-encoders can learn these representations by taking the difference between successive timestamps as an additional input. We present two versions of such auto-encoders: Variational Recurrent Auto-Encoder plus time (VRAE t ) and re-Scaling Variational Recurrent Auto Encoder plus time (S-VRAE t ). The objective is to achieve the most likely low-dimensional representation of the time series that matched latent variables and, in order to reconstruct it, should compactly contain the pattern information. In addition, the S-VRAE t embeds the re-scaling preprocessing of the time series into the model in order to use the Flux standard deviation in the learning of the light curves structure. To assess our approach, we used the largest transit light curve dataset obtained during the 4 years of the Kepler mission and compared to similar techniques in signal processing and light curves. The results show that the proposed methods obtain improvements in terms of the quality of the deep representation of phase-folded transit light curves with respect to their deterministic counterparts. Speciﬁcally, they present a good balance between the reconstruction task and the smoothness of the curve, validated with the root mean squared error, mean absolute error, and auto-correlation metrics. Furthermore, there was a good disentanglement in the representation, as validated by the Pearson correlation and mutual information metrics. Finally, a useful representation to distinguish categories was validated with the F 1 score in the task of classifying exoplanets. Moreover, the S-VRAE t model increases all the advantages of VRAE t , achieving a classiﬁcation performance quite close to its maximum model capacity and generating light curves that are visually comparable to a Mandel–Agol ﬁt. Thus, the proposed methods present a new way of analyzing and characterizing light curves.


Introduction
New instrumentation technologies, such as the Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory [1], the Transiting Exoplanet Survey Satellite (TESS, [2]), and future space-based telescopes and ground-based observatories have motivated the use of automatic techniques to process and analyze the large amount of data that is being and will be generated.
In the exoplanet domain, advances in instrumentation and data analysis have allowed the discovery of thousands of exoplanets. NASA has reported (last updated 12 October 2021: https://exoplanets.nasa.gov accessed on 13 October 2021) more than 4500 exoplanets detected using different techniques, despite the fact that planets emit or reflect very faint light compared to their host star and that their orbital distance is very small relative to observational distance. The analysis of light curves has been the main source of candidate objects. These series are photometric observations of light intensity as a function of time, where the observed star luminosity varies through time as a result of intrinsic processes or due to external influences, such as an orbiting planet passing between the star and the observer. The latter phenomenon is called a transit and has been used as an effective method (http://exoplanetarchive.ipac.caltech.edu/docs/counts_detail.html accessed on 12 October 2021) method to study candidate orbiting objects. A transit light curve can be mathematically modeled as a natural phenomenon of an orbiting object, and algorithms such as the well-known Mandel-Agol method [3] can be used to fit such models. Unfortunately, the diversity of planetary systems (and their corresponding astrophysical scenarios) prevents unique model selection due to the unknown number of transits and complex dynamics, such as eclipsing binary systems, very faint observation, or the microvariability of the star [4].
The application of task-based learning models to solve specific tasks (e.g., exoplanet search) has been the common approach to reduce human effort [5][6][7][8]. These are methods that learn representations and model parameters based on a specific learning task and a particular objective function, usually based on the available labels. On the other hand, learning models to build self-generated representations of light curves have been largely unexplored.
Task-free methods, such as those in [9][10][11], learn the key features (representation) for detecting candidate exoplanets from data instead of imposing a model without the correct capacity to learn the task (i.e., excessively simple or complex for the observed data). These methods learn pattern structures from the data without the need for labels (i.e., unsupervised learning), having the advantage of using all the available data.
A major challenge of working with time series in astronomy is that they are usually noisy, meaning that values are missing for several timestamps or that the measurements are not uniform throughout the series. Different approaches have been proposed to solve this representation issue such as binning the light curve based on a previous folding of the periodic behavior [6], extracting specialized features based on subsets of data points [12], or explicitly modeling the time dependencies [13]. Based on the latter, some approaches have included the time information in the input representation of the learning model, showing improved results [10,14,15].
Motivated by the success of deep learning techniques [16] in different research fields, we focused on using the auto-encoder models. In this paper, we propose using variational (stochastic) models as dimensionality reduction techniques in order to learn a quality deep representation for Kepler light curves. Specifically, we propose two variational recurrent auto-encoder extensions that can process the noisy Kepler time series by using the time information of each observation. We extend the idea of using all the information on the time series by including the standard re-scaling pre-processing task into the learning model as an end-to-end architecture. The motivation to extract the information on the original scale of the light curve, based on Flux standard deviation (F-std), emerges from the fact that the relative sizes of exoplanets and noise are correlated with the F-std of the signal.
The concept of quality used and proposed in this paper is based on obtaining a compact yet robust representation (i.e., parsimonious), with the capability of smooth time series and with less correlated features (disentangled). In addition, good quality should imply an informative representation of the behavior on the time series, for example by having a better performance on classification tasks (such as time series categories). The need to learn quality representation in astronomy has potential uses in cases such as denoising, large-scale data processing, clustering, manifold learning, characterization, and other unsupervised applications.
Our experiments show that the proposed variational models are robust in learning patterns with less noise (denoising effect), having a balanced behavior between reproducing the original time series and smoothing. Indeed, the resulting light curve could be compared to the isolated simulation of a transit object. The proposed quality factor of the representation was also achieved, representing an improvement over the deterministic counterparts proposed thus far. This paper is organized as follows. In Section 2, the main methods for light curve representation are jointly described with our proposed methods and their main differences. Then, the experiment setting and results were introduced in Section 3. A brief discussion is presented with regard to the advantage of the proposed methods in Section 4. Finally, Section 5 summarizes the conclusions of this work.

Materials and Methods
We review the most relevant current works for light curve representation in Section 2.1. Then, we introduce the proposed methods for learning light curve representation in Section 2.2, a simple extension in Section 2.2.2, and another more complex in Section 2.2.3.

Light Curve Representation Methods
A light curve is a time series (function of time) containing measurements of the light intensity of a celestial object or region. When one celestial body crosses in front of another astronomical object and blocks any fraction of its light, it is called a transit. In this section, we briefly introduce different representation approaches used to study transit light curves (detection, vetting, classification, characterization) and related fields with model-based and self-generated representations.

Model-Based Representations
The Mandel-Agol simulation [3] models the transit of a spherical planet around a spherical star assuming a non-linear limb darkening model on the light source. It does this by modeling the opacity observed on the light intensity according to the planet's position with a function over time c(t). When the planet is in front of the star, the opacity is at maximum ( c < 1). On the other hand, when the planet orbits without blocking the star light, the opacity is minimum ( c = 1). When the planet is close to being in front of the star, the intensity c(t) is modeled as a polynomial based on the limb darkening of the star. This requires knowing the distance from the center of the planet to the center of the parent star, as well as the radius of each one of the bodies, the transit period, the inclination, and the limb darkening models (coefficients). There are techniques that find the model parameters through least squares (LS) or Markov Chain Monte Carlo (MCMC).
However, light curves can be observed not only by a transit, but several inherent and exogenous processes might be involved in the variability of the observed intensity of a star. Therefore, several machine learning techniques have been used to classify variable stars, typically by manually extracting specialized features from the light curve as representation and then applying classic pattern recognition methods to them. For example, Richards et al. [12] presents a catalog of variable stars where 53 specialized features are extracted from light curves through statistics such as kurtosis, skewness, standard deviation, and Stetson, plus other features based on the period and frequency analysis of a Lomb-Scargle [13] fitted model. Donalek et al. [17] also worked on classifying variable stars on the Catalina Real-Time Transient Survey (CRTS) and the Kepler Mission, extracting similar features from the light curves. In the work of Nun et al. [18], statistical descriptors are used as inputs for a random forest algorithm that can detect anomalous light curves based on probabilistic learning models. These outliers are removed from the training set used in variable stars' classification.
For transit vetting, we found the so-called Autovetter [5] which uses the random forest model over the features derived from the statistics pipeline on the Kepler mission for vetting candidate objects on the Kepler Threshold-Crossing Event (TCE) data. Another approach is to represent the light curve as phase-aligned sections called "folds" to address the problem of irregular sampling in transits. This folded light curve centers the transit and stacks all the times on which it occurs, as shown in Figure 1. In this sense, the most common is to bin the folded light curve based on a window proportional to the estimated transit period. For example, as a dimensionality reduction method (embedding representations) of folded light curves, Thompson et al. [11] proposed a locality preserving projection to filter out light curves with non-transit-like shapes, while Armstrong et al. [19] proposes a self-organization map on the classification of true planets.
Neural networks and deep learning [16] algorithms have become very popular in solving problems where feature extraction from data is non-trivial. These algorithms have been successfully applied for transit vetting in recent years by using the binned folded light curve representation. For example, Shallue & Vanderburg [6] used a one-dimensional convolutional neural network (1D CNN) model on vetting candidate exoplanets on the Kepler TCE data with the global and local representation of the folded light curve. Pearson et al. [7] used a similar approach to [6] for detecting transit shape objects trained on simulated data and evaluated using the Kepler mission dataset. Schanche et al. [8] also used a 1D CNN model for distinguishing and vetting candidate exoplanets among variable stars (four classes) on the Wide Angle Search for Planets dataset.

Self-Generated Representations
While most of the representations focus on using astrophysical knowledge to ease a specific task (task-based), e.g., classifying star variability or vetting transits, only a few of them tried different representation approaches without direct human intervention or being especially designed for some task (task-free), which may have the advantage of using unlabeled data and being used for different purposes. Mackenzie et al. [9] used an unsupervised learning algorithm known as affinity propagation with a custom distance function to build a new representation from light curves and then use it on a linear support vector machine to classify variable stars. The representation is based on the similarity between fragments of the light curves and cluster exemplars or centroids. Bugueño et al. [20] also used an unsupervised learning algorithm to build a new representation on the Kepler mission, and in a second phase, used supervised learning to classify true planets. The principal component analysis (PCA) to extract features on the frequency domain representation of light curves, based on a discrete Fourier transform, was used in [20].
In the work of Mahabal et al. [21], an image representation (i.e., grid) was obtained from the variations of magnitude through time from light curves with missing values. This work used the variable star data from the CRTS dataset and completed the classification task using two-dimensional convolutional neural networks (2D CNN), which are models that learn local spatial dependencies [22]. Aguirre et al. [14] also obtained variations in magnitude from variable stars light curves and the delta times of each sample. These were used as different input channels to train a 1D CNN classifier with shared-weights through novel data augmentation techniques.
In the work of Naul et al. [10], time was used as an additional input channel but through recurrent neural network (RNN) models, which are models that learn temporal dependencies [23]. Naul et al. [10] presented a recurrent auto-encoder (RAE) that learned how to embed and then reconstruct a light curve by setting the original times using RNN models in the encoder and decoder phase-which we named "RAE t " (RAE plus time information). The authors in [10,15] showed that learned representations are useful for classifying variable stars, thus improving the results obtained using statistical features [12]. It also explores the use of the folded light curve representation which further improves the obtained results.
Other types of research have been performed to characterize noisy signals. The most common approach is to estimate the missing values. Rehfeld et al. [24] presented the principal techniques used to analyze the correlation on irregularly sampled time series. Mondal & Percival [25] proposed an estimator for wavelet variance for time series with missing data. There is also the Lomb-Scargle periodogram [13] for period analysis on data with missing values and the previously discussed Mandel-Agol model.
Marquardt & Acuff [26] included delta times (time differences between two adjacent samples) to obtain a spectral analysis from time series. However, to the best of our knowledge, including delta times as an input in neural networks models of noisy times series was first explored by Che et al. [27]. This study presents a modification of a gated recurrent unit (GRU) model [28], namely GRU-D, which uses a binary mask for missing values and the delta times as input channels. The objective was to impute (fill) the missing values to improve the predictions on medical problems.

Variational Auto-Encoders
The variational auto-encoder (VAE) is a stochastic auto-encoder (AE) learned in a probabilistic fashion based on the variational lower bound or evidence lower bound. The VAE framework [29] is extended to work with uniform time series on the variational recurrent auto-encoder (VRAE) proposed by Fabius & van Amersfoort [30]. The motivation behind this is that VAEs are deep generative models trained on an unsupervised scenario, learning latent variable representation as it trains. These variables are learned through the observed distribution, so they are built to adapt to the variations in the behavior of the data. This is the main difference of standard (deterministic) AEs that learn an invariable specific point for the input pattern [31].
The use of the VRAE on different time series applications is associated with anomaly detection [32][33][34], with the objective of detecting outliers. It usually compares the reconstructed (or generated) input with the original values and sets some threshold of tolerance to normal behavior. The sampled values from the latent distribution have smooth transitions [29], so the reconstructed data should reduce the bias of the specific patterns [34]. In addition, a VAE for generating transit-shape light curves as a data augmentation technique was presented by Woodward et al. [35].
In the work of Locatello et al. [36], the importance of disentangled representations for better unsupervised learning was discussed-better in the sense that the representation should contain the information of the data in a compact and interpretable structure, among other points. In [37], a good representation was defined as a representation that discovers and disentangles some underlying factors in variation that the data may reveal. These should be explanatory factors of the data that are a priori unknown and need to be learned. Locatello et al. [36] showed that the VAE regularization also aims to achieve this.

Extending VRAE to Include Temporal Information
Based on the effectiveness of deep stochastic models, we propose two VRAE extensions that incorporate the time information in the dimensionality reduction of light curve time series, based on the time processing proposed by [10]. The first model is a natural extension from VRAE that includes the temporal information of the signal, while the second model adds the information of the Flux standard deviation for the dimensionality reduction task.

Problem Setup
cording to an unknown probability distribution p( x). These inputs pattern are vectors of variable length, j be the timestamp when the j-th observation was obtained for datum i. Furthermore, we define the time interval, or delta, for each observation: be the standard deviation of the i-th time series. This paper focuses on transit-shape domain objects over a light curve x (i) in order to recognize patterns of exoplanets orbiting its host star. An example is shown in Figure 1 (the time unit used is explained in Section 3.1).
Assuming a detrended light curve x (i) (with zero mean), the standard deviation, s (i) ∈ R + , expresses the scale magnitude of the variability on flux x (i) : which we abbreviate to F-std (Flux standard deviation).

VRAE Including Delta Times
A VAE, as any other auto-encoder architecture, is composed of a tangled encoder and decoder models trained on an unsupervised scenario (unsupervised refers to the fact that no labels are used as inputs to the model). The encoder model q φ ( z| x), with parameters φ, codifies the input pattern x to a multi-dimensional latent variable z, which has fewer dimensions than the original space (data compression). The decoder model p θ ( x| z), with parameters θ, reconstructs the input pattern from the codification z. The objective of the model is to maximize a (variational) lower bound L(θ, φ; D) of the log-likelihood (θ, φ; D) [29]. For example, for an input pattern x, we have: where the first term of L is related to the expected reconstruction likelihood and the second enforces the consistency between the posterior obtained by the encoder q φ ( z| x) and some prior p θ ( z), based on the Kullback-Leibler (KL) divergence [38]. In the standard VAE, the distribution q φ ( z| x) is typically normal, N (µ( x), diag(σ( x))), where µ( x) and σ( x) are modeled with neural networks yield vectors. However, the common choices of p θ ( z) lead to a KL divergence that can be analytically integrated. However, the first term of L needs to be approximated with the so-called re-parametrization trick:ˆ z = µ + σ , with an auxiliary noise variable ∼ N ( 0, I). The operator symbolizes the Hadamard product, i.e., an element-wise product of vectors and matrices.
The fully connected VAE for light curves proposed by [35] could be extended to adapt RNN models into the encoder q φ ( z| x) and decoder p θ ( x| z) as a VRAE model. However, to adapt the model to incorporate the time information of the light curves, we need the time intervals ( δ) as an extra input channel. To do this, we followed the idea of the RAE t archi-tecture proposed by Naul et al. [10] (shown in Figure 2), which adds the time information in both encoding and decoding sections of the auto-encoder. The proposed extension is named VRAE t model (VRAE plus time information). This model could be expressed by the encoder q φ ( z| x, δ) and the decoder p θ ( x| z, δ), which have the time information δ of every observation on the time series x as an extra input signal.
The motivation behind the variational extension is that the generative (stochastic) model learns a latent variable with smoother transitions, meaning that their latent space must be continuous [29]. Furthermore, since the model learns the distribution of the encoded variable, it becomes robust to input variations, similarly to a denoising autoencoder [31]. Indeed, the learned distribution must have more likely regions or confidence intervals where the input data should be.  [10]. The sequence is processed by recurrent layers and produce a final fixed-length embedding with a single fully connected layer on last state. The decoder first repeats this embedding T i times, and then appends the delta times (δ t , in Figure ∆t). To determine the points at which the function will be evaluated, the delta times are input to both the encoder and decoder.

VRAE with Embedded Re-Scaling
Currently, deep learning methods need a standardized version of the input representation x (i) that retains the original distribution but re-scaled to more tractable magnitudes. This is used for properly training neural network models, based on the generalization principle that magnitudes of weights and activation functions must be somehow bounded [39,40]. Furthermore, Ioffe & Szegedy [41] recommend that a normalization step is added after each layer to obtain more stable training. On a time series, this transformation is usually based on its own magnitude behavior, e.g., the range, standard deviation, and norm. For example, These transformations are necessary for the model to detect pattern behaviors instead of magnitude behaviors. The problem is that the information on the original scale of variability is lost in the process, which can be useful in some cases. Focused on light curves, we propose adding the Flux standard deviation s (i) of the flux time series x (i) as an additional input to the VRAE t model, but still use the re-scaled version of the data for achieving bounded weights and activations. To the best of our knowledge, this is the first approach to do this as a end-to-end architecture.
We consider the F-std s (i) (Section 2.2.1 for problem notation) as another input pattern of the auto-encoder model VRAE t that also needs to be reconstructed. With this, the objective of the VRAE t with re-scaling or S-VRAE t is to reconstruct the time series on the un-scaled values, based on the F-std.
S-VRAE t architecture: The outline of the main components of the S-VRAE t model which clarify the differences with respect to VRAE t , as summarized here: 1.
Re-scale data: The first layer of the encoder q φ (·) re-scales the data by dividing on the F-std. This step is performed in order to use the standardized version of the data, as the literature recommends.

2.
Encode: The encoder q φ (·) adds the F-std as an input pattern to the coding task by ) in order to extract the information on it. 3.
Sample: The sampled latent variable is given by adds the F-std to the reconstruction task in order to estimate the original F-std by p θ (s (i) | z (i) ).

5.
Re-scale reconstruction: The last layer of the decoder re-scales the data, by returning the reconstructed F-std (multiplied by it). This final step is performed in order to obtain a reconstructed time series on the un-scaled values' representation.
Based on the high variability in the range of values of the F-std and inspired by [41], we introduced a normalization logarithm layer (Norm L ) that is applied to s (i) . Considering a random variable a, the forward pass to obtain a new value b is given by The mean and standard deviation (std) are previously computed over the whole dataset. Furthermore, another layer is introduced to reverse this transformation, given by Compared against the VRAE t on Algorithm 1, a pseudo-code of the S-VRAE t forward pass is presented on Algorithm 2. Here, it can be seen that the main differences are the re-scaling process inside the model and the additional input pattern to be used. The first thing to formalize is that g w (·) and f w (·) are non-linear functions (i.e., deep learning models) parameterized by the weights w. The E(·), which stands for the embedding function, corresponds to the first layers of a deep learning model. Inside f φ (·), the E 1 (·) is an RNN model and E 2 (·) is a multi-layer perceptron (MLP) model with Norm L (·) as the first layer. Here, E 1 (·) codifies the information from the noisy standardized time series, while E 2 codifies the F-std. On the decoder phase, the g 1 θ (·) is similar to a mirror model of . Then, g 1 θ (·) reconstructs the standardized version of the time series and g 2 θ (·) reconstruct the F-std. The architecture of the proposed S-VRAE t is illustrated in Figure 3. Please note that the Encode and Reconstruct blocks are the same as the RAE t in Figure 2 but without the F-std transformations. In summary, S-VRAE t learns a coded deep representation of the noisy un-scaled time series in an unsupervised way. The advantage is that those features are optimized for the specific input behavior (rather than classification) and are more disentangled (more independent between each other) than human-crafted counterparts. The evidence for this claim can be found in Section 3.3.
∼ N ( 0, I) // Auxiliary noise 5:ˆ z (i) ← µ (i) + σ (i) 6: // Decode or Reconstruct: Input: x (i) -measurements of the time series δ ( ) -delta times of the time series Output:ˆ x (i) -reconstructed time series 1: ∼ N ( 0, I) // Auxiliary noise 7:ˆ z (i) ← µ (i) + σ (i) // Step-3 8: // Decode or Reconstruct: Step-4 9:ˆ x (i) In any VAE model, the objective of learning the distribution of the latent variable is to obtain the more likely reconstructed input pattern. In our S-VRAE t , the time series is reconstructed on the original un-scaled behaviorˆ x (i) , so we can consider it as a smoothing or denoising model in a data-dependent way (based on data behavior). As the model processes the F-std inside it, it allows a robust F-std reconstruction (based on the noise).
For example, consider the following decomposition of a time series: p , flux measurements with an isolated or perfect environment such as the one modeled by [3], and n (i) as the intrinsic noise of these measurements, both with zero mean. Based on this decomposition, we can express the noised and denoised dispersion, F-std, as p could only be close-estimated by S-VRAE t in order to re-scale the denoised reconstructed time serieŝ Step 5 of Algorithm 2). Other models, with a fixed re-scaling process, will always use the noised F-std: s  Figure 3. Diagram of the S-VRAE t (VRAE t with re-scaling) architecture for time series. Firstly, the raw time series input is re-scaled and passed to the RNN encoder block together with the delta times. In parallel, the F-std input is also encoded. A concatenation is performed on both learned embedding values to obtain the Normal distribution parameters of the latent variable. After a sample is performed over this distribution, the value is used to reconstruct the F-std. In parallel, the sample is repeated and concatenated with the delta times in order to reconstruct the scaled time series on the time space. Finally, the F-std is returned to the time series through the network estimation.

Loss Function
In this section, we describe the optimization objectives of the proposed variational auto-encoder models (VRAE t and S-VRAE t ) for the time series domain.
Reconstruction loss: We used a modified version of the mean squared error (MSE) function for the un-scaled variable time series, named weighted or re-scaled MSE, given by where the additional weight w (i) is associated to every input pattern (as a sample weight [42]) and is defined as the inverse of the variance in the time series: w (i) = (var( x (i) )) −1 . The weight value comes from w i = (s (i) ) −2 , with the idea of normalizing and removing the intrinsic dispersion of the time series x (i) , and so every input pattern has the same impact in the objective (note that SMSE is similar to the chi-square error of a linear least-squares fit with (uncorrelated) uncertainties that follows from a maximum likelihood approach). In order to perform a proper reconstruction of the input patterns, the SMSE is applied as a loss function for the deterministic counterpart of VRAE t , i.e., the RAE t model.

Variational loss:
The two factors to optimize on the variational lower bound (Equation (1)) could be expressed by: (i) a reconstruction factor through the SMSE described above; and (ii) the closed solution for the KL divergence with normal priors [29]. Following Higgins et al. [43], we combine these two factors using a regularization parameter β obtaining the loss function of the VRAE t model: Our earlier experiments on the validation set led us to set the β hyper-parameter to a small value of 10 −3 . This means that the priority is set on the reconstruction task but with a variational component of regularization.
Variational loss with re-scaling: As our proposed S-VRAE t adds the F-std to the reconstruction task, we need to specify an additional loss that guides this reconstruction. We used the mean squared logarithm error (MSLE) as a loss function, given by Then, the learning objective of the S-VRAE t model is given by where the α hyper-parameter was set to 10 −1 /var(log S). The same motivation used for β led to set the 10 −1 value, firstly giving relevance to the reconstruction loss SMSE inside V. However, the var(log S) value is to re-scale the values of the MSLE loss (by removing the standard deviations) to the same proportions of SMSE loss. The loss function used helps the models, guiding them into the right reconstruction. As each component has a hyper-parameter weight, this does not force the model to explicitly predict the exact value, as it only acts as a data-driven loss.

Experimental Setup and Results
This section presents the experimental setup in which the proposed methods were validated and the results obtained through that setup.

Dataset
Our work used the Kepler Objects of Interest (KOI, [44]) dataset described by the NASA Exoplanet Archive [45]. It is composed of 8054 light curve records (https://archive. stsci.edu/kepler/koi/search.php accessed on 20 September 2019) where every KOI describes a target star in the field of view that exhibits transit-like signatures in its light curve. Primarily, a TCE that was accepted as a valid astrophysical signal based on the diagnostic tests described in [44] was designated as a KOI. According to the NASA Exoplanet Archive (https://exoplanetarchive.ipac.caltech.edu/ accessed on 10 July 2021), KOIs are categorized into three tags: Confirmed, claimed exoplanets through extensive scientific analysis and follow ups; False Positive, evidence indicates that they correspond to another type of behavior not associated with transit exoplanets (e.g., eclipsing binary systems), and Candidate, those that are still under study (unlabeled data). In some cases, multiple KOIs are obtained from the analyzed host star, where each object contains the raw flux measurements with the timestamps (in BKJD), including the instrumental error associated with each measure. The time unit Barycentric Kepler Julian Day (BKJD) corresponds to BKJD = BJD − 2454833, while the Barycentric Julian Day (BJD) is simply Julian days (the continuous count of days since the beginning of the Julian period) with some corrections based in the Earth's position and the barycenter of the Solar System. The long-cadence sampling rate of the 4-years' worth of measurements was 0.0204 BKJD on average (i.e., half an hour); however, there is 22.98% of missing values on average in the collected and used light curves. The light curve data [46][47][48] described here may be obtained from the Mikulski Archive for Space Telescopes (MAST) (https://doi.org/10.17909/T9RP4V accessed on 10 July 2021).
The archive also provides a set of metadata values (https://exoplanetarchive.ipac. caltech.edu/docs/API_kepcandidate_columns.html accessed on 10 July 2021), some of which (usually those related to the host star such as its effective temperature and metallicity) are cross-matched from other catalogs, while others are calculated following a Mandel-Agol modeling fit [3] according to LS and MCMC. For comparing our self-generated features with model-based ones, we selected those that could be obtained from the light curves based on the modeling, which are: Period; First Transit Time; Inclination; Planet Radius (over Stellar Radius); Semi-Major Axis (over Stellar Radius); Transit Duration; Impact Parameter; and Fitted Stellar Density. Furthermore, we included the Limb Darkening Coefficients (quadratic model) even though they are not solved -they are fixed in the fit for each star, but they are still involved in the modeling of the light curves.
For data pre-processing, we used a detrending process that also normalized the stellar flux and apriori removed low frequency noise. It combines ideas from the combined differential photometric precision proxy used by [49]-subtract polynomial fit and sigmaclipping, and the Argabrightening detection by [50]-subtract polynomial fit and subtract median filter. This detrending process was performed over the raw flux measurements by using time. It consists of applying and then subtracting a two-degree polynomial fit with a window of 151 points (approximate 3 days on Kepler), the Sav-Gol filter [51]. Then, it subtracts a moving median filter with a window of 25 points. Finally, during the sigmaclipping, we remove positive outliers that are greater than 5σ and remove negative outliers less than −40σ.

Data Representation
For the detrended flux light curve f (i) , we used the global-folded representation proposed by Shallue & Vanderburg [6]. First, it produces p (i) , a vector with the same number of points by folding the detrended light curves on the period (P with the event centered (fold step). Then, a median window was applied every P/T times over the folded light curve p (i) (global step) producing x (i) , a vector with T points (empty values masked with 0) such that each light curve has the same length with the width interval depending on the period P. The parameter T controls the trade-off between a detailed representation and having enough points on each window for the median to be meaningful. Figure 4 shows examples of folded and global-folded representations of some light curves (based on Kepler). The second example illustrates a general disadvantage of this method, where long-period KOIs may end up with very narrow transits that fall entirely within a small number of bins. On the other hand, the third example shows how the global-folded representation helps obtain a cleaner version of the light curve when the planets are small.
As explained in Section 2.2, the delta times ( δ (i) ) of x (i) are considered as an additional input for the learning models. Furthermore, please recall that the measurements of the global-folded light curve x (i) were previously scaled through x (i) s (i) for VRAE t , while the proposed S-VRAE t model directly uses the un-scaled global-folded light curve x (i) .

Data Selection and Augmentation
We set a data-driven mask over all the objects (labeled and unlabeled) of the KOI data (8054 records) to filter out light curves without a transit behavior, obtaining 4317 objects to train in an unsupervised fashion. The process is described below: • Check for Kepler flags (in the metadata) and remove objects with "secondary event" or "not transit-like" flags; • Remove objects with a "transit score" (in the Kepler metadata) less than 0.55; • Perform a Mandel-Agol fit and remove objects with (SMSE) residual greater than 1.
The candidate objects that are not transits ("not transit-like" flags) could be because the detection was from instrumental artifacts errors, non-eclipsing variable stars or another object (non-planetary) in the background. On the other hand, the objects with a statistically different secondary event ("secondary event" flag) are most likely caused by an eclipsing binary system, i.e., two stars orbiting around their barycenter.
Furthermore, as a data augmentation step, we doubled this dataset by mirroring each folded light curves [52]. This represents of the same object, with the same properties, orbits the star in the opposite direction.  . Examples of the different representations that can be obtained from a periodic light curve by knowing the period P. The example images on every column are the raw detrended light curve, the folded (phase-space), and global-folded (setting the bins to T = 1000), respectively. The third column is the representation used in this work. The time unit is BKJD.

Model Assessment and Implementation
For assessing the quality in the latent representation of global-folded light curves learned by the proposals, we propose validating the reconstruction task based on that representation, the representation disentanglement, and the true planet classification by using the representation (extrinsic validation).

Reconstruction Validation
Using the encoder-decoder architecture, we assessed the reconstruction task by comparing the estimated input patternˆ x and real value x using the root mean squared error (RMSE) and mean absolute error (MAE) as performance metrics. For evaluating the denoising effect on the reconstruction, we used the estimated input patternˆ x and measure the auto-correlation (Auto-C) and the mean of the differences (Diff-M) between consecutive values. A high Auto-C and low Diff-M, is associated with a smoother time series. For measuring the structure left within the residual noise (difference between real and estimated time series), we used an information theory score called spectral entropy (Spectral-H) [53]. A high value of this entropy means a less structured residual in terms of signal frequencies.
Smoothing a time series is a classical signal processing task. Unfortunately, this is not straightforward for transits because transits are (statistically speaking) isolated behaviors or anomalies as shown in Figure 4. Therefore, a standard denoising model will remove transits by considering them anomaly deviations of the "normal" stellar flux. Based on this, we selected the following signal processing baselines to compare our models: the Butterworth passband filter [54], a moving average filter with different window sizes, and the RAE t [10]. In addition, we performed a Mandel-Agol (M-A) fit over the global-folded light curves, modeling points on the phase-space of this representation. The M-A fit was performed to obtain a reasonable time period for the exoplanet population analysis, using the same model calculation given by Kepler. This was implemented using the ktransit library [55] with an LS optimization based on a quadratic limb darkening model and the star coefficients of the Kepler metadata, given by [56]. At this point, the LS optimization hyper-parameters were set to the library defaults.
To make a fair comparison, the initial estimate for the period and mid-transit time was the same as that used in the folding of the light curves.

Disentanglement Validation
Using the encoder sub-architecture, we analyzed the disentanglement of the latent representation based on the features' dependency. To perform this in an unsupervised way [36], we used the Pearson correlation between all the features on the representation to determine how orthogonal the generated features are. We report the average over all the values (Pcorr) and the average over the absolute values (Pcorr-A). The mutual information (MI) between the continuous features was also measured, including a normalized version (of a value in the range [0, 1]) that is obtained by dividing on the entropy of every feature (N-MI). The calculation of MI is a discrete approximation of the real continuous space calculation, based on the k-neighbors' implementation [57].
Here, we compare the high-level specialized features included in Kepler's metadata with the features learned by RAE t [10] and the PCA features extracted over the frequency domain (spectrum) representation of the time series (Fourier followed by PCA or F-PCA) [20].

Classification Validation
Using the latent representation learned by the encoder sub-architecture, we analyzed the performance in a binary classification task (is the object a true exoplanet?). This is an extrinsic assessment of the quality of the representation since the model did not have learning as a specific objective (task-free). The classification was made using an MLP model built over the representation with 128 units and relu activation, ending with a sigmoid classification layer (one-unit) as output. The dataset labels were used to carry out this evaluation, with 2281 exoplanets and 3976 non-exoplanets (1797 of the KOIs are unlabeled, i.e., candidates). The network is trained in 70% of the labeled data with 30% remaining as a test set. The best parameters found over 200 epochs are stored (with 128 batch size). As the dataset contains a larger number of non-exoplanet objects (in an unbalanced scenario), an F 1 score macro averaged (F 1 -Ma) criterion was used to assess the results, where a high value means high precision and recall simultaneously on both classes.

Model Implementation
Following the RAE t model [10], the recurrent layers of our models implement GRU over the widely known long short-term memory (LSTM, [58]) as it presents roughly the same performance [59], but has fewer parameters and needs to store less information per time (i.e., it is faster). The encoder and decoder recurrent blocks of the models (i.e., E 1 (·) and g 1 (·) on Algorithms 1 and 2) are a stack of two bidirectional RNN layers of 64 units in order to increase representation complexity as [10] presented. The models were trained for 300 epochs with a batch size of 64 and an Adam optimizer [60]. The implementation was performed using the Keras library (https://keras.io, accessed on 10 July 2021).

Results
The following results were obtained at the Chilean Virtual Observatory (ChiVO) datacenter [61] on an Intel Xeon CPU E5-2680 2.50 GHz with 12 cores and 64 GB of RAM.
For the experiments, we set T = 300 and the dimension of the encoder representation D to 16, generating 95% compression. The results correspond to just one sample set. All the implementations are available in our repository (Github, accessed on 12 October 2021).
3.3.1. Is the Time Needed? Table 1 shows the results on the behavior of the model when using the time information. The first thing to mention is that the learning models, on average, manage to reconstruct the data without the time channel. However, including the delta time channel as input to both sub-architectures (encoder and decoder) helps the methods to regularize the reconstruction. This phenomenon can be seen by the RMSE and MAE metrics since they report an increase when using time, while the Spectral-H indicates that the models leave more noise in the residual reconstructing non-uniform patterns of the original time series. In that sense, the denoising results report that the consecutive values of the reconstructed time series are more auto-correlated and with smoother transitions. All this indicates that the model does not focus on exactly imitating each measurement but generates smoother light curves with more residual noise. This effect is stronger on the RAE t model, since the variational loss on VRAE t already forces some regularizations. Table 1. Reconstruction and denoising results with the associated metrics. In both, compared methods are shown if the time information is used (as delta intervals δ). ↑ symbol means that a higher value is better on that task, while ↓ means that a lower value is better.

Reconstruction
Denoising

Quality Evaluation
The reconstruction and smoothing scores for different methods and configurations are presented in Table 2. As expected, passband methods offer a very good denoising behavior, but fail to properly reconstruct the signal. For example, if we vary the filter on the passband, by specifically narrowing the pass frequency, the reconstruction error increases and becomes smother (high Auto-C and low Diff-M). The same effect occurs for the moving average method when we increase the window size. Compared to the passband, the moving average method presents smaller reconstruction errors but producing a rougher time series (low Auto-C). The Mandel-Agol fit has a smaller reconstruction error compared to the passband and moving average methods, which is because it performs a specific and specialized problem fit (transit light curves). The simulation function does not consider factors of smoothness, as it models the ideal behavior of the transit objects. Nevertheless, the auto-correlation is still strong as in the previous methods, with the smallest difference on consecutive values.
It is clear that the learning-based methods (auto-encoders), due to data learning, have the lowest reconstruction errors only comparable with a moving average with a window size of 3. Furthermore, the reconstructed data are still significantly smoother than those of the original time series (denoising effect). Between them, variational proposals have a greater auto-correlation (i.e. smoother) than their deterministic counterpart RAE t . Except for the Mandel-Agol fit and the narrowest passband, they have the smallest difference on consecutive values (Diff-M), showing smoother transitions in the local behavior of the time series. Table 2. Reconstruction validation. Reconstruction and denoising results with the associated metrics on different methods. The configuration of the passband corresponds to low and high-pass filters, respectively, while for moving average corresponds to window size. For the Mandel-Agol it indicates the library used for the fit, while deep auto-encoder methods have the dimension of the encoder representation. The ↑/↓ symbols mean that a higher/lower value is better on that task respectively. If we take a look at the spectral entropy of the residual, we clearly identify that learning-based methods exhibit the best scores. This means that these methods really focus on learning the time series patterns to perform the reconstruction instead of removing structure for achieving high smoothness as in the generic denoising methods presented. Please note that the spectral entropy has a reference value of 0.840 (i.e., on the original light curve) and its maximum value is 1.000 (i.e., a uniform distribution of frequencies). Figure 5 shows a few examples of reconstructed light curves for a better understanding of the previous results. The RAE t approach tends to keep the noise on the edges of the time series (in order to perform a good reconstruction), while our proposal (S-VRAE t ) generates a smoother transit to the extent of being comparable with the M-A transit model, removing more high-frequency noises, as can be seen in the figure. The VRAE t model is not shown because visually it is quite similar to S-VRAE t . On the other hand, the Mandel-Agol model residuals show that the reconstruction errors are related to the positive values, as the function models that the measured light could not be greater than the regular/normal star's light intensity (without objects in front of the observer), i.e., it is limited by the model itself.

Reconstruction
In Table 3, we present the disentanglement of the representation based on the feature dependence. Here, it can be seen that learning-based methods (AEs) obtain features with less information duplicity than the metadata features obtained by Kepler, i.e., the learned features are more disentangled. Minimizing the correlation and mutual information could simplify the task of identifying independent linear and non-linear processes in the system, respectively.
We computed an F-PCA representation from the whole (unfolded) raw measurements and from the folded representation. Both obtained components were linearly independent between them based on the orthogonality of the PCA formulation. However, the F-PCA method over the fold representation produces less MI, as an example of optimal independence for the problem. The closer values to F-PCA (in the folded curve) are from the proposed model with re-scaling (S-VRAE t ) showing the good impact of including the F-std into the model. It is worth noting that the variational proposed models (VRAE t and S-VRAE t ) learn a space with more independent components (linearly and based on mutual information) than the deterministic counterpart RAE t .

Application of the Learned Representation
An extrinsic way to assess the quality of a representation is to impose a task: in this case, classifying candidate light curves as true exoplanets (Confirmed) or another variability (False Positive). Table 4 presents the results of this binary classification. Here, we compare different representations, including the auto-encoders and supervised methods from the globalfolded light curve shape (with T = 300 bins). The "1D CNN" is a supervised method that extracts features based on convolutions, while "RNN t " does it based on recurrence, both with the classification layer mentioned in Section 3.2.3 on top. The 1D CNN network architecture is inspired by [6], until the third convolutional block using only the global view on this KOI dataset (a subset of the TCE). While the RNN t architecture is the recurrent encoder of RAE t /VRAE t without the decoder phase. In Table 4, it can be seen that the representation of the variational extension (VRAE t ) gives an improvement on all the classification metrics with respect to the RAE t representation, as well as being better than the other unsupervised methods. This can be explained due to the generalization capacity of the implemented latent variables within the variational model. Furthermore, the results of the S-VRAE t suggest that the method succeeds in using the Flux standard deviation of the light curve, as it is able to codify better patterns in the representation to distinguish in the classification task. Indeed, the S-VRAE t representation has the best classification performance among unsupervised techniques on all the metrics, being quite close to the supervised ones. This indicates that, without access to labels, we are obtaining a representation almost as good as the supervised method on that task and at the same time, useful in other factors.

Discussion
In Section 3.3.1, we show that the time is needed if a regularization effect and smoothing is desired. We briefly discuss the reasoning behind why time could contribute to phase-space light curves. As [10] introduced, the time information is not only used as input in the first stage (encoder process), as the decoder process needs the space reference where to reconstructs it. Each light curve has its own phase-space based on its period, then the delta times of the bins for different light curves will be different. Even if the delta times are similar for some light curves, the decoder does not know that information. The model will have to learn to reconstruct in each phase-space, since the decoder processes the previous time and its memory with a recurring operation. By having the same vectorẑ repeated at each time-step in the first decoder layer (Figure 3), the temporal information of each point for the different light curves could be beneficial in the reconstruction. Section 3.3.2 shows that the proposed methods offer a good combination of reconstruction and denoising of the light curves. In addition, adding the re-scaling into the model (S-VRAE t ) seems to produce a smoother reconstructed time series but with the cost of increasing the reconstruction error, but both results are quite similar. The disadvantage in reconstruction is expected because the model needs to learn to simultaneously codify and reconstruct the F-std of the data. These results indicate that the representation learned by the variational proposals allow a smoother version of the light curve reconstruction that is the most similar (except for RAE t ) to the original curves, these being the methods that perform the best removal of unstructured patterns (noise) on the light curves among all the methods. While the results in Table 3 illustrate that more exclusive information could be stored through the learned variational features of the proposed models (as every feature represent a more different structure) and used for different proposes, as shown in the following. In addition, as shown in [36], a more disentangled representation indicates that the features found are closer to the underlying factors that explain the data behavior.
The unsupervised feature extraction methods are useful based on the classification improvement over the raw global-folded representation, as shown in Section 3.3.3. This suggests that it is easier to discriminate using these compact representations than the raw folded measurements (95% compression). In addition, the 16-dimensional representation is computationally lighter, taking a fraction of the time to train the supervised network (i.e., 10%). However, supervised methods that train an internal representation explicitly specialized for the task have better classification results over all unsupervised techniques. This is expected if we take a look at the input mapping for the desired output y with a classification model C(·; w) parameterized by w. For example, (considering that the input x should be ( x, δ)), the VRAE t will estimateŷ = C( f ( x; φ); w) by learning w with φ fixed, since the representation µ = f ( x; φ) is already learned (in the unsupervised auto-encoder objective without knowing y). However, the supervised version of this method (RNN t ) will estimateŷ by learning φ as well (in the supervised objective knowing y). Despite this, it is worth noting that our proposed S-VRAE t achieves a performance quite close to the maximum model capacity for this specific task (RNN t ) by learning φ without labels (y). In this case, we refer to the capacity to extract the information from the input x, and it would be given by the structure of f ( x; φ), e.g., a linear function or not, in the case of a neural network, the architecture, or how φ is organized within the network.

Conclusions
In this work, we used variational (stochastic) auto-encoder models to learn quality deep representations for global-folded transit light curves. We focused on adapting the variational auto-encoders to properly include the time information as delta times outperformed all of their deterministic competitors. We presented two methods, one of which included a re-scaling pre-processing of time series (as an end-to-end architecture) which led to significant improvements on different quality evaluation schemes that we proposed.
The evaluation of the learned representation showed that the variational proposals obtain a better quality. This means that the representation is: (i) more informative, i.e., more independent features, so more information could be stored; (ii) more useful, i.e., more effective for the classification of exoplanets; and (iii) more robust, i.e., learn a smoother light curve reconstruction. By adding the re-scaling into the model, these three effects increase. For example, the S-VRAE t model ends up being almost as informative as the optimal PCA, at the same time that it produces a denoising effect on the reconstruction that is visually comparable to a Mandel-Agol fitted model. Furthermore, the model achieves a classification performance quite close to its maximum model capacity.

Future Remarks
Future work includes performing a sensitivity analysis on the architecture parameters. Furthermore, we believe that interpreting these informative, useful, and robust features by mapping each dimension to an astronomical concept could enrich current knowledge about orbiting objects. Furthermore, this could be a useful tool for characterizing exoplanets and identifying non-conventional planetary properties, such as magnetic bow shocks, exomoons, or exorings. Another exciting line of study considers the use of the S-VRAE t model to help elucidate which characteristics of a transit model are most relevant. With this, we could generate new synthetic transit light curves and face other learning tasks such as classification problems.