Unveiling Hidden Insights in Gas Chromatography Data Analysis with Generative Adversarial Networks

: The gas chromatography analysis method for chemical substances enables accurate analysis to precisely distinguish the components of a mixture. This paper presents a technique for augmenting time-series data of chemicals measured by gas chromatography instruments with artificial intelligence techniques such as generative adversarial networks (GAN). We propose a novel GAN algorithm called GCGAN for gas chromatography data, a unified model of autoencoder (AE) and GAN for effective time-series data learning with an attention mechanism. The proposed GCGAN utilizes AE to learn a limited number of data more effectively. We also build a layer of high-performance generative adversarial neural networks based on the analysis of the features of data measured by gas chromatography instruments. Then, based on the proposed learning, we synthesize the features embedded in the gas chromatography data into a feature distribution that extracts the temporal variability. GCGAN synthesizes the features embedded in the gas chromatography data into a feature distribution that extracts the temporal variability of the data over time. We have fully implemented the proposed GCGAN and experimentally verified that the data augmented by the GCGAN have the characteristic properties of the original gas chromatography data. The augmented data demonstrate high quality with the Pearson correlation coefficient, Spearman correlation coefficient, and cosine similarity all exceeding 0.9, significantly enhancing the performance of AI classification models by 40%. This research can be effectively applied to various small dataset domains other than gas chromatography data, where data samples are limited and difficult to obtain.


Introduction
Chemical weapons pose a significant threat to global security and have been used in numerous conflicts around the world [1].Chemical weapons are composed of microbiological and biological toxins consist of various types, such as nerve agents and blisters [2].Their use has resulted in devastating consequences, including death, injury, and long-term health effects.The development of effective countermeasures and protective measures against chemical weapons is a critical area of research that requires an accurate and efficient analysis of chemical data [3].Traditional chemical research methods, such as laboratory experiments and manual analysis, have limitations in terms of efficiency, accuracy, and scalability, especially when dealing with large datasets.
Among the various chemical analysis techniques to solve this problem, gas chromatography analysis methods are of particular importance for their ability to separate and accurately detect compounds within complex mixtures and are therefore indispensable for observing subtle changes in chemical properties, crucial for identifying and mitigating the threats posed by chemical weapons.This technique ensures not only accuracy but also detailed differentiation necessary for detecting harmful substances [4].
Artificial intelligence is being developed and studied in various fields such as detection, identification, and optimization [5][6][7].In particular, recent advances in artificial intelligence and machine learning offer significant potential to improve chemical research by enabling augmenting data for hard-to-test chemical experiments [8,9].Such artificial intelligence has been widely applied to chemical analysis fields such as gas chromatography, showing promising results in improving research [10].However, chemical data are very complex and structured, and limited quantities of experimental data are sometimes difficult to provide sufficient quantities of data for AI models, and AI models may have difficulty capturing the complex distributions of chemical properties [11].
Generative adversarial networks (GAN) are representative artificial intelligence techniques presented for the problems of limited data as described above [12].A GAN is a type of neural network that consists of two parts: a generator and a discriminator.The generator generates synthetic data that is designed to look like real data, while the discriminator tries to distinguish between real and synthetic data.Through an iterative process, the generator learns to produce synthetic data that are increasingly difficult for the discriminator to distinguish from real data.
Training a GAN neural network requires a lot of training data.Data generation techniques using GAN are currently dominated by research for 2D data generation such as images and text generation models such as ChatGPT, which are currently in the spotlight, and also learn using giant corpus data [13].In this paper, we propose and implement gas chromatography GAN (GCGAN) with specifically designed for chemical data to augment real chemical data.By enhancing chemical data through GCGAN, we aim to improve the accuracy and efficiency of chemical data analysis research beyond contributing to the development of effective countermeasures against chemical weapons.
Contributions of this paper: (1) Novel attention mechanism.Recently, with the development of transformer models such as ChatGPT, many attention mechanisms are being studied [14].We propose a novel attention mechanism to adequately improve the performance of deep learning algorithms on gas chromatography data.Because of the characteristics of gas chromatography data, various existing deep learning techniques have enormous limitations in training.However, the truncated attention mechanism adequately learns about large peaks at retention time and small peaks at the rest of the time zone, which are properties represented by gas chromatography data.The discovery of the truncated attention mechanism suggests that it can be widely applied to a variety of data with similar properties, in addition to chemical data such as gas chromatography.
Contributions of this paper: (2) High-Performance GAN Architecture.In this paper, we design and fully implement a high-performance GAN architecture that generates chemicals using the mechanism of truncated attention.Currently, GAN has been studied to augment visual objects such as images.As many GAN studies focus on performance improvement, there are many learning time or limitations of learning, such as computation [15].Therefore, our proposed GCGAN aims at efficient and high performance to address the above limitations.GCGAN represents a fusion structure using an autoencoder (AE) in addition to the attention mechanism of truncated.Through this, several advantages suitable for training chemical substances can be obtained.First, the training process of the generator is efficiently induced by reusing the AE model that learned useful features of the data.In addition, the transfer learning of the discriminator using AE improves the ability of the discriminator, allowing for more accurate classification of original and synthetic data.As such, GCGAN proposes the structure of a generative model that can not only generate high-quality simulation data of chemical data but also successfully learn the data in an efficient way.
Overall, our research is motivated by the need to overcome the limitations of traditional chemical research methods and to leverage the potential of AI and ML for improving chemical research.Our proposed novel attention mechanism and GAN structure offer a promising approach for augmenting chemical data and advancing chemical research.
The remainder of this paper is organized as follows.In Section 2, we review the existing literature on AI and ML in chemical research and current attention mechanisms and GAN structures.In Section 3, we propose new attention mechanisms and GAN structures for chemical data.In Section 4, we present the results of our performance evaluation experiments using actual chemical data.Finally, in Section 5, we discuss the implications of our research and potential directions for future research.

Gas Chromatography Analysis
Gas chromatography is a widely used analytical technique in chemistry, especially in the field of organic chemistry.Gas chromatography separates and analyzes the components of a mixture based on their physical and chemical properties.In gas chromatography, the sample is vaporized and injected into a chromatographic column, where it interacts with a stationary phase.The components of the mixture separate on the basis of their interactions with the stationary phase and emerge from the column at different times, which are recorded as peaks on a chromatogram.
The resulting data from gas chromatography is a time series of signals, with each signal representing the concentration of a particular compound over time.The data are typically noisy, with fluctuations in the signal due to variations in the experimental conditions, such as temperature and flow rate.It is also easy to accurately identify and quantify individual compounds by peak [16].Additionally, the data can be highly dimensional, with hundreds or thousands of signals collected for a single sample.In gas chromatography, the retention time refers to the time it takes for a compound to travel through the chromatographic column and elute from the detector.Retention time is an important parameter for identifying compounds, as it is influenced by the properties of the compound and chromatographic conditions.The retention time of a compound can be used as a fingerprint for identification, with different compounds exhibiting characteristic retention times.However, an accurate determination of the retention time can be challenging due to the presence of noise and other interfering compounds in the gas chromatography data.
Increasing data with GAN can improve the performance of artificial intelligence models in several ways [17].In this work, we generate synthetic data to increase the size and diversity of gas chromatography datasets by generating synthetic data to improve the performance of artificial intelligence models.
In addition, the generated gas chromatography data can be used to filter noise or artifacts from the original gas chromatography data, resulting in cleaner and more reliable data for artificial intelligence models.The gas chromatography data generated in this way can help artificial intelligence models learn more meaningful features that are more useful for classification or prediction.
Additionally, the use of artificial intelligence, particularly attention mechanisms, can help accurately identify and quantify compounds based on retention time.Attention mechanisms can be applied to gas chromatography data to highlight the specific signals corresponding to the retention times of interest.This can aid in the identification and quantification of compounds, particularly in cases where the peaks are overlapped or the data are noisy.When the relevant signals are focused on, attention mechanisms can help to improve the accuracy and efficiency of gas chromatography data analysis.
Overall, the characteristics of gas chromatography data present significant challenges for traditional analytical techniques, such as identification.Therefore, the use of artificial intelligence, specifically generative adversarial neural networks and the attention mechanism, offers the potential to enhance the accuracy and efficiency of the analyzing of gas chromatography data.
The integration of artificial intelligence methodologies in the area of gas chromatography analysis is increasingly prominent, as evidenced by the spectrum of recent studies highlighted in comparison, as shown in Table 1.These studies predominantly utilize machine learning techniques, such as convolutional neural networks (CNN) and long-short-term memory networks, for tasks including peak classification, species authentication, and prediction of chromatography retention indices [9,[18][19][20][21].Each of these efforts has contributed significantly to enhancing the precision and efficiency of chemical analysis.Moreover, the transformer structure, which is the basis for the LLM generation models represented by ChatGPT, is characterized by being suitable for Zipfian distribution and burstiness data.This is not suitable for chemical data analysis properties where values are uniformly measured over time.This characteristic underscores the suitability of the GAN-based model proposed in this paper for GC data augmentation, as supported by our previous research [11,23].
Distinguished from existing approaches, our work introduces GCGAN, leveraging a generative adversarial network enhanced by an innovative attention mechanism and transfer learning to conditionally generate high-quality synthetic GC data.This not only expands the available dataset, particularly beneficial where real samples are scarce, but also pioneers a novel avenue in the AI-chromatography domain by focusing on data generation.

GAN
GAN is a type of deep learning architecture that has gained significant attention due to their ability to generate highly realistic data samples.GAN consists of two neural networks: a generator (G) and a discriminator (D).The generator creates new data samples based on a random noise vector, while the discriminator evaluates the generated samples to determine whether they are real or fake.
Goodfellow et al. defines GAN as Equation (1), which is described as minimax game with value function V(G, D) [12].
One of the key characteristics of GAN is their ability to learn from and mimic the underlying distribution of the training data.This is achieved through the iterative training process, where the generator learns to create increasingly realistic samples, while the discriminator becomes more skilled at identifying fake samples.As a result, GAN are capable of generating high-quality synthetic data that are often indistinguishable from the real data.
GAN have been widely used for generating images and text, but their potential applications in chemistry are only beginning to be explored [24].In particular, GAN can be used to augment chemical data, which is important for tasks such as predicting chemical properties, designing new molecules, and identifying potential drug candidates.
One of the main advantages of using GAN for chemical data augmentation is their flexibility and adaptability [25].GAN can be trained on a wide variety of data types, including images, audio, and text, which makes them well suited for use in chemistry, where data come in various forms.Moreover, GAN can be modified and customized to suit specific data types and applications.For instance, attention mechanisms can be incorporated into GAN architectures to improve their performance in analyzing and generating chemical data.
Another important characteristic of GAN is their ability to generate large quantities of data samples.This is particularly useful in chemistry, where data are often scarce and expensive to obtain.By augmenting the available data, GAN can improve the accuracy and reliability of machine learning models trained on chemical data.
In conclusion, we show the potential to contribute a lot to the development of the chemical field by enabling the generation of high-quality synthetic data through this paper.Furthermore, the ability to generate new chemical properties using GAN can greatly accelerate the drug discovery process and lead to the development of new and more effective drugs.However, further research is needed to develop new attention mechanisms and GAN architectures specifically tailored to chemical data.We propose the design of the appropriate artificial intelligence model through the subsequent Section 3 and demonstrate its performance through Section 4.

Attention Mechanism
The attention mechanism is a computational model that mimics the human cognitive system by selectively focusing on certain features or regions of input data, while filtering out irrelevant information [26].In the context of chemical data, the attention mechanism is a type of neural network architecture that allows for the selective weighting of input features, based on their relevance to the task at hand [27].Previously studied attention mechanisms are calculated using an attention score expressing the importance between input elements and an attention weight expressing how much each input element pays attention to another element.At this time, the attention score is performed by a method such as dot-product or additive, and indicates the importance or relevance of each element [28].The attention score is then usually normalized using a softmax function to obtain up to 1 attention weight, by which the attention weight determines how much each element of the input sequence contributes to the output representation [14].
In chemical data analysis, the attention mechanism has shown promise in improving the performance of models by allowing for more fine-grained analysis of data [29].For example, in gas chromatography data, the attention mechanism can be used to focus on specific peaks or retention times that are indicative of particular compounds or classes of compounds [11].However, studies that incorporate existing attention mechanisms into various generative models for gas chromatography data show insignificant performance improvement results compared to models without attention mechanisms [11,30].Therefore, we propose the importance of a new attention mechanism focusing on the inherent properties of gas chromatography data.The inherent properties of gas chromatography data are important for identifying and quantifying compounds such as retention time, mass-to-charge ratio (m/z), peak area (PA), and peak height (PH) [31].In this work, we propose a novel attention mechanism that focuses on the retention time among these properties.
The retention time, which is a characteristic of gas chromatography data, is a measure of the time it takes for a compound to pass through the chromatography column and reach the detector.The characteristic peaks that appear during these retention times are important characteristics of gas chromatography data because they can provide important information about the chemical composition of the sample.
Our proposed attention mechanism is used to emphasize the retention time feature in the chemical data and enable the generative model to produce more accurate and meaningful chemical data.The mechanism selectively focuses on retention time and its relationship with other chemical features, allowing the model to learn the complex dependencies and correlations that exist between these features.Therefore, the attention mechanism enables the generative model to produce more diverse and representative chemical data, which can be used to augment existing datasets and enhance the performance of chemical analysis and identification tasks.

Design Principle and Architecture
In this section, we propose an optimal preprocessing algorithm for gas chromatography data and truncated attention mechanism that efficiently learns gas chromatography data.Furthermore, we propose GCGAN as shown in Figure 1, a high-performance generative model that can conditionally generate gas chromatography data by applying various deep learning techniques, including transfer learning considering real chemical acquisition scenarios.

Preprocessing Algorithm of GCGAN
We propose a preprocessing technique that allows the gas chromatography data in time-series format to be used appropriately in deep learning models such as the GCGAN proposed in this paper.The proposed preprocessing technique consists of sampling and robust scaling processes.
First, we sample complex forms of gas chromatography data consisting of large-scale time steps with a systematic sampling method that extracts them from the population with constant rules [32].In addition to the large peak value at retention time, gas chromatography data are evenly distributed over most of the time, so the system sampling method is suitable, as shown in Figure 1, Ensuring that the samples selected during this sampling process are evenly distributed across the population can help reduce the risk of overfitting the original gas chromatography data for learning.
Additionally, we propose a method to use robust scaling for sampled data to allow deep learning models to properly learn outliers at the retention time of gas chromatography data.The robust scaler is a type of data normalization technique that scales the data based on the median and interquartile range (IQR) instead of the mean and standard deviation.The robust scaler used in this study is , where x is the original value, median(X) is the median of the feature, and IQR(X) is the interquartile range of the feature.The interquartile range is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data [33].Gas chromatography data often have extreme values, which can be outliers and may not follow a normal distribution, so robust scaling can help normalize the data and make them more suitable for analysis with machine learning algorithms [34].Furthermore, robust scaling is less affected by the presence of outliers compared to other scaling methods such as min − max scaling, which can make it more effective in preserving the information in the data.

Truncated Attention Mechanism
The truncated attention mechanism is designed to effectively capture and learn the unique characteristics of gas chromatography data, which consist of large peaks at retention time and small peaks at the rest of the time zone.The mechanism focuses on regions where the slope of the data changes rapidly, indicating the presence of peaks.Let x(t) be the input data, where t ∈ [0, T] represents the time step and T is the total number of time steps.The truncated attention mechanism can be formulated as follows in Equations ( 2)- (10).First, we define the slope function s(t) as the absolute difference between adjacent time steps, as shown in Equation ( 2): where h is the time step size.Next, as shown in Equation ( 3), we introduce the attention weight function a(t), which is determined by the slope function s(t) and a threshold value τ: where σ(•) is the sigmoid function, defined as shown in Equation (4): The sigmoid function is a smooth, continuous function that maps any real-valued input to a value between 0 and 1.The threshold value τ is calculated as the mean of the slope function over the entire time range as shown in Equation ( 5): The scaling factor α controls the steepness of the sigmoid function, determining how quickly the attention weight transitions from 0 to 1 around the threshold.When the slope function s(t) is much larger than the threshold τ, the attention weight a(t) approaches 1, indicating a strong emphasis on the corresponding time step.Conversely, when the slope function is much smaller than the threshold, the attention weight approaches 0, effectively truncating the influence of that time step.
The truncated attention mechanism can be applied to the input data x(t) to obtain the attended data x(t) as shown in Equation ( 6): The attended data x(t) preserve the large peaks while suppressing the small peaks and non-peak regions, effectively focusing the attention of model on the most informative parts of the data.The truncated attention mechanism can be further generalized to handle multidimensional input data x(t) ∈ R n , where n is the number of features.In this case, the slope function and attention weight function can be applied element-wise to each feature as follows in Equations ( 7) and (8): The threshold value τ i is calculated as the mean of the slope function over the entire time range for each feature, as shown in Equation ( 9): The multidimensional attended data x(t) are obtained by element-wise multiplication of the attention weight functions and the input data, as shown in Equation (10): In contrast to existing attention mechanisms, our proposed truncated attention mechanism does not explicitly compute attention scores based on input sequences.Both approaches aim to focus on the critical parts of the input, but differ in how attention weights are calculated and applied to input sequences [28].Existing attention mechanisms are to derive attention weights by calculating attention scores, but the proposed truncated attention mechanism serves the purpose of selectively focusing on specific parts of the input based on gradient and local variations.
By applying the truncated attention mechanism, the model can effectively focus on the regions with large peaks while suppressing the influence of small peaks and nonpeak regions.This enables the model to capture the essential characteristics of the gas chromatography data and improve its learning performance.
The truncated attention mechanism provides a new method for processing data with different peak patterns, such as gas chromatography data, and has the potential to be applied to various domains in which similar data characteristics are observed.The mathematical formula presented above provides the basis for understanding and implementing the truncated attention mechanism in a deep learning model for gas chromatography data analysis.

Unified Structure of GCGAN
Various studies have been conducted to pretrain the generators of generative adversarial neural networks using autoencoders [35].However, while only the latent vector of autoencoders is generally used for pretraining of GAN, we propose a unified learning scheme that uses autoencoders efficiently.Therefore, we unified two values of the autoencoder in GAN, as shown in Figure 2, to properly train gas chromatography.First, we use latent vectors with compressed data from encoders instead of using random variables for the generator in GCGAN.At this point, the generator of GCGAN begins meaningful mapping more proactively, and enables smooth learning when the sensitive information of the input gas chromatography learns compressed data to generate data.The latent vector of the autoencoder can help reduce the computational cost of the GAN training process, as it allows us to learn compressed representations of the data [36].Furthermore, the use of latent vectors of the autoencoder can help overcome the mode decay problem in GAN training and ensure that the generator captures and generates important features of the input data [37].Another part of our emphasis in GCGAN is the use of transfer learning from decoders, as shown in Figure 2. Transfer learning allows an artificial intelligence model to utilize the knowledge learned from one task for another related task [38].For our proposed GCGAN, we apply transfer learning to discriminator networks that serve to distinguish between real and synthetic data.After acquiring a substance in an actual threatening situation scenario, various organic and chemical reactions may occur over time.Therefore, assuming this situation, we can learn various features and potential reactions of chemical data by pretraining a discriminator for chemicals that have reacted with solvents over a long period of time.In addition, the discriminator can quickly adapt to new tasks that distinguish real chemical data from synthetic chemical data without having to start a comparison task from scratch.The measurement results 1 week after mixing chemicals and solvent and mixing with those measured immediately are generally similar as shown in Figure 3, but the impurities are partially different.In the traditional GAN method, to learn and generate data that reacted immediately to the target, we repeatedly learn using random variables as latent vectors [12].However, we propose efficient learning and using old data because it reuses aged data without discarding it.We leverage old data to ensure that our model aims at the robustness and generalization of synthetic data generation processes learned from a wider range of chemical reactions and their temporal progression.This approach not only preserves valuable data resources, but also enhances the ability of model to handle changes and anomalies in chemical data over time.Overall, gas chromatography data do not significantly change the aspect of the data, as shown in Figure 3, even if a reaction occurs with a solvent over time.Therefore, using transfer learning on the discriminator has the following additional benefits:

•
Pretrained discriminator for transfer learning can leverage the knowledge learned from related tasks to provide better discriminant performance, providing a more accurate and robust model.

•
Transfer learning can help reduce overfitting, as pretrained models provide better normalization and can prevent the discriminator from remembering training data.

•
Using transfer learning from discriminator leads to more efficient and effective models, providing better results and reducing the time and resources required for training.

Structure of GCGAN
We construct the encoder of GCGAN as a 1D CNN layer, allowing us to capture the time dependence of gas chromatography, as shown in Figure 1.This allows the generator of GCGAN to receive data in which the time information of gas chromatography is compressed, and to generate a sample that preserves the time pattern of the data.
The CNN is often used to process visual data, where shape and pattern are important, such as images [39].For the data generated by the aforementioned preprocessing algorithm, GCGAN first uses a CNN to grasp the relationship and pattern between the environment of data with time-series characteristics [40].Through this, the 1D CNN extracts important features from the data and generates a latent vector.The calculation process of the 1D CNN neural network output is as shown in Equation ( 11): In this formula, y i,k represents the k-th feature map at position i, b k is the bias term for the k-th filter, w k,j is the weight for the j-th element of the k-th filter, and x i+j is the input signal value at position i + j.The sum is taken over the filter length F, and the dot product between the filter weights and the input signal values is computed for each position i.When the result of a 1D CNN layer is compressed in an autoencoder, it is flattened into a 1D vector for potential representation of the input signal.Therefore, given the gas chromatography input signal x and the y of Equation (11), the latent representation z of the input signal can be calculated as z = flatten(y).We use this latent vector z as an input to the generator network of GCGAN for proposed unified structure.Therefore, the generator function of GCGAN can be expressed as G(flatten(y)).The retention time t r is incorporated into the latent vector z during the encoding process.This allows z to maintain the time structure of t r and for the generator to generate data that accurately reflect the retention time characteristics.This relationship is denoted by z = flatten(y t r ), where y t r encodes the retention time information.The discriminator network of GCGAN distinguishes real gas chromatography data x from synthetic data of the generator.As shown in Figure 2, the output reconstructed by the decoder of the autoencoder model for input gas chromatography data x is f trans (x).Therefore, we use the discriminator for input x to use the proposed transfer learning technique, as shown in Equation ( 12): In the above Equation ( 12), we use σ as a sigmoid activation function, W d as the weight matrix, and • as a matrix multiplication.
In GCGAN, our proposed truncated attention mechanism can be applied to the discriminator to improve its ability to distinguish between real and synthetic gas chromatography data.The truncated attention mechanism allows the discriminator to focus on data with most small values and only certain parts having very large values, which is particularly useful for the properties of gas chromatography data.The mathematical expression for the discriminator using the truncated attention mechanism is shown in Equation ( 13): In the above Equation ( 13), we use W d and b d as the weight matrix and bias term of the fully connected layer in the discriminator, z i as the i-th feature map of the input data x obtained from the CNN layer, and α i as the attention weight for the i-th feature map.
According to Equations ( 12) and ( 13), the discriminator model of GCGAN that we propose is as shown in Equation ( 14): In GCGAN, G(flatten(y)) is trained to deceive D(X; W d ) by generating generated data that are difficult to distinguish from real data.The objective of G(flatten(y)) is to maximize the error rate of D(X; W d ), while the objective of D(X; W d ) is to minimize its error rate as in Equation (1).

Performance Evaluation
In this section, we describe our experiments to augment chemical data using GCGAN with truncated attention mechanism and transfer learning technique.Therefore, we implement all of the proposed algorithms and measure their performance through the measured experimental chemical data.

Datasets
We used data that experimentally measured the gas chromatography of DMMP, DFP, and 2-CEES using the Agilent 8890 GC system (G3540A).Each chemical was measured under the experimental conditions as shown in Table 1 in the device such as Figure 4c, and each chemical was measured under the experimental conditions, as shown in Table 2, and we proceeded based on the standard experimental protocol in the experiment [41].Each substance consists of a pair of data detected immediately on the day and measured after a chemical reaction for a week, as shown in Figure 3. Solvent is used for moving each material in a gas chromatography device.We measured each chemical by mixing it with an ultrasonic wave detector with solvents as shown in Figure 4b.
Therefore, we experimentally confirmed that impurities such as ethyl dipentyl phosphate, triisopropyl phosphate, etc. are produced through the reaction with solvent for 1 week, as shown in Figure 3b.The temporal spacing between these data is a scenario that assumes when a real scenario of situations where measurement is limited, such as chemical terrorism occurs, and we generate synthetic gas chromatography data by transferring gas chromatography data over time for reasons such as transport of materials.

Implementation
We used an Intel Dual-Core Xeon CPU @ 2.30GHz, 32 GB RAM, Nvidia Tesla P100 GPU for training of full implemented GCGAN and gas chromatography data augmentation.Our GCGAN processes GC data time series to effectively capture and model retention time properties in chromatography columns.To implement the proposed GCGAN, we set the hyperparameters of the part of the autoencoder and the part of GAN that make up the GCGAN, as shown in Table 3.The batch size is trained as a whole for sampling by the preprocessing algorithm proposed in Section 3. Mean squared error (MSE) loss used in Table 3a is a loss function that measures the mean square difference between the predicted value and the true value.MSE loss is used to calculate the difference between the values of the result of the autoencoder and the actual chromatography data.MSE loss measures the difference between binary values of input and output, indicating that the smaller the loss, the better the reconstruction.This makes it a suitable loss function for the autoencoder for gas chromatography data, which aims to minimize reconstruction errors between input and output.On the other hand, binary cross-entropy (BCE) loss used in Table 3b measures the discrepancy between the predicted probability distribution and the actual probability distribution.The generator is then updated to minimize this difference so that the generated data are more similar to the actual data.By minimizing BCE loss, the generator is encouraged to generate gas chromatography data similar to the real data.
We implement an early stopping algorithm for optimized GCGAN training periods and stopping points to effectively prevent fitting.The algorithm monitors the root mean squared error (RMSE) between real and generated synthetic data in both the training and validation datasets and compares it with the best observed RMSE.The process meets the early stop criteria and stops training as soon as no improvement in RMSE is observed during 50 consecutive epochs according to predefined patience parameters such as Table 3b.
Consequently, we demonstrate that stable learning is possible, as shown in Figure 5, by the optimized implementation of the proposed GCGAN using gas chromatography dataset.In particular, in the GAN training process in Figure 5b, loss is rapidly reduced in the initial epoch.This indicates that our proposed unified structure of GCGAN and the truncated attention mechanism worked properly and learned efficiently.

Evaluation Metrics
We use various metrics as follow to compare and evaluate the gas chromatography data generated by GCGAN with the original data.In addition, we implement a deep learning-based artificial intelligence model to demonstrate the superiority of synthetic gas chromatography data that are difficult to represent with existing evaluation metrics.

•
Visual inspection: We propose a method for visually inspecting generated gas chromatography data using the original data and graphs to demonstrate the performance of GCGAN.It is a method to display both the original data and the generated data on the same graph and to compare the values of retention time and peak value, which are important indicators for gas chromatography, as mentioned in Section 2. Since each chemical has its own retention time, this method shows that GCGAN performs well if the generated data are very similar to the original data [46].
• Quantitative evaluation: We use the Pearson correlation coefficient (PCC), the Spearman correlation coefficient (SCC), and cosine similarity techniques to quantitatively evaluate the performance of GCGAN.PCC, SCC, and cosine similarity are commonly used metrics to measure chromatographic similarity.To proceed with the accurate evaluation, we generate 10 synthetic data for each data and use the data obtained by averaging the peak value for each timestamp of the generated data.PCC is one of the metrics appropriately used to evaluate the performance of machine learning models, including GAN, in various applications [47].SCC is a nonparametric measure of the monotonicity of the relationship between two datasets, which can be used to assess the similarity of the original and generated gas chromatography data, even if their relationship is not linear [48].Cosine similarity measures the cosine of the angle between two vectors, providing a value between −1 and 1, with 1 indicating high similarity, 0 indicating no similarity, and −1 indicating dissimilarity.In the context of reinforcing gas chromatography data with GAN, PCC, SCC, and cosine similarity can be used to compare the original data with the generated data and evaluate the similarity between the two datasets.PCC is a measure of the linear correlation between two variables and ranges from −1 to 1, with 1 indicating a perfect positive correlation, 0 indicating no correlation, and −1 indicating a perfect negative correlation.By using these three metrics together, we can comprehensively assess the similarity between the original and generated gas chromatography data, ensuring the quality and reliability of the GCGAN-augmented data [49].

•
Using deep learning model: To demonstrate the usefulness of chemical synthetic data, we also build a very basic and primitive discriminant model consisting of a single dense layer.This simple model is designed to show the effect of progressively learning from the synthetic data generated by GCGAN.Our evaluation is experimented by using a deep neural network (DNN) with a fully connected layer.The architecture was developed by a backpropagation neural network (BPNN) model used for research dealing with complex patterns inherent in chromatographic data, which is used to effectively evaluate the quality of synthetic data generated by GCGAN and to evaluate the usefulness of synthetic data [18].This experiment highlights the effectiveness of GCGAN-generated data in improving the learning ability of the most basic discriminant model.By gradually incorporating synthetic data, we observe a noticeable increase in the accuracy of the model, emphasizing the value of synthetic data in improving the performance of gas chromatography data classification tasks even in rudimentary model architectures.One of the contributions of this study is to improve the performance of artificial intelligence models through high-quality synthetic data generation.The deep learning model has the characteristic that the performance improves as the number of appropriate data required for learning increases [50].Therefore, we propose a method to implement a deep learning model based on fully connected layers for classification to measure the performance of the model according to the amount of learning of synthetic data.We measure the performance according to the amount of learning of synthetic data in two situations for accurate experiments of classification models on gas chromatography data.We generated 1, 15, and 50 synthetic data for each chemical for experiments that demonstrate the effectiveness of synthetic data.Each dataset instance, whether real or synthetic, follows a uniform shape of (2, 58,500), ensuring consistency in the data representation in all experiments.We used models trained on these datasets to measure the impact of data augmentation on classification performance using both synthetic and real data to evaluate the ability of the model to classify classes of three chemicals: 2-CEES, DFP, and DMMP.This approach effectively mitigates the risk of discriminator overfitting, which is generally a concern when training GAN with limited datasets [51].First, we conducted a classification experiment by generating a number of validation datasets using random variables drawn from a normal distribution, based on the gas chromatography data of DMMP, DFP and 2-CEES, which were used to train the GCGAN.Second, we performed a more complex classification experiment by enriching the validation dataset from the first experiment with gas chromatography data not used in GCGAN training.For this, we include data of 2-Chloroethyl phenyl sulfide (2-CEPS), a chemical not previously involved in GCGAN training, with measurements from four solvents: ethanol, methanol, dimethyl carbonate, and tetrahydrofuran.Additionally, we incorporate data from a 1-week reaction of 2-CEES with methanol solvent into the validation dataset, providing further dimensions for model evaluation.
We measured the performance of deep learning models for chemical classification using confusion matrix-based accuracy and area under the receiver operating characteristic curve (AUC).The confusion matrix is a performance evaluation metric for a classification model and consists of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) [52].Accuracy is calculated as TP+TN TP+TN+FP+FN , where a higher value indicates how many correct classifications were made during the entire classification.However, accuracy is not reliable when the data are imbalanced, which is the goal of classification.Additionally, we also measure AUC to evaluate the more accurate performance of the classification model on various gas chromatography data.AUC is the area under the receiver operating characteristic curve that represents the change in the TP rate as TP TP+FN and the FP rate as TN FP+TN [53].As a result, we demonstrated the performance of GCGAN by showing that the more synthetic data we have, the better the classification model.

Evaluation Results
We successfully conditionally generated synthetic data for each of the three substances using GCGAN, as shown in Figure 6.These results indicate that our GCGAN was able to effectively augment the chemical data and capture the important features of the input data.We experimented with training up to 4000 epochs, as shown in Figure 5b, to find a suitable epoch, and confirmed that the loss of GCGAN quickly stabilizes before 1000 epochs.Therefore, we set the hyperparameters as in Table 3.The GCGAN set through this indicates that the learning process of 2-CEES, DFP, and DMMP all proceeds appropriately, as shown in Figure 7.
Visual inspection We show the comparison of the synthetic data generated by GCGAN with the original data for three different chemicals-2-CEES, DFP, and DMMP-through Figure 6.The blue line represents the original data for each material and the red line represents the corresponding synthetic data.It is clear that the retention time and peak values of the data generated by visual inspection agree very well with the retention time and peak values of the original data for all three chemicals.It shows the effectiveness of GCGAN in capturing the intrinsic properties of gas chromatography data and generating realistic synthetic samples that can be used for data augmentation purposes.Furthermore, we conducted experiments to demonstrate the effectiveness of the truncated attention mechanism.We trained 2-CEES by implementing GAN with the same structure and hyperparameter of GCGAN as proposed in the paper, but without the truncated attention mechanism, and proved that learning is difficult until 1000 epochs, as shown in Figure 8.This suggests that the truncated attention mechanism is suitable for time-series data with large deviations, such as gas chromatography.Quantitative evaluation We show in Table 4 that synthetic data were generated similar to the chemical properties of the original data.The table presents the quantitative evaluation results of the synthetic data generated for three different chemicals: 2-CEES, DFP, and DMMP.We can see that the PCC values for the three chemicals are extremely high, ranging from 0.9965 to 0.9984, indicating a strong linear correlation between the original and generated data [54].The SCC values, which measure the monotonic relationship between the datasets, are also relatively high, ranging from 0.8192 to 0.8352, suggesting a strong overall similarity in the rank order of the data points.Finally, the cosine similarity values are identical to the PCC values, further confirming the high degree of similarity between the original and generated data vectors.Overall, we represent the effectiveness of the GCGAN model in generating synthetic gas chromatography data that closely mimic the characteristics of the original data for various chemicals [49].The high values of PCC, SCC, and cosine similarity demonstrate that the generated data capture the essential features and patterns present in the original data, validating the quality and reliability of the synthetic data generated by GCGAN for the enhancement of gas chromatography datasets.Additionally, we perform validation based on more additional chemical analysis methods to further validate the quality of the synthetic data generated by the proposed model as shown below [55,56]: • Peak area (PA): The peak area is calculated as the sum of the values under a peak, which represents the concentration of the compound.• Peak height (PH): The peak height is the maximum value of the peak, which also correlates with the concentration of the compound.
The similarity verification of the generated synthetic data in terms of chemical analysis indicates the quality of the synthetic data through the PA and PH, as shown in Table 5.Specifically, the PA of DMMP, 2-CEES, and DFP show similarities of 98.31%, 97.24%, and 97.10%, respectively, indicating high accuracy in the concentration expression of the synthetic data.Similarly, the PH of DMMP, 2-CEES, and DFP show similarities of 69.20%, 87.28%, and 97.44%, respectively.Although these results show the need for room for improvement in 2-CEES of PH, they show similarities in excess of 90% overall in terms of peak-based chemical analysis.This verifies the effectiveness of the proposed data augmentation method by emphasizing that the synthetic data retain the characteristics of the original data.Using deep learning model We demonstrate that synthetic gas chromatography data improve the performance of classification models, as shown in Table 6.Table 6 shows the performance results of the classification model that was trained using synthetic data after generating 1, 15, and 50 synthetic data using GCGAN for each of the three chemicals.This experiment crucially represents that the ability of the model to identify mixed chemical data significantly improves with a much larger performance observed with 50 augmentation per datum, with even a slight augmentation with 15 per real datum.
We highlight a very important contribution to improving the performance of the urgent identification model for hazardous chemicals used in this paper.
In the first experiment using only validation data, the classification model achieved an accuracy of 0.3367 and an AUC of 0.5025 when trained on only 1 pair of training datasets as shown in Table 6a.This baseline performance establishes the initial capability of our model to classify gas chromatography data without the aid of augmented data.When the model is subsequently trained with increasing quantities of augmented data, the accuracy and AUC results gradually increased, demonstrating the added value of synthetic data in enhancing the model.Accuracy and AUC achieved a very high value of 1 when trained with 50 synthetic data, indicating that the inclusion of diverse synthetic samples significantly enriches the training set, leading to perfect classification performance.
In the second experiment using untrained data, the classification model achieved an accuracy of 0.3067 and an AUC of 0.5485 when trained on only one pair of training datasets, as shown in Table 6b.Similar to the first experiment, this initial result highlights the challenges faced by the classification model when limited to sparse training data.By training with increasing quantities of augmented data up to 50, we achieved AUC and accuracy of 0.8133 and 0.9357, respectively.The substantial improvement in both metrics underscores the effectiveness of synthetic data in providing the model with a richer, more comprehensive understanding of the data distribution.This represents a significant improvement over the model trained on the original data alone.
Additionally, we conduct experiments as shown in Figure 9 to demonstrate the effectiveness of using vectors extracted from encoder of autoencoder in GCGAN as shown in Figure 2. To this end, we use the data generated by truncated attention GAN used based on normal distribution random variables, as in conventional GAN training methods, and the data generated by GCGAN of the complete structure we propose.As shown in Figure 9, when the data size is 1, identifying the data of DMMP, DFP, and 2-CEES and classifying the correct label shows low numbers in both cases.However, it can be seen that the more synthetic data generated for learning the classification model, the more the GCGAN of the complete structure that we propose can improve performance rapidly.This improvement is remarkable as it progresses from 15 to 50 datasets, demonstrating the robustness of the generated synthetic data in enriching the training environment.Effectiveness of using encoder vectors in GCGAN.The blue line is a method using a vector extracted from the encoder we propose, and the orange line is a method using a normal distribution-based vector instead of an encoder vector.
The results in Figure 9 not only demonstrate the efficiency of the proposed GCGAN structure, but also highlight the efficiency of using encoder extraction vectors over conventional random sampling methods.This implies that the GCGAN method can use the intrinsic properties of the input data more effectively, which is essential for generating high-fidelity synthetic data.These results indicate that the method of generating using vectors from the encoder is effective and suggest that our method can make a significant contribution to various AI models used in the chemical field.
Overall, our results demonstrate that augmenting gas chromatography data with synthetic data generated by GCGAN can improve the performance of an artificial intelligence classification model.It is particularly noteworthy that the synthetic data not only complements but also significantly extends the representational diversity of the training set, thereby enhancing the predictive accuracy and generalization capability of model.Furthermore, experimental results confirm that the GCGAN can produce high-quality synthetic data that closely resemble the original data, validating our approach as a viable method for data augmentation in gas chromatography analysis.

Discussion
Our results demonstrate that GCGAN with truncated attention mechanism and transfer learning technique in unified model can effectively augment chemical data and improve the performance of chemical modeling.The use of truncated attention mechanism and the transfer learning technique improved the ability of discriminator to distinguish between real and generated data.Furthermore, we found that GCGAN configured without truncated attention mechanism and transfer learning during implementation cannot learn gas chromatography data no matter how advanced the neural network is.Therefore, our truncated attention mechanism and transfer learning technique in the unified model are suitable for learning time-series data such as gas chromatography data, where the deviation between the value at a certain time and the value at the other time is very large.
Our approach has several potential applications in drug discovery and chemical modeling.By augmenting chemical data, we can generate larger datasets that can be used for training machine learning models.This can improve the accuracy and reliability of these models and help accelerate the drug discovery process.Our approach can also be used to generate novel compounds that may have potential therapeutic properties [57].
However, there are some limitations to our approach that need to be addressed in future work.One limitation is that the generated chemical data may not always be chemically realistic.Future research may explore ways to ensure that the compounds produced are physically and chemically realistic.In addition, while the used metrics may not fully capture the complexity of the problem.Future work can explore alternative metrics that are more suitable for the task of generating chemical data.Furthermore, we plan to apply this attention mechanism to other important parameters, such as m/z in GC data, to improve the robustness and applicability of the generative model in future work.We also aim to validate our model on chemicals measured under various time and mixed conditions to ensure comprehensive performance evaluations, along with metrics such as resolution and contrast.

Conclusions
Through this study, we represent a generative adversarial neural network based data augmentation technique that can be applied to time-series data of chemicals.In addition, this algorithm, which generates simulated data similar to time-series data measured with gas detection equipment, such as gas chromatography analysis for chemicals, can also generate simulated data for chemicals whose individual properties change over time, such as gas chromatography.
We summarized the contents and contributions of this study as follows: • We develop an novel attention mechanism that pays attention to a specific critical portion of all of the data in gas chromatography data.

•
We designed GCGAN with transition learning for scenarios that take into account actual chemical acquisition time, and we demonstrated performance by implementing them all in practice.

•
We demonstrated the performance of GCGAN by using directly experimentally acquired gas chromatography data, not open-source or simulation data, for the performance evaluation of GCGAN.
To improve the proposed study, we plan future studies as follows: • We have implemented a classification model consisting of deep learning layers to evaluate the quality of synthetic data generated by GCGAN.Our further work could lead to the development of a more advanced gas chromatography data classification model using a novel attention mechanism or neural network layers, which is very necessary in the field of chemical analysis.

•
Future research should explore how to visualize and analyze the generated compounds to better understand the performance of the model.
The simulated data generated in this way can be linked to research that develops toxic chemical detection algorithms and improves performance through identifying singularities and learning patterns.In the future, research is planned to increase the diversity of simulated data generation by applying various statistical techniques to actual data in the preprocessing process.

Figure 3 .
Gas chromatography data of various chemicals actually measured (a,b) show that various impurities were generated through chemical reactions in addition to the peak values at the retention time.(a) Data measured immediately after mixing chemicals with solvents.(b) Data measured 1 week after mixing the chemical with the solvent.

Figure 4 .
Experimental acquisition process of gas chromatography (GC) data.(a) Process of combining chemicals and solvents.(b) Process of mixing chemicals and solvents.(c) Analyze with GC equipment.

Figure 5 .
Loss of GCGAN in training process.The blue line represents training loss, and the orange line represents validation loss.(a) Loss of autoencoder in training process.(b) Loss of GAN in training process.

Figure 6 .Figure 7 .Figure 8 .
Comparison of the synthetic data with the original data.The blue line represents the original data for each material and the red line is the synthetic data.In each graph, the horizontal axis represents time (minutes) and the vertical axis represents the peak value.(a) Comparison of the synthetic data with the original data about 2-CEES.(b) Comparison of the synthetic data with the original data about DFP.(c) Comparison of the synthetic data with the original data about DMMP.Process of gas chromatography data generation.The blue line is the original data of each material, and the red line is the synthetic data generated during the training process.In each graph, the horizontal axis represents time (minutes), and the vertical axis represents the peak value.(a) Training of 2-CEES at 100 epochs.(b) Training of 2-CEES at 200 epochs.(c) Training of 2-CEES at 800 epochs.(d) Training of DFP at 100 epochs.(e) Training of DFP at 200 epochs.(f) Training of DFP at 800 epochs.(g) Training of DMMP at 100 epochs.(h) Training of DMMP at 200 epochs.(i) Training of DMMP at 800 epochs.Results of training GCGAN without truncated attention mechanism.In each graph, the horizontal axis represents time (minutes), and the vertical axis represents the peak value.(a) Training of 2-CEES at 100 epochs without truncated attention mechanism.(b) Training of 2-CEES at 500 epochs without truncated attention mechanism.(c) Result of 2-CEES at 1000 epochs without truncated attention mechanism.

Figure 9 .
Figure9.Effectiveness of using encoder vectors in GCGAN.The blue line is a method using a vector extracted from the encoder we propose, and the orange line is a method using a normal distribution-based vector instead of an encoder vector.

Table 1 .
Comparison of artificial intelligence studies on chromatography data.

Table 4 .
Quantitative evaluation of generated synthetic data.

Table 5 .
Evaluate the chemical analysis similarity of the generated synthetic data.

Table 6 .
Experiments on performance improvement of classification model.(a) shows the experimental results of the classification model for the validation dataset consisting of DMMP, DFP, and 2-CEES, and (b) shows the experimental results of the classification model for the dataset with 2-CEPS for various solvents and various time conditions.