Comparison of the Diagnostic Performance of Deep Learning Algorithms for Reducing the Time Required for COVID-19 RT–PCR Testing

(1) Background: Rapid and accurate negative discrimination enables efficient management of scarce isolated bed resources and adequate patient accommodation in the majority of areas experiencing an explosion of confirmed cases due to Omicron mutations. Until now, methods for artificial intelligence or deep learning to replace time-consuming RT-PCR have relied on CXR, chest CT, blood test results, or clinical information. (2) Methods: We proposed and compared five different types of deep learning algorithms (RNN, LSTM, Bi-LSTM, GRU, and transformer) for reducing the time required for RT-PCR diagnosis by learning the change in fluorescence value derived over time during the RT-PCR process. (3) Results: Among the five deep learning algorithms capable of training time series data, Bi-LSTM and GRU were shown to be able to decrease the time required for RT–PCR diagnosis by half or by 25% without significantly impairing the diagnostic performance of the COVID-19 RT–PCR test. (4) Conclusions: The diagnostic performance of the model developed in this study when 40 cycles of RT–PCR are used for diagnosis shows the possibility of nearly halving the time required for RT–PCR diagnosis.


Introduction
Until the emergence of the Omicron mutation, most nations have tried to implement strategies to quickly find positive cases, isolate them, initiate early treatment, and precisely identify negative patients to minimize the spread of infection and to avoid advancement to a critical disease condition.
Nevertheless, with the emergence of Omicron mutations, SARS-CoV-2 infection, which began in the winter of 2019 and has spread throughout the world, entered a new phase. The emergence of Omicron has resulted in a rapid increase in the number of confirmed cases, and as a result of this rising trend, many countries around the world have been exposed to a situation that is difficult to manage with the current medical capabilities associated with COVID- 19. In this situation, the previously employed strategy of the rapid isolation of patients, the confirmation of the patient's diagnosis and treatment, and the tracking of the patient's contacts is no longer feasible. Although RT-PCR is still considered the gold standard for the confirmation of a COVID-19 diagnosis because of the high diagnostic accuracy of RT-PCR, this test is labour-intensive and takes a longer time than the rapid

Study Participants
We enrolled patients who visited a specialized outpatient department for COVID -19 or who visited an emergency department for possible COVID-19 between 23 November 2020 and 25 September 2021. During this period, a total of 27,835 raw RT-PCR curve data points were obtained to identify cases of COVID-19. For this research, 1270 positive and 1270 negative test results were chosen from among these cases.
This study was approved by the Institutional Review Committee (HKS 2020-07-007) of Hallym University Kangnam Sacred Heart Hospital in Korea; the requirement for informed consent was waived because the subjects' data were anonymized. This study was conducted in accordance with the STARD guidelines and regulations for a study related to the diagnostic accuracy of COVID-19 RT-PCR.

Materials
A MagNa Pure 96 System was used to extract RNA from the samples (Roche Diagnostics, Rotkreuz, Switzerland). A STANDARD M nCoV Real-Time Detection kit (SD Biosensor, Gyeonggi, Republic of Korea) was utilized in this investigation, and a Bio-Rad CFX96 analyser (Bio-Rad Laboratories, Inc., Hercules, CA, USA) was used for the RT-PCR test ( Figure 1).

Materials
A MagNa Pure 96 System was used to extract RNA from the samples (Roche D nostics, Rotkreuz, Switzerland). A STANDARD M nCoV Real-Time Detection kit (SD osensor, Gyeonggi, South Korea) was utilized in this investigation, and a Bio-Rad CF analyser (Bio-Rad Laboratories, Inc., Hercules, CA, USA) was used for the RT-PCR ( Figure 1).

Data Description
The RT-PCR results from nasopharyngeal swab specimens of the patients wh ceived SARS-CoV-2 RT-PCR testing at Kangnam Sacred Heart Hospital were include the raw data. The fluorescence values that were measured for a total of 40 cycles via RT-PCR test were recorded for each patient sample, and the raw data comprised the orescence values that were produced during the RT-PCR testing for SARS-CoV-2.
As a result, for each sample, 40 fluorescence values were measured across 40 cy yielding a total of 2540 raw data points. The fluorescence readings were recorded i columns for each sample across a total of 2540 rows in the raw data. There were verified positive and 1270 verified negative test results in total.

Development of the DL Model
The output variable for training the models was the RT-PCR findings (positiv negative). A total of 40 models were produced and evaluated, starting with the m trained with only the fluorescence value of the first RT-PCR cycle, and the last model the model trained with the fluorescence value of all 40 RT-PCR cycles.
The first model, for example, was trained using the fluorescence value from the RT-PCR cycle, whereas the second model was trained using the fluorescence values f the first and the second RT-PCR cycles. Similarly, the fluorescence values from the fir the 39th RT-PCR cycle were used to train the 39th model, and the fluorescence va from the first to the 40th RT-PCR cycle were used to train the 40th model.
The raw data for the RT-PCR fluorescence values that were used in this study ex ited sequential characteristics that varied with the extraction time during the cycle. T the different deep learning models ("Recurrent Neural Network" (RNN), "Long Sh Term Memory" (LSTM), "Bidirectional Long Short-Term Memory" (Bi-LSTM), "G Recurrent Unit" (GRU), and "Transformer") that were suitable for time series proces were applied in this study using Tensorflow (Hardware: GPU; RTX 2080 Ti * 2, CPU 9800X, RAM; 64GB).
An RNN is a simple recurrent neural network that sequentially processes time s data [9]. LSTM is a neural network used to prevent vanishing gradients, which are kn to be a problem with existing RNNs [10]. This model has a four-layer structure and sists of three gates (forget, input, output). The forget gate determines how much pas formation is to be forgotten, and the input gate determines how much current informa is to be remembered. The output gate is a gate for exporting the final hidden result.

Data Description
The RT-PCR results from nasopharyngeal swab specimens of the patients who received SARS-CoV-2 RT-PCR testing at Kangnam Sacred Heart Hospital were included in the raw data. The fluorescence values that were measured for a total of 40 cycles via the RT-PCR test were recorded for each patient sample, and the raw data comprised the fluorescence values that were produced during the RT-PCR testing for SARS-CoV-2.
As a result, for each sample, 40 fluorescence values were measured across 40 cycles, yielding a total of 2540 raw data points. The fluorescence readings were recorded in 40 columns for each sample across a total of 2540 rows in the raw data. There were 1270 verified positive and 1270 verified negative test results in total.

Development of the DL Model
The output variable for training the models was the RT-PCR findings (positive or negative). A total of 40 models were produced and evaluated, starting with the model trained with only the fluorescence value of the first RT-PCR cycle, and the last model was the model trained with the fluorescence value of all 40 RT-PCR cycles.
The first model, for example, was trained using the fluorescence value from the first RT-PCR cycle, whereas the second model was trained using the fluorescence values from the first and the second RT-PCR cycles. Similarly, the fluorescence values from the first to the 39th RT-PCR cycle were used to train the 39th model, and the fluorescence values from the first to the 40th RT-PCR cycle were used to train the 40th model.
The raw data for the RT-PCR fluorescence values that were used in this study exhibited sequential characteristics that varied with the extraction time during the cycle. Thus, the different deep learning models ("Recurrent Neural Network" (RNN), "Long Short-Term Memory" (LSTM), "Bidirectional Long Short-Term Memory" (Bi-LSTM), "Gated Recurrent Unit" (GRU), and "Transformer") that were suitable for time series processing were applied in this study using Tensorflow (Hardware: GPU; RTX 2080 Ti * 2, CPU; i7-9800X, RAM; 64 GB).
An RNN is a simple recurrent neural network that sequentially processes time series data [9]. LSTM is a neural network used to prevent vanishing gradients, which are known to be a problem with existing RNNs [10]. This model has a four-layer structure and consists of three gates (forget, input, output). The forget gate determines how much past information is to be forgotten, and the input gate determines how much current information is to be remembered. The output gate is a gate for exporting the final hidden result. The existing LSTM learns in the forward direction, but the BiLSTM also learns in the reverse direction [11]. Therefore, both outputs from bidirectional learning are used for prediction. The GRU has a simplified model structure compared to the existing LSTM. The GRU consists of two gates (update, reset), and it solves the long-term dependency problem and reduces computation. The reset gate decides how to merge the new input with the old memory, and the update gate decides how much to keep of the old memory [12]. Transformer models, on the other hand, were originally comprised of encoders and decoders and are typically employed for sequence-to-sequence learning tasks, such as translation [13]. Encoder blocks comprised of normalization and attention were employed in this investigation. Instead of learning sequentially, the encoder learned through attention weights ( Figure 2). The information on the optimal hyper-parameters of DL models based on the grid search method is provided in Table 1, and we used the early stopping function to prevent the problem of over-fitting.
The GRU has a simplified model structure compared to the existing LSTM. The GRU consists of two gates (update, reset), and it solves the long-term dependency problem and reduces computation. The reset gate decides how to merge the new input with the old memory, and the update gate decides how much to keep of the old memory [12]. Transformer models, on the other hand, were originally comprised of encoders and decoders and are typically employed for sequence-to-sequence learning tasks, such as translation [13]. Encoder blocks comprised of normalization and attention were employed in this investigation. Instead of learning sequentially, the encoder learned through attention weights ( Figure 2). The information on the optimal hyper-parameters of DL models based on the grid search method is provided in Table 1, and we used the early stopping function to prevent the problem of over-fitting.    Through these proposed models, the positive and negative RT-PCR results were classified, and then the results were compared and analysed.

Training and Test Datasets
To train the models, the results of the RT-PCR virology tests were utilized as a reference. There were 1270 positive RT-PCR results among the 2540 patients whose data were included Through these proposed models, the positive and negative RT-PCR results were classified, and then the results were compared and analysed.

Training and Test Datasets
To train the models, the results of the RT-PCR virology tests were utilized as a reference. There were 1270 positive RT-PCR results among the 2540 patients whose data were included in the research, and 1270 of the patients had negative results. These data were split into training (2000) and test (540) datasets. Raw data from curves of the RT-PCR results from 1000 positive and 1000 negative cases were used to establish the data for the training and validation in an 80:20 ratio. For testing, 270 positive and 270 negative results were utilized ( Figure 3).

Outcomes
The primary endpoints were the sensitivity, specificity, AUROC values, positive predictive value (PPV), negative predictive value (NPV), and accuracy. The PPV, NPV and accuracy were assessed using a 5% prevalence assumption and in consideration of the rapid surge in the number of confirmed cases of the Omicron variant.
The secondary endpoints were the comparisons of the diagnostic performances in each of the algorithms of models 10 and 20.

Statistical Analysis
SPSS software V.26.0 was used for all statistical analyses (IBM, SPSS, Inc., Chicago, IL, USA). In addition to the positivity or negativity of RT-PCR data, the sensitivity (the proportion of actual positives) and specificity (the proportion of actual negatives) were also determined. The mean and 95% confidence interval are used to express the variables. To compare diagnostic performance between algorithms, the AUROC values of each were compared to those of the DeLong method; statistical significance was defined as a p value less than 0.05. (p < 0.05).

The Diagnostic Performance of Each DL Algorithm
The sensitivities of the algorithm (Model No. 10) that was trained with the raw data up to 10 cycles (which was the first 25% of the total of 40 cycles) were 96. 7

Outcomes
The primary endpoints were the sensitivity, specificity, AUROC values, positive predictive value (PPV), negative predictive value (NPV), and accuracy. The PPV, NPV and accuracy were assessed using a 5% prevalence assumption and in consideration of the rapid surge in the number of confirmed cases of the Omicron variant.
The secondary endpoints were the comparisons of the diagnostic performances in each of the algorithms of models 10 and 20.

Statistical Analysis
SPSS software V.26.0 was used for all statistical analyses (IBM, SPSS, Inc., Chicago, IL, USA). In addition to the positivity or negativity of RT-PCR data, the sensitivity (the proportion of actual positives) and specificity (the proportion of actual negatives) were also determined. The mean and 95% confidence interval are used to express the variables. To compare diagnostic performance between algorithms, the AUROC values of each were compared to those of the DeLong method; statistical significance was defined as a p value less than 0.05. (p < 0.05).

The Diagnostic Performance of Each DL Algorithm
The sensitivities of the algorithm (Model No. 10) that was trained with the raw data up to 10 cycles (which was the first 25% of the total of 40 cycles) were 96.7 (95% CI, 93. 8 Figure 4C, Supplementary Table S1).

The Effect of the Prevalence on the Diagnostic Performance of Each DL Algorithm
As of 25 February 2022, the calculated prevalence rates in Korea, Germany, and the United States were 2.96 percent, 4.49 percent, and 8.42 percent, respectively. Considering the trend in the current prevalence, which has been affected by the Omicron mutation, a prevalence of 5% was utilized to calculate the PPV and NPV affected by the prevalence in this study [14].
(Model No. 20) that was trained from the raw data up to 20 cycles (which was half of the entire total number of cycles) were 97.8 (95% CI, 95.  Figure 4C, Supplementary Table S1).

The Effect of the Prevalence on the Diagnostic Performance of Each DL Algorithm
As of 25 February 2022, the calculated prevalence rates in Korea, Germany, and the United States were 2.96 percent, 4.49 percent, and 8.42 percent, respectively. Considering the trend in the current prevalence, which has been affected by the Omicron mutation, a prevalence of 5% was utilized to calculate the PPV and NPV affected by the prevalence in this study [14]. In

Discussion
Prior to the discussion, it is vital to clarify the objective of this study and the significance of the result values that are displayed in the study findings to properly convey the study's meaning. This study aims to develop and validate models for reducing the RT-PCR test time by utilizing five deep learning methods for time-domain data processing. Therefore, in the results of this study, we confirmed and compared the performance of several models. "Model No. 20 learned by using raw data up to 20 cycles and can decrease the RT-PCR test time by 50%," and "Model No. 10 learned using the raw data up to 10 cycles and can decrease the test time in by 25%".
In model No. 10, the AUROCs when using Bi-LSTM and GRU were 85.2 (82.2-88.1) and 84.3 (81.6-87.1), respectively, and this model demonstrated the best diagnostic performance. No statistically significant difference existed between the two algorithms. These two algorithms significantly outperformed the rest of the algorithms. There was no statis-   Figure 6B, Table 3).

Discussion
Prior to the discussion, it is vital to clarify the objective of this study and the significance of the result values that are displayed in the study findings to properly convey the study's meaning. This study aims to develop and validate models for reducing the RT-PCR test time by utilizing five deep learning methods for time-domain data processing. Therefore, in the results of this study, we confirmed and compared the performance of several models. "Model No. 20 learned by using raw data up to 20 cycles and can decrease the RT-PCR test time by 50%", and "Model No. 10 learned using the raw data up to 10 cycles and can decrease the test time in by 25%".
In model No. 10, the AUROCs when using Bi-LSTM and GRU were 85.2 (82.2-88.1) and 84.3 (81.6-87.1), respectively, and this model demonstrated the best diagnostic performance. No statistically significant difference existed between the two algorithms. These two algorithms significantly outperformed the rest of the algorithms. There was no statistically significant difference between all of the algorithms in Model No. 20; however, the algorithm using Bi-LSTM had an AUROC of 93.2 (91.0-95.0), which demonstrated that this model had the best diagnostic performance. Additionally, Bi-LSTM demonstrated superior performance not only in terms of sensitivity and specificity but also in terms of PPV, NPV, and accuracy, which are influenced by prevalence.
According to the literature, the diagnostic performance was evaluated as no discrimination when the AUROC value was less than 50, acceptable when the AUROC value was between 70 and 80, excellent when the AUROC value was between 80 and 90, and outstanding discrimination when the AUROC value was greater than 90 [14][15][16][17]. The diagnostic performances of the models that were developed in this study were close to or exceeded 80 in Model No. 10 regardless of the algorithm, which indicates that there was generally excellent discrimination, and the diagnostic performance exceeded 90 in Model No. 20 regardless of the algorithm, indicating outstanding discrimination.
In terms of sensitivity, Kim et al. [18]. previously reported that the pooled sensitivity of RT-PCR was 89.0 (95% CI, 81.0-94.0), and an earlier meta-analysis study by Hayer et al. [19]. indicated that the overall sensitivity of RAT was 74.7 (95% CI, 63.7-80.9). However, Ricco et al. [20] indicated that the pooled sensitivity of RAT was 64.8 (95% CI, 54.5-74.0). In this study, it was confirmed that the sensitivities of all the algorithms of Model No. 10 exceeded 90.
The sensitivities calculated in this study tended to be slightly lower compared to Ricco et al.'s [20] meta-analysis, which found that the pooled specificity of RAT was 98.0 (95% CI, 95.8-99.0). The sensitivity in Model No. 10 was close to or exceeded 70, and in Model No. 20, it was close to 90. However, for the specificity to be clinically meaningful as a means of preventing the spread of infection, the NPV, which changes according to prevalence, must be sufficiently high. However, the specificities of the models that were not impacted by the prevalence were slightly lower than that of the RAT in this investigation. However, the NPVs in all of the algorithms of Model No. 10 and No. 20 that were produced in this study were greater than 99.0, compared to a previous meta-analysis study by Ricco et al. [20], which found a range of 26.2 to 94.1. In other words, there is a 99% or greater likelihood that a verified negative patient is truly negative, suggesting that these algorithms may be safe screening approaches for preventing viral spread.
The PPVs (with a prevalence of 5%) in this study were observed to have a range of 12.6 to 16.2 in Model No. 10 and a range of 27.3 to 32.2 in Model No. 20. The PPVs have ranged from 57.1 to 100.0 in studies that did not reflect the difference in the prevalence, and this was previously demonstrated in a meta-analysis study on the RAT diagnostic performance by Ricco et al. [20].
Studies regarding various artificial intelligence, deep learning, and machine learning methods to replace RT-PCR, which is the gold standard for the diagnosis of COVID-19, have been reported.
As mentioned previously and in view of the results of various previous studies that have evaluated the replacement of the RT-PCR diagnostic test, the findings of these previous studies seem to be applicable to the clinical setting, and the goal is to minimize the possibility for viral transmission through the quick identification of patients, the isolation of patients, and the safe discharge of patients from isolation.
Nonetheless, it is challenging to utilize these diagnostics in a clinical setting due to the following issues.
These previous studies have indicated that there are problems, including imbalance and bias, in the data used for the model learning because of the study limitations . Laghi concurs with efforts to diagnose COVID-19 using AI models [40]. However, given the clinical course of SARS-CoV-2 infection, the models provided in these studies were not developed by reflecting the difference in the test results across the time frame from the onset of infection to test execution. As a result, the use of these models in a true clinical setting appears to be extremely risky.
Considering this point, the prior study had a bias in the data used for learning, and it assessed a model that only utilized a single algorithm, LSTM. However, in the present study, 1270 negative results and 1270 positive results were acquired and applied at the same rate to boost the reliability of the learned models. Additionally, by learning through five different algorithms, it was feasible to evaluate each algorithm's diagnostic performance.
In this study, the results of sensitivity, specificity, and AUROC show similarly high performance in LSTM, GRU, and Bi-LSTM. This is thought to show the tendency of RNNbased models that show excellent performance in processing time series data. In other words, through this study, it was confirmed that RNN-type models are very suitable deep learning models for training RT-PCR time series data.
In contrast, RNNs and transformers demonstrated poor performance. The following are the reasons for these results: The gradient vanishing problem is a limitation of RNN in which previous information is lost as the length of the time series increases, and as the cycle lengthens, initial information is lost, resulting in poor performance. Additionally, the transformer's structure is more complex than other DL models. Consequently, it requires a large quantity of data to be utilized effectively, but the data in this study are relatively small compared to the characteristics of the domain, so it can be interpreted as demonstrating poor performance. Therefore, RT-PCR data were deemed unsuitable for transformers, which are extremely complex and large models, or very simple RNN models, and it was determined that LSTM, GRU, and Bi-LSTM are more applicable DL models for clinical applications. In the future, we expect to further improve the generalization performance of the DL model by collecting more data and redesigning the structure of the DL model, so that it can be effectively used in the clinical field.
Nevertheless, this study has several limitations as well. First, the outcomes of this study do not represent differences in race or geographical area, as the data that were used to train and test each algorithm-specific model were obtained in South Korea. However, the fluorescence values that were acquired during the 40 cycles of RT-PCR are the same regardless of race or region, and the RT-PCR test procedure is identical regardless of the geographical region or race. As a result, even if there are differences in the results due to the differences in race or geographical area, there is no need to collect further data to reflect these differences for the model learning and testing. Second, in the case of the PPV, which is one of the variables that is affected by the prevalence, the performance of the models that were developed in this work is poor in comparison to the RT-PCR or RAT PPV values. However, when the previous studies examined the diagnostic performance of RT-PCR or RAT, either the prevalence was not addressed in those studies or the PPV was determined in those studies based on diagnostic data that were obtained from certain population groups, such as symptomatic patients. As a result, the values may appear inflated. Thus, it is difficult to draw direct comparisons between the findings in those studies and the findings of this study. Third, because the patient's symptoms, blood test results, and X-ray test results were not linked to the data that were used for learning and testing in this study, these additional data were not used in conjunction with the data used for learning and testing; therefore, it is unreasonable to apply these models directly in the clinical field. If the patient's sex, age, symptoms, vital signs, blood test results, or X-ray test findings are all analysed and used for learning, it is believed that it will be feasible to determine whether the model is beneficial and can be applied to real-world patients. However, the models in this study were not constructed in this manner. Nonetheless, to our knowledge, except for the previous study that we conducted, no study has been undertaken on the diagnostic performance of deep learning models trained for the goal of lowering the time necessary for RT-PCR diagnosis by utilizing raw data of fluorescence values from 40 cycles of RT-PCR. If the models that were developed in this study are combined with other clinical data in the future, it is possible that a diagnostic approach for a variety of infectious diseases will be conceivable; therefore, additional research will be required in the future.

Conclusions
Among the five deep learning algorithms capable of training time series data, Bi-LSTM and GRU were shown to be suitable for halving or quartering the time required for RT-PCR diagnosis without significantly impairing the diagnostic performance of the COVID-19 RT-PCR test.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v15020304/s1, Table S1: Diagnostic performance of DL models in five different algorithms using the test dataset; Table S2: The effect of the prevalence on the diagnostic performance of each DL algorithm.
Author Contributions: Y.L. was involved in the conceptualisation. Y.L. and Y.-S.K. were involved in the study design, management, data collection, interpretation of the results and cowriting of the manuscript. Y.L. and Y.-S.K. contributed equally to this study. G.H.K., H.Y.C., J.G.K., Y.S.J. and W.K. were involved in the interpretation of the data and the critical revision of the paper for important intellectual content. D.I.L. contributed to the development and analysis of the DL models using Python. As the corresponding author, S.J. was involved in the study concept and design, critical revision of the paper, and final approval of the version to be published. All authors have read and agreed to the published version of the manuscript.