Using Machine Learning Methods in Identifying Genes Associated with COVID-19 in Cardiomyocytes and Cardiac Vascular Endothelial Cells

Corona Virus Disease 2019 (COVID-19) not only causes respiratory system damage, but also imposes strain on the cardiovascular system. Vascular endothelial cells and cardiomyocytes play an important role in cardiac function. The aberrant expression of genes in vascular endothelial cells and cardiomyocytes can lead to cardiovascular diseases. In this study, we sought to explain the influence of respiratory syndrome coronavirus 2 (SARS-CoV-2) infection on the gene expression levels of vascular endothelial cells and cardiomyocytes. We designed an advanced machine learning-based workflow to analyze the gene expression profile data of vascular endothelial cells and cardiomyocytes from patients with COVID-19 and healthy controls. An incremental feature selection method with a decision tree was used in building efficient classifiers and summarizing quantitative classification genes and rules. Some key genes, such as MALAT1, MT-CO1, and CD36, were extracted, which exert important effects on cardiac function, from the gene expression matrix of 104,182 cardiomyocytes, including 12,007 cells from patients with COVID-19 and 92,175 cells from healthy controls, and 22,438 vascular endothelial cells, including 10,812 cells from patients with COVID-19 and 11,626 cells from healthy controls. The findings reported in this study may provide insights into the effect of COVID-19 on cardiac cells and further explain the pathogenesis of COVID-19, and they may facilitate the identification of potential therapeutic targets.


Introduction
Corona Virus Disease 2019 (COVID- 19), caused by respiratory syndrome coronavirus 2 (SARS-CoV-2) [1][2][3], was first declared a pandemic on 11 March 2020 due to its quick global spread [4,5]. As of 23 July 2022, over 569 million cases have been reported worldwide, with Genes were ordered according to the degree of correlation with COVID-19 using four feature selection methods. The obtained four ordered gene lists were fed into the incremental feature selection (IFS) method, where decision tree (DT) was employed as the classification algorithm. Finally, the best feature subsets and classification rules were extracted according to the IFS results using DT.

Data
The current study integrated data from the GEM database [34], which contains the gene expression profiles of cardiac cells from patient with COVID-19 versus healthy con- Genes were ordered according to the degree of correlation with COVID-19 using four feature selection methods. The obtained four ordered gene lists were fed into the incremental feature selection (IFS) method, where decision tree (DT) was employed as the classification algorithm. Finally, the best feature subsets and classification rules were extracted according to the IFS results using DT.

Data
The current study integrated data from the GEM database [34], which contains the gene expression profiles of cardiac cells from patient with COVID-19 versus healthy controls. The database was downloaded from https://singlecell.broadinstitute.org/single_cell/ study/SCP1216 (accessed on 23 May 2022). Cardiomyocytes and vascular endothelial cells in the heart were analyzed because they are closely related to heart function. Cells were grouped according to the cell type, and we obtained two sets of data. The first set contained 104,182 cardiomyocyte samples, including 12,007 COVID-19 cardiomyocytes In each cell type, COVID-19 and healthy control were deemed labels, which were combined with features to constitute a classification system. Essential features can be obtained by investigating such classification system.

Feature Selection Methods
Each sample contained a large number of gene expression features, but only a small fraction was associated with COVID-19. We analyzed these genes using the following four feature selection methods: least absolute shrinkage and selection operator (LASSO) [35], light gradient boosting machine (LightGBM) [36], Monte Carlo feature selection (MCFS) [37], and random forest (RF) [38], and we ranked them according to their association with COVID-19.

Least Absolute Shrinkage and Selection Operator
In the LASSO algorithm, a first-order penalty function was constructed using the L1 paradigm, where independent variables are the genetic features we input. By applying penalties to the variables with low correlation and small predictive contribution, the corresponding coefficients are reduced to zero, thereby eliminating these unimportant features. Such operation can reduce the data dimension and prevent overfitting. By observing the optimized function, the input features can be sorted according to the absolute values of the coefficients of the features.

Light Gradient Boosting Machine
LightGBM improves the gradient boosting DT algorithm. In a large dataset, it can merge some mutually exclusive features and eliminate those with small gradients, thus achieving data dimensionality reduction and improving efficiency. It uses a leaf-wise strategy when constructing a tree, so it extends branches with high efficiency each time during feature evaluation. The more a feature is involved in building the tree, the more it contributes to the prediction. Thus, features can be ranked according to their occurring times in the constructed DTs.

Monte Carlo Feature Selection
The MCFS algorithm constructs a number of independent DTs. The algorithm randomly selects some features used for constructing the nodes of trees many times. For each feature set, the algorithm randomly selects training data many times. Eventually, p feature subsets are obtained, and for each feature subset, training data are randomly constructed t times, and DTs are constructed. Thus, p × t trees are obtained. The relative importance score (RI) can symbolize the importance of a feature g.
In the formula, ωA CC is the weighted accuracy of the tree τ under consideration, ng(τ) is a node of the DT whose information gain is denoted as IG(ng(τ)), no.in ng(τ) denotes the sample size of ng(τ), and no.in τ is the number of samples in the root of τ; u and v are two positive reals weighting the ωA CC and the ratio no.in ng(τ)/no.in τ, respectively. In terms of decreasing order of RI scores, features can be ranked in a list.

Random Forest
Permutation feature importance was first introduced for RFs by Breiman [39] in 2001 and was later applied to other models by Fisher et al. [38]. The rationale is easy to explain. If a feature is more important, it leads to a greater increase in prediction error when it is randomized. The importance of features can be measured by the increase in prediction error after feature permutation. If permuting its value does not increase the prediction error, the feature is not important. Based on the increment in prediction error, features can be ranked in a list.

Incremental Feature Selection
Using the above methods, four ranked feature lists were obtained. However, it is still challenging to select which features to participate in classification. Given a classification algorithm, its performance under the selected features should provide good performance. This study used the IFS method [32] to determine such features. It constructs a series of feature subsets from a given feature list. The first subset contains the first 10 features, and then each new subset includes an increment of ten following features in the list. Each of these subsets is used to train the downstream classifier based on a given classification algorithm. The performance of all classifiers is subsequently evaluated using 10-fold cross-validation. Based on the performance of the classifiers, the classifier with best performance can be obtained and the best feature subset used in this classifier is picked up for further investigation.

Synthetic Minority Oversampling Technique
Looking at the data set, we found a disparity in the number of samples in each category. For example, the number of healthy cardiomyocytes was 7.7 times that of COVID-19 cardiomyocytes. The direct use of these data would result in preferences for majority classes. This study used the synthetic minority oversampling technique (SMOTE) [40] to address this issue. For a minority class, the method randomly selects a sample, say α, and looks for its k closest neighbors of the same class according to the Euclidean distance. Sample b is randomly selected among these k neighbors, and a point is randomly identified on the line segment connecting α and b. This point is taken as a new sample of the minority class. This process is repeated to generate enough new samples for the minority class so that the number of samples of each class in the dataset is balanced.

Decision Tree
The operation of the IFS method requires a classification algorithm. Here, we selected the DT algorithm [33], which constructs a tree structure with a trunk and leaves. The trunk represents the process of classifying the samples and has several nodes, which contain tests on features. Samples are assigned to different trunks according to the tests. Eventually, the samples reach the leaves, which represent the categories. The DT algorithm is a classical white-box algorithm; that is, its decision process is transparent and can suggest meaningful clues to facilitate the elucidation of the effect of COVID-19 on the expression levels of genes. These clues are included in a group of quantitative classification rules, which can be extracted from the constructed trees.

Performance Evaluation
In IFS method, many classifiers were constructed and evaluated using 10-fold crossvalidation [41]. The F1 measure was taken as the key metric in measuring the performance of the classifiers [42][43][44][45][46][47][48]. The calculation procedure was as follows: where TP is true positive, FP is false positive, and FN is false negative. Classifier performance increases with F1 measure. We further used accuracy (ACC) and Matthew correlation coefficient (MCC) [49] for reference. MCC indicates the agreement between predicted and observed labels and is balanced when classes have different sample sizes. The performance of a classifier increases with MCC and ACC.

Results
The gene expression profiles of cardiomyocytes and cardiac vascular endothelial cells from patients with COVID-19 and healthy population were analyzed in this study, as shown in Figure 1. The results obtained in each step are presented in this section.

Results of Feature Ranking and Incremental Feature Selection
Each heart cell contained 29,071 gene expression signatures, which were analyzed using LASSO, LightGBM, MCFS, and RF methods. Four ranked feature (gene) lists were obtained, which are provided in Table S1. As only a small number of genes would be significantly associated with COVID-19, top 2000 genes in each list were picked up for subsequent analysis. Using the IFS method and the DT algorithm, 200 feature subsets were constructed from each list, thereby inducing 200 DT classifiers. All classifiers were assessed by 10-fold cross-validation.
where is true positive, is false positive, and is false negative. Classifier performance increases with 1 . We further used accuracy (ACC) and Matthew correlation coefficient (MCC) [49] for reference. MCC indicates the agreement between predicted and observed labels and is balanced when classes have different sample sizes. The performance of a classifier increases with MCC and ACC.

Results
The gene expression profiles of cardiomyocytes and cardiac vascular endothelial cells from patients with COVID-19 and healthy population were analyzed in this study, as shown in Figure 1. The results obtained in each step are presented in this section.

Results of Feature Ranking and Incremental Feature Selection
Each heart cell contained 29,071 gene expression signatures, which were analyzed using LASSO, LightGBM, MCFS, and RF methods. Four ranked feature (gene) lists were obtained, which are provided in Table S1. As only a small number of genes would be significantly associated with COVID-19, top 2000 genes in each list were picked up for subsequent analysis. Using the IFS method and the DT algorithm, 200 feature subsets were constructed from each list, thereby inducing 200 DT classifiers. All classifiers were assessed by 10-fold cross-validation.    On the cardiomyocyte dataset, four IFS curves on four feature lists are illustrated in Figure 2. It can be observed that the DT classifier using the first 20 genes in the list yielded by LightGBM showed the best performance, with an F1 measure of 0.983. In addition, it provided high ACC and MCC (0.996 and 0.980, respectively, see Table 1). The best DT classifiers on other three feature lists also showed excellent performance. The highest F1 measures were 0.975 (LASSO), 0.975 (MCFS), and 0.976 (RF) when first 20, 1500, and 340 genes were used in the corresponding lists, respectively. Their detailed performance is listed in Table 1.
On the dataset of vascular endothelial cells, four IFS curves are shown in Figure 3. The DT classifier using the first 120 genes in the list yielded by MCFS showed the highest performance (F1 measure of 0.949) and the best ACC and MCC values (0.951 and 0.902, respectively) were obtained on the same set of genes, which are listed in Table 1. Similarly, the best DT classifiers on the lists yielded by LASSO, LightGBM, and RF also provided high performance, which generated the F1 measure values of 0.923, 0.948, and 0.945, respectively, by using the first 60, 80, and 60 genes in the corresponding lists. On the cardiomyocyte dataset, four IFS curves on four feature lists are illustrated in Figure 2. It can be observed that the DT classifier using the first 20 genes in the list yielded by LightGBM showed the best performance, with an F1 measure of 0.983. In addition, it provided high ACC and MCC (0.996 and 0.980, respectively, see Table 1). The best DT classifiers on other three feature lists also showed excellent performance. The highest F1 measures were 0.975 (LASSO), 0.975 (MCFS), and 0.976 (RF) when first 20, 1500, and 340 genes were used in the corresponding lists, respectively. Their detailed performance is listed in Table 1.
On the dataset of vascular endothelial cells, four IFS curves are shown in Figure 3. The DT classifier using the first 120 genes in the list yielded by MCFS showed the highest performance (F1 measure of 0.949) and the best ACC and MCC values (0.951 and 0.902, respectively) were obtained on the same set of genes, which are listed in Table 1. Similarly, the best DT classifiers on the lists yielded by LASSO, LightGBM, and RF also provided high performance, which generated the F1 measure values of 0.923, 0.948, and 0.945, respectively, by using the first 60, 80, and 60 genes in the corresponding lists.
According to the above results, the best DT classifiers on each feature list all gave high performance, meaning that they can be useful tools to identify COVID-19 patients. For each cell type, important genes were extracted from each feature list, with which DT classifier can yield the best performance. The number of genes is listed in Table 1. However, lots of such genes were extracted from some feature lists. For example, for cardiomyocytes, 1500 genes were accessed from the list yielded by MCFS. In this case, it was not easy to conduct the further investigation. In view of this, the most important genes should be extracted from these genes. After checking the IFS results on two cell types, we selected the top 60 genes from the list yielded by MCFS and top 60 genes from the list yielded by RF for cardiomyocytes, and we selected the top 20 genes from the list yielded by LightGBM and top 20 genes from the list yielded by MCFS for vascular endothelial cells. The F1-measure values of DT classifiers under above features are marked in Figures 2 and 3. Compared with the F1-measure values of the best DT classifiers on the same feature list, they were a little lower. However, the numbers of used features were sharply reduced. Thus, these genes were relatively more important than rest genes. For the important genes selected from other feature lists, it was not necessary to make further selection as only a few genes were picked up. Accordingly, the top 20, 20, 60, and 60 genes in the lists yielded by LASSO, LightGBM, MCFS, and RF for cardiomyocytes, respectively, were selected to find the intersection and a Wayne diagram was plotted, as shown in Figure 4. Similarly, the most important genes (the first 60, 20, 20, and 60 genes of the lists) for vascular endothelial cells were analyzed. The results of the Wayne diagram are shown in Figure 5. As expected, we searched the key genes commonly indicated by these methods and shown to be closely associated with COVID-19. Detailed intersection results are shown in Table S3.

Classification Rules
The DT algorithm is a classical white-box algorithm. It can explicitly show the process of classification and provide interpretable classification clues. These rules may facilitate the analysis of the significance of genes, which may be differentially expressed in patients with COVID-19. By using the best DT classifiers on four feature lists, we summarized four groups of quantitative classification rules for cardiomyocytes and cardiovascular endothelial cells, respectively. The detailed classification rules are provided in Table S4. On one hand, these rules can be used to distinguish patients with COVID-19 from healthy controls. On the other hand, they can clearly display the different expression patterns in cardiomyocytes or cardiovascular endothelial cells between healthy controls and patients with COVID-19. Some rules will be discussed in detail below.

Classification Rules
The DT algorithm is a classical white-box algorithm. It can explicitly show the process of classification and provide interpretable classification clues. These rules may facilitate the analysis of the significance of genes, which may be differentially expressed in patients with COVID-19. By using the best DT classifiers on four feature lists, we summarized four groups of quantitative classification rules for cardiomyocytes and cardiovascular endothelial cells, respectively. The detailed classification rules are provided in Table S4. On one hand, these rules can be used to distinguish patients with COVID-19 from healthy controls. On the other hand, they can clearly display the different expression patterns in cardiomyocytes or cardiovascular endothelial cells between healthy controls and patients with COVID-19. Some rules will be discussed in detail below.

Classification Rules
The DT algorithm is a classical white-box algorithm. It can explicitly show the process of classification and provide interpretable classification clues. These rules may facilitate the analysis of the significance of genes, which may be differentially expressed in patients with COVID-19. By using the best DT classifiers on four feature lists, we summarized four groups of quantitative classification rules for cardiomyocytes and cardiovascular endothelial cells, respectively. The detailed classification rules are provided in Table S4. On one hand, these rules can be used to distinguish patients with COVID-19 from healthy controls. On the other hand, they can clearly display the different expression patterns in cardiomyocytes or cardiovascular endothelial cells between healthy controls and patients with COVID-19. Some rules will be discussed in detail below.

Discussion
We identified a set of potential signature genes that reveal differential expression associated with COVID-19 in cardiomyocytes and vascular endothelial cells. These genes can be useful in understanding how SARS-CoV-2 infection affects gene expression in cardiac cells. Confirming these genes can enhance the understanding of the pathogenesis of COVID-19-induced cardiovascular diseases, aiding in the clinical diagnosis and treatment of related clinical diseases. Several features and quantitative rules we identified are related to COVID-19-caused cardiovascular diseases according to newly published papers.

Analysis of Gene Features in Cardiac Cells for COVID-19
Based on the machine learning-based workflow, a set of significant genes were identified, which may be differentially expressed in vascular endothelial cells and cardiomyocytes. The genes facilitated the differentiation of patients with COVID-19 from healthy populations. Under pathological conditions, several top genes are involved in distinct biological activities in cardiomyocytes and vascular endothelial cells. Here, we analyzed the first five genes listed in Table 2 for cardiomyocytes and vascular endothelial cells. The possible impact of the differential expression of identified top features in vascular endothelial cells and cardiomyocytes on the hearts of patients with COVID-19 is discussed below. This discussion may explain the vulnerability of patients with COVID-19 to cardiovascular diseases and may offer direction for clinical diagnosis and treatment.

Qualitative Features in Vascular Endothelial Cells
The first identified feature gene is MALAT1 (ENSG00000251562), which has a significant impact on the heart given that it positively regulates the proliferation of cardiomyocytes [50] and plays an important role in the regulation of cardiovascular inflammation [51]. Low levels of MALAT1 lncRNA were discovered in patients with severe COVID-19 [52,53]. Another study suggested that MALAT1 depletion is responsible for the sepsis inflammatory response by inhibiting the expression of IL-6 and TNF-α and the NF-κB signaling pathway by upregulating miR-150-5p [54]. COVID-19 is associated with proinflammatory cytokine release [55], suggesting that SARS-CoV-2 infection alters the expression of MALAT1. Therefore, MALAT1 can be considered a potential feature for identifying patients with COVID-19. In vascular endothelial cells, the differential expression of MALAT1 is associated with several cardiovascular diseases, and MALAT1 is associated with the inflammation and apoptosis of vascular endothelial cells [56], risk of coronary heart disease [57,58], and deep vein thrombosis [59].
The next identified features were ID1 (ENSG00000125968) and ID3 (ENSG00000117318). They belong to the inhibitor of DNA binding (ID) family. IDs are required for the formation of the heart [60,61] and skeletal myogenesis [62] and play an important role in angiogenesis [63]. Thus, a potential relationship between IDs and the heart was revealed. A recent study discovered ID1 differential expression in COVID-19 retest-positive patients [64], demonstrating the potential influence of COVID-19 on ID expression. The beneficial function of IDs in reducing viral replication has been shown in some articles [65,66]. The expression of IDs in vascular endothelial cells may have a potential role in increasing the risk of some cardiovascular diseases. ID3 expression is involved in the protective process of coronary artery disease [67] and atherosclerosis [60]. Moreover, ID1 plays an important role in repair after a vascular injury [68]. The next identified gene was MT-CO1 (ENSG00000198804), which is a cytochrome c oxidase gene and a mitochondrial marker that plays an important role in mitochondrial aerobic respiration [69]. Given that the heart's driving function depends on ATP produced by aerobic respiration, MT-CO1 is essential for the heart. Several studies have demonstrated the effect of COVID-19 on MT-CO1 expression. In 2022, researchers found low mRNA levels of MT-CO1 in patients with COVID-19 [70], and another article noted that MT-CO1 was downregulated in patients with COVID-19 and recovered individuals [71]. The differential expression of MT-CO1 in vascular endothelial cells due to COVID-19 may contribute to a range of cardiovascular diseases. The inhibition of MT-CO1 expression may lead to inefficient electron transfer, resulting in the production of reactive oxygen species [72]. The resulting damage to the vascular endothelium has detrimental effects on cardiac function. In addition, MT-CO1 downregulation produces mitochondrial oxidative stress, which may increase the risk of atherosclerosis and coronary artery disease [73]. Thus, the higher risk of cardiovascular disease in patients with COVID-19 is partially explained by the differential expression of the MT-CO1 gene in vascular endothelial cells [74].
The last identified feature was EGFL7 (ENSG00000172889). EGFL7 gene is associated with angiogenesis and is highly expressed in the developing neonatal vasculature [75]. It is upregulated in adults with a vascular injury [76,77]. In 2020, Leng et al. [78] discovered that EGFL7 is downregulated in patients with COVID-19, suggesting that EGFL7 can be identified as a valid feature for distinguishing patients with COVID-19. The negative consequences that may be caused by EGFL7 gene differential expression in vascular endothelial cells in the heart have been explored. One study discovered that increasing EGFL7 expression enhances neoangiogenesis within plaques and promotes the development of atherosclerosis [79]. Moreover, EGFL7 knockdown induces cardiac dysfunction and fibrosis [80], and EGFL7 genetic deletion causes severe vascular defects [81].

Qualitative Features in Cardiomyocytes
The first identified feature gene was MALAT1 (ENSG00000251562). We have discussed the basic function of the MALAT1 gene and its differential expression in patients with COVID-19. In contrast to the previous discussion, the COVID-19-induced differential expression of MALAT1 in cardiomyocytes produced different negative effects on patients' hearts. Increased MALAT1 expression enhances cardiomyocyte apoptosis [50,82,83], and MALAT1 is highly expressed in patients with acute myocardial infarction [84,85]. However, the low expression of MALAT1 may promote myocardial hypertrophy [86].
The next identified gene was CD36 (ENSG00000135218). It plays an important role in lipid metabolism, especially in the uptake of long-chain fatty acids, which are the primary source of myocardial energy supply, which explains why CD36 is so closely related to heart function [87]. CD36 is differentially expressed in patients with COVID-19 and recovered individuals [88]. Moreover, human primary monocytes infected with SARS-CoV-2 alter the expression of genes related to lipid uptake, such as CD36 [89]. All these results suggest that COVID-19 may alter CD36 gene expression. Alterations in CD36 gene expression may cause some adverse effects on the heart and leads to the disruption of fatty acids or lipid metabolism, resulting in a range of chronic diseases, such as cardiac hypertrophy, heart failure, and cardiac ischemia/reperfusion [90]. The myocardial uptake of long-chain fatty acids is significantly reduced, and glucose utilization is increased in CD36-deficient patients [91]. A shift in energy supply accelerates the progression of myocardial hypertrophy to heart failure [92].
The next identified feature was LARGE1 (ENSG00000133424), which is highly abundant in the heart [93] and involved in the regulation of cellular homeostasis in myocytes [94]. Although no direct evidence has shown that LARGE1 gene expression is altered in patients with COVID-19, two studies on the Lassa fever virus, which also contains spike proteins, found LARGE1 gene overexpression [95,96]. Thus, LARGE1 expression is likely to change as a result of SARS-CoV-2 infection. LARGE1 overexpression exacerbates muscular dystrophy [97,98]. Given that this disease affects cardiac function and even leads to heart failure [99], the altered expression of LARGE1 in cardiac myocytes caused by COVID-19 may have an adverse effect on the heart. The next identified gene was RYR2 (ENSG00000198626), which is highly expressed in the heart, is involved in action potential regulation in atrial myocytes [100], and positively regulates atrial contractility [101]. These functions indicate the important role of RYR2 in the heart. Indirect evidence has demonstrated the association between COVID-19 and RYR2 expression. COVID-19 may cause hypoxia in patients [102], and RYR2 expression can be significantly reduced upon hypoxic exposure [103]. RYR2 channels exhibit increased activity in a patient's brain [104], and alterations in RYR2 expression may cause a range of negative effects on the heart. Moreover, pathological Ca 2+ channel remodeling and heart failure progression are linked to RYR2 [105,106]. Additionally, catecholamine-sensitive ventricular tachycardia is caused by RYR2 dysfunction [107], and low RYR2 expression likely causes a comparable range of negative effects.
The last identified feature was PLCG2(ENSG00000197943), which is a SARS-CoV-2 infection-related gene associated with immune response [108]. SARS-CoV-2 infection may lead to viral myocarditis, which may be associated with the expression of PLCG2. The expression of PLCG2 transcripts is upregulated in club cells after SARS-CoV-2 infection [109], and PLCG2 is highly expressed in the kidneys of patients with COVID-19 [108]. Cardiac injury due to COVID-19-induced differential expression of PLCG2 in cardiomyocytes may be associated with myocarditis. Given that PLCG2 is involved in multiple immune responses [110,111], its high expression may promote inflammation leading to myocarditis.

Analysis of Decision Rules in Cardiac Cells for COVID-19
As described above, we identified a validated set of genes that may help in qualitatively distinguishing cardiac gene expression samples from patients with COVID-19 from that in uninfected populations by using cardiac gene expression. The ability of some genes in a sample categorization at the transcriptome level can be confirmed by recent studies. In addition, quantitative rules were also established based on the computational workflow, and some representative rules for cardiomyocytes and vascular endothelial cells were selected for in-depth discussion, as listed in Table 3. The rules allowed us to accurately identify patients with COVID-19 according to changes in gene expression in cardiac cells. Table 3. Representative rules generated in cardiac cells to identify patients with COVID-19.

Cell Type Rules Parameters Predicted Class
Vascular endothelial cells expressed in each rule. The pathogenicity analysis of the parameters in the rule is more specific than that in the previous section and is strictly based on the "upregulation" or "downregulation" indicated in the rule.

Quantitative Rules in Vascular Endothelial Cells
The first rule (Rule 0) involves four parameters that help in differentiating a population infected with SARS-CoV-2 and reflecting on the impact of COVID-19 on heart function. The first parameter MALAT1 (ENSG00000251562) was downregulated in this rule for the identification of patients with COVID-19. In 2020, a study found that MALAT1 was downregulated in the bronchoalveolar lavage fluids of patients with mild and severe SARS-CoV-2 infections [112]. Another study in 2021 found similar results [113]. MALAT1 can be considered a reliable parameter for distinguishing patients with COVID-19 despite the fact that MALAT1 expression levels varied with the severity of COVID-19 and the type of cells tested [52,114]. MALAT1 is associated with vascular endothelial cell homeostasis [115], and reduced MALAT1 expression may impede vascular repair after SARS-CoV-2 infection. MT-CO1 (ENSG00000198804), as a second parameter, was downregulated in vascular endothelial cells after SARS-CoV-2 infection in this rule. In 2022, researchers found reduced mRNA levels of MT-CO1 in patients with COVID-19 [70], validating this parameter. Another study from 2021 discovered that after SARS-CoV-2 infection, the expression of mitochondrial genes, including the MT-CO1 gene, was downregulated [116]. Reduced MT-CO1 expression in this rule may increase the incidence of coronary artery disease and contributes to poor outcomes in patients with coronary artery disease [117,118]. The next parameter in the vascular endothelium was HIF3A (ENSG00000124440), which was shown to be downregulated in this rule. In 2021, HIF3A was found to be downregulated within the frontal cortices of patients with COVID-19 [119]. HIF3A is induced by hypoxia [120] and downregulated in the inflammatory states [121], partially explaining the validity of the parameters. However, studies on the potential damage caused by low HIF3A expression to the heart are limited. A 2021 publication found that silencing HIF3A reduces apoptosis and has a cardioprotective effect [122]. The last parameter in Rule 0 was SNHG7(ENSG00000233016), which was predicted to be downregulated in the vascular endothelium of patients with COVID-19. In 2021, researchers found that SNHG7 lncRNA was upregulated in SARS-CoV-2-infected cells and tissues of patients with COVID-19 [123]. This result is contrary to the prediction of Rule 0. The testing of other tissues, such as the lung tissues, which were more exposed to different quantities of SARS-CoV-2 than the heart, may be the cause of the discrepancy between the results. A recent publication suggested that SNHG7 provides protection against atherosclerosis development [124], suggesting that patients with COVID-19 are more likely to develop atherosclerosis or more severe atherosclerosis. SNHG7 was downregulated in unstable plaques of patients with coronary artery disease [125], suggesting the potential role of its low expression in the regulation of coronary artery disease.
The second rule (Rule 1), which helped us identify healthy people without SARS-CoV-2 infection, contained four parameters. The first parameter was MALAT1 (ENSG00000251562), whose expression in vascular endothelial cells was consistently high in healthy populations. The downregulation of MALAT1 in patients with COVID-19 and the potential effects on cardiac health have been discussed. The second parameter, ID1 (ENSG00000125968), showed consistently high levels in vascular endothelial cells in this rule and was used in distinguishing individuals who were not infected by SARS-CoV-2. The knockdown of ID1 was found to be beneficial for the replication and transcription of multiple viruses [65,66], partially demonstrating that ID1 helps in distinguishing uninfected populations. Endocytic activation and angiogenesis are hampered by ID1 gene inhibition [126], and high levels of ID1 expression regulate the severity of inflamed tissue injury [127]. Therefore, lowering its expression following SARS-CoV-2 infection may reduce protection against heart damage. The next parameter was PDLIM5 (ENSG00000163110), which had low levels in the vascular endothelial cells of the SARS-CoV-2-uninfected population. Given the crucial Life 2023, 13, 1011 14 of 21 function of PDLIM5 in repair when the heart is damaged [128,129], low levels of PDLIM5 in vascular endothelial cells in healthy populations seem understandable. Given that PDLIM5 is a pro-atherosclerotic gene [130], SARS-CoV-2 infection that results in increased PDLIM5 expression levels may accelerate atherosclerosis. The next parameter, PHACTR1 (ENSG00000112137), had low levels in vascular endothelial cells in this rule. It was used in distinguishing uninfected individuals from patients with COVID-19. According to a 2018 study, PHACTR1 interacts with MRTF-A to mediate inflammation, which may help to explain why uninfected people continue to have low levels of PHACTR1. The COVID-19-mediated overexpression of PHACTR1 may cause a range of cardiovascular diseases, including coronary artery disease [131][132][133] and acute myocardial infarction [133].

Quantitative Rules in Cardiomyocytes
The first rule (Rule 0), with two parameters, facilitated the identification of patients with COVID-19 according to the differential expression of these parameters in cardiomyocytes. The first parameter was MALAT1 (ENSG00000251562), which showed low expression levels in the cardiomyocytes of patients with COVID-19. Low MALAT1 expression in cardiomyocytes can cause damage to the hearts of COVID-19 patients. Chen et al. [86] found that MALAT1 knockout exacerbates angiotensin II-induced cardiac hypertrophy, suggesting that patients with COVID-19 are susceptible to myocardial hypertrophy. As for the second parameter, TTN (ENSG00000155657), which was shown downregulated in this rule, was used in distinguishing patients with COVID-19. Kanduc et al. [134] found that titin, a protein expressed by the TTN gene, shares 29 pentapeptides with the echinocandin of SARS-CoV-2. This similarity may cause immune cross-reactivity and thus affect TTN gene expression. Another publication stated that TTN is downregulated after IL-6 treatment [55], implying a corresponding downregulation in patients with COVID-19. TTN expression and myocardial function are tightly associated [55,135], and cardiomyocytes with TTN gene knockdown show weaker and disordered muscle nodes [136], which may decrease the heart's ability to contract.
As for the second rule (Rule 1), it contained three parameters for the identification of populations without SARS-CoV-2 infection. The first parameter was EMC10 (ENSG00000161671), which had low levels in the cardiomyocytes of uninfected individuals. With regard to this gene's potential impact on the heart, SARS-CoV-2 infection may result in the overexpression of EMC10, a gene primarily involved in cardiac repair [134]. As a result, we considered it a potential adaptive repair response. NEB (ENSG00000183091), which was demonstrated in this rule to retain low expression levels in the cardiomyocytes of uninfected people, was the parameter used in excluding SARS-CoV-2-infected people. SARS-CoV-2 infection affects NEB gene expression and induces an inflammatory response [137], contributing to the exclusion of healthy individuals. Patients with COVID-19 have higher levels of NEB expression in their cardiomyocytes. Actin filament length is regulated by NEB [138], suggesting that NEB overexpression may cause myonodal dysfunction, which in turn results in impaired heart contraction. The last parameter in this rule was PTGDS (ENSG00000107317), a gene whose specific expression level in cardiac myocytes helps in distinguishing healthy individuals without SARS-CoV-2 infection. In this rule, the expression of PTGDS should be maintained at a high level. In 2021, Haslbauer et al. [139] found that PTGDS gene expression was downregulated in patients with COVID-19, suggesting the dysregulation of arachidonic acid metabolism. The heart may be affected by COVID-19-induced decreased expression of PTGDS, which is involved in the production of prostaglandin D2 necessary for the survival of cardiomyocytes [140]. Therefore, increased cardiomyocyte apoptosis due to decreased PTGDS expression can harm the heart. Furthermore, Zhao et al. [141] revealed that reduced PTGDS expression may be a biomarker of myocardial infarction, indicating that COVID-19 patients may have a higher risk of experiencing this condition.
Overall, as we have already indicated, previous studies have validated the acquired rules and supported the top quantitative rules.

Limitations of This Study
In the current study, some essential information (essential genes and classification rules) was extracted from a large gene expression profile on two types of heart cells for COVID-19 patients and healthy controls. Some genes and rules can be confirmed to be related to COVID-19-caused cardiovascular diseases by looking up recent publications. However, solid evidence (through wet experiments) was not provided in this study. Furthermore, lots of genes and rules were output by the machine learning-based workflow. Only a few of them were discussed. Some essential findings may be hidden in the undiscussed part. It is hopeful that the following studies can identify more essential genes and rules.

Conclusions
To investigate the effect of COVID-19 on cardiac cells, a machine learning-based workflow was designed in this study, which analyzed the gene expression data of two heat cell types from patients with COVID-19 and healthy controls, including cardiomyocytes and cardiac vascular endothelial cells. Some essential genes related to COVID-19 in these cells were accessed, which can be latent biomarkers for identifying COVID-19 patients. On the other hand, effective classification rules were established, indicating the special expression patterns in these cells for COVID-19 patients. Furthermore, some efficient DT classifiers were built in this study, which can be useful tools to diagnose COVID-19 patients. These findings provided a reference for uncovering the mechanism by which COVID-19 damages the heart and suggests some possible therapeutic targets. Unlike traditional drug target studies, which investigate the association between a drug and genes, we believe the drug should target certain cells and the genes within these cell types. The single cell method will revolutionize the drug discovery.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/life13041011/s1, Table S1: Feature lists obtained by using LASSO, LightGBM, MCFS, and RF; Table S2: IFS results of decision tree on different feature lists for two cell types; Table S3: Intersection of important genes identified by LASSO, LightGBM, MCFS, and RF for the two cell types. The features identified by 4, 3, 2, and 1 method(s) are shown; Table S4: Classification rules generated by the optimal DT classifiers on different feature lists for two cell types.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.