Unraveling the Molecular Puzzle: Exploring Gene Networks across Diverse EMT Status of Cell Lines

Understanding complex disease mechanisms requires a comprehensive understanding of the gene regulatory networks, as complex diseases are often characterized by the dysregulation and dysfunction of molecular networks, rather than abnormalities in single genes. Specifically, the exploration of cell line-specific gene networks can provide essential clues for precision medicine, as this methodology can uncover molecular interplays specific to particular cell line statuses, such as drug sensitivity, cancer progression, etc. In this article, we provide a comprehensive review of computational strategies for cell line-specific gene network analysis: (1) cell line-specific gene regulatory network estimation and analysis of gene networks under varying epithelial–mesenchymal transition (EMT) statuses of cell lines; and (2) an explainable artificial intelligence approach for interpreting the estimated massive multiple EMT-status-specific gene networks. The objective of this review is to help readers grasp the concept of computational network biology, which holds significant implications for precision medicine by offering crucial clues.


Introduction
Recently, heterogeneous gene regulatory networks, which are regulatory interactions between genes controlling specific cell functions, have garnered a large amount of attention in various fields of research to understand disease mechanisms arising from complex molecular networks.To estimate gene regulatory networks, various computational methodologies have been developed.Furthermore, the effectiveness of these gene networks has been validated in various works, e.g., drug combinations identification, cancer prediction, etc. [1][2][3].Although numerous studies on gene regulatory networks have been widely conducted, most of the existing studies have been based on averaged gene networks across all cell lines.Thus, we cannot reveal cell line (or patient)-specific gene regulatory networks that contain crucial clues for precision medicine.
In this article, we review computational strategies for cell line characteristic-specific gene network analysis.We first review machine learning approaches for sample-specific gene network estimation.We then focus on epithelial-mesenchymal transition (EMT), which is a biological phenomenon wherein epithelial cells undergo a transformation, leading to the loss of their characteristic properties and the acquisition of mesenchymal characteristics.The EMT plays key roles in cancer invasion, metastasis, and resistance to chemotherapy.Thus, uncovering the EMT mechanism is crucial for developing effective strategies to target cancer metastasis and enhance therapeutic efficacy.To understand the mechanism of transformation from epithelial cells to mesenchymal states and the related markers, we consider gene networks under varying EMT statuses of cell lines and review a study on uncovering system changes in gene networks under varying EMT statuses of cell lines [4].We then turn our focus to the drawbacks in the studies involving the analysis of cell line characteristic-specific gene networks, specifically the interpretation of the massive multiple networks.The cell line-specific gene network analysis provides hundreds of networks for hundreds of cell lines, where a network is given in matrix form consisting of more than 10,000 rows with more than 1000 columns.Thus, the task of interpreting the massive multiple networks remains a big challenge.However, the inference of gene regulatory networks is not the final result; rather, these networks are intended to help solve biological and biomedical problems by effectively interpreting the estimated gene networks [5].Gene regulatory networks play a crucial role in comprehending the maintenance, establishment, and disruption of cellular identity in diseases [6].The gene networks facilitate our understanding of the molecular mechanisms governing organisms and reveal the fundamental principles governing a wide array of biological processes and reactions in organisms [7].Furthermore, extensive evidence shows that epithelialmesenchymal transition (EMT) is regulated by numerous transcription factors and signaling pathways, and gene regulatory networks play a critical role in controlling EMT programs during both developmental processes and disease [8,9].
In the second part of this article, we review the explainable artificial intelligence (XAI) approach, TRIP, which is used to interpret large-scale gene regulatory networks, enabling researchers to unravel complex biological processes and gain a deeper understanding of disease mechanisms [10].We also review the application results of XAI in interpreting the estimated EMT status-specific gene networks [11].TRIP was applied to gene networks estimated a decade ago and EMT markers were uncovered.Interestingly, some of the genes identified in this analysis were also identified as EMT markers in a previous study on EMT network analysis [4].This implies that the dataset from 10 years ago contained knowledge of the EMT markers and their corresponding EMT-related mechanisms, which have subsequently been discovered in the last decade.Due to the lack of computational strategies for comprehensive analysis and interpretation of the massive gene regulatory networks, we were unable to uncover the EMT markers and their EMT-related mechanisms 10 years ago.Based on these reviews, it is expected that the incorporation of XAI will bring forth new possibilities in computational network biology, and may lead to a more effective understanding of complex disease mechanisms.
The remainder of this paper is organized as follows.In Section 2, we review the computational strategies for cell line-specific gene network analysis.The application results of the strategy for estimating cell line-specific gene networks under varying EMT statuses of cell lines are reviewed in Section 2.Then, we review the explainable artificial intelligence approach for interpreting the massive multiple networks in Section 3. The conclusions are provided in the Discussion section.

Computational Strategies
The expression levels of p regulator genes are given as X = (x 1 , ..., x n ) T ∈ R n×p , where x i = (x i1 , ..., x ip ), which control target gene transcription y ∈ R n for n cell lines, = 1, ..., q.The gene regulatory network can be described by the following linear regression framework, where β = (β 1 , ..., β p ) T is the regression coefficient vector that represents the effect of p regulator genes on the th target gene, and i is a random error vector.
The L 1 -type regularization methods have been used to estimate gene regulatory networks, e.g., elastic net [12], where where λ > 0 is a regularization parameter controlling the degree of shrinkage applied to β , and 0 ≤ δ ≤ 1 is a mixing parameter between the ridge [13] and lasso [14] penalties.
Although the L 1 -type regularization methodologies successfully estimate gene regulatory networks, the methods provide an averaged gene network that provides an average representation across various cell lines However, molecular interplays exhibit diverse structures depending on the characteristics of cell lines.Figure 1 shows the correlation between two genes (i.e., UPF1 and DPM1) in cell lines corresponding to low (drug-sensitive), middle (drug-moderate), and high (drug-resistant) IC50 values of an anti-cancer drug (i.e., capecitabine).As shown in Figure 1, the two genes show a positive correlation in the drug-sensitive cell lines, whereas a negative correlation is seen in the drug resistance cell lines.Furthermore, the positive and negative correlation patterns between two genes cannot be discerned without considering the characteristics of cell lines (i.e., all cell lines).This implies that the characteristics of cell lines should be considered when estimating gene regulatory networks to extract crucial information for precision medicine.That is, cell line-specific gene regulatory networks are essential to accurately uncover the gene regulatory system for the specific characteristics of each cell line.RNA expression data were obtained from the CCLE dataset, and drug sensitivity data were downloaded from the Genomics of Drug Sensitivity in Cancer Project.
where β (m α ) is the regression coefficient vector that describes the effect of p regulator genes on the th target gene in the α th cell line.The α th cell line has m α as a cancer-related characteristic, where the characteristic of cell lines is referred to as a modulator M (e.g., EMT status of cell lines).To estimate the varying coefficient β (m α ), Shimamura et al. [4] developed the kernel-based L1-type regularization method as follows, where P{β (m α )} is the recursive elastic net penalty, and is the Gaussian kernel function used for grouping cell lines according to their cancer-related characteristics (i.e., modulator m i for i = 1, ..., n).In other words, the Gaussian kernel function is employed to quantify the similarity between the characteristics of cell lines and determine the weights that control the influence of samples in estimating the gene network for the α th cell line.Thus, we can estimate β (m α ) based only on cell lines having similar characteristics m i to those of the target cell line m α .This implies that NetworkProfiler has the ability to identify specific molecular interactions for cancer-related statuses of cell lines (e.g., EMT status, drug sensitivity, cancer progression-specific gene regulatory networks).In practice, genomic datasets frequently contain outliers originating from diverse sources, such as experimental errors, coding errors, and other factors.However, L 1 -type regularization methods and NetworkProfiler suffer from outliers because the methods are based on the least-squares loss function.Thus, the network estimation and edge selection procedures are disturbed in the presence of outliers.In short, we cannot effectively perform personalized gene network analysis.In the subsequent subsection, we review robust personalized gene network analysis.

Robust NetworkProfiler
In practice, clinical and genomic alteration datasets typically encompass outliers arising from multiple sources (e.g., experimental errors, reporting or labeling errors, etc.).The genomic dataset consists of a substantial number of features (e.g., genes) and a limited number of samples (e.g., cell lines).This type of data is called high-dimensional data.Identifying and managing outliers in a high-dimensional genomic dataset are crucial and challenging tasks.To effectively detect outliers in a high-dimensional genomic dataset, Park et al. [16] considered the following robust Mahalanobis distance computed in the robust principal component space, where T r.pc and C r.pc are the robust mean and covariance matrices, respectively, estimated using the minimum volume ellipsoid, and T is a κ-dimensional matrix of the robust principal components.By using the robust Mahalanobis distance, Park et al. [16] developed the following weight to control outliers in high-dimensional genomic data, where k = χ 2 (df = κ) is the 95% quantile of the χ 2 (df = κ) distribution [17].Park et al. [16] incorporated the weight into the kernel-based L 1 -type regularization method as follows The robust NetworkProfiler detects data points as outliers if their R.MD r.pc i is greater than the 95th percentile of the χ 2 (d f = κ) distribution, and then reduces the effect of the detected outliers by applying the weight R κ i to the network estimation.Due to its robust nature, NetworkProfiler can effectively estimate personalized gene networks, even in the presence of outliers.

Uncovering Changes in Gene Regulatory Networks in the Epithelial-Mesenchymal Transition
We review the application results of NetworkProfiler for gene network analysis under varying EMT statuses of cell lines.Shimamura et al. [4] applied NetworkProfiler to reveal system changes under varying EMT statuses of cell lines.EMT status-specific gene networks were estimated using the EMT modulator that describes the EMT statuses of cell lines, defined using the module discovery method [18].This method is based on 50 genes labeled as EMT-related genes (i.e., EMT-UP, EMT-DN, JECHLIN-GER-EMT-UP, and JECHLIN-GER-EMT-DN) in the Molecular Signatures Database v2.5, where low and high values of the EMT modulator represent the epithelial-and mesenchymal-like cell lines, respectively.The expression profiles of 13,508 genes in 762 cell lines were obtained from the Sanger Cell Line Project (http://bonsai.hgc.jp/~shima/NetworkProfiler/,accessed on 25 June 2023).The gene networks were estimated with 13,508 target genes and 1732 regulator genes, consisting of 47 nuclear receptors, 1183 transcription factors, and 502 human miRNA.For the 762 modulator values describing the EMT statuses of 762 cell lines, 762 gene regulatory networks between the 13,508 target and 1732 regulator genes were estimated.
To reveal system changes in the context of EMT, Shimamura et al. [4] focused on the well-known EMT marker, E-cadherin, because the loss of the cell adhesion molecule E-cadherin is a biomarker of EMT.Then, they identified candidate regulators of E-cadherin based on the following regulatory effect of the j th regulator on the l th target gene (i.e., E-cadherin) at the α th cell line, where π j α is the set of all possible paths from x j (i.e., regulators) to y (i.e., E-cadherin), and β(j→ ) (m α ) is the product of the estimated coefficients on the th path in π j α , where the length of the path from x j to y was regarded as 1 or 2. In other words, they considered the parent and grandparent genes as potential regulatory factors.Then, the following regulatory effect change according to the EMT status of cell lines was computed, to measure how the EMT status affects the regulatory effect of the regulators on E-cadherin.The highest ranked 25 genes corresponding the highest REC values were extracted as candidate regulators of E-cadherin: IRF6 (-), miR-141 ( Among the 25 candidate regulators identified, about half of the genes had wellestablished evidence supporting their regulatory roles in E-cadherin, whereas the mechanisms of the remaining genes were yet to be revealed.NetworkProfiler was employed to predict the mechanistic interpretations of the E-cadherin-related regulatory system as follows: • The expression of miR-141 had a strong positive effect on the expression of E-cadherin in epithelial-like cells, whereas this effect decreased as the transition from epithelialto mesenchymal-like cell lines occurred.

•
The expression of ZEB1 had a weak negative effect on the expression of E-cadherin in epithelial-like cells, whereas this effect increased as the transition from epithelial-to mesenchymal-like cell lines occurred.
• miR-141 and ZEB1 had a strong negative effect on each other only in epitheliallike cells.
The findings suggest the existence of an adverse feedback loop between miR-141 and ZEB1 in epithelial-like cells, and this interaction had been previously revealed in [27].

•
As the transition from epithelial-like cells to mesenchymal-like cells occurred, the expression levels of miR-141 and E-cadherin decreased, whereas the expression level of ZEB1 increased.
Shimamura et al. [4] suggested, based on the aforementioned results, that the inhibition of miR-141 in mesenchymal-like cells disrupts the adverse feedback loop between miR-141 and ZEB1, consequently resulting in decreased expression of E-cadherin due to the increased expression of ZEB1.

Limitations of Current Personalized Gene Network Analysis
Personalized gene network analysis (e.g., EMT status-specific gene network) generates massive multiple networks, where each network is represented in matrix form with 1732 columns and 13,508 rows.This indicates that the interpretation of the analysis of EMT status-specific gene networks involves the examination of 762 extensive matrices, with each matrix corresponding to a distinct cell line.
Although NetworkProfiler enables us to explore changes in the system under different EMT status conditions, the existing study focused only on the well-known EMT marker, E-cadherin, and then interpreted the results based on the neighboring genes of E-cadherin.This approach was taken as the comprehensive analysis and interpretation of the extensive multiple networks were not feasible.However, the narrow interpretation is insufficient to understand the complex mechanism of disease.To effectively advance precision medicine, a comprehensive interpretation of the massive multiple gene networks is essential.

Explainable Artificial Intelligence (XAI) for Comprehensive Gene Network Analysis
In this section, we review the explainable artificial intelligence approach, known as Tensor Reconstruction-based Interpretable Prediction (TRIP), for the analysis of massive multiple networks [10].In recent years, artificial intelligence has garnered considerable attention in various fields of research, statistics, computer science, biomedical, etc.Although AI approaches have shown significant success in terms of prediction or classification accuracy, the methods frequently encounter the black-box problem, wherein the decisionmaking process of AI machines cannot be explained due to the highly intricate nature of the deep learning model's decision rules.The utilization of AI in the medical field is limited because a basis for explanation is required.To address the black-box problem, Maruhashi et al. [10] developed an XAI strategy called TRIP, which is a deep learning approach for tensor decomposition.It aims to find a subspace of the data from multiple networks that minimizes prediction error while retaining as much of the data information as possible.TRIP estimates a human-readable low-dimensional subspace and performs predictions based on the estimated subspace.Thus, we can effectively interpret and understand the analysis of massive multiple networks because the decision boundaries of a model can be efficiently visualized in a lower human-readable dimension, even though the model is learned through a deep learning approach.We briefly review the mathematical formula of TRIP in the following subsection.

Method: Tensor Reconstruction-Based Interpretable Prediction (TRIP)
The gene network connecting target genes with regulator genes can be regarded as a second-order tensor.Assume a K-mode tensor X with size I 1 × • • • × I K .TRIP estimates a projection matrix C (k) ∈ R I k ×J k for X and projects X onto a lower-dimensional subspace using the estimated C (k) , i.e., Then, the projected tensor X i is used as the input for the prediction or classification model, as follows ŷi = f (X i , θ) for i = 1, ..., n. ( This implies that the prediction or classification is conducted within the estimated human-readable low-dimensional subspace.The projection-based prediction leads to more explainable and interpretable results in the analysis of multilayer gene networks because the complex massive multiple gene networks are effectively visualized within a human-readable dimensional subspace. TRIP estimates the projection matrices C (k) and learns the deep learning-based prediction model simultaneously by minimizing not only the prediction error but also the subspace estimation error.The objective function of TRIP is given as where γ > 0 is the tuning parameter for the projection error.The first term L(y i , ŷi ) and the second term are the loss functions for predicting and estimating the projection matrices, respectively.As shown in the objective function ( 14), TRIP learns a model by minimizing these two loss functions simultaneously.In other words, the projection matrices C (k) are estimated to minimize errors in both projection and prediction.This implies that TRIP enables us to achieve effective prediction results while still retaining a significant amount of the original data variance within the estimated subspace.For details about TRIP, please refer to Maruhashi et al. [10]

Comprehensive Interpretation of the Massive Multiple Gene Networks across Varying EMT Statuses
Park et al. [11] applied TRIP to estimate 762 gene regulatory networks across varying EMT statuses of cell lines.The gene network between 13,508 targets with 1762 regulators was regarded as a second-order tensor.TRIP first estimated the projection matrices C (k) , k = 1,2 for the regulator and target genes axes and learned a 50 × 50 subspace of the 762 networks.Then, 50 crucial components that describe the importance of the target and regulator genes for EMT-modulator prediction were extracted.Park et al. [11] interpreted the results based on the crucial components of the subspace for regulator genes.Among the estimated 50 components, the first three components explained approximately 70% of the variability in the regulator genes within EMT-related gene networks (first component: 56%; second component: 8%; third component: 4%).Then, the interpretation of the EMT networks was conducted using the first three components.Figure 2 shows the overall framework of the EMT network analysis.In other words, EMT status-specific gene networks were estimated for 762 cell lines.To interpret the massive multiple gene networks, the explainable AI technique, TRIP, was applied.The interpretation of these networks was based on the first three crucial components of the regulator axis.In order to achieve a more biologically reliable interpretation, Park et al. [11] combined the results obtained using TRIP with well-known EMT markers, i.e., ZEB1, ZEB2, SNAIL1, SNAIL2, and TWIST1 (EMT-TFs).The target networks of the five EMT-TFs were extracted separately for the high-and low-value regions of each component.The target genes (TG.EMT-TFs) of the EMT-TFs and the target genes of the TG.EMT-TFs were also extracted.For the networks in the high and low regions of each component, the binary adjacency matrices were computed.The 10 genes with the most significant differences in edge structure between the binary adjacency matrices for the high and low regions of each component were extracted.Table 1 shows the identified genes from the three components and their EMT-related evidence, where " " in the column "In NetworkProfiler" indicates that the genes were also identified as top-25-ranked regulators of E-cadherin in the previous study of EMT status-specific gene network [4].
Most of the EMT markers identified using TRIP have been supported by strong evidence as EMT markers, and their EMT-related mechanisms have been previously reported in multiple studies.This implies that TRIP has provided biologically reliable insights into the identification of EMT-related mechanisms.Out of the 17 genes identified, GRHL2, IRF6, LSR, and OVOL2 were also identified as EMT markers (i.e., regulators of E-cadherin) in the previous EMT network analysis using the NetworkProfiler [4].The EMT-related mechanisms of genes such as GRHL2, IRF6, LSR, and OVOL2 were not known a decade ago.However, over the past decade, the EMT-related mechanisms of the genes have been uncovered.Interestingly, the EMT status-specific gene networks were estimated a decade ago based on data from that time, and TRIP was applied to the data to identify the EMT-related mechanisms.This implies that the data from 10 years ago already contained insights into EMT mechanisms that have since been revealed in the past decade.Because of a shortage of computational strategies for comprehensive analysis of the large-scale gene regulatory network, we were unable to reveal EMT-related mechanisms a decade ago (e.g., GRHL2, IRF6, OVOL2, LSR shown in Table 1).The application of the explainable AI technique, TRIP, allowed us to uncover the EMT-related mechanisms that have been unveiled over the past ten years in a single, unified analysis.Reviews on EMT gene network analysis indicate that the application of explainable AI could usher in a new era for computational network biology, potentially revolutionizing our understanding of intricate disease mechanisms.

Discussion
The aim of this review paper was to provide insights into the computational strategies for cell line-specific gene networks and the use of the explainable AI approach to overcome the bottleneck of existing black-box AI (i.e., interpreting the massive multiple gene networks).We also reviewed computational strategies for analyzing massive multiple gene networks.Although many studies have been conducted on gene network analysis, the existing studies were related to averaged gene networks for all cell lines (or patients), which cannot provide cell line-specific molecular characteristics.To address this issue, recently, computational strategies for cell line-specific gene network estimation have been developed.Cell line characteristic-specific gene network analysis enables us to uncover gene regulatory systems for specific characteristics of cell lines.The gene networks identified through this approach provide indispensable information that can significantly contribute to precision medicine.However, interpreting the massive multiple gene networks remains a challenge, mainly due to the extensive scale of the networks involved.Each interpretation subject entails hundreds of massive networks, comprising more than 10,000 rows and over 1000 columns.We reviewed the computational strategies for cell line-specific gene network estimation and the explainable AI approach for interpreting the estimated massive multiple gene regulatory networks.We also reviewed the analysis results of the gene regulatory system changes under varying EMT statuses of cell lines.Based on the reviews of two studies, it was found that the explainable AI approach, TRIP, has successfully unraveled the EMT-related mechanisms that have been discovered over the past ten years.This implies that the lack of computational strategies for the comprehensive analysis of large-scale gene networks has hindered the identification of the EMT-related mechanisms that have been revealed in the past ten years.We expect that explainable AI methods, such as TRIP, can offer insightful and interpretable solutions for understanding the complex dynamics of cancer network biology.
In this article, we have reviewed computational strategies for cell line-specific gene network analysis and their application for EMT status-specific gene networks.The Connectivity Map is one of the most powerful tools for identifying connections among small molecules sharing a mechanism of action, chemicals and physiological processes, and diseases and drugs [61].It can be suggested that the use of a Connectivity Map with the reviewed strategies can lead to more comprehensive results, especially in the interpretation of drug-related characteristic-specific gene network analysis.

Figure 2 .
Figure 2. Overall framework of EMT network analysis: (1) Estimation of EMT status-specific gene networks for 762 cell lines.(2) Interpretation of the massive multiple gene networks using the explainable AI technique, TRIP.(3) Interpretation of the networks based on the first three crucial components of the regulator axis.