Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network

Senthilkumar, Geeitha; Pitchaimuthu, Renuka; Panneerselvam, Prabu Sankar; Alagarswamy, Rama Prasath; Dhanasekaran, Seshathiri

doi:10.3390/diagnostics15222848

Open AccessArticle

Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network

by

Geeitha Senthilkumar

¹,

Renuka Pitchaimuthu

¹,

Prabu Sankar Panneerselvam

²,

Rama Prasath Alagarswamy

³ and

Seshathiri Dhanasekaran

^4,*

¹

Department of Information Technology, M. Kumarasamy College of Engineering, Thalavapalayam, Karur 639113, Tamil Nadu, India

²

Shanmuga Hospital, Salem 636007, Tamil Nadu, India

³

Department of Computer Applications, SRM Institute of Science and Technology, Trichy 620015, Tamil Nadu, India

⁴

Department of Computer Science, UiT The Arctic University of Norway, 9037 Tromsø, Norway

^*

Author to whom correspondence should be addressed.

Diagnostics 2025, 15(22), 2848; https://doi.org/10.3390/diagnostics15222848

Submission received: 19 June 2025 / Revised: 23 September 2025 / Accepted: 7 November 2025 / Published: 10 November 2025

(This article belongs to the Special Issue Applications of Machine Learning in Obstetrics and Gynecology)

Download

Browse Figures

Versions Notes

Abstract

Background: Recurrent cervical cancer is one of the most defining threats to patient longevity, underscoring the need for prognostic models to identify high-risk patients. Objectives: The aim of the study is to integrate clinical data with the GSE44001 Dataset to identify key risk factors associated with the recurrence of cervical cancer. Patients are stratified into high-, moderate-, and low-risk groups using selected clinical and molecular features. Identifying a long non-coding RNA (lncRNA) gene signature associated with recurrent cervical cancer. Methods: From the total data collected, 138 recurrent cervical cancer patients were identified. GSE44001 Dataset is downloaded from the NCBI GEO Database. When using the GENCODE Annotation tool, the long non-coding RNA is filtered. The dataset is then linked with filtered long non-coding RNA. The Least Absolute Shrinkage Selection Operator (LASSO) is employed to find attributes in gene expression analysis. Risk factors of recurrent cervical cancer are identified. Risk value is assigned to each individual based on the selected lncRNAs and the corresponding overfitting coefficients. Result: The RNN Long Short-Term Memory model demonstrates a prognostic value, where high-risk patients experience a shorter duration of recurrence-free survival (p < 0.05). Individuals with a recurrence of cervical carcinoma, a progressive disease, were associated with the ATXN8OS marker, the C5orf60 indicator, and the INE1 index gene. In contrast, patients diagnosed at earlier stages are aligned with the KCNQ1DN marker, LOH12CR2 gauge, RFPL1S value, and KCNQ1OT1 indicator. Patients in moderate stages were primarily associated with the EMX2OS score. Conclusions: The research findings demonstrate that the nine-lncRNA signature, when combined with deep learning, offers a powerful approach for recurrence risk stratification in cervical cancer.

Keywords:

prognosis; long non-coding RNA; biomarker; recurrent neural network; recurrent cervical cancer

1. Introduction

Cervical cancer is the leading cause of cancer-related mortality among women worldwide. There are many advances in treatment, but recurrence rates remain significant and represent a hurdle for long-term survival. Identifying accurate risk factors of recurrence is important. Clinical parameters, such as tumor stage, lymph node metastasis, and tumor diameter, are associated with the risk of recurrence. Many studies have shown that long non-coding RNAs (lncRNAs) play a vital role in tumor progression and prognosis. Long non-coding RNA, such as HOTAIR and MALAT1, has been linked to cervical cancer progression, but the predictive potential of lncRNA signatures for recurrence has received limited attention. Gaps in the existing literature include that in prior studies, the investigations were conducted with either clinical factors or molecular features independently, without combining both sources of data in an integrated forecasting model. This research aims to integrate real-world clinical data with transcriptomic data (GSE44001) to identify key recurrence-related factors and construct a deep learning based predictive model. By combining clinical and molecular features, a clinically useful tool will be developed for assessing recurrence risk in cervical cancer patients in the future. Patients are stratified into high-, moderate-, and low-risk groups using selected clinical and molecular features. Identifying long non-coding RNA (lncRNA) gene signature associated with recurrence in cervical cancer.

Preventing cervical carcinoma is possible with early detection, as it is the most prevalent cancer in women [1]. Cervical cancer causes many deaths in women [2]. Recurrence of cervical cancer is the development of a tumor in a local area or distant metastasis within six months following the completion of initial treatment. Recurrence of cervical carcinoma commonly occurs in two-thirds of patients who have undergone cervical carcinoma treatment within the two years. Computed Tomography and Magnetic resonance imaging are the main techniques that identify recurrent cervical cancer [3]. Forecasting the recurrence of cervical cancer is a challenging issue. An AI-powered neural network method addresses these issues [4]. Between 1994 and 2004, the MD Anderson Carcinoma Center treated 28 individuals who underwent pelvic exoneration for recurrent cervical carcinoma [5].

Doctors from 21 medical centers in Norway surveyed 58 cervical cancer patients between 2012 and 2016. Medical and self-reported characteristics were collected using a standardized survey [6]. The risk of recurrence and mortality is increased in patients with tumor size, advanced FIGO stages, and lymph node involvement. As tumor size increases, the likelihood of recurrence also increases. Due to these results, Gynecologists schedule regular patient appointments and determine the tumor possibility for patients with cervical cancer [7]. Medical professionals identify recurrent cervical cancer by evaluating the risk factors with their clinical expertise. Scientific study and practical application are attempting to identify significant risk factors for recurrent [8]. Clinicians categorize the patients by prognosis, assign them to treatment studies, and perform precise and repeated evaluations to improve results [9]. The study employs K-means clustering, principal component analysis, and multivariate Cox survival analysis to assess the variables that influence recurrence prediction. The study evaluates eight distinct machine learning methods using logistic and Cox models [10]. By the optimization of some variables, Cox regression analysis demonstrates that treatments given, clinical phase, and premature delivery were all substantially linked to the recurrence of cervical cancer [11].

Patients at stage 1A1 did not want exceptional medical follow-up in cervical cancer treatment. This conclusion is supported by the fact that, during the previous nine years, there were only eight recurrences and one death among the 510 patients in this cohort [12]. Four long non-coding RNAs (lncRNAs) were significantly linked to a lower recurrence-free outcome in cervical cancer using Cox regression analysis. MIR22HG is a predictive biomarker for cervical carcinoma [13]. This research examines the relationship between diagnostic techniques and their effectiveness in relation to survival rates across various clinicopathological features [14]. Web-based research for recurrent cervical cancer treatment and care was conducted in the Medline and CancerLit databases. Randomized research investigations have been given priority in the collection and analysis of all pertinent information [15]. The purpose of the research is to figure out the clinical value of the lncRNA SPRY4-IT1 in the advancement of cervical carcinoma, as well as assess its expression level.

According to research, lncRNA SPRY4-IT1 is a novel molecule associated with the progression of cervical carcinoma. As such, it could be a helpful target for therapy, as well as a potential prognostic indicator [16]. A recurrent neural network (RNN) is employed to process information related to gene expression, as described in the research. LSTM and GRU, two varieties of RNNs, were examined. A proposed method for improving the RNN’s design and hyperparameter scores calculates the sample precision for classification, as well as the F1 score. This parameter enables assessments of sample distribution into the appropriate classes [17].

2. Related Works

Multiple bioinformatics approaches are used to detect 143 Differentially expressed Genes [DEG] related to cervical carcinoma and show that these genes contribute to cervical cancer development [18]. Grated Recurrent Unit-based Recurrent Neural Network forecasts the clinical outcomes, showing its strength on sequence data and retains the past information [19]. Researchers employed data mining through the NCBI Gene Expression Omnibus to ascertain the medical importance of certain DEGs in cervical carcinoma in women [20]. A nine-lncRNA signature achieved enhanced predictive precision than FIGO Stage [21].

To investigate the survival experience of cervical cancer patients who experience recurrence, a recurrence model based on the nomogram is used, analyzing the signatures of long non-coding RNA [22]. Early immunological status assessment of cervical cancer patients is facilitated by a prognostic framework based on Lymph node metastasis-relevant long non-coding RNAs (lncRNAs), as outlined in the HSIC Model [23]. A total of 289 RNA sequence data and related clinical information were acquired. Two of the forty-nine lncRNAs that we found to be differentially expressed were linked to prolonged life for patients with cervical cancer. Collectively, both of these lncRNAs (RAS4CP and ILF3-AS1) have been identified to form a single predictive characteristic.

In contrast, women with low-risk cervical cancer had an improved prognosis and saw a significant correlation in overall survival (p < 0.001). Subsequent investigation revealed that using the two-lncRNA expression profile together could potentially be utilized as a stand-alone indicator of the cervical cancer patients’ prognosis [24]. Identifying the risk factors of recurrent cervical cancer is essential for treatment management and improved outcomes [25]. A prognostic scoring model is developed to predict patient outcomes through integrative analyses that identify ion channel-related long non-coding RNAs (lncRNAs) [26]. The recently discovered functions of lncRNAs in cervical carcinoma, namely in relation to metabolic restructuring, HPV control, therapy obstruction, cancer metastases, and cancer development. There is a vast potential for lncRNAs to be indicators for the detection of cervical carcinoma [27].

If clinicians knew the prognosis of recurrence, then patients would receive therapy earlier. A hybrid approach is recommended to address this issue. Features are extracted via transfer learning, which is then integrated with conventional machine learning to assess and ascertain a patient’s likelihood of metastases and recurrence [28]. The suggested method performs multi-label categorization using the patient’s medical visit data by extending an extended short-term memory system with two distinct approaches [29]. Language, series of times, and voice are examples of sequential data that may be handled by neural network architectures such as the recurrent neural network (RNN). An RNN’s fundamental idea is that it retains links to previous states, which allows the algorithm to retain knowledge about aspects of the preceding sequences [30,31,32].

3. Materials and Methods

Clinical information is collected from patients with cervical cancer undergoing therapy at Shanmuga Hospital, Salem. The data consists of complete demographic details of patients with cervical cancer, containing several variables that might affect mortality and recurrence.

The research analyzed a total of 739 cervical cancer patients with 29 clinical and pathological features, in which 138 patients experienced recurrence during continuous follow-up [33]. The proposed architecture, illustrated in Figure 1, involves preprocessing the dataset by imputing missing values using mean and mode, encoding categorical variables through label encoding, and normalizing the data using MinMax scaling. The most important features are selected using Recursive Feature Elimination. A Long Short-Term Memory (LSTM) recurrent neural network is deployed to find key risk factors associated with recurrence. Clinical high-risk factors are tumor stage, lymph node metastasis, tumor size, PV/PR examination, and disease-free survival rate. To enhance the analysis, the publicly available GSE44001 dataset is acquired from the NCBI GEO. The Information Gain method is used for feature selection. The GENCODE Annotation tool is used to filter out the long non-coding RNA [34].

The samples are matched to the original dataset based on stage and recurrence status. Patients are classified into high, moderate, and low-risk groups using the selected clinical and molecular features. Significant long non-coding RNAs (lncRNAs) were identified through LASSO regression, and individual risk scores were calculated using the corresponding LASSO coefficients to evaluate recurrence risk for each patient [35]. Thus, Table 1 below shows the recurrence of some categories of cervical cancer patient details.

The correlation coefficient is 0.71 for disease-free survival, and progression-free overall survival has a strong positive correlation. This indicates that prolonged disease-free survival is linked with extended overall survival in patients. A negative correlation indicates a weaker relationship between the variables, as illustrated in Figure 2.

Table 2 represents the statistical table for numeric values in the dataset. This table presents the mean and median calculations, along with the standard deviation, for numeric columns to extract information from the dataset.

3.1. Data Preprocessing

Preprocessing is an important stage to ensure the information is processed and developed for Neural Network algorithms to function effectively. To handle the missing values in categorical features, the mode is used to replace the value. For numerical values, the median technique is used [36].

Handling missing values in a dataset is crucial for creating a well-structured dataset and obtaining consistent results. For a categorical attribute

Y

with values

{(y}_{1}, y_{2}, \dots, y_{n})

, the values that appear most frequently in a dataset are replaced in

Y

. In Equation (1), the mode is defined as the element

y

within the set

Y

, that has the highest frequency of occurrence.

M o d e (Y) = y_{m a x}

(1)

y_{m a x} — v a l u e o f y t h a t a p p e a r e d a m a x i m u m t i m e

.

Median techniques are used for numerical data and are effective against the outliers. In Equation (2), the median is the middle value of the dataset if it is sorted in ascending order. If m is an odd number, then the median is the middle value of the dataset. If the dataset has an even number of elements, then the median is the average of the two middle elements.

M e d i a n (Z) = \{\begin{matrix} z_{(m + 1) / 2} i f m i s o d d \\ \frac{z_{\frac{m}{2}} + z_{(\frac{m}{2}) + 1}}{2} i f m i s e v e n \end{matrix}\}

(2)

Z = {(z}_{1}, z_{2}, \dots, z_{m)} i n s o r t e d d a t a

.

m — N u m b e r o f e l e m e n t s

.

Table 3 detailing the proportion of missing data per feature is included above. Label encoding is used to transform all categories which is not numerical variables into numeric depictions. A dataset consisting of categorical features, such as sex, different diagnosis categories, and imaging techniques, needs to be converted into a numerical form for further analysis. For a label encoding method, each feature is assigned to a unique numerical value.

In Equations (3) and (4), it means that

y = g_{i}

, then

f (y) = i

. For example, it means that if Stage (one of the attributes in a dataset) has four categories, the encoding is performed. Stage 1 is represented as 0, Stage 2 is represented as 1 and so on. f(y) produces a numerical value where y is a categorical variable.

For a categorical feature y having categories

{g_{1}, g_{2} \dots, g_{n}}

f : y \to z, w h e r e f (g_{i}) = i (i . e g_{1} \to 0, g_{2} \to 1, g_{3} \to 2 \dots, g_{4} \to n - 1)

(3)

Mathematically:

f (y) = i; w h e r e y = g_{i} a n d i ϵ (0,1, \dots, n - 1)

(4)

Machine learning methods, such as Random Forest, require numerical inputs; therefore, categorical parameters need to be converted into numerical values.

The proposed model utilizes MinMaxScaler to scale each feature to a range of 0 to 1, as shown in Equation (5). Different features in the dataset have different ranges, but the LSTM RNN model works well with normalization. The MinMax Scaler is used to scale each attribute z in the dataset to a range of 0 to 1.

Z_{m i n - m a x} = \frac{z - z_{m i n}}{z_{m a x} - z_{m i n}}

(5)

z — o r i g i n a l a t t r i b u t e i n a d a t a

.

z_{m i n} a n d z_{m a x} a r e t h e m i n i m u m a n d m a x i m u m v a l u e s o f t h a t a t t r i b u t e

.

Z_{m i n - m a x} — T h e n o r m a l i s e d v a l u e

.

LSTM can capture intricate, non-linear connections in high-dimensional long non-coding RNA (lncRNA) expression and clinical data. LSTM networks provide a distinct benefit over conventional models. In cases where the input is not sequential, its gated design is allowed. As a result, when compared to conventional analytical methods, LSTMs are best in forecasting the recurrence of cervical cancer.

The Neural Network model, particularly the Long Short-Term Memory model (LSTM), is dependent on the size of its input, which is crucial. MinMaxScaler makes sure that each attribute makes an equal contribution. These outcomes show the prognostic value of the detected lncRNA and clinical variables, and are the primary sources of the predictive model.

3.2. Feature Selection

Feature selection is for selecting important attributes in a given data set that help the algorithm concentrate on the relevant data and reduce the chance of overfitting. A Random Forest Classifier is used as a foundational framework for Recursive Feature Elimination (RFE) [37]. RFE recursively adapts the framework by eliminating fewer significant features until the necessary number of characteristics is obtained. RFE enables the framework to maintain the most essential features while repeatedly eliminating the less significant features. This lessens the level of complexity of the model, which might enhance generalization.

The dataset consists of 29 features

Z = {{z}_{1}, z_{2}, \dots, z_{29}}

. To find the important attributes k, where k < 29, and to shortlist the important attributes to enhance the model’s performance.

z_{t} = s e t o f f e a t u r e s a t s t e p t

I_{t} = s e t o f F e a t u r e I m p o r t a n c e

Train

N o n z_{t} :

Calculate, I_{t} = {I_{t 1,} I_{t 2, \dots,} I_{t, |Z_{t}|}}

(6)

In Equation (6), the model assigns an importance score to each feature, and these scores are collected as

I_{t}

.

To identify {, Z}_{m i n, t} = \arg {m i n}_{i} (I_{t}, i)

(7)

Equation (7) is to identify the feature with the minimum importance score in the step

I_{t}

:

Set, Z_{t + 1} = \frac{Z_{t}}{\{Z_{m i n,} t\}}

(8)

Equation (8) finds the feature with the least important score.

The recurrence will be stopped once

| z_{t} | = k

, considering k as a crucial feature, Random Forest is capable of accurately capturing the relevance of features. It is vital for feature selection. Features that are selected as important include total full-term pregnancies, PV/PR Examination, history, largest diameter (in cm), lymph node metastasis, histological type, treatment type, stage, imaging, and disease-free survival rate (in months). These features are used to train the model.

3.3. RNN LSTM

A recurrent neural network model is developed, utilizing a long short-term memory (LSTM) model with sequential data [38]. Through the selected attributes, the LSTM recurrent neural network is used to analyze sequential patterns. To accurately predict the recurrence of cervical cancer, the LSTM is used to find the subset of relevant features. LSTM determines if an individual is at significant risk of recurrence by analyzing similarities in periodic data of clinical factors over time, as shown in Figure 3, if the chosen characteristics contain such information.

LSTM is excellent for handling information with temporal dependence. LSTM performs well with organized and sequential information; it was employed in this dataset, which is not quite a time series. The structure consisted of two layers of LSTM. The initial layer consisted of 64 units, while the subsequent layer comprised 32 units. To avoid overfitting, a dropout layer is added after every LSTM layer. This forces the framework to avoid overfitting and improve generalization by randomly dropping neurons throughout training. A dropout rate of 0.3 is employed. The LSTM layers underwent L2 regularization to penalize large values and prevent further overfitting [39]. The mathematical framework is kept from getting overly dependent on any particular characteristic process. Since this involves a problem with binary classification, such as forecasting the recurrence of cervical carcinoma, a sigmoid activation function is utilized in the outcome layer. Algorithm 1 is the Identification of relevant features. The Adam optimizer is used for effective learning; the model is constructed using binary cross-entropy as the loss function. By analyzing performance metrics, accuracy is found.

Algorithm 1. Identification of Relevant Features

First stage: Preparing the Dataset and Setting Up Hyperparameters
Step 1: Use the matrix method for displaying the expression of genes in the dataset.

D = {{(d}_{m n})}_{i * j}

(9)

i = No. of patient samples.
j = The quantity of chosen lncRNA genes used as input characteristics.

{(d}_{m n}) — E x p r e s s i o n o f g e n e n i n p a t i e n t m

.
Step 2: Initialization of Hyperparameters
Configure LSTM network-specific hyperparameters.
One layer of LSTM at first, and up to three layers at most.

I n i t i a l v a l u e : k = 30 N e u r o n s

R a n g e : k_{m i n} = 30 t o k_{m a x} = 80 h e n c e i n c e r e m n e n t i n g ∆ k = 5

Step 3: Describe the activation processes.
Use tanh activation in the LSTM’s internal processing.
For the binary category of recurrence condition, recurrence versus non-recur rence employs sigmoid activation.
Step 4: Splitting of Data.

E_{t r a i n} — 70 % d a t a f o r L S T M M o d e l

E_{t e s t} — 30 % f o r f i n a l e v a l u a t i o n

2nd Stage: Developing LSTM Models using Hyperparameter Adjustment
Step 5: The LSTM unit performs subsequent calculations at every step based on the gene expression pattern of each sample.
Step 6: Forget gate: The forget gate regulates which data derived from the prior cell state should be kept.

g_{t} = σ (U_{f} * [g_{t - 1}, y_{t}] + a_{f})

(10)

g_{t} — F o r g e t g a t e o u t p u t

.

U_{f}

&

a_{f} — F o r g e t g a t e W e i g h t m a t r i x a n d B i a s v e c t o r

.

g_{t - 1} — P r e v i o u s h i d d e n s t a t e

.

y_{t} — i n p u t

.
Step 7: Input Gate: Choose which

y_{t}

data should be incorporated into the:

C e l l s t a t e = σ (U_{i} * {[g}_{t - 1}, y_{t}] + a_{i})

(11)

{\hat{C}}_{t} = t a n h (U_{c} * [g_{t - 1}, y_{t}] + a_{c})

(12)

y_{t} — I n p u t G a t e A c t i v a t i o n

.

{\hat{C}}_{t} — c a n d i d a t e c e l l s t a t e

.

g_{t - 1} — P r e v i o u s h i d d e n s t a t e

.

U_{i}, a_{i}

,

U_{c}

,

a_{c} — R e s p e c t i v e w e i g h t a n d B i a s e s

Step 8: Cell state update-Improves the cell state through the combination of data from the past and present.

C_{t} = C e l l s t a t e U p d a t e

C_{t} = d_{t} * C_{t - 1} + j_{t} * {\hat{C}}_{t}

(13)

d_{t} — f o r g e t g a t e o u t p u t

.

C_{t - 1} — p r e v i o u s c e l l s t a t e

.

j_{t} — i n p u t g a t e o u t p u t

.
Step 9: Output gate—Uses the modified cell status for identifying the subsequent concealed state

g_{t} .

O_{t} = σ (U_{o} * [g_{t - 1}, y_{t}] * a_{o})

(14)

O_{t} — o u t p u t g a t e a c t i v a t i o n

.

σ — s i g m o i d f u n c t i o n

.

U_{o} — w e i g h t m a t r i x f o r o u t p u t g a t e

.

g_{t - 1}, y_{t} — p r e v i o u s h i d d e n s t a t e a n d c u r r e n t i n p u t

.

a_{o} — b i a s t e r m

.

g_{t} = O_{t} * t a n h (C_{t})

(15)

Step 10: Output layer: The output layer receives the last concealed state

Z_{t},

from the last LSTM cell.

Z_{t} = σ (U_{g_{0}} * g_{t} + a_{g_{0}})

(16)

Z_{t} — P r o b a b i l i t y o f r e c u r r e n c e

.

σ — A c t i v a t i o n f u n c t i o n

.

U_{g_{0}} — w e i g h t m a t r i x

.

g_{t} — H i d d e n s t a t e

.
Step 11: To track convergence and prevent overfitting, the LSTM model is trained with E_Train and compute accuracy, F1 score, and loss.
Step 12: Hyperparameter adjustment is made until the k_max value reaches 30.
Step 13: After determining the ideal arrangement, test the system on

E_{T e s t}

, and estimate final indicators, such as F1 score, accuracy, and ROC-AUC.
3rd Stage: Result analysis and optimal configuration.

A layered LSTM 10-fold cross-validation grid search method has been applied to adjust the LSTM classifier’s hyperparameters. The original single-layer LSTM, featuring 30 neurons, is designed for tuning, and the number of neurons has continuously increased in increments of ∆k = 5 until the maximum value of 80 neurons per layer was reached. In addition to differences in the number of batches, learning rates, and dropout rates, designs comprising one to three LSTM layers are assessed. The mean of each cross-validated AUC was the primary criterion chosen for sorting the suggested models.

The selected feature is given as input to an LSTM Model after normalization. These characteristics enable the forget gate to recall essential data, such as dangerous signs, by teaching it only to recall specific details of the prior state. The gate that receives input determines the level of additional data that must be sent to the cell’s state. Forget and input gates are activated to modify the Cell State [40]. It integrates data with the current input characteristics and the previously stored information from the prior stage. This Research employs LSTM because of its ability to capture complex, long-range dependencies among features. The integrated dataset of lncRNA expression profiles and clinical parameters exhibits a quasi-sequential structure due to correlated gene expression patterns and their biological pathway relationships. The inclusion of recurrence-free survival follow-up times introduces a temporal dimension to the prediction task. The gating mechanisms of the LSTM unit allow for the selective retention and forgetting of contextual information, which is advantageous for high-dimensional biomedical data that contain partially imputed values.

In more risky circumstances, the gate that receives input could be trained to add Disease-Free Survival to the state of cells more frequently, and this could represent a powerful signal of recurrence probability. To reveal the hidden state that passes to the following LSTM cell, the result of a gate highlights specific properties. If characteristics such as stage and treatment type are associated with malignant recurrence, they have a significant impact on the underlying state. Based on the hidden features, the model is learned through multiple steps, utilizing features such as disease-free survival and recurrence risk, which are detailed in the algorithm above.

The GSE44001 dataset is downloaded from the NCBI Gene Expression Omnibus (GEO); it consists of clinical and gene expression data from 300 cervical cancer patients [41]. The dataset contains clinical features such as tumor stage, the largest diameter of the tumor, disease-free survival (DFS) in months, and DFS status. Kaplan–Meier survival analysis is performed using the GSE44001 dataset to examine stage-wise variations in survival probabilities among patients, showing the difference in recurrence and mortality risk across tumor stages [42].

To detect long non-coding RNAs (lncRNAs) linked with recurrence, gene annotation is performed using the HUGO Gene Nomenclature Committee (HGNC) and GENCODE databases. The Linear Models for Microarray Data package in R is used to recognize long non-coding RNAs (lncRNAs) between recurrence and non-recurrence groups [43]. The GSE44001 clinical attribute file and gene family information are downloaded through R 4.5.2 packages, merged, and converted into a CSV file containing 29,378 gene IDs and corresponding clinical data for 300 patients. Using GENCODE and HGNC annotations, 249 lncRNA gene signatures are identified. The dataset was quantile-normalized to ensure comparability across samples. A list of long non-coding RNAs, downloaded from Gencode [44], is mapped to the GSE44001 Dataset. The GSE44001 Dataset is quantile-normalized, ensuring that long non-coding RNA samples are matched across samples.

The least absolute shrinkage and selection operator is applied to train the dataset and find the relevant lncRNA [45]. Since it can select a specific group of significant lncRNAs by reducing the number of characteristics by setting the coefficients of fewer significant variables to zero, LASSO is frequently employed for feature selection in gene expression analysis. By using LASSO, the features that are not well matched will be reduced, and important features will be matched well.

The Objective Function for LASSO Regression is:

o b j e c t i v e f u n c t i o n = (\frac{1}{2 M} {{(q}_{i} - P_{i} . α)}^{2} + λ \sum_{j = 1}^{p} | α_{j} |)

(17)

q_{i} — T a r g e t v a r i a b l e f o r a p a t i e n t i

.

P_{i} — V e c t o r o f G e n e e x p r e s s i o n v a l u e

.

α — C o e f f i c i e n t o f l n c R N A

.

M — N o . o f o b s e r v a t i o n

.

p — N o . o f l n c R N A P r e d i c t i o n

.

λ — T h e r e g u l a r i s a t i o n v a r i a b l e t h a t r e g u l a t e s t h e s h r i n k a g e r a t e

.

Equation (17) is the objective function of the LASSO Regression, which is used to minimize the prediction error between the actual outcome and the predicted outcome. The penalty term controlled by

λ

makes the less important features decline toward zero. It performs well.

A 10-fold cross-validation was applied to optimize the model and prevent overfitting [46].

Based on the selected lncRNAs and their LASSO coefficients, individual risk scores are calculated as:

{R i s k s c o r e}_{j} = \sum_{i ϵ S} Y_{i j} . α_{j}

(18)

Y_{i j} — E x p r e s s i o n l e v e l o f l n c R N A j f o r p a t i e n t i

.

α_{j} — L A S S O C o e f f i c i e n t f o r l n c R N A

.

S — S e t o f s e l e c t e d l n c R N A b y L A S S O

.

Equation (18) is used to find the individual recurrence risk score for each patient, which is calculated as a weighted sum of the expression value for selected RNA. The Positive coefficients indicate that higher expression increases risk and reduces DFS; negative coefficients show a protective effect.

After the analysis, the nine lncRNA gene signatures—ATXN8OS, C5orf60, DIO3OS, EMX2OS, INE1, KCNQ1DN, KCNQ1OT1, LOH12CR2, and RFPL1S—are found to be significant predictors of cervical cancer recurrence [47]. These features were further validated using Cox regression to derive risk scores, which formed the basis for the composite risk assessment model.

4. Results

Real-time clinical data are gathered from Shanmuga Hospital, Salem, and the GSE44001 Data is downloaded in the NCBI GEO Database. Finally, common high-risk factors are identified. Clinical high-risk factors are disease-free survival, stage, lymph node metastasis, tumor size, and PV/PR examination. Datasets have been linked to long non-coding RNA Gene expression data using the following procedures. The GENCODE Annotation tool is used to filter out the long non-coding RNA. They are using the GENCODE system to link gene IDs to known long non-coding RNAs (lncRNAs) in order to annotate the GSE44001 expression dataset. Mapped the gene identifiers with only long non-coding RNA from the GSE44001 Dataset. Matching the patients in the original dataset with GSE44001 Clinical attributes through Staging and Recurrence status using the R language can be categorized into reduced risk, Moderate risk, and Elevated risk using the chosen extended non-coding RNA attributes, as determined by LASSO and the selection of significant lncRNAs. Determine a risk value for each individual based on the chosen lncRNAs and the associated overfitting coefficients.

The research builds an LSTM model with 10-fold cross-validation for data preparation and feature selection using RFE. Cross-validation shows a reliable assessment of the model’s performance, and regularization techniques are applied to minimize overfitting. The model is trained, and features are identified to evaluate risk prediction. The integration of clinical data with gene expression profiles enables the system to determine individual risk scores, categorizing patients into three risk groups: reduced, moderate, and elevated risk, based on the selected long non-coding RNAs (lncRNAs) and their associated coefficients.

A two-stage feature selection and validation strategy is implemented to minimize overfitting. Dimensionality is reduced using Random Forest Recursive Feature Elimination (RF-RFE) and Least Absolute Shrinkage and Selection Operator (LASSO) regression, which reduces complexity and identifies the most predictive lncRNA features. This indicates that the number of input variables to the classifier is reduced prior to model training. A recurrent neural network LSTM classifier is trained using Tenfold cross-validation to maintain balanced recurrence status across folds. Hyperparameters were optimized within the cross-validation loop to prevent information leakage, effectively mitigating the risk of overfitting in the high-dimensional, low-sample-size setting. Figure 4 shows the training and validation accuracy loss for 1 to 50 epochs.

Using the GSE44001 dataset’s Kaplan–Meier survival curve by stage, focus on the variations in mortality odds among various tumor stages. The Kaplan–Meier survival curve in Figure 5 illustrates the survival rates of individuals with cervical carcinoma at various stages, including Stage IB1, IA2, IB2, and IIA, over time. Throughout the observed time period, patients at Stage IB1 had the best chance of surviving. In comparison to the remaining phases, the slope for this one is relatively steady, indicating a reduced recurrence. The Stage IA2 survival curve begins similarly to that of Stage IB1, but it displays a more rapid decline in mortality likelihood over time.

In comparison to Stage IB1, this corresponds to a higher recurrence rate. Stage IB2 has a moderate survival probability. Individuals at Stage IIA have the highest risk of recurrence and the sharpest decline in survival likelihood.

Cervical carcinoma that only spreads to the cervix or has minimal dissemination is referred to as the initial stages of cervical cancer. Cervical cancer that has expanded outside of the cervix yet has not spread to other areas is referred to as LACC [48]. Although growth continues to occur within the pelvic area, it is more common in later stages of development. Cervical carcinoma, which has progressed outside the pelvic area to distant organs, is referred to as advanced-stage cervical cancer. It is anticipated that cervical carcinoma is going to be the first cancer to be eradicated by humans [49].

In Figure 6, a Violin plot displays the median and interquartile range (IQR) as white box plots enclosed within each violin, whereas black jittered dots represent specific patient findings. The result reduces the masking impact of high and low values by providing both the raw data range and a statistical summary simultaneously. Stage-related variations in DFS are confirmed by visual inspection, with broader violins in advanced stages indicating greater variation. Since DFS has an irregular distribution, comparisons between groups were conducted using the Kruskal–Wallis test (p < 0.001), a nonparametric statistical technique. Real-time data collected from Shanmuga Hospital, Salem, is compared with GSE44001 Data for clinical characteristics. Tumor size and DFS are analyzed. GSE44001 consists of 300 samples of data, with thirty-eight recurrence data and two hundred and sixty-two non-recurrent data.

Comparison of Common Features

Thus, in the two-dataset comparison, tumor size, disease-free survival (DFS), and recurrence status were identified as the standard variables. To ensure data compatibility and support the integration of the hospital recurrence dataset with the GSE44001 public dataset, a fuzzy matching algorithm is employed based on clinically relevant variables, including tumor size and disease-free survival.

Figure 7 is the comparison of common features in the dataset. After normalizing both features, patients were matched by reducing Euclidean distance, yielding 138 matched pairs in the GSE44001 samples. Strong clinical consistency across datasets was indicated by the average difference in DFS of 8.46 months and the average difference in tumor size of 0.105 cm between matched records.

To statistically validate the comparability of these variables, an independent t-test was performed, yielding a non-significant p-value of 0.1194 for tumor size, suggesting no significant difference in central tendency. The Kolmogorov–Smirnov test detected a mild distributional shift, p = 0.0290, which is expected due to real-world heterogeneity. Cohen’s d values of 0.161 for tumor size and 0.335 for DFS indicate small to moderate effect sizes, thereby supporting the biological alignment of the matched cohorts.

Cohen’s d was used to assess effect size:

m = \frac{{\bar{Y}}_{1} - {\bar{Y}}_{2}}{X_{P}}

(19)

X_{p} = \sqrt{\frac{(n_{1} - 1) X_{1}^{2} + (n_{2} - 1) X_{2}^{2}}{n_{1} + n_{2} - 2}}

(20)

{\bar{Y}}_{1}, {\bar{Y}}_{2} — m e a n

.

X_{1}, X_{2} — S t a n d a r d d e v i a t i o n s o f t u m o r s i z e i n t w o d a t a s e t s

.

n 1, n 2 — s a m p l e s i z e

.

X_{p} — P o o l e d s t a n d a r d d e v i a t i o n

.

Cohen’s d is calculated as the standard mean difference between the recurrence and non-recurrence groups in Equation (19). Pooled standard deviation is represented in Equation (20), which is the weighted average of two group variances adjusted by their sample size. Cohen’s d is used to find the effect size between the recurrence and non-recurrence groups. A Cohen’s d of 0.161 for tumor size and 0.335 for DFS indicates small to moderate effect sizes, supporting the biological alignment of the matched cohorts.

Fuzzy matching summaries and statistical validations are presented in Table 4, along with their interpretations. The results of the t-test, in which the p-value of tumour size comparisons shows less similar differences between the two datasets.

From Figure 8, the Kaplan–Meier survival curve shows that patients in the real-time recurrence cervical cancer dataset have a steep decline in survival, particularly those diagnosed as advanced stages, resulting in a shorter disease-free survival [50]. In comparison, the GSE44001 dataset exhibits longer DFS, whereas the main dataset shows shorter survival times. A chi-square test comparing recurrence and DFS status between the two datasets has a p-value of 1.0, indicating no significant difference. From the recurrence of cervical cancer data, the following coefficient table (Table 5) is formed.

The coefficient value for the stage is 0.39, indicating that as the stage increases, the chances of recurrence also increase. A larger tumor size corresponds to a higher hazard ratio, indicating a greater likelihood of recurrence. Long non-coding RNA (lncRNA) profiles from GENCODE are mapped to the GSE44001 dataset and matched based on clinical features, mainly tumor stage and recurrence status. After merging, patients were categorized into three groups: reduced-risk, moderate-risk, and elevated-risk. The contribution of each lncRNA to the risk index is shown by its coefficient. The Positive coefficients indicate that higher expression increases risk and reduces DFS; negative coefficients show a protective effect. Three lncRNA Gene signatures, which have positive coefficients, are ATXN8OS, C5orf60, and INE1. DIO3OS, EMX2OS, KCNQ1DN, KCNQ1OT1, LOH12CR2, and RFPL1S have negative values, which imply a protective effect. Positive expression values reduced the risk and are associated with a greater chance of survival.

Figure 9 shows the distribution of risk scores derived from nine long non-coding RNA (lncRNA) gene signatures. The histogram shows the median risk score (red line), which acts as the cut-off point to classify patients into two groups. The high-risk group is above the median, and the low-risk group is below the median. A scatter plot of disease-free survival time versus risk scores is shown in the same figure. As predicted by the nine-lncRNA prognosis method, shorter survival periods with greater probabilities of recurrence are often associated with higher risk values.

Figure 10 presents the heat map of nine long non-coding RNA signature expression levels, showing the stratification of patients into reduced risk, moderate risk, and elevated risk categories through distinct color patterns. Figure 11 shows the time-dependent ROC curves for different follow-up periods, which show the prognostic performance of the nine-lncRNA LSTM model. The model’s predictive accuracy improved over time, with the gradual emergence of prognostic factors due to either slow changes in clinical markers or cumulative effects that become apparent in the medium to long term. According to the time-dependent ROC analysis, the AUC values at 12, 36, and 60 months were 0.55, 0.67, and 0.74, respectively. This trend indicates that the model is more effective at differentiating recurrence risk during extended follow-up periods. The reliability of these estimates is evaluated using 1000 bootstrap resamples, yielding 95% confidence intervals of 0.51–0.60 for 12 months, 0.63–0.71 for 36 months, and 0.70–0.78 for 60 months.

The Chi-square test is a method to confirm that high-risk patients have higher recurrence rates. The results in Table 6 show the relationship between reduced, Moderate, and Elevated risk recurrence rates. The ROC Curve shows the capability of the nine-lncRNA signatures to differentiate between high- and low-risk patients. In the Chi-square test performance, the high-risk group consists of 74 individuals, and the low-risk group consists of 64 patients, as analyzed by their stages. The p-value less than 0.05 indicates that patients with an increased risk, as determined by the nine-lncRNA signature, experienced a significantly greater chance of recurrence than patients with low risk, which is supported by this data.

To confirm the robustness of results, additional quality control and statistical validation are performed using the GSE44001 dataset.

Figure 12 confirms that the mean–variance trend plot confirms the variance of expression values stabilizes after log-transformation, supporting the assumptions of the LIMMA framework for differential expression analysis.

Second, the moderated t-statistic plot in Figure 13 shows the close alignment between the observed and theoretical quantiles, indicating that the statistical tests follow the expected null distribution and reduce the risk of inflated false positives.

Finally, the LIMMA differential expression Venn diagram in Figure 14 shows the overlaps among differentially expressed lncRNAs across patient subgroups (p adj < 0.05). This confirms that the selected lncRNAs are not a single subgroup, but consistently associate with recurrence risk across comparisons.

These analyses provide strong evidence that feature selection and the modeling pipeline are reliable. The real-time dataset is matched with the GSE44001 dataset based on tumor stages. From this analysis, patients with advanced stages of cervical cancer were associated with the ATXN8OS marker, C5orf60 indicator, and INE1 index gene. In contrast, patients diagnosed at earlier stages are linked with the KCNQ1DN marker, LOH12CR2 gauge, RFPL1S value, and KCNQ1OT1 indicator. Patients in moderate stages were primarily associated with the EMX2OS score. This connection between GSE44001 analysis and real-time clinical data validates the robustness of the nine-lncRNA gene signatures, suggesting that they classify patients by recurrence risk.

5. Discussion

By combining molecular signatures with clinical features, a nine-lncRNA prognostic model is developed to classify patients into three groups based on reduced, moderate, and elevated recurrence risk. This integrative approach enhances the potential of combining transcriptomic and clinical information in recurrence prediction. This research integrates a real-time cervical cancer recurrence dataset with the publicly available GSE44001 dataset to identify long non-coding RNA signatures. By applying a feature selection framework, the nine-lncRNA signatures are identified in relation to recurrence risk and disease-free survival. Clinical high-risk factors are disease-free survival, stage, lymph node metastasis, tumor size, and PV/PR examination. Results show that three long non-coding RNAs, ATXN8OS, C5orf60, and INE1, are associated with recurrence risk factors. Findings from prior studies indicate that dysregulated lncRNAs, such as HOTAIR and MALAT1, promote cervical tumor progression and metastasis [51]. lncRNAs such as DIO3OS, EMX2OS, KCNQ1DN, KCNQ1OT1, LOH12CR2, and RFPL1S have been shown to associate with a negative gene signature, which links lncRNA to tumor suppression mechanisms [21]. The Kaplan–Meier and time-dependent ROC curves reinforced the predictive capability of nine-lncRNA, with prediction accuracy improving over more extended follow-up periods. This finding aligns with earlier survival-based lncRNA studies in gynecological cancers, which suggest that extended observation enhances predictive discrimination. Long non-coding RNA signatures have predictive power compared with clinical features, particularly tumor size and stage [52].

ATXN8OS, C5orf60, and INE1 risk scores are 28.45, 32.18, and 75.93, indicating a strong correlation with shorter disease-free survival and increased recurrence probability. These results suggest that these lncRNAs act as oncogenic drivers in cervical carcinoma, consistent with previous findings that several lncRNAs promote tumor progression. whereas DIO3OS, KCNQ1DN, LOH12CR2, RFPL1S, and KCNQ1OT1 showed protective effects, correlating with longer progression-free survival. lncRNAs act as either tumor suppressors or oncogenes in cancer development, the mixed predictive scores shown here reflecting the same. Time-dependent ROC analysis validates the model; the reliability of these estimates is evaluated using 1000 bootstrap resamples, yielding 95% confidence intervals of 0.51–0.60 for 12 months, 0.63–0.71 for 36 months, and 0.70–0.78 for 60 months. This supports the nine-lncRNA long-term prognostic value by suggesting the long-term impact of molecular dysregulation. The practical implications of the research are that patients with elevated-risk signatures should undergo closer surveillance and aggressive adjuvant therapy, while low-risk patients can avoid overtreatment. Integrative analysis identified a nine-lncRNA signature with distinct prognostic associations in cervical cancer recurrence as shown in Table 6. This shows that molecular markers complement traditional clinical prognostic factors, improving personalized treatment planning. This research provides new insights, such as the integration of real-world hospital data with GSE44001, which enhances external validation. Limitations include a modest sample size compared to larger population-based cohorts.

In future, we plan to collaborate with additional hospitals as an extension of this research. RNA sequencing data contain inherent batch effects, despite efforts to normalize and preprocess them. Findings of this research demonstrate that a nine-lncRNA prognostic signature derived from both real-time and public datasets stratified cervical cancer patients by recurrence risk. This integrative approach enhances the generalizability of results, bridges clinical and transcriptomic data, and paves the way for future studies and potential translational applications in risk prediction and therapy optimization.

6. Conclusions

This Research investigates lncRNA gene signatures associated with cervical cancer recurrence by integrating clinical features with lncRNA expression profiles. Clinical data were matched with the GSE44001 dataset using fuzzy alignment of tumor stage and disease-free survival (DFS), and relevant lncRNAs were identified using the LIMMA framework. Through feature selection and analysis, a nine-lncRNA signature is found as a predictor of recurrence risk based on clinical factors. Among these, INE1, C5orf60, and ATXN8OS were identified as high-risk markers; the other six gene signatures are associated with protective effects. Patients with a considerably shorter DFS and greater recurrence rates were identified as high risk based on this characteristic.

The accuracy of predictions is enhanced by the model’s ability to identify intricate relationships between molecular and clinical variables through the use of an LSTM-based recurrent neural network. These results highlight the potential of lncRNA-driven predictive models for patient categorization. Crucially, the nine-lncRNA analysis provides an important resource for identifying high-risk patients, enabling individualized treatment plans and more stringent monitoring. Future validation in larger, multi-center groups and functional analysis to reveal the biological mechanisms of these lncRNAs is crucial for advancing this signature toward clinical application.

Author Contributions

Conceptualization, G.S. and R.P.; data curation, G.S. and P.S.P.; methodology, R.P. and S.D.; software, R.P.A. and S.D.; validation, R.P. and P.S.P.; formal analysis, G.S. and S.D.; investigation, G.S., R.P.A. and R.P.; resources, S.D.; visualization, G.S. and S.D.; supervision, G.S. and S.D.; project administration, S.D. and R.P.; writing—original draft, G.S. and R.P.; writing—review and editing, S.D., R.P. and R.P.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Anusandhan National Research Foundation [Science and Engineering Research Board]—Core Research Grant (Grant No.CRG/2022/008526), Department of Science and Technology, India.

Institutional Review Board Statement

All analyses were conducted using anonymized clinical data that contained no identifiable patient information. Formal ethics approval was not required, in accordance with Health Research Authority guidance for secondary use of data collected for clinical purposes.

Informed Consent Statement

Patient consent was not needed for this study because only anonymized, non-identifiable data were used.

Data Availability Statement

The datasets used and analyzed during the current study were collected from Shanmuga Hospital, Salem. The original data presented in the study are openly available in NCBI GEO at GSE44001 accession number.

Acknowledgments

The authors are grateful for the financial assistance and support provided by Anusandhan National Research Foundation, India.

Conflicts of Interest

The authors declare no conflict of interest.

References

Al Mudawi, N.; Alazeb, A. A Model for Predicting Cervical Cancer Using Machine Learning Algorithms. Sensors 2022, 22, 4132. [Google Scholar] [CrossRef]
Ghoneim, A.; Muhammad, G.; Hossain, M.S. Cervical cancer classification using convolutional neural networks and extreme learning machines. Future Gener. Comput. Syst. 2020, 102, 643–649. [Google Scholar] [CrossRef]
Antunes, D.; Cunha, T.M. Recurrent Cervical Cancer: How Can Radiology be Helpful. OMICS J. Radiol. 2013, 2, 138. [Google Scholar] [CrossRef]
Senthilkumar, G.; Ramakrishnan, J.; Frnda, J.; Ramachandran, M.; Gupta, D.; Tiwari, P.; Shorfuzzaman, M.; Mohammed, M.A. Incorporating Artificial Fish Swarm in Ensemble Classification Framework for Recurrence Prediction of Cervical Cancer. IEEE Access 2021, 9, 83876–83886. [Google Scholar] [CrossRef]
Roszik, J.; Ring, K.L.; Wani, K.M.; Lazar, A.J.; Yemelyanova, A.V.; Soliman, P.T.; Frumovitz, M.; Jazaeri, A.A. Gene Expression Analysis Identifies Novel Targets for Cervical Cancer Therapy. Front. Immunol. 2018, 9, 2102. [Google Scholar] [CrossRef]
Vistad, I.; Bjorge, L.; Solheim, O.; Fiane, B.; Sachse, K.; Tjugum, J.; Skroppa, S.; Bentzen, A.G.; Stokstad, T.; Iversen, G.A.; et al. A national, prospective observational study of first recurrence after primary treatment for gynecological cancer in Norway. Acta Obs. Gynecol. Scand. 2017, 96, 1162–1169. [Google Scholar] [CrossRef]
Chang, C.; Chen, J.; Chang, W.Y.; Chiang, A.J. Tumor Size Has a Time-Varying Effect on Recurrence in Cervical Cancer. J. Low. Genit. Tract Dis. 2016, 20, 317–320. [Google Scholar] [CrossRef] [PubMed]
Tseng, C.-J.; Lu, C.-J.; Chang, C.-C.; Chen, G.-D. Application of Machine Learning to Predict the Recurrence-Proneness for Cervical Cancer. Neural Comput. Appl. 2014, 24, 1311–1316. [Google Scholar] [CrossRef]
Chang, C.C.; Cheng, S.L.; Lu, C.J.; Liao, K.H. Prediction of Recurrence in Patients with Cervical Cancer Using MARS and Classification. Int. J. Mach. Learn. Comput. 2013, 3, 75–78. [Google Scholar] [CrossRef]
Guo, C.; Wang, J.; Wang, Y.; Qu, X.; Shi, Z.; Meng, Y.; Qiu, J.; Hua, K. Novel artificial intelligence machine learning approaches to precisely predict survival and site-specific recurrence in cervical cancer: A multiinstitutional study. Transl. Oncol. 2021, 14, 101032. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Liu, G.; Luo, J.; Yan, S.; Ye, P.; Wang, J.; Luo, M. Cervical cancer prognosis and related risk factors for patients with cervical cancer: A long-term retrospective cohort study. Sci. Rep. 2022, 12, 13994. [Google Scholar] [CrossRef]
Taarnhoj, G.A.; Christensen, I.J.; Lajer, H.; Fuglsang, K.; Jeppesen, M.M.; Kahr, H.S.; Hogdall, C. Risk of recurrence, prognosis, and follow-up for Danish women with cervical cancer in 2005–2013: A national cohort study. Cancer 2018, 24, 943–951. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, X.; Zhu, H.; Liu, Y.; Cao, J.; Li, D.; Ding, B.; Yan, W.; Jin, H.; Wang, S. Identification of Potential Prognostic Long Non-coding RNA Biomarkers for Predicting Recurrence in Patients with Cervical Cancer. Cancer Manag. Res. 2020, 12, 719–730. [Google Scholar] [CrossRef]
Chao, X.; Fan, J.; Song, X.; You, Y.; Wu, H.; Wu, M.; Li, L. Diagnostic Strategies for Recurrent Cervical Cancer: A Cohort Study. Front. Oncol. 2020, 10, 591253. [Google Scholar] [CrossRef]
Peiretti, M.; Zapardiel, I.; Zanagnolo, V.; Landoni, F.; Morrow, C.P.; Maggioni, A. Management of recurrent cervical cancer: A review of the literature. Surg. Oncol. 2012, 21, e59–e66. [Google Scholar] [CrossRef]
Cao, Y.; Liu, Y.; Lu, X.; Wang, Y.; Qiao, H.; Liu, M. Upregulation of long noncoding RNA SPRY4-IT1 correlates with tumor progression and poor prognosis in cervical cancer. FEBS Open Bio 2016, 6, 954–960. [Google Scholar] [CrossRef] [PubMed]
Babichev, S.; Liakh, I.; Kalinina, I. Applying a Recurrent Neural Network-Based Deep Learning Model for Gene Expression Data Classification. Appl. Sci. 2023, 13, 11823. [Google Scholar] [CrossRef]
Deng, S.P.; Zhu, L.; Huang, D.S. Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 13, 27–35. [Google Scholar] [CrossRef] [PubMed]
Yan, Y.; Zhao, K.; Cao, J.; Ma, H. Prediction research of cervical cancer clinical events based on recurrent neural network. Procedia Comput. Sci. 2021, 183, 221–229. [Google Scholar] [CrossRef]
Annapurna, S.D.; Pasumarthi, D.; Pasha, A.; Doneti, R.; Sheela, B.; Botlagunta, M.; Vijaya Lakshmi, B.; Pawar, S.C. Identification of Differentially Expressed Genes in Cervical Cancer Patients by Comparative Transcriptome Analysis. Biomed. Res. Int. 2021, 2021, 8810074. [Google Scholar]
Mao, Y.; Dong, L.; Zheng, Y.; Dong, J.; Li, X. Prediction of Recurrence in Cervical Cancer Using a Nine-lncRNA Signature. Front. Genet. 2019, 10, 284. [Google Scholar] [CrossRef]
Ding, H.; Zhang, L.; Zhang, C.; Song, J.; Jiang, Y. Screening of Significant Biomarkers Related to Prognosis of Cervical Cancer and Functional Study Based on lncRNA-associated ceRNA Regulatory Network. Comb. Chem. High. Throughput Screen. 2021, 24, 472–482. [Google Scholar] [CrossRef]
Geeitha, S.; Prabha, K.R.; Cho, J.; Easwaramoorthy, S.V. Bidirectional recurrent neural network approach for predicting cervical cancer recurrence and survival. Sci. Rep. 2024, 14, 31641. [Google Scholar] [CrossRef]
Wu, W.; Sui, J.; Liu, T.; Yang, S.; Xu, S.; Zhang, M.; Huang, S.; Yin, L.; Pu, Y.; Liang, G. Integrated analysis of two-lncRNA signature as a potential prognostic biomarker in cervical cancer: A study based on public database. PeerJ 2019, 7, e6761. [Google Scholar] [CrossRef]
Geeitha, S.; Renuka, P.; Thilagavathi, C.; Ananth, S.; Ramya, S.; Sinduja, K. LSTM—Recurrent Neural Network Model to Forecast the Risk Factors in Recurrent Cervical Carcinoma. In Proceedings of the 2nd International Conference on Self-Sustainable Artificial Intelligence Systems, Erode, India, 23–25 October 2024; pp. 132–137. [Google Scholar]
Wang, B.; Wang, W.; Zhou, W.; Zhao, Y.; Liu, W. Cervical cancer-specific long non-coding RNA landscape reveals the favorable prognosis predictive performance of an ion-channel-related signature model. Cancer Med. 2024, 13, e7389. [Google Scholar] [CrossRef]
He, J.; Huang, B.; Zhang, K.; Liu, M.; Xu, T. Long non-coding RNA in cervical cancer: From biology to therapeutic opportunity. Biomed. Pharmacother. 2020, 127, 110209. [Google Scholar] [CrossRef]
Ye, Z.; Zhang, Y.; Liang, Y.; Lang, J.; Zhang, X.; Zang, G.; Yuan, D.; Tian, G.; Xiao, M.; Yang, J. Cervical Cancer Metastasis and Recurrence Risk Prediction Based on Deep Convolutional Neural Network. Curr. Bioinform. 2022, 17, 164–173. [Google Scholar] [CrossRef]
Men, L.; Ilk, N.; Tang, X.; Liu, Y. Multi-disease prediction using LSTM recurrent neural networks. Expert. Syst. Appl. 2021, 177, 114905. [Google Scholar] [CrossRef]
Gholami, H.; Mohammadifar, A.; Golzari, S.; Song, Y.; Pradhan, B. Interpretability of simple RNN and GRU deep learning models used to map land susceptibility to gully erosion. Sci. Total Environ. 2023, 904, 166960. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Chen, Z.; Dong, Y.; Tu, J. An interpretable LSTM deep learning model predicts the time-dependent swelling behaviour in CERCER composite fuels. Mater. Today Commun. 2023, 37, 106998. [Google Scholar] [CrossRef]
Amendolara, A.B.; Sant, D.; Rotstein, H.G.; Fortune, E. LSTM-based recurrent neural network provides effective short-term flu forecasting. BMC Public Health 2023, 23, 1788. [Google Scholar] [CrossRef]
Perkins, R.B.; Guido, R.L.; Saraiya, M.; Sawaya, G.F.; Wentzensen, N.; Schiffman, M.; Feldman, S. Summary of Current Guidelines for Cervical Cancer Screening and Management of Abnormal Test Results: 2016–2020. J. Womens Health 2021, 30, 5–13. [Google Scholar] [CrossRef]
Harrow, J.; Frankish, A.; Gonzalez, J.M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B.L.; Barrell, D.; Zadissa, A.; Searle, S.; et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22, 1760–1774. [Google Scholar] [CrossRef]
Li, X.; Jin, F.; Li, Y. A novel autophagy-related lncRNA prognostic risk model for breast cancer. J. Cell. Mol. Med. 2021, 25, 4–14. [Google Scholar] [CrossRef] [PubMed]
Ren, L.; Wang, T.; Seklouli, A.S.; Zhang, H.; Bouras, A. A review on missing values for main challenges and methods. Inf. Syst. 2023, 119, 102268. [Google Scholar] [CrossRef]
Alghamdi, T.A.; Javaid, N. A Survey of Preprocessing Methods Used for Analysis of Big Data Originated from Smart Grids. IEEE Access 2022, 10, 29149–29171. [Google Scholar] [CrossRef]
Gao, C.; Yan, J.; Zhou, S.; Chen, B.; Varshney, K.P.; Liu, H. Long short-term memory-based recurrent neural networks for nonlinear target tracking. Signal Process. 2019, 164, 67–73. [Google Scholar] [CrossRef]
Belagoune, S.; Bali, N.; Bakdi, A.; Baadji, B.; Atif, K. Deep learning through LSTM classification and regression for transmission line fault detection, diagnosis, and location in large-scale multi-machine power systems. Measurement 2021, 177, 109330. [Google Scholar] [CrossRef]
Ding, Y.; Zhu, Y.; Feng, J.; Zhang, P.; Cheng, Z. Interpretable spatio-temporal attention LSTM model for flood forecasting. Neurocomputing 2020, 403, 348–359. [Google Scholar] [CrossRef]
Zou, J.; Lin, Z.; Jiao, W.; Chen, J.; Lin, L.; Zhang, F.; Zhang, X.; Zhao, J. A multi-omics-based investigation of the limitation and immunological impact of necroptosis-related mRNA in patients with cervical squamous carcinoma and adenocarcinoma. Sci. Rep. 2022, 12, 16773. [Google Scholar] [CrossRef]
Li, N.; Yu, K.; Lin, Z.; Zeng, D. Identifying a cervical cancer survival signature based on mRNA expression and genome-wide copy number variations. Exp. Biol. Med. 2022, 247, 207–220. [Google Scholar] [CrossRef]
Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef]
Derrien, T.; Johnson, R.; Bussotti, G.; Tanzer, A.; Djebali, S.; Tilgner, H.; Guernec, G.; Martin, D.; Merkel, A.; Knowles, D.G.; et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 2012, 22, 1775–1789. [Google Scholar] [CrossRef]
Ge, X.; Lei, S.; Wang, P.; Wang, W.; Wang, W. The metabolism-related lncRNA signature predicts the prognosis of breast cancer patients. Sci. Rep. 2024, 14, 3500. [Google Scholar] [CrossRef]
Allgaier, J.; Pryss, R. Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Mach. Learn. Knowl. Extr. 2024, 6, 1378–1388. [Google Scholar] [CrossRef]
Wu, M.; Zhang, X.; Han, X.; Pandey, V.; Lobie, P.E.; Zhu, T. The potential of long noncoding RNAs for precision medicine in human cancer. Cancer Lett. 2021, 50, 12–19. [Google Scholar] [CrossRef]
Zhang, Y.; Zou, J.; Li, L.; Han, M.; Dong, J.; Wang, X. Comprehensive assessment of postoperative recurrence and survival in patients with cervical cancer. Eur. J. Surg. Oncol. 2024, 50, 108583. [Google Scholar] [CrossRef]
Burmeister, C.A.; Khan, S.F.; Schäfer, G.; Mbatani, N.; Adams, T.; Moodley, J.; Prince, S. Cervical cancer therapies: Current challenges and future perspectives. Tumour Virus Res. 2022, 13, 200238. [Google Scholar] [CrossRef] [PubMed]
Chen, Q.; Hu, L.; Huang, D.; Chen, K.; Qiu, X.; Qiu, B. Six-lncRNA Immune Prognostic Signature for Cervical Cancer. Front. Genet. 2020, 11, 533628. [Google Scholar]
Zhou, Y.; Wang, Y.; Lin, M.; Wu, D.; Zhao, M. LncRNA HOTAIR promotes proliferation and inhibits apoptosis by sponging miR-214-3p in HPV16-positive cervical cancer cells. Cancer Cell Int. 2021, 21, 400. [Google Scholar] [CrossRef]
Sun, M.; Liu, X.; Xia, L.; Chen, Y.; Kuang, L.; Gu, X.; Li, T. A nine-lncRNA signature predicts distant relapse-free survival of HER2-negative breast cancer patients receiving taxane and anthracycline-based neoadjuvant chemotherapy. Biochem. Pharmacol. 2021, 189, 114285. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Proposed research method.

Figure 2. Correlation matrix of RCC.

Figure 3. A recurrent neural network LSTM to find the relevant features.

Figure 4. Training and validation accuracy loss.

Figure 5. Kaplan Meier survival curve by stages.

Figure 6. Violin plot for disease-free survival vs. staging categories.

Figure 7. Comparison of common features between tumor size and disease-free survival.

Figure 8. Kaplan Meier survival curve Comparison.

Figure 9. Risk score Distribution and survival time prediction.

Figure 10. Heatmap of Nine–lncRNA Signature expression level.

Figure 11. Time-dependent ROC curve for different follow-up times.

Figure 12. Mean-variance trend plot.

Figure 13. Moderated t-statistic plot.

Figure 14. LIMMA Differential expression.

Table 1. Recurrence of cervical cancer patient details.

Category	Details
Total patients with recurrence (n)	138
Age (years), mean ± SD	49.7 ± 14.6
FIGO 2009 staging	n (%)
IVB	21 (15.2%)
IVA	12 (8.7%)
IIIC-2	13 (9.4%)
IIIC-1	8 (5.8%)
IIIB	16 (11.6%)
IIIA	7 (5.1%)
IIB	16 (11.6%)
IB3	12 (8.7%)
IB2	11 (8.0%)
IB-1	9 (6.5%)
IA-2	2 (1.4%)
IA-1	11 (8.0%)
Stages, n (%)
Early	22 (15.9%)
Locally advanced	67 (48.6%)
Advanced	49 (35.5%)
Histological subtypes, n (%)
Squamous cell carcinoma (SCC)	51 (37.0%)
Adenocarcinoma (ADC)	53 (38.4%)
Other	34 (24.6%)

Table 2. Statistical table for numeric values in the dataset.

Features	Mean	Median	Standard Deviation
Age	49.65	50.0	14.57
Age of Initial Diagnosis	50.35	52.0	14.89
Post Menopause(years)	7.60	6.0	7.6
Tumor size	3.21	3.34	1.183
DFS months	59.45	45.0	45.75

Table 3. Extent of missingness for each clinical and imaging feature.

Feature	Missing (n)	Missing (%)
Age	5	0.7%
Age of Initial Diagnosis	6	0.8%
Post Menopause in Years	28	3.8%
Symptoms	10	1.4%
Duration of Symptoms	35	4.7%
Comorbidities	12	1.6%
Comorbidities Details	45	6.1%
Addictive Habits	60	8.1%
PV Examination	15	2.0%
PR Examination	18	2.4%
Primary Lesion—MRI	55	7.4%
Primary Lesion—CT Scan	110	14.9%
HPV Infection	95	12.9%
HPV Vaccination Status	140	18.9%
Smoking	80	10.8%
Chlamydia Infection	65	8.8%
BMI	50	6.8%
Oral Contraceptives Use	90	12.2%
Number of Full-term Pregnancies	25	3.4%
Age at First Full-term Pregnancy	60	8.1%
History	75	10.1%
Tumor Size (cm)	30	4.1%
Lymph Node Metastasis	40	5.4%
Histological Type	15	2.0%
Treatment Type	5	0.7%
FIGO Stage	0	0.0%
Imaging	12	1.6%
DFS (Months)	0	0.0%

Table 4. Fuzzy matching summary and statistical validation.

Metric	Value	Interpretation
Number of matched records	138	Successfully matched patients using tumor size and DFS
GSE 44001 Data	299	Full public dataset size
Tumor size difference in average (cm)	0.105	Very close alignment in tumor size
DFS difference average(months)	8.457	Acceptable difference considering real-world variability
t-test p-value (Tumor Size)	0.1194	No significant difference in tumor size distributions
KS-test p-value (Tumor Size)	0.0290	Mild distributional shift detected
Cohen’s d (Tumor Size)	0.161	distributions are broadly similar

Table 5. Coefficient table for recurrence of cervical cancer.

Variable	Coef	exp(Coef)	Z	p-Value	95% Confidence Interval
Stage	0.39	1.48	13.92	<0.005	(1.40–1.56)
Largest Diameter (cm)	0.17	1.19	2.93	<0.005	(1.06–1.33)

Table 6. Estimated Prognosis of Different Biomarkers.

Biomarker	Prognostic Value	Classification	Estimated Prognosis
ATXN8OS Marker	28.45	Elevated Risk	Decreased progression-free period
C5orf60 Indicator	32.18	Elevated Risk	Decreased progression-free period
INE1 Index	75.93	Elevated Risk	Decreased progression-free period
DIO3OS Metric	−45.27	Reduced Risk	Prolonged progression-free period
EMX2OS Score	−1.25	Moderate Risk	Medium progression-free period
KCNQ1DN Marker	−4.87	Low Risk	Prolonged progression-free period
LOH12CR2 Gauge	−0.85	Low Risk	Prolonged progression-free period
RFPL1S Value	−0.62	Low Risk	Prolonged progression-free period
KCNQ1OT1 Indicator	−0.95	Low Risk	Prolonged progression-free period

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Senthilkumar, G.; Pitchaimuthu, R.; Panneerselvam, P.S.; Alagarswamy, R.P.; Dhanasekaran, S. Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network. Diagnostics 2025, 15, 2848. https://doi.org/10.3390/diagnostics15222848

AMA Style

Senthilkumar G, Pitchaimuthu R, Panneerselvam PS, Alagarswamy RP, Dhanasekaran S. Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network. Diagnostics. 2025; 15(22):2848. https://doi.org/10.3390/diagnostics15222848

Chicago/Turabian Style

Senthilkumar, Geeitha, Renuka Pitchaimuthu, Prabu Sankar Panneerselvam, Rama Prasath Alagarswamy, and Seshathiri Dhanasekaran. 2025. "Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network" Diagnostics 15, no. 22: 2848. https://doi.org/10.3390/diagnostics15222848

APA Style

Senthilkumar, G., Pitchaimuthu, R., Panneerselvam, P. S., Alagarswamy, R. P., & Dhanasekaran, S. (2025). Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network. Diagnostics, 15(22), 2848. https://doi.org/10.3390/diagnostics15222848

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrative Long Non-Coding RNA Analysis and Recurrence Prediction in Cervical Cancer Using a Recurrent Neural Network

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Data Preprocessing

3.2. Feature Selection

3.3. RNN LSTM

4. Results

Comparison of Common Features

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI