A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis

Cabrera-Ibarra, Hugo; Hernández-Granados, David; Riego-Ruiz, Lina

doi:10.3390/a18100652

Open AccessArticle

A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis

by

Hugo Cabrera-Ibarra

^1,*,†

,

David Hernández-Granados

^1,*,†

and

Lina Riego-Ruiz

^2,†

¹

División de Control y Sistemas Dinámicos, Instituto Potosino de Investigación Científica y Tecnológica A.C. (IPICyT), Camino a la Presa San José 2255, Lomas 4ta Sección, San Luis Potosí 78216, SLP, Mexico

²

División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica A.C. (IPICyT), Camino a la Presa San José 2255, Lomas 4ta Sección, San Luis Potosí 78216, SLP, Mexico

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(10), 652; https://doi.org/10.3390/a18100652

Submission received: 30 August 2025 / Revised: 14 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

Download

Browse Figures

Versions Notes

Abstract

Non-coding RNAs (ncRNAs) are involved in many biological processes, making their identification and functional characterization a priority. Among them, long non-coding RNAs (lncRNAs) have been shown to regulate diverse cellular processes, such as cell development, stress response, and transcriptional regulation. The continued identification of new lncRNAs highlights the demand for reliable methods for their detection, with structural analysis offering insightful information. Currently, lncRNAs are identified using tools such as LncFinder, whose database has a large collection of lncRNAs from humans, mice, and chickens, among others. In this work, we present a graph-based algorithm to represent and compare RNA secondary structures. Rooted tree graphs were used to compare two groups of Saccharomyces cerevisiae RNA sequences, lncRNAs and not lncRNAs, by searching for structural similarities between each group. When applied to a novel candidate sequence dataset, the algorithm evaluated whether characteristic structures identified in known lncRNAs recurred. If so, the sequences were classified as likely lncRNAs. These results indicate that graph-based structural analysis offers a complementary methodology for identifying lncRNAs and may complement existing sequence-based tools such as lncFinder or PreLnc. Recent studies have shown that tumor cells can secrete lncRNAs into human biological fluids forming circulating lncRNAs which can be used as biomarkers for cancer. Our algorithm could be applied to identify novel lncRNAs with structural similarities to those associated with tumor malignancy.

Keywords:

RNA; lncRNAs; secondary structure; rooted tree graphs; algorithm; Saccharomyces cerevisiae

MSC:

90C35; 92B99

Graphical Abstract

1. Introduction

It has been established that non-coding RNAs (ncRNAs) are involved in several cellular processes. For example, they are known to be key players in cell differentiation, cell lineage choice, and organogenesis [1,2,3]. Bernstein et al. [4] suggest that transcriptional ncRNAs are more closely related to biological processes than previously believed. For their study, RNAs have been categorized into the following two groups [5]: ncRNAs with less than 200 nucleotides, for example, microRNAs (miRNAs) or small RNAs (sRNAs), and ncRNAs with more than 200 nucleotides, known as lncRNAs. Given the importance of lncRNAs, it is significant to establish a method for analyzing their structure and to denote, if possible, if any particular substructure could characterize them.

Liu et al. [6] studied R-loops, which are a class of non-canonical nucleic acid structures which typically form during transcription. This was carried out by associating to an RNA secondary structure a rooted tree graph and a unique polynomial called tree polynomial. They established a strong correlation between the coefficient sums of tree polynomials and the experimental probability of R-loop formation.

Gan et al. [7] developed a two-dimensional graphical representation approach to describe and estimate the size of RNA’s secondary structural repertoire. They used rooted tree graphs to describe RNA tree motifs and pointed out that tree topologies not found in RNA databases could be candidate templates for designing novel RNA sequences.

For several years, biology has incorporated mathematics and programming as tools. Taking advantage of this, persistent characteristics are sought in RNA folding to characterize it [8,9]. In addition, representing the RNA interaction networks through graphs has been very useful in their structural analysis. In the case of detecting lncRNAs, Siyu et al. [10] introduced LncFinder, an integrated platform based on machine learning algorithms that includes an lncRNA predictor with good performance while detecting lncRNAs from humans, mice, and chickens, among others. Another tool, proposed by Cao et al. [11], is PreLnc, which uses high-confidence lncRNA and mRNA transcripts to build prediction models using feature selection and classifiers. They analyzed the tri-nucleotide composition of transcripts from different species and concluded that this approach is promising for large-scale transcriptome annotation of lncRNAs.

The study of lncRNAs in yeast has been scarce compared to recent studies focused on humans, mammals, and plants. Yamashita et al. [12] mention that studying yeasts is important because of their genetic traceability, speed, and ease with which experiments can be completed. Furthermore, he states that the big question of whether there are unifying principles for the lncRNA functions remains. In order to answer this question, studies using yeast are essential. With this motivation, our analysis was focused on lncRNAs of Saccharomyces cerevisiae. Following the approach of Gan et al. [7], we developed an algorithm that allowed us to analyze the structure of RNAs. First, the algorithm analyzes the secondary structures associated with RNAs in a set, assigning a graph to each of them. Second, it compares two different sets of graphs in order to establish whether they share any substructures. Finally, the results are used to generate a conclusion that indicates whether an RNA sequence has the potential to belong to a set with certain characteristics.

This paper is organized as follows: In Section 2, we show how to assign a string of dots and parentheses to a rooted tree graph in a plane; this string determines the graph. In Section 3, given an RNA and the Dot–Bracket Notation (DBN) associated with its secondary structure, we showed how to associate a rooted tree graph with it, whose determining string will be called the Simplified Dot–Bracket Notation (SDBN). This graph carries essential information on the secondary structure related to the corresponding RNA. Subsequently, in Section 4, we present the development of an algorithm designed to identify the frequently occurring strings, providing evidence of potential structural similarities. Then, in Section 5, the selected datasets are analyzed using this algorithm. Finally, in Section 6, we underscore the relevance of the proposed approach and highlight the results obtained.

2. Rooted Tree Graphs

In order to talk about graphs, we will use some definitions [13]. A graph

G

is a pair

G = (V, E)

, where

E \subset V \times V

; see Example 1. A tree is a connected graph that contains no cycles; see Figure 1. Note that a tree graph can be drawn in the plane without any edges crossing. In this work, we focus on planar tree graphs, i.e., graphs contained in

R^{2}

.

Example 1.

Given

V = {

A, B, C, D, E} and

E = {

(A,B), (B,C), (B,D), (B,E)}, the tree graph is as follows:

A tree

(V, E)

with a selected vertex

r \in V

is a rooted tree, denoted by

(V, E, r)

, and the vertex r is called the root of the tree. Sometimes, the name of the vertices is not displayed, and the root vertex is highlighted; see Figure 2.

A basic tool that we will be using is a string of dots and parentheses assigned to a rooted tree graph. For the string assignment, we follow the contour of the graph in a clockwise direction; this means that we traverse each edge twice and each vertex has the same number of edges it spawns, with the exception of the root vertex that is traversed twice. Then, traversing the contour of the graph by starting (and ending) from the rooted vertex, we assign a dot each time a vertex is encountered (with the exception of the root vertex which possesses an extra dot). Also, for the first time we encounter an edge, an open parenthesis, ‘(’, is assigned, and, if it is the second time, we assign a close one, ‘)’.

Example 2.

The string associated with the rooted tree graph in Figure 2 is

. (. (.) . (. (.) . (. (.) .) .) . (.) .) .

; while,

. (. (.) . (.) . (.) .) .

is the string associated with the rooted tree graph

(V, E, A)

in Figure 1.

We will see that these strings of dots and parentheses are useful in studying the graphs associated with the secondary structures obtained from RNA sequences.

3. From RNA Sequences to Graphs

An RNA sequence can be described by a string in FASTA format. This format describes an RNA including a raw sequence which uses the following four-letter-based alphabet: A (Adenine), C (Cytosine), G (Guanine), and U (Uracil); see Example 3. If a string in this format is used in RNA folding software, its RNA secondary structure is obtained. It establishes how present nitrogenous bases match and employs the following structural motifs to describe them: stem, bulge, hairpin, and union. There are several software that achieve this goal; for example, methods based on dynamic programming and thermodynamic calculation, such as NUPACK Web [14] or RNAfold WebServer [15], or methods based on neural networks (bi-lstm), including rna-state-inf [16]. In this paper, given an RNA sequence in FASTA format, we used NUPACK Web [14] to obtain the associated secondary structure. After that, we associate it with a rooted tree graph by applying the Gan et al. rules proposed in [7].

Example 3.

A raw sequence is expressed as follows:

ACCCGGCCACAGUGAGCGGAACACCCGUGACUCAUUUCGAACCUCGGAAGUUAAGCCG-

CUCACGUUGGUGGGGCCGUGGAUAACCGUGAGGAUCCGCAGCCCCACUAAGCUGGGAU

Given the FASTA sequence in Example 3, by using NUPACK Web [14], the associated secondary structure is as shown in Figure 3A, where bulges, hairpins, junctions, and the 5^′ and 3^′ ends are numbered.

3.1. From Secondary Structures to Strings in Dot–Bracket Notation

Given an RNA sequence and its secondary structure, it is common to associate a string with it in Dot–Bracket Notation (DBN). DBN is a text string made up of dots and parentheses that contains information about the RNA secondary structure and where each character represents a base. Note that an open parenthesis “(” indicates that the base is paired with another base ahead of it, while a closed parenthesis “)” indicates that the base is paired with a previous base, and a dot “.” indicates an unpaired base.

Example 4.

The secondary structure shown in Figure 3A has the following Dot–Bracket Notation:

. ((((((. . . . (((((((((((. . (((. (. . ((. . . . .)) . .) .))) . .))) . . .)))))))) . ((((((((((. ((((((. . ((. . . .)))))))) .)))))))))))))))) . .

From a topological point of view, we are interested in the RNA structure, the secondary structure, rather than in the length of the stems, hairpins, or bulges. That is why associating rooted tree graphs to the secondary structure, as we will see in the next section, comes in handy.

3.2. The Associated Rooted Tree Graphs

One way to analyze the RNA secondary structure is by associating it with a rooted tree graph. This can be done by following, with a slight modification, the rules proposed by Gan et al. in [7], as follows:

The 5^′ and 3^′ ends are considered the root vertex.
A bulge or hairpin is considered a vertex when there are two or more consecutive unmatched nucleotides.
A junction is considered a vertex.
A stem is considered an edge if it has two or more complementary base pairs.

We consider the 5^′ and 3^′ ends as a vertex because, in rooted graph trees, any edge must join two vertices. The other vertices might be involved in base pairing with unpaired bases elsewhere in the RNA molecule via tertiary interactions, stabilizing RNA’s three-dimensional structures, and they usually involve more than one base pair. Therefore, RNA’s bulges, loops, and junctions, represented as vertices, are determinants of RNA’s interaction, flexibility, and tertiary structure. Conversely, a minimum of two base pairs ensures the stability of the RNA stem against thermal fluctuations.

Example 5.

Following the above rules, Figure 3B shows the rooted tree graph associated with the secondary structure of the RNA shown in Figure 3A.

In order to obtain it, we associate the root vertex to the 5^′ and 3^′ ends, as the first rule stated. Note that since bulge 8 does not satisfy the second rule, it does not represent a vertex; all other bulges and hairpins have a vertex associated with them. Then, because the stem between bulges 5 and 6 has only one complementary base pair, by the third rule, both bulges are considered as just one. All other stems will be associated with an edge.

In Figure 3C, the associated rooted tree graph is shown, which in Section 2 is associated with the string

. (. (. (. (. (.) .) .) .) . (. (.) .) .) .

that encodes the graph and, hence, the structure of the RNA.

Note that the DBN in Example 4 codifies the secondary structure shown in Figure 3A. An interesting task is obtaining the string

. (. (. (. (. (.) .) .) .) . (. (.) .) .) .

just from the DBN.

Given the RNA, with the above procedure, we associate it with a rooted tree graph or, equivalently, a string of dots and parentheses determining the rooted tree graph. This can be seen in Example 5. We will call such a string of dots and parentheses the Simplified Dot–Bracket Notation (SDBN), which encodes the graph. Note that a string in DBN is associated with a secondary structure while a string in SDBN is associated with a rooted tree graph. Therefore, given a string in DBN, the goal is to apply Gan et al.’s rules [7] and obtain the SDBN, which encodes the associated rooted tree graph.

It is worth noticing that given a string in DBN, at first glance, reducing consecutive sequences of the same character to one single sequence seems appropriate to obtain the string in SDBN. But, for example, in the case of the dot–bracket notation associated with Figure 3A, expressed as follows:

. ((((((. . . . (((((((((((. . (((. (. . ((. . . . .)) . .) .))) . .))) . . .)))))))) . ((((((((((. ((((((. . ((. . . .)))))))) .)))))))))))))))) . .,

reducing consecutive characters will give the following string:

. (. (. (. (. (.) .) .) .) .) . (. (. (.) .) .,

which is not associated with any rooted tree graph, since it does not have the same number of open and closing parentheses. The right associated SDBN is obtained as shown in Example 5.

In the next section, given a string in DBN, we will see how to obtain the associated string in SDBN.

3.3. From DBN to SDBN

So far, given an RNA in FASTA format, we used a folding program to obtain the RNA secondary structure and the DBN. Now, following Gan et al.’s rules [7], we would like to generate the SDBN of the associated rooted tree graph. This will be done by modifying the original secondary structure using the following procedure:

i.: If a stem consists only of one complementary base pair, it must be removed. In addition, the bulges or hairpins with no two or more consecutive unmatched nucleotides are removed.
ii.: All bulges (or unions) must have at least one dot separating each of the convergent stems; if this is not the case, insert one dot.
iii.: Finally, consecutive sequences of the same character will be reduced to one single character.

Example 6.

As an example of this procedure, we used the DBN associated with Figure 3A and reduced it to its SDBN; the changes from one step to the other are highlighted. In (a), we have the original DBN; note that since the stem between bulges 5 and 6 has only one complementary base pair, it is removed, obtaining (b). Since bulge 8 has no two or more consecutive unmatched nucleotides, it is also removed, obtaining (c). In bulges 2, 3, and 9, a dot must be inserted in regions I,

I I

, and

I I I

, obtaining (d); see Figure 4. Finally, we reduce the characters and obtain (e), which is the SDBN of the rooted tree graph associated with the secondary structure in Figure 3.

(a): .((((((....(((((((((((..(((.(..((.....))..).)))..)))...)))))))).((((((((((.((((((..((....)))))))).))))))))))))))))..
(b): .((((((....(((((((((((..(((....((.....))....)))..)))...)))))))).((((((((((.((((((..((....)))))))).))))))))))))))))..
(c): .((((((....(((((((((((..(((....((.....))....)))..)))...)))))))).(((((((((((((((((..((....)))))))))))))))))))))))))..
(d): .((((((....((((((((.(((..(((....((.....))....)))..)))...)))))))).(((((((((((((((((..((....)).))))))))))))))))).))))))..
(e): .(.(.(.(.(.).).).).(.(.).).).

It is worth noting that, under this procedure, different RNA sequences may produce secondary structures that generate the same SDBN, and consequently, the same associated rooted tree graph. This occurs because our analysis emphasizes the rooted tree graph rather than the length of the stems, bulges, or junctions in the secondary structure.

In the following section, we present the comparison of two sets of RNA sequences with different characteristics, with the aim of determining whether they share any underlying structural similarities. To address this question, Section 4, introduces an algorithm specifically designed for this purpose.

4. The Comparing Algorithm

For this analysis, we worked with three sets of RNA sequences in FASTA format. Two of these are control sets, denoted by

\bar{A}

and

\bar{B}

, each containing sequences with defined functionality. We were interested to know whether the RNAs within these sets share structural characteristics that could be related to their function. The third set,

\bar{C}

, was the test group in which we searched for the presence of such shared structural features. In this study,

\bar{A}

consists of experimentally verified lncRNAs, while

\bar{B}

contains sequences confirmed not to be lncRNAs. The test set

\bar{C}

includes RNA sequences whose potential classification as lncRNAs is being evaluated.

For each RNA sequence in FASTA format, we generate its secondary structure and associated DBN using NUPACK Web [14]. Thus, each dataset—

\bar{A}

,

\bar{B}

, and

\bar{C}

—includes the RNA identifier, its nucleotide sequence in FASTA format, and the associated DBN string. The DBN strings were then ordered by length, from longest to shortest. Finally, applying the rules proposed by Gan et al.’s [7], we obtained the corresponding SDBN sets A, B, and C. This procedure is outlined in Figure 5.

Given sets A, B, and C in SDBN, the proposed algorithm is intended to analyze them by looking for repeated strings. It determines the repeated strings that appear more frequently within sets A and B, and it also determines whether elements in C share more strings with the set A or set B,. In each case, such an element is more likely to have the same functionality as the one of the corresponding set. See Figure 6. This algorithm was implemented in Python v.3.8, and the corresponding pseudocode is provided in Appendix A.

In the next subsection, we describe in detail the algorithm that, given the SBDN sets A, B, and C, compares the shared strings between them and evaluates whether the elements in C exhibit structural characteristics consistent with lncRNA.

The Algorithm

As mentioned previously, given sets

\bar{A}

,

\bar{B}

, and

\bar{C}

in FASTA format, we apply to them a folding program, like NUPACK Web [14], to obtain, respectively, the sets

\hat{A}

,

\hat{B}

, and

\hat{C}

in DBN. Then, using Gan et al.’s rules [7], we get the sets

A_{H}

,

B_{H}

, and

C_{H}

, also in DBN, which are then simplified to SDBN, which carries the associated root tree graph information, obtaining sets A, B, and C, respectively. The algorithm uses sets A, B, and C.

Given integer parameters

l_{1}

and

l_{2}

, we searched for repeated strings of length l satisfying

l_{1} \leq l \leq l_{2}

within set A (B), obtaining collections of repeated strings

S_{A}

(

S_{B}

). The choice of

l_{1}

and

l_{2}

depends on the datasets under study. In practice, we suggest using

l_{1} \geq 18

, as shorter strings are unlikely to provide meaningful structural information and would significantly increase the computational cost.

To identify the most relevant substructures in

S_{A}

for distinguishing lncRNAs, we first counted the occurrences of each string from

S_{A}

in both sets A and B. This produces the matrices

M_{A A}

and

M_{A B}

, respectively, where the

i j

-th entry represents the number of times the j-th string of

S_{A}

appears in the i-th sequence of A (B). If A contains m elements and

S_{A}

contains n strings, then

M_{A A}

is an

m \times n

matrix. Analogously, using

S_{B}

, A, and B, we obtain the matrices

M_{B A}

and

M_{B B}

.

Next, we compared the matrices

M_{A A}

and

M_{A B}

to determine which strings in

S_{A}

form the subset of relevant strings from A, denoted by

{S^{'}}_{A}

. This subset can be used as a marker for the detection of lncRNAs. To achieve this, we sum the columns of

M_{A A}

(

M_{A B}

) to obtain the vector

C_{A A}

(

C_{A B}

). We then established a discriminant threshold

l_{3}

and evaluated whether

C_{A A} - C_{A B} \geq l_{3}

. When

C_{A A i} - C_{A B i} \geq l_{3}

, it means that the i-th element of

S_{A}

belongs to

{S^{'}}_{A}

; also note that if

C_{A A i} - C_{A B i} = 0

, then such an element appears as many times in A as in B, and therefore, it cannot be used to detect lncRNAs. Similarly, with

M_{B A}

and

M_{B B}

, we calculated

C_{B B} - C_{B A} \geq l_{3}

and then determined the set of relevant strings of B,

{S^{'}}_{B}

.

Now we are going to look at the relevant strings from A and B in set C, and to do so, we define the set

D = A + B + C

. As before, we looked for elements from

{S^{'}}_{A}

in A and D, obtaining the matrices

{M^{'}}_{A A}

and

{M^{'}}_{A D}

(we followed the same computation for

{M^{'}}_{B B}

and

{M^{'}}_{B D}

). For

{M^{'}}_{A A}

to compute the row sum vector

R_{A A}

, note that each

R_{A A i}

indicates the number of relevant strings in the i-th element (follow the same computation for vectors

R_{B B}

,

R_{A D}

, and

R_{B D}

).

The next step is to compute the associated first quartile

Q 1_{A}

(

Q 1_{B}

) for the vector

R_{A A}

(

R_{B B}

), which will be used to establish a ranking that ensures a high probability that an RNA is an lncRNA. We hypothesized that if the string of an RNA possesses many elements of

{S^{'}}_{A}

and only a few elements of

{S^{'}}_{B A}

, it is more likely that the RNA belongs to set A. Therefore, the next step is to determine whether the i-th entries

R_{A D i}

in

R_{A D}

and

R_{B D i}

in

R_{B D}

satisfy the following restrictions:

R_{A D i} > 0.5 * Q 1_{A} R_{B D i} < 1.5 * Q 1_{B} .

If so, it means that the i-th element of D possesses more of the relevant strings from

{S^{'}}_{A}

and only a few of those from

{S^{'}}_{B}

. In this case, such an element in D is identified as an lncRNA. Therefore, we can assign a vector

V_{X}

of length

| X |

, where

X \in {A, B, C, D}

, as follows: If the i-th element in X is detected as an lncRNA, a one is assigned to the i-th position of the vector

V_{X}

; otherwise, it is a zero. Note that in this case, we have

V_{D} = V_{A} + V_{B} + V_{C}

.

Clearly, more elements of A should be detected as lncRNAs than those from B, and in this case, the elements in C that were also detected have a higher probability of being lncRNAs. Since for some combination of sets A, B, and C this was not the case, we added the following restriction to take into account cases where more elements from A are detected than from B. Given the numbers

N_{A}

and

N_{B}

of elements of A and B, respectively, detected as lncRNAs, then when

N_{A}

and

N_{B}

satisfy

N_{A} \geq l_{4} and N_{B} \leq l_{5},

the elements of C also detected are more likely to be an lncRNA. Here,

l_{4}

and

l_{5}

are suggested to satisfy

l_{4} \geq \frac{| A |}{2}

and

l_{5} \leq \frac{| B |}{4}

, where

| A |

and

| B |

are the cardinalities of A and B, respectively.

Given sets A, B, and C, the result of applying this algorithm is the vector

V_{C}

, which will be denoted as follows:

A (A, B, C, l_{1}, l_{2}, l_{3}, l_{4}, l_{5}) or just A (A, B, C) .

For example, if C has six elements and

A (A, B, C) = V_{C} = [0, 1, 0, 0, 1, 0]

, this means that the algorithm detects as possible lncRNAs the second and fifth elements of set C. Note that for some combinations of sets A, B, and C, it could be possible that the inequalities

N_{A} \geq l_{4}

and

N_{B} \leq l_{5}

are not satisfied. In this case, the output will be

V_{C} = [0, 0, 0, 0, 0, 0]

. It is worth noting that this result may still be obtained even when the inequalities are satisfied. A schematic representation of the algorithm is provided in Figure 7.

5. Results

The algorithm was used to determine whether, in a set of sequences, some of them could be identified as potential lncRNAs. In Table 1, along with its lengths, the following are shown: control set

\bar{A}

consists of the 18 sequences recorded in the literature [17] as lncRNAs until 2019; control set

\bar{B}

, which consists of 18 non-lncRNA sequences and the test set

\bar{C}

, consists of 3 RNAs sequences that are known not to be lncRNAs, KAP123, GRE2, and EMC11, and 3 potential lncRNAs sequences. Then, as aforementioned, by using NUPACK Web [14], we obtain the sets

\hat{A}

,

\hat{B}

, and

\hat{C}

in DBN and then, after applying Gan et al.’s rules [7], we obtain the sets A, B, and C in SDBN. All sequences used here are form S. cerevisiae.

While thousands of putative lncRNAs have been computationally annotated in S. cerevisiae, only a small subset has experimentally confirmed functional evidence. For model training, we used 18 such high-confidence lncRNAs curated from SGD, NONCODE, and the literature (Xu et al. [18]; van Dijk et al. [19]; Geisler et al. [20]; Castelnuovo et al. [21]; and Balarezo-Cisneros et al. [22]). This choice prioritizes well characterized lncRNAs over dataset size, minimizing the inclusion of uncertain or artifact transcripts.

5.1. Computations

To validate our results, we randomly took 30 subsets of 12 elements of A and

A_{i}

and another 30 subsets of 3 elements of A and

C A_{i}

, where

i = 1, 2, \dots, 30

. The sets

A_{i}

and

C A_{i}

are disjoint. Following the same procedure, we obtained subsets

B_{j}

and

C B_{j}

for

j = 1, 2, \dots, 30

. In this case, the algorithm parameters were heuristically calibrated in such a way as to account for when it detects at least seven elements of set A and, at most, three elements of set B. In this case, the parameters used were

l_{1} = 22

,

l_{2} = 34

,

l_{3} = 2

,

l_{4} = 7

, and

l_{5} = 3

. The threshold

l_{3} = 2

was chosen after observing that, in this case, the values of the elements in the vectors

C_{A A} - C_{A B}

and

C_{B B} - C_{B A}

typically range between 0 and 6. As the number increases, the number of associated strings decreases. Then, given

K = (0, 0, 0, 0, 0, 0)

, we performed the following computations:

\begin{matrix} for i in range (30) \\ for j in range (30) \\ K = K + A (A_{i}, B_{j}, C A_{i} + C B_{j}) . \end{matrix}

Note that vector K will keep track of the detected lncRNAs.

Clearly, the elements of

C A_{i}

should be more likely to be detected as lncRNAs while the ones in

C B_{j}

should not. If we use algorithm

A

to perform 900 trials and sum the vectors obtained, we get results such as

K = (170, 381, 259, 41, 2, 195)

, which means that the second element was detected as lncRNA 381 times, while the fifth element was detected only twice. When considering the overall results of the detected sequences, we found that the lncRNAs were correctly identified

77.3 %

of the time, while

22.7 %

were not. Recall that in this case, the third set

C A_{i} + C B_{j}

varies with i and j; therefore, it is different each time.

In contrast, instead of varying the third set, we could fix it to

C A_{3} + C B_{5}

(any other combination would also work) and compute

A (A_{i}, B_{j}, C A_{3} + C B_{5})

, varying i and j. The result was

(370, 406, 597, 149, 0, 0)

, which means that the third element in

C A_{3}

was correctly detected as lncRNA 597 times out of 900, while the second element of

C B_{5}

was not detected as lncRNA even once. This implies that

90.21 %

of the time, the lncRNAs in

C A_{3}

were correctly detected. We performed the same computations for other fixed sets,

C A_{i_{0}} + C B_{j_{0}}

, and whenever we ran the experiment, the largest number was between the first three positions, i.e., it corresponded to an lncRNA in the set

C A_{i 0}

.

If we record as lncRNA only the sequence with the largest associated number, then for the case

(370, 406, 597, 149, 0, 0)

, we only take as a detected lncRNA the third sequence and denote this result as

V_{R} = [0, 0, 1, 0, 0, 0]

. Then we performed this procedure 500 times, but varying the set

C A_{3 j} + C B_{5 j}

, for

j = 1, \dots, 500

, and the result of the sum of these computations was the vector

V_{R} = [149, 170, 181, 0, 0, 0],

which indicates that, out of 500 times, the sequences detected were indeed lncRNAs, and at the same time, no element of the set of not lncRNAs was detected as lncRNAs. The results are shown in Figure 8A.

Nevertheless, if we fix the third set as the original set C and, as before, we run the algorithm 500 times in order to record the number of times each of the elements in the set C were detected as lncRNA (see Figure 8B), we would obtain the following:

V_{R} = [0, 0, 0, 5, 306, 189] .

Therefore, our results indicate that RNA 6754 was detected as lncRNA

61.2 %

of the time, while RNA 12189 and RNA 1477 were detected as lncRNA

37.8 %

and

1 %

of the time, respectively. Note that the other three RNAs, known not to be lncRNAs, were not detected even once as lncRNAs. Since in all cases the sets

A_{i}

and

B_{j}

are taken randomly, the results vary slightly each time. For example, in another round, the results were as follows:

V_{R} = [0, 0, 0, 3, 291, 206] .

It is important to note that RNA 12189 was experimentally validated as an lncRNA by Novačić et al. [23]. This finding supports our results and further suggests that RNA 6754, which consistently showed the highest values in our analysis, should be prioritized for experimental testing to determine whether it is indeed an lncRNA.

We determine the time complexity for this algorithm by taking into consideration the average time it takes the algorithm to conduct 100 tests when the number of subsets taken into account varies between 16, 18, 20,…, 32. The results are shown in Figure 9, where the graph of the polynomial

1.35 x^{2} - 12.90 x + 129

that best approximates the points

(x_{i}, y_{i})

is shown in red.

x = [16, 18, 20, 22, 24, 26, 28, 30, 32] y = [308, 367, 448, 579, 639, 743, 850, 972, 1165] .

Therefore, its time complexity is

O (n^{2})

. Note that even if we increase the number of RNAs in sets

A, B

, and C, if we maintain fixed the number of sets used for the bootstrapping (in this case 12, 12, and 6), similar results to those shown in Figure 9 are expected.

5.2. Using Other RNA Folding Software Programs

As different software programs can generate different RNA secondary structures, we decided to use the algorithm with RNAfold WebServer [15] and rna-state-inf [16] as folding software. In addition, we used the same parameters (

l_{1} = 22

,

l_{2} = 34

,

l_{3} = 2

,

l_{4} = 7

, and

l_{5} = 3

) as in the NUPACK Web [14] case.

In Figure 10A, we show the results in the case of the RNAfold WebServer [15] while using

C A_{3 j} + C B_{5 j}

as the third set, i.e., known lncRNAs and known as not lncRNAs. In this case, we obtained the following vector:

V_{R} = [164, 183, 142, 3, 6, 2] .

We observed that in 489 out of 500 trials (

97.8 %

), the known lncRNAs were correctly detected, while only in 11 cases (

2.2 %

) were the known non-lncRNAs mistakenly classified as lncRNAs.

In contrast, Figure 10B shows the results when the third set was C, where the corresponding vector was as follows:

V_{R} = [9, 0, 0, 0, 0, 491] .

This outcome indicates that, when using RNAfold WebServer, RNA 12189 is the most likely candidate to be an lncRNA, as is the case.

The results when we used rna-state-inf [16] are shown in Figure 11. In the case of known lncRNAs and known non-lncRNAs, the result was as follows:

V_{R} = [183, 158, 155, 2, 1, 1] .

Figure 11A shows that lncRNA sequences were accurately detected

99.2 %

of the time, while

0.8 %

of the detected sequences were incorrectly identified.

In contrast, when we use set C, we obtained the following associated vector:

V_{R} = [0, 114, 103, 0, 0, 283] .

This vector suggests that RNA 12189 should be an lncRNA. Note that the RNAs GRE2 and ECM11, which are not lncRNAs, were wrongly detected 114 and 103 times, respectively (see Figure 11B).

Although there are several software programs to predict the secondary structure of RNA—such as MFold WebServer [24], mXfold2 [25], SPOT-RNA2 [26], or UFold [27]—we were unable to obtain DBN strings for all sequences listed in Table 1 due to the sequence lengths. More specifically, these tools restrict the lengths of RNA sequences to 2400, 2000, 1000, and 600 nucleotides, respectively. It is important to emphasize that the purpose of this analysis was not to compare or determine which RNA folding software performs best, but rather to demonstrate that the detection of known lncRNAs using our proposed algorithm remains consistent across different prediction programs. In all cases, known lncRNAs were correctly identified in 100%, 97.8%, and 99.2% of the instances when using NUPACK Web [14], RNAfold WebServer [15], and rna-state-inf [16], respectively, thereby reinforcing the robustness and reliability of our approach. As previously described, the parameters

l_{i}

, for

i = 1, 2, \dots, 5

, were determined heuristically using Nupack Web [14] and were kept unchanged when applying the same procedure with RNAfold WebServer [15] and rna-state-inf [16]. However, it is expected that suitable adjustment of these parameters for each specific folding prediction program could further enhance performance, particularly with RNAfold WebServer [15] or rna-state-inf [16]. The observed variations among the three programs in dataset C are consistent with the fact that different folding algorithms generate distinct secondary structures, and consequently, different associated graphs. For example, RNA 6754 exhibits a larger number of relevant substructures in common with those of set A when analyzed with NUPACK Web [14], making it more likely to be classified as an lncRNA. However, this same RNA was not identified as an lncRNA when processed with RNAfold [15] or rna-state-inf [16], underscoring the influence of the underlying folding model and energy parameters on the resulting graph representation and detection outcome. Finally, a more detailed assessment of the sensitivity of the proposed method across folding programs is suggested for future discussion.

6. Conclusions

To study the structural similarities between RNA sequences, we used the approach of Gan et al. [7], which consists of assigning a graph to an RNA secondary structure. They used it to estimate the secondary structural repertoire of RNA and showed that known RNA trees represented a small subset of all possible motifs; therefore, some of the “missing” motifs could represent new RNAs subject to design. Building on these ideas, we proposed a set of rules to derive simplified DBN representations (SDBN) only from secondary structures in DBN strings. Furthermore, we introduced an algorithm designed to identify shared strings between elements of two RNA sets, with the goal of defining the parameters capable of distinguishing between the different sets of RNAs.

This algorithm was tested on three data sets from S. cerevisiae as follows: two control sets,

\bar{A}

and

\bar{B}

, and one test set

\bar{C}

from Table 1. From these, we generated the corresponding structure sets that allowed us to evaluate whether an RNA sequence is more likely to be an lncRNA. In particular, when the third set was fixed as

C A_{i_{0}} + C B_{j_{0}}

, for given

i_{0}

and

j_{0}

, the algorithm consistently achieved certainty levels close to

100 %

, regardless of the folding software used (NUPACK Web [14], RNAfold WebServer [15], and rna-state-inf [16]).

Our results highlighted two RNA sequences, 6754 and 12189, as strong lncRNA candidates. Importantly, sequence 12189 had already been experimentally validated as an lncRNA by Novaičić et al. [23] in 2020, lending independent support to our methodology. These findings suggest that the proposed approach can serve as a valuable computational tool for lncRNA identification in S. cerevisiae.

In contrast to the approaches of Gan et al. [7]—which explored the global landscape of RNA structural motifs—and Siyu et al. [10]—who relied on sequence composition and machine learning for lncRNA prediction—our work introduces a fundamentally different strategy. By defining a rule-based algorithm that generates simplified dot–bracket representations and identifies shared structural patterns between RNA sets, we provide a direct, structure-centered means to distinguish functional RNA classes and uncover novel lncRNA candidates with high confidence.

In the future, we plan to enhance the predictive power of the algorithm by integrating additional features, such as free energy analysis, structural motif characterization, and improved input set homogenization. We also aim to extend our study to RNA sequences from other organisms, thereby testing the broader applicability of the method.

In summary, by using RNA secondary structure information, we were able to identify RNAs that share structural fingerprints with known lncRNAs and are hence more likely to share similar functions. Although this study focused on lncRNAs, the approach can be applied to explore other RNA functionalities, opening the door to new applications in RNA biology. As an example, recent studies have shown that tumor cells can secrete lncRNAs into human biological fluids in the form of microvesicles, exosomes, or protein complexes. These secreted molecules form circulating lncRNAs that remain in a stable state and are not degraded by RNA enzymes. Such circulating lncRNAs can serve as biomarkers for cancer. Hu et al. in [28] showed that the lncRNAs SPRY4-IT1, ANRIL, and NEAT1 exhibit high sensitivity and specificity in non-small cell lung cancer (NSCLC), suggesting their potential as novel diagnostic markers for NSCLC. Similarly, in the case of breast cancer, in [29], Xu et al. compared the ROC and AMC values of lncRNAs with some common serum tumor markers and demonstrated that lncRNA RP11-445H22.4 showed the highest sensitivity and specificity. Our algorithm could potentially be applied to identify novel lncRNAs with structural similarities to those associated with tumor malignancy.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by D.H.-G., H.C.-I. and L.R.-R. The first draft of the manuscript was written by D.H.-G. and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The software implementation of the presented algorithm is openly available at https://github.com/Dave-HG/A-graph-based-algorithm-for-detecting-long-non-coding-RNAs-through-RNA-secondary-structure-analysis, accessed on 13 October 2025.

Acknowledgments

The authors wish to express a sincere gratitude to the anonymous reviewers for their suggestions, which have immensely improved the article. D. Hernández-Granados would like to thank Secretaría de Ciencia, Humanidades, Tecnologías e Innovación (SECIHTI) for the granted doctoral scholarship. H. Cabrera-Ibarra also thanks SECIHTI (CVU: 25479) and IPICYT for their support. We are grateful to Alfredo Trujillo-Rodríguez for kindly providing the RNA groups and sequences used in this study, as well as for their insightful comments and suggestions, which greatly contributed to improving the analyses of this work.

Conflicts of Interest

No conflict of interest exists, and if accepted, the article will not be published elsewhere in the same form, in any language, without the written consent of the publisher.

Appendix A

In this section, the pseudocode of the algorithm is shown so that it can be implemented in different programming languages. It uses as input the following sets:

\hat{A}

,

\hat{B}

, and

\hat{C}

, consisting of determined lncRNAs, not lncRNAs, and a set of candidates to be lncRNAs, respectively.

▹Pseudocode:

1:: Use a software program to obtain the strings in DBN from the RNAs in FASTA format.
2:: Simplify, based on Gan et al.’s rules [7], the strings in DBN to SDBN, obtaining the sets A, B, and C, respectively.
3:: Given $l_{1}$ and $l_{2}$ generate the subset $S_{A}$ $(S_{B})$ of repeated strings within A (B) with length l satisfying $l_{1} \leq l \leq l_{2}$ .
4:: Use $S_{A}$ ( $S_{B}$ ) to compute the matrix of repeated strings present in A, denoted by $M_{A A}$ (similarly compute $M_{A B}$ , $M_{B B}$ and $M_{B A}$ ).
5:: Establish an adequate discriminant $l_{3}$ , here $l_{3} = 2$ , and sum the columns in $M_{A A}$ to obtain $C_{A A}$ (similarly $C_{A B}$ , $C_{B B}$ and $C_{B A}$ ). Evaluate $C_{A A} - C_{A B} \geq l_{3}$ to determine the set of relevant strings ${S^{'}}_{A}$ of A ( $C_{B B} - C_{B A} \geq l_{3}$ , ${S^{'}}_{B}$ ).
6:: As in step $(4)$ , with sets ${S^{'}}_{A}$ ( ${S^{'}}_{B}$ ) and $D = A + B + C$ compute the matrix of relevant strings present in A, denoted by ${M'}_{A A}$ (similarly compute ${M'}_{B B}$ , ${M'}_{A D}$ , and ${M'}_{B D}$ ). Also, compute the rows sum $R_{A A}$ ( $R_{A D}$ , $R_{B B}$ and $R_{B D}$ ), where the i-th entry in $R_{A A}$ is the number of appearances of elements of ${s'}_{A}$ in the i-th element of A. In addition, compute the associated first quartile $Q 1_{A}$ ( $Q 1_{B}$ ).
7:: Determine if the i-th entry $R_{A D i}$ in $R_{A D}$ ( $R_{B D i}$ in $R_{B D}$ ) satisfies both restrictions: $R_{A D i} > 0.5 * Q 1_{A}$ and $R_{B D i} < 1.5 * Q 1_{B}$ . If so, such element in D is detected as an lncRNA.
8:: Given the number $N_{A}$ ( $N_{B}$ ) of elements of A (B) detected as lncRNAs. When $N_{A}$ and $N_{B}$ satisfy $N_{A} \geq 7$ and $N_{B} \leq 3$ , then the elements of C also detected are more likely to be an lncRNA, $V_{C}$ .

References

Eddy, S.R. Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2001, 2, 919–929. [Google Scholar] [CrossRef]
Waters, L.S.; Storz, G. Regulatory RNAs in bacteria. Cell 2009, 136, 615–628. [Google Scholar] [CrossRef]
Fernandes, J.C.R.; Acuña, S.M.; Aoki, J.I.; Floeter-Winter, L.M.; Muxel, S.M. Long Non-Coding RNAs in the Regulation of Gene Expression: Physiology and Disease. Noncoding RNA 2019, 5, 17. [Google Scholar] [CrossRef] [PubMed]
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 57–74. [Google Scholar] [CrossRef] [PubMed]
Zampetaki, A.; Albrecht, A.; Steinhofel, K. Long Non-coding RNA Structure and Function: Is There a Link? Front. Physiol. 2019, 10, 1127. [Google Scholar] [CrossRef]
Liu, P.; Lusk, J.; Jonoska, N.; Vázquez, M. Tree polynomials identify a link between co-transcriptional R-loops and nascent RNA folding. PLoS Comput. Biol. 2024, 20, e1012669. [Google Scholar] [CrossRef] [PubMed]
Gan, H.H.; Pasquali, S.; Schlick, T. Exploring the repertoire of RNA secondary motifs using graph theory: Implications for RNA design. Nucleic Acids Res. 2003, 31, 2926–2943. [Google Scholar] [CrossRef]
Mamuye, A.L.; Rucco, M.; Tesei, L.; Merelli, E. Persistent Homology Analysis of RNA. Mol. Based Math. Biol. 2016, 4, 14–25. [Google Scholar] [CrossRef]
Agrawal, D.K.; Tang, X.; Westbrook, A.; Marshall, R.; Maxwell, C.S.; Lucks, J.; Noireaux, V.; Beisel, C.L.; Dunlop, M.J.; Franco, E. Mathematical Modeling of RNA-Based Architectures for Closed Loop Control of Gene Expression. ACS Synth. Biol. 2018, 7, 1219–1228. [Google Scholar] [CrossRef]
Siyu, H.; Yanchun, L.; Qin, M.; Yangyi, X.; Yu, Z.; Wei, D.; Cankun, W.; Ying, L. LncFinder: An integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 2019, 20, 2009–2027. [Google Scholar] [CrossRef]
Cao, L.; Wang, Y.; Bi, C.; Ye, Q.; Yin, T.; Ye, N. PreLnc: An Accurate Tool for Predicting lncRNAs Based on Multiple Features. Genes 2020, 11, 981. [Google Scholar] [CrossRef]
Yamashita, A.; Shichino, Y.; Yamamoto, M. The long non-coding RNA world in yeasts. Biochim. Biophys. Acta 2016, 1859, 147–154. [Google Scholar] [CrossRef] [PubMed]
Diestel, R. Graph Theory; Graduate Texts in Mathematics, 5 (173); Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–17. [Google Scholar]
Zadeh, J.N.; Steenberg, C.D.; Bois, J.S.; Wolfe, B.R.; Pierce, M.B.; Khan, A.R.; Dirks, R.M.; Pierce, N.A. NUPACK: Analysis and design of nucleic acid systems. J. Comput. Chem. 2011, 32, 170–173. [Google Scholar] [CrossRef] [PubMed]
Gruber, A.R.; Lorenz, R.; Bernhart, S.H.; Neuböck, R.; Hofacker, I.L. The Vienna RNA Websuite. Nucl. Acids Res. 2008, 36, 70–74. [Google Scholar] [CrossRef]
Willmott, D.; Murrugarra, D.; Ye, Q. Improving RNA secondary structure prediction via state inference with deep recurrent neural networks. Comput. Math. Biophys. 2020, 8, 36–50. [Google Scholar] [CrossRef]
Cherry, J.M.; Hong, E.L.; Amundsen, C.; Balakrishnan, R.; Binkley, G.; Chan, E.T.; Christie, K.R.; Costanzo, M.C.; Dwight, S.S.; Engel, S.R.; et al. Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Res. 2012, 40, 36–50. [Google Scholar] [CrossRef]
Xu, Z.; Wei, W.; Gagneur, J.; Perocchi, F.; Clauder-Münster, S.; Camblong, J.; Guffanti, E.; Stutz, F.; Huber, W.; Steinmetz, L.M. Bidirectional promoters generate pervasive transcription in yeast. Nature 2009, 457, 1033–1037. [Google Scholar] [CrossRef]
van Dijk, E.; Chen, C.; d’Aubenton-Carafa, Y.; Gourvennec, S.; Kwapisz, M.; Roche, V.; Bertrand, C.; Silvain, M.; Legoix-Né, P.; Loeillet, S.; et al. XUTs are a class of Xrn1-sensitive antisense regulatory non-coding RNA in yeast. Nature 2011, 475, 1033–1037. [Google Scholar] [CrossRef]
Geisler, S.; Lojek, L.; Khalil, A.M.; Baker, K.E.; Coller, J. Decapping of long noncoding RNAs regulates inducible genes. Mol. Cell 2012, 45, 1097–2767. [Google Scholar] [CrossRef]
Castelnuovo, M.; Rahman, S.; Guffanti, E.; Infantino, V.; Stutz, F.; Zenklusen, D. Bimodal expression of PHO84 is modulated by early termination of antisense transcription. Nat. Struct. Mol. Biol. 2013, 20, 851–858. [Google Scholar] [CrossRef]
Balarezo-Cisneros, L.N.; Parker, S.; Fraczek, M.G.; Timouma, S.; Wang, P.; O’Keefe, R.T.; Millar, C.B.; Delneri, D. Functional and transcriptional profiling of non-coding RNAs in yeast reveal context-dependent phenotypes and in trans effects on the protein regulatory network. PLoS Genet. 2021, 17, e1008761. [Google Scholar] [CrossRef]
Novačić, A.; Vučenović, I.; Primig, M.; Stuparević, I. Noncoding RNAs as cell wall regulators in Saccharomyces cerevisiae. Crit. Rev. Microbiol. 2020, 46, 15–25. [Google Scholar] [CrossRef]
Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucl. Acids Res. 2003, 31, 3406–3415. [Google Scholar] [CrossRef] [PubMed]
Sato, K.; Akiyama, M.; Sakakibara, Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 2021, 12, 941. [Google Scholar] [CrossRef] [PubMed]
Singh, J.; Paliwal, K.; Zhang, T.; Shing, J.; Litfin, T.; Zhou, Y. Improved RNA secondary structure and tertiary base-pairing prediction using evolutionary profile, mutational coupling and two-dimensional transfer learning. Bioinformatics 2021, 37, 2589–2600. [Google Scholar] [CrossRef]
Fu, L.; Cao, Y.; Wu, J.; Peng, Q.; Nie, Q.; Xie, X. UFold: Fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Res. 2022, 50, e14. [Google Scholar] [CrossRef] [PubMed]
Xu, N.; Chen, F.; Wang, F.; Lu, X.; Wang, X.; Lv, M.; Lu, C. Clinical significance of high expression of circulating serum lncRNA RP11-445H22.4 in breast cancer patients: A Chinese population-based study. Tumor Biol. 2015, 36, 7659–7665. [Google Scholar] [CrossRef]
Hu, X.; Bao, J.; Wang, Z.; Zhang, Z.; Gu, P.; Tao, F.; Cui, D.; Jiang, W. The plasma lncRNA acting as fingerprint in nonsmall-cell lung cancer. Tumor Biol. 2016, 37, 3497–3504. [Google Scholar] [CrossRef]

Figure 1. The tree graph associated with Example 1.

Figure 2. A rooted tree graph in the plane (A). By following the red arrows in a clockwise direction, we assign associations using a dot or open or closed parentheses according to the string assignation rules (B). The associated string is

. (. (.) . (. (.) . (. (.) .) .) . (.) .)

.

Figure 2. A rooted tree graph in the plane (A). By following the red arrows in a clockwise direction, we assign associations using a dot or open or closed parentheses according to the string assignation rules (B). The associated string is

. (. (.) . (. (.) . (. (.) .) .) . (.) .)

.

Figure 3. Secondary structure from the RNA in Example 3 (A); the rooted tree graph associated after applying Gan et al.’s rules [7] (B); and its associated rooted tree graph with its SDBN (C):

. (. (. (. (. (.) .) .) .) . (. (.) .) .)

.

Figure 3. Secondary structure from the RNA in Example 3 (A); the rooted tree graph associated after applying Gan et al.’s rules [7] (B); and its associated rooted tree graph with its SDBN (C):

. (. (. (. (. (.) .) .) .) . (. (.) .) .)

.

Figure 4. Secondary structure from Example 6 (a) with yellow marking regions I,

I I

, and

I I I

, where, to obtain a SDBN, a dot must be inserted (A). Secondary structure from Example 6 (d), with yellow highlighting the same region where a dot was inserted (B).

Figure 4. Secondary structure from Example 6 (a) with yellow marking regions I,

I I

, and

I I I

, where, to obtain a SDBN, a dot must be inserted (A). Secondary structure from Example 6 (d), with yellow highlighting the same region where a dot was inserted (B).

Figure 5. A schematic representation of the procedure for obtaining a SDBN from a RNA sequence [7].

Figure 6. A sketch of the proposed algorithm. It identifies the important strings in sets A and B and then uses them to detect lncRNAs in set C.

Figure 7. Graphical description of the computational process for searching for repeated strings in rooted tree graphs associated with lncRNAs.

Figure 8. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by NUPACK Web [14].

Figure 8. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by NUPACK Web [14].

Figure 9. The polynomial

1.35 x^{2} - 12.90 x + 129

, which best approximates the points

(x_{i}, y_{i})

, is shown.

Figure 9. The polynomial

1.35 x^{2} - 12.90 x + 129

, which best approximates the points

(x_{i}, y_{i})

, is shown.

Figure 10. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by RNAfold WebServer [15].

Figure 10. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by RNAfold WebServer [15].

Figure 11. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by rna-state-inf [16].

Figure 11. The results of running the algorithm 500 times with

C = C A_{3 j} + C B_{5 j}

(A) and original set C (B), by rna-state-inf [16].

Table 1. Control set

\bar{A}

consists of 18 lncRNA sequences; control set

\bar{B}

, consisting of 18 non-lncRNA sequences; and test set

\bar{C}

, with 6 RNA sequences. Sets A and B are ordered by length to facilitate searching for shared strings.

Table 1. Control set

\bar{A}

consists of 18 lncRNA sequences; control set

\bar{B}

, consisting of 18 non-lncRNA sequences; and test set

\bar{C}

, with 6 RNA sequences. Sets A and B are ordered by length to facilitate searching for shared strings.

Control Set $\bar{A}$		Control Set $\bar{B}$		Test Set $\bar{C}$
RNA	$\| \bar{A} \|$	RNA	$\| \bar{B} \|$	RNA	$\| \bar{C} \|$
ICR1	3199	15S Ribosomal	1649	KAP123	3342
RME2	2223	LSR1	1175	GRE2	1029
RME3	1905	Telomerase	1158	ECM11	909
IRT1	1489	SNR86	1004	1477	843
TLC1	1301	SNR30	609	6754	840
PWR1	941	Small nuclear SNR30	606	12189	583
RUF5-1	710	SNR19	568
RUF21	707	SNR84	550
ETS1-1	700	Small nuclear SNR84	537
SRG1	551	RPM1	483
RUF22	515	SNR17B	462
RUF20	443	Nuclear RNASE P	358
ITS1-1	361	SNR42	351
RUF23	254	NME1	340
ITS2-1	232	U3	334
ETS2-1	211	Small nuclear U3	333
RNA170	169	RNASE MRP	332
ZOD1	58	SNR83	306

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cabrera-Ibarra, H.; Hernández-Granados, D.; Riego-Ruiz, L. A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis. Algorithms 2025, 18, 652. https://doi.org/10.3390/a18100652

AMA Style

Cabrera-Ibarra H, Hernández-Granados D, Riego-Ruiz L. A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis. Algorithms. 2025; 18(10):652. https://doi.org/10.3390/a18100652

Chicago/Turabian Style

Cabrera-Ibarra, Hugo, David Hernández-Granados, and Lina Riego-Ruiz. 2025. "A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis" Algorithms 18, no. 10: 652. https://doi.org/10.3390/a18100652

APA Style

Cabrera-Ibarra, H., Hernández-Granados, D., & Riego-Ruiz, L. (2025). A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis. Algorithms, 18(10), 652. https://doi.org/10.3390/a18100652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Graph-Based Algorithm for Detecting Long Non-Coding RNAs Through RNA Secondary Structure Analysis

Abstract

1. Introduction

2. Rooted Tree Graphs

3. From RNA Sequences to Graphs

3.1. From Secondary Structures to Strings in Dot–Bracket Notation

3.2. The Associated Rooted Tree Graphs

3.3. From DBN to SDBN

4. The Comparing Algorithm

The Algorithm

5. Results

5.1. Computations

5.2. Using Other RNA Folding Software Programs

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI