Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids

Sereti, Fotini; Georgiou, Dimitrios; Karakasidis, Theodoros

doi:10.3390/appliedmath6030039

Open AccessArticle

Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids

by

Fotini Sereti

¹,

Dimitrios Georgiou

² and

Theodoros Karakasidis

^3,*

¹

Department of Chemical Engineering, University of Western Macedonia, 50100 Kozani, Greece

²

Department of Mathematics, University of Patras, 26504 Patra, Greece

³

Condensed Matter Physics Laboratory, Department of Physics, University of Thessaly, 35100 Lamia, Greece

^*

Author to whom correspondence should be addressed.

AppliedMath 2026, 6(3), 39; https://doi.org/10.3390/appliedmath6030039

Submission received: 11 January 2026 / Revised: 15 February 2026 / Accepted: 26 February 2026 / Published: 4 March 2026

Download

Browse Figures

Versions Notes

Abstract

Genetic sequences play a central role in biological and medical research, and mathematics provides powerful means for their representation and analysis. Conventional approaches, such as the fuzzy polynucleotide space

{[0, 1]}^{12}

, model codons as 12-dimensional vectors, but this comes at the cost of high dimensionality. In this study, we introduce two new models, Vector-Fuzzy-I and Vector-Fuzzy-II, that map codons and genetic sequences into the 4-dimensional Euclidean space ℝ⁴ using vector algebra and fuzzy set theory. In the first model, sequence structure is represented by successive vector addition, while in the second, it is represented by positional frequencies normalized by nucleotide locations. These low-dimensional representations are unique, preserve sequence order, and allow effective measurement of similarity and difference via Euclidean metrics. Compared with the fuzzy polynucleotide space, the proposed models achieve dimensionality reduction while enhancing the resolution of sequence differentiation. Our approach offers new mathematical perspectives for sequence analysis in theoretical biology.

Keywords:

DNA; RNA; amino acid; pseudo amino acid composition; polynucleotides; vectors; fuzzy sets; Euclidean metric

1. Introduction

It is commonly accepted that the science of bioinformatics is a constantly developing field in which mathematics play an important role in the analysis of genetic sequences (see for example [1,2]) in order to study biological data in a different manner. We understand that the genetic material of living organisms consist of nucleic acids, DNA and RNA, and that their investigation is a fundamental task in the science of medicine, ranging from understanding disease mechanisms to studying evolution.

The “language” of DNA and RNA consists of four nucleotides, U, C, A, G in the case of DNA, and T, C, A, G in the case of RNA (A = adenine, C = cytosine, G = guanine, T = thymine, U = uracil) [3]. Particularly, DNA and RNA are made of triplets, XYZ, of three nucleic acids (that is, they form an amino acid or a codon) each of which have the possibility to be one of the above “letters”. Although over 500 amino acids exist in nature, by far the most important are the 20 amino acids incorporated into proteins and appear with their genetic code (see Table 1).

Related studies refer to amino acids, their biological distance and the concept of pseudo amino acid (pseAA) composition proposed in [4]. Several investigations have been made in order to use various numbers to represent the amino acids through the vehicle of pseAA composition (see for example [5,6,7,8,9,10,11,12,13,14]). Thus, it is absolutely natural to give special attention to amino acids, pseudo-amino acid composition and genetic sequences in general. There are several methods that are used in order to extract the characteristics of genomes and find some common characteristics among their constituents. Sadegh-Zadeh [15] showed that a genetic code can be represented in a 12-dimensional space, because a triplet codon XYZ has a 3 × 4 = 12 dimensional fuzzy code

(a_{1}, \dots, a_{12})

and is a point in the 12-dimensional fuzzy polynucleotide space

I^{12} = {[0, 1]}^{12}

(denote by

I = [0, 1]) .

Additionally, in the same paper, the authors introduced the fuzzy polynucleotide space (FPS) based on the principle of the fuzzy hypercube [16]. In this notation a polynucleotide consisting of a sequence of k triplets XYZ is a point in a

I^{12} .

Torres and Nieto [17] mapped a polynucleotide on an

I^{12}

space by considering the frequencies of presence of the nucleotides at the three base sites of a codon in the coding sequence. In that work, referring to the metric of [15,18,19], they calculated the distances between nucleotides. Further work has been performed using the idea of [20], in which the influences of several metrics have been examined. We note that, in related studies, it is also very important to be in a position to determine how close two genetic sequences are, as there are many important biological and medical implications (see for example [21,22,23]).

Another line of research involves time-series methods, where physicochemical indices of 20 amino acids are combined with the Hungarian algorithm to map each amino acid into a vector. In this way, a protein sequence can be transformed into a time series within an eleven-dimensional space [24]. Other studies, like that of [25], use systematic experiments with synthetic sequences to explore how CNN architecture—particularly convolutional filter size and max-pooling—affects the ability of first-layer filters to learn sequence motif representations. Their findings suggest that CNNs designed to promote hierarchical representation learning capture sequence motifs more effectively.

There is also a growing interest in incorporating contextual features. By representing similarity to homologous proteins both within and across species, the prediction of shared paralog functions can be significantly improved. Overall, these results demonstrate that alternative similarity metrics capture complementary aspects of functional similarity that go beyond sequence identity alone [26]. Comparative methods such as sequence analysis are commonly used to detect differences between sequences, whether at the single nucleotide level or as the result of larger phenomena like recombination or deletion. Identifying these variations is crucial for biological and medical insights; however, the massive scale and complexity of genomic data demand considerable computational resources [27].

As the applications of mathematics develop new directions in biological and medical studies (see for example [28,29,30]), the mathematical branch of vector algebra comes to introduce a new chapter in the direction of mathematical biology. Vector algebra has many applications in medicine such as gene therapy and data analysis of gene expression. Vectors, in this field, can represent genetic material for therapeutic delivery and sets of variables for analysis (see for example [31,32]).

1.1. Description of the Contributions of the Paper

Amino acids are essential components used to synthesize proteins in the human body. In general, protein sequences are the linear orders of amino acids within a polypeptide chain. A schematic representation of such a chain is given in Figure 1, where each amino acid of the chain is presented in a different color.

A chain of single-letter residue codes contains significant information. As protein sequences number many amino acids in their structure forms and are connected with the physicochemical properties of every amino acid or part of this linear order in the view of genetic sequences, we are motivated to study new mathematical representations of amino acids and genetic sequences.

Until now, we have understood the corresponding representation of such biological data in the space

{[0, 1]}^{12}

as a subspace of ℝ¹². However, it is believed that the procedure of reducing amino acid composition serves not only as a key method for analyzing protein structure and function but also expands opportunities within the broader field of machine learning (see for example [33] where the authors, taking into consideration this necessity of reduction, reviewed the strategy and method studies of the reduced amino acid alphabets). Thus, in our investigation, in order to facilitate simplicity of implementation (with less coordinates in the amino acids’ compositions) and faster differentiation between genetic sequences, we reduce the dimension of the space where such biological data can be “embedded” and represented in different vector forms.

More precisely, we follow the directions of vector algebra and fuzzy set theory, combine them with the field of representations of biological data and develop new mathematical models for studying amino acids and genetic sequences in general. We pay attention to the following axes:

We introduce two new representations of genetic sequences in ℝ⁴, using vectors and fuzzy sets. In the new representations, we succeed in representing nucleotides, triplets and general genetic sequences in the 4-dimensional Euclidean space ℝ⁴, assigning unit vectors of ℝ⁴ to such sequences. Thus, we avoid restricting ourselves to only the 12-dimensional space ℝ¹² and investigate the geometrical image in the Euclidean space ℝ⁴, significantly decreasing the number of coordinates of the corresponding biological data: amino acids and larger genetic sequences.
Having these new representations, we investigate the similarity, difference and Euclidean distances between genetic sequences, studying the influence of the new methodologies. For that, we compare the new models with known models representing genetic sequences in the fuzzy polynucleotide space ${[0, 1]}^{12}$ . Our study proves that there is a better differentiation of the sequences in the new models and that the criteria of similarity, difference and Euclidean distance provide better qualitative and quantitative data.

1.2. Outline of the Paper

The paper is organized as follows. In Section 2, we present the basic notions related to fuzzy set theory, vector algebra, metrics and similarity/difference. In Section 3, we insert the first new model of representation for genetic sequences, giving its various applications for known amino acids and genetic sequences. In Section 4, we present the second model, providing its applications for known amino acids and genetic sequences. Finally, in Section 5, we investigate the influence of the new methodologies, comparing then with the known methodologies representing genetic sequences in the fuzzy polynucleotide space

{[0, 1]}^{12} .

The characteristics of Euclidean distance and similarity/difference are studied.

2. Preliminaries

In this section we give some basic notions which are useful for the rest of our study. In particular, we present three subsections: the first for fuzzy sets, the second for the fuzzy polynucleotide space (FPS) and the third for vector algebra.

2.1. Basic Notions for Fuzzy Sets

We begin our study reminding the reader of the basic notions of fuzzy set theory. For more details we refer to [34,35,36,37]. Fuzzy sets were invented in 1965 by Lotfi Zadeh [34] and they replace the notion of an element’s participation to a set (being or not being in a set) with the notion of the element’s participation in the set with the degree of membership (see [35,36,37]). There are several modifications of fuzzy sets and, recently, in [38], the authors inserted the notion of ℚ(ε)-fuzzy sets: sets with values that may be infinitesimal or infinitely close to some rational number from the interval [0, 1]. However, in our investigation we always take into consideration the standard notion of fuzzy sets as follows:

Definition 1

([34]). Let X be a set. A subset A of X is called fuzzy subset of X if there is a function

μ_{A}

such that

(1): $μ_{A}$ : X → [0, 1] and
(2): A = {(x, $μ_{A}$ (x)): x ∈ X}, that is A is the set of all pairs (x, $μ_{A}$ (x)) such that x ∈ X and $μ_{A}$ (x) is the degree of its membership in A.

The nearer the value of

μ_{A}

(x) to unity, the higher the grade of membership of x in A. That is, for x ∈ X,

μ_{A} (x) \in [0, 1]

gives the so-called degree of participation of x in A. In what follows if

X = {x_{1}, x_{2}, \dots, x_{n}}

and

A = {(x_{1}, μ_{A} (x_{1}), \dots, (x_{n}, μ_{A} (x_{n}))}

, then we simply write

A = (μ_{A} (x_{1}), \dots, μ_{A} (x_{n}))

or

A = (a_{1}, \dots, a_{n}),

where

μ_{A} (x_{i}) = a_{i}

and

a_{i} \in [0, 1],

for each

i = 1, 2, \dots, n .

Definition 2

([34]). Let A and B be two fuzzy subsets of a set X. Then

(1): by A ∧ B we define the fuzzy set for which the membership function $μ_{A \land B}$ : X → [0, 1] is given as

$μ_{A \land B} (x) = m i n {μ_{A} (x), μ_{B} (x)}, f o r e v e r y x \in X$

(1)
(2): by A ∨ B we define the fuzzy set for which the membership function $μ_{A \lor B}$ : X → [0, 1] is given as

$μ_{A \lor B} (x) = m a x {μ_{A} (x), μ_{B} (x)}, f o r e v e r y x \in X$

(2)

Additionally, for a set

X = \{x_{1}, x_{2}, \dots, x_{n}\}

and a fuzzy subset

A = (a_{1}, \dots, a_{n})

of X, where

a_{i} \in [0, 1]

, by c(A) (see [15,19]) we denote the number

c (A) = \sum_{i = 1}^{n} a_{i}

(3)

Definition 3

([39]). Let

X = \{x_{1}, x_{2}, \dots, x_{n}\}

be a set and

A = (a_{1}, \dots, a_{n})

,

B = (b_{1}, \dots, b_{n})

, where a_i, b_i ∈ [0, 1], be two fuzzy subsets of X.

(1): The degree of similarity between A and B, denoted by sim(A,B), is defined to be the number

$s i m (A, B) = \frac{c (A \land B)}{c (C)}$

(4)

where C is the fuzzy set $(\frac{a_{1} + b_{1}}{2}, \dots, \frac{a_{n} + b_{n}}{2})$ , that is C is the canonical midpoint between A and B.
(2): The degree of difference between A and B, denoted by dif(A,B), is defined to be the number

$d i f (A, B) = 1 - s i m (A, B) .$

(5)

2.2. Fuzzy Polynucleotide Space (FPS)

Let us describe how amino acids and genetic sequences can be “embedded” in the hypercube

I^{12}

. A codon corresponds to a corner of the 12-dimensional unit hypercube

I^{12}

, called fuzzy polynucleotide space (in short, FPS). In particular, any element of

I^{12}

may be viewed as a fuzzy codon. In the case of DNA, we make the following correspondence: U = (1, 0, 0, 0), C = (0, 1, 0, 0), A = (0, 0, 1, 0) and G = (0, 0, 0, 1). As a consequence, if we have a nucleotide described by the codon UCG (serine) this would be written in the

I^{12}

as

U G G = (1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1) .

(6)

When we have a polynucleotide which is a sequence of m triplets, one would need an m × 12 hyperspace. For example, if we have the polynucleotide described by the sequence s₁ = UAC-UGU (tyrosine/cysteine), it is a point in

I^{2 \times 12} = I^{24}

and is represented by

s_{1} = (1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0) .

(7)

However, if one considers the frequencies of the nucleotides of the alphabet at the three base sites of a codon (in the coding sequence or any other region) it may be viewed as a point in the hypercube

I^{12}

as given in the following FPS representation (see Table 2).

2.3. Basic Notions for Vectors

When dealing with genetic sequences it is of interest to be able to describe how different two sequences are. For this reason, the notions of Euclidean distance, cosine similarity (i.e., cosine of angle), similarity and difference are used (see [15,20,39]). One of the most well-known mathematical spaces is the Euclidean space ℝⁿ,

n = 1, 2, \dots

, where the Euclidean distance and the norm are defined as follows, respectively:

d (x, y) = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

(8)

for every

x = (x_{1}, x_{2}, \dots, x_{n})

,

y = (y_{1}, y_{2}, \dots, y_{n})

∈ ℝⁿ and

{| | x | |}_{d} = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}

(9)

for every

x = (x_{1}, x_{2}, \dots, x_{n})

∈ ℝⁿ (subsequently, following the usual notations of vectors, we denote by

x

(in bold) any vector). For more details on vector and linear algebra we refer to [40]. Several other metrics and applications are also presented in [41,42,43]. However, in this paper we only focus on the Euclidean space ℝ⁴.

From a geometric point of view, the sum of two vectors

x = (x_{1}, x_{2}, \dots, x_{n})

and

y = (y_{1}, y_{2}, \dots, y_{n})

, belonging to ℝⁿ, can been seen as the diagonal of a parallelogram formed by those two vectors (see Figure 2).

Furthermore, having two vectors

x

and

y

, the so-called cosine similarity is a metric used to measure how similar given documents are irrespective of their size. In a pure mathematical view, it measures the cosine of the angle, say θ, between the vectors projected in a multi-dimensional space (see Figure 3). The smaller the angle, the higher the cosine similarity [43]. In particular, having two vectors

x = (x_{1}, x_{2}, \dots, x_{n})

,

y = (y_{1}, y_{2}, \dots, y_{n})

in ℝⁿ,

n = 1, 2 \dots

, different to 0 = (0, 0,

\dots

, 0) ∈ ℝⁿ, then the angle θ between

x

and

y

is defined as follows:

(1): $0 \leq θ \leq π$ and
(2): $c o s θ = \frac{1}{{‖ x ‖}_{d} \cdot {‖ y ‖}_{d}} \sum_{i = 1}^{n} x_{i} \cdot y_{i}$ .

3. Vector-Fuzzy-I Representation of Genetic Sequences

In this section, we present the first model of vector representation of the biological data of amino acids and genetic sequences.

Let (ℝ⁴,+,) be the Euclidean vector space and let

r_{1}

= (1, 0, 0, 0),

r_{2}

= (0, 1, 0, 0),

r_{3}

= (0, 0, 1, 0) and

r_{4}

= (0, 0, 0, 1) be the linearly independent unit vectors of the space ℝ⁴. The first model to provide the representation in ℝ⁴ of a given genetic sequence

s = u_{1} u_{2} u_{3} - u_{4} u_{5} u_{6} - u_{7} u_{8} u_{9} - \dots - u_{k - 5} u_{k - 4} u_{k - 3} - u_{k - 2} u_{k - 1} u_{k} .

(10)

which consists of k nucleotides, where the elements of the set

{{u}_{1}, u_{2}, \dots, u_{k}}

are elements of the set {U, C, A, G} and which is described in the following rule (note that, since we have triplets XYZ of nucleotides, the above k is a natural number divided by 3 and these k nucleotides are placed at

\frac{k}{3}

-triplets).

New Representation
[called Vector-Fuzzy-I (in short VF-I) representation of s]

Step 1. We make the correspondence of unit vectors to each codon as follows:

$r_{1} = (1, 0, 0, 0) = U, r_{2} = (0, 1, 0, 0) = C, r_{3} = (0, 0, 1, 0) = A, r_{4} = (0, 0, 0, 1) = G .$

(11)

As each $u_{i}$ , i = 1, 2, $\dots$ , k, of the above sequence s is one of the elements U, C, A, G, we find that each $u_{i}$ , i = 1, 2, $\dots$ , k, is one of the vectors $r_{1}$ , $r_{2}, r_{3}, r_{4} .$ Go to Step 2.
Step 2. Find the vector:

$w_{1} = \frac{u_{1} + u_{2}}{{‖ u_{1} + u_{2} ‖}_{d}}$

(12)

and go to Step 3.
Step 3. Find the vector:

$w_{2} = \frac{w_{1} + u_{3}}{{‖ w_{1} + u_{3} ‖}_{d}}$

(13)

and go to Step 4.
Step 4. Find the vector:

$w_{3} = \frac{w_{2} + u_{4}}{{‖ w_{2} + u_{4} ‖}_{d}}$

(14)

and go to Step 5.
Step k − 1. Find the vector:

$w_{k - 2} = \frac{w_{k - 3} + u_{k - 1}}{{‖ w_{k - 3} + u_{k - 1} ‖}_{d}}$

(15)

and go to Step k.
Step k. Find the vector:

$w_{k - 1} = \frac{w_{k - 2} + u_{k}}{{‖ w_{k - 2} + u_{k} ‖}_{d}}$

(16)

and go to Step k + 1.
Step k + 1. Assign the genetic sequence s to the vector w_k−1.

Clearly, the vectors

w_{1}, w_{2}, \dots w_{k - 1}

are elements of the hypercube

I^{4}

and the location of the nucleotides U, C, A, G ensures the uniqueness of each representation, getting the following remark.

Remark 1.

The VF-I representation of genetic sequences satisfies the following:

(1): The final vector w of the VF-I representation of a genetic sequence s is unique.
(2): If $s_{1}$ and $s_{2}$ are two different genetic sequences and $w_{1}, w_{2}$ are their VF-I representations, respectively, then $w_{1} \neq w_{2}$ .

In the following examples we shall apply the VF-I representation for amino acids and genetic sequences (using their genetic code and Chou’s pseudo-amino acid composition) in order to succeed their vector-image in the Euclidean space ℝ⁴.

Example 1.

We shall find the VF-I representation of some known nucleotides as follows.

(1): For CAU (histidine) we have C = $r_{2}$ , A = $r_{3}$ and U = $r_{1}$ . Thus, the corresponding vectors of VF-I representation are as follows:

Step 1. We add the first two vectors corresponding to C and A and construct the corresponding unit vector, as follows:

$w_{1} = \frac{r_{2} + r_{3}}{{‖ r_{2} + r_{3} ‖}_{d}} = \frac{(0, 1, 0, 0) + (0, 0, 1, 0)}{{‖ (0, 1, 0, 0) + (0, 0, 1, 0) ‖}_{d}} = \frac{(0, 1, 1, 0)}{\sqrt{2}} = (0, \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2}, 0)$

(17)
Step 2. We add to the previous unit vector w₁ the next vector corresponding to U, as follows:

$w_{2} = \frac{w_{1} + r_{1}}{{‖ w_{1} + r_{1} ‖}_{d}} = \frac{(0, \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2}, 0) + (1, 0, 0, 0)}{{‖ (0, \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2}, 0) + (1, 0, 0, 0) ‖}_{d}} = \frac{(1, \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2}, 0)}{\sqrt{2}} = (\frac{\sqrt{2}}{2}, \frac{1}{2}, \frac{1}{2}, 0)$

(18)

Thus, CAU corresponds to the following unit vector:

$C A U = (\frac{\sqrt{2}}{2}, \frac{1}{2}, \frac{1}{2}, 0) .$

(19)

A schematic geometrical image of CAU is given in Figure 4.

In Figure 4, we firstly consider the sum

r_{2} + r_{3}

, getting the corresponding parallelogram and then the vector

w_{1}

will be the unit part of the diagonal of this parallelogram. Next, we consider the sum

w_{1} + r_{1}

, getting the corresponding parallelogram and then the final vector

w_{2} = C A U

(in blue) will be the unit part of the diagonal of this parallelogram.

(2): For CCG (proline) we have C = $r_{2}$ , C = $r_{2}$ and G = $r_{4}$ . Thus, the corresponding vectors of VF-I representation are as follows:

Step 1. We add the first two vectors corresponding to C and C and construct the corresponding unit vector, as follows:

$w_{1} = \frac{r_{2} + r_{2}}{{‖ r_{2} + r_{2} ‖}_{d}} = \frac{(0, 1, 0, 0) + (0, 1, 0, 0)}{{‖ (0, 1, 0, 0) + (0, 1, 0, 0) ‖}_{d}} = \frac{(0, 2, 0, 0)}{2} = (0, 1, 0, 0) = r_{2}$

(20)
Step 2. We add to the resulting unit vector w₁ the unit vector corresponding to G, as follows:

$w_{2} = \frac{w_{1} + r_{4}}{{‖ w_{1} + r_{1} ‖}_{d}} = \frac{(0, 1, 0, 0) + (0, 0, 0, 1)}{{‖ (0, 1, 0, 0) + (0, 0, 0, 1) ‖}_{d}} = \frac{(0, 1, 0, 1)}{\sqrt{2}} = (0, \frac{\sqrt{2}}{2}, 0, \frac{\sqrt{2}}{2})$

(21)

Thus, CCG corresponds to the following unit vector:

$C C G = (0, \frac{\sqrt{2}}{2}, 0, \frac{\sqrt{2}}{2}) .$

(22)

A schematic geometrical image of CCG is given in Figure 5.

In Figure 5, we firstly consider the sum

r_{2} + r_{2}

and then the vector

w_{1}

will be the unit part of this vector, equal to

r_{2} .

Afterwards, we consider the sum

w_{1} + r_{4}

getting the corresponding parallelogram and then the final vector

w_{2} = C C G

(in blue) will be the unit part of the diagonal of this parallelogram.

Regarding the cosine of the angle θ of the above vector representations of

C A U = (\frac{\sqrt{2}}{2}, \frac{1}{2}, \frac{1}{2}, 0) a n d C C G = (0, \frac{\sqrt{2}}{2}, 0, \frac{\sqrt{2}}{2})

(23)

applying Definition 3, we have

c o s θ = \frac{\sqrt{2}}{4}

. Thus, θ ≈ 69°.

(3): Similarly, in Table 3 we can see the VF-I representations of the genetic sequences $s_{1}, s_{2}, s_{3}, s_{4}, s_{5}$ and $s_{6}$ (approximately, two decimal digits), the genetic codes of which are given in Table 2.
(4): In Table 4 we can see the VF-I representations of the genetic sequences with three triplets (approximately, two decimal digits).

4. Vector-Fuzzy-II Representation of Genetic Sequences

In this section, we present the second model of vector representation of the biological data of amino acids and genetic sequences.

Let (ℝ⁴,+,) be the Euclidean vector space and the following genetic sequence

s = u_{1} u_{2} u_{3} - u_{4} u_{5} u_{6} - u_{7} u_{8} u_{9} - \dots - u_{k - 5} u_{k - 4} u_{k - 3} - u_{k - 2} u_{k - 1} u_{k},

(24)

which consists of k nucleotides and is in

\frac{k}{3}

-triplets (as in Section 3), where k is a natural number divided with 3 and the elements of the set

{{u}_{1}, u_{2}, \dots, u_{k}}

are elements of the set {U, C, A, G}. For the new model, we consider the following meanings.

Definition 4.

Let s be the above genetic sequence (24) and N be any of the nucleotides U, C, A, G. The frequency of N, denoted by F(N), is defined to be a natural number of the set {0, 1, 2,

\dots

} which refers to how many times the nucleotide N appears in the sequence s.

As each genetic sequence consists of triplets of nucleotides, it is necessary to consider the location of each nucleotide U, C, A, G in each triplet. This will give us the uniqueness of the representation of s. Thus, each nucleotide U, C, A, G is located at the 1st-position, 2nd-position or 3rd-position in a triplet, getting the following meanings of partial positional weight and total positional weight.

Definition 5.

Let s be a genetic sequence as in (24),

j = 1, 2, \dots, \frac{k}{3}

, be any triplet of it, and N be any of the nucleotides U, C, A, G.

(1): The partial positional weight of N, denoted by $S_{i}^{j}$ (N) is defined to be a number of the set {1, 2, 3} which refers to the position i of the nucleotide N in the triple j of s. That is,

$S_{i}^{j} (N) = \{\begin{array}{l} i, if N is located at the i - position in the j - triplet \\ 0, otherwise \end{array}$

(25)
(2): The total positional weight of N, denoted by Σ(N), is defined to be the sum of all $S_{i}^{j}$ (N) in the sequence s. That is,

Σ (N) = \sum_{J = 1}^{k / 3} \sum_{i = 1}^{3} S_{i}^{j} (N)

(26)

Clearly, F(N) = 0 if and only if Σ(N) = 0.

Definition 6.

Let s be a genetic sequence consisting as in (24) and N be any of the nucleotides U, C, A, G. The positional frequency of N, denoted by PF(N), is defined to be the natural number as follows:

P F (N) = \{\begin{matrix} 0, if F (N) = 0, \\ \frac{F (N)}{Σ (N)}, if F (N) \neq 0 . \end{matrix}

(27)

Example 2.

We shall find the positional frequencies of U, C, A, G in the following nucleotides:

(1): For UCG (serine) we have the following:

$P F (U) = \frac{F (U)}{Σ (U)} = \frac{1}{1} = 1, P F (C) = \frac{F (C)}{Σ (C)} = \frac{1}{2}, P F (A) = \frac{F (A)}{Σ (A)} = 0, P F (G) = \frac{F (G)}{Σ (G)} = \frac{1}{3} .$

(28)
(2): For CGU (arginine) we have the following:

$P F (U) = \frac{F (U)}{Σ (U)} = \frac{1}{3}, P F (C) = \frac{F (C)}{Σ (C)} = \frac{1}{1} = 1, P F (A) = \frac{F (A)}{Σ (A)} = 0, P F (G) = \frac{F (G)}{Σ (G)} = \frac{1}{2} .$

(29)
(3): For UCG-CGU we have the following:

$P F (U) = \frac{F (U)}{Σ (U)} = \frac{2}{1 + 3} = \frac{1}{2}, P F (C) = \frac{F (C)}{Σ (C)} = \frac{2}{2 + 1} = \frac{2}{3}, P F (A) = \frac{F (A)}{Σ (A)} = 0, P F (G) = \frac{F (G)}{Σ (G)} = \frac{2}{3 + 2} = \frac{2}{5} .$

(30)

New Representation
[called Vector-Fuzzy-II (in short VF-II) representation of s]
Step 1. Find the vector:

$w = \frac{(P F (U), P F (C), P F (A), P F (G))}{{‖ (P F (U), P F (C), P F (A), P F (G)) ‖}_{d}}$

(31)

and go to Step 2.
Step 2. Assign the genetic sequence s of (24) to the vector w.

Clearly, the vector w is an element of the hypercube

I^{4}

and the location of the nucleotides U, C, A, G ensures the uniqueness of each representation, getting the following remark.

Remark 2.

The VF-II representation of genetic sequences satisfies the following:

(1): The vector w of the VF-II representation of a genetic sequence s is unique.
(2): If $s_{1}$ and $s_{2}$ are two different genetic sequences and $w_{1}, w_{2}$ are their VF-II representations, respectively, then $w_{1} \neq w_{2} .$

In the following examples we shall apply the VF-II representation for amino acids and genetic sequences (using their genetic code and Chou’s pseudo-amino acid composition) in order to succeed their vector image in the Euclidean space ℝ⁴.

Example 3.

We shall find the VF-II representation of some known nucleotides as follows.

(1): For UGU (cysteine) we have $P F (U) = \frac{2}{1 + 3} = \frac{1}{2},$ as U is present 2 times at positions 1 and 3, $P F (C) = P F (A) = 0$ and $P F (G) = \frac{1}{2},$ as G is present only once at position 2. Thus, the corresponding vector of VF-II is as follows:

$w = \frac{(P F (U), P F (C), P F (A), P F (G))}{{‖ (P F (U), P F (C), P F (A), P F (G)) ‖}_{d}} = \frac{(\frac{1}{2}, 0, 0, \frac{1}{2})}{{‖ (\frac{1}{2}, 0, 0, \frac{1}{2}) ‖}_{d}} = \frac{2}{\sqrt{2}} (\frac{1}{2}, 0, 0, \frac{1}{2}) = (\frac{\sqrt{2}}{2}, 0, 0, \frac{\sqrt{2}}{2})$

(32)

Thus, UGU corresponds to the following unit vector:

$U G U = (\frac{\sqrt{2}}{2}, 0, 0, \frac{\sqrt{2}}{2}) .$

(33)
(2): For AUG (methionine) we have $P F (U) = \frac{1}{2}$ , as U is present only once at position 2, PF(C) = 0, $P F (A) = \frac{1}{1} = 1,$ as A is located once at position 1, and $P F (G) = \frac{1}{3},$ as G is present only once at position 3. Thus, the corresponding vector of VF-II is as follows:

$w = \frac{(P F (U), P F (C), P F (A), P F (G))}{{‖ (P F (U), P F (C), P F (A), P F (G)) ‖}_{d}} = \frac{(\frac{1}{2}, 0, 1, \frac{1}{3})}{{‖ (\frac{1}{2}, 0, 1, \frac{1}{3}) ‖}_{d}} = \frac{6}{7} (\frac{1}{2}, 0, 1, \frac{1}{3}) = (\frac{3}{7}, 0, \frac{6}{7}, \frac{2}{7})$

(34)

Thus, AUG corresponds to the following unit vector:

$A U G = (\frac{3}{7}, 0, \frac{6}{7}, \frac{2}{7}) .$

(35)

Additionally, regarding the cosine of the angle θ of the above vector representations of

$U G U = (\frac{\sqrt{2}}{2}, 0, 0, \frac{\sqrt{2}}{2}) a n d A U G = (\frac{3}{7}, 0, \frac{6}{7}, \frac{2}{7})$

(36)

and applying Definition 3, we have cosθ = $\frac{5 \sqrt{2}}{14} .$ Thus, θ ≈ 60°.
(3): Similarly, in Table 5 and Table 6 we can see the VF-II representations of the genetic sequences $s_{1}, s_{2}, s_{3}, s_{4}, s_{5}, s_{6}$ ; $Σ_{1}, Σ_{2}$ ; and $Σ_{3}$ (approximately, two decimal digits), the genetic codes of which are given in Table 2, Table 3 and Table 4.

Here, it is interesting to see the cosine similarity of the sequences

s_{5}

and

s_{6}

, as we can see a similar representation in Table 5. Then, applying Definition 3 we get cosθ ≈ 1. Hence, the corresponding vectors of

s_{5}

and

s_{6}

tend to be nearly parallel and present very large similarity with an angle close to θ = 0° between them.

5. Characteristics of Vector-Fuzzy Representation

The advantage and, simultaneously, the essential characteristic of the new vector-fuzzy models are based on the fact that we can represent the biological data of amino acids and bigger genetic sequences as vectors in ℝ⁴, giving a pure mathematical approach. Additionally, the new models provide a way to investigate “measurements” between genetic sequences. The distance

d (s_{i}, s_{j})

between two genetic sequences

s_{i}

and

s_{j}

is determined by the corresponding Euclidean distance

d (w_{i}, w_{j})

of the unit vectors

w_{i}

and

w_{j}

that are assigned to

s_{i}

and

s_{j}

_j, respectively, and the similarity

s i m (s_{i}, s_{j})

(resp.

d i f (s_{i}, s_{j})

) is determined by the corresponding similarity

s i m (w_{i}, w_{j})

(resp.

d i f (w_{i}, w_{j})

), where

w_{i}

and

w_{j}

are given in the Vector-Fuzzy I and Vector-Fuzzy II representations.

In Section 3 and Section 4 we have seen the new models of vector fuzzy representations (VF-I and VF-II) for the genetic sequences

s_{1}, s_{2}, s_{3}, s_{4}, s_{5}

and

s_{6} .

According to the known FPS representation of these sequences in

I^{12}

, we can compare these two models via the characteristics of Euclidean distance, similarity and difference.

Application 1: Regarding $s_{1}$ in combination with $s_{2}$ and $s_{3}$

With respect to the Euclidean distance and similarity, in the FPS representation we can see that

s_{1}

is closer to

s_{2}

than

s_{3}

. This fact remains true in the new representations, which is absolutely expected as, between

s_{1}

and

s_{3}

, there are more differences (in two bases) and between

s_{1}

and

s_{2}

there is only one difference (see Table 7 and Table 8). In the new models, we have more near distances and larger degrees of similarities between the investigated sequences. We also note that the distance between

s_{1}

and

s_{3}

remains bigger than the distance between

s_{1}

and

s_{2}

, and the degree of similarity between

s_{1}

and

s_{3}

remains smaller than the degree of similarity between

s_{1}

and

s_{2}

. This fact can be verified by the composition of the sequences, as

s_{1}

and

s_{3}

have a difference of two bases.

Application 2: Regarding $s_{2}$ in combination with $s_{3}, s_{4}, s_{5}$ and $s_{6}$

In the following we calculate the distances and similarities of sequences

s_{1}, s_{3}, s_{4}, s_{5}

and

s_{6}

from

s_{2}

. All these sequences have the characteristic in that they differ from

s_{2}

at one base: for example,

s_{1}

and

s_{2}

have the same structure except for the first base of the first triplet where

s_{1}

has a U while

s_{2}

has a C. In the FPS representation, all distances and similarities are equal to each other. Thus, we cannot derive satisfactory results about how different the sequences are. However, we can observe that in vector fuzzy representations the distances differ, giving a satisfactory quantitative criterion for how big the difference is between two polynucleotides (see Table 9 and Table 10). Additionally, the fact that, in VF-II, representations of

s_{5}

and

s_{6}

tend to be in parallel (see Section 4) leads to the coincidence of their distances from

s_{2}

as follows:

d (s_{2}, s_{5}) = d (s_{2}, s_{6})

(37)

and also, their similarities from

s_{2}

, as follows:

s i m (s_{2}, s_{5}) = s i m (s_{2}, s_{6}) .

(38)

Consequently, it is believed that the above results are due to the fact that the new methodologies take into account the local structure of the sequences when compared with the approach of FPS, as, in the case of VF-I, the unit vectors are constructed based on the successive addition of vectors corresponding to each nucleotide, and, in the case of VF-II, the frequencies are normalized with the locations of the nucleotides.

6. Discussion

The main tasks in bioinformatics are based on comparison, classification, and phylogenetic analyses of genetic sequences. Their vector representations ensure the extraction of useful and effective conclusions. However, the main problems of such works are the high computational complexity, mainly on high dimensional spaces, and the variety of definitions and approaches, which are developed independently, leading to possible indistinguishable results across related studies. In the present study, we focus on these problems, significantly reducing the dimension of the space where our data can be studied and by using the most known and widely used meanings of the fuzzy sets and vectors, avoiding any extremely new definition that may lead to new confusion.

More precisely, in this paper, we have presented new approaches to the representation of amino acids and genetic sequences, modeling amino acids and genetic sequences in a way that can achieve dimensionality reduction and better differentiation of the sequences. The new models are based on vector algebra and fuzzy set theory and provide representations of genetic sequences as unit vectors in the 4-dimensional space ℝ⁴. The location of the amino acids plays an important role in each model as it verifies that each vector is assigned to a unique amino and vice versa. We applied the new models to several known amino acids and genetic sequences, providing examples of their representations in ℝ⁴. The new models take into account the local structure of the sequences when compared with the approach of FPS, as, in the case of VF-I, the unit vectors are constructed based on the successive addition of vectors corresponding to each nucleotide, and, in the case of VF-II, the frequencies are normalized with the locations of the nucleotides.

7. Conclusions

Two new models for the vector representations of amino acids and genetic sequences in the space ℝ⁴ are given, investigated and applied for several such genetic sequences. Each amino acid and genetic sequence gets a new image representation via vectors in a new mathematical space: the Euclidean space ℝ⁴. Compared with the conventional fuzzy polynucleotide space

{[0, 1]}^{12}

(as a subspace of ℝ¹²), the proposed models achieve dimensionality reduction while enhancing the resolution of sequence differentiation, thus providing promising tools for sequence analysis in theoretical biology.

Author Contributions

Methodology, F.S., D.G., T.K.; investigation, F.S., D.G., T.K.; writing—original draft preparation, F.S., D.G., T.K.; writing—review and editing, F.S., D.G., T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Paun, G.; Rozenberg, G.; Saloma, A. DNA Computing: New Computing Paradigms; Springer: Berlin, Germany, 1998. [Google Scholar]
Percus, J. Mathematics of Genome Analysis; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Freeland, S.J.; Hurst, L.D. The genetic code is one in a million. J. Mol. Evol. 1998, 47, 238–248. [Google Scholar] [CrossRef]
Chou, K.C. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins-Struct. Funct. Genet. 2001, 43, 246–255. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Tian, Y.X.; Zou, X.Y.; Cai, P.X.; Mo, J.Y. Using pseudo-amino acid composition and support vector machine to predict proteins tructural class. J. Theor. Biol. 2006, 243, 444–448. [Google Scholar] [CrossRef]
Chou, K.C.; Cai, Y.D. Predicting protein quaternary structure by pseudo amino acid composition. Proteins-Struct. Funct. Genet. 2003, 53, 282–289. [Google Scholar] [CrossRef]
Wang, S.; Liu, S. Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int. J. Mol. Sci. 2015, 16, 30343–30361. [Google Scholar] [CrossRef]
Du, P.; Li, Y. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform. 2006, 7, 518. [Google Scholar] [CrossRef]
Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A. A study of entropy/clarity of genetic sequences using metric spaces and fuzzy sets. J. Theor. Biol. 2010, 267, 95–105. [Google Scholar] [CrossRef] [PubMed]
Lin, H.; Li, Q.Z. Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components. J. Comput. Chem. 2007, 28, 1463–1466. [Google Scholar] [CrossRef]
Mondal, S.; Bhavna, R.; MohanBabu, R.; Ramakumar, S. Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. J. Theor. Biol. 2006, 243, 252–260. [Google Scholar] [CrossRef]
Mundra, P.; Kumar, M.; Kumar, K.K.; Jayaraman, V.K.; Kulkarni, B.D. Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognit. Lett. 2007, 28, 1610–1615. [Google Scholar] [CrossRef]
Xiao, X.; Shao, S.H.; Huang, Z.D.; Chou, K.C. Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor. J. Comput. Chem. 2006, 27, 478–482. [Google Scholar] [CrossRef]
Zhou, X.B.; Chen, C.; Li, Z.C.; Zou, X.Y. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J. Theor. Biol. 2007, 248, 546–551. [Google Scholar] [CrossRef]
Sadegh-Zadeh, K. Fundamentals of clinical methodology: 3. Nosology. Artif. Intell. Med. 1999, 17, 87–108. [Google Scholar] [CrossRef]
Kosko, B. Neural Networks and Fuzzy Systems; Prentice-Hall: Englewood Cliffs, NJ, USA, 1992. [Google Scholar]
Torres, A.; Nieto, J.J. The fuzzy polynucleotide space: Basic properties. Bioinformatics 2003, 19, 587–592. [Google Scholar] [CrossRef]
Lin, C.T. Adaptive subsethood for radial basis fuzzy systems. In Fuzzy Engineering; Kosko, B., Ed.; Prentics-Hall: Upper Saddle River, NJ, USA, 1997; pp. 429–464. [Google Scholar]
Sadegh-Zadeh, K. Fuzzy genomes. Artif. Intell. Med. 2000, 18, 1–28. [Google Scholar] [CrossRef]
Nieto, J.J.; Torres, A. Midpoints for fuzzy sets and their application in medicine. Artif. Intell. Med. 2023, 17, 81–101. [Google Scholar] [CrossRef]
Chou, K.C.; Shen, H.B. Review: Recent progresses in protein subcellular location prediction. Anal. Biochem. 2007, 370, 1–16. [Google Scholar] [CrossRef]
Gusev, V.D.; Nemytikova, L.A.; Chuzhanova, N.A. On the complexity measures of genetic sequences. Bioinformatics 1999, 15, 994–999. [Google Scholar] [CrossRef]
Jiang, T.; Lin, G.; Ma, B.; Zhang, K. A general edit distance between RNA structures. J. Comput. Biol. 2002, 9, 371–388. [Google Scholar] [CrossRef]
Li, C.; Dai, Q.; He, P.A. A time series representation of protein sequences for similarity comparison. J. Theor. Biol. 2022, 538, 111039. [Google Scholar] [CrossRef]
Koo, P.K.; Eddy, S.R. Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput. Biol. 2019, 15, e1007560. [Google Scholar] [CrossRef]
Dennler, O.; Ryan, C.J. Evaluating sequence and structural similarity metrics for predicting shared paralog functions. NAR Genom. Bioinform. 2025, 7, lqaf051. [Google Scholar] [CrossRef]
Kösoglu-Kind, B.; Loredo, R.; Grossi, M.; Bernecker, C.; Burks, J.M.; Buchkremer, R. A biological sequence comparison algorithm using quantum computers. Sci. Rep. 2023, 13, 14552. [Google Scholar] [CrossRef]
Lee, Y.C.; Jung, S.H.; Kumar, A.; Shim, I.; Song, M.; Kim, M.S.; Kim, K.; Myung, W.; Park, W.Y.; Won, H.H. ICD2Vec: Mathematical representation of diseases. J. Biomed. Inform. 2023, 141, 104361. [Google Scholar] [CrossRef]
Liu, Y.; Wu, R.; Yang, A. Research on Medical Problems Based on Mathematical Models. Mathematics 2023, 11, 2842. [Google Scholar] [CrossRef]
Zayed, A.I. A new perspective on the role of mathematics in medicine. J. Adv. Res. 2019, 17, 49–54. [Google Scholar] [CrossRef]
Kuruvilla, F.G.; Park, P.J.; Schreiber, S.L. Vector algebra in the analysis of genome-wide expression data. Genome Biol. 2002, 3, research0011.1–research0011.11. [Google Scholar] [CrossRef]
Liu, E.S.F.; Wu, V.W.C.; Harris, B.; Foote, M.; Lehman, M.; Chan, L.W.C. Vector-model-supported optimization in volumetric-modulated arc stereotactic radiotherapy planning for brain metastasis. Med. Dosim. 2017, 42, 85–89. [Google Scholar] [CrossRef]
Liang, Y.; Yang, S.; Zheng, L.; Wang, H.; Zhou, J.; Huang, S.; Yang, L.; Zuo, Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput. Struct. Biotechnol. J. 2022, 20, 3503–3510. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy Sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
Klir, G.J.; Yuan, B. Fuzzy Sets and Fuzzy Logic (Theory and Applications); Prentice Hall PRT: Hoboken, NJ, USA, 1995. [Google Scholar]
Terano, T.; Asai, K.; Sugeno, M. Fuzzy Systems Theory and its Applications; Academic Press, Harcount Brace Jovanovich Publishers: San Diego, CA, USA, 1992. [Google Scholar]
Zimmermann, H.J. Fuzzy Theory and Its Applications; Kluwer Academic Publishers: New York, NY, USA, 1991. [Google Scholar]
Stojanovic, N.; Lakovic, M. ℚ[ε]-Fuzzy Sets. J. Korean Soc. Ind. Appl. Math. 2024, 28, 303–318. [Google Scholar]
Nieto, J.J.; Torres, A.; Georgiou, D.N.; Karakasidis, T.E. Fuzzy polynucleotide spaces and metrics. Bull. Math. Biol. 2006, 68, 703–725. [Google Scholar] [CrossRef] [PubMed]
Smith, L. Linear Algebra, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1984. [Google Scholar]
Engelking, R. General Topology; Heldermann Verlag: Berlin, Germany, 1989. [Google Scholar]
Karakasidis, T.E.; Georgiou, D.N. Partitioning elements of the periodic table via fuzzy clustering technique. Soft Comput. 2004, 8, 231–236. [Google Scholar]
Samaras, P.; Kungolos, A.; Karakasidis, T.; Georgiou, D.; Perakis, K. Statistical evaluation of PCDD/F emission data during solid waste combustion by fuzzy clustering techniques. J. Environ. Sci. Health Part A 2001, 36, 153–161. [Google Scholar] [CrossRef] [PubMed]

Figure 1. A schematic representation of a genetic sequence.

Figure 2. Schematic representation of the vectors

x

and

y

(in black) and their sum

x

+

y

(in blue), drawing the corresponding parallelogram and its diagonal.

Figure 2. Schematic representation of the vectors

x

and

y

(in black) and their sum

x

+

y

(in blue), drawing the corresponding parallelogram and its diagonal.

Figure 3. Schematic representation of the angle θ between the vectors

x

and

y

.

Figure 3. Schematic representation of the angle θ between the vectors

x

and

y

.

Figure 4. Schematic geometric representation of CAU (in blue) as a vector in the frame of VF-I representation.

Figure 5. Schematic geometric representation of CCG (in blue) as a vector in the frame of VF-I representation.

w_{1} + r_{4} .

Figure 5. Schematic geometric representation of CCG (in blue) as a vector in the frame of VF-I representation.

w_{1} + r_{4} .

Table 1. The 20 amino acids.

Amino Acid	3-Letter Code	Name	Reverse Codon
1.	Ala	Alanine	GCU, GCC, GCA, GCG
2.	Cys	Cysteine	UGU, UGC
3.	Asp	Aspartic acid	GAU, GAC
4.	Glu	Glutamic acid	GAA, GAG
5.	Phe	Phenylalanine	UUU, UUC
6.	Gly	Glycine	GGU, GGC, GGA, GGG
7.	His	Histidine	CAU, CAC
8.	Ile	Isoleucine	AUU, AUC, AUA
9.	Lys	Lysine	AAA, AAG
10.	Leu	Leucine	UUA, UUG, CUU, CUC, CUA, CUG
11.	Met	Methionine	AUG
12.	Asn	Asparagine	AAU, AAC
13.	Pro	Proline	CCU, CCC, CCA, CCG
14.	Gln	Glutamine	CAA, CAG
15.	Arg	Arginine	CGU, CGC, CGA, CGG, AGA, AGG
16.	Ser	Serine	UCU, UCC, UCA, UCG, AGU, AGC
17.	Thr	Threonine	ACU, ACC, ACA, ACG
18.	Val	Valine	GUU, GUC, GUA, GUG
19.	Trp	Tryptophane	UGG
20.	Tyr	Tyrosine	UAU, UAC

Table 2. FPS representation of genetic sequences.

Genetic Sequence	Symbol	Genetic Code	FPS Representation
tyrosine/cysteine	$s_{1}$	UAC-UGU	(1, 0, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0.5, 0, 0)
histidine/cysteine	$s_{2}$	CAC-UGU	(0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0.5, 0, 0)
leucine/cysteine	$s_{3}$	CUC-UGU	(0.5, 0.5, 0, 0, 0.5, 0, 0, 0.5, 0.5, 0.5, 0, 0)
histidine/cysteine	$s_{4}$	CAU-UGU	(0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 1, 0, 0, 0)
glutamine/cysteine	$s_{5}$	CAG-UGU	(0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0, 0.5)
glutamine/cysteine	$s_{6}$	CAA-UGU	(0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0)

Table 3. VF-I representation of genetic sequences with 2 amino acids.

Genetic Sequence	Symbol	Genetic Code	VF-I Representation
tyrosine/cysteine	$s_{1}$	UAC-UGU	(0.90, 0.16, 0.12, 0.39)
histidine/cysteine	$s_{2}$	CAC-UGU	(0.87, 0.27, 0.11, 0.41)
leucine/cysteine	$s_{3}$	CUC-UGU	(0.89, 0.22, 0.00, 0.40)
histidine/cysteine	$s_{4}$	CAU-UGU	(0.91, 0.10, 0.10, 0.39)
glutamine/cysteine	$s_{5}$	CAG-UGU	(0.84, 0.12, 0.12, 0.52)
glutamine/cysteine	$s_{6}$	CAA-UGU	(0.87, 0.11, 0.27, 0.41)

Table 4. VF-I representation of genetic sequences with 3 amino acids.

Genetic Sequence	Symbol	Genetic Code	VF-I Representation
leucine/asparagine/serine	$Σ_{1}$	UUA-AAU-UCU	(0.91, 0.39, 0.13, 0.00)
alanine/glutamic acid/phenylalanine	$Σ_{2}$	GCU-GAG-UUU	(0.98, 0.02, 0.07, 0.16)
valine/tyrosine/arginine	$Σ_{3}$	GUG-UAU-AGG	(0.19, 0.00, 0.30, 0.94)

Table 5. VF-II representation of genetic sequences with 2 amino acids.

Genetic Sequence	Symbol	Genetic Code	VF-II Representation
tyrosine/cysteine	$s_{1}$	UAC-UGU	(0.61, 0.34, 0.51, 0.51)
histidine/cysteine	$s_{2}$	CAC-UGU	(0.50, 0.50, 0.50, 0.50)
leucine/cysteine	$s_{3}$	CUC-UGU	(0.57, 0.57, 0.00, 0.57)
histidine/cysteine	$s_{4}$	CAU-UGU	(0.33, 0.77, 0.38, 0.38)
glutamine/cysteine	$s_{5}$	CAG-UGU	(0.39, 0.78, 0.39, 0.31)
glutamine/cysteine	$s_{6}$	CAA-UGU	(0.39, 0.78, 0.31, 0.39)

Table 6. VF-II representation of genetic sequences with 3 amino acids.

Genetic Sequence	Symbol	Genetic Code	VF-II Representation
leucine/asparagine/serine	$Σ_{1}$	UUA-AAU-UCU	(0.58, 0.58, 0.58, 0.00)
alanine/glutamic acid/phenylalanine	$Σ_{2}$	GCU-GAG-UUU	(0.43, 0.49, 0.49, 0.58)
valine/tyrosine/arginine	$Σ_{3}$	GUG-UAU-AGG	(0.53, 0.00, 0.71, 0.46)

Table 7. Comparing

s_{1}, s_{2} a n d s_{3}

via the Euclidean distance.

Table 7. Comparing

s_{1}, s_{2} a n d s_{3}

via the Euclidean distance.

Euclidean Distance d
	FPS Representation Table 1	VF-I Representation Table 2	VF-II Representation Table 4
$d (s_{1}, s_{2})$	0.7071068	0.1161895	0.1946792
$d (s_{1}, s_{3})$	1	0.1349074	0.5640922

Table 8. Comparing

s_{1}, s_{2} a n d s_{3}

via the similarity/difference.

Table 8. Comparing

s_{1}, s_{2} a n d s_{3}

via the similarity/difference.

Similarity/Difference
	FPS Representation Table 1	VF-I Representation Table 2	VF-II Representation Table 4
$s i m (s_{1}, s_{2})$	0.8333333	0.9473684	0.9269521
$d i f (s_{1}, s_{2})$	0.1666667	0.0526316	0.0730479
$s i m (s_{1}, s_{3})$	0.6666667	0.9350649	0.7717391
$d i f (s_{1}, s_{3})$	0.3333333	0.0649351	0.2282609

Table 9. Comparing

s_{2}

with

s_{3}

,

s_{4}

,

s_{5}

and

s_{6}

via the Euclidean distance.

Table 9. Comparing

s_{2}

with

s_{3}

,

s_{4}

,

s_{5}

and

s_{6}

via the Euclidean distance.

Euclidean Distance d
	FPS Representation Table 1	VF-I Representation Table 2	VF-II Representation Table 4
$d (s_{1}, s_{2})$	0.7071068	0.1161895	0.1946792
$d (s_{2}, s_{3})$	0.7071068	0.1228821	0.5144900
$d (s_{2}, s_{4})$	0.7071068	0.1760682	0.3613862
$d (s_{2}, s_{5})$	0.7071068	0.1886796	0.3724245
$d (s_{2}, s_{6})$	0.7071068	0.2262741	0.3724245

Table 10. Comparing

s_{2}

with

s_{3}

,

s_{4}

,

s_{5}

and

s_{6}

via the similarity/difference.

Table 10. Comparing

s_{2}

with

s_{3}

,

s_{4}

,

s_{5}

and

s_{6}

via the similarity/difference.

Similarity/Difference
	FPS Representation Table 1	VF-I Representation Table 2	VF-II Representation Table 4
$s i m (s_{2}, s_{3})$	0.8333333	0.9400631	0.8086253
$d i f (s_{2}, s_{3})$	0.1666667	0.0599369	0.1913747
$s i m (s_{2}, s_{4})$	0.8333333	0.9240506	0.8238342
$d i f (s_{2}, s_{4})$	0.1666667	0.0759494	0.1761658
$s i m (s_{2}, s_{5})$	0.8333333	0.9079755	0.8217054
$d i f (s_{2}, s_{5})$	0.1666667	0.0920245	0.1782946
$s i m (s_{2}, s_{6})$	0.8333333	0.9036145	0.8217054
$d i f (s_{2}, s_{6})$	0.1666667	0.0963855	0.1782946

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sereti, F.; Georgiou, D.; Karakasidis, T. Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids. AppliedMath 2026, 6, 39. https://doi.org/10.3390/appliedmath6030039

AMA Style

Sereti F, Georgiou D, Karakasidis T. Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids. AppliedMath. 2026; 6(3):39. https://doi.org/10.3390/appliedmath6030039

Chicago/Turabian Style

Sereti, Fotini, Dimitrios Georgiou, and Theodoros Karakasidis. 2026. "Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids" AppliedMath 6, no. 3: 39. https://doi.org/10.3390/appliedmath6030039

APA Style

Sereti, F., Georgiou, D., & Karakasidis, T. (2026). Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids. AppliedMath, 6(3), 39. https://doi.org/10.3390/appliedmath6030039

Article Menu

Models of Low-Dimensional Vector-Fuzzy Representations of Genetic Sequences and Amino Acids

Abstract

1. Introduction

1.1. Description of the Contributions of the Paper

1.2. Outline of the Paper

2. Preliminaries

2.1. Basic Notions for Fuzzy Sets

2.2. Fuzzy Polynucleotide Space (FPS)

2.3. Basic Notions for Vectors

3. Vector-Fuzzy-I Representation of Genetic Sequences

4. Vector-Fuzzy-II Representation of Genetic Sequences

5. Characteristics of Vector-Fuzzy Representation

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI