SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare

Calude, Cristian S.; Gladding, Patrick; Henderson, Alec; Kasabov, Nikola

doi:10.3390/a18100605

Open AccessArticle

SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare

by

Cristian S. Calude

¹

,

Patrick Gladding

²,

Alec Henderson

³ and

Nikola Kasabov

^4,*

¹

Department of Computer Science, University of Auckland, Auckland 1142, New Zealand

²

Bioengineering Institute, University of Auckland, Auckland 1142, New Zealand

³

University of Queensland Center for Clinical Research, Brisbane, QLD 4072, Australia

⁴

Department of Computer Science, Auckland University of Technology, Auckland 1142, New Zealand

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(10), 605; https://doi.org/10.3390/a18100605

Submission received: 26 June 2025 / Revised: 10 September 2025 / Accepted: 18 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Algorithms for Computer Aided Diagnosis: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Personalised modelling has become dominant in personalised medicine and precision health. It creates a computational model for an individual based on large data repositories of existing personalised data, aiming to achieve the best possible personal diagnosis or prognosis and derive an informative explanation for it. Current methods are still working on a single data modality or treating all modalities with the same method. The proposed method, SAIN (Search-And-INfer), offers better results and an informative explanation for classification and prediction tasks on a new multimodal object (sample) using a database of similar multimodal objects. The method is based on different distance measures suitable for each data modality and introduces a new formula to aggregate all modalities into a single vector distance measure to find the closest objects to a new one, and then use them for a probabilistic inference. This paper describes SAIN and applies it to two types of multimodal data, cardiovascular diagnosis and EEG time series, modelled by integrating modalities, such as numbers, categories, images, and time series, and using a software implementation of SAIN.

Keywords:

search in multimodal data; inference in multimodal data; personalised modelling; precision health

1. Introduction

Multimodal data has been gathered at a personalised level in large quantities, worldwide, for many applications, such as neuro-imaging analysis, personalised health diagnosis and prognosis, environmental modelling, and financial modelling, to mention only a few of them. Still, there are no efficient methods to integrate various multimodal data for a new subject and derive a more accurate and explainable diagnosis or prognosis based on existing multimodal data of many other subjects. The goal of this paper is to create such a method.

There are three main approaches to multimodal data integration in machine learning, explored so far [1]:

1. Early integration, where a common vector represents all modalities used for training a model and for its recall. This approach has been used for integrating time series and textual information [2,3], for image integration [4], and for the integration of clinical, social, and cognitive data modalities to predict psychosis in young adults [5]. While the method in [4] is based on deep, feedforward neural networks, the methods in [2,5] use a brain-inspired spiking neural network architecture NeuCube [6].

2. Late integration, where a model is created and trained for each of the modalities of data, and the results from all models are integrated to calculate the output. This approach has been demonstrated in [1] on integrating clinical, genetic, cognitive, and social data for medical prognosis.

3. Hybrid, early, late, and intermediate integration of data modalities, where the two approaches above are combined [1].

The proposed SAIN method in this paper is designed for early integration of data modalities, where specific encoding and distance metrics are suggested for different types of data, along with novel algorithms for search in a multimodal database and inference. These search and inference algorithms are related to building a personalised model for individual outcome assessment and its explanation.

Personalised modelling is concerned with the creation of an individual model for a new personal (individual) record of data X, using an already existing repository D of many other personal records for which the outcomes are known, to assess the outcome of the new record X [7]. Methods for personalised modelling have been developed to work mainly on a single modality of data [8,9]. These methods have been used in many applications and constitute the state of the art in the field (e.g., refs. [1,9,10,11,12,13]). In [9,14], a personalised modelling is proposed based on static and temporal modalities of data, which are used to train a spiking neural network model.

The enormous growth of personal multimodal data worldwide demands more advanced methods for personal modelling with the use of multimodal data. This paper offers such a method, called SAIN, where the specificity of each data modality is considered and new algorithms are proposed for the encoding of multimodal data, for search in a multimodal data repository, and for multimodal inference, along with its explanation and visualisation. In contrast with the statistical solutions used in [15,16,17], we adopt a probabilistic framework that gives more precise evaluations of probabilities of outcomes for an individual.

2. Mathematical Description

In this section, we present the mathematical method. In detail, we start with the database coding and a class of distances (metrics) to be used in this article, followed by the list of tasks (problems) and their solutions. Detailed examples illustrate the solutions. Detailed solutions to three critical problems, namely survival analysis, heart disease diagnosis, and time series classification, are presented in detail and illustrated with numerical examples.

2.1. Database

We will work with the multidimensional data described as follows:

$m > 1$ objects (samples) $o_{1}, \dots, o_{m}$ ;
Each object $o_{i}$ ( $1 \leq i \leq m$ ) is defined by $n > 1$ criteria (variables) $c_{1}, \dots, c_{n}$ with values in linearly ordered domains $D_{i}$ with $min D_{i}$ and $max D_{i}$ ; if some value $a_{i, j} \in D_{i}$ ( $1 \leq i \leq m$ , $1 \leq j \leq n$ ) is either missing or uncertain, then its value is recorded as ∞;
$n > 1$ weights $w_{1}, \dots, w_{n}$ in $(0, 1)$ with $\sum_{i = 1}^{n} w_{i} = 1$ , where each $w_{i}$ ( $1 \leq i \leq n$ ) quantifies the importance of the criterion $c_{i}$ ; if $w_{i} = \frac{1}{n}$ for all $1 \leq i \leq n$ , then all criteria are equally important; a criterion $c_{i}$ is ignored if $w_{i} = 0$ .

Data of independent variables are organised as in Table 1.

2.2. Distance Metrics

A distance metric on a space X is a positive real-valued function

d : X \times X \to R_{+}

satisfying the following three conditions for all

x, y, z \in X

: (a)

d (x, y) = 0

if and only if

x = y

, (b)

d (x, y) = d (y, x)

, (c)

d (x, z) \leq d (x, y) + d (y, z)

.

The multicriteria metrics [18,19] (used in multicriteria recommendation systems [20]) presented in this part can be used on a variety of domains

X = D_{i}

: they can be sets of logical values, rational numbers, percentages, digitally codified images, sounds, videos, and many others. We use a bounded distributive complemented lattice

(L, \lor, \land, \bar{}, 0, 1)

to describe uniformly the domains

D_{i}

. We rank all objects according to their aggregated distance to a new one; based on that, we calculate probabilities of the new object belonging to different classes, represented in the object repository.

Here is a list with illustrative, but far from exhaustive, examples of domains

D_{i}

:

Logical Boolean domain: $({0, 1}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = 1 - x, x \in {0, 1}$ .
Logical non-Boolean domain: $(\{0, \frac{1}{N - 1}, \frac{2}{N - 1}, \dots, \frac{N - 2}{N - 1}, 1\}, max, min, \bar{}, 0, 1)$ , where $x \in \{0, \frac{1}{N - 1}, \frac{2}{N - 1}, \dots, \frac{N - 2}{N - 1}, 1\}$ and $\bar{x} = 1 - x,$ .
Numerical domain with natural values: $({0, 1, \dots, N}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = N - x, x \in {0, 1, \dots, N}$ .
Numerical domain with rational values: $({x ∣ a \leq x \leq A}, max, min, \bar{}, a, A)$ , where $\bar{x} = A - x, a \leq x \leq A$ .
Binary code: $({0, 1}^{n}, max, min, \bar{}, 00 \dots 0, 11 \dots 1)$ , where the domain consists of all binary strings of length n, ${0, 1}^{n} = {x_{1} x_{2} \dots x_{n} ∣ x_{i} \in {0, 1}}$ and for all $x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n} \in {0, 1}^{n}$ , $max (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = max (x_{1}, y_{1}) max (x_{2}, y_{2}) \dots max (x_{n}, y_{n})$ , $min (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = min (x_{1}, y_{1})$ $min (x_{2}, y_{2}) \dots min (x_{n}, y_{n})$ , $\bar{x_{1} x_{2} \dots x_{n}} = (1 - x_{1}) (1 - x_{2}) \dots (1 - x_{n}) .$

In the lattice

(L, \lor, \land, \bar{}, 0, 1)

we introduce, following [18], the metric:

d (x, y) = \{\begin{matrix} (x \land \bar{y}) \lor (\bar{x} \land y), & if x \neq y, \\ 0, & otherwise, \end{matrix}

for

x, y \in L

. This metric d can be extended to

L \cup {\infty}

as follows:

d_{\infty} (x, y) = \{\begin{matrix} d (x, y), & if x, y \in L, \\ σ (x), & if x \in L and y = \infty, \\ σ (y), & if y \in L and x = \infty, \\ 0, & otherwise, \end{matrix}

where

σ (x) = max (x, \bar{x})

.

The metrics

d_{\infty, i}

on

L_{i} \cup {\infty}

,

1 \leq i \leq n

, can be extended to

{(L_{i} \cup {\infty})}^{n}

, i.e., to n-dimensional vectors, as follows:

d_{\infty} (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = \sum_{i = 1}^{n} d_{\infty, i} (x_{i}, y_{i}),

(1)

where

x_{i}, y_{i} \in L_{i} \cup {\infty}

,

1 \leq i \leq n

.

In what follows, we write d for

d_{\infty}

when the meaning is clear from the context.

2.3. Tasks Specification

Data organised as in Table 2, Table 3 and Table 4 consists of independent objects augmented with a column of labels, the weights of criteria, and a new unlabelled object, respectively.

Additional information associated with data in Table 2 may include the range of each criterion

c_{j}

and the associated specific distance, e.g., the Euclidean distance for real numbers and the distance d for binary strings or strings over a non-binary alphabet (e.g., for images or colours).

We consider the following tasks:

Task 1: Calculate the distance (or similarity metric) between the new object and each object in Table 2. If the distance corresponding to $c_{i}$ is $d_{i}$ , then

$d (o_{j}, x) = \sum_{i = 1}^{n} w_{i} \cdot d_{i} (a_{i, j}, x_{j}) .$
Task 2: Given a threshold $δ > 0$ , calculate all objects $o_{i}$ at a distance at most $δ$ to x.
Task 3: Calculate the probability of a new object belonging to a labelled class (e.g., low risk vs. high risk) using a threshold $δ$ and Table 2.
Task 4: Rank the criteria in Table 2 and calculate the marker or markers criterion/criteria that are the most important/ones.
Task 5: Assign alternative weights to criteria.
Task 6: Test the data accuracy and method for Task 4.

2.4. Tasks Solutions

For Task 1, we calculate the distances

d_{\infty} (o_{i}, x)

between each object

o_{i}

in Table 2 and x in Table 4.

For Task 2, given a threshold

δ > 0

, we calculate all objects in Table 2 at a distance at most

δ

to x, that is, the objects which are

δ

-similar to x:

C_{δ, x} = {o_{i} ∣ d (x, o_{i}) \leq δ, 1 \leq i \leq m},

and its complement

\bar{C_{δ, x}}

.

For Task 3, we calculate the probability that x is in class label

l_{t}

, which is the ratio of the number of objects in

C_{δ, x}

with the label

l_{t}

to the size of the cluster

C_{δ, x}

:

P r o b (x has label l_{t}) = \frac{# {o_{i} \in C_{δ, x} ∣ l_{i} = l_{t}}}{# (C_{δ, x})},

where

# {\dots}

means the number of elements in the set

{\dots}

.

For Task 4, we work with Table 2. Recall that, for each criterion

c_{i}

, we have a domain

D_{i}

augmented with information “high” or “low,” indicating whether higher or lower values are desirable. Based on this information, we can construct a hypothetical object (see Table 5) which has the most desirable values for each criterion: one could see this object as an “exemplar” one.

Sometimes, criteria are interrelated or correlated. This means that, in some cases, there is no unique “exemplar object”, but a couple of them have to be studied in ranking the importance of criteria.

For example, fix an “exemplar object”

o_{E}

.

Compute the distances $d_{\infty} (o_{i}, o_{E})$ between each object $o_{i}$ in Table 2 and $o_{E}$ , so obtain a vector with n non-negative real components $V_{0} = (d_{1}^{0}, \dots d_{n}^{0})$ .
For each $1 \leq t \leq m$ , compute the distances $d_{\infty} (o_{i}, o_{E})$ taking into consideration all criteria in Table 2 except $c_{t}$ : obtain the vector $V_{t} = (d_{1}^{t}, \dots d_{n}^{t})$ .
Compute the distances between $d i s t (V_{0}, V_{t})$ , $1 \leq t \leq m$ using the formula

$d i s t (V_{0}, V_{t}) = \sum_{i = 1}^{n} | d_{0, i} - d_{t, i} |,$

and sort them in increasing order. The criterion $c_{t}$ is a marker if $d i s t (V_{0}, V_{t}) \geq d i s t (V_{0}, V_{j})$ , for every $1 \leq j \leq m$ .

We repeat this procedure for each “exemplar object" and study possible variations.

For Task 5, normalise the distances

d i s t (V_{0}, V_{t})

and use these values to construct the weights

w_{i}^{*}

,

1 \leq t \leq m

.

For Task 6, assume we have weights

(w_{i})

associated to Table 2 (see Table 1). To test the accuracy of the data and method used for Task 4, compare the original weights

(w_{i})

with

(w_{i}^{*})

. Serious discrepancies should signal issues either with the data or the choices made in the applications of the method.

2.5. An Example

We illustrate the above tasks with an example of a labelled database in Table 6 and a new object (see Table 7), all having the following seven characteristics (the last column has the label classes 1 and 2):

$c_{1}$ : real number ${0 - 100}$ , e.g., age, weight, BMI etc.;
$c_{2}$ : Boolean value ${0, 1}$ , e.g., gender;
$c_{3}$ : integer number ${0 - 10, 000}$ , e.g., gene expression;
$c_{4}$ : categorical {small, med, large }, e.g., size of tumour, body size, keywords;
$c_{5}$ : colour {red, yellow, white, black}, e.g., colour of a spot on the body, on the heart;
$c_{6}$ : spike sequence of ${- 1, 0, 1}$ e.g., encoded EEG, ECG;
$c_{7}$ : black and white image, e.g., MRI, face image.

Table 6. An example of labelled data.

68.2	0	6789	small	red	0, 1, −1, −1, 1,1, 0, 0, 1, −1	1, 1, 0	1
						0, 0, 1
						0, 0, 1
93	1	98,000	medium	yellow	0, −1, −1, −1, −1, 0, 0, 1, −1, 1	1, 0, 0	1
						0, 0, 1
						0, 0, 1
44.5	1	5600	large	red	0, 1, −1, 1, −1, 1, 0, 0, 1, −1	1, 1, 0	1
						1, 0, 1
						1, 1, 1
56.8	0	89	small	white	1, −1, −1, −1, −1, 1, 0, 0, 1, −1	1, 1, 0	1
						0, 1, 1
						1, 0, 1
26.3	0	9456	large	black	1, −1, −1, −1, 0,1, 0, 0, 1, −1	1, 1, 0	2
						1, 1, 1
						1, 0, 1
81.5	1	78, 955	medium	red	0, 1, −1, 1, −1, −1, 0, 0, 1, −1	1, 1, 0	2
						0, 0, 1
						1, 1, 1
56.7	1	68, 900	small	black	1, −1, −1, 1, −1, 1, 0, 0, 1, 1	1, 1,1	2
						0, 0,1
						1, 1,1
20	0	7833	large	yellow	1, 1, −1, −1, 1,1, 0, −1, −1, 1	1, 0, 0	2
						0, 0, 1
						1, 1, 1
20	0	7833	∞	yellow	1, 1, −1, −1, 1,1, 0, −1, −1, 1	1, 0, 0	2
						0, 0, 1
						1, 1, 1

Table 7. An example of new unlabelled object.

48.5	1	45,679	large	red	1, 0, 0, −1, 1, −1, 1, 0, 0, 1	1, 1, 0
						0, 0, 1
						1, 0, 1

In this fictitious example, for simplicity, we did not use weights.

The first step is to code the data in Table 6 and Table 7. The new data is in Table 8 and Table 9.

Then, we normalise the data in Table 8 and Table 9—the entries in the first, third, and fourth columns have been divided by 100, 10,000, and 2, respectively. The entries in the last three columns have been transformed into reals in the unit interval, and the column of labels has been removed. In this way, we have obtained Table 10 and Table 11.

Then, we choose an appropriate distance according to each criterion. In this example, we used the Euclidean distance for all criteria (see Table 12 and Table 13).

We can compute

C_{δ, x} = {o_{i} ∣ d (o_{i}, x) \leq δ}

and, accordingly, the probability that x would be labelled in class 1 or class 2.

If

δ = 3.5

, then

C_{3.5, x} = {o_{1}, o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

so the probability that x is in class 1 is 2/7 and the probability that x is in class 2 is 5/7. If

δ = 2.5

, then its closest cluster is

C_{2.5, x} = {o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

, so the probability that x is in class 1 is 1/3 and the probability that x is in class 2 is 2/3.

Which induces the following ranking of the objects in Table 8:

o_{3}, o_{6}, o_{7}, o_{5}, o_{2}, o_{8}, o_{1}, o_{9}, o_{2}

.

For Task 4, assume that the criteria

c_{1}, \dots, c_{7}

in Table 10 have the additional information

(m, m, m, m, m, M, M)

, where m (M) means that the exemplar value is the minimum (maximum) value. Based on this vector, we compute the exemplar object (see Table 14).

Next we calculate

V_{0}, \dots, V_{t}

, (see Table 15), and finally the distances

D i s t (V_{0}, V_{t})

,

t = 1, 2, \dots, 7

and the weights as their normalised values (see Table 16). The marker, in this case, is the criterion

c_{5}

.

2.6. Complexity Estimation of the SAIN Method

The proposed method to compute the similarity between a new object and N objects in a data repository is linear in N, so very fast.

3. Survival Analysis in SAIN

Medical survival analysis evaluates the time until an event of interest occurs, like death or disease recurrence, in a group of patients. This analysis is often used to compare treatment outcomes or predict prognosis. In contrast with the statistical solutions used in [15,16,17], we adopt a probabilistic framework that gives more precise evaluations of probabilities.

3.1. Data and Tasks

We are given the following data:

1.: Table 17, in which the first column lists the patients treated for the same disease with the same method under strict conditions, and wherein the last column records the times until the patients’ deaths.
2.: Table 18, which includes the record of the new patient p.
3.: A threshold $δ$ which defines the acceptable similarity between p and the relevant $p_{i}$ ’s in the Survival database (i.e., $d (p, p_{i}) \leq δ$ ).

We consider the following tasks:

Task 1: What is the life expectancy of p?

Task 2: What is the probability that the life expectancy of p is greater than or equal to a given T?

3.2. Tasks Solutions

Using a standard method of survival analysis

1.

For Task 1,

(a): Compute the set of patients that are similar up to $δ$ to p:

$C_{δ, p} = {p_{i} ∣ d (p, p_{i}) \leq δ, 1 \leq i \leq m} .$

(2)
(b): Using $C_{δ, p}$ , compute the probability that p will survive the time $t_{j}$ :

$P r o b_{δ} (p survives time t_{j}) = \frac{# {p_{i} \in C_{δ, p} ∣ t_{i} = t_{j}}}{# (C_{δ, p})} .$

(3)
(c): Compute the life expectancy of p using the formula:

$L E_{δ} (p) = \sum_{j = 1, t_{j} \in C_{δ, p}}^{m} t_{j} \times P r o b_{δ} (p survives time t_{j}) .$

(4)

2.

For Task 2, calculate the probability that the life expectancy of p is at least time T:

P r o b_{δ} (L E (p) \geq T) = \sum_{j = 1, t_{j} \in C_{δ, p}, t_{j} \geq T}^{m} P r o b_{δ} (p survives time t_{j}) .

(5)

3.3. An Example

We illustrate the above tasks with an example of a database in which columns 2–8 record patients’ medical test results, and the last column records time to death (see Table 19) and a new patient (see Table 20):

The distance for column 4 is

d (x, y) = | x - y | =

and

d_{\infty} (x, \infty) = max (x, 1 - x)

. For example,

d_{\infty} (1, \infty) = max (1, 1 - 1) = 1

. For all other columns, the distance is

d (x, y) = | x - y |

. Finally, the total distance is the sum of individual distances (7 terms), with the results in Table 21.

The results for Task 1, (a), (b), and (c) are listed below:

1.

For

δ \geq 3.37

,

C_{δ, p} = {v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{7}, v_{8}, v_{9}}

, that is the entire database. Then,

(a)

L E_{δ} (p) = 50.11

,

(b)

i.: $P r o b_{δ} (p survives time = 12.3) = 1 / 9$ ,
ii.: $P r o b (p survives time = 15) = 1 / 9$ ,
iii.: $P r o b_{δ} (p survives time = 68) = 1 / 9$ ,
iv.: $P r o b_{δ} (p survives time = 1.4) = 1 / 9$ ,
v.: $P r o b_{δ} (p survives time = 40.5) = 1 / 9$ ,
vi.: $P r o b_{δ} (p survives time = 97.2) = 2 / 9$ ,
vii.: $P r o b_{δ} (p survives time = 55.7) = 1 / 9$ ,
viii.: $P r o b_{δ} (p survives time = 63.7) = 1 / 9$ .

(c)

i.: $P r o b_{δ} (L E_{δ} (p) \geq 1.4) = 1$ ,
ii.: $P r o b_{δ} (L E_{δ} (p) \geq 12.3) = 8 / 9$ ,
iii.: $P r o b_{δ} (L E_{δ} (p) \geq 15) = 7 / 9$ ,
iv.: $P r o b_{δ} (L E_{δ} (p) \geq 40.5) = 6 / 9$ ,
v.: $P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 5 / 9$ ,
vi.: $P r o b_{δ} (L E_{δ} (p) \geq 63.7) = 4 / 9$ ,
vii.: $P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9$ ,
viii.: $P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9$ ,

We can calculate other probabilities; for example, $P r o b_{δ} (L E_{δ} (p) \geq 60) = P r o b_{δ} (L E_{δ} (p) \geq 63.7) + P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9 +$ $P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9 = 1 / 9 + 1 / 9 + 2 / 9 = 4 / 9$ .

2.

For

δ \geq 2.5

,

C_{δ, p} = {v_{2}, v_{3}, v_{5}, v_{6}, v_{7}, v_{8}}

. Then,

(a)

L E_{δ} (p) = 62.27

,

(b)

i.: $P r o b (p survives time = 15) = 1 / 6$ ,
ii.: $P r o b (p survives time = 68) = 1 / 6$ ,
iii.: $P r o b (p survives time = 40.5) = 1 / 6$ ,
iv.: $P r o b (p survives time = 97.2) = 2 / 6$ ,
v.: $P r o b (p survives time = 55.7) = 1 / 6$ ,

(c)

i.: $P r o b_{δ} (L E_{δ} (p) \geq 15) = 1$ ,
ii.: $P r o b_{δ} (L E_{δ} (p) \geq 40) = 5 / 6$ ,
iii.: $P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 4 / 6$ ,
iv.: $P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 6$ ,
v.: $P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 6$ .

Similarly, we can calculate the probabilities $P r o b_{δ} (L E_{δ} (p) \geq 45) = 4 / 6$ , $P r o b_{δ} (L E_{δ} (p) \geq 100) = 0$ .

In contrast with the statistical solutions used in [14,15,16], we adopted a probabilistic framework that gives more precise evaluations of probabilities. The SAIN algorithms also include some statistically established methods, such as the t-test, for ranking variables before applying the inference method.

4. SAIN: A Modular Diagram and Functional Information Flow

The SAIN framework consists of the following modules (Figure 1):

1.

Multimodal data of a new object X.

2.

An existing repository D of multimodal data of many objects, labelled with their outcome.

3.

A module of algorithms for searching in the database D and based on the distance between X and each object in D.

4.

Defining a subset

D_{x}

from D, so that X is closer to the objects in

D_{x}

based on a given threshold.

5.

A module of algorithms for building a model

M_{x}

in

D_{x}

.

6.

An inference algorithm to derive the output for X from the model

M_{x}

and to visualise it for explanation purposes. Figure 1 gives a modular view of the SAIN framework and Figure 2 shows the information processing flow:

(a): Encoding the multimodal data of X and D.
(b): Choosing a distance matrix and similarity search in the dataset D.
(c): Calculating the aggregated difference between the new data vector X and the closest vectors in $D_{x}$ .
(d): Creating a model $M_{x}$ in $D_{x}$ .
(e): Applying inference by calculating the $X_{c, j}$ for each class $C_{j}$ (or output value), using the wwkNN method in [5].
(f): Reporting and visualisation of results of the individual model Mx. This is illustrated in Figure 3.

The inference method is based on the wwkNN (weighted variables, weighted samples k-nearest neighbour) proposed by Kasabov [7]. This method first ranks the impact of the variables (multimodal ones) to estimate their weights/impact towards the output using a t-test; then, it measures the distance between the new object X and the ones in the database

D_{x}

and weighs it. For each class

C_{j}

, the higher a variable is ranked, the closer the samples/objects belong to class

C_{j}

. The closer to X, the higher the calculated value of

C_{j}

is. The new object X is classified in class

C_{l}

if

X_{c l}

is the highest among all

X_{c j}

values in Figure 1 and Figure 2, we present the modular diagram and the functional information flow of SAIN.

Figure 1. A modular diagram of the proposed SAIN computational framework.

Figure 2. A flow of data and information processing in the SAIN computational framework.

Figure 3. An example of visualisation of a personalised SAIN model. The closest four samples (out of 6) to the new object (star) are from class 1 (in red, and class 2 in blue) using the top three informative variables. Each sample is a multimodal one, and the top 3 variables can be of different modalities.

5. Case Studies for Medical Diagnosis and Prognosis

We present three case studies in which we applied SAIN.

5.1. Heart Disease Diagnosis

We worked with the well-known Cleveland dataset, which contains multiple data types [21]. The UCI Heart Disease dataset includes 76 attributes. As in most articles, the attributes in our experiment data were restricted to 14, see Table 22.

The problem is a binary classification of whether the patient has or does not have heart disease.

First, we selected suitable distance metrics and weights to classify the attributes. For binary objects, the distance metric is simply whether they are equal; for non-binary discrete objects such as resting electrocardiographic results, the appropriate distance measure is not obvious and should be informed by an expert. We give the electrocardiographic results of 0 for normal, 1 for having ST-T wave abnormality, and 2 for showing probable or definite left ventricular hypertrophy following Estes’ criteria.

Many studies with the Cleveland dataset have been tested with different machine learning techniques. For example ref. [21] lists different algorithms and performances ranging from 47% to 80% accuracy. SAIN achieved an 82% accuracy score. Why SAIN? The search is fast, uses appropriate distances chosen by a medical expert, and provides explainability at a personal level, including probabilities. It offers different scenarios for modelling by experimenting with different sets of features, parameters, and preferred outcome visualisations.

The SAIN experiment used binary and numerical representation for each variable as described. We have used the same data representation (recommended by medical experts) as in the original paper [21]. The accuracy of the SAIN experiment was 82%, the same accuracy as in [6], which used classical machine learning. In addition, SAIN allows visualising each personalised model as shown in the examples in Figure 3 and Figure 4.

5.2. Time Series Classification

The proposed SAIN framework can incorporate time series data, as another modality, in addition to other modalities of data for a person, making a joint multimodal personal vector. A time series data is encoded into the binary vector by using spike encoding algorithms [6], where if there is a positive change from one discrete time point to the next one in the time series, there will be a positive spike (encoded as 1); a negative change will result in a negative spike (−1) and no change will result in 0 value. This is illustrated on a hypothetical time series in Figure 5. This approach applies to any time series raw data, at any time scales, and here we show just two hypothetical examples of brain EEG data (Figure 6) and cardio data (Figure 7).

Many datasets for classifying outcomes of events consist of multiple time series. Each variable in a time series may depend on other variables that change in time. The proposed model can deal with this problem by encoding time series (signal) into binary vectors, which can be processed for classification in the SAIN framework. The variables for this dataset are 14 channels of temporal EEG data, located at places of interest on the human scalp.

The signals measured over the same period are the EEG channels, fMRI voxels, ECG electrodes, seismic sensory signals, financial time series, gene expressions, voice, and music frequency bands [11]. Even when the variable (signal) measurements are independent, the signals may impact each other as they represent the same object/person at the same time period. The number N of these signals can vary from just a few for a short time window T (Figure 5) to hundreds and thousands when the time varies from a few milliseconds to minutes, hours, days, etc.

Figure 6 shows an EEG experiment, and Figure 7 shows a cardio-vascular disease signal.

Next, we present a simple example of how this search can be computed for a new record X consisting of only three variables/signals (e.g., EEG channels, ECG electrodes) over a short period of five time moments and the database D consisting of only six such records, which are labelled by outcome labels 1, 2, 3 (e.g., diagnosis, prognosis).

In addition to the record X, a weight vector is supplied with the weighted importance of the signals at different time points, e.g.,

W = [0.1, 0.2, 0.4, 0.2, 0.1]

, meaning that the most important and informative part of the measurements is at time point 3.

The new record

X = [1, 1, - 1, 0, 1]

(signal, EEG channel 1) 0, 1, 1, 1, −1 (signal, EEG channel 2) 1, 1, −1, −1, 0 (signal, EEG channel 3),

W = [0.1, 0.2, 0.4, 0.2, 0.1]

.

The database contains records

(R e c o r d s, R 1, R 2, R 3, R 4, R 5, L a b e l s L)

where (see Table 23):

The new record X of EEG signals will be classified in class 1 as it is closest according to the Euclidean distance, with class 1 data samples

R 1

and

R 2

.

5.3. Predicting Longevity in Cardiac Patients

We utilised a dataset [22] in which we applied a binary classification on whether the patient had an event (e.g., death) and in addition to those that had an event, whether this would occur in the near future (within the next 180 days, e.g., approximately six months). The dataset contained a set of 150 variables and an outcome, with 295 patients in the first dataset and 49 in the second. The data included a mix of variables that could be grouped as follows:

Demographics, risk factors, disease states, medication, and deprivation scores;
Echocardiography, cardiac ultrasound measurements;
Advanced ECG measurements.

The other data includes the days until the event occurred and the censor date for the Cox proportional hazard monitoring.

The objectives are to predict an arrhythmic event or death.

Before running the algorithm, the data was normalised, and to account for the data being unbalanced, we utilised the SMOTE data balancing method [11] each time we left one out (ensuring that we did not SMOTE when the true data point was part of the dataset). For the event classification dataset, the model achieved an accuracy of 79%. This is broken down into classifying no event (198/247, 80%) and an event with (36/49, 73%) accuracy. It is worth noting that the confidence of each individual could be explored with a sample of the confidence for classification in Figure 8.

For the second experiment, we normalised the dataset and removed any columns with unknown values. We then applied a genetic algorithm to find the set of features to use for classification. We found a set of 34 variables which would provide an accuracy of

81 %

with (34/34) for class 0 and (6/15) for class 1. Alternatively, if we apply SMOTE and focus more on the accuracy of class 1, we obtain

69 %

accuracy; however, more evenly distributed with (24/34) for class zero and (10/15) for class 1.

Experiment one, in which we used a non-balanced complete dataset, showed satisfactory results of 80% for class 0 (no event) and 73% for class 1 (event). This demonstrates the ability of the SAIN method to work with imbalanced data. It also shows the superiority of utilising the missing values rather than removing them in experiment two. Furthermore, the selection of variables with a genetic algorithm also showed improvement. The genetic algorithm, included in the SAIN software version 0.1, can also help select biomarker variables in other cases (see Figure 4).

6. Discussion

In Table 24, the features that are implemented in the proposed SAIN methodology and framework are compared qualitatively with similar features of already developed methods for personalised modelling. SAIN can deal with an unlimited number of modalities of data, including numbers, time series, images, videos, and digitally encoded elements. It also offers multicriteria metrics for distance measure across variables from different domains, e.g., logical Boolean domain, logical non-Boolean domains, numerical domains with natural and rational values, and binary codes. Furthermore, SAIN processes all modalities of data into a single individual (personalised) vector, which allows the early integration of the modalities and facilitates the discovery of causal associations across modalities to explain individual outcomes. This contrasts with the late integration of modalities [1], where for each modality there is a separate model and their outputs are weighted for the calculation of the final output. Explainability is available in all models in Table 24.

When tested on benchmark and domain problem data, all personalised models have exceeded in accuracy the corresponding global models, where one model is created on all data. In terms of speed of processing, the proposed SAIN method is superior, due to the early integration and the binary representation of most of the modalities.

7. Conclusions

This paper presents a new search and inference method, called SAIN, for multi-modal data integration and personalised model creation based on these multi-modal data. The model not only evaluates the outcome for a person more accurately than traditional machine learning methods using a single modality of data, but it also explains the proposed solution in terms of probability and visual explanation.

In its current form, this paper is more directed towards revealing a new methodology and algorithms than real full-scale medical applications. However, we have illustrated the methods using hypothetical and real-case health and medical datasets. Further utilisation of the proposed framework is currently being developed for large-scale biomedical data.

The proposed new method offers new functionality and features for personalised search and model creation in multimodal data, some of which are listed below:

The method is suitable for multimodal data searches in heterogeneous datasets, e.g., numbers, text, images, sound, categorical data.
It is suitable for personalised model creation to classify or predict specific outcomes based on multimodal and heterogeneous data.
It uses a similarity measure based on multicriteria metrics. In this way, inaccurate measurement of similarity on a large number of heterogeneous variables is avoided.
Its search is fast even on large datasets and includes advanced personalised searches with multiple parameters and features.
It facilitates multiple solutions with corresponding probabilities.
It is suitable for unsupervised clustering in multimodal heterogeneous data.

In conclusion, integrating all possible data modalities for a single subject to predict/classify the object’s state in relation to existing ones is an open problem in data science. While the creation of personalised models based on a single modality data [23] and clustering of single modality data into a single cluster [26] have been successfully developed, the theory, framework, and algorithms proposed in this paper are the first to integrate all data modalities for a single subject together into a single vector-based representation and to make an inference based on it. For the first time, time series, such as EEG and ECG data, are included in this unified representation, after suitable encoding. In this respect, spike encoding of time series is used, integrating statistical and brain-inspired information representation. The human brain integrates sensory data modalities into its spatio-temporal structure, and brain-inspired models using spike information representation have already been developed for learning [6,11,27] and for explanation of the learned patterns [28]. However, brain-inspired computers are still in their early stage of development [11], and even if they are developed, they may not be able to integrate all possible modalities of data into one brain-inspired mode. This paper offers a solution to the problem of multimodal personalised data integration and inference, with six novel features as follows: (1) it includes all possible modalities of data; (2) it can be implemented on any conventional computer platforms; (3) it takes into account the differences across modalities of data through offering different distance measures; (4) it offers a new way of ranking existing multimodal objects in order of similarity to a new multimodal object and uses that for building multiple neighbourhood clusters; (5) it offers a probability-based inference with the use of the different similarity clusters; (6) it explains the inferred results, both in terms of probabilities and visual representation. The proposed original method here is planned to be applied to large-scale multimodal data for biomedical and health applications in the future.

The proposed method is implemented as a computer system and applied to several case studies to illustrate its advantages and applicability. The SAIN method described in Section 4 was implemented as a software system, as seen in Data Availability Statement.

Author Contributions

N.K. designed the overall framework of SAIN, wrote the initial draft of the paper and took part in the experiments and the paper revision; C.S.C. introduced the mathematical description of the distance measures for different data modalities, took part in the preparation and the revision of the paper; A.H. developed the software implementation of SAIN in Python and ran the experiments, also took part in the paper preparation and its revision; P.G. provided cardio data for experiments and took part in the analysis of results. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data has been obtained from the UCI Cleveland from https://archive.ics.uci.edu/dataset/45/heart+disease (accessed on: 20 February 2024); the EEG data is available from https://github.com/KEDRI-AUT/NeuCube-Py/tree/master/example_data (accessed on 20 February 2024). Access to the software is available upon request.

Acknowledgments

We thank Elena Calude for her contributions to the mathematical model. We also thank the referees for their suggestions, which improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Budhraja, S.; Singh, B.; Doborjeh, M.; Doborjeh, Z.; Tan, S.; Lai, E.; Goh, W.; Kasabov, N. Mosaic LSM: A Liquid State Machine Approach for Multimodal Longitudinal Data Analysis. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, QLD, Australia, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
AbouHassan, I.; Kasabov, N.K.; Jagtap, V.; Kulkarni, P. Spiking neural networks for predictive and explainable modelling of multimodal streaming data with a case study on financial time series and online news. Sci. Rep. 2023, 13, 18367. [Google Scholar] [CrossRef] [PubMed]
Rodrigues, F.; Markou, I.; Pereira, F.C. Combining time-series and textual data for taxi demand prediction in event areas: A deep learning approach. Inf. Fusion 2019, 49, 120–129. [Google Scholar] [CrossRef]
Li, J.; Liu, J.; Zhou, S.; Zhang, Q.; Kasabov, N.K. GeSeNet: A General Semantic-Guided Network With Couple Mask Ensemble for Medical Image Fusion. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 16248–16261. [Google Scholar] [CrossRef]
Doborjeh, Z.; Doborjeh, M.; Sumich, A.; Singh, B.; Merkin, A.; Budhraja, S.; Goh, W.; Lai, E.M.; Williams, M.; Tan, S.; et al. Investigation of social and cognitive predictors in non-transition ultra-high-risk’individuals for psychosis using spiking neural networks. Schizophrenia 2023, 9, 10. [Google Scholar] [CrossRef]
Kasabov, N.K. NeuCube: A spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data. Neural Netw. 2014, 52, 62–76. [Google Scholar] [CrossRef]
Kasabov, N. Global, local and personalised modeling and pattern discovery in bioinformatics: An integrated approach. Pattern Recognit. Lett. 2007, 28, 673–685. [Google Scholar] [CrossRef]
Kasabov, N. Data Analysis and Predictive Systems and Related Methodologies. U.S. Patent 9,002,682 B2, 7 April 2015. [Google Scholar]
Doborjeh, M.; Doborjeh, Z.; Merkin, A.; Bahrami, H.; Sumich, A.; Krishnamurthi, R.; Medvedev, O.N.; Crook-Rumsey, M.; Morgan, C.; Kirk, I.; et al. Personalised predictive modelling with brain-inspired spiking neural networks of longitudinal MRI neuroimaging data and the case study of dementia. Neural Netw. 2021, 144, 522–539. [Google Scholar] [CrossRef]
Kasabov, N.K. Evolving Connectionist Systems, 2nd ed.; Springer: London, UK, 2007. [Google Scholar]
Kasabov, N.K. Time-Space, Spiking Neural Networks and Brain-Inspired Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2019; Volume 750. [Google Scholar]
Santomauro, D.F.; Herrera, A.M.M.; Shadid, J.; Zheng, P.; Ashbaugh, C.; Pigott, D.M.; Abbafati, C.; Adolph, C.; Amlag, J.O.; Aravkin, A.Y.; et al. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. Lancet 2021, 398, 1700–1712. [Google Scholar]
Swaddiwudhipong, N.; Whiteside, D.J.; Hezemans, F.H.; Street, D.; Rowe, J.B.; Rittman, T. Pre-diagnostic cognitive and functional impairment in multiple sporadic neurodegenerative diseases. bioRxiv 2022. [Google Scholar] [CrossRef]
Kasabov, N.K.; Hou, Z.; Feigin, V.; Chen, Y. Improved Method and System for Predicting Outcomes Based on Spatio/Spectro-Temporal Data. 2015. Available online: https://patents.google.com/patent/WO2015030606A2/en (accessed on 20 December 2015).
Paprotny, D.; Morales-Nápoles, O.; Worm, D.T.; Ragno, E. BANSHEE—A MATLAB toolbox for non-parametric Bayesian networks. SoftwareX 2020, 12, 100588. [Google Scholar] [CrossRef]
Koot, P.; Mendoza-Lugo, M.A.; Paprotny, D.; Morales-Nápoles, O.; Ragno, E.; Worm, D.T. PyBanshee version (1.0): A Python implementation of the MATLAB toolbox BANSHEE for Non-Parametric Bayesian Networks with updated features. SoftwareX 2023, 21, 101279. [Google Scholar] [CrossRef]
Mendoza-Lugo, M.A.; Morales-Nápoles, O. Version 1.3-BANSHEE—A MATLAB toolbox for Non-Parametric Bayesian Networks. SoftwareX 2023, 23, 101479. [Google Scholar] [CrossRef]
Calude, C.; Calude, E. A metrical method for multicriteria decision making. St. Cerc. Mat 1982, 34, 223–234. [Google Scholar]
Calude, C. A simple non-uniform operation. Bull. Eur. Assoc. Theor. Comput. Sci. 1983, 20, 40–46. [Google Scholar]
Akhtarzada, A.; Calude, C.S.; Hosking, J. A Multi-Criteria Metric Algorithm for Recommender Systems. Fundam. Informaticae 2011, 110, 1–11. [Google Scholar] [CrossRef]
Kahramanli, H.; Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert Syst. Appl. 2008, 35, 82–89. [Google Scholar] [CrossRef]
Gleeson, S.; Liao, Y.W.; Dugo, C.; Cave, A.; Zhou, L.; Ayar, Z.; Christiansen, J.; Scott, T.; Dawson, L.; Gavin, A.; et al. ECG-derived spatial QRS-T angle is associated with ICD implantation, mortality and heart failure admissions in patients with LV systolic dysfunction. PLoS ONE 2017, 12, e0171069. [Google Scholar] [CrossRef]
Song, Q.; Kasabov, N. TWNFI–a transductive neuro-fuzzy inference system with weighted data normalization for personalized modeling. Neural Netw. 2006, 19, 1591–1596. [Google Scholar] [CrossRef]
Song, Q.; Kasabov, N.K. NFI: A neuro-fuzzy inference method for transductive reasoning. IEEE Trans. Fuzzy Syst. 2005, 13, 799–808. [Google Scholar] [CrossRef]
Sengupta, N.; McNabb, C.B.; Kasabov, N.; Russell, B.R. Integrating space, time, and orientation in spiking neural networks: A case study on multimodal brain data modeling. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5249–5263. [Google Scholar] [CrossRef]
Kasabov, N. NeuCube EvoSpike Architecture for Spatio-temporal Modelling and Pattern Recognition of Brain Signals. In Artificial Neural Networks in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2012; pp. 225–243. [Google Scholar] [CrossRef]
Kumarasinghe, K.; Kasabov, N.; Taylor, D. Deep learning and deep knowledge representation in Spiking Neural Networks for Brain-Computer Interfaces. Neural Netw. 2020, 121, 169–185. [Google Scholar] [CrossRef]
Futschik, M.; Kasabov, N. Fuzzy clustering of gene expression data. In Proceedings of the 2002 IEEE World Congress on Computational Intelligence, 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02. Proceedings (Cat. No.02CH37291), Honolulu, HI, USA, 12–17 May 2002; pp. 414–419. [Google Scholar] [CrossRef]

Figure 4. An individual model of survival showing the closest neighbouring samples in the top 3 ranked variables. Where the green star is the individual, the purple denotes class 0 and the yellow class 1.

Figure 5. Every time series can be represented as a 3-value vector through a spike encoding method over time [11]. If at a time t the time series is increasing in value, there will be a positive spike (1), if decreasing—a negative spike (−1), and if no change—no spike (0) (left figure). Each element in this vector represents the signal change at a time. The original signal can be recovered over time using this vector (right figure) if necessary. The length of the vector is equal to the time points measured (reproduced from [11]).

Figure 6. EEG signals from EEG electrodes spatially distributed on the scalp are spatio-temporal signals (left figure). Each time series signal from an electrode is measured every 1 millisecond. The figure on the right shows the measurements of 14 EEG electrodes over 124 milliseconds. Each signal can be encoded into a 124-element vector according to Figure 5, making altogether 14 such vectors to be processed in the SAIN framework (reproduced from [11]).

Figure 7. ECG (electrocardiogram) signals ((a)—nosy and (b)—filtered) can be encoded into binary vectors according to the spike-encoding methods from Figure 5. Spike encoding is robust to noise, as any noise below a threshold would not cause the generation of a spike (either positive or negative), and the encoder will act as a filter. This vector’s length will equal the number of measurement time points. The vector data can be further processed in the SAIN framework.

Figure 8. A sample of the classification breakdown (with class 0 in green and class 1 in red) and the confusion matrix.

Table 1. Unlabelled database.

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

Table 2. The labelled database.

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class Label
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$l_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$l_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$l_{m}$

Table 3. Weights.

Criteria Weights	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

Table 4. A new unlabelled object.

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
x	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 5. The hypothetical object.

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class Label
$o_{E}$	$n_{1}$	$n_{2}$	...	$n_{j}$	...	$n_{n}$	$l_{h}$

Table 8. Coded labelled data.

$o_{1}$	68.2	0	6789	0	FF0000	0122110012	110001001	1
					111111110000000000000000
$o_{2}$	93	0	98,000	1	FFFF00	0222200121	110001001	1
					111111111111111100000000
$o_{3}$	44.5	1	5600	2	FF0000	0121210012	110101111	1
					111111110000000000000000
$o_{4}$	56.8	0	89	0	FFFFFF	1222210012	110011101	1
					111111111111111111111111
$o_{5}$	26.3	0	9456	2	000000	1222010012	110111101	2
					000000000000000000000000
$o_{6}$	81.5	1	78,955	1	FF0000	0121220012	110001111	2
					111111110000000000000000
$o_{7}$	56.7	1	68,900	0	000000	1221210011	111001111	2
					000000000000000000000000
$o_{8}$	20	0	7833	2	FFFF00	1122110221	100001111	2
					111111111111111100000000
$o_{9}$	20	0	7833	∞	FFFF00	1122110221	100001111	2
					111111111111111100000000

Table 9. The new unlabelled object coded.

x	48.5	1	45,679	2	FF0000	1002121001	110001101
					111111110000000000000000

Table 10. Coded labelled normalised data.

$o_{1}$	0.682	0	0.06789	0	0.2	0.0122110012	0.110001001
$o_{2}$	0.93	1	0.98	0.5	0.6	0.0222200121	0.100001001
$o_{3}$	0.445	1	0.056	1	0.2	0.0121210012	0.110101111
$o_{4}$	0.568	0	0.00089	0	1	0.1222210012	0.110011101
$o_{5}$	0.263	0	0.09456	1	0	0.1222010012	0.110111101
$o_{6}$	0.815	1	0.78955	0.5	0.2	0.0121220012	0.110001111
$o_{7}$	0.567	1	0.689	0	0	0.1221210011	0.111001111
$o_{8}$	0.2	0	0.07833	1	0.6	0.1122110221	0.100001111
$o_{9}$	0.2	0	0.07833	∞	0.6	0.1122110221	0.100001111

Table 11. The new unlabelled object coded normalised.

x

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

Table 12. Normalised distances from the new object to all objects.

$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556

Table 13. Ranking of distances in increasing order in Table 12.

$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778

Table 14. An exemplar object.

o_{E}

0.2

0

0.00089

0

1

0.1222210012

0.100001001

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals.

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals.

$V_{0}$	$V_{1}$	$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$V_{6}$	$V_{7}$
1.469	0.987	1.469	1.402	1.469	0.669	1.359	1.459
3.709	2.979	2.709	2.730	3.209	3.309	3.609	3.709
3.220	2.975	2.220	3.165	2.220	2.420	3.110	3.210
0.378	0.010	0.378	0.378	0.378	0.378	0.378	0.368
2.167	2.104	2.167	2.073	1.167	1.167	2.167	2.157
3.824	3.209	2.824	3.035	3.324	3.024	3.714	3.814
3.066	2.699	2.066	2.378	3.066	2.066	3.066	3.055
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights.

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights.

Distances	2.870	4.00	2.826	5.00	5.60	0.450	0.061
Weights	0.137	0.192	0.135	0.240	0.269	0.021	0.002

Table 17. Survival database.

Patients/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Units of Time
$p_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$t_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$t_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$t_{m}$

Table 18. The new patient record.

Patient/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
p	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 19. Patient records.

Patients	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	Units of Time
$p 1$	0.682	0	0.06789	0	0.2	0.012211001	0.110001001	12.3
$p 2$	0.93	1	0.98	0.5	0.6	0.022220012	0.100001001	15
$p 3$	0.445	1	0.056	1	0.2	0.012121001	0.110101111	68
$p 4$	0.568	0	0.00089	0	1	0.122221001	0.110011101	1.4
$p 5$	0.263	0	0.09456	1	0	0.122201001	0.110111101	40.5
$p 6$	0.815	1	0.78955	0.5	0.2	0.012122001	0.110001111	97.2
$p 7$	0.567	1	0.689	0	0	0.122121001	0.111001111	97.2
$p 8$	0.2	0	0.07833	1	0.6	0.112211022	0.100001111	55.7
$p 9$	0.2	0	0.07833	∞	0.6	0.112211022	0.100001111	63.7

Table 20. New patient records.

x p

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

Table 21. Distances between all patients and the new patient.

	$d_{1}$	$d_{2}$	$d_{3}$	$d_{4}$	$d_{5}$	$d_{6}$	$d_{7}$	Distance d
$d (p_{1}, p)$	0.1970	1	0.388900	1.0	0.0	0.08800109890	0.000000100	2.67390119890
$d (p_{2}, p)$	0.4450	0	0.523210	0.5	0.4	0.07799208800	0.010000100	1.95620218800
$d (p_{3}, p)$	0.0400	0	0.400790	0.0	0.0	0.08809109890	0.000100010	0.52898110890
$d (p_{4}, p)$	0.0830	1	0.455900	1.0	0.8	0.02200890110	0.000010000	3.36091890110
$d (p_{5}, p)$	0.2220	1	0.362230	0.0	0.2	0.02198890110	0.000110000	1.80632890110
$d (p_{6}, p)$	0.3300	0	0.332760	0.5	0.0	0.08809009890	0.000000010	1.25085010890
$d (p_{7}, p)$	0.0820	0	0.232210	1.0	0.2	0.02190890100	0.001000010	1.53711891100
$d (p_{8}, p)$	0.2850	1	0.378460	0.0	0.4	0.01199892200	0.009999990	2.08545891200
$d (p_{9}, p)$	0.2850	1	0.378460	1.0	0.4	0.01199892200	0.009999990	3.08545891200

Table 22. The 14 variables used in the heart disease diagnosis case.

Name	Data Type	Definition
age	integer	age in years
sex	binary	sex
cp	{1,2,3,4}	chest pain type
trestbps	integer	resting blood pressure
chol	integer	serum cholesterol in mg/dL
fbs	binary	fasting blood sugar > 120 mg/d
restecg	{0,1,2}	resting electrocardiographic results
thalach	integer	maximum heart rate achieved
exang	binary	exercise-induced angina
oldpeak	float	ST depression induced by exercise relative to rest
slope	{1,2,3}	the slope of the peak exercise ST segment
ca	{0,1,2,3 }	number of major vessels colored by flourosopy
thal	{3,6,7}	heart status
num	{0,1,2,3,4}	diagnosis of heart disease

Table 23. Database of EEG records.

Record	Channel 1	Channel 2	Channel 3	Label
R1	(1, 1, −1, 0, 1)	(0, 1, 1, 1, −1)	(1, 1, −1, −1, 0)	1
R2	(1, 0, −1, 0, 1)	( 0, 1, 1, 1, −1)	(1, 0, −1, −1, 1 )	1
R3	(1, 1, −1, 0, 1)	(0, −1, 1, 1, −1)	(1, 1, −1, 0, 1)	2
R4	(1, 1, −1, 0, 1)	(0, −1, 1, 0, −1)	(1, 1, −1, 0, 1)	2
R5	(1, 1, −1, 0, 0)	(0, −1, 0, 1, −1)	(1, 1, −1, 1, 1)	3
R6	(1, −1, −1, 0, 1)	(0, −1, 1, 0, −1)	(1, 1, −1, 0, 1)	3

Table 24. Comparison between SAIN and other existing methods for personalised modelling.

Source	Number of Modalities	Number of Metrics	Types of Data Sets	Type of Integration Explainability	Explainability	Machine Learning Method
[1]	3	1	Longitudinal time series data	Late	No	Liquid Sate Machine
[2]	2	1	Time series; on-line text	Early	Reveals the impact of news on time series	SNN
[3]	2	1	Time series; on-line text	Late	No	DeepNN
[4]	3	1	Brain images – pixel values	Early	Moderate	Deep NN
[5]	2	1	Social and cognitive data as numbers	Early	Feature interaction network	SNN
[6]	2	1	Time and space	Early	Feature interaction network.	SNN
[9,14]	3	1	Personalised vector data of numbers; Time and space	Preliminary selection	Reveals feature interaction over time	SNN
[23,24]	1	1	Numerical vector-based data	No integration	Extracted fuzzy rules	Fuzzy neural networks
[25]	3	1	Time, space and direction: fMRI + DTI data	Early	Feature interaction network	SNN
This paper—SAIN	Multiple, practically unlimited	Multiple, multicriteria metrics	Multiple, practically unlimited	Early	Visualisation and interpretation of personalised model	Statistical: search and inference using wwkNN

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calude, C.S.; Gladding, P.; Henderson, A.; Kasabov, N. SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare. Algorithms 2025, 18, 605. https://doi.org/10.3390/a18100605

AMA Style

Calude CS, Gladding P, Henderson A, Kasabov N. SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare. Algorithms. 2025; 18(10):605. https://doi.org/10.3390/a18100605

Chicago/Turabian Style

Calude, Cristian S., Patrick Gladding, Alec Henderson, and Nikola Kasabov. 2025. "SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare" Algorithms 18, no. 10: 605. https://doi.org/10.3390/a18100605

APA Style

Calude, C. S., Gladding, P., Henderson, A., & Kasabov, N. (2025). SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare. Algorithms, 18(10), 605. https://doi.org/10.3390/a18100605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAIN: Search-And-INfer, a Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Healthcare

Abstract

1. Introduction

2. Mathematical Description

2.1. Database

2.2. Distance Metrics

2.3. Tasks Specification

2.4. Tasks Solutions

2.5. An Example

2.6. Complexity Estimation of the SAIN Method

3. Survival Analysis in SAIN

3.1. Data and Tasks

3.2. Tasks Solutions

3.3. An Example

4. SAIN: A Modular Diagram and Functional Information Flow

5. Case Studies for Medical Diagnosis and Prognosis

5.1. Heart Disease Diagnosis

5.2. Time Series Classification

5.3. Predicting Longevity in Cardiac Patients

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI