1. Introduction
The enormous development of data in almost every area of life has generated a great demand for an analytical technique that can turn data into useful knowledge. Efforts to gain knowledge from these data are often referred as data mining [
1,
2]. As an important research content in data mining and artificial intelligence, clustering is an unsupervised pattern recognition method without prior information guidance (the dataset does not have a column to explain which observation belongs to a particular class or group). It aims to find potential similarities and grouping so that the distance between two points in the same cluster is as small as possible and the distance between data points in different clusters is the opposite [
3].
One of the data analysis techniques that are often used to perform data mining is clustering analysis [
4,
5,
6]. The form of data used in clustering is a two-dimensional (2D) data matrix consisting of rows and columns, row dimensions are usually observations and columns are data attributes. Clustering analysis aims to divide a set of observation data into several groups or clusters. Observations of data that are in the same cluster will have similarities and are not similar to observations that are in different clusters [
7].
Clustering analysis can only group objects from one data dimension (observations or attributes) separately, so that a data sub-matrix is obtained in the form of a subset of observations by containing all attributes [
8,
9]. However, for some research problems, there is an urge to find a data sub-matrix in the form of a subset of observations and does not have to contain all attributes. So it must be done by clustering the observations and attributes simultaneously. The data analysis technique that can do this is biclustring analysis. Biclustering analysis can group the two dimensions of data simultaneously, so that further knowledge can be obtained about the problems of the data used [
10,
11].
Facing increasingly complex data problems, where the data used is not only in two-dimensional form but can be three-dimensional (3D) data, data analysis techniques are developed that can cluster 3 data dimensions simultaneously. This analysis is known as triclustering analysis. For example, 3D data consists of observation dimensions, attributes and context, then by using triclustering analysis, a tricluster can be obtained, which is a sub-space of 3D data. The resulting tricluster is a subset of observations, a subset of conditions and a subset of attributes from 3D data [
8,
9].
Triclustering analysis is often performed to analyze microarray gene expression data. Microarray is a technology that can be used to measure the level of thousands genes expression under several conditions simultaneously. Microarray technology can also observe gene expression at a certain time, resulting in 3D data on gene expression [
12]. 3D data on microarray gene expression usually consists of gene dimensions, conditions, and time points of observation. The purpose of conducting triclustering analysis on microarray gene expression data is to find groups of genes that have similar expressions in a subset of conditions as well as a subset of observation time points [
12].
The triclustering method for analyzing 3D gene expression data was first proposed by [
13] under the name Tricluster. After [
13] proposed the triclustering analysis method, many researchers proposed various new methods for analyzing 3D data. In 2007, Ref. [
14] proposed the Extended Dimension Iterative Signature Algorithm (EDISA) method that can be used when the data re unbalanced, where the number of points of observation time is not the same in each experimental condition. [
15] proposed an OPTricluster algorithm that can maintain the time sequence in gene expression data. OPTricluster is effectively used to analyze short time series data (3–5 time points) and has 2–5 experimental conditions [
15].
One of the triclustering methods used to analyze microarray gene expression data is the
-Trimax method. The
-Trimax method can produce a sub-space or tricluster that has a mean square residual smaller than
[
12]. The
-Trimax method can be used for gene expression data that has many conditions and time points (long time series) in an experiment. In this paper, the
-Trimax program is created using the Python programming language, which is an improvement in calculating multiple node deletion and node addition. This program also adds a calculated evaluation of each tricluster found. This is a novelty of the program that was previously made by [
16]. Program can be accessed on
https://github.com/novalsaputra/Delta-Trimax (accessed on 18 January 2021).
The implementation of the
-Trimax method was carried out on microarray gene expression data from the differentiation process of human induced pluripotent stem cell (HiPSC) in patients with heart disease and HIV-1. This data were obtained from the web page “
www.ncbi.nlm.nih.gov”. The expected result of this study is the discovery of gene groups that have similar expressions in the condition subset and the observation time point subset. The acquired gene group can be used as a guide by medical and biologists to carry out further action on patients.
This paper mainly discusses the triclustering method used in gene expression data in a microarray format. We see a match between our topics with the mathematics and computer science scope in the journal Symmetry, especially in transformation (matrix transformation), pattern recognition (unsupervised learning), diversity, and similarity. Therefore, we choose symmetry journal as the venue of our research work. We also cited one research within the symmetry journal to deepen our knowledge of the most recent research on the published clustering research (although it does not necessarily using the same dataset and the same data dimension compared to this paper).
3. Methodology
The method used to perform triclustering in this study is the
-Trimax method. 3D data in this method is focused on the problem of microarray gene expression, where the data consists of the dimensions of genes (observation), conditions (attributes), and time (context). The
-Trimax method is a greedy approach that uses an iteration search scheme, where objects are gradually added and removed from the candidate subspace tricluster to meet certain criteria. This method aims to find a tricluster that has a mean square residual (S) smaller than
.
is the threshold determined based on the perspective of the researcher. The
-Trimax method is composed of several algorithms, namely the multiple node deletion algorithm, the single node deletion algorithm and the node addition algorithm. As the name implies, these algorithms perform iteration deletions and additions of nodes, resulting in a tricluster that has a mean square residual smaller than
. Then the tricluster that has been found is masked so that other triclusers can be found in the 3D data. The workflow of the
-Trimax method in this study is shown in
Figure 1.
3.1. Multiple Node Deletion Algorithm
The multiple node deletion algorithm performs the deletion of nodes which are thought to increase the mean square residual
S. The number of nodes that are removed in each iteration is large or equal to one. This algorithm determines a value of
which is used as the threshold to control the number of deletions performed. The value of
can be adjusted experimentally to optimize the speed of the algorithm. When the number of genes, conditions or times in the data is less than 50, then this algorithm is not executed on the data [
12]. This is to avoid the algorithm deleting all nodes in small 3D data. The purpose of this algorithm is that the data sub-space found does not have an
S value that is too far from the predetermined
, to save the computation process for the next step.
Here are the steps of the multiple node deletion algorithm on the matrix :
When then proceed to step 2, otherwise the process is not continued and gives as the final result of this algorithm.
Delete the ith gene if it satisfies the following inequality:
.
Recalculate:
and S.
Delete the jth condition if it satisfies the following inequality:
.
Recalculate:
and S.
Repeat step 2 to step 7. If there are no genes, conditions and times are deleted then the iteration stops.
The complexity of this algorithm is O(max(m,n,p)) where m,n and p are the number of genes, conditions and time-points [
12].
3.2. Single Node Deletion Algorithm
The single node deletion algorithm performs deletion of nodes iteratively until the S generated by the data sub-space has a value small or equal to the threshold . The single node deletion algorithm only deletes one node in each iteration, so the computation time at this step is quite long. Therefore, before running the single node deletion algorithm, the multiple node deletion algorithm is first to run.
Following are steps of the single node deletion algorithm:
Detect the gene, condition, and time that has the highest residual score in the following way:
The residual score for the ith gene, :
The residual score for the jth condition, :
The residual score for the kth time, :
Delete the gene, condition or time that has the highest score.
Recalculate and S.
Repeat step 1 to step 3. If value then iteration stops.
The final result of the single node deletion algorithm is subspace which has a value of , where and .
The complexity of this algorithm is O(log m + log n + log p) where m,n and p are the number of genes, conditions and time-points [
12].
3.3. Node Addition Algorithm
The -Trimax method aims to find the tricluster which has a maximum volume and a small mean square residual (S) of the threshold . In the single node deletion algorithm, a tricluster has been obtained which has . However, this tricluster may not be the tricluster with the maximum volume that can be extracted from 3D dataset, therefore it is necessary to double check the previously deleted nodes. Checking is done by adding nodes that are not members of the tricluster (nodes that were deleted previously) on the condition that the value of S remains small from . The following are the steps for the addition node algorithm:
Add genes that satisfy
Recalculate and S.
Add conditions that satisfy
.
Recalculate and S.
Add times that satisfy
.
Recalculate and S.
Repeat step 1 to step 6. If there are no more nodes added to the gene, condition and time then the iteration stops.
The final result of node addition algorithm is a subspace
, where
,
and
. The data sub-space generated by this algorithm is a tricluster with
and has a maximum volume. The complexity of this algorithm is O(mnp) as each iterates (m+n+p) times [
12].
3.4. Algorithm Simulation
This chapter aims to simulate triclustering discovery using the proposed
-trimax algorithm using a small sample of data, as given in
Table 1. The steps in this simulation are based on
Figure 1. The dataset in
Table 1 consists of 5 genes, three conditions, and 4-time steps. Before employing
-trimax algorithm, a threshold value for both
and
are required. In this simulation, we initialize
and
, when using real data,
and
can be obtained in discussion 4.2. According to
Table 1, we know that the total number of genes, condition, and time steps are less than 50, so we don’t need to perform multiple node deletion and apply single node deletion directly. This steps initialized by calculating
and
S using Equations (
2)–(
4) and (
11) respectively.
3.4.1. Precomputing
Assume that
I is the set of genes,
J is the set of conditions, and
K is the set of time steps. According to
Table 1,
and
. The calculations of
are given as follows:
The calculations of
are given as follows:
The calculations of
are given as follows:
The calculation of
is given as follows:
By using using the values that have been obtained above, we can get residual value for each element within dataset using Equation (
10). For example, for element
(
, and
):
By counting residual value for each element on each datum, we have got the residual value calculation result, as given in
Table 2.
According to
Table 2 we calculate
S as follows:
3.4.2. Single Node Deletion
We will calculate single node deletion through four following iterations:
3.4.3. Node Addition Algorithm
We will add genes, conditions, and time steps to keep the
S value smaller than the
value. The data used are given in
Table 11 where
and
. The mean values for each gene, condition, time step and S for the data in
Table 11 were previously obtained, so there was no need to recount those values.
3.4.4. Masking
According to the calculation of node deletion and node addition, one tricluster has been obtained in
Table 1. The resulting tricluster is the data shown in
Table 11. To find another tricluster, the data element containing the tricluster in
Table 1 are exchanged with random numbers. Since the tricluster obtained previously is a data sub-space with
, and
, the data elements that are exchanged are elements that are members
.
After the masking process has been carried out, the next step is to repeat the
-Trimax method using the 3D data in
Table 11, so that a new tricluster can be found. The tricluster search stops when no tricluster is found in data that has
; this condition will stop the search iteration for the
-Trimax method. The numbers in bold in
Table 16 are the datum that has been masked.
6. Conclusions
The -Trimax method form triclusters from 3D data which has a large dimension level. Our proposed method can be used when the data has many points in time (long time series). The -Trimax method can produce a tricluster that has a smaller mean square residual than the threshold , where the mean square residual is an indicator of the homogeneity of a tricluster. The authors has created a new program for analyzing 3D data using the -Trimax method. The novelty of the program is improvements in calculating the multiple node deletion, node additions algorithm and evaluation algorithm for each generated tricluster.
The implementation of the -Trimax method was carried out on gene expression data from the HiPSC differentiation process in patients with heart disease and HIV-1. From the results of the implementation, the following conclusions were obtained:
From several simulations using different and , the best simulation is obtained when using and for HiPSC, and for HIV-1.
The best five tricluster based on the smallest TQI for HiPSC data. This group of gene expression within the five tricluster is thought to be a feature of heart disease. Therefore, this gene group can be used by medical experts in providing further treatment, such as making the genes in this tricluster a therapeutic target or as a drug development.
Three biomarkers for HIV-1 disease were obtained from the 10 selected tricluster. Biomarkers consist of genes AGFG1, EGR1, and HLA-C.
Further research include conducting gene ontology analysis (GO) to see the relationship between genes based on their biological characteristics. Furthermore, we can use parallel computing to speed up the computation time in the -Trimax method.