Membrane Clustering of Coronavirus Variants Using Document Similarity

Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological computations are gaining increased attention. Genomes of viruses can be represented by character strings based on their nucleobases. Document similarity metrics can be applied to these strings to measure their similarities. Clustering algorithms can be applied to the results of their document similarities to cluster them. P systems or membrane systems are computation models inspired by the flow of information in the membrane cells. These can be used for various purposes, one of them being data clustering. This paper studies a novel and versatile clustering method for genomes and the utilization of such membrane clustering models using document similarity metrics, which is not yet a well-studied use of membrane clustering models.


Introduction
Deoxyribonucleic acid (DNA) is a complex molecule consisting of nucleotides. It contains genetic information. Nucleotides consist of three components: nucleobases, a sugar called deoxyribose, and a phosphate group. There are four kinds of nucleobases: adenine, cytosine, guanine, and thymine. The list of the nucleobases of the chromosomes of individuals and species is contained in the genome sequences. These genomes are stored as strings, composed of the first characters of the nucleobases. In the case of a DNA genome, these are 'A', 'C', 'G', and 'T' [1].
In this paper, we use text similarity metrics, such as Doc2Vec and MinHash, to calculate the similarity of virus genomes. Text similarity metrics can be used to assign vectors to texts and then measure their distances in the vector space. The first step in our experiments, was to study how well these metrics can be used to calculate the similarity or difference of genomes. Then with the use of a membrane clustering algorithm, we calculated regular and hierarchical clusterings of coronavirus variants. The algorithm first determines centroids in the space, one for each cluster. Then by using some evolutional rules-in our case the Particle Swarm Optimization (PSO)-we calculated the next configuration in each round until a given number of steps was taken. In the second step of our experiments, we presented the clusters of viral genomes that were created by our method. At the end of the paper, the method was also compared with the usage of K-Means. The advantages of using our proposed algorithm are shown by comparing the clustering validity indexes of the clusters produced by our proposed method and by K-Means.
The paper is arranged as follows. Section 2 presents some previous studies that are related to this paper, including studies about the usage of document similarity metrics in the comparison of genomes, the clustering of genomes, or membrane computing or membranebased clustering. In Sections 3.1-3.4 the similarity metrics that we used are presented. In Sections 3.5 and 3.6 the concepts related to clustering algorithms and membrane systems are presented. Then, we present previous studies that we have built upon in this paper in Sections 3.7. In Section 4 we present the new methods added to our algorithm that were used to cluster genomes and we also describe how we experimented with these methods.
In more detail, Section 4.1 discusses how we used the similarity metrics to measure the similarity of genomes. Then, Section 4.2 presents our hierarchic membrane clustering method, the experiments, and how we connected this method with the genome similarity results. Sections 5.1 and 5.2 contain the evaluation of using the described similarity metrics on genomes. Then, Sections 5.3 and 5.4 describe the clustering results we obtained with our membrane clustering methods on genomes, including a comparison with the results of another clustering algorithm. The final conclusions are gathered in Section 6. Furthermore, in Section 6, possibilities are suggested for future research.

Related Works
In this section, we gathered some studies that are related to this paper. First, we present some studies discussing genome similarity metrics. In [2], the authors designed and implemented SimilarityAtScale, a communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. They packaged their routines in a tool called GenomeAtScale, which combines the proposed algorithm with tools for processing input sequences.
The authors of [3] introduced the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. In [4], the authors introduced Mash, extending the MinHash dimensionality-reduction technique to include a p value significance test and a pairwise mutation distance. Their method reduced sequence sets and large sequences to small, representative sketches, from which global mutation distances can be rapidly estimated.
The authors of [5] introduced the containment MinHash approach, for estimating the Jaccard index of sets of different sizes by leveraging another probabilistic method, Bloom filters for fast membership queries. In [6], the authors introduced Mashtree, which uses min-hash values to cluster genomes into trees using the neighbor-joining algorithm.
The authors of [7] proposed an automatic feature learning approach to avoid explicit and predefined feature extraction. The proposed approach is based on the adaptation of two extensively used natural language processing techniques, namely Word2Vec and Doc2Vec. In [8], the authors applied an unsupervised sequence embedding technique (Doc2Vec) to represent protein sequences as rich feature vectors with a low dimensionality. Training a Random Forest (RF) classifier through a training dataset that covers known PPIs (proteinprotein interactions) between humans and all viruses, they obtained excellent predictive accuracy that outperformed various combinations of machine learning algorithms and commonly-used sequence encoding schemes.
Considering the related studies in the field of coronavirus genome studies, the authors of [9] proposed a method for predicting coronavirus disease 19 . They introduced similarity features to distinguish COVID-19 from other human coronaviruses. In [10], the authors created a protocol for the analysis and phylogenetic clustering of SARS-CoV-2 genomes using an open-source tool, Nextstrain, for real-time interactive visualization of genome sequencing data.
Next, we present some related studies that discuss clustering methods. First we discuss genome clustering and then membrane-based clustering. The authors of [1] summarized the biological background of the clustering and classifying of genomes. Then they presented the mathematical models used to analyze documents containing natural languages, such as string distances and the n-gram technique. They also analyzed the language of DNA text. In the end, they used all these technologies to introduce the clustering of genomes. In [11], the authors presented algorithms using nucleotide n-grams that required no preprocessing steps such as sequence alignment-which solved the problems of classification and hierarchical clustering of isolates-to determine the family of genomes and where the given genome belongs. They also introduced a new distance measure between n-gram profiles.
Based on [12], P systems, also known as membrane systems, a class of distributed parallel computing models was widely used to solve clustering problems. An improved PSO-based (Particle Swarm Optimization) clustering algorithm inspired by a tissue-like P system was introduced in this paper. The proposed clustering algorithm adopted the tissue-like P system structure, which contains a loop of cells. A group of candidate cluster centers was represented by an object in the cells. Communication and evolution rules were also adopted in this approach. A local neighborhood topology was built using the communication rules, by virtue of the loop structure of cells. This increased the diversity of objects in the system and promoted the co-evolution of the objects. The different PSO-based evolution rules are also used to evolve poor objects and common objects, respectively.
In [13], the authors introduced a variant of a tissue-like P system with active membranes for the clustering process in the calculation of the density of data points using the K-nearest neighbors and Shannon entropy. The authors of [14] proposed an improved spectral clustering algorithm based on a cell-like P system. Instead of the K-Means algorithm they used the bisecting K-Means algorithm. To improve the spectral clustering, as the framework of this algorithm, they constructed a cell-like P system. The efficiency of the bisecting K-Means is improved by the maximum parallelism of the P system.
A summary of the analyzed literature can be seen in Table 1. Many of the related studies mention the variety of their dataset as a limitation of their study. Usually, this means that many of the methods were only validated on given genomes or they were not validated outside the field of computational biology. Some of the related studies also mention the lack of improved database construction. In this paper, we still focus on a given set of genomes as the dataset because our method has previously been validated on other kinds of data outside the field of computational biology in our previous works. These studies also analyzed the utilization of different database management systems as data sources [15,16].

Author Title Finding
Arslan (2021) COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus Introducing similarity features to distinguish COVID-19 from other human coronaviruses.
Jolly (2021) Computational analysis and phylogenetic clustering of SARS-CoV-2 genomes A protocol for the analysis and clustering of SARS-COV-2 genomes. Tomović (2006) n-Gram-based classification and unsupervised hierarchical clustering of genome sequences New distance measure between n-gram profiles. Gao (2018) An improved PSO-based clustering algorithm inspired by tissue-like P system Using a local neighborhood topology, increasing the diversity and co-evolution of objects in the P system.

Jiang (2019)
A density peak clustering algorithm based on the Knearest Shannon entropy and tissue-like P system A P system variant for the calculation of density points using the K-nearest neighbors and Shannon entropy.

Zhang (2019)
An improved spectral clustering algorithm based on cell-like P system Improved efficiency of K-Means in spectral clustering using the maximum parallelism of the P system.

Cosine Similarity
Cosine similarity is a similarity measure between two sequences of numbers or vectors. This corresponds to the cosine which is null when the components are perpendicular and maximum when the components span a zero angle. Given the vectors A and B, cos (θ), the Cosine similarity is represented using a dot product and magnitude, as where and

Jaccard Similarity
The dissimilarity and similarity of two sets can be measured with the Jaccard similarity, which is usually used to find documents that are textually similar. It is defined by the size of the intersection divided by the size of the union of two sets: where 0 ≤ J(A, B) ≤ 1. If A and B sets are both empty, then their similarity is J(A, B) = 1 [17].

MinHash
The similarity of two sets can be calculated rapidly by MinHash, as introduced in [18]. It can be applied to large-scale clustering problems, for example, clustering documents based on their similarity of sets of words or detecting duplicate web pages. Similarly, we are going to use it to cluster genomes based on the similarity of their sets of n-grams.
Let A and B be two subsets of set U. The minimal member of any set S with regard to h • perm (the member x of S with the minimal value of h(perm(x)) is defined as h min (S), where perm is a random permutation of the elements of U and h is a hash function mapping the members of U to individual numbers. Applying h min on A and B both assuming there is no hash conflict, h min (A) and h min (B) values are equal if and only if among all elements of A ∪ B, the A ∩ B intersection contains the minimum hash value element. The probability of this case is:

Word2Vec and Doc2Vec
An open-source tool was introduced in [19] for natural language processing that is called Word2Vec. It proposes effective algorithms to represent words as word embeddings that are N-dimensional real vectors. It is also used to measure the similarity of words. The algorithm uses a simple neural network model with a single hidden layer. Taking a large text as input, the algorithm assigns a vector for each distinct word, creating a vector space consisting of hundreds of dimensions. Similar context words are located close to each other in the vector space. These vectors are selected cautiously so that the mathematical function Cosine similarity can state the semantic similarity between the words that are represented by vectors. It has two models, continuous bag-of-words (CBOW) and Skip-Gram.
Doc2Vec is an extension of Word2vec that was proposed in [20]. It constructs embeddings regardless of documents of any lengths. It also uses a vector called Paragraph ID (or doc ID). Two algorithms are used to calculate Doc2Vec. One is similar to the CBOW model, called Distributed Memory version of Paragraph Vector (PV-DM). The other one is called Distributed Bag of Words version of Paragraph Vector (PV-DBOW), which is similar to Skip-Gram.
In this paper, we used Skip-Gram, which converts each word into a feature vector using Word2Vec. Then it calculates the average of these vectors, resulting in the Doc2Vec vectors. Their length is the same as the length of Word2Vec vectors. Formally: where W2V(w i ) represents the i th word's W2V vector and n is the number of vectors.

Clustering Algorithms
The objective of clustering algorithms is, by using a function of goodness, to discover groupings of a specified data set, where each data point of the set belongs to a group. In such a cluster, the similarity-in our case document similarity-of the members is maximized, while in separate groups, the similarity of the points is as low as possible. To solve this clustering problem, many distinct approaches exist (centroid-based, densitybased, connectivity-based), each bearing its strong points.
At the end of the paper, we are going to compare our clustering algorithm with K-Means [21]. This algorithm partitions a set of n data vectors x 0 , ..., x n ∈ X into k disjoint clusters that are described by the mean c 0 , ..., c k ∈ C of the samples contained in the cluster. The algorithm clusters the samples by minimizing the sum of squares in each cluster: The goodness of clustering can be measured by multiple methods. In this paper, we are going to use the Davies-Bouldin index [22] and the Silhouette coefficient [23] to compare the results of our membrane-based clustering method with the results of K-Means.
The Davies-Bouldin index is the average similarity of each cluster with its most similar cluster. This similarity is the ratio of the distances of the points within the cluster and the distances of the points between the clusters. This means that clusters that are less dispersed and further apart result in a better score.
The Silhouette coefficient is measured using two values: for each sample, a is the mean intra-cluster distance and b is the mean nearest cluster distance. With these values, the coefficient is calculated in the following way:

Membrane Computing
Various types of membranes delimit the parts in a membrane system or P system. Membranes keep together certain chemicals and allow other chemicals to pass selectively in biology and chemistry. These systems take the form of a certain structure. A cell-like membrane system takes the form of a tree, while a tissue-like membrane system takes the form of an arbitrary graph. These structures consist of parallel computing units called cells.
Evolution rules are contained in the regions defined by these cells. These rules delineate the calculations in the system as a sequence of transitions between the states of the system. A multiset of objects is also contained in a region. When the system takes a step in the computation, it chooses non-deterministically, in a maximally parallel manner from all the available rules.
The system reaches a new state or configuration when a step is applied. Furthermore, when there is no possibility for any transitions, meaning there are no rules in any of the cells that could be applied, the calculation terminates. After the system halts, the result of the computation may be defined by the state of a specific cell [24].
In the case of the clustering task, an object is going to be a vector containing the potential cluster centroids. The evolution rules are going to move these centroids of the objects of each cell in each step until a given number of steps. The centroids start from an initial state randomly chosen from the data points to be clustered.

Our Approach from Our Previous Works
We partially used the same membrane clustering method that we used in two of our previous studies [15,16]. In the first study, we were experimenting with membrane-based clustering on data points stored in PostgreSQL DBMS. We also validated our clustering using the Davies-Bouldin index, the Silhouette coefficient, and the Calinski-Harabasz index. In this current paper, we are going to use the input parameters as we have found them to be optimally set in this previous work.
In our second previous study, we performed some experiments using greater datasets stored in NoSQL DBMSs: Redis and MongoDB. We evaluated the running time, storage size, and memory usage of these systems in combination with our algorithm. The detailed description of the algorithm and equations that we used in this study can also be found in that paper.

Experiments and the Algorithm
In this paper, we added the document similarity underneath our algorithm and a hierarchical system above it. In this section, we walk through these extensions and explain our experiments.

Experiments on Using Doc2Vec and MinHash
In the first two experiments, we performed simple tests where we calculated the distance of the Doc2Vec and MinHash vectors of three viruses: Human Coronavirus 229E, Human Coronavirus NL63, and Hepatitis C. We wanted to create associations between smaller parts of the text. In order to do this, we wanted to split these lines into smaller parts of size m. We will call these elements m-grams from now on.
Next, we trained the model. For this, we simply used every genome we had on every m size we wanted to analyze. We used the trained model to calculate the vectors for two genomes and then used Cosine distance to obtain their similarity. Euclidean distance could also be used and would have produced roughly the same results.
For MinHash, we used the same examples that we tried in the previous model with Jaccard similarity.

Hierarchic Membrane Clustering
In the following section, we describe our hierarchic membrane clustering method and the four experiments we created to evaluate it using Doc2Vec and MinHash. The following elements are going to be used in the pseudo codes: • c-a cluster containing a list of vectors. • t-threshold, the size of clusters we want to create. • memb_clust-our original membrane clustering algorithm described in detail in [16]. Hierarchical clustering differs from regular clustering in that a cluster can become part of another cluster. Our hierarchical membrane clustering recursively calls our original membrane clustering method, splitting a greater clustering result c of the previous clustering round into two (or N) branches until all clusters contain only a maximal number of samples, which is smaller than the t threshold.
Pseudo code for hierarchical membrane clustering: For example, in the experiments using hierarchical clustering, we created clusters containing two elements and divided the greater clusters into two parts in each round.
As it can be seen in Section 5, we were experimenting with the usage of document similarity metrics with our clustering in four ways.
Our first idea was to represent each virus as a vector with Doc2Vec and run the clustering algorithm on the results. To achieve this we first needed to be able to obtain the vector for just one genome. Then we applied this algorithm to all genomes docs and stored their vectors in a list. Here, we used a fixed m for the m-grams.
Pseudo code for the first experiment:

vectors) END
For our second experiment, we did the same thing, but instead of selecting a specific m m-gram size, we concatenated the vectors for each size, thus trying to obtain as much stored information as possible. A workflow of this second experiment can be seen in Figure 1. The workflow for the first experiment would be similar, but with only creating one m-gram list for each genome.  In the last experiment for our hierarchic clustering, we used the same methodology but with MinHash and Jaccard similarity. A workflow of this third and the last experiments can be seen in Figure 2.

Results
In this section, we present the results that we obtained with the methods described in Section 4. We used the NCBI reference sequences (RefSeq) [25] and The European nucleotide archive [26] databases to collect the genomes used in the following.

Evaluation of Using Doc2Vec
We examined the ways text similarity metrics can be used to examine the similarity of genomes, with most attention focused on the viruses described in Section 4.1. For Doc2Vec, we used Gensim's implementation [27].
In Table 2, we calculated the distance of Human Coronavirus 229E-Hepatitis C (denoted as C1), Human Coronavirus NL63-Hepatitis C (denoted as C2), and Human Coronavirus 229E-Human Coronavirus NL63 (denoted as C3). The metric goes from 0 to 1 and the smaller number means that the two genomes are more similar. We wanted to determine the ideal mgram size. The distance between the two Corona variants must be smaller than the distance between Corona and Hepatitis C virus. So we calculated the difference between the distances too: in column C1-C3 the difference between the first and third column and in column C2-C3 the difference between the second and the third column. In Table 3, we calculated the running times for the calculations of the above and the average in seconds for each m.
Based on the test results that can be seen in Table 2 we found that the ideal m is 14 in this case since there we can see the two Corona variants to be the closest to each other compared to the Hepatitis C virus. Usually based on other experiments, for other data, m between 3 and 14 gives valid results. Below that the genomes are too uniform, above that there are too many unique sequences, and the similarity between genomes is too close to 0.
Considering only the scores, m = 14 seems to be the best. If we also consider the running times in Table 2, we can also see that with the increasing m, the running time decreases. In this experiment with m = 14, our score and running time are also fine. However, if under other circumstances, such as using other data, a smaller m is found to be optimal; therefore, we can conclude that it is possible to increase the m for a smaller running time, if needed, until 14, since the scores are usually still valid between the range of 3 and 14.

Evaluation of Using MinHash
To experiment with MinHash, we used the implementation of datasketch [28].
The results can be seen in Table 4. The metric still goes from 0 to 1, but the higher it is the more similar the two genomes are. Based on this experiment, m between 5 and 8 provided the correct results, because it measured a bigger similarity between the two Corona variants compared to the similarity of the Hepatitis C and a Corona variant. However, when m < 5 and when m > 8, the results seem incorrect in most cases. With these m values, sometimes greater or equal similarity was measured between the Hepatitis C and a Corona variant than between the two Corona variants. Furthermore, when m > 8, it started to measure 0 distances between the different genome sequences.
We measured the running times again in Table 5. However, when m < 2 and when m > 10 Doc2Vec is faster, we would not use these values anyway considering the distance scores. Overall the running times are smaller for m between 5 and 8 in the case of MinHash, where we received the best scores. However, these running times are bigger than the running time with m = 14 when using Doc2Vec, which we would use as a value giving valid results. Furthermore, when m > 10, the running times with Doc2Vec are smaller. Table 4. Similarity scores (229E-Hepa (C1), NL63-Hepa (C2), and NL63-229 (C3)) of the MinHash of the three viruses and the differences of these distances (C1-C3 and C2-C3) using different m-gram sizes.  Table 5. Running times at the calculation of the similarity scores (229E-Hepa (C1), NL63-Hepa (C2), and NL63-229 (C3)) of the MinHash of the three viruses and the average of these running times in seconds using different m-gram sizes.

Creating Clusters Using Hierarchic Membrane Clustering
The next thing we wanted to do is to find a way to efficiently sort multiple viruses into clusters. We used the methods described in Section 4.2. Here this first experiment did not bring satisfying results, so we have omitted the results from this paper.
Our second experiment could effectively distinguish the murine hepatitis variants from other viruses. It could also collect bovine coronavirus variants into the same cluster, with the exception of one variant. It could collect the influenza variants also in the same cluster. It successfully distinguished SARS CUHK variants from other variants and viruses. Furthermore, in the end, it collected many other SARS variants in the same clusters. It could also collect SARS Sin variants into the same cluster correctly. Furthermore, it also nearly clustered the West Nile variants correctly, but in the end was unsuccessful.
Sample clusters: For the third experiment, after running some tests, we obtained a good clustering of the viruses using m = 11 in our membrane-based hierarchical clustering. In given moments it could separate influenza and West Nile variants from the others. It could collect the SARS Sin variants together, except for one variant. It collected murine hepatitis variants in one cluster, except one variant. Furthermore, it collected the bovine coronavirus variants into one cluster, except one. It also clustered SARS CUHK variants correctly.
Overall, the first and fourth attempts with the similarity matrices produced the worst results; most clusters were essentially random. We did not obtain a satisfying result with MinHash in the end, so in the following, we used Doc2Vec.
The results of the second and the third Doc2Vec tests with correct m parameters were mostly correct with one or two viruses out of order. We continued with these parameters in the last tests.

Comparison of our Membrane Based Approach with K-Means
We evaluated the performance of the clustering results based on two clustering validity indices: the Davies-Bouldin index and the Silhouette coefficient and compared our results with the results received with K-Means. We used the implementation of sklearn [29]. First, we presented what indexes we reached with our membrane-based clustering and K-Means on the results of the third Doc2Vec test with m = 11.
We present the best Daves-Bouldin indexes that we obtained for the different number of clusters with the membrane configuration that we used to reach the scores, compared to the indexes we reached with K-Means in Table 6. In the case of the Daves-Bouldin index, a lower value means better clustering. It can be seen that we found at least one configuration for each number of clusters where we could reach a better index than the index produced by K-Means. The average index reached with our method is also much better. Based on these results, with the increasing number of clusters, the clustering became better and better. Our method with 14 clusters produced the best results. In Table 7, the best Silhouette indexes can be seen that we reached, similarly to Table 6. In the case of the Silhouette index, a higher value means better clustering. It can be seen that we found at least one configuration for the most number of clusters where we could reach a better index than the index produced by K-Means, but not for all number of clusters. The average value reached with our method is also not as much better as with the Davies-Bouldin index. Based on these results, with decreasing number of clusters, the clustering became better and better. Our method with six clusters produced the best result.
Similar to the clustering of the results of the third Doc2Vec test with m = 11, we will now discuss which indexes we obtained in the second Doc2Vec test with m set between 4 and 16.
First, we present the best Daves-Bouldin indexes that we reached for the different number of clusters with the membrane configuration that we used to reach the scores, compared to the indexes we reached with K-Means in Table 8. It can be seen that we found at least one configuration for the most number of clusters where we could reach a better index than the index produced by K-Means. Compared to the previous experiment, this experiment showed that not increasing the number of clusters led to better clustering. Moreover, K-Means with 9-10 clusters found very good clusterings, with 9 clusters being the best in this experiment, while our algorithm performed the worst with these cluster numbers. Furthermore, it can be seen that the resulting values and averages are better than the resulting values of the previous experiment with a fixed m. Table 7. Silhouette scores of the K-Means and the membrane clustering with the different number of clusters and m-gram size of 11 and using different membrane configurations. The Better scores are bold in the table.

Clusters
Steps We present the best Silhouette scores that we obtained for the different number of clusters with the membrane configuration that we used to reach the scores, compared to the indexes we reached with K-Means, in Table 9. It can be seen that we found at least one configuration for the most number of clusters where we could reach a better index than the index produced by K-Means, but not for all number of clusters. Furthermore, although we produced better values with more cluster numbers, the average with K-Means was still better. Again, the best clusterings resulted in nine clusters both with our method and the K-Means. Moreover, our method with nine clustersled to the best clustering in this experiment. Furthermore, it can be seen that the resulting values here are better than the resulting values of the previous experiment with a fixed m. We examined various methods and algorithms in the process of clustering genomes. First, we evaluated the usage of Doc2Vec and MinHash and how they can be used to measure the distance between genomes. After that, we applied our hierarchical membrane clustering method to the results of the similarity metrics and validated the results using virus genomes. Then, we compared our membrane clustering method with K-Means and evaluated their clustering results using the Silhouette coefficient and the Davies-Bouldin index.
In the end, we can conclude that our membrane clustering method can be effectively used to cluster virus genomes because it reached good scores compared to K-Means. We also found that our clustering methods worked better with Doc2Vec than with MinHash.

Limitations
To achieve better results compared to rival solutions, as a basis for clustering algorithms, evolutionary optimization methods were utilized, including Particle Swarm Optimization (PSO), Red Fox Optimization (RFO), Polar Bear Optimization (PBO), Chimp Optimization Algorithm (ChOA), or even a combination of these as their frameworks [30][31][32][33].
Making further experiments and evaluations comparing these heuristics was not in the scope of this paper, so we used PSO, which we already had built into our membrane clustering approach in our previous studies and are described in Section 3.7. However, in future studies, this aspect could also be evaluated in more detail.
We evaluated the clustering performance of our method, but it could still be examined and improved in other aspects. For example, since membrane systems are parallel models, the running time of our algorithm could be improved in a distributed or multi-threaded environment.
It would be also interesting to examine and test the usage of classification models to classify genomes such as Decision Trees, Support Vector Machines, Random Forests, or Neural Networks.
Furthermore the accuracy of the algorithm could also be tested on larger data sets, for example using other kinds of coronaviruses such as MERS [34] and SARS-CoV [35]. It would also be interesting to try our algorithm and these models not only on virus genomes but on bacteria or other species to inspect how these longer genome sequences affect the running time, memory usage, etc.
Funding: This study was supported by the ÚNKP-21-3 New National Excellence Program of the Ministry for Innovation and Technology from the source of the National Research, Development and Innovation Fund. This research was also supported by grants of the "Application Domain Specific Highly Reliable IT Solutions" project that has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme TKP2020-NKA-06 (National Challenges Subprogramme) funding scheme.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
We used the NCBI reference sequences (RefSeq) [25] and The European Nucleotide Archive [26] databases to collect the genomes used in this paper.
Acknowledgments: Special thanks to Dániel Szabó, who contributed to the paper by helping in the implementation of the software.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: