Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning

: Recent technical innovations and developments in computer-based technology have enabled bioscience researchers to acquire comprehensive datasets and identify unique parameters within experimental datasets. However, ﬁeld researchers may face the challenge that datasets exhibit few associations among any measurement results (e.g., from analytical instruments, phenotype observations as well as ﬁeld environmental data), and may contain non-numerical, qualitative parameters, which make statistical analyses difﬁcult. Here, we propose an advanced analysis scheme that combines two machine learning steps to mine association rules between non-numerical parameters. The aim of this analysis is to identify relationships between variables and enable the visualization of association rules from data of samples collected in the ﬁeld, which have less correlations between genetic, physical, and non-numerical qualitative parameters. The analysis scheme presented here may increase the potential to identify important characteristics of big datasets.


Introduction
Technical innovation has enhanced progress in many fields of research and broadened the potential for data collection and analysis.The term "omics" has recently emerged, referring to exhaustive datasets of pools of biological molecules-for example, metabolomics, genomics, and proteomics.Next-generation sequencing is an example of a technical innovation that has contributed to genomics and has been used to elucidate the reasons that many species cannot grow in particular environments, leading to a deeper understanding of microbe communities that grow without cultivation [1][2][3].Microbes exists in various locations, sometimes in symbiotic relationships with various animals [4].While living on animals, the microorganisms interact with hosts or items that the host carries-for example, with food residue [5].Commensal microbes are those which benefit from the host, while the host receives no benefit; they exhibit high cell turnover and produce high levels of secretions, including sweat and mucus [6][7][8].Commensal microbes can, however, affect the host's immunity and behavior-for example, via physical attachment and the release of bacterial components and metabolites [9][10][11][12].Because of the effects that commensal microbes have on the host, exhaustive information on the microbial content of the host's intestine may provide information relating to the status of health [13,14].Moreover, to clarify the significance of not only commensal microbes but also the workings of nature for the host or human social and environmental benefits, humans might control nature Appl.Sci.2022, 12, 5927 2 of 16 in pursuit of the sustainable development goals (SDGs), particularly food consumption.However, in nature, a significant variety of parameters exist that cannot be controlled.Although technical innovation provides a relevant amount of information for us, we must develop appropriate methods to overcome the problems in accomplishing the SDGs.
Informatics is a constantly evolving technique with collateral performance advancements in computers, and various methods of calculating the unique characteristics of experimental datasets have been developed-for example, the similarity by clustering methods and order reduction for the overview of a dataset.In recent years, machine learning and deep learning have become essential in order to obtain an overview of big data, such as omics datasets [15,16].Several complementary methods may be used simultaneously to solve complex data mining issues.Moreover, several methods were improved and their concept and application were extended [17,18].
Numerical data are typically used when the statistical analysis of experimental data is undertaken; when qualitative data are used, appropriate quantification must be performed before statistical analysis is possible.Thus, qualitative data, which cannot be easily converted into numerical data, present a significant challenge to statistical analysis [17].
The present study aimed to mine characteristic variance from crude omics data, and to identify association trends within subjective and qualitative data, which are difficult to convert to numerical data using informatics.We used two machine learning algorithms: K-means clustering and random forest [19,20].The former is a well-known clustering analysis method that uses centroids to classify data coordinates, but the optimal number of groups for a particular dataset and the validity of separation cannot be known prior to the analysis.Random forest is an ensemble learning method for classification and regression.It is a high-performance method but requires datasets where some class information is already known-for example, control group, study group, and dosage amount-and is thus classified as a supervised learning method.We attempted to analyze non-experimental datasets from the field, which did not have supervised information.Using a combination of the two algorithms, we were able to overcome the limitations of each method and extract the characteristics, which are notable and/or interesting points of the dataset to study, and assign non-numerical classification information for downstream analysis.We propose that our advanced analysis scheme is applicable for data with unsupervised and/or multiple non-numerical variables to extract characteristic information for following the study and association rules between multiple variables, including non-numerical parameters.

Related Works
Several researchers have used machine learning to extract important characteristics and classification.In 2020, Li et al. used machine learning to diagnose a disorder [21].Mesiar and Sheikhi applied machine learning to nonlinear entities [18].Wei et al. and Tatsumi et al. used machine learning for data from the field, such as those from land and sea [22,23].Wang et al. attempted to improve calculation performance by combining machine learning [24].Shiokawa et al. applied market basket analysis for the visualization and selection association of information between human lifestyle factors and experimental measurements [17].

Proposed Framework
Our proposed framework consists of three well-known algorithms, K-means, random forest, and Apriori.The algorithms listed below provide a brief but detailed explanation of previous reports [17,[25][26][27][28].In this study, the algorithms were computed by R using the packages "Randomforest" and "arules".

K-Means
n-dimensional points are clustered into K (1, . . .K), which is the arbitrary number of users.Set initial k centers C = {C 1 , C 2 , . . . ,C k }.Cluster C i (i ∈ {1, . . . ,k}) is set to be a set of points in X that are closer to C i than C j in all j (j = i), and we calculate the center of mass of all points in C i :

Random Forest
Random forest is a machine learning method based on the decision tree algorithm for classification and regression, developed by Breiman [27].This algorithm displays the training method called bootstraps, and the decision tree uses Gini impurity to generate branches and extract variable importance.Gini impurity for a set of data with J classes (i ∈ {1, 2 . . . ,J}) labeled p is computed as follows.

Gini impurity
We performed association rule mining as previously described [17].Briefly, we used the Apriori algorithm [28] and applied parameters in R with the packages "arules" with "support", "confidence", and "lift".The meanings are described below using formulas with probability of X and Y.
In this study, the parameters were set as follows: "support" was 0.063, "confidence" was 0.25, and "lift > 1" with "maxlen = 2." Our our data preparation for Apriori was converted using datasets to zero-one data by ranking and a quarter of the whole variables belonging to "high" or "low" were "one" and three quarters were "zero".Therefore, when any association rules occurred randomly, support was approximately 0.063 and confidence was 0.25.
The analysis flow details are given in the Materials and Methods section below.

Overview
The analytical scheme of our study is presented in Figure 1.We used a two-stage analysis: first, the group was divided using K-means clustering of the applied dataset, and the validity of the result was evaluated using random forest based on the error rate.Important variables were extracted simultaneously with this step.This step was therefore termed "importance-based K-means", referred to as "i-means" for short.Second, association rules were mined in each dataset and qualitative information was evaluated by Apriori [29].
NMR measurements were normalized by 2,2-dimethyl-2-silapentane-5-sulfonate (DSS).Each dataset was converted as per the ratio to the sum of each sample.K-means clustering was performed on each dataset and the validity checked using the random forest importance-based error rate.This part is referred to as "importance-based K-means", or "i-means".The resulting and original datasets were converted into zero-one data by data ranking or class information resulting from i-means.The zero-one data were analyzed by the Apriori algorithm.Finally, we selected meaningful association rules by the importance calculated by random forest from the extracted association rules by the Apriori algorithm.NMR measurements were normalized by 2,2-dimethyl-2-silapentane-5-sulfonate (DSS).Each dataset was converted as per the ratio to the sum of each sample.K-means clustering was performed on each dataset and the validity checked using the random forest importance-based error rate.This part is referred to as "importance-based Kmeans", or "i-means".The resulting and original datasets were converted into zero-one data by data ranking or class information resulting from i-means.The zero-one data were analyzed by the Apriori algorithm.Finally, we selected meaningful association rules by the importance calculated by random forest from the extracted association rules by the Apriori algorithm.

Sample Preparation
We used natural fish samples from Japanese hydrosphere cultivation at 118 points in the river or sea.The samples (n = 315 individuals from 21 orders of taxonomy) were collected over the period of 2012-2016.Collected samples were stored at −30 °C or −80 °C until anatomic analysis was performed in the lab, to obtain intestinal content for measurements.Observers assessed and attached qualitative information, feeding behavior, ecology, habitat, body shape, tail type, season of collection, color, and scale type (Table 1).The fish samples were measured by length and divided into four groups by length average and variance.The intestinal content of the fishes was collected in a sample tube, freeze-dried, powdered, and stored at −80 °C for further experiments.

Sample Preparation
We used natural fish samples from Japanese hydrosphere cultivation at 118 points in the river or sea.The samples (n = 315 individuals from 21 orders of taxonomy) were collected over the period of 2012-2016.Collected samples were stored at −30 • C or −80 • C until anatomic analysis was performed in the lab, to obtain intestinal content for measurements.Observers assessed and attached qualitative information, feeding behavior, ecology, habitat, body shape, tail type, season of collection, color, and scale type (Table 1).The fish samples were measured by length and divided into four groups by length average and variance.The intestinal content of the fishes was collected in a sample tube, freezedried, powdered, and stored at −80 • C for further experiments.The sample had a single category; if there were overlaps and missing values, the number did not always reflect the sample number.

Nuclear Magnetic Resonance
For nuclear magnetic resonance (NMR) observations, 18 mg of each powdered sample was extracted in 600 µL of KPi buffer containing 90% deuterium oxide and 1 mM sodium 2,2-dimethyl-2-silapentane-5-sulfonate (DSS) at 65 • C for 15 min, and then centrifuged at 17,800 G for 5 min.The entire supernatant was mixed and transferred to a 5-mm NMR tube.Two-dimensional J-resolved (2D J-RES) NMR spectra were acquired at 298 K using a Bruker AVANCE II 700 spectrometer equipped with an 1 H inverse triple-resonance cryogenically cooled probe with Z-axis gradients (Bruker BioSpin GmbH, Rheinstetten, Germany).In brief, 2D J-RES NMR spectra were acquired using the standard Bruker pulse program jresgpprqf, with 16 K (F2) and 16 (F1) points, and were then categorized into 32 transients and 16 dummy scans.

Data Processing
The 2D J-RES NMR spectra were processed using TopSpin software (Bruker Biospin: https://www.bruker.com/en/products-and-solutions/mr/nmr-software/topspin.html,accessed on 29 April 2022).Tilt correction and symmetrization were performed and projections of the 2D spectra were obtained.We defined 235 regions of interest manually using Revolution R Open software (https://cran.r-project.org/,accessed on 29 April 2022).

Data Preprocessing for Analysis and Annotation
The NMR measurement outcomes were normalized according to the DSS signal and peaks were annotated by comparison with premeasured standards from our database of metabolites acquired under the same conditions (Figure S1 and gut microbe analysis (MiSeq) were processed to obtain the composition ratio to the sum of the measurement in fish intestines for each sample.Subjective qualitative information was assigned to samples by observation, as mentioned above (Table 1).These observations were made by multiple persons subjectively.The classification of a sample was difficult if qualitative information belonged to multiple categories.

Fish Gut Microbe Analysis by MiSeq
Fish intestinal microbial DNA was extracted as described in a previous report [30].Further, DNA amplification was conducted according to the same report.Briefly, we used universal primers 954 f and 1369 r, targeted to the V6-8 regions of the 16S rRNA cording region.

Data Analysis 4.7.1. i-Means Analysis
Median points k of the NMR and MiSeq dataset were calculated using K-means clustering in R (https://cran.r-project.org/,accessed on 29 April 2022).R required the number of median points to perform K-means.Median point values were determined by reference to previous reports [22,31,32].Briefly, the k of the bacterial data comprised three clusters, and the NMR data comprised four.Next, the validity of the calculated median values was assessed by evaluating the random forest error rate using the "Randomforest" package in R. In other words, K-means clustering transformed tentative class information into datasets.Then, the random forest algorithm confirmed that we could correctly classify using tentative class information as supervised data.At the same time, we computed the silhouette index of each K-means result for an internal validation index to evaluate the clustering quality.These steps were repeated at least 10,000 times to improve validity and to identify classification and characteristic variances, which were a result of plateaus on the error rate curve.The i-means analysis was performed at least 10,000 times and importance parameters that appeared at high frequency (over four times in the five times i-means was performed) were adopted as meaningful characteristic variances and used for subsequent analyses.In this step, we measured the elapsed times for reference with differential performance with two computers using MiSeq data (53 × 209 matrix, 10,000 repetitions).The elapsed time clearly depended on the data size, repetition time, and computer environment.

Association Rule Mining (Apriori)
Numerical data from NMR and MiSeq were converted to zero-one data and categorized as high, low, or null according to the interquartile range, as per the previous study [17]."High" was defined as higher than the interquartile range; "low," lower than the interquartile range; and other data points were classified as "null."The data were then merged and, then, with the qualitative information shown in Table 1, were added into a matrix.The matrix was calculated by the Apriori algorithm (support = 0.063, confidence = 0.25, maxlen = 2) and association rules were extracted (lift > 1) using the R package "arules."The calculation condition was decided according to previous reports [17].The association rule network was depicted by Gephi, an open-source software (https://gephi.org/,accessed on 29 April 2022).

Overview
The heatmap illustrating the correlation between bacterial variances and the NMR signal is shown in Figure S1.The advanced "i-means" method was able to process unsupervised data into the supervised random forest algorithm.Figure 2 and Figure S3 show the results of mining the characteristic variance of the groups that were determined using K-means clustering based on the importance calculated by random forest.The information that was classified using i-means analysis with additional qualitative parameters is presented in Table 1, and it showed better separation than category-based separation (Figure S4).Following zero-one data conversion, we extracted 22,462 association rules (Figure 3A).In this study, we verified the K-means validity by using the random forest error rate and not internal validation metrics, such as the silhouette index.The positive judgment rate converged to the plateau by repeating i-means (Figure S5).However, the silhouette index did not correlate the error rate with the random forest (Table S2).
parameters is presented in Table 1, and it showed better separation than category-based separation (Figure S4).Following zero-one data conversion, we extracted 22,462 association rules (Figure 3A).In this study, we verified the K-means validity by using the random forest error rate and not internal validation metrics, such as the silhouette index.The positive judgment rate converged to the plateau by repeating i-means (Figure S5).However, the silhouette index did not correlate the error rate with the random forest (Table S2).The axis of ordinate indicates Gini impurity (importance).The graph shows the importance ranking of bacteria.Bacterial importance has three high importance factors.The axis of ordinate indicates Gini impurity (importance).The graph shows the importance ranking of bacteria.Bacterial importance has three high importance factors.

Association Rules Focused on the Bacterial Data
The relationship between bacterial/NMR classification and qualitative parameters extracted from the association network is illustrated in Figure 3B,C.Moreover, representative association rules are summarized in Tables 3-5.Interestingly, all bacterial classes exhibited associations with other classes, but two NMR classes showed no association with any variables.Specifically, Bacterial Class 1 was linked with a higher ratio of Firmicutes and feeding behavior category 4. Bacterial Classes 1 and 3 both showed associations with Ecology Category 2. Bacterial Classes 2 and 3 were both found to have high proportions of Proteobacteria compared with other bacterial groups, and showed associations with four qualitative parameters (color category 5, length category 1, scale category 3 and tail category 1) and one NMR-annotated signal (Table 3).Bacterial Class 2 was associated with high levels of Actinobacteria and four qualitative parameters (body category 3, color category 4, habitat category 3 and season category 3) and many amino acid signals in NMR Class 1. Bacterial Class 3 did not exhibit a high proportion of any one bacterial group, but had a low ratio of three common bacteria (Actinobacteria, Bacteroidetes and Firmicutes) and was found to be associated with six qualitative parameters (body category 1, 2; food category 6; habitat category 4; and season 1, 2) and a high acetate NMR signal (Table 3).In contrast, NMR Classes 1 and 2 had many common NMR signals and were associated with three qualitative parameters but with no bacterial variables (Table 4).

Association Rules Focused on NMR Signals
We identified two associations for NMR Class 1, one with a high proportion of Proteobacteria and one with a lower proportion of other bacterial groups.Finally, the NMR signal network indicated that some source factors might affect the TMAO ratio in the NMR signals of intestinal content samples.Overall, the network was found to have 48 association rules consisting of seven bacterial factors including one class factor, and 18 NMR factors including two class and 23 qualitative factors (Figure 3D, Table 5).These qualitative factors had four habitats, in Categories 2-5.

Discussion
We have developed a two-part advanced analysis scheme, involving "i-means" and market basket analyses (Figure 1).The i-means step includes two machine learning algorithms, "K-means clustering" and "random forest."While K-means is a well-known clustering algorithm, the appropriate number of categories is usually not known and the result depends on the initial point of the centroids, which is known as the initialization trap [33].To overcome the problem of selecting an appropriate number of groups, we referred to previous reports [22,31,32].To address the issue of the centroids, we checked the validity of initial centroid coordinates by comparing the random forest error rate to the K-means result.Using i-means, we were able to successfully divide a complicated dataset because the error rate from random forest improved with repeated i-means analyses (Figure S4).The plateauing of the error rate indicates that repeated i-means analysis might converge the best clusters of the dataset (Figure S5).The divided classes of NMR and Miseq datasets were able to be mined for unique variables and unique groups were identified (Figures S3 and S4).However, clustering the internal validation index, which

Figure 1 .
Figure 1.Overview of our proposed analysis.

Figure 1 .
Figure 1.Overview of our proposed analysis.

Figure 2 .
Figure 2. The results of mining the characteristic variance of the groups that were determined using i-means.

Figure 2 . 16 Figure 3 .
Figure 2. The results of mining the characteristic variance of the groups that were determined using i-means.2022, 12, x FOR PEER REVIEW 9 of 16

Figure 3 .
Figure 3. Association networks of association rule mining results constructed using zero-oneconverted datasets.(A) Network of associations of high importance and annotated signals extracted from all associations.(B) Bacterial association network.(C) Associations determined from NMR data.The network hubs for B and C are class information determined by i-means.Both networks sometimes share some factors.(D) Depiction of some rules from some factors to trimethylamine oxide (TMAO).

Table 1 .
Added qualitative information and sample numbers.

Table 1 .
Added qualitative information and sample numbers.

Table 2 .
NMR signal annotation of gut contents of fish extracts using KPi/D 2 O.

Table 3 .
Extracted association-rule-correlated bacterial class information from i-means result to other variances.

Table 4 .
Extracted association-rule-correlated NMR class information from i-means result to other variances.

Table 5 .
Extracted association rules to high TMAO.