Next Article in Journal
The Role of Verbal Feedback in the Motor Learning of Gymnastic Skills: A Systematic Review
Previous Article in Journal
Prediction Method of Water Absorption of Soft Rock Considering the Influence of Composition, Porosity, and Solute Quantitatively
Previous Article in Special Issue
Prediction of Process Quality Performance Using Statistical Analysis and Long Short-Term Memory
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning

1
RIKEN Center for Sustainable Resource Science, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan
2
Graduate School of Medical Life Science, Yokohama City University, Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan
3
Graduate School of Bioagriculuture Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(12), 5927; https://doi.org/10.3390/app12125927
Submission received: 29 April 2022 / Revised: 6 June 2022 / Accepted: 8 June 2022 / Published: 10 June 2022
(This article belongs to the Special Issue Latest Advances and Prospects in Big Data)

Abstract

:
Recent technical innovations and developments in computer-based technology have enabled bioscience researchers to acquire comprehensive datasets and identify unique parameters within experimental datasets. However, field researchers may face the challenge that datasets exhibit few associations among any measurement results (e.g., from analytical instruments, phenotype observations as well as field environmental data), and may contain non-numerical, qualitative parameters, which make statistical analyses difficult. Here, we propose an advanced analysis scheme that combines two machine learning steps to mine association rules between non-numerical parameters. The aim of this analysis is to identify relationships between variables and enable the visualization of association rules from data of samples collected in the field, which have less correlations between genetic, physical, and non-numerical qualitative parameters. The analysis scheme presented here may increase the potential to identify important characteristics of big datasets.

1. Introduction

Technical innovation has enhanced progress in many fields of research and broadened the potential for data collection and analysis. The term “omics” has recently emerged, referring to exhaustive datasets of pools of biological molecules—for example, metabolomics, genomics, and proteomics. Next-generation sequencing is an example of a technical innovation that has contributed to genomics and has been used to elucidate the reasons that many species cannot grow in particular environments, leading to a deeper understanding of microbe communities that grow without cultivation [1,2,3]. Microbes exists in various locations, sometimes in symbiotic relationships with various animals [4]. While living on animals, the microorganisms interact with hosts or items that the host carries—for example, with food residue [5]. Commensal microbes are those which benefit from the host, while the host receives no benefit; they exhibit high cell turnover and produce high levels of secretions, including sweat and mucus [6,7,8]. Commensal microbes can, however, affect the host’s immunity and behavior—for example, via physical attachment and the release of bacterial components and metabolites [9,10,11,12]. Because of the effects that commensal microbes have on the host, exhaustive information on the microbial content of the host’s intestine may provide information relating to the status of health [13,14]. Moreover, to clarify the significance of not only commensal microbes but also the workings of nature for the host or human social and environmental benefits, humans might control nature in pursuit of the sustainable development goals (SDGs), particularly food consumption. However, in nature, a significant variety of parameters exist that cannot be controlled. Although technical innovation provides a relevant amount of information for us, we must develop appropriate methods to overcome the problems in accomplishing the SDGs.
Informatics is a constantly evolving technique with collateral performance advancements in computers, and various methods of calculating the unique characteristics of experimental datasets have been developed—for example, the similarity by clustering methods and order reduction for the overview of a dataset. In recent years, machine learning and deep learning have become essential in order to obtain an overview of big data, such as omics datasets [15,16]. Several complementary methods may be used simultaneously to solve complex data mining issues. Moreover, several methods were improved and their concept and application were extended [17,18].
Numerical data are typically used when the statistical analysis of experimental data is undertaken; when qualitative data are used, appropriate quantification must be performed before statistical analysis is possible. Thus, qualitative data, which cannot be easily converted into numerical data, present a significant challenge to statistical analysis [17].
The present study aimed to mine characteristic variance from crude omics data, and to identify association trends within subjective and qualitative data, which are difficult to convert to numerical data using informatics. We used two machine learning algorithms: K-means clustering and random forest [19,20]. The former is a well-known clustering analysis method that uses centroids to classify data coordinates, but the optimal number of groups for a particular dataset and the validity of separation cannot be known prior to the analysis. Random forest is an ensemble learning method for classification and regression. It is a high-performance method but requires datasets where some class information is already known—for example, control group, study group, and dosage amount—and is thus classified as a supervised learning method. We attempted to analyze non-experimental datasets from the field, which did not have supervised information. Using a combination of the two algorithms, we were able to overcome the limitations of each method and extract the characteristics, which are notable and/or interesting points of the dataset to study, and assign non-numerical classification information for downstream analysis. We propose that our advanced analysis scheme is applicable for data with unsupervised and/or multiple non-numerical variables to extract characteristic information for following the study and association rules between multiple variables, including non-numerical parameters.

2. Related Works

Several researchers have used machine learning to extract important characteristics and classification. In 2020, Li et al. used machine learning to diagnose a disorder [21]. Mesiar and Sheikhi applied machine learning to nonlinear entities [18]. Wei et al. and Tatsumi et al. used machine learning for data from the field, such as those from land and sea [22,23]. Wang et al. attempted to improve calculation performance by combining machine learning [24]. Shiokawa et al. applied market basket analysis for the visualization and selection association of information between human lifestyle factors and experimental measurements [17].

3. Proposed Framework

Our proposed framework consists of three well-known algorithms, K-means, random forest, and Apriori. The algorithms listed below provide a brief but detailed explanation of previous reports [17,25,26,27,28]. In this study, the algorithms were computed by R using the packages “Randomforest” and “arules”.

3.1. K-Means

n-dimensional points are clustered into K (1, … K), which is the arbitrary number of users. Set initial k centers C = { C 1 , C 2 , , C k } . Cluster C i ( i { 1 , , k } ) is set to be a set of points in X that are closer to Ci than Cj in all j (ji), and we calculate the center of mass of all points in Ci: C i = 1 | C i | x C i x Repeat until C is stable.

3.2. Random Forest

Random forest is a machine learning method based on the decision tree algorithm for classification and regression, developed by Breiman [27]. This algorithm displays the training method called bootstraps, and the decision tree uses Gini impurity to generate branches and extract variable importance. Gini impurity for a set of data with J classes ( i { 1 , 2 , J } ) labeled p is computed as follows.
Gini   impurity ( p ) = 1 i = 1 J p i 2

3.3. Association Rule Mining (Apriori)

We performed association rule mining as previously described [17]. Briefly, we used the Apriori algorithm [28] and applied parameters in R with the packages “arules” with “support”, “confidence”, and “lift”. The meanings are described below using formulas with probability of X and Y.
support ( X Y ) = P ( X   Y )
confidence ( X Y ) = P ( X   Y ) P ( X )
lift ( X Y ) = P ( X   Y ) P ( X ) P ( Y )
In this study, the parameters were set as follows: “support” was 0.063, “confidence” was 0.25, and “lift > 1” with “maxlen = 2.” Our our data preparation for Apriori was converted using datasets to zero-one data by ranking and a quarter of the whole variables belonging to “high” or “low” were “one” and three quarters were “zero”. Therefore, when any association rules occurred randomly, support was approximately 0.063 and confidence was 0.25.
The analysis flow details are given in the Materials and Methods section below.

4. Materials and Methods

4.1. Overview

The analytical scheme of our study is presented in Figure 1. We used a two-stage analysis: first, the group was divided using K-means clustering of the applied dataset, and the validity of the result was evaluated using random forest based on the error rate. Important variables were extracted simultaneously with this step. This step was therefore termed “importance-based K-means”, referred to as “i-means” for short. Second, association rules were mined in each dataset and qualitative information was evaluated by Apriori [29].
NMR measurements were normalized by 2,2-dimethyl-2-silapentane-5-sulfonate (DSS). Each dataset was converted as per the ratio to the sum of each sample. K-means clustering was performed on each dataset and the validity checked using the random forest importance-based error rate. This part is referred to as “importance-based K-means”, or “i-means”. The resulting and original datasets were converted into zero-one data by data ranking or class information resulting from i-means. The zero-one data were analyzed by the Apriori algorithm. Finally, we selected meaningful association rules by the importance calculated by random forest from the extracted association rules by the Apriori algorithm.

4.2. Sample Preparation

We used natural fish samples from Japanese hydrosphere cultivation at 118 points in the river or sea. The samples (n = 315 individuals from 21 orders of taxonomy) were collected over the period of 2012–2016. Collected samples were stored at −30 °C or −80 °C until anatomic analysis was performed in the lab, to obtain intestinal content for measurements. Observers assessed and attached qualitative information, feeding behavior, ecology, habitat, body shape, tail type, season of collection, color, and scale type (Table 1). The fish samples were measured by length and divided into four groups by length average and variance. The intestinal content of the fishes was collected in a sample tube, freeze-dried, powdered, and stored at −80 °C for further experiments.

4.3. Nuclear Magnetic Resonance

For nuclear magnetic resonance (NMR) observations, 18 mg of each powdered sample was extracted in 600 μL of KPi buffer containing 90% deuterium oxide and 1 mM sodium 2,2-dimethyl-2-silapentane-5-sulfonate (DSS) at 65 °C for 15 min, and then centrifuged at 17,800 G for 5 min. The entire supernatant was mixed and transferred to a 5-mm NMR tube. Two-dimensional J-resolved (2D J-RES) NMR spectra were acquired at 298 K using a Bruker AVANCE II 700 spectrometer equipped with an 1H inverse triple-resonance cryogenically cooled probe with Z-axis gradients (Bruker BioSpin GmbH, Rheinstetten, Germany). In brief, 2D J-RES NMR spectra were acquired using the standard Bruker pulse program jresgpprqf, with 16 K (F2) and 16 (F1) points, and were then categorized into 32 transients and 16 dummy scans.

4.4. Data Processing

The 2D J-RES NMR spectra were processed using TopSpin software (Bruker Biospin: https://www.bruker.com/en/products-and-solutions/mr/nmr-software/topspin.html, accessed on 29 April 2022). Tilt correction and symmetrization were performed and projections of the 2D spectra were obtained. We defined 235 regions of interest manually using Revolution R Open software (https://cran.r-project.org/, accessed on 29 April 2022).

4.5. Data Preprocessing for Analysis and Annotation

The NMR measurement outcomes were normalized according to the DSS signal and peaks were annotated by comparison with premeasured standards from our database of metabolites acquired under the same conditions (Figure S1, Table 2). Datasets from NMR and gut microbe analysis (MiSeq) were processed to obtain the composition ratio to the sum of the measurement in fish intestines for each sample. Subjective qualitative information was assigned to samples by observation, as mentioned above (Table 1). These observations were made by multiple persons subjectively. The classification of a sample was difficult if qualitative information belonged to multiple categories.

4.6. Fish Gut Microbe Analysis by MiSeq

Fish intestinal microbial DNA was extracted as described in a previous report [30]. Further, DNA amplification was conducted according to the same report. Briefly, we used universal primers 954 f and 1369 r, targeted to the V6–8 regions of the 16S rRNA cording region.

4.7. Data Analysis

4.7.1. i-Means Analysis

Median points k of the NMR and MiSeq dataset were calculated using K-means clustering in R (https://cran.r-project.org/, accessed on 29 April 2022). R required the number of median points to perform K-means. Median point values were determined by reference to previous reports [22,31,32]. Briefly, the k of the bacterial data comprised three clusters, and the NMR data comprised four. Next, the validity of the calculated median values was assessed by evaluating the random forest error rate using the “Randomforest” package in R. In other words, K-means clustering transformed tentative class information into datasets. Then, the random forest algorithm confirmed that we could correctly classify using tentative class information as supervised data. At the same time, we computed the silhouette index of each K-means result for an internal validation index to evaluate the clustering quality. These steps were repeated at least 10,000 times to improve validity and to identify classification and characteristic variances, which were a result of plateaus on the error rate curve. The i-means analysis was performed at least 10,000 times and importance parameters that appeared at high frequency (over four times in the five times i-means was performed) were adopted as meaningful characteristic variances and used for subsequent analyses. In this step, we measured the elapsed times for reference with differential performance with two computers using MiSeq data ( 53 × 209 matrix, 10,000 repetitions). The elapsed time clearly depended on the data size, repetition time, and computer environment.

4.7.2. Association Rule Mining (Apriori)

Numerical data from NMR and MiSeq were converted to zero-one data and categorized as high, low, or null according to the interquartile range, as per the previous study [17]. “High” was defined as higher than the interquartile range; “low,” lower than the interquartile range; and other data points were classified as “null.” The data were then merged and, then, with the qualitative information shown in Table 1, were added into a matrix. The matrix was calculated by the Apriori algorithm (support = 0.063, confidence = 0.25, maxlen = 2) and association rules were extracted (lift > 1) using the R package “arules.” The calculation condition was decided according to previous reports [17]. The association rule network was depicted by Gephi, an open-source software (https://gephi.org/, accessed on 29 April 2022).

5. Results

5.1. Overview

The heatmap illustrating the correlation between bacterial variances and the NMR signal is shown in Figure S1. The advanced “i-means” method was able to process unsupervised data into the supervised random forest algorithm. Figure 2 and Figure S3 show the results of mining the characteristic variance of the groups that were determined using K-means clustering based on the importance calculated by random forest. The information that was classified using i-means analysis with additional qualitative parameters is presented in Table 1, and it showed better separation than category-based separation (Figure S4). Following zero-one data conversion, we extracted 22,462 association rules (Figure 3A). In this study, we verified the K-means validity by using the random forest error rate and not internal validation metrics, such as the silhouette index. The positive judgment rate converged to the plateau by repeating i-means (Figure S5). However, the silhouette index did not correlate the error rate with the random forest (Table S2).
The axis of ordinate indicates Gini impurity (importance). The graph shows the importance ranking of bacteria. Bacterial importance has three high importance factors.

5.2. Association Rules Focused on the Bacterial Data

The relationship between bacterial/NMR classification and qualitative parameters extracted from the association network is illustrated in Figure 3B,C. Moreover, representative association rules are summarized in Table 3, Table 4 and Table 5. Interestingly, all bacterial classes exhibited associations with other classes, but two NMR classes showed no association with any variables. Specifically, Bacterial Class 1 was linked with a higher ratio of Firmicutes and feeding behavior category 4. Bacterial Classes 1 and 3 both showed associations with Ecology Category 2. Bacterial Classes 2 and 3 were both found to have high proportions of Proteobacteria compared with other bacterial groups, and showed associations with four qualitative parameters (color category 5, length category 1, scale category 3 and tail category 1) and one NMR-annotated signal (Table 3). Bacterial Class 2 was associated with high levels of Actinobacteria and four qualitative parameters (body category 3, color category 4, habitat category 3 and season category 3) and many amino acid signals in NMR Class 1. Bacterial Class 3 did not exhibit a high proportion of any one bacterial group, but had a low ratio of three common bacteria (Actinobacteria, Bacteroidetes and Firmicutes) and was found to be associated with six qualitative parameters (body category 1, 2; food category 6; habitat category 4; and season 1, 2) and a high acetate NMR signal (Table 3). In contrast, NMR Classes 1 and 2 had many common NMR signals and were associated with three qualitative parameters but with no bacterial variables (Table 4).

5.3. Association Rules Focused on NMR Signals

We identified two associations for NMR Class 1, one with a high proportion of Proteobacteria and one with a lower proportion of other bacterial groups. Finally, the NMR signal network indicated that some source factors might affect the TMAO ratio in the NMR signals of intestinal content samples. Overall, the network was found to have 48 association rules consisting of seven bacterial factors including one class factor, and 18 NMR factors including two class and 23 qualitative factors (Figure 3D, Table 5). These qualitative factors had four habitats, in Categories 2–5.

6. Discussion

We have developed a two-part advanced analysis scheme, involving “i-means” and market basket analyses (Figure 1). The i-means step includes two machine learning algorithms, “K-means clustering” and “random forest.” While K-means is a well-known clustering algorithm, the appropriate number of categories is usually not known and the result depends on the initial point of the centroids, which is known as the initialization trap [33]. To overcome the problem of selecting an appropriate number of groups, we referred to previous reports [22,31,32]. To address the issue of the centroids, we checked the validity of initial centroid coordinates by comparing the random forest error rate to the K-means result. Using i-means, we were able to successfully divide a complicated dataset because the error rate from random forest improved with repeated i-means analyses (Figure S4). The plateauing of the error rate indicates that repeated i-means analysis might converge the best clusters of the dataset (Figure S5). The divided classes of NMR and Miseq datasets were able to be mined for unique variables and unique groups were identified (Figures S3 and S4). However, clustering the internal validation index, which is a well-known silhouette coefficient, did not correlate with the random forest error rate. This could mean that better clustering does not always provide meaningful separation to focus on unique characteristics of datasets (Table S2). The result was thought to be due to the data being from the field, not experimental, and they included several parameters and relevant variance but not ideal data distribution. This likely indicates that it is difficult to judge complicated data only by the clustering of the internal validation index, such as the silhouette coefficient. Additionally, qualitative data were able to be attached to classify parameters from i-means analysis and market basket analysis enabled the extraction of some association rules to the additional parameters from i-means classification (Figure 3B,C). These results demonstrate that our proposed scheme is suitable for identifying notable variables and proposing related parameters in challenging datasets.
We performed i-means analysis on the MiSeq dataset in order to divide the data into three groups because it has been previously reported that there are three human enterotypes based on gut microbiota at the phylum level, predominantly Firmicutes, Bacteroidetes, or Actinobacteria [31]. Previous studies have proposed that Proteobacteria represent the prominent microbial phyla in the fish gut [34]. The present study revealed the fish gut microbiota to be predominantly composed of Firmicutes, Actinobacteria, or Proteobacteria, rather than Bacteroidetes as in human gut enterotypes (Figures S3 and S4). Bacteroidetes were in low-to-high abundance in individual fish gut microbiomes (Figure S3A), but the composition ratio had an impact on group separation because the importance value calculated by i-means was lower than Proteobacteria, Firmicutes, and Actinobacteria (Figure S3), which may indicate that Proteobacteria and Bacteroidetes play similar roles in the gut microbiota of fish and humans, respectively. In marine biofilms, a considerable amount of carbohydrate metabolism is carried out by Bacteroidetes and Gammaproteobacteria, including Proteobacteria; thus, Proteobacteria in the fish gut might decompose consumed carbohydrates [35]. Our analysis revealed that Bacterial Class 3 was associated with a high abundance of Proteobacteria and a low abundance of Bacteroidetes, and the high-intensity NMR signal that was annotated as acetate might be derived from carbohydrate degradation (Figure 3B).
Bacterial Class 1 was found to have fewer association rules by our proposed scheme than Bacterial Classes 2 and 3. The associations were a higher proportion of Firmicutes, while food category 4 included crustacea eaters and was found to share ecology category 2 with Bacterial Class 3. Crustacea use chitin and/or chitosan to form their exoskeletons [36], and it has been reported that many species of bacteria, including bacillus which are Firmicutes, express enzymes with chitinase activity [37]. Fish with a Bacterial Class 1 gut microbiome may, therefore, digest chitin in their feed via enzymes originating from their intestinal Firmicutes. This suggestion indicates that food influences the intestinal microbiome [38,39].
We performed i-means analysis on NMR datasets in order to divide the data into four groups. This revealed that the intestinal contents of fish have many variables, but the high importance value of dividing the NMR signal clusters calculated by i-means was only TMAO, which suggested that the number of high impact factors (Gini impurity) was less than bacterial factors (Figure S3). The most important variable for dividing the NMR dataset was annotation as a TMAO signal. The group with the highest TMAO proportion was NMR Class 4 (Figure S4B). However, this group did not exhibit associations with other NMR classes, bacterial classes, or qualitative information. It is known that TMAO is involved in osmoregulation in the muscle of marine creatures [40,41]. TMAO indicated association rules to four habitat factors: brackish water, coast, offshore, and deep sea (not fresh) water (Figure 3D).
Freshwater fish have been reported to have lower TMAO concentrations in their muscle tissue than seawater fish [42,43,44]. Our scheme could be used to identify the characteristics of organisms using qualitative and non-numerical information.

7. Conclusions and Future Work

Our scheme overcomes some of the limitations of K-means clustering and random forest by combining both methods in an iterative process. Moreover, machine learning clustering methods provide a useful approach to extract association rules when combined with market basket analysis. This method may enable researchers to cluster and characterize challenging unsupervised datasets which have few correlations, making it possible to determine notable points and identify important characteristics. We could not discuss all association rules; some rules may warrant further research.
In the future, we hope to develop other combination schemes; for example, our method “i-means” would replace random forest to support vector machine or, alternatively, K-means to K-means++. Support vector machine might work as well as i-means, but K-means++ may not. The method improves the elapse time of K-means results based on the silhouette index. In this study, the silhouette index was not necessarily appropriate and it is necessary to suggest a better index to extract meaningful characteristics. We must select a better analysis scheme depending on the situation and dataset obtained. Our scheme may enable the extraction of characteristics and association rules from unsupervised or multiple non-numerical variable datasets obtained from uncontrolled and nonexperimental situations, such as environmental field data. However, it is necessary to experimentally assess the extracted rules by our scheme to clarify the meaning.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app12125927/s1, Figure S1: Heatmap of correlate bacterial variances and nuclear magnetic resonance (NMR) signals. Figure S2: Annotation of metabolites in gut contents extract using KPi/D2O. Figure S3: The results of mining the characteristic variance of the groups that were determined using i-means. Figure S4: Box plots of the group result using i-means importance and classification. Figure S5: Line graph of 1—error rate (correction ratio) using i-means.

Author Contributions

Y.S. and J.K. designed the experiments; Y.S. and T.A. collected fish samples; Y.S., K.S. and T.A. performed experiments; H.S. assembled the analysis methods (i-means); H.S. and K.S. analyzed the data and created the figures and tables; H.S., K.S. and J.K. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The authors’ work described in this article was supported, in part, by grants from the Ministry of Agriculture, Forestry and Fisheries in Japan.

Data Availability Statement

Numerical data are available from: http://datahub.riken.jp/dataset/EMAR0040/ (accessed on 3 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Handelsman, J. Metagenomics: Application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 2004, 68, 669–685. [Google Scholar] [CrossRef] [Green Version]
  2. Lasken, R.S. Genomic sequencing of uncultured microorganisms from single cells. Nat. Rev. Microbiol. 2012, 10, 631–640. [Google Scholar] [CrossRef] [PubMed]
  3. Albertsen, M.; Hugenholtz, P.; Skarshewski, A.; Nielsen, K.L.; Tyson, G.W.; Nielsen, P.H. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 2013, 31, 533–538. [Google Scholar] [CrossRef] [PubMed]
  4. Moran, N.A.; Wernegreen, J.J. Lifestyle evolution in symbiotic bacteria: Insights from genomics. Trends Ecol. Evol. 2000, 15, 321–326. [Google Scholar] [CrossRef]
  5. Leahy, S.; Higgins, D.; Fitzgerald, G.; Van Sinderen, D. Getting better with bifidobacteria. J. Appl. Microbiol. 2005, 98, 1303–1315. [Google Scholar] [CrossRef]
  6. Ashida, H.; Ogawa, M.; Kim, M.; Mimuro, H.; Sasakawa, C. Bacteria and host interactions in the gut epithelial barrier. Nat. Chem. Biol. 2012, 8, 36–45. [Google Scholar] [CrossRef]
  7. Tsutsui, S.; Date, Y.; Kikuchi, J. Visualizing Individual and Region-specific Microbial–metabolite Relations by Important Variable Selection Using Machine Learning Approaches. J. Comput. Aided Chem. 2017, 18, 31–41. [Google Scholar] [CrossRef] [Green Version]
  8. Sicard, J.-F.; Le Bihan, G.; Vogeleer, P.; Jacques, M.; Harel, J. Interactions of intestinal bacteria with components of the intestinal mucus. Front. Cell. Infect. Microbiol. 2017, 7, 387. [Google Scholar] [CrossRef] [PubMed]
  9. Ohno, H. Gut microbial short-chain fatty acids in host defense and immune regulation. Inflamm. Regen. 2015, 35, 114–121. [Google Scholar] [CrossRef] [Green Version]
  10. Forsythe, P.; Sudo, N.; Dinan, T.; Taylor, V.H.; Bienenstock, J. Mood and gut feelings. Brain Behav. Immun. 2010, 24, 9–16. [Google Scholar] [CrossRef] [PubMed]
  11. Schnupf, P.; Gaboriau-Routhiau, V.; Gros, M.; Friedman, R.; Moya-Nilges, M.; Nigro, G.; Cerf-Bensussan, N.; Sansonetti, P.J. Growth and host interaction of mouse segmented filamentous bacteria in vitro. Nature 2015, 520, 99–103. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Hase, K.; Kawano, K.; Nochi, T.; Pontes, G.S.; Fukuda, S.; Ebisawa, M.; Kadokura, K.; Tobe, T.; Fujimura, Y.; Kawano, S. Uptake through glycoprotein 2 of FimH+ bacteria by M cells initiates mucosal immune response. Nature 2009, 462, 226–230. [Google Scholar] [CrossRef] [PubMed]
  13. Osaka, T.; Moriyama, E.; Arai, S.; Date, Y.; Yagi, J.; Kikuchi, J.; Tsuneda, S. Meta-analysis of fecal microbiota and metabolites in experimental colitic mice during the inflammatory and healing phases. Nutrients 2017, 9, 1329. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Carding, S.; Verbeke, K.; Vipond, D.T.; Corfe, B.M.; Owen, L.J. Dysbiosis of the gut microbiota in disease. Microb. Ecol. Health Dis. 2015, 26, 26191. [Google Scholar] [CrossRef] [PubMed]
  15. Shima, H.; Masuda, S.; Date, Y.; Shino, A.; Tsuboi, Y.; Kajikawa, M.; Inoue, Y.; Kanamoto, T.; Kikuchi, J. Exploring the impact of food on the gut ecosystem based on the combination of machine learning and network visualization. Nutrients 2017, 9, 1307. [Google Scholar] [CrossRef] [Green Version]
  16. Zhang, Z.; Zhao, Y.; Liao, X.; Shi, W.; Li, K.; Zou, Q.; Peng, S. Deep learning in omics: A survey and guideline. Brief. Funct. Genom. 2019, 18, 41–57. [Google Scholar] [CrossRef] [PubMed]
  17. Shiokawa, Y.; Misawa, T.; Date, Y.; Kikuchi, J. Application of market basket analysis for the visualization of transaction data based on human lifestyle and spectroscopic measurements. Anal. Chem. 2016, 88, 2714–2719. [Google Scholar] [CrossRef]
  18. Mesiar, R.; Sheikhi, A. Nonlinear random forest classification, a copula-based approach. Appl. Sci. 2021, 11, 7140. [Google Scholar] [CrossRef]
  19. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1967; pp. 281–297. [Google Scholar]
  20. Fawagreh, K.; Gaber, M.M.; Elyan, E. Random forests: From early developments to recent advancements. Syst. Sci. Control Eng. Open Access J. 2014, 2, 602–609. [Google Scholar] [CrossRef] [Green Version]
  21. Li, W.T.; Ma, J.; Shende, N.; Castaneda, G.; Chakladar, J.; Tsai, J.C.; Apostol, L.; Honda, C.O.; Xu, J.; Wong, L.M. Using machine learning of clinical data to diagnose COVID-19. medRxiv 2020. [Google Scholar] [CrossRef]
  22. Wei, F.; Ito, K.; Sakata, K.; Asakura, T.; Kikuchi, J. Fish ecotyping based on machine learning and inferred network analysis of chemical and physical properties. Sci. Rep. 2021, 11, 3766. [Google Scholar] [CrossRef] [PubMed]
  23. Tatsumi, K.; Yamashiki, Y.; Torres, M.A.C.; Taipe, C.L.R. Crop classification of upland fields using Random forest of time-series Landsat 7 ETM+ data. Comput. Electron. Agric. 2015, 115, 171–179. [Google Scholar] [CrossRef]
  24. Wang, J.; Wu, X.; Zhang, C. Support vector machines based on K-means clustering for real-time business intelligence systems. Int. J. Bus. Intell. Data Min. 2005, 1, 54–64. [Google Scholar] [CrossRef] [Green Version]
  25. Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef] [Green Version]
  26. Arthur, D.; Vassilvitskii, S. k-Means++: The Advantages of Careful Seeding; Stanford University: Stanford, CA, USA, 2006. [Google Scholar]
  27. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  28. Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases—VLDB, Santiago de Chile, Chile, 12–15 September 1994; pp. 487–499. [Google Scholar]
  29. Woo, J. Market basket analysis algorithms with mapreduce. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2013, 3, 445–452. [Google Scholar] [CrossRef]
  30. Date, Y.; Iikura, T.; Yamazawa, A.; Moriya, S.; Kikuchi, J. Metabolic sequences of anaerobic fermentation on glucose-based feeding substrates based on correlation analyses of microbial and metabolite profiling. J. Proteome Res. 2012, 11, 5602–5610. [Google Scholar] [CrossRef]
  31. Arumugam, M.; Raes, J.; Pelletier, E.; Le Paslier, D.; Yamada, T.; Mende, D.R.; Fernandes, G.R.; Tap, J.; Bruls, T.; Batto, J.-M. Enterotypes of the human gut microbiome. Nature 2011, 473, 174–180. [Google Scholar] [CrossRef] [PubMed]
  32. Wei, F.; Fukuchi, M.; Ito, K.; Sakata, K.; Asakura, T.; Date, Y.; Kikuchi, J. Large-scale evaluation of major soluble macromolecular components of fish muscle from a conventional 1H-NMR spectral database. Molecules 2020, 25, 1966. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Ikotun, A.M.; Almutari, M.S.; Ezugwu, A.E. K-Means-Based Nature-Inspired Metaheuristic Algorithms for Automatic Data Clustering Problems: Recent Advances and Future Directions. Appl. Sci. 2021, 11, 11246. [Google Scholar] [CrossRef]
  34. Egerton, S.; Culloty, S.; Whooley, J.; Stanton, C.; Ross, R.P. The gut microbiota of marine fish. Front. Microbiol. 2018, 9, 873. [Google Scholar] [CrossRef] [PubMed]
  35. Stal, L.J.; Bolhuis, H.; Cretoiu, M.S. Phototrophic marine benthic microbiomes: The ecophysiology of these biological entities. Environ. Microbiol. 2019, 21, 1529–1551. [Google Scholar] [CrossRef]
  36. Kurita, K. Chitin and chitosan: Functional biopolymers from marine crustaceans. Mar. Biotechnol. 2006, 8, 203–226. [Google Scholar] [CrossRef]
  37. Askarian, F.; Zhou, Z.; Olsen, R.E.; Sperstad, S.; Ringø, E. Culturable autochthonous gut bacteria in Atlantic salmon (Salmo salar L.) fed diets with or without chitin. Characterization by 16S rRNA gene sequencing, ability to produce enzymes and In Vitro growth inhibition of four fish pathogens. Aquaculture 2012, 326, 1–8. [Google Scholar] [CrossRef] [Green Version]
  38. Warren, F.J.; Fukuma, N.M.; Mikkelsen, D.; Flanagan, B.M.; Williams, B.A.; Lisle, A.T.; Ó Cuív, P.; Morrison, M.; Gidley, M.J. Food starch structure impacts gut microbiome composition. mSphere 2018, 3, e00086-18. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Albenberg, L.G.; Lewis, J.D.; Wu, G.D. Food and the gut microbiota in IBD: A critical connection. Curr. Opin. Gastroenterol. 2012, 28, 314–320. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Downing, A.B.; Wallace, G.T.; Yancey, P.H. Organic osmolytes of amphipods from littoral to hadal zones: Increases with depth in trimethylamine N-oxide, scyllo-inositol and other potential pressure counteractants. Deep. Sea Res. Part I Oceanogr. Res. Pap. 2018, 138, 1–10. [Google Scholar] [CrossRef]
  41. Kelly, R.H.; Yancey, P.H. High contents of trimethylamine oxide correlating with depth in deep-sea teleost fishes, skates, and decapod crustaceans. Biol. Bull. 1999, 196, 18–25. [Google Scholar] [CrossRef] [PubMed]
  42. Seibel, B.A.; Walsh, P.J. Trimethylamine oxide accumulation in marine animals: Relationship to acylglycerol storagej. J. Exp. Biol. 2002, 205, 297–306. [Google Scholar] [CrossRef]
  43. Summers, G.; Wibisono, R.; Hedderley, D.; Fletcher, G. Trimethylamine oxide content and spoilage potential of New Zealand commercial fish species. N. Z. J. Mar. Freshw. Res. 2017, 51, 393–405. [Google Scholar] [CrossRef]
  44. Yin, X.; Gibbons, H.; Rundle, M.; Frost, G.; McNulty, B.A.; Nugent, A.P.; Walton, J.; Flynn, A.; Brennan, L. The Relationship between Fish Intake and Urinary Trimethylamine-N-Oxide. Mol. Nutr. Food Res. 2020, 64, 1900799. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overview of our proposed analysis.
Figure 1. Overview of our proposed analysis.
Applsci 12 05927 g001
Figure 2. The results of mining the characteristic variance of the groups that were determined using i-means.
Figure 2. The results of mining the characteristic variance of the groups that were determined using i-means.
Applsci 12 05927 g002
Figure 3. Association networks of association rule mining results constructed using zero-one-converted datasets. (A) Network of associations of high importance and annotated signals extracted from all associations. (B) Bacterial association network. (C) Associations determined from NMR data. The network hubs for B and C are class information determined by i-means. Both networks sometimes share some factors. (D) Depiction of some rules from some factors to trimethylamine oxide (TMAO).
Figure 3. Association networks of association rule mining results constructed using zero-one-converted datasets. (A) Network of associations of high importance and annotated signals extracted from all associations. (B) Bacterial association network. (C) Associations determined from NMR data. The network hubs for B and C are class information determined by i-means. Both networks sometimes share some factors. (D) Depiction of some rules from some factors to trimethylamine oxide (TMAO).
Applsci 12 05927 g003
Table 1. Added qualitative information and sample numbers.
Table 1. Added qualitative information and sample numbers.
Qualitative ParametersObserved Number
Bacterial Class 1 (from i-means)26
Bacterial Class 293
Bacterial Class 390
Body1 (Spindle-shaped)64
Body2 (Compressed)144
Body3 (Cubic-shaped)81
Body4 (Flat-shaped)16
Body5 (Long, slender)14
Color1 (Red)85
Color2 (Blue)16
Color3 (Yellow)10
Color4 (Brown)75
Color5 (Black)131
Ecology1 (Migratory)36
Ecology2 (Territorial-staying)188
Ecology3 (Rockfish)66
Ecology4 (Demersal fish)26
Food1 (Plankton eater)42
Food2 (Herbivore)41
Food3 (Polychaeta eater)150
Food4 (Crustacea eater)259
Food5 (Mollusk eater)82
Food6 (Fish eater)162
Habitat1 (Freshwater)24
Habitat2 (Brackish water)22
Habitat3 (Coast)136
Habitat4 (Offshore)92
Habitat5 (Deep sea)48
Length1 (50–150 mm)94
Length2 (150–200 mm)97
Length3 (200–300 mm)88
Length4 (Over 300 mm)58
NMR Class 1 (from i-means)192
NMR Class 299
NMR Class 37
NMR Class 411
Scales1 (Matte)52
Scales2 (Glossy)133
Scales3 (Shiny)131
Season1 (March–May)90
Season2 (June–August)68
Season3 (September–November)115
Season4 (December–February)46
Tail1 (Two distinct)122
Tail2 (Others)191
The sample had a single category; if there were overlaps and missing values, the number did not always reflect the sample number.
Table 2. NMR signal annotation of gut contents of fish extracts using KPi/D2O.
Table 2. NMR signal annotation of gut contents of fish extracts using KPi/D2O.
Peak NumberΔ1H (ppm)Δ13C (ppm)Annotation
10.9313.9Ile
20.94623.7Leu
30.9719.4Val
41.00120.6Ile
51.03420.6Val
61.31922.2Lactate, Thr
71.47118.9Ala
81.70126.7Leu
91.7229Lys
101.90925.9Acetate
111.97338.6Ile
121.99931.9Pro
132.04529.7Glu
142.12429Gln
152.26131.9Val
162.33936.2Glu
172.39436.9Succinate
182.63631.5Met
192.67739.4Asp
202.71137.2DMA
212.80739.4Asp
222.87647.4TMA
233.02539.7Creatine
243.19656.6Choline
253.23343.3Arg
263.22376.9Glucose
273.25862.3TMAO, Taurine
283.41438.1Taurine
293.50170Choline
303.54844.2Gly
313.64565.3Glycerol
323.92156.5Creatine
333.95563Ser
344.10371.2Lactate
354.24168.7Thr
364.64298.6Glucose
375.22494.8Glucose
386.09191.1inosine
396.887118.6Tyr
407.108119.8His
417.181133.4Tyr
427.265124.8Trp
437.318131.9Phe
447.365130.5Phe
457.412132Phe
467.518114.6Trp
477.721121.2Trp
487.999138.6His
498.178148.3Inosine
508.223148.9Inosine
518.338143Inosine
Table 3. Extracted association-rule-correlated bacterial class information from i-means result to other variances.
Table 3. Extracted association-rule-correlated bacterial class information from i-means result to other variances.
SourceTargetSupportConfidenceLift
Bac_class1High_Firmicutes0.080.966.44
Ecology2 (Territorial-staying)0.060.771.29
Food4 (Crustacea-eater)0.070.881.08
Bac_class2High_Actinobacteria0.100.322.42
Low_Bacteria.Other0.130.431.83
High_Proteobacteria0.230.771.44
Season3 (September–November)0.130.451.24
Color5 (Black)0.150.511.22
Color4 (Brown)0.080.281.17
Body3 (Cubicshape)0.090.301.17
Scales3 (Shiny)0.140.471.14
NMR_class10.200.681.11
Habitat3 (Coast)0.140.461.07
High_Gln.Malic.acid0.170.571.07
Tail1 (two-distinct)0.120.411.05
Length1(50–150 mm)0.090.311.04
High_Thr0.240.821.03
High_Ile0.290.991.01
High_Creatine.G3P.GPC0.290.991.01
High_Lactate0.290.991.01
High_Val0.290.991.01
High_Ile0.290.991.01
High_Ala0.290.991.01
High_Gly0.290.991.01
High_TMAO0.290.991.01
High_Creatine0.290.991.01
High_Leu0.290.981.00
Bac_class3Low_Actinobacteria0.070.262.30
Low_Firmicutes0.080.292.17
Low_Bacteroidetes0.100.341.78
High_Proteobacteria0.260.901.68
Body1 (Spindle shaped)0.080.281.37
Season2 (June–August)0.070.261.18
Ecology2 (Territorial-staying)0.200.691.15
Tail1 (two-distinct)0.130.441.15
Low_Bacteria.Other0.080.271.14
Length1 (50–150 mm)0.100.331.12
Habitat4 (Offshore)0.090.321.10
Food6 (Fish-eater)0.160.571.10
High_Gln.Malic.acid0.170.581.08
Color5 (Black)0.130.441.07
Scales3 (Shiny)0.130.441.07
NMR_class20.100.331.06
Body2 (Compressed)0.140.481.05
Season1 (March–May)0.080.291.01
High_Acetate0.260.921.00
Table 4. Extracted association-rule-correlated NMR class information from i-means result to other variances.
Table 4. Extracted association-rule-correlated NMR class information from i-means result to other variances.
SourceTargetSupportConfidenceLift
NMR_class1Body1 (Spindle-shaped)0.170.271.33
Scales3 (Shiny)0.320.531.26
Tail1 (Two distinct)0.290.481.24
Low_Bacteria.Other0.170.281.18
Color5 (Black)0.290.471.13
Ecology2 (Territorial-staying)0.410.671.12
High_Gln.Malic.acid0.360.591.11
Bac_class20.200.331.11
High_Proteobacteria0.360.581.09
Food5 (Mollusca eater)0.170.281.08
Length3 (200–300 mm)0.180.301.06
Length2 (150–200 mm)0.200.321.05
Color1 (Red)0.170.281.04
Season3 (September–November)0.230.381.03
High_Thr0.500.821.03
High_Acetate0.570.941.02
High_Leu0.611.001.02
High_Ile0.611.001.02
High_Creatine.G3P.GPC0.611.001.02
High_Lactate0.611.001.02
High_Val0.611.001.02
High_Ile0.611.001.02
High_Ala0.611.001.02
High_Gly0.611.001.02
High_TMAO0.611.001.02
High_Creatine0.611.001.02
Food6 (Fish eater)0.320.521.01
Season1 (March–May)0.170.291.00
Body2 (Compressed)0.280.461.00
Food4 (Crustacea eater)0.500.821.00
NMR_class2Color4 (Brown)0.120.391.65
Ecology3 (Rockfish)0.100.321.54
Scales2 (Glossy)0.180.581.36
Food3 (Polychaeta eater)0.200.631.32
Season2 (June–August)0.090.271.26
Length1 (50–150 mm)0.120.371.25
Body3 (Cubic-shaped)0.100.311.22
Tail2 (Others)0.230.741.22
Habitat4 (Offshore)0.100.321.11
High_Lactate0.140.441.09
Habitat3 (Coast)0.150.461.08
High_Thr0.270.851.06
High_Acetate0.310.981.06
Bac_class30.100.301.06
Food4 (Crustacea eater)0.270.871.06
Length2 (150–200 mm)0.100.321.05
High_Ile0.311.001.02
High_Creatine.G3P.GPC0.311.001.02
High_Lactate0.311.001.02
High_Val0.311.001.02
High_Ile0.311.001.02
High_Ala0.311.001.02
High_Gly0.311.001.02
High_TMAO0.311.001.02
High_Creatine0.311.001.02
Body2 (Compressed)0.150.461.02
High_Leu0.310.991.01
Table 5. Extracted association rules to high TMAO.
Table 5. Extracted association rules to high TMAO.
SourceTargetSupportConfidenceLift
Body1 (Spindle-shaped)High_TMAO0.201.001.02
Body3 (Cubic-shaped) 0.261.001.02
Color4 (Brown) 0.241.001.02
Ecology1 (Migratory) 0.111.001.02
Ecology4 (Demersal fish) 0.081.001.02
Food1 (Plankton eater) 0.131.001.02
Habitat2 (Brackish water) 0.071.001.02
Habitat5 (Deep sea) 0.151.001.02
High_Acetate 0.921.001.02
High_Ala 0.981.001.02
High_Creatine 0.981.001.02
High_Creatine.G3GPC 0.981.001.02
High_Gln.Malic.acid 0.531.001.02
High_Gly 0.981.001.02
High_Ile 0.981.001.02
High_Ile 0.981.001.02
High_Lactate 0.411.001.02
High_Lactate 0.981.001.02
High_Leu 0.981.001.02
High_malic.acid 0.161.001.02
High_Thr 0.801.001.02
High_Val 0.981.001.02
Length2 (150–200 mm) 0.311.001.02
Low_Actinobacteria 0.111.001.02
Low_Cyanobacteria 0.091.001.02
Low_Fusobacteria 0.081.001.02
Low_Ile 0.091.001.02
Low_Planctomycetes 0.111.001.02
Low_Tenericutes 0.071.001.02
Low_Tyr 0.111.001.02
NMRclass1 0.611.001.02
NMRclass2 0.311.001.02
Scales1 (Matte) 0.171.001.02
Season2 (June–August) 0.221.001.02
Season4 (December–February) 0.151.001.02
Food6 (Fish eater) 0.510.991.01
Tail1 (Two distinct) 0.380.991.01
Season3 (September–November) 0.360.991.01
Length1 (50–150 mm) 0.300.991.01
Bac_class2 0.290.991.01
Habitat4 (Offshore) 0.290.991.01
Color1 (Red) 0.270.991.01
Food5 (Mollusca eater) 0.260.991.01
Low_Bacteria.Other 0.230.991.01
Habitat3 (Coast) 0.430.991.00
Ecology3 (Rockfish) 0.210.981.00
Scales3 (Shiny) 0.410.981.00
Length4 (Over 300 mm) 0.180.981.00
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shima, H.; Sato, Y.; Sakata, K.; Asakura, T.; Kikuchi, J. Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning. Appl. Sci. 2022, 12, 5927. https://doi.org/10.3390/app12125927

AMA Style

Shima H, Sato Y, Sakata K, Asakura T, Kikuchi J. Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning. Applied Sciences. 2022; 12(12):5927. https://doi.org/10.3390/app12125927

Chicago/Turabian Style

Shima, Hideaki, Yuho Sato, Kenji Sakata, Taiga Asakura, and Jun Kikuchi. 2022. "Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning" Applied Sciences 12, no. 12: 5927. https://doi.org/10.3390/app12125927

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop