Integration of Morphometrics and Machine Learning Enables Accurate Distinction between Wild and Farmed Common Carp

Morphology and feature selection are key approaches to address several issues in fisheries science and stock management, such as the hypothesis of admixture of Caspian common carp (Cyprinus carpio) and farmed carp stocks in Iran. The present study was performed to investigate the population classification of common carp in the southern Caspian basin using data mining algorithms to find the most important characteristic(s) differing between Iranian and farmed common carp. A total of 74 individuals were collected from three locations within the southern Caspian basin and from one farm between November 2015 and April 2016. A dataset of 26 traditional morphometric (TMM) attributes and a dataset of 14 geometric landmark points were constructed and then subjected to various machine learning methods. In general, the machine learning methods had a higher prediction rate with TMM datasets. The highest decision tree accuracy of 77% was obtained by rule and decision tree parallel algorithms, and “head height on eye area” was selected as the best marker to distinguish between wild and farmed common carp. Various machine learning algorithms were evaluated, and we found that the linear discriminant was the best method, with 81.1% accuracy. The results obtained from this novel approach indicate that Darwin’s domestication syndrome is observed in common carp. Moreover, they pave the way for automated detection of farmed fish, which will be most beneficial to detect escapees and improve restocking programs.


Introduction
The Cyprinidae clade has the broadest geographical distribution among fish families, with more than 2000 species across four continents [1]. Cyprinids contribute to over 20 million metric tons of worldwide fish production, which equates to 40% of total global aquaculture production, and 70% of total freshwater fish farming [2]. Common carp (Cyprinus carpio) is an economically important species of Cyprinidae, originally native to Central Asia and introduced worldwide over time [3]. Native common carp is found throughout all Caspian Sea drainages from north to south and from west to east, as the fish enter the rivers to breed. A dramatic stock reduction has been observed recently due to overfishing and dam construction during the last few decades. While the Iranian Fisheries Organization has practiced semi-artificial fingerling production to boost Caspian Sea fish Life 2022, 12, 957 2 of 14 stocks, the capture rate of Caspian carp still shows no improvement. Among several reasons accounting for the unsuccessful recovery programs of Caspian fish species, mixing events between wild and farmed populations are of utmost importance.
Investigation of the diagnostic morphological features has been taken into consideration in fisheries science and ichthyology to identify and define different species and strains [4][5][6]. The farmed stocks of common carp in Iranian farms are from the European strain, which has a deeper body form than native common carp from the Caspian Sea. Domestication, as a process in which wild animals are adapted to anthropogenic conditions, has been recognized to produce behavioral, molecular, and morphological alterations through generations [7,8]. According to the phenomenon known as Darwin's domestication syndrome [9], the captive phenotypes show distinctive traits compared with their wild conspecifics of similar sizes, such as faster growth and maturity under the nurture conditions and lower reproductive success [10] and reduced swimming performance in nature [11]. It has been postulated that the cultured carp strain may have escaped from the farms and hybridized with common wild carp in the Caspian Sea [12][13][14]. In their study, Khalili and Amirkolaie [15] found some genotypes of farmed common carp in the Caspian Sea. Mixing wild populations and/or hybridization events between farmed and native species will reduce the genetic diversity and fitness of the species [16][17][18].
Computational approaches such as machine learning, decision trees, and attribute weighting have been used in biological data processing to determine evolutionary solutions of pattern identification, classification, and prediction [19][20][21][22][23]. Decision tree models find the best possible decision from serial decisions made in uncertain conditions [24][25][26][27][28]. These robust models can be used on different sets of biological (e.g., phenotypic) data. Guisande et al. [29] successfully identified 847 marine and freshwater fish species using a machine-learning-based system (IPez) and supportably a high accuracy and fast prediction for fish classification based on machine learning techniques reported by Hnin and Lynn [30]. Genetic/genomic data provide helpful information on the assignment of fish populations, but morphometric data have advantages compared with molecular data, since they are relatively easier, cheaper, and faster to obtain. The application of morphometric data in robust machine-learning-based algorithms is expected to provide fast, reliable, and accurate detection in fish animals compared with traditional methods [31]. Hence, the present study was conducted to investigate the potential of machine learning to (i) identify morph variability of common carp in different habitats, and to (ii) introduce the diagnostic morphometric feature(s) to distinguish wild Caspian carp population from their farmed counterparts.

Sampling
Sixty specimens were taken from three locations in the southern Caspian basin, including Gomishan (E: 53 •  The number of annuli in scales or otoliths was not determined but, based on fish size, their age range can be estimated from one to three years.

Data Preparation
The traditional morphometric (TMM) data, including 26 features (Figure 2), were extracted using the ImageJ Software Version 1.45s, Bethesda, MD, USA [32]. To minimize the effect of fish size on the measured morphometric characters, the allometric method of the PAST Software Version 2.17c, Oslo, Norway [33] was used on the raw morphometric data [34]. (1)

Data Preparation
The traditional morphometric (TMM) data, including 26 features (Figure 2), were extracted using the ImageJ Software Version 1.45s, Bethesda, MD, USA [32]. To minimize the effect of fish size on the measured morphometric characters, the allometric method of the PAST Software Version 2.17c, Oslo, Norway [33] was used on the raw morphometric data [34].

Data Preparation
The traditional morphometric (TMM) data, including 26 features (Figure 2), were extracted using the ImageJ Software Version 1.45s, Bethesda, MD, USA [32]. To minimize the effect of fish size on the measured morphometric characters, the allometric method of the PAST Software Version 2.17c, Oslo, Norway [33] was used on the raw morphometric data [34].
(1)  M adj is the adjusted measurement of size, M is the observed length of each character, and Ls is the overall average size of standard length. L o stands for standard height for each sample, and b is related to the allometric growth coefficient. All measurements can be found in Supplementary Materials Table S1.
In order to investigate the body form variations of common carp understudy, 14 landmark points were digitized on the left side of each specimens using tpsDig2 Version 2.16 ( Figure 3). Madj is the adjusted measurement of size, M is the observed length of each character, and Ls is the overall average size of standard length. Lo stands for standard height for each sample, and b is related to the allometric growth coefficient. All measurements can be found in Supplementary Materials Table S1.
In order to investigate the body form variations of common carp understudy, 14 landmark points were digitized on the left side of each specimens using tpsDig2 Version 2.16 ( Figure 3).

Data Analysis
Regarding the TMM, a dataset containing 76 samples (14 from Anzali, 27 from Gomishan, 19 from Miankaleh, and 14 from farmed population) with 26 measured features were imported into RapidMiner software Version 7.0 (Rapid-I, GmbH, Dortmund, Germany), shuffled, and missing data were handled, and the output cleaned file was named as FCDB (final cleaned database). A one-way ANOVA was performed on the morphometric data to assess the level of variability of each trait among different locations. In order to remove the effects of non-shape data, including scale, direction, and position on geometric morphometric data, a generalized Procrustes analysis (GPA) was performed on the landmark-obtained data using Morpho J version 1.02 [35]. After normalization, the consensus shape variations of Caspian and farmed common carp were visualized using the wireframe graphs in Morpho J. Then, the following steps of data mining analysis were performed on the FCDB datasets of both TMM and geomorph data.

Attribute Weighting
Attribute weighting is a unique method to illustrate the impact of each feature on the target or label attribute [36,37]. Ten attribute weighting algorithms, namely PCA, SVM, relief, uncertainty, Gini index, chi-squared, deviation, rule, information gain, and information gain ratio, were applied to the FCDB. Each attribute weighting method or feature

Data Analysis
Regarding the TMM, a dataset containing 76 samples (14 from Anzali, 27 from Gomishan, 19 from Miankaleh, and 14 from farmed population) with 26 measured features were imported into RapidMiner software Version 7.0 (Rapid-I, GmbH, Dortmund, Germany), shuffled, and missing data were handled, and the output cleaned file was named as FCDB (final cleaned database). A one-way ANOVA was performed on the morphometric data to assess the level of variability of each trait among different locations. In order to remove the effects of non-shape data, including scale, direction, and position on geometric morphometric data, a generalized Procrustes analysis (GPA) was performed on the landmark-obtained data using Morpho J version 1.02 [35]. After normalization, the consensus shape variations of Caspian and farmed common carp were visualized using the wireframe graphs in Morpho J. Then, the following steps of data mining analysis were performed on the FCDB datasets of both TMM and geomorph data.

Attribute Weighting
Attribute weighting is a unique method to illustrate the impact of each feature on the target or label attribute [36,37]. Ten attribute weighting algorithms, namely PCA, SVM, relief, uncertainty, Gini index, chi-squared, deviation, rule, information gain, and information gain ratio, were applied to the FCDB. Each attribute weighting method or feature selection model gives a weighted score between 0.0 and 1.0 for each attribute based on their impact on the population target feature. The attributes with a weighted score greater than 0.70 in all algorithms were considered important features. Generally speaking, the relevance of a feature to each weighting model is calculated based on the class distribution, as follows [38].
Information gain: The relevance of an attribute is evaluated by computing the information gain. Information gain ratio: Calculates the correlation of a feature by computing the information gain ratio.

Machine Learning Prediction of Target Populations
The original FCDB and the ten datasets from the attribute weighting models above were then used to develop machine-based prediction systems. The performance of each model on each dataset was measured based on their accuracy [38].

Tree Induction
Tree induction is an efficient and popular method in the classification of populations. In order to make decision trees, four different induction algorithms (decision tree, random forest, decision tree parallel, and decision stump) were applied to all 11 datasets (the FCDB and 10 generated datasets from attribute weighting models, including only the important features that scored higher than 0.70; Supplementary Materials Table S1). Each tree induction algorithm was run with four other criteria (gain ratio, information gain, Gini index, and accuracy) using a 10-fold cross-validation based on our previously published papers and default parameters for a local random seed and stratified sampling type [39][40][41][42][43]. Hence, a total of 176 trees were generated.

Naïve Bayes
The naïve Bayes classifier is an effective classification method even if the dataset is not very large [44]. This classifier is based on the hypothesis of Bayes conditional probability rule performed by two algorithms (naïve Bayes and naïve Bayes kernel) on all 11 prepared datasets (FCDB and 10 generated from attribute selection processes).

Linear Discriminant Analysis (LDA)
The LDA method [44] tries to separate two or more target classes by linear features. The resulting linear classifier made of combination features is used to discriminate variables between two or more naturally occurring groups, whether with a descriptive or a predictive objective. The same 11 datasets mentioned above were fed into this model and calculated its accuracy performance. The LDA on geomorph data was per-192 formed using the Morpho J software version 1.02.

Attribute Weighting (Feature Selection) Models
One-way ANOVA on morphometric data showed that 24 out of 26 investigated morphometric traits were significantly different from each other (p < 0.05), the exceptions being caudal peduncle length and anal fin base length. In traditional morphometric (TMM) data, 80% of attribute weighting models allocated weights greater than 0.7-HH1 (maximum head height); Gini index, info gain, and info gain ratio models computed the highest possible weights of 1.0 to this feature. A proportion of 70% of the attribute weighting models assigned weights greater than 0.7 to PelH (pelvic fin height) feature while POL (postorbital length), HL (head length), and PH (pectoral fin height) were identified by 50% of the models with weights above 0.7 ( Table 1). The complete attribute weighting results are available in Supplementary Materials Table S2. In attribute weighting models using the geomorph dataset, landmark point 12 (related to the pectoral fin position) was recognized by 70% of the models to have weight higher than 0.7 and after that landmark point 5 (close to the beginning position of dorsal fin) was supported by 50% of models with weight above 0.7 (Table 2).

Predictions Based on Machine-Learning Algorithms
The overall performance of the 16 different tree induction models applied on 11 datasets was less than 60% in most cases. The best performance (77%) on the basis of TMM approach was obtained when the decision tree parallel model ran on the rule dataset with accuracy criterion. The best performance of the decision tree stump model was 59%; under the decision tree model, the performance went up to 0.72 (see Table 3). The Gini index criterion showed the best performance on the Gini Index database was for the random forest algorithm. Based on the visualized induced tree with the highest performance on TMM ( Figure 4A), the HH1 (head height) trait was recognized as the best feature of the tree's root to identify common carp populations. When HH1 was greater than 8.079, and the value for ED feature (eye diameter) was higher than 1.44, the samples belonged to the Anzali population; otherwise, they were from the farmed group. Moreover, when the value of POL is >4.249, carp individuals with HH1 ≤ 7.824 and 7.824 < HH1 ≤ 8.079 originate from Anzali and Gomishan populations, respectively. The Miankaleh population includes individuals with POL is ≤4.249 and HH1 ≤ 6.335. Based on geomorph data, Random Forest with accuracy criterion resulted in a maximum of 61% precision using FCDB dataset ( Figure 4B). The best performance of the naïve Bayes models on the 11 prepared datasets of each traditional and geomorph approaches was 0.77 and 0.60, respectively, obtained when the naïve Bayes model ran on FCDB (Table 4).

Linear Discriminant Analysis (LDA)
The overall prediction accuracy of LDA was over 81% with the FCDB of TMM approach, while the LDA accuracy based on geometric morphometric was only 57.9%. The best class prediction was computed for farmed site samples with a precision that reached 100%. The Anzali class was the second best, predicted with 87.5% accuracy but less precision ( Table 5). The clustering of individual fish in the LDA model showed that the first two Life 2022, 12, 957 9 of 14 components of the LD explained 89% of the variation among the populations. The farmed populations constituted an utterly separate group according to LD1 and LD2 ( Figure 5). The ANOVA based on LD1 showed significant differences between the populations of common carp (F-value = 229.5, p < 0.001); the Gomishan and Miankaleh samples were the only pairwise comparison that did not show a significant difference (p = 0.266).

Geomorph Variations
The body form variations of common carp showed that the first two components represented 89% of the variance (PC1 = 58% and PC2 = 31%) among the populations studied; landmarks 4, 5, 11, 12, and 13 were the most variable ( Figure 5). The CVA scatter plot based on the geomorph data illustrated a distribution pattern similar to the TMM approach, separating the farmed population from the Caspian carp populations ( Figure S1). Comparison of body shapes between Caspian and farmed common carp populations revealed that they differed in body depth and head size ( Figure 6).

Geomorph Variations
The body form variations of common carp showed that the first two components represented 89% of the variance (PC1 = 58% and PC2 = 31%) among the populations studied; landmarks 4, 5, 11, 12, and 13 were the most variable ( Figure 5). The CVA scatter plot based on the geomorph data illustrated a distribution pattern similar to the TMM approach, separating the farmed population from the Caspian carp populations ( Figure S1). Comparison of body shapes between Caspian and farmed common carp populations revealed that they differed in body depth and head size ( Figure 6).

Discussion
The new machine learning tools used in the present study enabled us to accurately distinguish farmed common carp from its wild counterparts in the southern Caspian Sea using morphometric information. Based on the morphological data obtained in this study, we suggest a considerable admixture structure of wild common carp in the south-southeast of the Caspian Sea, while Anzali in the southwest represented a distinct stock of the Caspian common carp. Wild population management is critically dependent on maintaining the populations' differentiation to stabilize the productivity of ecosystems as a whole [45]. Machine learning analysis is well documented in biology [46], but in aquaculture and fisheries science, this approach is still in its infancy. This study analyzed the morphometric data (traditional morphometric and geometric morphometric) taken from common carp across the southern Caspian basin using new machine learning analysis methods, including attribute weighting, decision tree, and naïve Bayes prediction. The highest accuracy and prediction power were obtained by applying these models on traditional morphometric datasets. The higher accuracy by traditional morphometrics may be due to the fact that geometric morphometric data are two-dimensional data and need to be converted to distance-like data in TMM. Based on 10 attribute weighting models, 80% of the models identified head height as the key trait contributing to variation among populations. The farmed population had a larger head height (8.19 ± 0.52 cm) compared with the wild forms (Table S3), while amongst the wild Caspian common carp, head height was larger in Anzali (7.36 ± 2.13 cm) than in Gomishan (7.03 ± 1.60 cm) and Miankaleh (6.99 ± 1.18 cm). This phenotype is likely linked to the domestication syndrome in farmed carp and to differences in environmental conditions between locations in the case of Anzali (a resident form of wild carp in Anzali lagoon) versus Miankaleh and Gomishan populations (Caspian carp). Domestication generates morphologic alterations leading to captive phenotypes across several generations and is accompanied by epigenetic and genetic changes [7,8,47]. Head depth enlargement and deeper caudal peduncle and body profile have been observed as typical characteristics of the captive phenotypes in steelhead trout compared with the wild counterparts [48]. Body shape variation of common carp based on geomorph data also supported a deeper body form and larger head size in farmed population compared with the Caspian form of common carp. Hence, head size, especially head height, and body depth are the main parameters that distinguish the Iranian stocks of common carp from the farmed population.
The results obtained from decision trees have categorized the fish groups correctly. The comparison between the best-obtained accuracy by decision tree (79%) and naïve Bayesian model (77%) indicates no substantial difference between these two methods of machine learning analysis in categorizing common carp populations using morphometric information. The highest accuracy obtained was 81% by LDA, which could be further improved by increasing the dataset size. Nevertheless, the farmed population was accurately identified through the current models. It seems that admixture of the wild stocks has diminished the overall accuracy, especially in the southeast population. The wild stocks of common carp across the southern coasts of the Caspian Sea have been experiencing mixing between them due to the semi-natural proliferation and restocking program. It should be noted that some individuals that have not been correctly categorized based on the location of sampling can be related to migration between sites. Several publications have mentioned the negative effects of dam constructions on marine life [49,50]. The Caspian Sea is a closed lake, and its seawater level has decreased by two meters since 1995 [51]. Dam building programs on the main drainages of the Caspian Sea and global warming are thought to be the main causes of the lowering sea level, which in turn reduces the breeding and feeding grounds of common carp, and makes mixing of wild populations more likely than before. Migration events can also be explained by the restocking program since fish are not always released in the location where they had initially been caught for reproduction. Based on the classification using cluster analysis, it can be concluded that, in the Caspian Sea, there are two phenotypically distinct and geographically separated groups of common carp: (i) one population in the west (Anzali) and (ii) a stock including Gomishan and Miankaleh populations. This observation is supported by the genomic structure investigation of common carp in the Caspian Sea [52]. During the past decade, landings of common carp have seen a dramatic reduction, and the LDA plot obtained in the present study indicates that the stocks of common carp are experiencing a reduction in heterozygosity. Machine-learningand deep-learning-based analytical toolkits provide the most accurate predictions, practical advantages over the basic statistical models, such as easily identification of trends and patterns, continued improvement, handling multi-dimensional and multi-variety data, and a wide range of applications [53]. While population and sub-population identification of fish species is of great importance in conservation ecology and applied ichthyology [54], most studies of novel analytical methods such as deep learning on the fish animals have focused their applicability on fish species identification. In a study performed on commercial carp species, deep-learning-based methods were applied and successfully identified four different species of farmed carp [55]. In the Triglidae family, three morphologically similar species were recognized based on morphometric data using the deep learning approach [56]. Courtenay et al. [57] have tested the potential of deep learning on the processing of morphological data to provide a hybrid approach that efficiently overcomes taphonomic equifinality in the archaeological and paleontological register.

Conclusions
To the best of our knowledge, this is the first time that machine learning algorithms have been used in fish stock management using both morphometric and geometricmorphometric information. The origin of common carp individuals caught in the southern basin of the Caspian Sea was predicted with maximum accuracy by the LDA prediction model, which could be further improved using a larger dataset. The present study demonstrates that machine-learning-based methods can be successfully applied to morphometric data to accurately assign common carp specimens to farmed or wild populations. Thus, machine learning and deep learning methods have enormous potential in aquaculture, fisheries, and ecology to identify farmed escapees in wild stocks, manage restocking programs, and monitor the robustness of fish in aquaculture conditions. Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/life12070957/s1, Table S1: 11 different generated datasets using attribute weighting models on the morphometric traits of common carp, Table S2: The whole results of ten attribute weighting models on traditional morphometric data of Caspian carp, Table S3: Mean ± SD for each morphometric trait of common carp per each region, Figure S1: The CVA scatter plot of farmed and Caspian carp populations based on the first two components using geomorph data (A-Anzali lagoon; P-farmed population; M-Miankaleh; G: Gomishan).