Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality

Gang, Jiahong; Yamazaki, Tatsuya; Iida, Yusuke

doi:10.3390/fishes10120655

Open AccessArticle

Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality

by

Jiahong Gang

^1,*,

Tatsuya Yamazaki

^2,*

and

Yusuke Iida

²

¹

Graduate School of Science and Technology, Niigata University, Niigata 950-2181, Japan

²

Faculty of Engineering, Niigata University, Niigata 950-2181, Japan

^*

Authors to whom correspondence should be addressed.

Fishes 2025, 10(12), 655; https://doi.org/10.3390/fishes10120655

Submission received: 31 October 2025 / Revised: 8 December 2025 / Accepted: 15 December 2025 / Published: 17 December 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

Nishikigoi are highly valued ornamental fish whose evaluation affects their market price. However, the judging criteria of the exhibitions remain unclear. This study applies a generative artificial intelligence model to explore potential factors behind non-award-winning Kohaku Nishikigoi. An improved Variational Autoencoder (VAE) is developed based on the standard VAE as follows: introducing perceptual loss to enhance detail, adding mask loss to maintain body shape consistency, and replacing transposed convolutions with UpSampling layers to reduce artifacts. With the improved VAE, we propose a method to evaluate a non-award-winning Koi. Specifically, when the non-award-winning images are input into the model, differences between the input and output images become large to identify visual deficiencies of the inputs, since the improved VAE is designed to generate images that potentially win competitions. For experiments, synthetic non-award-winning Koi images were created by modifying award-winning ones. The synthesized non-award-winning images were input into the improved VAE and the generated images were obtained. Experimental results showed that shape consistency measured by Multi-layer Sliding Window was lower for award-winning images (0.110) than for non-award-winning images (0.141). Also, the average difference in color was smaller for award-winning Koi (4.75%) than for non-award-winning Koi (28.7%).

Keywords:

generative AI models; variational autoencoder; Nishikigoi evaluation; Kohaku Nishikigoi

Key Contribution: This study utilizes a generative model to reconstruct images of award-winning Kohaku Nishikigoi and to analyze the body and color features that contribute to Koi failing to win awards.

Graphical Abstract

1. Introduction

Nishikigoi (hereafter referred to as Koi), also known as ornamental carp, are a very popular species that are believed to have originated in the Yamakoshi region of Niigata, Japan [1,2]. Their ancestor is the Magoi, which was originally raised as a food source in the heavy snow environment of the region [3,4]. During long-term breeding, individuals with color mutations appeared by chance. Through repeated trials and selective breeding, local residents developed many varieties, and it is said that more than one hundred types now exist [5,6]. Among them, Kohaku is regarded as the most basic and representative variety, characterized by a white body surface with red markings, and considered the origin of many other strains [7].

In the early years Koi did not attract wide attention, but their exhibition at the Taisho Exposition in 1914 marked a turning point [8]. Since then, they have gradually gained recognition in Japan and around the world, with some individuals sold for prices reaching several hundred thousand dollars [9,10,11]. In Japan, exhibitions are venues to display breeding results and to exchange knowledge about cultivation and improvement of varieties. They were first held in Yamakoshi and later developed into both national and regional contests. In these exhibitions, Koi are evaluated by panels of experts who score each Koi based on criteria. A total of one hundred points is assigned, with 50% assigned to body shape and 20% to coloration, respectively, whereas pattern, gracefulness, and dignity account for 10% each [12,13]. However, these professional standards remain difficult for ordinary enthusiasts to understand.

Such reliance on human expertise is also common in other fields such as agriculture and medicine [14,15,16,17]. Although computer-assisted evaluation has gradually been introduced in these areas, the judging of Koi still mainly depends on human assessment and related computer-based analysis remains limited. Previous research has attempted to explain the mechanisms behind award-winning Koi through image analysis. For example, Domasevich et al. studied images of Kohaku by removing backgrounds and cropping fins and tails, then calculated the body length to width ratio [18]. They also applied the HSVA (Hue, Saturation, Value, and Alpha) model to identify red, white, and light red regions and to measure their surface coverage. These features were then analyzed using multiple linear regression to explore the reasons for awards. Although such methods provide some explanatory power, they rely heavily on manually defined image features and cannot fully represent the overall visual impression of Koi. Recently, methods based on deep learning have been widely used, in which large amounts of pre-acquired image data are preprocessed, features are extracted, and the images are classified into target categories. Convolutional Neural Networks (CNNs) are one type of deep learning models. CNNs are expected to distinguish between award-winning and non-award-winning Koi by learning a large dataset of these Koi. The images of award-winning Koi, typically with similar backgrounds and poses, are abundantly available. A major challenge, however, is that the images of non-award-winning Koi are extremely scarce.

Therefore, in this study, we aim to construct a generative artificial intelligence (Generative AI) model. The proposed model is based on a Generative AI model, Variational Autoencoder (VAE) model. We improve the learning loss for the standard VAE model by adding two losses: perceptual loss and mask loss. The former is to mitigate image blurring and the latter is to maintain body-shape consistency. Also, we improve the VAE model by replacing the transposed convolution layers with UpSampling layers. This contributes to producing clearer, more natural images.

In addition, we proposed a method to judge the possibility of award-winning for Koi images called the differential comparison method. The method uses the improved VAE model and the model learns only from images of award-winning Koi to generate similar Koi for any input Koi images. Therefore, by comparing the difference between the input and output images of the improved VAE model, we can distinguish award-winning and non-award-winning Koi images. If the difference between the input and output images is small, the Koi is likely to win an award; otherwise, it is unlikely to win an award.

The proposed method performs a task analogous to how humans visually evaluate and discriminate the quality of Koi. Thus, this method can serve as an auxiliary tool for Koi breeders or enthusiasts, providing reference information for improving Koi quality.

2. Dataset and Methods

2.1. Dataset

The dataset used in this study was provided by Koi publishing house and consists of 214 images of award-winning Kohaku Koi collected from exhibitions across Japan. All images are in RGB (Red, Green, and Blue) color and saved in JPEG (Joint Photographic Experts Group) format, with an original resolution not lower than 1795 × 2512 pixels. These images retain both the blue background and the fish body, and the orientation of the Koi is kept consistent. In exhibitions, Koi are usually classified by body length in 5 cm intervals and then evaluated within each group. However, this classification results in a limited number of images in each category. To avoid this situation, all images in this study were grouped together as “award-winning Koi images.” Due to copyright restrictions, the actual dataset cannot be publicly displayed. Therefore, similar images downloaded from a public dataset [19] are shown in this paper such as Figure 1.

The dataset provided by Koi publishing house is used for the training of a Generative AI model described in the following sections. For the training, all input images are uniformly preprocessed. First, to reduce GPU (Graphics Processing Unit) workload, the image resolution is standardized. An analysis of aspect ratios shows that most images (208, approximately 97.2%) have a height-to-width ratio of 1.4. Based on this, all images are resized to a resolution of 1088 × 768 pixels, which preserves structural information while meeting memory constraints. The images are then normalized by scaling pixel values to the range [0, 1] to accelerate convergence and improve training stability. No cropping or background removal is applied during preprocessing, ensuring that the model can learn complete visual information, including both the Koi and its background.

2.2. Computational Environment

All experiments were conducted using an NVIDIA RTX A6000 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The models were implemented using TensorFlow (version 2.13.0, Google LLC, Mountain View, CA, USA) with CUDA (version 11.8, NVIDIA Corporation, Santa Clara, CA, USA). The operating system was Ubuntu 20.04.6 LTS (Canonical Ltd., London, UK).

2.3. Proposed Methods

2.3.1. Standard VAE Model

As described above, only images of Koi that won awards at exhibitions are available, it is not possible to apply a CNN to classify between “award-winning” and “non-award-winning” categories [20]. Therefore, a standard VAE, which is a kind of Generative AI model, is applied to perform generative modeling of award-winning Koi images. As presented in Figure 2, the model consists of three components: an encoder, a latent space, and a decoder. The encoder on the left side is composed of four repeating modules, each containing two convolutional layers and one pooling layer, with the input being an RGB image of 1088 × 768 resolution. The latent space is set to 256. The decoder of the right side mirrors the encoder and is also composed of four modules, each containing two convolutional layers and one transposed convolution layer, producing an output image with the same size as the input. The detailed parameters of each layer are shown in Figure 2.

During training, a loss function is used as a combination of Mean Squared Error (MSE) reconstruction loss and Kullback–Leibler (KL) divergence, with both weights set to 1.0 [21,22]. The Adam optimizer is used with an initial learning rate of 0.0004, a batch size of 8, and a total of 100 epochs. The dataset is divided into training and validation sets at a ratio of 8:2.

2.3.2. Perceptual Loss

In the standard VAE, pixel-wise MSE is commonly adopted as the reconstruction loss. However, MSE measures the average squared difference between the input and output images in the pixel space. This pixel-level measurement tends to lead the model to generate averaged and blurry images. Furthermore, this approach is often inconsistent with human perception of image quality. To improve the clarity of generated images and better align with human visual characteristics, perceptual loss is introduced. Perceptual loss measures the difference between generated and original images in the feature space, addressing the limitations of pixel-level reconstruction loss in preserving structure and texture [23]. In this study, a ResNet-50 model pretrained on ImageNet is employed as the feature extraction network [24]. As presented in [23], perceptual loss is calculated as MSE between the feature maps of a convolutional layer in the ResNet-50 for an input image and an output image of VAE. In this study, three convolutional layers close to the input of the ResNet-50 are selected to calculate MSE as shown in Figure 3. Their weighted sum is the overall perceptual loss term and weights of 3.0, 0.5, and 0.5 are assigned for each MSE.

For the above improvements, the final loss function consists of three components: reconstruction error (MSE), KL divergence, and perceptual loss, each with a weight of 1.0. With the introduction of perceptual loss, the batch size was reduced to 4 to account for the increased computational cost introduced by perceptual loss, while all other training parameters and the network architecture remained unchanged.

2.3.3. Mask Loss

Perceptual loss mainly plays a role in correcting blur between input and output images, but it is also necessary to preserve body shape consistency between the input and output images. Therefore, a mask loss is incorporated into the VAE model. We introduce mask loss by referring to the conditional VAE (CVAE) originally developed by Sohn et al. [25].

First, we use an image segmentation tool to find the contours of an original award-winning Koi image. This contoured image is a mask image. The mask images are binary, with white pixels representing the Koi and black pixels representing the background, as shown in Figure 4.

Second, in the output layer of the VAE, in addition to the original three channel RGB output, an extra single channel output is added to predict the corresponding mask image, with the two outputs produced in parallel, as shown in Figure 5. During training, the MSE between the mask of input image and the mask of output image is calculated as the mask loss. This loss is combined with the original reconstruction loss (MSE), KL divergence, and perceptual loss to form the total loss function, which is used for backpropagation of the Generative AI training and parameter updates. It should be emphasized that the mask corresponding to each input image is used only for loss computation and is not provided as an input feature to the VAE model. In terms of experimental settings, aside from the addition of mask loss, the number of training epochs is increased to 150, while all other training parameters remain unchanged.

In addition, to evaluate the effect of mask loss on constraining body shape consistency, a dedicated metric is proposed in this study. The method is as follows: first, an image segmentation tool is used to annotate the bounding boxes of the Koi regions in both the input and output images. It should be noted that the bounding boxes cover only the body of the Koi and do not include the fins or tail. Next, the aspect ratio and area ratio of the Koi are calculated from the bounding boxes. Finally, by comparing the differences in aspect ratio and area ratio between input and output images, the consistency of body shape in the generated results is measured, as expressed in Equations (1) and (2).

| \frac{R_{i n} - R_{o u t}}{R_{i n}} |,

(1)

| \frac{S_{i n} - S_{o u t}}{S_{i n}} |,

(2)

where

R_{i n}

and

R_{o u t}

denote the aspect ratios of the bounding boxes in the input and output images, while

S_{i n}

and

S_{o u t}

denote their corresponding areas. This metric enables the quantitative evaluation of the effect of mask loss, avoiding reliance solely on learning curves or visual inspection for subjective judgment.

2.3.4. UpSampling Layer

In the VAE model, transposed convolution layers are commonly used to progressively upsample low-resolution feature maps into high-resolution images. The advantage of transposed convolution is that its parameters can be optimized during training, which can improve texture details in the generated images. However, this process often introduces checkerboard artifacts in the outputs, reducing color quality. The presence of checkerboard artifacts in the generated images may cause them to exhibit periodic grid-like textures. This introduces discrepancies from the input images in terms of color, influencing the subsequent evaluation. To improve the color quality of generated Koi images, all transposed convolution layers in the VAE are replaced with UpSampling layers using bilinear interpolation. This improvement leaves all other training parameters and the overall model architecture unchanged.

3. Experimental Results

3.1. Results by Standard VAE Model

Figure 6 shows the variation in the loss function during training, including reconstruction loss (MSE) and KL divergence on both the training and validation dataset. The horizontal axis represents the number of epochs, and the vertical axis represents the loss values. The blue curve corresponds to the training loss, while the orange curve corresponds to the validation loss.

In the standard VAE model, reconstruction loss gradually decreased with training iterations and tended to converge in the later stages. KL divergence also showed an overall decreasing trend but with large fluctuations, indicating instability during training. This phenomenon is common in VAE training and mainly arises from the difficulty of balancing KL divergence with reconstruction loss during optimization. By the 100th epoch the total loss had already reached a low level, with values of 0.0170 for the training loss and 0.0187 for the validation loss.

To further compare the performance of the model in image generation, 40 images were randomly selected from the dataset as input samples and observed the outputs against the original input images. For standard VAE, the generated images successfully reproduced the overall outline of Koi, including fins, tails, and background regions, but the results remained blurry. In particular, the white regions were not reconstructed correctly, and only the red patterns were preserved. These results indicate that although the standard VAE demonstrates some ability in structural reconstruction, it still shows clear limitations in restoring fine details and accurately reproducing color distributions. To objectively assess the quality of the generated images, the Natural Image Quality Evaluator (NIQE) was employed [26]. NIQE is a no-reference image quality metric where lower scores indicate better visual quality. The evaluation results show that the generated images achieved an average NIQE score of 13.55 across 40 samples.

3.2. Results with Perceptual Loss

The training results of the VAE model with perceptual loss are shown in Figure 7, which presents the total loss, reconstruction loss (MSE), KL divergence, and perceptual loss. The meaning of the axes is consistent with those in the standard VAE. As shown in the learning curves, the total loss values of this model are higher than those of the standard VAE (0.7805 for the training set and 0.7311 for the validation set), mainly because perceptual loss accounts for a large proportion of the overall value. This can be seen from the fact that the scale of the vertical axis for perceptual loss in Figure 7 is an order of magnitude different from those for reconstruction loss and KL loss. When excluding the perceptual loss component, the reconstruction loss (MSE) of this model is comparable to that of the standard VAE.

The same 40 images as in Section 3.1 were selected as input samples, and the generated results were visualized for analysis. The results demonstrate that the VAE with perceptual loss achieves better performance in both structural fidelity and color reproduction. The model not only preserves the overall body outline of the Koi but also reconstructs the red and white regions more accurately. The distribution of red and white patches in the generated images does not exactly match the originals, showing a degree of recombination, which indicates the model’s ability to capture and reorganize image features. Concurrently, a NIQE evaluation was performed on these same 40 images, mirroring the standard VAE experiment. The results show an average NIQE score of 10.96, which is lower than the 13.55 achieved by the standard VAE. This reduction in the NIQE score objectively signifies an improvement in the perceptual quality of the generated images.

3.3. Results with Mask Loss

Figure 8 shows the training results after introducing mask loss into the model, with the axes having the same meaning as in the previous experiments. Due to the addition of mask loss, the overall loss was higher than in the first two experiments (0.8332 for the training loss and 1.0652 for the validation loss). However, the reconstruction loss, perceptual loss, and mask loss all gradually converged, indicating that the training process was stable.

To verify the effect of mask loss on maintaining body shape consistency, the proposed evaluation method was applied to compare the “model with perceptual loss only” and the “model with both perceptual loss and mask loss.” For verification, 15 images were randomly selected from the dataset and input into the above two models. In this verification, the differences in aspect ratio and area between input and output images were computed, with the average values used as the evaluation metrics. The results showed that for aspect ratio differences calculated by Equation (1), the mean value decreased from 0.073 with perceptual loss alone (without mask loss) to 0.043 with the addition of mask loss (with mask loss). For area differences defined in Equation (2), the mean value decreased from 0.103 to 0.036. These findings demonstrate that introducing mask loss improves the consistency of body shape between input and output images.

3.4. Results with UpSampling Layer

Figure 9 shows the training results when the transposed convolutional layers are replaced with UpSampling layers as explained in Section 2.3.4. Since only the method of enlarging feature maps was changed, there is no significant modification to the overall network structure, and the training outcomes did not differ greatly from the previous experiments. The total loss reached 0.7923 for the training set and 1.2000 for the validation set.

To evaluate the generation quality, we compared the output images from the same input image between the model using transposed convolutional layers and the proposed model using UpSampling layers. With the same 40 images as in Section 3.1 and Section 3.2, NIQE evaluation yielded an average score of 9.13. This score is lower than the 10.96 achieved by “model with both perceptual loss” in Section 3.2. The smaller the NIQE, the better the image quality. Therefore, the results indicate that when using UpSampling layers, the generated images exhibited better color quality. Therefore, this final enhanced model, incorporating perceptual loss, mask loss, and UpSampling layers, was adopted to further verify its performance on non-award-winning Koi images.

4. Comparative Experiment

4.1. Preparation of Non-Award-Winning Koi Images

Since the dataset used in this study was provided by a Koi image publishing house which only photographs and records award-winning Koi from exhibitions, images of non-award-winning Koi were not available. To evaluate the performance of the trained model on non-award-winning Koi, artificial samples were constructed from award-winning Koi images, categorized into two types: body shape defects and color defects.

For the body shape defect images, a graphics editing software was used to edit Koi images by manually erasing parts of the body or fins. This design was based on judging criteria of exhibitions, where body symmetry and fin completeness are considered important factors. Thirty award-winning Koi images were randomly selected, of which fifteen were modified to have incomplete bodies and another fifteen to have incomplete fins. Examples are shown in Figure 10 and Figure 11.

For the color defect images, an image annotation tool was first used to obtain masks of the red regions. Then, we converted these award-winning Koi images from RGB to HSV color space to adjust color of the images. Concretely, the saturation (S) channel of the red regions was reduced by 30% to simulate a lack of vivid coloration. This procedure was guided by the exhibition criteria that emphasize bright and vivid red coloration. Fifteen award-winning Koi images were selected to create color defect images. An example color defect image is shown in Figure 12.

4.2. Differential Comparison Method and Experiments

Our goal is to determine whether a Koi image is likely to win an award or not, using a model trained only on images of award-winning Koi. The decision process, called the differential comparison method, is shown in Figure 13. First, award-winning Koi images are input into the trained model completed in Section 3.4 to obtain the corresponding output. Next, non-award-winning Koi images were input into the same model under the same conditions to obtain the corresponding output. Because the model is trained only on award-winning Koi, even if an image of a non-award-winning Koi is input, the output tends to be an award-winning Koi. Therefore, if we compare the difference between the input and output of the two groups, the Koi with the smaller difference can be said to be the one more likely to win an award. On the other hand, if the difference between input and output is large, it is clear that there is a gap in the awards, and it is possible to analyze whether the cause is body shape or color.

Based on the differential comparison method described above, we conducted a judgment experiment. The images of non-award-winning Koi were artificially created in Section 4.1, because all the images in the database were images of award-winning Koi.

In the experiment of body shape analysis, non-award-winning Koi images were divided into two categories: body defects and fin defects, and each group was compared against the corresponding award-winning Koi images. To eliminate color interference and focus on body differences, this study compares the mask of the input image with the mask of the generated output image, rather than comparing the RGB images directly. The input masks were manually obtained using an annotation tool, while the output masks were generated directly by the model.

Generally, Intersection over Union (IoU) or Dice coefficient is an index used to evaluate the degree of similarity of objects in two images, namely they are suitable for evaluation of overlap across the entire image [27]. However, these are inadequate, since the defect regions occupy only a small fraction of the image (approximately 4.40% to 11.48% for body defects and even less for fin defects). IoU and Dice coefficient tend to dilute the discrepancies caused by local defects with the large overlapping background regions, making it difficult to effectively distinguish between award-winning and non-award-winning images.

To address this issue, this study proposes a Multi-layer Sliding Window evaluation metric. Specifically, spatially corresponding local windows of identical size are applied to both the input and output masks. Starting from the top-left corner of the image, the absolute pixel difference within each window is calculated and divided by the window area to obtain a local difference score. Subsequently, the window slides across the entire image with a stride of half the window size. To balance sensitivity to defects of varying scales with noise robustness, window sizes were set to 32, 64, 128, and 256 pixels. Using only large windows risks overlooking minute defects (like the limitations of IoU or Dice), while using only small windows makes the metric overly sensitive to minor generation errors (noise). The multi-layer approach effectively balances these factors.

Formally, let

D = | M_{i n} - M_{o u t} |

be the pixel-wise difference map between the ground truth and predicted masks, where

M_{i n}

is the mask image of the input image,

M_{o u t}

is the mask image of the output image and

D

is the absolute value of pixel difference between these two mask images. For a given scale

s \in S

, we denote the set of normalized sliding window scores derived from

D

as

V_{s}

, where

s \in S

shows a set of the windows and

s \in (32, 64, 128, 256)

.

V_{s}

is the value of

D

divided by the area of

s

. The scale-specific difference score

S_{s}

is defined as the average of the top

α %

values in

V_{s}

to focus on sparse defects:

S_{s} = E [\{v \in V_{s}| v \geq P_{1 - α} (V_{s})}]

(3)

where

P_{k} (\cdot)

denotes the k-th percentile function and

α

is an error tolerance threshold and it is set to 5% in this study. The final score is the average across all scales in

S

:

Final-Score = \frac{1}{| S |} \sum_{s \in S} S_{s}

(4)

where

|S|

is the number of elements in the window set, which is 4 in this study.

The reason for selecting the top 5% is based on an analysis of the window score distribution. As shown in Figure 14 (using a window size of 32 as an example), the histogram on the left displays the scores on the horizontal axis and their frequency on the vertical axis, indicating that many window scores are concentrated near 0. To observe the distribution more clearly, the chart on the right presents the data in logarithmic scale. It can be observed that many windows fall within the 0 to 0.2 score range, corresponding to minor noise in the image generation process. In contrast, high scores representing actual defects are distributed only on the far right. Therefore, adopting the top 5% threshold effectively filters out this noise.

In the experiments, the non-award-winning group consisted of 30 images (15 body defects and 15 fin defects), and 30 award-winning images were selected for comparison. Since the Multi-layer Sliding Window scores for the two groups did not follow a normal distribution, the Mann–Whitney U test was employed for statistical analysis [28]. The null hypothesis assumed no significant difference between the distributions of the two groups, while the alternative hypothesis assumed a significant difference. This test was used to determine whether the difference in body shape consistency between the award-winning and non-award-winning groups was statistically significant.

In the color analysis, 15 color defect images and 15 award-winning Koi images were input into the model, and their outputs were compared. Since color defects were generated by reducing the saturation of the red regions, histograms of the red regions in both input and output images were computed, and the average saturation differences were calculated. A paired t-test was then applied to the results of the award-winning and non-award-winning groups. The null hypothesis assumed no significant difference between the mean saturation values of input and output images, while the alternative hypothesis assumed a significant difference. This analysis was conducted to assess whether the differences in color consistency between award-winning and non-award-winning Koi images were statistically significant.

4.3. Comparative Results

In the body shape analysis, the case of body defects was first examined. The mean Multi-layer Sliding Window score, defined by Equations (3) and (4), was used as the evaluation metric. These results provide a clear visualization that the model was able to partially repair the defects of the input image. For the 15 award-winning Koi images, the mean Multi-layer Sliding Window score was 0.110, while the mean Multi-layer Sliding Window score for body defect images was 0.141. A Mann–Whitney U test yielded p = 0.014, which is less than 0.05. This result suggests a statistically significant difference in body shape consistency between the two groups. In the case of fin defects, the mean Multi-layer Sliding Window score of award-winning Koi images was 0.119, compared with 0.149 for non-award-winning Koi images, with a Mann–Whitney U test result of p = 0.045, which is also less than 0.05. The proposed differential comparison method could detect statistically significant body shape differences between award-winning and non-award-winning Koi images.

In the color analysis, differences in the saturation of red regions between input and output images were further compared. The outputs of non-award-winning Koi images showed that the originally dull red regions were enhanced, making them closer to the vivid coloration characteristic of award-winning Koi. For quantitative analysis, histograms of the saturation of red regions were plotted for both award-winning and non-award-winning Koi images and their corresponding outputs, as shown in Figure 15. The horizontal axis represents saturation values (converted from a range of 0–1 to 0–255 for computation), and the vertical axis represents pixel counts. The mean saturation of the red regions was calculated for each image, followed by the average across 15 samples. Results showed that the mean saturation of the red regions in award-winning Koi images was 207.62, while in their outputs it decreased to 197.76, representing a 4.75% difference, with a paired t-test result of p = 0.00088, which is smaller than 0.05. In contrast, for non-award-winning Koi images, the mean saturation of the red regions increased from 148.58 to 191.24 after reconstruction, representing a 28.7% difference, with a paired t-test result of p = 9.8 ×

10^{- 13}

, also smaller than 0.05. The differential comparison method could also detect statistically significant color differences between award-winning and non-award-winning Koi images.

5. Discussion

In this study, the standard VAE and its improved variants were trained to generate Koi images that could potentially receive awards in exhibitions. The results showed that while the standard VAE was able to reconstruct the overall body structure, the outputs were generally blurry and inadequate in color reproduction, especially in restoring white regions. This phenomenon is related to the design of the VAE loss function: MSE tends to produce blurry averaged images during optimization, while KL divergence further suppresses the representation of rare features in the latent space, such as white regions. Since red regions dominate the dataset, the model more easily learns and retains red features, leading to insufficient reconstruction of white areas.

Introducing perceptual loss for the standard VAE effectively alleviated the blurring problem. By measuring differences between generated and original images in the feature space, perceptual loss guided the model to focus more on structural and textural information. In particular, the use of shallow features from the early layers of the pretrained ResNet-50 improved the recognition and reconstruction of white regions. Although the addition of perceptual loss increased the overall loss values, the generated images showed clear improvements in the distinction between red and white regions as well as in structural completeness. Furthermore, the introduction of mask loss aimed to constrain consistency in body shape between inputs and outputs. Evaluations with the specifically designed body shape consistency metric demonstrated that mask loss made the input and output body shapes more similar, effectively mitigating inconsistency issues. Experiments replacing transposed convolution layers with UpSampling layers also showed that the colors of the outputs were closer to the inputs, with improved clarity and the avoidance of checkerboard artifacts. Compared with traditional approaches based on handcrafted features, this generative modeling framework was better able to preserve global image information, especially in reconstructing details such as fins and tails.

In the comparative experiment, two types of non-award-winning Koi images were created, representing body shape defects and color defects, corresponding to the major judging criteria of exhibitions. Qualitative analysis revealed that regardless of whether the input contained body or color defects, the model exhibited a clear tendency to generate outputs with award-winning characteristics. Quantitative analysis showed that in both the body defect and fin defect experiments, the mean final Multi-layer Sliding Window score of the award-winning group was consistently lower than that of the non-award-winning group. Furthermore, the Mann–Whitney U test confirmed that this difference reached statistical significance in both cases (p < 0.05).

In the color analysis, comparisons of histograms and mean saturation values further revealed differences between award-winning and non-award-winning Koi images. Results showed that award-winning Koi images still exhibited differences between input and output in red regions (p < 0.05), suggesting that the model has room for improvement in reproducing red coloration. In contrast, for non-award-winning Koi images, the saturation of red regions in the outputs increased significantly, producing a more vivid visual effect. This indicates that when processing non-award-winning Koi images, the model actively improved color features, making them closer to the standards of award-winning Koi.

6. Conclusions

In this study, the standard VAE was improved by introducing perceptual loss and mask loss, as well as replacing transposed convolution layers with UpSampling layers, to improve the quality of Koi image generation. The experimental results demonstrated that the improved model was able to generate Koi images with features characteristic of award-winning Koi at exhibitions and was also able to partially restore artificially constructed non-award-winning Koi images.

We also developed the differential comparison method which compares the differences between inputs and outputs of both award-winning and non-award-winning Koi images. The differential comparison method could detect differences in body shape or color. This was confirmed by statistical analysis using the Mann–Whitney U test and paired t-test. It was found that the model strengthened features associated with award-winning Koi during the generation process.

A primary limitation of this study is the size of the training dataset, which comprised only 214 award-winning Koi images. Obtaining high-quality photographs of Koi is inherently difficult due to the subjects being living, moving creatures. Consequently, to mitigate the risk of overfitting resulting from limited training data, a relatively simple VAE model was adopted and subsequently refined. Furthermore, a limitation of the comparative experiments is that the non-award-winning images used were artificially constructed, and the quantity of images for each category was only 15 samples.

Overall, this work demonstrates the feasibility of using generative models to explore the mechanisms of non-award-winning Koi and provides a new perspective for applying AI techniques to assist in Koi evaluation.

Author Contributions

Conceptualization, J.G., T.Y. and Y.I.; methodology, J.G. and T.Y.; software, J.G.; validation, J.G. and T.Y.; formal analysis, J.G.; investigation, J.G.; data curation, J.G.; writing—original draft preparation, J.G.; writing—review and editing, T.Y. and Y.I.; visualization, J.G.; supervision, T.Y.; project administration, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The training dataset used in this study is not publicly available due to copyright restrictions imposed by the provider (Kinsai Publishing Co., Ltd.). However, the representative images used for visualization in Figure 1, Figure 4, Figure 10, Figure 11 and Figure 12 were obtained from the publicly available Kaggle dataset “Dataset Images Koi” (https://www.kaggle.com/datasets/farizp/dataset-images-koi (accessed on 30 October 2025)) and can be accessed under the terms specified by the original source.

Acknowledgments

The authors sincerely thank Kinsai Publishing Co., Ltd. that provided the Nishikigoi images. The authors also thank the members of the Yamazaki Laboratory at Niigata University for their helpful discussions and encouragement.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kottelat, M.; Freyhof, J. Handbook of European Freshwater Fishes; Publications Kottelat: Cornol, Switzerland, 2007; pp. 147–148. [Google Scholar]
Daniel, W.M.; Morningstar, C.R.; Procopio, J. Cyprinus Rubrofuscus Lacepède, 1803: US Geological Survey; Nonindigenous Aquatic Species Database: Gainesville, FL, USA, 2000.
Balon, E.K. Origin and Domestication of the Wild Carp, Cyprinus carpio: From Roman Gourmets to the Swimming Flowers. Aquaculture 1995, 129, 3–48. [Google Scholar] [CrossRef]
Kodama, T. What Is a Nishikigoi? Koi Fish History Explained, Meaning, & Japanese Significance. Kodama Koi Farm. 2018. Available online: https://www.kodamakoifarm.com/what-is-nishikigoi-history-meaning/ (accessed on 30 September 2025).
De Kock, S.; Gomelsky, B. Japanese Ornamental Koi Carp: Origin, Variation and Genetics. In Biology and Ecology of Carp; Pietsch, C., Hirsch, P., Eds.; CRC Press: Boca Raton, FL, USA, 2015; pp. 27–53. [Google Scholar]
Varieties of Nishikigoi. Zen Nippon Airinkai. Available online: https://zna.jp/en/keep-nishikigoi/varieties-of-nishikigoi/ (accessed on 30 September 2025).
Kuroki, T. The Latest Manual to Nishikigoi, 9th ed.; Shin Nippon Kyoiku Tosho Co., Ltd.: Shimonoseki, Japan, 1990. [Google Scholar]
The Birth and History of Nishikigoi—All Japan Nishikigoi Promotion Association. 2024. Available online: https://jnpa.info/en/nishikigoi/history/ (accessed on 2 October 2025).
Tamadachi, M. The Cult of the Koi; T.F.H. Publications: Neptune City, NJ, USA, 1994. [Google Scholar]
Pond Informer: How Much Are Koi Fish Worth? Available online: https://pondinformer.com/how-much-are-koi-fish/ (accessed on 2 October 2025).
Farm, K.K. How Much Do Koi Fish Cost? Koi Pricing Guide for USA. Kodama Koi Farm. 2023. Available online: https://www.kodamakoifarm.com/how-much-do-koi-fish-cost/ (accessed on 2 October 2025).
De Kock, S.; Watt, R. Koi: A Handbook on Keeping Nishikigoi; Firefly Books: Richmond Hill, ON, Canada, 2006; ISBN 9781554072156. [Google Scholar]
Hoshino, S.; Fujita, S. Nishikigoi Mondo; International Nishikigoi Promotion Centre, Ed.; NABA Corporation: Tokyo, Japan, 2009. [Google Scholar]
Iyoubi, E.M.; El Boq, R.; Izikki, K.; Tetouani, S.; Cherkaoui, O.; Soulhi, A. Revolutionizing Smart Agriculture: Enhancing Apple Quality with Machine Learning. Data Metadata 2024, 3, 592. [Google Scholar] [CrossRef]
Nguyen, N.H.; Michaud, J.; Mogollon, R.; Zhang, H.; Hargarten, H.; Leisso, R.; Torres, C.A.; Honaas, L.; Ficklin, S. Rating Pome Fruit Quality Traits Using Deep Learning and Image Processing. Plant Direct 2024, 8, e70005. [Google Scholar] [CrossRef] [PubMed]
Miyachi, Y.; Ishii, O.; Torigoe, K. Design, Implementation, and Evaluation of the Computer-Aided Clinical Decision Support System Based on Learning-to-Rank: Collaboration between Physicians and Machine Learning in the Differential Diagnosis Process. BMC Med. Inform. Decis. Mak. 2023, 23, 26. [Google Scholar] [CrossRef] [PubMed]
Kong, Z.; Kong, D.; Kong, J.; Xing, Y.; Liang, P. The Performance Evaluation of the AI-Assisted Diagnostic System in China. BMC Health Serv. Res. 2025, 25, 1179. [Google Scholar] [CrossRef] [PubMed]
Domasevich, M.A.; Hasegawa, H.; Yamazaki, T. Quality Evaluation of Kohaku Koi (Cyprinus rubrofuscus) Using Image Analysis. Fishes 2022, 7, 158. [Google Scholar] [CrossRef]
Fariz, P.P.; Dataset Images Koi. Kaggle 2022. Available online: https://www.kaggle.com/datasets/farizp/dataset-images-koi?select=Dataset (accessed on 2 October 2025).
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV) 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sohn, K.; Yan, X.; Lee, H. Learning Structured Output Representation using Deep Conditional Generative Models. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV) 2016, Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Chicco, D.; Sichenze, A.; Jurman, G. A simple guide to the use of Student’s t-test, Mann-Whitney U test, Chi-squared test, and Kruskal-Wallis test in biostatistics. BioData Min. 2025, 18, 56. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Example of award-winning Kohaku Koi image used in training. This image downloaded from a public database [19].

Figure 2. Detailed hyperparameters of a VAE model.

Figure 3. Structure of the VAE model with perceptual loss.

Figure 4. Examples of original and mask images. (a) Award-winning Koi image used for model training. This image downloaded from a public database [19]; (b) corresponding mask image of the Koi in (a).

Figure 5. Workflow of mask loss computation. The VAE illustrated in the figure includes perceptual loss.

Figure 6. Training and validation loss curves for the standard VAE.

Figure 7. Training and validation loss curves for the VAE with perceptual loss.

Figure 8. Training and validation loss curves for the VAE with both perceptual loss and mask loss.

Figure 9. Learning curves of the VAE after replacing transposed convolution layers with UpSampling layers.

Figure 10. Examples of original image and body shape defect image. This image downloaded from a public database [19]. (a) Original Koi image; (b) Koi image with body shape defect (a part of the left body side is erased).

Figure 11. Examples of original image and fin defect image. This image downloaded from a public database [19]. (a) Original Koi image; (b) Koi image with the right fin defect.

Figure 12. An example of original image and color defect image. This image downloaded from a public database [19]. (a) Original image; (b) color defect image with 30% reduction in saturation in the red area.

Figure 13. The differential comparison method. Award-winning Koi images were first input into the model to generate corresponding outputs, followed by non-award-winning Koi images input into the same model to obtain their outputs. By comparing the differences between input and output images, the potential reasons for Koi not winning can be further analyzed.

Figure 14. Distribution of Multi-layer Sliding Window Difference Scores (Window Size: 32).

Figure 15. Comparison of saturation histograms of red regions in award-winning and non-award-winning Koi images. (a) Histogram of red regions for award-winning Koi image and its output, where blue indicates input image and orange indicates output; (b) histogram of red regions for non-award-winning Koi image and its output, with the same color scheme as in (a).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gang, J.; Yamazaki, T.; Iida, Y. Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality. Fishes 2025, 10, 655. https://doi.org/10.3390/fishes10120655

AMA Style

Gang J, Yamazaki T, Iida Y. Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality. Fishes. 2025; 10(12):655. https://doi.org/10.3390/fishes10120655

Chicago/Turabian Style

Gang, Jiahong, Tatsuya Yamazaki, and Yusuke Iida. 2025. "Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality" Fishes 10, no. 12: 655. https://doi.org/10.3390/fishes10120655

APA Style

Gang, J., Yamazaki, T., & Iida, Y. (2025). Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality. Fishes, 10(12), 655. https://doi.org/10.3390/fishes10120655

Article Menu

Generative Model Construction Based on Highly Rated Koi Images to Evaluate Koi Quality

Abstract

1. Introduction

2. Dataset and Methods

2.1. Dataset

2.2. Computational Environment

2.3. Proposed Methods

2.3.1. Standard VAE Model

2.3.2. Perceptual Loss

2.3.3. Mask Loss

2.3.4. UpSampling Layer

3. Experimental Results

3.1. Results by Standard VAE Model

3.2. Results with Perceptual Loss

3.3. Results with Mask Loss

3.4. Results with UpSampling Layer

4. Comparative Experiment

4.1. Preparation of Non-Award-Winning Koi Images

4.2. Differential Comparison Method and Experiments

4.3. Comparative Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI