Feature Normalization Reweighting Regression Network for Sugar Content Measurement of Grapes

: The measurement of grape sugar content is an important index for classifying grapes based on their quality. Owing to the correlation between grape sugar content and appearance, non-destructive measurements are possible using computer vision and deep learning. This study investigates the quality classiﬁcation of the Red Globe grape. The number of collected grapes in the range of the 15~16% measure is three times more than in the range of <14% or in the range of the >18% measure. This study presents a framework named feature normalization reweighting regression (FNRR) to address this imbalanced distribution of sugar content of the grape datasets. The experimental results show that the FNRR framework can measure the sugar content of a whole bunch of grapes with high accuracy using typical convolution neural networks and a visual transformer model. Speciﬁcally, the visual transformer model achieved the best accuracy with a balanced loss function, with the coefﬁcient of determination R = 0.9599 and the root mean squared error RMSE = 0.3841%. The results show that the effect of the visual transformer model is better than that of the convolutional neural network. The research ﬁndings also indicate that the visual transformer model based on the proposed framework can accurately predict the sugar content of grapes, non-destructive evaluation of grape quality, and could provide reference values for grape harvesting. where the approximately The as and the electric control track transported the electric control hook, with grapes attached, to the front of the panel, while the brightness adjustment knob controlled the brightness of the chassis by controlling the lamp. The length of the LED lamp was 30 cm, the power of the LED lamp was 7.9 W, the color temperature was 6500 K when photographing, the electric control track was located in the center of the four led lights, and its vertical distance was 30 cm. The computer used a JAI AD080GE camera to take pictures of the grapes, and the resolution of the obtained images was 1024 × 768 pixels. The distance between the grapes and the camera was 35 cm, and the distance between the grapes and the background plate was 15 cm. The grapes were numbered for imaging and sugar content measuring. A saccharometer is an instrument that measures sugar content, developed according to the principle of optical refraction; it is commonly used to measure the sugar content of fruit juice. This study used a VBS2T/ATC saccharometer, which had a measurement precision (Brix%) of 0.2. In total, 228 pictures were collected using the image acquisition system; 205 were used to train the model, and the model was tested on the remaining 23.


Introduction
The measurement of grape sugar content, and their quality, is the primary activity in packaging, preservation, storage, and transportation. The identification of the sugar content of grapes relates to both grape quality and the determination of storage time before consumption. With the development of the table grape industry towards large-scale and centralized development, an objective and non-destructive testing method is required to monitor the grading of grape quality in the supply chain to meet the needs of large-scale production [1].

Related Work
Phenotypic characteristics, such as the degree of fruit coloration and concentration, are important indicators for judging table grapes' maturity and quality [2,3]. Grape color is caused by the accumulation of anthocyanins in the peel. Anthocyanins are a type of water-soluble pigment, and their presence is positively correlated with fruit coloration [2].
Red light irradiation during plant growth increases the degree of fruit coloring, and the sugar content increases accordingly [3]. This demonstrates how the color concentration of grape peel has a positive correlation with sugar content. The correlation coefficients of anthocyanin content and fruit color index of the four grapes "Jingxiu," "Juxing," "Fujiminori," and "Yongyou" were 0.821, 0.946, 0.996, and 0.857, respectively. Additionally, the correlation coefficients of anthocyanins and sugar content were 0.804, 0.953, 0.932, and 0.911, respectively [3]. Visual characteristics of table grapes can reflect the physiological characteristics of a bunch of grapes during the growth process; hence, this provides a research method for predicting the sugar content of grapes using visual features.
Computer vision technology has been widely used in agricultural production as it is fast and non-destructive compared to manual detection [4,5]. Computer vision detection methods are based on grape color and shape, feature fusion, and deep learning. Information on the color channels in the L*a*b color space translated from RGB images was applied to extract color features, which has the potential for apple sugar content prediction [6]. A combination of visible-range image processing, image feature extraction, a hybrid imperialist competitive algorithm, and artificial neural network regression yielded a squared correlation coefficient (R 2 ) on the pH value of 0.843 ± 0.043 on a test set of Thomson navel oranges [7]. Experiments show that color features can determine the internal fruit quality. In [8], an RGB image of oranges was obtained using a camera, and the color, shape, and texture features were extracted from the image and used as the input value for the neural network model that predicted the sugar content and pH value. The results found that neural networks could determine sugar content and pH with high accuracy solely from the appearance of the fruit. A multiple linear regression model of the Red Globe grape's color features and sugar content was established to classify boxes of Red Globe grapes in [9], where the R 2 was 0.84 and the RMSE was 0.82%. The color, shape, and texture characteristics of an orange, taken from an image, were used to predict the sugar content and pH value. The model was able to predict the sugar content and pH value accurately. These methods mentioned above demonstrate the feasibility of using computer vision methods to detect the internal quality of fruits non-destructively.

Contribution
In this study, we divided the grape dataset into multiple categories based on the imbalanced distribution of the grape's sugar content. A deep convolutional neural network, transfer learning method, and transformer model are adopted to extract the image features of the grapes from the input data. We then present a computer vision system for nondestructive testing of table grapes. The system was upgraded by optimizing the loss function, which improved the accuracy of the test dataset. The major contributions are as follows: (a) A feature normalization reweighting regression network (FNRR-Net) is proposed for imbalanced distribution grape datasets, which has a high degree of confidence in predicting the sugar content of grapes. (b) We group and label the datasets in different sugar content intervals and propose a balanced loss function under the visual transformer model to accommodate the imbalanced grape datasets. (c) The non-destructive measurement of a grape's sugar content is efficient, economical, and convenient.

Sample Collection
The sample comprised Red Globe grapes harvested from a local market in Wuhan, China. Imitating the industrial inspection process, we designed a grape image acquisition system (Figure 1), consisting of the following components: A panel, an electric control track, an electric control hook, a camera, a brightness adjustment knob, a computer, and four LED lamps. Image acquisition was performed between 9 a.m. and 6 p.m. in the laboratory where the temperature is approximately 25 • C. The panel was used as the background, and the electric control track transported the electric control hook, with grapes attached, to the front of the panel, while the brightness adjustment knob controlled the brightness of the chassis by controlling the lamp. The length of the LED lamp was 30 cm, the power of the LED lamp was 7.9 W, the color temperature was 6500 K when photographing, the electric control track was located in the center of the four led lights, and its vertical distance was 30 cm. The computer used a JAI AD080GE camera to take pictures of the grapes, and the resolution of the obtained images was 1024 × 768 pixels. The distance between the grapes and the camera was 35 cm, and the distance between the grapes and the background plate was 15 cm. The grapes were numbered for imaging and sugar content measuring. A saccharometer is an instrument that measures sugar content, developed according to the principle of optical refraction; it is commonly used to measure the sugar content of fruit juice. This study used a VBS2T/ATC saccharometer, which had a measurement precision (Brix%) of 0.2. In total, 228 pictures were collected using the image acquisition system; 205 were used to train the model, and the model was tested on the remaining 23.
attached, to the front of the panel, while the brightness adjustm brightness of the chassis by controlling the lamp. The length of th the power of the LED lamp was 7.9 W, the color temperature w graphing, the electric control track was located in the center of th vertical distance was 30 cm. The computer used a JAI AD080GE of the grapes, and the resolution of the obtained images was 1024 × between the grapes and the camera was 35 cm, and the distance the background plate was 15 cm. The grapes were numbered for tent measuring. A saccharometer is an instrument that measures s according to the principle of optical refraction; it is commonly us content of fruit juice. This study used a VBS2T/ATC saccharomete ment precision (Brix%) of 0.2. In total, 228 pictures were collected sition system; 205 were used to train the model, and the model w ing 23.

Image Extraction
Foreground segmentation was performed to better extrac grape images (shown in Figure 2). U-Net [10], a deep segmentatio for the segmentation of the table grapes, and its structure consis decoder (right), and skip connections. The encoder contains four sists of two convolutional layers, which are both followed by a Re Using the 2 × 2 maximum pooling method to downsample the orig can encode the main features of the input image. The decoder is coder and contains four sub-modules. However, the maximum p with upsampling; then, the image resolution is sequentially incre put image size matches the input image size. The skip connectio pling information to the corresponding layer of the downsam learned features to achieve more accurate decoding.

Image Extraction
Foreground segmentation was performed to better extract the features from the grape images (shown in Figure 2). U-Net [10], a deep segmentation model, was proposed for the segmentation of the table grapes, and its structure consists of an encoder (left), a decoder (right), and skip connections. The encoder contains four sub-modules; each consists of two convolutional layers, which are both followed by a ReLU activation function. Using the 2 × 2 maximum pooling method to downsample the original image, the network can encode the main features of the input image. The decoder is symmetrical to the encoder and contains four sub-modules. However, the maximum pooling layer is replaced with upsampling; then, the image resolution is sequentially increased until the final output image size matches the input image size. The skip connection transfers the upsampling information to the corresponding layer of the downsampling part, reusing the learned features to achieve more accurate decoding. Appl. Sci. 2022, 12, x FOR PEER REVIEW 4 of 13 Figure 2. Grape semantic segmentation images using U-Net.

Label Grouping with Specific Interval
The grape dataset has an imbalanced distribution in terms of sugar content. The distribution of the collected grapes' sugar content (Brix%) was 12.9~18.6. This study investigated the quality classification of the Red Globe grape. There are three times more datasets of grapes in the range of the 15~16% measure than grapes with the <14% or >18% measure. Standard data processing usually involves the characteristic that the number of samples in each category of the datasets is approximately uniformly distributed; this is called the category balance [11]. To address the non-uniform distribution issue, we divided the grape datasets to ensure the data were more uniformly distributed. We divided the data into 0.4%, 0.6%, 0.8%, and 1% intervals ( Figure 3). Since the data in the certain interval are close to the label we defined, we can obtain the final regression result by operating on the label vector. The division method is simple and effective for the proposed FNRR network. We compared the performance of our method on the different interval divisions of the data to determine which resulted in the best performance. Image rotation was used for data augmentation [10] to make each set of data the same amount; left and right rotation was performed on the existing grape segmentation image, with a rotation angle between −20° and 20°.

Label Grouping with Specific Interval
The grape dataset has an imbalanced distribution in terms of sugar content. The distribution of the collected grapes' sugar content (Brix%) was 12.9~18.6. This study investigated the quality classification of the Red Globe grape. There are three times more datasets of grapes in the range of the 15~16% measure than grapes with the <14% or >18% measure. Standard data processing usually involves the characteristic that the number of samples in each category of the datasets is approximately uniformly distributed; this is called the category balance [11]. To address the non-uniform distribution issue, we divided the grape datasets to ensure the data were more uniformly distributed. We divided the data into 0.4%, 0.6%, 0.8%, and 1% intervals ( Figure 3). Since the data in the certain interval are close to the label we defined, we can obtain the final regression result by operating on the label vector. The division method is simple and effective for the proposed FNRR network. We compared the performance of our method on the different interval divisions of the data to determine which resulted in the best performance. Image rotation was used for data augmentation [10] to make each set of data the same amount; left and right rotation was performed on the existing grape segmentation image, with a rotation angle between −20 • and 20 • .

Label Grouping with Specific Interval
The grape dataset has an imbalanced distribution in terms of sugar content. The distribution of the collected grapes' sugar content (Brix%) was 12.9~18.6. This study investigated the quality classification of the Red Globe grape. There are three times more datasets of grapes in the range of the 15~16% measure than grapes with the <14% or >18% measure. Standard data processing usually involves the characteristic that the number of samples in each category of the datasets is approximately uniformly distributed; this is called the category balance [11]. To address the non-uniform distribution issue, we divided the grape datasets to ensure the data were more uniformly distributed. We divided the data into 0.4%, 0.6%, 0.8%, and 1% intervals ( Figure 3). Since the data in the certain interval are close to the label we defined, we can obtain the final regression result by operating on the label vector. The division method is simple and effective for the proposed FNRR network. We compared the performance of our method on the different interval divisions of the data to determine which resulted in the best performance. Image rotation was used for data augmentation [10] to make each set of data the same amount; left and right rotation was performed on the existing grape segmentation image, with a rotation angle between −20° and 20°.

FNRR Network
In this experiment, we propose an innovative deep learning framework FNRR, shown in Figure 4, for the non-destructive prediction of the sugar content of grapes. The framework is mainly composed of two modules: (1) The feature extraction module, which obtains the feature vector to obtain the result of the normalized vector through convolutional neural networks or transformer models, and (2) the normalized function, which produces a categorical distribution from the previous layer's output. The entries of the labeled vector are the average labels of each category. The result was achieved through the expectation of the normalized vector and labeled vector. The process of training is described in Algorithm 1 (The proposed algorithm for training the feature normalization reweighting regression models. Epoch is the number of iterations during training).

FNRR Network
In this experiment, we propose an innovative deep learning framework FNRR, shown in Figure 4, for the non-destructive prediction of the sugar content of grapes. The framework is mainly composed of two modules: (1) The feature extraction module, which obtains the feature vector to obtain the result of the normalized vector through convolutional neural networks or transformer models, and (2) the normalized function, which produces a categorical distribution from the previous layer's output. The entries of the labeled vector are the average labels of each category. The result was achieved through the expectation of the normalized vector and labeled vector. The process of training is described in Algorithm 1 (The proposed algorithm for training the feature normalization reweighting regression models. Epoch is the number of iterations during training). Read a segmented grape image 3 Put the image in deep learning model to obtain the Feature maps of the image 4 Put the Feature maps in FC layer to obtain the matrix 5 Normalize the matrix to obtain the Normalized Vector 6 Compute the Output Vector with Normalized Vector and Label Vector 7 Average the elements of Output Vector to obtain the final Brix Value of the image 8 end for Due to the non-uniform distribution of the data, the direct use of a multiple linear regression model will generate data a priori; the model will predict the data in the sparsely dense interval as if it were in the denser interval. The characteristics of the predicted data have a greater similarity to the data in the adjacent interval, and we can enrich the characteristics of the sparse interval data to a certain extent using grouping. We divided the grape image dataset into 14, 9, 7, and 6 categories with sugar content intervals of 0.4%, 0.6%, 0.8%, and 1%, respectively, and labeled each category with an appropriate sugar Due to the non-uniform distribution of the data, the direct use of a multiple linear regression model will generate data a priori; the model will predict the data in the sparsely dense interval as if it were in the denser interval. The characteristics of the predicted data have a greater similarity to the data in the adjacent interval, and we can enrich the characteristics of the sparse interval data to a certain extent using grouping. We divided the grape image dataset into 14, 9, 7, and 6 categories with sugar content intervals of 0.4%, 0.6%, 0.8%, and 1%, respectively, and labeled each category with an appropriate sugar content value. We used typical deep learning-based models to obtain the features. A feature map is obtained by the convolution layer or attention mechanism. The convolution layer is composed of the kernel size, stride, padding, input channels, and output channels. For details, see Section 2.3.1. The image is operated with layers of encoder blocks based on the self-attention mechanism to obtain the feature maps. The self-attention mechanism is described in Section 2.3.3. The last layer of the model is a fully connected layer whose size is the number of the categories. The output result of the fully connected layer is passed through a normalized function, and a probability value is given to each category label; this is calculated as the expected value with the vector of the category label.

Convolutional Neural Network
A convolutional neural network is a typical feedforward neural network [12] and is particularly prominent in image recognition research. The basic architecture of a convolutional neural network usually consists of three parts: The convolutional layer, the pooling layer, and the fully connected layer. The purpose of the convolutional layer is to learn the input sample characteristics [13,14], while the pooling layer, which is also called a downsampling operation, aims to control the spatial distortion of the data and reduce the resolution of the feature map [15]; the pooling layer is usually located between two convolutional layers. After the convolutional neural network has gone through several convolution and pooling layers, one or more fully connected layers will be set.

Transfer Learning
Transfer learning uses the correlation between multiple domains to transfer the knowledge learned in the source domain to the target domain to assist in the training of the target model [16]. By combining source domain knowledge and target domain data, the model's demand for the size of the target domain is reduced. It also allows the model to utilize the knowledge learned from other related tasks to assist the target task. Deep neural networks can automatically learn features from the original data; the lower layers of neural networks can learn detailed information, such as textures, and the higher layers can learn rich semantic information. Low-level features can be shared between many tasks; hence, transfer learning models can be reused directly. During the experiment, the model was pre-trained on the ImageNet dataset and was used as a feature extraction network to extract features from the input grape images [17].

Transformer Model
A transformer model is a deep neural network, usually based on a self-attention mechanism, which was initially applied in the field of natural language processing. Inspired by its powerful representation capabilities, it was extended to computer vision tasks [18]. For visual tasks, convolutional neural networks have the advantage of inductive bias, translational equivalence, and locality; however, the model requires more convolutional layers to expand the receptive field. While the self-attention mechanism requires longrange information, it performs well for different visual tasks [19]. In Figure 5, the input grape image is divided into fixed-size patches, and patch embedding is performed using a linear transformation, with patch embedding indicating the image information and position embedding indicating the label information of the image from a sequence of token embeddings, taken as the transformer model's input data [19]. The self-attention mechanism aggregates information by assigning different weights to the input information according to the current query. For a given query vector, k key vectors are matched using inner product calculation. The obtained inner product is then normalized using the softmax function to obtain k weights; the output of the query's attention is the weighted average of the value vectors that correspond to the k key vectors [20]. The equation for this matrix calculation is given by where: Q, K, and V: The three self-attention mechanism vectors. d k : The dimension of K [20].
The transformer uses multi-head self-attention to define attention heads, i.e., selfattention is applied to the input sequence, which can be split into h sequences, and the Appl. Sci. 2022, 12, 7474 7 of 13 outputs of the h different heads are then concatenated together. The final output is obtained using linear transformation [18]. The formula is defined as follows: where: head i : The attention head obtained from Equation (1) [20].
Parameter matrices of the model. In this experiment, we use the pre-trained transformer models to extract image features. Since the normalized feature is obtained from the classification model, we propose a multilabel loss function that compares the mean square error and cross-entropy with the loss of the mean square error, as described in Equations (4) and (5): where: In this work, we find that the widely used MSE loss (Loss1) function can be ineffective in imbalanced regression, so we propose a new balanced loss function from an imbalanced perspective to adapt to the label distribution of the imbalanced training set. The balanced loss function (Loss2) has two parts: The first part is the standard MSE loss and the second part is a new balanced item. According to the data we constructed, each group of data achieved a balanced effect in quantity after amplification. Based on our proposed FNRR framework, we introduced the balance loss term cross-entropy and hyperparameter λ = 0.05, which is obtained from multiple experiments.
where: ℎ : The attention head obtained from Equation (1) [20]. , , : Parameter matrices of the model. In this experiment, we use the pre-trained transformer models to extract image features. Since the normalized feature is obtained from the classification model, we propose a multi-label loss function that compares the mean square error and cross-entropy with the loss of the mean square error, as described in Equations (4) and (5): where: : Reference value. : Model estimate. : Probability of the j category reference value. ̂ : Probability distribution of the j category model estimate.
: Number of images in one batch. : Size of the categories. In this work, we find that the widely used MSE loss ( 1) function can be ineffective in imbalanced regression, so we propose a new balanced loss function from an imbalanced perspective to adapt to the label distribution of the imbalanced training set. The balanced loss function ( 2) has two parts: The first part is the standard MSE loss and the second part is a new balanced item. According to the data we constructed, each group of data achieved a balanced effect in quantity after amplification. Based on our proposed FNRR framework, we introduced the balance loss term cross-entropy and hyperparameter λ = 0.05, which is obtained from multiple experiments.

Experimental Setting and Evaluation
In this study, we tested the performance of the FNRR framework on four grape datasets using three methods: Deep convolutional neural network retraining, transfer learning, and a transformer model. The experiment was conducted on ubuntu 16.04, equipped with NVIDIA GeForce GTX 1080TI. The R and RMSE are mainly used to evaluate the

Experimental Setting and Evaluation
In this study, we tested the performance of the FNRR framework on four grape datasets using three methods: Deep convolutional neural network retraining, transfer learning, and a transformer model. The experiment was conducted on ubuntu 16.04, equipped with NVIDIA GeForce GTX 1080TI. The R and RMSE are mainly used to evaluate the performance of each model based on the FNRR framework, and MaximumError is used for reference. A higher R indicated that the model was more stable, and a lower RMSE indicated a greater model accuracy. MaximumError is an absolute value of the difference between the label truth and the predicted value, which is a reference value. These values are defined as follows: where: y i : Reference value. y i : Model estimate. σ yŷ : Covariance between y andŷ. σ y : Respective standard deviations of y. σŷ: Respective standard deviations ofŷ. max: The function to pick up the max number from the data.

Analysis of Convolutional Neural Network Retraining Results
AlexNet [21] is a classic algorithm of convolutional neural networks. The small datasets were retrained from the outer to the end. Results showed that as the interval of the Brix measure was reduced, the RMSE became smaller. AlexNet yielded the best performance for an interval of 0.4 %, with R = 0.9022 and RMSE = 0.6061%. In Table 1, the unstable results are not suitable for practical application due to the limitation of feature expression ability. However, transfer learning can provide more feature expression capabilities.

Analysis of Transfer Learning Results
The transfer learning method used in this study was an end-to-end model; this allowed the model to learn how to extract key features, and it was proven to be effective for small Appl. Sci. 2022, 12, 7474 9 of 13 samples. The transfer learning networks ResNet50 [22] and InceptionV3 [23] were used in this experiment. Under the transfer learning method, the R and RMSE's fluctuation ranges were relatively small on all four datasets (Table 1). For ResNet50, the difference between the maximum and minimum R and RMSE was 0.0381 and 0.0891%, respectively. For InceptionV3, the difference between the maximum and minimum difference R and RMSE was 0.0313 and 0.1499%, respectively. InceptionV3 obtained the best performance with intervals of 0.4%, with R = 0.9120 and RMSE = 0.6174%. ResNet50 model obtained the best performance with intervals of 0.4%, with R = 0.9067 and RMSE = 0.6095%.

Analysis of Transformer Model Result
In this experiment, four transformer models were used for training: ViT-B_32, ViT-B_16, ViT-L_32, and ViT-L_16. The accuracy of the test set is shown in Table 2. Figure 6 shows the R of the four transformer models, based on the average of the results of all four data division methods. Smaller patch sizes yield a better score with more layers. The model performed better overall with a patch size of 16. These results show that for grape samples of small size, the smaller the patch size, the better the model at feature extraction. B_16, ViT-L_32, and ViT-L_16. The accuracy of the test set is shown shows the R of the four transformer models, based on the average of th data division methods. Smaller patch sizes yield a better score wit model performed better overall with a patch size of 16. These results s samples of small size, the smaller the patch size, the better the model a  Table 2 suggests that three transformer models achieved the bes tasets with an interval of 0.6%, while the remaining models achieved datasets with an interval of 0.8%. Comparing the optimal model with there was a difference of 0.009 and 0.0019% in the R and RMSE valu with an interval of 0.6%. This suggests that it was easier to obtain opt datasets with an interval of 0.6%.   Table 2 suggests that three transformer models achieved the best results on the datasets with an interval of 0.6%, while the remaining models achieved the best results on datasets with an interval of 0.8%. Comparing the optimal model with the other models, there was a difference of 0.009 and 0.0019% in the R and RMSE values for the datasets with an interval of 0.6%. This suggests that it was easier to obtain optimal results for the datasets with an interval of 0.6%. Table 2 shows that, for the two models with a patch size of 16, the proposed loss2 has higher accuracy on the test set than the traditional loss1. The dataset with the 0.8% interval achieved the best results. In comparison with loss1, loss2 increased R by 0.0444 and reduced RMSE by 0.182%. ViT-L_16 obtained the best performance with the datasets with a 0.8% interval, with R = 0.9599 and RMSE = 0.3841%. The effect of the transformer model is significantly better than that of the classic convolutional neural network; the R increased by 0.0577, and the RMSE was reduced by 0.2220%, compared to the AlexNet model that achieved the best result for the typical convolutional neural network model ( Table 3). The visual analysis corresponding to Table 3 is shown in Figure 7. In addition, visual transformer models and typical convolutional neural network models each have consistency on a specific partition. Typical convolutional neural network models achieved the best performance on the interval of 0.4%. while the visual transformer models obtained optimal results more easily for the datasets with an interval of 0.6%.

Conclusions
This study used deep learning technology to predict the sugar content of Red Globe grapes. The deep convolutional neural network retraining using AlexNet, the transfer learning method using InceptionV3 and ResNet50, and the transformer model can ensure the integrity of the grapes by accurately and effectively predicting the sugar content value. Furthermore, we designed a grape image acquisition system. However, the collected dataset of grapes presented an imbalanced distribution. We proposed the FNRR framework to address this issue and demonstrate that it performed well on small samples of grapes with an imbalanced distribution. In addition, under our proposed multi-label loss function, and among all prediction models, the FNRR framework in conjunction with the ViT-L_16 transformer and the dataset categorized with intervals of 0.8% yielded the optimal performance, with R = 0.9599 and RMSE = 0.3841%. Using transformer models to predict the sugar content of grapes is a potential method for the non-destructive quality testing and grading of grapes. In future studies, we aim to enrich our dataset to make it applicable to other fruits and apply the FNRR framework to industrial production.

Conclusions
This study used deep learning technology to predict the sugar content of Red Globe grapes. The deep convolutional neural network retraining using AlexNet, the transfer learning method using InceptionV3 and ResNet50, and the transformer model can ensure the integrity of the grapes by accurately and effectively predicting the sugar content value. Furthermore, we designed a grape image acquisition system. However, the collected dataset of grapes presented an imbalanced distribution. We proposed the FNRR framework to address this issue and demonstrate that it performed well on small samples of grapes with an imbalanced distribution. In addition, under our proposed multi-label loss function, and among all prediction models, the FNRR framework in conjunction with the ViT-L_16 transformer and the dataset categorized with intervals of 0.8% yielded the optimal performance, with R = 0.9599 and RMSE = 0.3841%. Using transformer models to predict the sugar content of grapes is a potential method for the non-destructive quality testing and grading of grapes. In future studies, we aim to enrich our dataset to make it applicable to other fruits and apply the FNRR framework to industrial production.