Defect Detection in Food Using Multispectral and High-Deﬁnition Imaging Combined with a Newly Developed Deep Learning Model

: The automatic detection of defects (cortical ﬁbers) in pickled mustard tubers (Chinese Zhacai) remains a challenge. Moreover, few papers have discussed detection based on the segmentation of the physical characteristics of this food. In this study, we designate cortical ﬁbers in pickled mustard as the target class, while considering the background and the edible portion of pickled mustard as other classes. We attempt to realize an automatic defect-detection system to accurately and rapidly detect cortical ﬁbers in pickled mustard based on multiple images combined with a UNet4+ segmentation model. A multispectral sensor (MS) covering nine wavebands with a resolution of 870 × 750 pixels and an imaging speed over two frames per second and a high-deﬁnition (HD), 4096 × 3000 pixel resolution imaging system were applied to obtain MS and HD images of 200 pickled mustard tuber samples. An improved imaging fusion method was applied to fuse the MS with HD images. After image fusion and other preprocessing methods, each image contained a target; 150 images were randomly selected as the training data and 50 images as the test data. Furthermore, a segmentation model called UNet4+ was developed to detect the cortical ﬁbers in the pickled mustard tubers. Finally, the UNet4+ model was tested on three types of datasets (MS, HD, and fusion images), and the detection results were compared based on Recall, Precision, and Dice values. Our study indicates that the model can successfully detect cortical ﬁbers within about a 30 ± 3 ms timeframe for each type of image. Among the three types of images, the fusion images achieved the highest mean average Dice value of 73.91% for the cortical ﬁbers. At the same time, we compared the UNet4+ model with the UNet++ and UNet3+ models using the same fusion data; the results show that our model achieved better prediction performance for the Dice values, i.e., 9.72% and 27.41% higher than those of the UNet++ and UNet3+ models, respectively.


Introduction
Food quality and safety, as foundational pillars of public health, societal stability, and development, are of paramount significance in our society [1].Food defects are one of the most important reasons for reduced food quality.Therefore, the importance of reliable defect-detection techniques is increasing due to the growing demand for improved food quality and safety [2].In recent years, the detection of food defects was predominantly performed through manual processes, with workers positioned alongside conveyor systems, visually inspecting defects in processed food items, such as spoilage, injuries, diseases, bruises, etc. [3].However, using this approach, efficiency rapidly declines after several hours of continuous inspection.It is necessary to develop more effective, accurate, and rapid food defect-detection methods to replace manual inspection systems.
Many effective technologies have been adopted for food defect-detection purposes [4], including electronic noses [5], X-ray [6], ultrasound [7], thermal imaging [8,9], fluorescence spectroscopy [10], terahertz imaging [11], and spectroscopy [12,13].Each of these methods has its merits and restrictions.For instance, despite the capability of X-rays to scan highresolution images, it is challenging to detect low-density objects, such as plastics and wood fibers.Terahertz imaging, employed for food detection, is highly accurate; however, the high cost of Terahertz technology equipment and protracted data acquisition times limit the viability of this technology in factory line operation applications [14].
Spectral imaging, including hyperspectral (HS) and multispectral (MS) imaging techniques, can be used to acquire visual details of foods, making it possible to determine size, shape, texture, color, and other invisible information.HS imaging technology, by integrating spectral and imaging features, provides heterogeneous information [15] which can effectively capture food-quality characteristics.However, the development of real-time detection using HS imaging has run into a bottleneck due to its inefficiency in terms of acquiring and processing hundreds of continuous and narrowband HS images [16].MS imaging is an alternative which overcomes this problem, as it not only exhibits high efficiency in processing narrow-band images of discrete spectral ranges, but also enables the acquisition of images at certain multi-wavelengths across the electromagnetic spectrum [17].Considering its advantages, the MS imaging technique is recognized as a superior method of meeting the speed demands for food image processing [18].
Generally, food defect detection requires automatic image recognition and machine vision methods after image acquisition.In recent years, with the advancement of computerrelated technologies, machine learning (ML) has shown great advantages and potential in the field of machine vision [19].However, ML requires the extraction of a lot of features from images to optimize its parameters in order to produce good results.In this regard, HS and MS images can provide sufficient information.Therefore, spectral imaging technology combined with ML has been widely used to detect defects in food, including meat [20], seeds [21], and vegetables [22].As a subfield of ML, deep learning (DL) stands out as a powerful approach in various research fields, including natural language processing [23] and medical imaging diagnoses [24].DL was noted as one of the 10 breakthrough technologies in the MIT Technology Review [25].Undoubtedly, the application of DL will be an inevitable trend in agriculture applications in the future [26].For instance, the authors of [27] exploited various popular DL models for the classification of sunflower diseases.They then performed a comparative analysis of their classification results.The findings demonstrated the efficacy of DL models in terms of accurately identifying categories of sunflower diseases.The authors of [28] evaluated and analyzed various classification models and detection techniques for citrus fruit diseases.The authors noted that ML and DL models are widely applied in the detection of citrus fruit diseases.The authors of [29] introduced an automatic detection model for various types of potato defects based on multispectral data and the YOLOv3-tiny model.A comparative analysis was conducted with other deep learning models, demonstrating the effectiveness of the proposed approach in terms of accurately detecting different types of defects in potatoes.In this study, we introduce the DL technique in a food engineering experiment, i.e., defect detection using MS and HD images.Compared to traditional or manual methods, the application of DL improves the detection accuracy and decreases cost.
Fuling mustard tuber, renowned for its fresh, fragrant, crisp, and tender qualities, has gained worldwide acclaim and is exported to over 50 countries and regions, including Russia, Japan, Singapore, and South Africa.This food is one of the most famous pickled vegetables (alongside Chinese Zhacai, European pickled cucumber, and German sauerkraut) [30].According to relevant data (https://www.huaon.com/channel/trend/841302.html (ac-cessed on 14 November 2023)), the packaged pickled mustard tuber market in China has shown an upward trend, increasing from 37.8 billion RMB in 2013 to 82.9 billion RMB in 2021.The sales volume of pickled mustard tuber in China has been on the rise in recent years, increasing from 186,000 tons in 2013 to 334,000 tons in 2021.In the period from January to August 2022, China exported 15,101 tons of pickled mustard tuber, with an export value of 128.751 million RMB.Fuling pickled mustard tuber had the highest market share, accounting for 31% of the Chinese pickled mustard tuber market.However, the cortical fibers contained in mustard greatly affect its taste.Moreover, it is difficult and inefficient to remove cortical fibers manually, because the cortical fiber is very small and shapeless.Therefore, the question of how to remove cortical fibers has troubled people for a long time.A fast and simple approach is urgently needed to detect and remove cortical fibers in mustard.To address this issue, this paper employed the latest DL method, which was based on improved convolutional neural networks, to realize the real-time detection of cortical fibers.The main contributions of this paper are as follows: (1) We achieved cortical fiber detection based on the segmentation of the physical features of food using deep learning.Most past studies have focused on food classification, calorie estimation, and quality detection.However, few papers have discussed the segmentation of the physical characteristics of food.Especially for complex and indistinct defects in foods, traditional methods have been found to be completely ineffective.Therefore, to contribute to this field, this dissertation took mustard as an example to realize the semantic segmentation of the physical features of food through deep learning.(2) An improved fusion method with guided filtering was used to fuse MS images and HD images.The Sigmoid function was introduced to normalize weights for the generation of suitable fusion images.The method was shown to be capable of integrating features from multiple source images, making less conspicuous defects in food images appear more distinct, thereby aiding in the identification of defects.The detailed structure of the proposed method will be discussed in Section 2.2.(3) A novel image segmentation model based on the semantic segmentation model of UNet++ and UNet3+ for the extraction of cortical fibers, named UNet4+, is proposed.The model employs a multiscale semantic connection and dense convolutional layers, enabling the extraction of fine-grained, intricate, and deep-level characteristics of the target object.This results in superior performance for the detection of complex objects compared to conventional models.This approach can therefore facilitate more effective detection of objects similar to cortical fibers in pickled mustard tubers.The detailed structure of the proposed technology will be discussed in Section 2.3.(4) We compared the performance of the proposed model based on MS, HD, and Fusion images.Detailed results and discussion can be found in Section 3.2.(5) We compared the recognition results of our model with those of relevant segmentation models (UNet++, UNet3+) based on the data in this paper.The detailed results and discussion can be found in Section 3.5.

Image Acquisition
In this study, 50 MS raw images and 50 HD raw images (with each image including four pickled mustard targets) were acquired in a mustard processing plant.Figure 1a shows the mosaic multispectral (MS) imaging camera, i.e., a high-speed mosaic multispectral imaging camera produced by Championoptics (Changchun, China).Each MS image was made up of nine spectral bands (with band 1 to band 9 being 620 nm, 638 nm, 657 nm, 683 nm, 711 nm, 730 nm, 755 nm, 779 nm, and 816 nm, respectively).The resolution of each image is 750 pixels in width and 870 pixels in height.The advantages of this mosaic-type MS camera are the direct and fast acquisition of digital images of high quality.The imaging speed is over two frames per second, which can sufficiently satisfy the needs for industrial use.The HD images were collected by a high-definition (HD) camera (iRAYPLE, Hangzhou, China, Figure 1b) with a resolution of 4096 (width) × 3000 (height).
The MS camera and HD camera were put in the stand above a conveyor belt at a height of approximately 0.5 m so that we could obtain data synchronously.During data acquisition, the pickled mustards were cleaned and sliced into pieces in the range of 2-3 mm; these were then placed on a conveyor belt, i.e., two slices, perpendicular to the forward direction of the conveyor belt.Then, the conveyor belt moved the pickled mustard slices to the camera at an appropriate speed.After imaging with the camera, each view contained four slices.Figure 2 shows a schematic diagram of data acquisition.Figure 3 show sample images of pickled mustard tuber obtained with the HD camera and MS camera.Once the model had detected cortical fiber, a high-pressure water jet on a robotic arm autonomously separated those fibers, mimicking the manual process.The MS camera and HD camera were put in the stand above a conveyor belt at a height of approximately 0.5 m so that we could obtain data synchronously.During data acquisition, the pickled mustards were cleaned and sliced into pieces in the range of 2-3 mm; these were then placed on a conveyor belt, i.e., two slices, perpendicular to the forward direction of the conveyor belt.Then, the conveyor belt moved the pickled mustard slices to the camera at an appropriate speed.After imaging with the camera, each view contained four slices.Figure 2 shows a schematic diagram of data acquisition.Figure 3 show sample images of pickled mustard tuber obtained with the HD camera and MS camera.Once the model had detected cortical fiber, a high-pressure water jet on a robotic arm autonomously separated those fibers, mimicking the manual process.each image is 750 pixels in width and 870 pixels in height.The advantages of this mosaictype MS camera are the direct and fast acquisition of digital images of high quality.The imaging speed is over two frames per second, which can sufficiently satisfy the needs for industrial use.The HD images were collected by a high-definition (HD) camera (iRAYPLE, Hangzhou, China, Figure 1b) with a resolution of 4096 (width) × 3000 (height).
The MS camera and HD camera were put in the stand above a conveyor belt at a height of approximately 0.5 m so that we could obtain data synchronously.During data acquisition, the pickled mustards were cleaned and sliced into pieces in the range of 2-3 mm; these were then placed on a conveyor belt, i.e., two slices, perpendicular to the forward direction of the conveyor belt.Then, the conveyor belt moved the pickled mustard slices to the camera at an appropriate speed.After imaging with the camera, each view contained four slices.Figure 2 shows a schematic diagram of data acquisition.Figure 3 show sample images of pickled mustard tuber obtained with the HD camera and MS camera.Once the model had detected cortical fiber, a high-pressure water jet on a robotic arm autonomously separated those fibers, mimicking the manual process.each image is 750 pixels in width and 870 pixels in height.The advantages of this mosaictype MS camera are the direct and fast acquisition of digital images of high quality.The imaging speed is over two frames per second, which can sufficiently satisfy the needs for industrial use.The HD images were collected by a high-definition (HD) camera (iRAYPLE, Hangzhou, China, Figure 1b) with a resolution of 4096 (width) × 3000 (height).
The MS camera and HD camera were put in the stand above a conveyor belt at a height of approximately 0.5 m so that we could obtain data synchronously.During data acquisition, the pickled mustards were cleaned and sliced into pieces in the range of 2-3 mm; these were then placed on a conveyor belt, i.e., two slices, perpendicular to the forward direction of the conveyor belt.Then, the conveyor belt moved the pickled mustard slices to the camera at an appropriate speed.After imaging with the camera, each view contained four slices.Figure 2 shows a schematic diagram of data acquisition.Figure 3 show sample images of pickled mustard tuber obtained with the HD camera and MS camera.Once the model had detected cortical fiber, a high-pressure water jet on a robotic arm autonomously separated those fibers, mimicking the manual process.

Image Fusion
Image fusion is a technology in computer vision that is commonly employed to aggregate valuable features from multiple images to generate a new image with multiple features derived from the amalgamation of those features.In the field of remote sensing, images exhibit varying resolutions.Multispectral images typically have lower resolution, while panchromatic images possess higher resolution but are composed of a single spectral band.Therefore, images with high resolution and multispectral images need to be fused to obtain fusion images that combine both high resolution and multispectral characteristics to assist in the subsequent processing of remote sensing images.In this study, the MS images possessed obvious spectral characteristics of cortical fibers, while HD images had high resolution.Therefore, to obtain images with multispectral features and high resolution, we attempted to use a fusion method to fuse MS and HD images.
Image fusion with guided filtering [31] is a rapid and powerful method that can extract substantial relevant data from source images to create a new fusion image with enhanced informational content.The method involves two-stage image decomposition, which breaks down the images into a base level and a detail level that encompasses coarse-grained and captures the fine-grained details, respectively.A guided filter [32] is adopted to smooth the weights, allowing for the effective fusion of the base and detail layers with spatial consistency.The weights of the base layers and detail layers are described as follows: where w base , w detail represent the weight of base layers and detail layers, respectively, smoothed by the guild filter, according to the method presented in [31].w base i and w detail i are the final weight maps of the base and detail layers for the ith source image, respectively, and σ is the sigmoid function [33].Usually, the sigmoid function is employed in the activation layer of the neural network; its output range is 0-1.The output value will be normalized after the sigmoid function is activated.In this way, the sigmoid function is used to normalize the weight values such that they sum to one at each pixel k.The sigmoid function is described as follows: Finally, the base and detail layers will multiply the corresponding weight maps and fuse the base and detail layers from different source images through weighted averaging.

base
Then, fused image Fusion is obtained by combining the fused base layer base and the fused detail layer detail.Fusion = base + detail (6)

Defect Detection Based on UNet4+
The model (we named it UNet4+) was inspired by existing medical image segmentation models, i.e., UNet++ [34] and UNet3+ [35].UNet++ is a new segmentation architecture using nested and dense skip connections, while UNet3+ may be used to establish tighter connections through all scales, combining low-level details with high-level semantics from feature maps of different scales.Our model structure is as follows: 1.
Encoder: Visual Geometry Group Network-16 (VGG-16) serves as the backbone of the entire network, namely, X j,0 (j ∈ [0, 4]).Layers X 0 and X 1 are the model of two convolutional layers, while the rest is a model of three convolutional layers.Furthermore, the convolved data are upsampled and provided to the decoding layer from the X 1 to X 4 layer.

2.
Decoder: There are several decoding layers in the network, which can obtain extensive information from different scales.As shown in Figure 4, each model will fuse adjacent data and upsample data from the lower left model.Every two encoding models plus a decoding model can be considered as a small UNet network [36].In addition, skip connections are used in the network when exceeding two decoding models to connect coarse-grained and fine-grained information, which can help the network model learn more useful knowledge.

3.
Skip connection: In this network, to capture more effective information, we drew inspiration from UNet3+ and added multiscale skip connections to the network.To make the training more efficient, skip connections were adopted in the network, which associated high-level information with low-level semantic information (like color, border, texture, etc.) in the whole process of the network encoding and decoding.
Figure 5 shows the process of multi-scale skip connection and how to construct the feature maps of X 0,3 and X 1,2 .Like with UNet, the feature map from samescale decoder layer X 0,2 and the upsampled result from higher-scaler layer X 1,2 were instantly accepted in decoder layer X 0,3 , which delivered low-level information and high-level semantic information, respectively.Moreover, a series of multiple-scale skip connections passed the higher-level semantic information from encoder layer X 3,1 and decoder-layer X 2,1 by using bilinear interpolation, selecting different scale factors based on different expansion scales.Then, a 3 × 3 convolution operation was followed to update the number of channels and reduce the quantity of unnecessary information.

4.
Deep supervision [37,38]: Similar to UNet++, deep supervision that concurrently minimizes detection error and improves the directness and transparency of the hidden layer learning process was used in this model, which consisted of a 1 × 1 convolution.Finally, the result was produced by making use of a method of deep supervision which added the information from decoder layers X 0,1 , X 0,2 , X 0,3 , and X 0,4 .

5.
The differences between our model and other models: In comparison to UNet3+, the proposed model simplifies lateral connections and introduces vertical multiscale connections.This design choice aims to enable the model to capture more complex fine-grained features across different scales.Unlike UNet3+, we employed a multilayer dense convolutional network, allowing the model to extract features on various scales.The features obtained on different scales are then fused through the dense convolutional network, enhancing the model's ability to achieve superior results when handling complex targets.
Processes 2023, 11, x FOR PEER REVIEW 6 of 18 connections through all scales, combining low-level details with high-level semantics from feature maps of different scales.Our model structure is as follows: 1. Encoder: Visual Geometry Group Network-16 (VGG-16) serves as the backbone of the entire network, namely, Xj,0 (j ∈ [0, 4]).Layers X0 and X1 are the model of two convolutional layers, while the rest is a model of three convolutional layers.Furthermore, the convolved data are upsampled and provided to the decoding layer from the X1 to X4 layer.2. Decoder: There are several decoding layers in the network, which can obtain extensive information from different scales.As shown in Figure 4, each model will fuse adjacent data and upsample data from the lower left model.Every two encoding models plus a decoding model can be considered as a small UNet network [36].In addition, skip connections are used in the network when exceeding two decoding models to connect coarse-grained and fine-grained information, which can help the network model learn more useful knowledge.3. Skip connection: In this network, to capture more effective information, we drew inspiration from UNet3+ and added multiscale skip connections to the network.To make the training more efficient, skip connections were adopted in the network, which associated high-level information with low-level semantic information (like color, border, texture, etc.) in the whole process of the network encoding and decoding.Figure 5 shows the process of multi-scale skip connection and how to construct the feature maps of X0,3 and X1,2.Like with UNet, the feature map from same-scale decoder layer X0,2 and the upsampled result from higher-scaler layer X1,2 were instantly accepted in decoder layer X0,3, which delivered low-level information and high-level semantic information, respectively.Moreover, a series of multiple-scale skip connections passed the higher-level semantic information from encoder layer X3,1 and decoder-layer X2,1 by using bilinear interpolation, selecting different scale factors based on different expansion scales.Then, a 3 × 3 convolution operation was followed to update the number of channels and reduce the quantity of unnecessary information.4. Deep supervision [37,38]: Similar to UNet++, deep supervision that concurrently minimizes detection error and improves the directness and transparency of the hidden layer learning process was used in this model, which consisted of a 1 × 1 convolution.Finally, the result was produced by making use of a method of deep supervision which added the information from decoder layers X0,1, X0,2, X0,3, and X0,4. 5.The differences between our model and other models: In comparison to UNet3+, the proposed model simplifies lateral connections and introduces vertical multiscale connections.This design choice aims to enable the model to capture more complex finegrained features across different scales.Unlike UNet3+, we employed a multi-layer dense convolutional network, allowing the model to extract features on various scales.The features obtained on different scales are then fused through the dense convolutional network, enhancing the model's ability to achieve superior results when handling complex targets.

Design of Experiments
First, to reduce background and computational complexity, yolov5 [39] was utilized to detect a single mustard target in all 50 raw images.We obtained 200 images, where each image included one mustard target which was resized to 256 × 256 pixels.Second, bands 8, 4, and 3, which were chosen after the experiment, were extracted from the MS image and restored to a normal format, false-color image after it had been linearly stretched by 2%.Then, we fused HD and MS data with a fusion method based on guided filtering.Finally, we had three kinds of data: HD data, MS data, and Fusion data for the detection of mustard cortical fibers.
Subsequently, during the training of the network for 200 epochs, default initialization was applied to initialize the weights.The Adam optimizer was utilized with default momentum for weight optimization, employing a minimum batch size of 16 and a weight decay of 0.0001.The α-balanced focal loss function [40] was introduced into our model; its expression is described as follows: where α is obtained from  /   , p represents the number of positive samples, and n represents the number of negative samples.γ is a parameter that adjusts a sample unbalanced between positive and negative samples, γ ∈ [0, 1], where the value of γ was set as 0.9 after experiments.N is the number of pixels in each image, yi is the label of sample i, the positive class is 1, the negative class is 0, and pi is the probability that sample i is predicted to be a positive class.Furthermore, 200 images were manually labeled with the ROI tool of ENVI (Exelis Visual Information Solutions, Boulder, CO, USA), and 75% of each type of dataset was randomly chosen to create the training dataset (150 images), while the rest (50 images) were used as a testing dataset.In addition, to prevent overfitting of the model and increase the amount of training data, we used three data augmentation methods to expand the training dataset: (a) clockwise rotation of all original images by 90 degrees; (b) vertical mirroring of each original image; and (c) addition of salt and pepper noise to each original image.Finally, the number of raw images was increased fourfold, resulting in training dataset containing 600 images (150 raw images and 450 augmented images) of each type of training dataset (HD images, MS images, Fusion images).Last but not least, the network was trained on a terminal with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30 GHz (Santa Clara, CA, USA.), NVIDIA GTX 3090 24 GB (Santa Clara, CA, USA.).The running environment used included Python 3.8, CUDA 11.3.0,and PyTorch 1.10.0.The related code

Design of Experiments
First, to reduce background and computational complexity, yolov5 [39] was utilized to detect a single mustard target in all 50 raw images.We obtained 200 images, where each image included one mustard target which was resized to 256 × 256 pixels.Second, bands 8, 4, and 3, which were chosen after the experiment, were extracted from the MS image and restored to a normal format, false-color image after it had been linearly stretched by 2%.Then, we fused HD and MS data with a fusion method based on guided filtering.Finally, we had three kinds of data: HD data, MS data, and Fusion data for the detection of mustard cortical fibers.
Subsequently, during the training of the network for 200 epochs, default initialization was applied to initialize the weights.The Adam optimizer was utilized with default momentum for weight optimization, employing a minimum batch size of 16 and a weight decay of 0.0001.The α-balanced focal loss function [40] was introduced into our model; its expression is described as follows: where α is obtained from α = p/(p + n), p represents the number of positive samples, and n represents the number of negative samples.γ is a parameter that adjusts a sample unbalanced between positive and negative samples, γ ∈ [0, 1], where the value of γ was set as 0.9 after experiments.N is the number of pixels in each image, y i is the label of sample i, the positive class is 1, the negative class is 0, and p i is the probability that sample i is predicted to be a positive class.Furthermore, 200 images were manually labeled with the ROI tool of ENVI (Exelis Visual Information Solutions, Boulder, CO, USA), and 75% of each type of dataset was randomly chosen to create the training dataset (150 images), while the rest (50 images) were used as a testing dataset.In addition, to prevent overfitting of the model and increase the amount of training data, we used three data augmentation methods to expand the training dataset: (a) clockwise rotation of all original images by 90 degrees; (b) vertical mirroring of each original image; and (c) addition of salt and pepper noise to each original image.Finally, the number of raw images was increased fourfold, resulting in training dataset containing 600 images (150 raw images and 450 augmented images) of each type of training dataset (HD images, MS images, Fusion images).Last but not least, the network was trained on a terminal with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30 GHz (Santa Clara, CA, USA.), NVIDIA GTX 3090 24 GB (Santa Clara, CA, USA.).The running environment used included Python 3.8, CUDA 11.3.0,and PyTorch 1.10.0.The related code and UNet4+ model are obtainable on GitHub (https://github.com/monch999/pickled-mustard-tuber/tree/master (accessed on 14 November 2023)).

Assessment System
The result of detecting mustard cortical fiber using the UNet4+ model was assessed by Recall (R), Prediction (P), and Dice [41].Dice is a function for calculating the similarity between two sets; it is commonly utilized to determine the similarity between two samples.This coefficient is widely employed in the semantic segmentation of medical images to help evaluate the reliability of the models.The purpose of our experiment was the detection of cortical fibers of pickled mustard to allow a machine to remove it accurately and quickly.Therefore, Dice could be used in our experiment to evaluate the performance of our model.The R, P, and Dice were calculated according to equations below: where TP, FP, and FN represent the number of true mustard cortical fiber pixels (True Positives), the number of false mustard cortical fiber pixels (False Positives), and the number of missed mustard cortical fiber (False Negatives), respectively.

Spectral Attributes of Cortical Fiber and Meat Tissues
The mean reflectance spectra curve of meat and cortical fiber over the spectral range from 600 to 800 nm is displayed in Figure 6.With regard to the 657 nm (band 3) wavelength corresponding to the color of the meat, there was good divisibility between mustard meat and critical fiber, which indicates that the aforementioned wavelength is able to generate valuable spectral differentiation information, contributing to the detection of mustard critical fibers.Cortical fiber has a small peak at around 683 nm (band 4), making it a useful band for the identification of critical fibers.In other bands, there were no significant differences in spectral trends.Therefore, a multispectral band including band 3 and band 4 could be selected as the feature band.After testing, we selected band 3, band 4, and band 8 (false-color image shown in Figure 7) as the feature bands for the subsequent image fusion and network training.

Assessment System
The result of detecting mustard cortical fiber using the UNet4+ model was assessed by Recall (R), Prediction (P), and Dice [41].Dice is a function for calculating the similarity between two sets; it is commonly utilized to determine the similarity between two samples.This coefficient is widely employed in the semantic segmentation of medical images to help evaluate the reliability of the models.The purpose of our experiment was the detection of cortical fibers of pickled mustard to allow a machine to remove it accurately and quickly.Therefore, Dice could be used in our experiment to evaluate the performance of our model.The R, P, and Dice were calculated according to equations below: where TP, FP, and FN represent the number of true mustard cortical fiber pixels (True Positives), the number of false mustard cortical fiber pixels (False Positives), and the number of missed mustard cortical fiber (False Negatives), respectively.

Spectral Attributes of Cortical Fiber and Meat Tissues
The mean reflectance spectra curve of meat and cortical fiber over the spectral range from 600 to 800 nm is displayed in Figure 6.With regard to the 657 nm (band 3) wavelength corresponding to the color of the meat, there was good divisibility between mustard meat and critical fiber, which indicates that the aforementioned wavelength is able to generate valuable spectral differentiation information, contributing to the detection of mustard critical fibers.Cortical fiber has a small peak at around 683 nm (band 4), making it a useful band for the identification of critical fibers.In other bands, there were no significant differences in spectral trends.Therefore, a multispectral band including band 3 and band 4 could be selected as the feature band.After testing, we selected band 3, band 4, and band 8 (false-color image shown in Figure 7) as the feature bands for the subsequent image fusion and network training.
As illustrated in Figure 7, there was a significant spectral difference between meat and cortical fibers within the 650-700 nm range.This phenomenon may have been related to the high water, protein, and mineral content in meat compared to cortical fibers; in contrast, cortical fibers contain a higher amount of cellulose.As a result, meat has a stronger absorption in the 650-700 nm range, leading to noticeable distinctions between meat and cortical fibers in band 3 (657 nm) and band 4 (683 nm).As illustrated in Figure 7, there was a significant spectral difference between meat and cortical fibers within the 650-700 nm range.This phenomenon may have been related to the high water, protein, and mineral content in meat compared to cortical fibers; in contrast, cortical fibers contain a higher amount of cellulose.As a result, meat has a stronger absorption in the 650-700 nm range, leading to noticeable distinctions between meat and cortical fibers in band 3 (657 nm) and band 4 (683 nm).

Performance of Image Fusion
The fusion technology based on guided filtering was designed to efficiently fuse spectral information from the MS image with spatial information from the HD image, and thus, to achieve better resolution while preserving spectral information.Figure 8a shows a false-color image, displaying the original wavebands 8, 4, and 3 as RGB.From the images, it is evident that the resolution is a bit low and there is a noticeable frosted texture; since cortical fibers appear white, distinct spectral differences can be observed between meat and cortical fibers.Compared with Figure 8a, Figure 8b is clearer, smoother, and has better boundaries, while the differences between cortical fiber and meat are less obvious compared to those in the MS images.Figure 8c shows the image after the fusion of MS and HD images.The resolution has been greatly improved compared to the original MS image and is virtually the same as the HD image.In comparison to the HD images, the spectral information of the image has been strengthened, making the difference between cortical fibers and meat more pronounced.

Performance of UNet4+
The performance of the UNet4+ model was assessed regarding the detection of mustard cortical fiber.The UNet4+ dense network was composed of several different scale UNet networks, allowing for the simultaneous acquisition of multi-scale information during runtime.In addition, we incorporated multiscale connections in the network, which facilitated the integration of coarse-and fine-grained information to make it perform better.The model could identify cortical fibers from each type of the 50 test images, while the network could detect them on some small targets as well.Figure 9 presents the result of three types of images and shows that our model can effectively detect cortical fibers.

Performance of Image Fusion
The fusion technology based on guided filtering was designed to efficiently fuse spectral information from the MS image with spatial information from the HD image, and thus, to achieve better resolution while preserving spectral information.Figure 8a shows a false-color image, displaying the original wavebands 8, 4, and 3 as RGB.From the images, it is evident that the resolution is a bit low and there is a noticeable frosted texture; since cortical fibers appear white, distinct spectral differences can be observed between meat and cortical fibers.Compared with Figure 8a, Figure 8b is clearer, smoother, and has better boundaries, while the differences between cortical fiber and meat are less obvious compared to those in the MS images.Figure 8c shows the image after the fusion of MS and HD images.The resolution has been greatly improved compared to the original MS image and is virtually the same as the HD image.In comparison to the HD images, the spectral information of the image has been strengthened, making the difference between cortical fibers and meat more pronounced.

Performance of Image Fusion
The fusion technology based on guided filtering was designed to efficiently fuse spectral information from the MS image with spatial information from the HD image, and thus, to achieve better resolution while preserving spectral information.Figure 8a shows a false-color image, displaying the original wavebands 8, 4, and 3 as RGB.From the images, it is evident that the resolution is a bit low and there is a noticeable frosted texture; since cortical fibers appear white, distinct spectral differences can be observed between meat and cortical fibers.Compared with Figure 8a, Figure 8b is clearer, smoother, and has better boundaries, while the differences between cortical fiber and meat are less obvious compared to those in the MS images.Figure 8c shows the image after the fusion of MS and HD images.The resolution has been greatly improved compared to the original MS image and is virtually the same as the HD image.In comparison to the HD images, the spectral information of the image has been strengthened, making the difference between cortical fibers and meat more pronounced.

Performance of UNet4+
The performance of the UNet4+ model was assessed regarding the detection of mustard cortical fiber.The UNet4+ dense network was composed of several different scale UNet networks, allowing for the simultaneous acquisition of multi-scale information during runtime.In addition, we incorporated multiscale connections in the network, which facilitated the integration of coarse-and fine-grained information to make it perform better.The model could identify cortical fibers from each type of the 50 test images, while the network could detect them on some small targets as well.Figure 9 presents the result of three types of images and shows that our model can effectively detect cortical fibers.

Performance of UNet4+
The performance of the UNet4+ model was assessed regarding the detection of mustard cortical fiber.The UNet4+ dense network was composed of several different scale UNet networks, allowing for the simultaneous acquisition of multi-scale information during runtime.In addition, we incorporated multiscale connections in the network, which facilitated the integration of coarse-and fine-grained information to make it perform better.The model could identify cortical fibers from each type of the 50 test images, while the network could detect them on some small targets as well.Figure 9 presents the result of three types of images and shows that our model can effectively detect cortical fibers.However, in this study, some of the meat of pickled mustard tubers was in the early stages of cortical fiber growth.These elements have similar spectral characteristics to cortical fibers, suggesting that they might misidentified by the model as cortical fiber.Meanwhile, all training samples were labeled by hand.Since there is diversity in the shape of the cortical fibers of pickled mustard, the boundary between cortical fiber and meat is not always clear.Therefore, annotated data may introduce errors that affect the detection results.In summary, the model had a good segmentation effect on each type of data, making it a reliable segmentation model for the detection of cortical fibers in pickled mustard tubers.

Comparison of Three Types of Images Based on UNet4+
Three kinds of images were trained based on UNet4+ to obtain the corresponding mustard cortical fiber segmentation models; the results of UNet4+, tested using the testing dataset are displayed in Table 1 and Figure 10.The R, P, and Dice of the fusion images achieved 82.87%, 68.13%, and 73.91%, which was 2.25%, 6.21%, and 5.31% higher than with the MS images and 1.23%, 10.15%, 7.1% higher than with the HD images, respectively.Compared to HD images, the R, P, and Dice of the MS images achieved 80.62%, 61.92%, and 68.60%, which was −1.02%, 3.94%, and 1.79% higher/lower than the HD images.The reason that the R of MS images was lower than that of HD images may be that the information from images with high resolutions was more conducive to the broad identification of cortical fibers in some images, while the spectral information was beneficial for correctly identifying cortical fibers.Consequently, the fused images integrating highresolution and spectral information could achieve better results for the segmentation of cortical fibers.However, in this study, some of the meat of pickled mustard tubers was in the early stages of cortical fiber growth.These elements have similar spectral characteristics to cortical fibers, suggesting that they might misidentified by the model as cortical fiber.Meanwhile, all training samples were labeled by hand.Since there is diversity in the shape of the cortical fibers of pickled mustard, the boundary between cortical fiber and meat is not always clear.Therefore, annotated data may introduce errors that affect the detection results.In summary, the model had a good segmentation effect on each type of data, making it a reliable segmentation model for the detection of cortical fibers in pickled mustard tubers.

Comparison of Three Types of Images Based on UNet4+
Three kinds of images were trained based on UNet4+ to obtain the corresponding mustard cortical fiber segmentation models; the results of UNet4+, tested using the testing dataset are displayed in Table 1 and Figure 10.The R, P, and Dice of the fusion images achieved 82.87%, 68.13%, and 73.91%, which was 2.25%, 6.21%, and 5.31% higher than with the MS images and 1.23%, 10.15%, 7.1% higher than with the HD images, respectively.Compared to HD images, the R, P, and Dice of the MS images achieved 80.62%, 61.92%, and 68.60%, which was −1.02%, 3.94%, and 1.79% higher/lower than the HD images.The reason that the R of MS images was lower than that of HD images may be that the information from images with high resolutions was more conducive to the broad identification of cortical fibers in some images, while the spectral information was beneficial for correctly identifying cortical fibers.Consequently, the fused images integrating highresolution and spectral information could achieve better results for the segmentation of cortical fibers.

Comparision of the UNet4+ Model with UNet++ and UNet3+
The fusion images were trained on other models including UNet++ and UNet3+ to obtain the corresponding mustard defection detection models.The results of each model, based on testing with the test data, are shown in Table 2.The parameter details of each model are shown in Table 3.The R of the UNet4+ achieved 82.87%, which was 3.78% and 7.55% higher than those of the UNet++ model and UNet3+ model, respectively.The P of the UNet4+ achieved a 68.13%, which was 11.79%, and 30.97%, higher than those of the UNet++ model and UNet3+ model, respectively.The Dice of the UNet4+ achieved a 73.91% success rate, which was 9.72% and 27.41% higher than those of the UNet++ model and UNet3+ model, respectively.This demonstrated that the model proposed in this paper had a significant advantage in the detection of complex defects such as cortical fibers in pickled mustard tuber.
In terms of model size, the model used in this paper consumes more memory (13.2 MB) compared to the UNet3+ model (7.83 MB) or the UNet++ model (11.7 MB); however, our model had an advantage in defect detection accuracy.Regarding detection time, with the computing resources provided by the same GPU, in ascending order of time usage, the three models were as follows: UNet++ (17 ms), UNet3+ (22 ms), and UNet4+ (31 ms).
In terms of processing speed, the model proposed in this paper did not hold an advantage.The reason might have been that the bilinear upsampling operation in the multiscale connections took more time.Therefore, for the UNet++ model that does not employ this operation, the detection speed tended to be relatively faster.However, our model achieved much better detection results than the other two models.

Comparision of the UNet4+ Model with UNet++ and UNet3+
The fusion images were trained on other models including UNet++ and UNet3+ to obtain the corresponding mustard defection detection models.The results of each model, based on testing with the test data, are shown in Table 2.The parameter details of each model are shown in Table 3.The R of the UNet4+ achieved 82.87%, which was 3.78% and 7.55% higher than those of the UNet++ model and UNet3+ model, respectively.The P of the UNet4+ achieved a 68.13%, which was 11.79%, and 30.97%, higher than those of the UNet++ model and UNet3+ model, respectively.The Dice of the UNet4+ achieved a 73.91% success rate, which was 9.72% and 27.41% higher than those of the UNet++ model and UNet3+ model, respectively.This demonstrated that the model proposed in this paper had a significant advantage in the detection of complex defects such as cortical fibers in pickled mustard tuber.In terms of model size, the model used in this paper consumes more memory (13.2 MB) compared to the UNet3+ model (7.83 MB) or the UNet++ model (11.7 MB); however, our model had an advantage in defect detection accuracy.Regarding detection time, with the computing resources provided by the same GPU, in ascending order of time usage, the three models were as follows: UNet++ (17 ms), UNet3+ (22 ms), and UNet4+ (31 ms).
In terms of processing speed, the model proposed in this paper did not hold an advantage.The reason might have been that the upsampling operation in the multiscale connections took more time.Therefore, for the UNet++ model that does not employ this operation, the detection speed tended to be relatively faster.However, our model achieved much better detection results than the other two models.
Figure 13 illustrates some details regarding the training process of each model.Figure 13a shows the convergence curves of loss values for each model during training, indicating an overall decreasing trend in loss values that stabilized after certain epochs.Figure 13b displays the accuracy curves (expressed in Dice values) during the training process.It can be seen that the accuracy of each network increased during the training process and tended to stabilize after reaching a certain epoch, with minimal differences in Dice values among the models.Figure 13c demonstrates the Dice curves during the model validation process, highlighting the rapid attainment of high accuracy by all three models.Meanwhile, our model maintained a consistently high level of accuracy.Finally, Figure 13d depicts the results of testing of each model after training completion.The modest fluctuations observed in our model across diverse and complex test data indicate its capacity to effectively learn the intricate features of the target.In comparison to other models, our model exhibited enhanced resistance to interference.
As shown in Figure 14, owing to the usage of a dense network and multiscale skip connections, our model performed better in terms of handling small defects.The dense connection network made it possible to simultaneously process images on different scales, capturing their features and then fusing these features by using a method of deep supervision.Similarly, the UNet++ model also used a dense network which can roughly detect defects; however, it was relatively weaker in handling fine details compared to the model proposed in this paper.Multiscale skip connections, on the other hand, conveyed fine-grained semantic information to other layers, which was beneficial for detecting the details of pickled mustard defects.The UNet3+ model, because of its lack of multi-level convolution layers, exhibited relatively weaker learning capabilities and detection performance compared to the other two models.
Of course, besides detection accuracy, the detection time for mustard products is meaningful for image processing and cortical fiber removal system applications.In our experiment, each image contained exactly one target.By using this data acquisition method, the proposed image detection system achieved a speed of 135 ms (105 ± 3 ms for image preprocessing and 30 ± 3 ms for image detection) for the detection of a single Fusion image, meaning that it can detect approximately seven images including seven targets per second with a GTX 3090 24 GB GPU.The system is therefore sufficient to meet the needs of the study factory.As shown in Figure 14, owing to the usage of a dense network and multiscale skip connections, our model performed better in terms of handling small defects.The dense connection network made it possible to simultaneously process images on different scales, capturing their features and then fusing these features by using a method of deep supervision.Similarly, the UNet++ model also used a dense network which can roughly detect defects; however, it was relatively weaker in handling fine details compared to the model proposed in this paper.Multiscale skip connections, on the other hand, conveyed finegrained semantic information to other layers, which was beneficial for detecting the details of pickled mustard defects.The UNet3+ model, because of its lack of multi-level convolution layers, exhibited relatively weaker learning capabilities and detection performance compared to the other two models.
Of course, besides detection accuracy, the detection time for mustard products is meaningful for image processing and cortical fiber removal system applications.In our experiment, each image contained exactly one target.By using this data acquisition method, the proposed image detection system achieved a speed of 135 ms (105 ± 3 ms for image preprocessing and 30 ± 3 ms for image detection) for the detection of a single Fusion image, meaning that it can detect approximately seven images including seven targets per second with a GTX 3090 24 GB GPU.The system is therefore sufficient to meet the needs of the study factory.

Discussion
In our proposed model, we first utilized image fusion based on guided filtering to merge HD and MS images, consequently obtaining a fused image.This image possessed both the high resolution of the HD image and the spectral characteristics of the MS image.This facilitated the extraction of features related to cortical fibers in pickled mustard.Additionally, we employed the proposed model to train and detect three different types of images.The results indicated that our model was effective at detecting cortical fibers in pickled mustard.Through quantitative comparisons for each type of data, it was evident

Discussion
In our proposed model, we first utilized image fusion based on guided filtering to merge HD and MS images, consequently obtaining a fused image.This image possessed both the high resolution of the HD image and the spectral characteristics of the MS image.
This facilitated the extraction of features related to cortical fibers in pickled mustard.Additionally, we employed the proposed model to train and detect three different types of images.The results indicated that our model was effective at detecting cortical fibers in pickled mustard.Through quantitative comparisons for each type of data, it was evident that the fused image exhibited a superior Dice value, significantly improving the detection performance.Therefore, for the subject of this study, image fusion was deemed necessary.
Next, we use fused data to compare our model with existing deep learning models (UNet3+ and UNet4+).While the results indicate that existing models can generally detect the target of this study, our model demonstrated several advantages: 1.
Innovation: Our model combined multiscale connections with dense convolutional networks.Multiscale connections can connect features of different fine-grained sizes across layers.The dense convolutional network further amalgamates the characteristic information from multiscale connections.This elevates the model's complexity, enabling the acquisition of a broader array of features.Consequently, compared to traditional models, our model exhibited improved detection capabilities for small, irregular objects.

2.
Quantitative Comparison: Our model achieved higher accuracy than other models in quantitative comparisons, with a modest increase in the number of parameters.
Our model had a strong anti-interference ability in the detection of complex targets, making it suitable for the effective detection of complex targets.

3.
Production Efficiency: In terms of production, our model can yield better results within a specified time frame, leading to higher efficiency compared to other models.

4.
Architectural Design: Our model utilized multi-scale connections and multiple layers of densely connected convolutional networks, enabling the extraction of finer and more diverse features from the target.Consequently, it was effective at recognizing complex targets like cortical fibers in pickled mustard.
However, our model also had areas for improvement.First, while it effectively identified the cortical fibers in pickled mustard, there was still room for improvement in terms of detection accuracy.Additionally, our model is more complex than other models; this may be attributed to the utilization of tight convolutional networks and multiscale connections.The complexity of the model resulted in longer inference times compared to the other two models.

Conclusions
In this study, a mustard image system was applied to obtain Multispectral (MS) images and High-Definition (HD) images of mustard.Then, the method of image fusion with guided filtering was utilized to combine MS images with HD images.Based on the features of cortical fibers in mustard, the UNet4+ model, with a dense convolution block and multiscale skip connections, extracted abundant semantic information from the mustard images.Subsequently, the detection results of three types of test data based on UNet4+ were compared.This research evaluated each type of image with Precision (P), Recall (R), Dice, and assessed the required time consumption.The results revealed that the Fusion images achieved the highest Dice (73.91%) and maintained a detection speed (30 ± 3 ms) for each image.In addition, the UNet4+ model was compared with two other segmentation models (UNet++ and UNet3+) based on the same fused images.The experimental results showed that our model acquired the highest accuracy with its weights employing 13.2 MB storage.We can conclude that the mustard image system, along with the image fusion with guided filtering and the UNet4+ model, can effectively detect mustard cortical fiber.Furthermore, we applied a deep learning-based approach for the semantic segmentation of the physical features of food.Our model utilized multi-scale connections and densely connected convolutional layers to capture and fuse deep features of the samples, thus achieving effective segmentation of small objects.As such, it could be conveniently applied to similar scenarios.However, there was still room for improvement in terms of accuracy.

Figure 1 .
Figure 1.Images of the High-Definition and Multispectral cameras.

Figure 3 .
Figure 3.Samples of pickled mustard tuber were obtained with HD cameras (left) and MS cameras (right), with cortical fibers in red circles.

Figure 1 .
Figure 1.Images of the High-Definition and Multispectral cameras.

Figure 1 .
Figure 1.Images of the High-Definition and Multispectral cameras.

Figure 3 .
Figure 3.Samples of pickled mustard tuber were obtained with HD cameras (left) and MS cameras (right), with cortical fibers in red circles.

Figure 1 .
Figure 1.Images of the High-Definition and Multispectral cameras.

Figure 3 .
Figure 3.Samples of pickled mustard tuber were obtained with HD cameras (left) and MS cameras (right), with cortical fibers in red circles.

Figure 3 .
Figure 3.Samples of pickled mustard tuber were obtained with HD cameras (left) and MS cameras (right), with cortical fibers in red circles.

Figure 5 .
Figure 5. Structural details of skip connections.A skip connection model includes the operation of bilinear-upsampling and convolution.

Figure 5 .
Figure 5. Structural details of skip connections.A skip connection model includes the operation of bilinear-upsampling and convolution.

Figure 6 .
Figure 6.The mean spectral reflectance curves were extracted from areas of cortical fiber and meat.

Figure 6 .
Figure 6.The mean spectral reflectance curves were extracted from areas of cortical fiber and meat.

Figure 7 .
Figure 7. False-color image, the red box is cortical fiber with shapeless and white characteristics.

Figure 8 .
Figure 8.Comparison of images before and after fusion.

Figure 7 .
Figure 7. False-color image, the red box is cortical fiber with shapeless and white characteristics.

Processes 2023 , 18 Figure 7 .
Figure 7. False-color image, the red box is cortical fiber with shapeless and white characteristics.

Figure 8 .
Figure 8.Comparison of images before and after fusion.

Figure 8 .
Figure 8.Comparison of images before and after fusion.

Figure 10 .
Figure 10.Evaluation results of three types of images.

Figure 11 .Figure 10 . 18 Figure 10 .
Figure 11.The left image shows the Dice curve of each test image from among the Fusion images and Multispectral (MS) images and their relative relationships.The right shows the Fusion images and High-Definition (HD) images.The red bar represents how much the Dice values for Fusion images increased compared to other images, while green indicates that they decreased compared to other images.

Figure 11 .Figure 11 .
Figure 11.The left image shows the Dice curve of each test image from among the Fusion images and Multispectral (MS) images and their relative relationships.The right shows the Fusion images and High-Definition (HD) images.The red bar represents how much the Dice values for Fusion images increased compared to other images, while green indicates that they decreased compared to other images.

Figure 12 .
Figure 12.Detection results of partial data for three types of images: High-Definition (HD), Multispectral (MS), and Fusion image.

Figure 12 .
Figure 12.Detection results of partial data for three types of images: High-Definition (HD), Multispectral (MS), and Fusion image.

Figure 13 .
Figure 13.(a) the convergence curve of the loss function during the training process.(b) The Dice values of each model on the training dataset during the training process.(c) The Dice values of each model on the validation dataset during the training process.(d) The Dice values of each model based on a test dataset of 50 images after training had been completed.

Figure 13 .
Figure 13.(a) the convergence curve of the loss function during the training process.(b) The Dice values of each model on the training dataset during the training process.(c) The Dice values of each model on the validation dataset during the training process.(d) The Dice values of each model based on a test dataset of 50 images after training had been completed.

Table 1 .
Evaluation of detection results for using High-Definition (HD), Multispectral (MS), and fusion images.

Table 1 .
Evaluation of detection results for using High-Definition (HD), Multispectral (MS), and fusion images.