1. Introduction
Tilapia is a key source of high-quality, economical whitefish products, and its processing volume has been substantial in recent years, with annual growth. After preprocessing, tilapia is typically cut into fillets or segments for direct sale or used as raw material for further processing [
1]. The presence of foreign matter in fillet and segment products is a critical issue affecting food safety and quality [
2]. A typical assembly line for whitefish fillets, such as tilapia, includes processes such as deheading, filleting, deboning, trimming, cutting, washing, and quality inspection. Common spray and bubble washing methods can effectively remove most processing residues, such as blood and meat fragments, but they cannot completely eliminate foreign objects like tiny scales and broken bones. In actual production, fish scales and broken bones typically adhere to the surface of the fish meat or are embedded just beneath the surface. Due to their tiny size and translucent color, they are difficult to detect. Currently, the primary detection methods for the aforementioned endogenous foreign bodies include manual inspection, X-ray detection, spectral imaging techniques, and so on [
3,
4]. Manual inspection relies on visual observation under natural light, and the translucent scales adhering to fish rafts can often only be checked by hand, making it highly susceptible to subjective factors. Furthermore, fish bone fragments may closely resemble the color of fish meat, making accurate visual differentiation quite challenging [
5]. Most corporate hygiene protocols require staff to wear rubber gloves, and the chilled state of the raw materials further reduces tactile sensitivity, increasing the risk of missing bone fragments and fish scales. X-ray detection primarily identifies foreign bodies based on density differences and is unaffected by factors such as the color, shape, or position of the fish meat. However, for low-density foreign materials like fish bones and scales, equipment with higher-intensity radiation sources is required, which elevates costs. In some cases, certain fish species with even lower density may remain undetectable [
6,
7]. In response to this issue, researchers in recent years have proposed utilizing the spectral characteristics of different substances to identify foreign bodies. For example, Song et al. developed a method for detecting fish bones based on Raman hyperspectral imaging technology [
8]. Wang et al. used hyperspectral technology to model the spectral differences between fish skin and scales at different wavelengths, thereby achieving a quantitative assessment of the scaling rate of carp [
9]. These studies have demonstrated the feasibility of detecting foreign objects by utilizing the differences in light absorption and reflection of fish skin, flesh, and scales within specific wavelength ranges. However, these techniques can only detect fish scales or bones attached to the surface of the fish. Without the introduction of externally excited enhanced imaging technology, it remains impossible to identify foreign objects embedded deep within fish flesh.
Ultraviolet fluorescence imaging (UVFI) with a wavelength range of 320–400 nm (i.e., UVA band or near-ultraviolet light) is an imaging technique based on the molecular fluorescence response generated by materials under UV irradiation [
7]. Ultraviolet light in this band possesses strong penetrability and can effectively excite fluorophores in samples. Objects with similar colors exhibit analogous hues within the visible light spectrum (400–760 nm). However, under ultraviolet irradiation, different substances with distinct fluorescence radiation capabilities exhibit different fluorescence reflection intensities. Such intensity differences can improve the contrast of similarly colored targets under ultraviolet excitation and provide favorable detection conditions for foreign bodies hidden in deep fish tissues. For example, Wang et al. combined color and texture features from UV fluorescence images with a CNN model to predict total volatile basic nitrogen (TVB-N) in tilapia subjected to repeated freeze–thaw cycles, thereby demonstrating the feasibility of using fish scales as a carrier of fluorescence information [
10]. However, this study focused on predicting quality indicators rather than detecting foreign objects, and it did not address the issue of signal attenuation in deeper layers. In addition, Bao et al. used the UV-press method in conjunction with ISO standard procedures to detect Anisakis simplex in farmed Atlantic cod from Norway, thereby validating the effectiveness of UV fluorescence technology in the detection of foreign bodies in fish [
11]. However, this method is destructive, as it alters the original morphology of the sample, and it targets exogenous parasites rather than endogenous foreign bodies in fish. Furthermore, for foreign bodies embedded deep within muscle tissue, the press method struggles to effectively excite and capture fluorescent signals, leaving a risk of missed detection. Although the aforementioned studies did not resolve the issue of non-destructive detection of deep-seated endogenous foreign bodies, they demonstrated the feasibility of UVFI in the identification and quality assessment of foreign bodies in fish. Despite the enhanced contrast offered by UVFI, the intricate biological properties of fish tissues can, in practice, lead to overlapping or ambiguous fluorescence responses between foreign bodies and the surrounding substrate. This similarity in spectral signatures can complicate visual interpretation and traditional image analysis, potentially limiting identification accuracy. Therefore, accurately distinguishing foreign objects from the surrounding matrix has become key to improving recognition performance. The emergence of deep learning has provided a new and effective approach to addressing this challenge. Zhang et al. utilized a U-Net network to achieve accurate identification and segmentation of foreign objects in birds’ nests [
12], and Wang et al. employed an improved U-Net model to detect foreign objects on power transmission lines [
13]. These studies demonstrate that the U-Net family of models exhibits significant effectiveness in image-based foreign object detection tasks [
7]. Their advantage lies in the ability to automatically learn complex and discriminative features from raw image data [
14], thereby effectively extracting subtle patterns used to distinguish targets from the background. However, the original U-Net model achieves pixel-level semantic segmentation primarily by learning high-level semantic features within the global context of an image; its classification mechanism relies heavily on deep spatial feature mappings rather than explicit analysis of color and texture features. In the task of identifying foreign objects in tilapia fillets, color and texture differences between various foreign objects serve as critical discriminative information. Both the fish meat itself and foreign objects (especially small fish bones) exhibit fluorescence; although the fluorescence from the fish meat is relatively weaker than that from small fish bones, the small surface area of the fish bones makes them easily obscured by the fluorescent background generated by the fish meat. Although end-to-end models possess strong recognition capabilities, they may still misidentify key features of foreign objects against a broad fluorescent background. In this study, the impact of the fish meat’s fluorescent background on the classification performance of end-to-end models cannot be ignored. [
15]. For high-dimensional biological features such as color and texture, previous studies have demonstrated that support vector machines (SVMs) exhibit excellent classification performance. For example, Azarmdel et al. utilized an SVM classifier to achieve high-accuracy automatic classification of four fish species in a fish intelligent processing system [
16], while Windarsih et al. employed SVMs to achieve high-precision prediction of adulteration levels in the detection of pork fat adulteration in tuna oil [
17]. Therefore, by combining an efficient classifier with these distinctive visual features, it is expected that a robust identification model capable of accurately classifying various types of endogenous foreign bodies can be developed.
This study employs ultraviolet excitation technology to enhance the visualization of foreign objects in tilapia fillets. By leveraging the differences in fluorescence intensity between fish meat and foreign objects such as scales and bones under ultraviolet excitation, and by integrating classical machine learning with deep learning methods, a foreign object recognition model for tilapia fillets was developed. The model is designed to enable rapid, non-contact detection of endogenous foreign objects such as scales and bones. The main research content is as follows: Through scanning electron microscopy (SEM) experiments and organic solvent immersion tests, we analyzed the microscopic mechanisms underlying the fluorescence responses of fish scales and bones, thereby verifying the feasibility and reliability of non-destructive foreign object detection in fish fillets based on fluorescence response. We developed a foreign object detection system utilizing ultraviolet fluorescence to capture fluorescence response images of foreign objects and fish fillets. By comparing the performance of basic image processing methods with the U-Net deep learning method, a U-Net was ultimately adopted to achieve high-precision localization of foreign objects in complex backgrounds. Multi-color model space features were fused, and principal component analysis (PCA) was employed for dimensionality reduction. Combining the gray-level co-occurrence matrix (GLCM), local binarized pattern (LBP), and histogram indices (HI), a set of discriminative color and texture features was constructed. By optimizing the classic support vector machine (SVM) machine learning model using genetic algorithms (GAs) and particle swarm optimization (PSO), we achieved rapid and high-precision identification and detection of foreign objects in fish fillets. The overall research framework and methodology flowchart of this study are illustrated in
Figure 1.
3. Results and Analysis
3.1. Microstructural Analysis of Fish Scales, Bones, and Muscle Tissue
Figure 4a–c show the scanning electron microscopy (SEM) images of fish scales, fish bones, and fish meat, respectively. As seen in
Figure 4a, the scale patterns on the surface of the fish scale form raised ridge-like structures of the bony layer, arranged densely and uniformly with specific angular orientation and distinct directionality. In
Figure 4b, the fish bone, formed through the ossification of myoseptal connective tissue, exhibits a hard exterior and hollow interior, primarily composed of carbonates, crude protein, phosphorus, and other substances [
34], with a densely packed and compact structure. In contrast,
Figure 4c reveals that fish meat consists mainly of muscle fibers with relatively large inter-fiber gaps and a loosely organized structure. The presence or absence of fluorescence in these tissues is governed by two fundamental conditions: the frequency of the incident radiation must be compatible with the molecular structure, and the material must possess a sufficient fluorescence quantum yield after energy absorption at specific wavelengths [
35]. This explains why fish bones and fish scales, sharing compositional and structural similarities with their layered inorganic structures, effectively absorb and re-emit ultraviolet light, fulfilling both conditions to produce distinct fluorescent responses. Conversely, the myofibrillar protein structure of fish meat, despite absorbing ultraviolet light, fails to satisfy the second condition, as it does not efficiently re-emit the energy as longer-wavelength light, resulting in a weak fluorescent response.
3.2. Organic Solvent Immersion Experiment
Figure 5 shows the fluorescence response diagrams of fish scales and fish bones in different solvents under ultraviolet and visible light.
Figure 5a,b depict fish scales and fish bones placed in test tubes containing three different solvents (C
2H
6O, H
2O, and C
4H
8O
2) under visible light irradiation, while
Figure 5c,d show the corresponding images under ultraviolet light irradiation. Under natural light, neither fish scales nor fish bones exhibit a fluorescence response. Under ultraviolet light irradiation, only the fish scales and fish bones demonstrate a fluorescence response, while the supernatant in the solution shows no fluorescence. This indicates that the fluorescence is caused by the structural coloration of the fish scales and fish bones and is not influenced by external factors such as feeding conditions or environments, ensuring the stability and reliability of detection in the ultraviolet wavelength range. The differences in fluorescence response between fish flesh, scales, and bones also demonstrate the feasibility of implementing the method described in this paper.
3.3. Segmentation Results and Comparative Analysis
The segmentation results based on the U-Net network and classical threshold segmentation techniques are shown in
Figure 6. The first column displays the collected fluorescent images of fish fillets, while columns 2–5 present the segmentation results obtained by 4 different methods. These involve performing a bitwise AND operation between the mask images generated by each method and the original image to isolate fish scales and bone fragments. Since the foreign body FB-F is located on the surface layer of the fish meat, it exhibits a significant difference in fluorescence response compared to the background fish meat [
36]. All four methods can extract the foreign body to some extent, and the IOU data of all tested segmentation algorithms are listed in
Table 1. Among them, the fixed threshold method (threshold set to 125) shows obvious discontinuities in the segmentation results, while the Otsu method and K-means method exhibit burrs and pits in the extraction of fishbone edges. The U-Net network achieved a more complete and smooth segmentation with an Intersection over Union (IOU) of 92.6%. For the foreign body FB-I embedded within fish flesh, partial light reflection is absorbed by the flesh, weakening the fluorescence response. Consequently, classical threshold segmentation methods show significantly reduced performance: the Otsu method incorrectly identifies the entire fish fillet as foreign matter, the fixed threshold method detects only sparse bright regions, and K-means fails to identify it entirely with an IOU of 0.0%. In contrast, U-Net still achieves relatively complete segmentation of this foreign body with an IOU of 90.8%. For FS foreign body detection, the transparent nature of fish scales allows their fluorescence images to reveal the underlying texture of the fish meat. Consequently, the Otsu method and K-means both misclassify the fish meat in the edge regions as foreign bodies, while the fixed threshold method only identifies a very small number of bright areas, with an IOU of only 1.8%. In contrast, the U-Net network accurately and completely extracted the FS foreign body, achieving an IOU of 95.3%. This study compared three underlying image-processing techniques. Due to varying fluorescence intensities among foreign bodies, threshold segmentation methods failed to accurately distinguish the three object types. However, the U-Net network significantly improved segmentation performance for all three foreign bodies, demonstrating that its adoption for foreign body segmentation is both reasonable and accurate.
3.4. Experimental Results of Color and Texture Feature Characterization
For the segmented foreign body images mentioned above, a total of 12 single-value features were extracted from individual color channels across the four color models (RGB, L*a*b*, HSV, and YCbCr). As shown in
Figure 7, which takes randomly selected foreign body images from the FS, FB-F, and FB-I samples as an example, all components of the four color models are visualized.
Figure 7 shows the RGB, L*a*b*, HSV, and YCbCr color models and their respective color channel grayscale images. As can be intuitively observed from the grayscale images of the color channels, the regions and extent of textural prominence in the foreign body area differ across the various channels. This occurs since the individual channels encode distinct discriminative attributes (e.g., color, brightness, hue), causing the perceived textural characteristics of the foreign body to exhibit divergence across them. This difference better highlights the textural features of the foreign body image from multiple perspectives. Additionally, while rich color and texture information exists across different channels, some redundant information negatively impacts model accuracy. Before performing principal component analysis (PCA), the pixel values of the 12 monochrome grayscale images were normalized to the range [0, 1] using the min-max normalization method to eliminate scale differences between channels. PCA was then applied to the normalized grayscale images for dimensionality reduction. Principal component images with a cumulative contribution rate of 99% were selected for further extraction of textural features.
Table 2 presents the eigenvalues and contribution rates of the principal components obtained by the PCA method. The cumulative variance contribution rate of the first three principal components (PC1, PC2, and PC3) reached 99.32%, effectively encompassing nearly all the information present in the original data.
Figure 8 shows the first three principal component images obtained. It can be observed from the figure that the image clarity of PC1, PC2, and PC3 gradually decreases, indicating that the foremost principal components concentrate most of the information from the original images. Therefore, in this study, PC1, PC2, and PC3 were selected as the primary principal component images for subsequent texture feature extraction.
For the first three principal component images obtained, texture feature extraction was performed using the three methods mentioned in
Section 2.6: GLCM, LBP, and HI. Specifically, through the GLCM algorithm, a total of 48 texture features were extracted from each foreign body image (3 principal components × 4 directions × 4 texture features). Through the uniform LBP algorithm, a total of 177 texture features were extracted from each foreign body image (3 principal components × 59 texture features). Through the HI algorithm, a total of 18 texture features were extracted from each foreign body image (3 principal components × 6 texture features).
3.5. Optimal Parameters of the Model
Support vector machine (SVM), a well-established supervised learning algorithm, was selected as the core model for foreign body classification in this study. Its key advantage lies in the ability to project linearly inseparable features from the original space into a higher-dimensional feature space via kernel functions, thereby constructing an optimal separating hyperplane in this transformed space. This characteristic makes it particularly suitable for handling moderate-sized classification tasks with complex feature distributions, such as the one addressed in this research. Therefore, to evaluate the discriminative capability of the extracted texture features and construct high-performance classification models, parameter optimization was performed on the SVM models corresponding to different feature sets. The optimization process employed a genetic algorithm (GA) to optimize the penalty parameter ‘c’ and the kernel parameter ‘g’ of the support vector machine. During optimization, the classification accuracy of the model on the validation set was used as the fitness evaluation metric. As shown in
Figure 9a–c, the orange circles and red dots in the figures represent the average fitness and the best fitness values for each iteration, respectively. The best fitness curves of all models converged as the iterations progressed, indicating that the model parameters had reached their optimal solutions. Under these conditions, the parameter combinations corresponding to the highest validation accuracy were selected as the final model parameters. The optimal parameters and their corresponding validation accuracy for each model are presented in
Table 3. The Color-LBP-GASVM model achieved the highest validation accuracy of 98.41%, outperforming the Color-GLCM-GASVM and Color-HI-GASVM models (both with validation accuracies below 96%), preliminarily indicating that the Color-LBP-GASVM model possesses superior classification performance.
3.6. Classification Results of the Model
In this study, the performance of each model was comprehensively analyzed based on three metrics: validation set accuracy, test set accuracy, and F1 score. As shown in
Table 3, the Color-GLCM-GASVM and Color-HI-GASVM models achieved accuracy rates of 95.0% and 93.65%, respectively, on the validation set; however, their accuracy rates on the test set were both below 80.0%, and neither model’s F1 score exceeded 80.0%. These results indicate that while these two models exhibit good fitting performance on the training data, their stability and generalization capabilities are relatively insufficient. Among the three SVM models, the Color-LBP-GASVM model performed the best, achieving accuracy rates of 98.41% and 95.9% on the validation and test sets, respectively, with an F1 score of 96.15%. These results indicate that the Color-LBP-GASVM model demonstrates favorable classification performance in the foreign object recognition task, while also exhibiting reasonable generalization capabilities and model stability. It is worth noting that the model’s accuracy on the validation set is higher than that on the test set, which may suggest a slight degree of overfitting. However, the difference between the two is approximately 2.5 percentage points, which remains within an acceptable range. We believe that the primary cause of this discrepancy is not model overfitting, but rather statistical fluctuations resulting from the random division of small subsets. The validation set consists of 20% of the training set (63 images in total), while the test set comprises 270 independent images. Even if both sets follow the same data distribution, random division under small-sample conditions can still lead to natural fluctuations in accuracy. For example, if the validation set happens to contain more “simple” samples with clear fluorescence images and distinct foreign object edges, while the test set contains more “difficult” samples with complex backgrounds and locally blurred foreign objects, it is reasonable for the validation set to have a relatively higher accuracy than the test set.
The confusion matrices of the aforementioned models on the test set are shown in
Figure 10. As observed in
Figure 10a,b, models constructed based on GLCM and HI features exhibit inferior classification performance for both FB-F and FB-I categories. This is primarily because GLCM and HI are global texture statistical features. GLCM extracts the second-order joint probability distribution of an image, characterizing texture by analyzing the spatial co-occurrence relationships between pixel grayscale values across the entire image; in essence, it provides a macroscopic description of the entire image area. HI similarly performs statistical analysis based on global hue, saturation, and intensity distributions. In this study, the foreign object region accounts for a small proportion (approximately 5–15%), while the background is uniformly black (pixel value of 0). Under these conditions, the GLCM co-occurrence matrix is dominated by “0–0” pixel pairs, whose probability is far higher than that of grayscale pairs within the foreign object or at its edges. Consequently, statistical measures such as entropy, correlation, energy, and contrast are primarily determined by the background rather than the foreign object. These global metrics reflect the overall disorder and line complexity of the entire image; as a result, the true texture information of foreign objects is severely diluted or even overwhelmed by a large amount of background data. Similarly, the HI statistics on global color intensity are easily dominated by uniform background regions. Consequently, features extracted based on GLCM and HI contain only weak discriminative information between FB-F and FB-I, resulting in limited classification performance. In contrast, LBP employs a local neighborhood comparison strategy; the LBP code for each pixel depends solely on the gray-level contrast of the 8 pixels within its 3 × 3 neighborhood and is completely independent of the global pixel distribution. For uniform background regions (where all pixel values are 0), the LBP encoding is uniformly quantized to a specific uniform pattern (i.e., one of the LBP uniform patterns). The contribution of this pattern to the LBP histogram is concentrated on a single dimension, rather than generating scattered background noise across multiple statistics as in GLCM. More importantly, through local binarization, LBP is inherently insensitive to illumination and background grayscale, with its encoding results reflecting only relative changes within the local neighborhood [
37]. When the proportion of foreign objects is very small, although background regions contain a large number of pixels, they occupy only one or a few feature dimensions in the LBP histogram, whereas the edges of foreign objects, internal microstructures, and local differences between FB-I are encoded into different LBP pattern dimensions. This sparsity effect in feature dimensions enables LBP to effectively suppress global background noise and concentrate limited discriminative information into a few feature channels, thereby significantly enhancing the classifier’s ability to resolve weak signals. Consequently, compared to other classical machine learning models, the Color-LBP-GASVM model not only more accurately identifies FS but also achieves higher classification accuracy for both FB-I and FB-F, with an average model accuracy of 95.9%. In summary, when using color and LBP-based texture features as model inputs, the method demonstrates high classification performance, enabling the model to cover the vast majority of target samples. Among these, the Color-LBP-GASVM model achieved a macro-average accuracy, precision, recall, and F1 score of 96.59%, 96.59%, 96.37%, and 96.15%, respectively. These results indicate that the model performs well in terms of overall classification performance and demonstrates reasonable reliability and potential practicality.
Figure 10c shows the confusion matrix for the Color-LBP-GASVM classification model. As shown in the figure, the recall rates for FB-F, FB-I, and FS are 98.9%, 88.9%, and 100%, respectively, with FB-I having the lowest recall rate. In terms of precision, both FB-F and FB-I achieve 100%, while FS has a precision of 89.1%; the model’s overall classification accuracy is 95.9%. The confusion matrix shows that 10 FB-I samples were misclassified as FS, while 1 FB-F sample was misclassified as FS. This may be because FB-I is embedded within the fish meat, with fish tissue adhering to its surface; consequently, the FB-I region segmented by the U-Net includes some fish meat. Additionally, since FS appears colorless and translucent and is attached to the surface of the fish meat, its image exhibits some of the textural features of fish tissue, leading the model to misclassify some FB-I instances as FS during feature analysis. Overall, the Color-LBP-GASVM classification model performs effectively in the classification tasks for the three types of foreign objects.
In summary, the Color-LBP-GASVM classification model achieved the highest accuracy for the three types of foreign bodies (FB-F, FB-I, and FS), with rates of 98.9%, 88.9%, and 100%, respectively. Among these, both FB-F and FS were accurately identified, whereas the classification performance for FB-I was relatively weaker. This may be attributed to the fact that the surface of FB-I is obscured by fish meat, resulting in less distinct color and texture features. The aforementioned results indicate that the approach based on LBP texture features and color characteristics enables effective detection of the three types of foreign bodies and achieves satisfactory classification performance.
3.7. Comparative Experiment with End-to-End Models
To validate the rationality of the model architecture design, we selected two mainstream end-to-end image classification models (YOLO and ResNet) and conducted direct classification and recognition tests on ultraviolet fluorescence images of three types of foreign objects (FS, FB-F, and FB-I) using the same dataset and training and testing strategies. The experimental results are shown in
Figure 11. Compared to the model developed in this study, ResNet and YOLO performed poorly overall in the foreign object identification task for tilapia fillets, with particularly significant shortcomings regarding FB-I-class foreign objects: their recall rates for FB-I were only 57.8% and 63.3%, respectively, with prominent false negatives, making it difficult to reliably capture the features of this type of foreign object. A preliminary analysis suggests that this is primarily due to the inherent fluorescent properties of the fish meat itself. Since foreign objects (especially small fish bones) are relatively small in volume compared to the fish meat background, the end-to-end models may have overemphasized the fish meat background, thereby weakening the key features of the foreign objects and significantly interfering with the classification results.
In contrast, the two-stage architecture proposed in this study (U-Net segmentation, multi-feature fusion, and GASVM classification) first uses a segmentation module to precisely locate foreign object regions and then performs fine-grained classification based on differences in color and texture features. This effectively addresses a shortcoming of end-to-end models, which struggle to capture small, low-contrast features of foreign objects because they learn global features directly from raw images, thereby demonstrating greater validity and applicability in the current application scenarios.
3.8. Generalization Evaluation
To further evaluate the model’s generalization ability in real-world complex scenarios and prevent it from overfitting to the sample characteristics and background distribution of tilapia, we additionally procured three common whitefish species (Cyprinus carpio, Ctenopharyngodon idella, and Gadus morhua). Using the same sample preparation process as for tilapia, we prepared 90 samples of each fish species, with 30 samples for each of the three foreign object categories, and captured corresponding UV fluorescence images. The image data for these new fish species were directly fed into the pre-trained Color-LBP-GASVM classification model (without any model parameter adjustments or retraining). As shown in
Figure 12, the model achieved overall classification accuracies of 92.2%, 91.1%, and 94.4% for Ctenopharyngodon idella, Cyprinus carpio, and Gadus morhua, respectively, all exceeding 90%. However, the recall rate for FB-I class foreign bodies was lower in the three fish species than in tilapia. Preliminary analysis indicates that carp, grass carp, and cod differ from tilapia in terms of muscle protein content and texture. These differences may affect the UV fluorescence response of embedded foreign objects, thereby complicating feature extraction and target recognition and increasing the likelihood of missed detections. Taken together, the results of this cross-species validation demonstrate that the method proposed in this study exhibits reasonable generalizability in detecting foreign bodies within different fish species.
3.9. Industrial Applicability and Efficiency Evaluation
To assess the potential value of the detection method described in this paper for industrial applications, this study analyzed the operational efficiency and adaptability of the detection system under dynamic production line conditions. We randomly selected 20 samples from the test set and measured their execution times at each stage of the model. The results showed that the time required to perform precise segmentation of a single foreign object image using the U-Net neural network was consistently less than 365 milliseconds; the time required to extract color and texture features of foreign objects using the feature extraction module was consistently less than 83 milliseconds; and the time required to classify foreign objects using the SVM classification model was consistently less than 0.07 milliseconds. When the processing times of all modules are combined, the entire detection system completes the full identification process for a single foreign object sample in less than 500 ms, generally adapting to the actual production pace of tilapia fillet processing lines; simultaneously, the introduction of a multi-channel parallel detection architecture can further accelerate the process, effectively enhancing the algorithm’s engineering application value. When applying this method to actual production, the “clean/contaminated” classification problem must also be addressed. This can be achieved by directly utilizing the background regions (areas with pixel values of 0) obtained from U-Net segmentation of existing labeled images. By randomly sampling image patches, “foreign-object-free” class samples are constructed and combined with foreign-object region samples to train a simple binary classifier. Furthermore, to address practical operational challenges such as variations in fish fillet thickness and on-site environmental noise, the model can undergo incremental training using actual production data.