1. Introduction
The detection and control of viral contaminants in biopharmaceutical processes are critical for ensuring product quality and patient safety [
1,
2]. Retroviruses such as murine leukemia virus (MLV) or xenotropic murine leukemia virus (X-MuLV) can contaminate mammalian cell substrates used for vaccines and therapeutic proteins, posing substantial biosafety risks [
3]. Among current detection methods, Transmission Electron Microscopy (TEM) remains one of the most reliable tools for identifying viral contaminants, as it allows for direct visualization of viral particles without prior knowledge of the pathogen or specific reagents [
4].
TEM is widely used to detect both known and emerging viral agents in unprocessed bulk (UPB) materials, making it an essential component of quality control during biologics manufacturing [
5,
6]. For example, TEM analysis of cell lines and viral harvests has proven effective in identifying retrovirus-like particles that may be invisible to infectivity assays. However, manual TEM analysis is time-consuming, requiring skilled operators and often taking hours to process a single batch of images, with a high risk of undercounting due to human fatigue [
1,
7].
Artificial intelligence (AI), particularly deep learning approaches, has recently shown great promise in automating virus detection from microscopy images [
8,
9]. The YOLO (You Only Look Once) object detection family is well known for ensuring real-time detection with high accuracy, making it an attractive choice for virus detection applications [
10]. Studies have successfully applied CNN-based methods for detecting viral particles in TEM images, achieving near-perfect detection metrics [
11]. Furthermore, the availability of benchmark TEM datasets and advancements in network architectures such as DenseNet201, ViResNet, and transformer-based models (e.g., VISN) have paved the way for more accurate and efficient virus detection pipelines [
12,
13,
14].
In this work, we developed a YOLOv11-based virus detection system, enhanced by integrating BiFPN (Bidirectional Feature Pyramid Network) for multi-scale feature fusion and C3K2_IDWC for lightweight, morphology-adaptive detection. Our model was specifically designed for detecting retrovirus particles in UPB samples. We demonstrate that the proposed model significantly improves detection performance, reduces computational overhead, and provides reliable results for downstream virus safety testing in biopharmaceutical production.
2. Data Processing
2.1. Sample Collection and Preparation
The samples to be tested were first clarified by medium-low speed centrifugation and filtration, followed by ultra-high-speed centrifugation. The precipitate was resuspended in a fixed-volume solution, mixed with an equal volume of agar, then subjected to pre-fixation with glutaraldehyde and paraformaldehyde, post-fixation with tannic acid, and staining with uranyl acetate. The fixed samples underwent gradient dehydration and infiltration and were subsequently embedded in resin to form sample blocks after polymerization. The ultrathin sectioning technique was applied to section the samples, which were then fished onto copper grids, stained, and dried. Finally, the samples were examined and photographed under a transmission electron microscope.
2.2. Data Collection and Preprocessing
In the image processing stage, raw images were first screened to remove blurred ones and those without UPB virus strain cells, resulting in a total of 195 valid images. The labelimg tool was employed to annotate the UPB virus strain samples. During annotation, precise marking was conducted, targeting the morphological characteristics of the virus strains (such as spherical, rod-shaped, and irregular forms) and their residing backgrounds (including single-cell backgrounds, multi-cell overlapping backgrounds, and complex backgrounds with tiny impurities). After annotation, 10% of total, 19 images were selected as the test set, and the remaining images were divided into the training set and validation set at a ratio of 8:1. The training set was used for training the detection model, the validation set for adjusting model parameters and optimizing the model, and the test set for evaluating the model’s detection performance. To ensure the representativeness of the test set, consideration was given to virus strains with different morphologies and backgrounds when selecting the test set images, aiming to enhance the reliability of the detection results. Additionally, the image input size of the YOLOv11 network is 640 × 640, while the cell images obtained by computer scanning are 1095 × 866, which is larger than the network’s input size. Given that each image contains almost only one target and the YOLO network automatically resizes the input image to 640 × 640, no separate cropping was performed on the images in this study. In object detection, the production of training labels directly affects the quality of the final detection results. Therefore, it is necessary to create labels for UPB virus strain cells for each image.
Figure 1 shows the process of preparation, electron microscopy imaging of UPB virus strain samples, and the training and validation of the detection model based on YOLOv11.
2.3. Sturcture of YOLOv11 Algorithm
The YOLO algorithm is a one-stage object detection algorithm that has become a well-known detection algorithm after multiple iterations. It is characterized by fast detection speed and high detection accuracy. This paper employs the YOLOv11 network to automatically detect viruses in UPB. The YOLOv11 model is a relatively new addition to the YOLO series. After receiving input, this model extracts image features through a convolutional neural network and detects targets by performing regression operations on the images. Its overall structure is shown in
Figure 2.
The YOLOv11 is consisted of section Backbone, section Neck and section Head. In the section Backbone, YOLOv11 uses the C3K2 module instead of the C2f module in YOLOv10. As shown in
Figure 3, the C3K2 module is composed of multiple C3K modules. Through the superposition of multiple modules, information at different depths is extracted. The C3K module, in turn, contains multiple BottleNeck modules. Residuals can be optionally added to the BottleNeck modules. The residual adds the input and output of the BottleNeck, allowing the combination of shallow and deep information while increasing only a small amount of computational effort. This enhances the network’s feature extraction capabilities.
In addition, YOLOv11 incorporates the SPPF module in the second-to-last layer of its backbone. This module is crucial for extracting multi-scale feature information. By stacking three max-pooling operations, it achieves feature fusion from local to global, further enhancing the information representation of feature maps and improving detection accuracy. The structure of the SPPF module is shown in
Figure 4, where the kernel size of each max-pooling operation is 5 × 5.
In the last layer of the backbone, YOLOv11 uses the C2PSA module. This module further enhances the model’s spatial attention capability through the stacking of multiple PSA modules, facilitating feature extraction by the network. The Attention module in the PSA module, inspired by Vision Transformer, introduces the attention mechanism from it to strengthen the network’s feature extraction. The structure diagram of C2PSA is shown in
Figure 5.
In the section Neck, YOLOv11 employs a Path Aggregation Network (PANet) to enhance feature extraction. During up sampling, the spatial dimensions of the feature maps are increased by interpolating between the pixels of the original feature maps. Here, nearest-neighbor interpolation is used to double the height and width of the feature maps before stacking them with the feature maps from the previous layer in the backbone. This enables communication between deep-layer and shallow-layer information and facilitates the transfer of deep-layer information to the shallow layers. After two rounds of up sampling and stacking, YOLOv11 performs down sampling. Down sampling reduces the height and width of the feature maps to half their original size through convolution. After reducing the feature maps, they are stacked with the feature maps from before the up sampling, further promoting the fusion of deep-layer and shallow-layer information.
In the section Head, compared to YOLOv10, YOLOv11 replaces the two CBS modules in the detection category loss with DWConv modules. DWConv is a depthwise separable convolution module that uses grouped convolution during convolution, with the number of groups being the greatest common divisor of the input and output channel numbers. This convolution effectively reduces redundant parameters and increases training efficiency.
2.4. The Improved YOLOv11
Given that traditional YOLOv11 is designed for multi-object detection, its application to a single category, such as retrovirus particles in UPB, leads to redundant parameters and computations. To address this, we simplified the YOLOv11 network while maintaining training accuracy, thus reducing the number of training parameters and improving efficiency.
We propose replacing the original PANet with BiFPN to significantly reduce network parameters. This modification optimizes feature fusion by enhancing cross-scale connections, akin to improvements in YOLOv5 where key modules such as feature fusion paths were optimized to reduce redundant computations while preserving feature representation [
15]. BiFPN achieves efficient fusion by increasing node inputs and introducing bidirectional paths, thereby avoiding excessive feature extraction and aligning well with the single-class UPB virus dataset.
While PANet improves accuracy through down sampling, it demands more parameters and computational effort. For this study, given the relatively simple dataset structure, PANet would introduce unnecessary computational load. Therefore, we introduced the Weighted Bi-directional Feature Pyramid Network (BiFPN), which optimizes cross-scale feature fusion in traditional FPN. BiFPN allows each node to have multiple inputs, enhancing fusion efficiency. Additionally, for nodes with matching input-output layers, BiFPN adds an additional input from the backbone, enabling feature fusion without incurring significant cost. Unlike PANet, BiFPN’s bidirectional paths, treated as identical feature layers, are repeated multiple times to strengthen fusion. The BiFPN structure, shown in
Figure 6, includes the Node_mode and Fusion modules. The Node_mode is composed of three C3K2 modules, each of which extracts features by stacking multiple C3K2 modules. The Fusion module generates trainable fusion weights, which are normalized and multiplied by the input feature maps to produce the final fused feature.
The irregular shapes of viruses and viral particles in the UPB dataset present challenges for traditional square convolution kernels. Conventional convolution fails to capture the complete target information, especially when the target occupies a small portion of the image, reducing model accuracy. Furthermore, redundant computations arise from processing a large number of similar channels. To mitigate this, we introduce the C3K2_IDWC module, which replaces some C3K2 modules to perform multi-scale convolution and reduce computational load. This approach, inspired by applications in agricultural disease detection [
16], uses 3 × 3, 1 × 11, and 11 × 1 convolutions to adapt to irregular virus morphologies. The grouped convolution, with the number of groups set to 1/8 of the input channels, reduces redundant computations while enhancing the network’s ability to capture virus-specific features. The C3K2_IDWC structure, shown in
Figure 7, includes two InceptionDWConv2d operations that divide input channels into four parts, with each part undergoing different convolutions to improve feature extraction. The final architecture replaces the last two C3K2 modules with C3K2_IDWC modules and replaces the PANet network with BiFPN. These improvements achieve a balance between lightweight design and high precision, consistent with the logic of optimizing YOLO for specific tasks [
17]. The structure of the improved YOLOv11 network is shown in
Figure 8.
3. Model Selection
Training Parameter Settings for YOLOv11, it was conducted using the Python 3.9 programming environment and the PyTorch 2.3.1 framework running on an NVIDIA GeForce RTX 3090 GPU (Wuxi, China). In the experiment, the SGD optimizer was used to facilitate the optimization of network parameters for the model. The momentum value was set to 0.937, the batch size was set to 4, and the number of training epochs was set to 100. Early stopping was triggered if the composite metric (calculated as 0.9 × mAP@0.5 + 0.1 × mAP@0.95) showed no improvement over 50 consecutive epochs.
In the YOLO object detection algorithm, detection accuracy is evaluated using the standard of mean Average Precision (mAP). Similarly to other machine learning models, the predicted results can be categorized into four scenarios: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
To calculate mAP, it is necessary to first compute
Recall (
R) and
Precision (
P). Recall is the ratio of the number of true positives to the total number of positive samples, which reflects the rate of missed detections of UPB virus strains. The formula for calculating Recall is as follows:
Precision is the ratio of true positives to all detected samples, which reflects the false detection situation of UPB virus strains. Its calculation formula is as follows:
Object detection algorithms typically use average precision (
AP) and mean average precision (mAP) to assess the accuracy of model detection results. The
AP measures the detection accuracy for a single target class by evaluating the trade-off between precision and recall. The
MAP represents the meaning of all class
AP values and is used to evaluate multiclass detection accuracy. It offers insight into the model’s performance across all classes. The calculation formula for
MAP is as follows.
5. Discussion
Viral safety is of paramount importance in biopharmaceutical production, particularly for biologics derived from mammalian cells. Retroviruses such as murine leukemia virus (MLV) and xenotropic murine leukemia virus (X-MuLV) can contaminate cell substrates, posing significant biosafety risks [
4]. Transmission Electron Microscopy (TEM) serves as a key tool for detecting viruses in unprocessed bulk (UPB) samples, enabling direct visualization of viral particles without requiring prior knowledge of the pathogen [
3]. Nevertheless, manual TEM analysis is time-consuming, taking 2–4 h to complete, and human fatigue may lead to undercounting or misjudgment issues that could delay production or result in missed contamination [
7,
18].
The improved YOLOv11 model (YOLOv11n + BiFPN + IDWC) presents a viable solution to these challenges. Boasting high accuracy with a mean Average Precision (mAP@0.5) of 0.995, it ensures reliable virus identification, thereby meeting regulatory requirements. With 1.74M parameters and 6.0 GFLOPs, it is compatible with standard laboratory equipment, striking a balance between detection accuracy and practical feasibility. For example, reducing TEM analysis from 2 to 4 h to <5 min per batch (with each image detected in <1 s) could shorten QC release timelines.
From a biological perspective, the model is well-adapted to the characteristics of viral particles in TEM images. The BiFPN module enhances cross-scale feature fusion, making it effective for identifying viruses of varying sizes within complex backgrounds. Meanwhile, the C3K2_IDWC module captures irregular viral morphologies through multi-scale convolutions, which contributes to its high precision (0.954) and recall (1). Notably, the noise robustness experiment (
Section 4.3) further validates the model’s adaptability to the actual characteristics of TEM images. Under the conditions of 3 × 3, 5 × 5, and 7 × 7 Gaussian blurring (simulating common issues in industrial scenarios such as slight defocusing and uneven section thickness), the model’s Recall remained consistently at 1.0 (all 19 real viral particles were detected). This demonstrates that the cross-scale feature fusion capability of BiFPN can accurately anchor the core morphological features of viruses even when the viral edges are weakened due to blurring. Additionally, the multi-scale convolution (3 × 3, 1 × 11, 11 × 1) design of C3K2_IDWC effectively counteracts the interference of blurring on the morphological recognition of irregular viruses. This complements the detection performance under ideal conditions (
Section 4.1), highlighting the model’s biological adaptability in non-ideal TEM workflows.
In practical applications, the model enables automated detection, processing image data within seconds and providing real—time results. This not only shortens quality control cycles but also allows for dynamic adjustments to downstream purification processes, minimizing production losses caused by detection delays. For a company producing 100 batches annually, the model saves 235 min per batch (240 min vs. 5 min), totaling 391.7 h (≈49 working days) of labor time each year. This reduction directly cuts manual TEM analysis costs and enables dynamic adjustments to purification processes, minimizing production delays. More crucially, accelerating QC cycles can expedite drug development: each month shortened in the time-to-market timeline for biopharmaceuticals can yield millions in revenue. By eliminating weeks of testing delays, the model has the potential to advance regulatory approval and earlier commercial launch by 3–6 months, delivering substantial competitive and financial advantages. As highlighted in recent industry analyses [
19], early viral detection can avoid facility downtime and production delays, with the biomanufacturing viral detection market projected to grow at a CAGR of 8.2–9.37% through 2033 [
2,
5].
However, the study has certain limitations. The dataset, consisting of 195 images, covers a relatively narrow range of viral morphologies. In real-world scenarios, viruses may develop new morphological features due to changes in culture conditions or mutations, and the model’s ability to recognize such “unseen” morphologies remains untested. To address this, future work could expand the dataset by integrating TEM images of UPB samples containing viruses with diverse morphological variations—including those induced by controlled culture condition adjustments (e.g., temperature, nutrient concentration) and laboratory-simulated mutation models. Additionally, U-Net (a mainstream segmentation model) was not compared here as it targets semantic segmentation (evaluated via Dice coefficient), whereas our study focuses on object detection (evaluated via mAP); segmentation-based benchmarks will be included in future work if extending to virus quantification.
Additionally, the samples used are limited to monoclonal antibody intermediates derived from CHO cells; its adaptability to other cell lines or production processes requires further verification. This limitation may be mitigated by validating the model on UPB samples from a broader spectrum of biopharmaceutical production systems, such as those using HEK293 cells (for recombinant protein production) or Vero cells (for vaccine development), and across different manufacturing processes (e.g., batch, fed-batch, and continuous production).
Furthermore, Previously, the robustness of the model against actual TEM preparation artifacts had not been systematically evaluated. Although the newly added noise robustness experiment in this study (
Section 4.3) has filled some gaps—verification through Gaussian blur on 19 test set images confirmed that the model is tolerant to minor disturbances (3 × 3~7 × 7 kernel) (Recall = 1.0). The experiment still has three limitations: first, the type of disturbance is single, only covering Gaussian blur, and does not involve more common artifacts in industrial scenarios such as uneven staining (e.g., uranyl acetate staining spots), copper mesh impurities (e.g., grid line interference), etc.; second, extreme disturbance scenarios (e.g., blur larger than 9 × 9, Gaussian noise with σ > 0.03) were not tested. Although such situations are rare in standardized QC, occasional preparation errors (e.g., overly thick ultra-thin sections) may still lead to a sudden drop in image quality; third, the fixed false detection problem found in the experiment (3 non-viral impurities were misjudged as viruses, Precision = 0.864) has not been resolved. This is because the C3K2_IDWC module is only designed for viral morphology and lacks targeted extraction of features of impurities such as cell debris and staining residues, which may trigger misjudged re-inspection of qualified batches in practical applications and increase production costs. To improve noise resilience, researchers could generate a dedicated noisy image subset by simulating common sample preparation artifacts (e.g., staining irregularities, debris accumulation, and sectioning defects) and fine-tune the model using this subset. Based on the results of this experiment, future optimizations can further focus on the following aspects: To address the issue of false detections, add a lightweight impurity classification branch in the model section Head. Train this branch using a small number of TEM impurity samples (such as CHO cell debris, tannic acid staining residues) to achieve accurate distinction between viral and non-viral features; at the same time, expand the interference type dataset, include uneven staining, copper mesh impurities, etc., in the robustness test, and integrate an adaptive wavelet denoising module to improve the model’s tolerance to extreme noise without adding excessive parameters (controlled within 2M). Alternatively, integrating a preprocessing module (e.g., adaptive median filtering or wavelet denoising) before the YOLOv11 detection pipeline could reduce the impact of extreme noise on detection performance.