4.1. Target Detection
In this section, we evaluate the object detection performance of our model using standard metrics, including Precision, Recall, Average Precision (AP), mean Average Precision (mAP), and the score. These metrics collectively offer a comprehensive assessment of the model’s ability to detect and classify the target class—debris—against a background.
Precision and recall are calculated by ranking the detection outputs according to their confidence scores. Precision measures the proportion of true positive detections among all predicted positives, while recall measures the proportion of true positives among all actual objects. Their definitions are given by
where
,
, and
denote true positives, false positives, and false negatives, respectively.
The
score, which balances both precision and recall, is computed as
As shown in
Figure 17, the precision–recall curve for the “debris” class remains consistently high across a wide range of recall values, achieving an average precision (AP) of 0.955. The model’s mean average precision (mAP) at an IoU threshold of 0.5 is also calculated. mAP represents the mean value of AP over all classes and is defined as
where
N is the number of classes, and
is the average precision of class
i.
To further explore the influence of confidence thresholds, we examine the relationships between confidence and performance metrics. In
Figure 18, the
score peaks at 0.92 at a threshold of 0.716, reflecting the optimal balance point. In
Figure 19, precision reaches 1.00 when the threshold is increased to 0.961, and
Figure 20 confirms that recall stays near-perfect at lower thresholds.
The model’s training progress is visualized in
Figure 21. The box loss, classification loss, and distribution focal loss (DFL) all decrease steadily during training. Evaluation metrics such as mAP and
also improve continuously, reflecting stable convergence and good generalization ability.
We also visualize the spatial distribution and size of detected objects to assess the model’s localization behavior. As shown in
Figure 22, debris instances are well-distributed across the image, and the bounding box sizes remain consistent, indicating that the model avoids positional bias and detects targets with stable scale inference.
To evaluate classification performance, we analyze the confusion matrices in
Figure 23.
Figure 23a shows the raw confusion matrix, where the model correctly classifies nearly all debris instances with minimal confusion.
Figure 23b, the normalized version, further confirms strong class separation, with nearly all predictions lying along the diagonal.
To support our model selection, we conducted a quantitative comparison among DETR, RT-DETR, YOLOv5, and YOLOv8 using standard evaluation metrics, including Precision (P), Recall (R), mAP@50, and mAP@50:95. As shown in
Figure 24, YOLOv8 consistently outperforms the other models across all evaluation metrics. Specifically, it achieves the highest precision (93.5%), recall (86.5%), mAP@50 (95.5%), and mAP@50:95 (66.9%), which clearly demonstrates its superior ability to detect and localize the target class with both accuracy and generalization. Moreover, despite its outstanding performance, YOLOv8 maintains a remarkably small parameter size (3.01M) and low computational complexity (8.3 GFLOPs), making it particularly suitable for deployment in real-time or resource-constrained environments.
Therefore, considering both detection performance and model efficiency, YOLOv8 was chosen as the primary detection backbone in this work. Its superior balance of accuracy and speed ensures that it not only delivers state-of-the-art results but also scales well for practical application scenarios.
To assess the real-time performance of the final detection model, we conducted a benchmark test using a standard YOLOv8 model deployed on an NVIDIA RTX 4060 GPU, as shown in
Table 9. The input resolution was set to 640 × 640, and the batch size was fixed at 4. A total of 500 test images were processed using the official Ultralytics YOLOv8 framework. The total inference time was measured at 2.94 s, resulting in an average inference speed of approximately
170 FPS. This demonstrates that, despite the incorporation of additional modules such as stereo vision and trajectory estimation, the system remains computationally efficient and suitable for real-time applications in space debris detection.
Finally, to realistically simulate space debris in a controlled visual environment, we conducted a simplified modeling process using MATLAB R2024a. In this simulation, two white spherical objects of approximately the same size were used to represent space debris. The spheres were placed in a dark scene with a black background to mimic the low-illumination conditions of outer space. A single point light source was introduced and its angle was adjusted to cast realistic highlights and shadows on the surfaces of the spheres, enhancing contrast and replicating the optical features of space-based imagery.
The reason for selecting debris targets of similar size in this experiment is two-fold. First, it allows the object detection model to rapidly localize each target and treat them as individual points in space, simplifying the visual inference process. Second, this setup enables the more effective extraction of spatial and kinematic parameters—such as velocity vectors, movement direction, and trajectory prediction—for the subsequent analysis of potential collisions. By using geometrically uniform shapes, the focus of the experiment remains on evaluating the detection performance and motion estimation under space-like visual conditions, rather than introducing additional complexity due to scale variation.
Figure 25 shows an example detection result on a test image. The model accurately localizes multiple debris targets, each annotated with confidence scores above 0.69. This visual outcome reinforces the strong quantitative metrics reported above and illustrates the model’s effectiveness in real-world inference.
The modeling and simulation process was conducted in MATLAB. Two cameras were positioned in the 3D space for this simulation. Camera 1 was placed at the coordinates (2.0, −1.0, 1.5) and oriented to face the point (1, 1, 1), while Camera 2 was placed at (0.3, 2.2, 1.4), also facing towards the same point (1, 1, 1). The detection views from both cameras will be shown in
Figure 26.
Based on the velocity extraction via binocular vision described in
Section 3.3.1 and the kinematic model of spatial binary-sphere collision outlined in
Section 3.3.2, we calculated the velocities of two spheres in 3D space that are likely to collide. Specifically,
Sphere 1 has an initial velocity of [0.3, 0.3, 0.3], moving along the spatial diagonal;
Sphere 2 has an initial velocity of [−0.3, −0.3, −0.29], in an approximately opposite direction.
According to the kinematic model, the two spheres are expected to collide at time t = 1.1667 s at the spatial coordinate [0.55, 0.55, 0.65]. At t = 1.2 s, the positions of the two spheres are observed to be [0.531, 0.531, 0.684] and [0.569, 0.569, 0.528], respectively. The robotic manipulator’s end-effector will subsequently perform servo control targeting Sphere 1’s predicted position at that moment. Using the conservation of momentum and the restitution coefficient (e = 0.9).
We then simulated the entire collision process in MATLAB, producing a 3D visualization that shows the trajectories of both spheres before and after the collision, as illustrated in
Figure 27.
4.2. Robotic Arm Trajectory Planning
Based on the robotic arm parameters presented in
Table 10, the robotic arm is modeled in MATLAB, as shown in
Figure 28.
The following figures illustrate the impact of employing the PSO algorithm to enhance the trajectory planning of robotic joints. As depicted in
Figure 29, the evolution of each joint’s historical best fitness value is tracked across optimization iterations, highlighting the algorithm’s convergence behavior. A detailed comparison of joint position, velocity, and acceleration profiles—before and after optimization—offers clear insight into the improvements achieved.
Figure 30 presents this comparative analysis, showcasing the effectiveness of the PSO-based optimization approach.
Comparing the unoptimized and optimized joint angle trajectories of the 6-DOF robotic arm shows clear performance gains. In the unoptimized case (left column), the motion takes about 6 s, with joint angles sweeping roughly rad to rad. Although each curve is smooth, poor synchronization causes uneven timing and inefficiencies. After optimization (right column), the same maneuver completes in about 2.2 s, with all peak displacements constrained to approximately ±3 rad, yielding much tighter coordination and reduced overshoot.
Differences in joint speed profiles are even more pronounced. Unoptimized joint velocities peak near rad/s and display slow, staggered transitions that lead to lag across axes. In the optimized case, peak speeds climb to nearly rad/s, and the velocity peaks across all joints become closely aligned. Despite the higher speeds, the curves remain smooth—evidencing a balanced plan that boosts throughput without sacrificing continuity.
Acceleration profiles highlight system responsiveness. The original acceleration curves stay within , reflecting a conservative, cautious controller. Post-optimization, accelerations reach up to yet avoid sharp spikes, delivering noticeably greater responsiveness, improved path adherence, and ultimately shorter cycle times with less mechanical wear.
4.3. Transformer-KAN Prediction
To assess the impact of introducing Kernel-based Attention Networks (KANs) into the Transformer architecture, we conducted a series of controlled experiments. The evaluation leverages three common regression metrics—Huber Loss, root mean squared error (RMSE), and coefficient of determination ()—as well as trajectory prediction comparisons. Below, we present and analyze the corresponding results in detail.
We begin by introducing the definitions of the evaluation metrics:
Huber loss combines the sensitivity of squared error for small residuals with the robustness of absolute error for larger deviations, thereby reducing the influence of outliers.
Root mean squared error (RMSE):
RMSE shares the same units as the original target variable, providing a more interpretable measure of absolute prediction error.
Coefficient of determination ():
where
is the mean of the actual values.
reflects how much variance in the true values is captured by the model; values close to 1 indicate an excellent fit, while values below 0 imply that the model performs worse than a constant predictor.
Training loss (Huber) analysis:
Figure 31 plots the Huber loss over 300 training epochs for the Transformer model. The curve begins at a moderate initial value of around 0.06, reflecting the model’s early prediction errors under the robust Huber metric. It then drops precipitously within the first 5–10 epochs—from about 0.06 down to roughly 0.012—indicating very rapid initial learning.
Following this burst of improvement, the loss curve oscillates modestly and enters a slower, steady descent. From approximately epoch 10 to 50, the Huber loss levels off near 0.009, fluctuating only slightly as the model fine-tunes. Beyond epoch 50, the loss continues to decrease gradually, reaching about 0.004 by epoch 200.
In the late stage of training (epochs 200–300), both the training and test curves converge around 0.002, signifying that further epochs yield only marginal gains. The tight overlap between the train and test losses throughout also suggests strong generalization with minimal overfitting.
In contrast,
Figure 32 presents the Huber loss of the Transformer+KAN model over 270 training epochs. The curve begins at approximately
, then falls sharply to about
by epoch 20, indicating rapid initial learning. Between epochs 20 and 50, the loss continues a smooth decline to around
, after which it enters a steadier, slower descent, reaching roughly
by epoch 100. In the late training stage (epochs 100–270), the loss tapers off and converges near
by epoch 270. Throughout training, the train and test curves nearly overlap, demonstrating strong generalization and minimal overfitting. This behavior shows that integrating KAN not only accelerates convergence but also yields a significantly lower final error, reflecting an enhanced capacity to model complex nonlinear and temporal dependencies.
Throughout training, the test loss (orange) nearly overlaps the train loss (blue), with only minor fluctuations, indicating excellent generalization and minimal overfitting. This behavior shows that adding KAN not only accelerates early learning but also drives the model to a significantly lower final error, underscoring its enhanced ability to capture complex nonlinear and temporal patterns.
RMSE Comparison: As shown in
Figure 33, it plots the RMSE of the Transformer model over 300 training epochs. The RMSE decreases from an initial value of roughly
m at epoch 0 to about
m by epoch 1, demonstrating a steep initial learning. Over the next 20 epochs, the error falls to approximately
m, followed by a smoother decline to
m at epoch 100 and
m at epoch 150. In the late stage (epochs 150–300), RMSE continues to improve gradually, reaching
m at epoch 200,
m at epoch 250, and approaching
m by epoch 300. The near-coincidence of training and test curves throughout indicates strong generalization and minimal overfitting. These trends suggest that the Transformer quickly captures dominant low-frequency patterns, while the further refinement of higher-frequency components yields diminishing returns over extended training.
In contrast,
Figure 34 demonstrates that Transformer+KAN maintains a steady downward trend throughout training, ultimately reaching an RMSE of approximately 0.12. The absence of strong fluctuations and the continuous decline suggest a more stable and effective learning process. The reduced RMSE indicates that the KAN module helps the model minimize absolute prediction errors more consistently across epochs.
score evolution:
Figure 35 plots the
progression for the Transformer model. Initially, the model exhibits highly unstable
values below 0.5, with noticeable noise in the first 100 epochs. It eventually stabilizes, but only reaches around 0.82 by the end of training. This indicates that roughly 18% of the variance in the data remains unexplained.
In comparison, the Transformer+KAN model shown in
Figure 36 follows a much more desirable trajectory. Although it also starts with erratic fluctuations,
begins to rise steadily after epoch 80. By epoch 200, it surpasses 0.9 and ultimately reaches around 0.935. This high final score signifies that the model captures nearly all the variance in the data, showcasing its superior generalization ability.
Trajectory prediction on X axis: The effectiveness of the model in capturing real-world temporal dynamics is further demonstrated by the predicted trajectory in the X-axis dimension, as shown in
Figure 37. The Transformer+KAN model generates predictions that closely align with the ground truth across the entire time span. It successfully reproduces the amplitude and frequency of oscillations, including sharp peaks and valleys, with high fidelity.
In contrast to the baseline Transformer, the +KAN architecture displays a significantly better ability to represent high-frequency components and non-stationary transitions. This advantage can be attributed to the KAN module’s capacity to capture localized dynamic behavior, making it particularly suitable for complex motion forecasting tasks.
In conclusion, the Transformer+KAN architecture consistently outperforms the vanilla Transformer across all evaluation metrics. The combination of faster convergence, lower prediction error, higher explanatory power, and stronger trajectory fidelity confirms that the integration of KAN significantly enhances the model’s ability to learn complex temporal dependencies in sequential data.