Abstract
The generalization capability of the decision-making modules in unmanned ground vehicles (UGVs) is critical for their safe deployment in unseen environments. Prevailing evaluation methods, which rely on aggregated performance over static benchmark sets, lack the granularity to diagnose the root causes of model failure, as they often conflate the distinct influences of scenario similarity and intrinsic difficulty. To overcome this limitation, we introduce a fine-grained, dynamic evaluation framework that deconstructs generalization along the dual axes of multi-level difficulty and similarity. First, scenario similarity is quantified through a four-layer hierarchical decomposition, with results aggregated into a composite similarity score. Test scenarios are independently classified into ten discrete difficulty levels via a consensus mechanism integrating large language models and task-specific proxy models. By constructing a three-dimensional (3D) performance landscape across similarity, difficulty, and task performance, we enable detailed behavioral diagnosis. The framework assesses robustness by analyzing performance within the high-similarity band (90–100%), while the full 3D landscape characterizes generalization under distribution shift. Seven interpretable metrics are derived to quantify distinct facets of both generalization and robustness. This initial validation focuses on the path-planning layer under full state observability, establishing a foundational proof-of-concept for the framework. It not only ranks algorithms but also reveals non-trivial behavioral patterns, such as the decoupling between in-distribution robustness and out-of-distribution generalization. It provides a reliable and interpretable foundation for evaluating the readiness of UGVs for safe deployment in unseen environments.