Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery

Borovyk, Dmytro; Barmak, Oleksander; Radiuk, Pavlo; Krak, Iurii

doi:10.3390/technologies14060360

Open AccessArticle

Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery

¹

Department of Computer Science, Khmelnytskyi National University, 11 Instytuts’ka Str., 29016 Khmelnytskyi, Ukraine

²

Department of Theoretical Cybernetics, Taras Shevchenko National University of Kyiv, 4d Akademika Glushkova Ave, 03680 Kyiv, Ukraine

³

Laboratory of Communicative Information Technologies, V.M. Glushkov Institute of Cybernetics, 40 Akademika Glushkova Ave, 03187 Kyiv, Ukraine

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(6), 360; https://doi.org/10.3390/technologies14060360 (registering DOI)

Submission received: 29 April 2026 / Revised: 6 June 2026 / Accepted: 9 June 2026 / Published: 13 June 2026

(This article belongs to the Special Issue Advanced Technologies in Computer Vision and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate identification of visually similar targets in Unmanned Aerial Vehicle (UAV) imagery is hindered by significant inter-class ambiguity and viewpoint variability. While hierarchical deep learning mitigates these challenges, existing architectures relieve manual design, introducing subjectivity and limiting cross-domain scalability. In this work, we propose an objective, data-driven method for the automated synthesis of hierarchical classification structures. Our approach uses a hybrid inter-class proximity metric that integrates geometric distances between latent-feature-space centroids with empirical misclassification probabilities. Using a hierarchical agglomerative clustering algorithm optimized via an inconsistency coefficient, we synthesize a coarse-to-fine cascade that deploys YOLOv11 for feature extraction and FT-Transformers for specialized identification. Experimental validation on the VisDrone2019 and UAV123 datasets demonstrates that the automatically generated hierarchy achieves a peak F1-score of 94.9%, outperforming the monolithic YOLOv11 model by 0.8% and matching human-designed cascades. Sensitivity analysis indicates an optimal hybrid weight range of 0.4–0.6. The findings confirm that our automated synthesis provides high adaptability and reliability for real-time edge AI deployments, ensuring robust performance in dynamic monitoring environments without requiring manual redesign.

Keywords:

Unmanned Aerial Vehicles (UAVs); similar objects; classification; deep learning; architecture search

1. Introduction

Ensuring a high level of situational awareness is a critical task for the effective deployment of Unmanned Aerial Vehicles (UAVs) in monitoring missions, search and rescue operations, and critical infrastructure inspections. Between 2023 and 2025, the development of deep learning architectures, specifically the YOLO model series (v8 through v11) [1] and specialized Vision Transformers, have demonstrated significant progress in object detection speed and accuracy [2,3]. However, the identification of visually similar targets in aerial imagery remains a fundamental challenge [4]. Limited sensor resolution, viewpoint variability (top-down view), and significant inter-class ambiguity often lead to critical errors in monolithic models that attempt to learn all features within a single flat classification space.

To overcome these limitations, modern literature proposes hierarchical approaches that decompose the complex recognition task into a cascade of simpler subtasks [5,6]. Our previous study [5] introduced modular architecture (Faster R-CNN, YOLO, FT-Transformer) that improved identification accuracy using expert-defined superclasses. However, this approach has a significant limitation: it relies on subjective expert knowledge, making the system difficult to scale or transfer to new data types and changing sensor environments.

An analysis of recent studies reveals a substantial research gap in methods for automatically synthesizing such hierarchies. Existing approaches generally rely on two paradigms, neither of which fully accounts for the specifics of UAV operations:

Semantic Hierarchies (e.g., WordNet-based). These group classes by their linguistic meaning. However, in aerial imagery, semantically distant classes (e.g., “building roof” and “truck”) may appear visually similar, whereas semantically related objects may have drastically different visual descriptors due to the specific nadir (top-down) perspective.
Latent Feature Clustering. These methods form hierarchies based solely on geometric distance within the feature space. They ignore the actual “behavior” of the model—specifically, how a given neural network confuses classes under conditions of noise, digital artifacts, and partial occlusions typical of UAV monitoring.

The primary contribution of this paper is the development and experimental validation of a method for automatically synthesizing an optimal hierarchical structure using a hybrid inter-class similarity metric. Unlike existing solutions, the proposed approach integrates two data sources: the geometric distance between class centroids in the latent space and an empirical measure of classifier confusion (confusion matrix). This enables the automatic generation of a coarse-to-fine cascade structure that objectively reflects both the internal complexity of the data and the real-world limitations of the neural models used.

Building on previous research that demonstrated the effectiveness of a manually designed multi-stage pipeline, this paper focuses on the algorithmic automation of the cascade synthesis process. We propose a data-driven approach to automatically generate the hierarchy based on inter-class proximity metrics.

The paper is organized as follows: Section 2 provides an analysis of current research in hierarchical classification and UAV-based detection; Section 3 details the mathematical framework of the hybrid metric and the automatic cascade construction algorithm; Section 4 is dedicated to the experimental validation of the proposed method using representative VisDrone and UAV123 datasets, alongside a comparative analysis with modern SOTA (State-of-the-Art) models; Section 5 discusses the results, their practical significance, and the method’s limitations; finally, the Conclusions section summarizes the study and outlines prospects for future work.

2. Related Works

The problem of automatically synthesizing an optimal hierarchical classification structure is multidisciplinary, situated at the intersection of deep learning, computer vision, and computer-aided design systems. An analysis of recent research highlights several key technological trends and fundamental contradictions.

2.1. Evolution of UAV Object Detection and the Limits of Monolithic Architectures

The evolution of detection methods for UAVs has progressed from classical algorithms to modern neural networks such as the YOLO (v8–v11) series, Faster R-CNN, and SSD [1,3,7,8]. Modern architectures utilize attention mechanisms and advanced feature extraction blocks (C3k2, PSA), achieving high accuracy on standard datasets [3,9,10].

Despite this progress, monolithic (flat) architectures exhibit a significant decrease in accuracy when identifying visually similar targets. This is caused by the “feature overlap” effect in the latent space: when objects share identical geometric proportions and color characteristics, the network is forced to establish decision hyperplanes for embedding vectors that are too closely positioned. This leads to classification instability under minor changes in lighting, perspective, or the presence of digital noise [8,11,12].

2.2. Hierarchical Approaches and the Semantic Gap

To overcome the limits of flat models, researchers have proposed various hierarchical structures: generative part-based models [13], the use of hyperbolic geometry for hierarchy modeling [14,15], Concept Bottleneck frameworks [6], and hierarchical segmentation [16,17].

Decomposing the task into subtasks allows for specialized feature extraction at each level of the cascade. Nonetheless, most authors rely on pre-existing semantic hierarchies (e.g., WordNet), where classes are grouped according to ontological logic [18,19]. However, a semantic gap occurs in aerial imagery: objects belonging to different linguistic groups (e.g., “building roof” and “truck”) are often visually much closer at a 90° angle than objects within the same group [20]. A network trained in a semantic hierarchy is forced to search for common features in nodes that lack visual correlation, which degrades training efficiency [5,21]. This underscores the need for data-driven synthesis, in which the cascade structure is determined exclusively by visual descriptors.

2.3. Automation of Hierarchy Synthesis: Metric Challenges and Edge AI

The automatic synthesis of classification trees is typically based on latent feature clustering [15,22] or the analysis of confusion matrices [19].

Purely geometric methods (Euclidean or cosine distance) often become uninformative in the deep layers of a network, where classes “cluster” into dense bundles due to the influence of the SoftMax function. Furthermore, they do not account for empirical model confusion caused by atmospheric interference. Conversely, methods based solely on confusion matrices are unstable and highly dependent on the quality of the training sample.

In this domain, Hierarchical Agglomerative Clustering (HAC) is considered more appropriate than K-means, as it naturally forms a dendrogram [23,24]. This allows for a flexible definition of abstraction levels without pre-specifying the number of clusters [25,26]. However, existing implementations lack integration with edge AI platform constraints, such as NVIDIA Jetson (NVIDIA Corporation, Santa Clara, CA, USA) [4,27], where cascade depth directly impacts inference latency [28,29].

Recent advancements in cross-domain machine learning emphasize the importance of robust pre-processing for imbalanced and highly correlated data. For instance, in the domain of intrusion detection, methods proposed by Semenov et al. [30] demonstrate the effectiveness of handling imbalanced feature spaces, which is methodologically similar to resolving class ambiguity in UAV imagery. Similarly, article [31] highlights the role of statistical feature selection in creating lightweight architectures, a principle that is vital for the deployment of deep learning cascades on resource-constrained Edge AI platforms.

2.4. Problem Statement and Scientific Contradiction

Based on the analysis presented in Table 1, a fundamental contradiction has been identified: the need to improve identification accuracy through complex hierarchical structures versus the requirement for automated, scalable systems without expert intervention. Current hierarchical models for UAVs are designed subjectively, limiting their application in dynamic environments.

The research gaps addressed in this study are:

The absence of a hybrid metric that combines the geometric separability of the latent space with the statistics of the model’s empirical errors;
The lack of algorithms for the automatic selection of the dendrogram cutting threshold ( $τ$ ) to balance cascade depth and computational complexity
Insufficient integration of automatically synthesized hierarchies with modern Transformer-CNN pipelines for Edge computing.

Given these points, the aim of this study is to increase the accuracy and reliability of classifying visually similar objects in UAV imagery by developing and implementing a method for automatically synthesizing an optimal hierarchical cascade structure, thereby eliminating the need for subjective architectural design.

To achieve this objective, the following research tasks were defined:

Develop a mathematical model for automatic hierarchy synthesis based on a hybrid inter-class distance metric;
Integrate the automatic construction method into the general hierarchical cascade pipeline established in [5];
Perform a comparative analysis of the automatically generated hierarchy against an expert-driven approach using representative UAV imagery datasets (VisDrone, UAV123);
Evaluate the generalization capability and computational efficiency of the resulting models within real-time operational scenarios.

Table 1. Comparative analysis of approaches to hierarchical target classification.

Approach	Sources	Advantages	Key Disadvantages and Gaps
Monolithic SOTA	[1,8,9,10]	Maximum speed (Real-time)	“Feature overlap” effect; low accuracy on visually similar targets.
Semantic Hierarchies	[6,16,21]	Logical interpretability	Semantic gap: mismatch between linguistics and visual similarity.
Geometric Clustering	[13,14,15]	Objectivity	Fails to account for empirical confusion and specific neural network architecture.
Confusion Matrix Analysis	[19]	Adaptation to the model behavior	High sensitivity to the dataset; risk of creating redundant levels.
Hybrid Transformer-CNN	[21,22,23,24,32]	High generalization capability	Lack of algorithms for the automatic formation of the cascade structure.

3. Materials and Methods

3.1. General Description of the Automatic Cascade Generation ApproachGeneral Architecture of the Object Identification Cascade

The proposed approach is based on the paradigm of hierarchical decomposition of complex feature spaces. According to this paradigm, the overall object identification task is not solved directly but is sequentially broken down into a system of nested subtasks, with results gradually refined. This approach enables effective handling of high-dimensional feature spaces and reduces classification complexity at each stage. Based on the results presented in [5], a multi-level architecture combining a detection stage with several sequential levels of classification refinement is most appropriate for UAV monitoring tasks (Figure 1).

In the initial stage (Stage 0), potential objects of interest are localized within the input image. This is accomplished using a Faster R-CNN network, which detects Regions of Interest (ROIs) regardless of target type. The primary priority of this stage is to maximize detection recall—ensuring that no potentially relevant object is missed, even at the expense of an increased false-positive rate. The resulting ROIs are then passed to subsequent processing levels.

Subsequently, at Stage 1, primary classification is performed, where the detected ROIs are distributed among the most visually distinct superclasses. At this level, generalized features are utilized to separate fundamentally different object categories; for instance, distinguishing biological entities from technical assets, or pedestrians from vehicles.

The next level of the hierarchy (Stage 2) involves intermediate classification refinement, where complex superclasses are further subdivided into more homogeneous subclasses. For example, vehicles may be differentiated by dimensions or functional purpose, thereby reducing intra-class variance and facilitating subsequent identification.

The final level of the hierarchy provides fine-grained, detailed identification of specific object types, such as buses, trucks, or vans, utilizing highly specific feature sets (Stage 3). In contrast to the manual semantic grouping used in [5], this study proposes a method for automatic hierarchy synthesis that replaces subjective expert decisions with an objective mathematical analysis of the feature space formed by the deep neural network.

3.2. Formalization of Feature Extraction and Class Centroids

Let the training dataset be defined as a set of pairs

D = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}

, where

x_{i}

represents the image of an individual object and

y_{i} \in C = \{c_{1}, \dots, c_{M}\}

is its class label. The set

C

corresponds to the set of all possible classes in the dataset. The process of informative feature extraction is described by the mapping

Φ : X \to R^{d}

, implemented as a deep neural network acting as a backbone feature extractor. Consequently, each image

x_{i}

is mapped to a feature vector (embedding)

f_{i} = Φ (x_{i})

.

For each base class

c_{k} \in C

, we define its mathematical expectation in the feature space, which serves as the class centroid

μ_{k}

:

μ_{k} = \frac{1}{| D_{k} |} \sum_{(x_{j}, y_{j}) \in D_{k}} Φ (x_{j}),

(1)

where

D_{k} = \{x_{j} : y_{j} = c_{k}\}

is the set of all image objects in the dataset belonging to class

c_{k}

.

It is assumed that the centroid

μ_{k}

serves as a reference representative of the class, minimizing intra-class variance.

To ensure the feature space retains both abstract semantic context and fine-grained morphological details (such as vehicle contours), the mapping

Φ

is implemented using the multi-layer concatenation strategy proven optimal in our foundational study [5]. Specifically, the feature vector fuses outputs from the final, penultimate, and earlier contour-sensitive convolutional blocks of the YOLOv11 backbone.

To automate the construction of the hierarchical classification cascade, this study proposes a hybrid inter-class distance metric

D_{i j}

. This metric integrates both the theoretical similarity of classes in the latent feature space and the empirical complexity of their practical differentiation by a neural network. This approach accounts not only for the geometric structure of the feature space formed by the deep extractor but also for the actual behavior of the classifier under conditions of noise, partial occlusions, viewpoint changes, and other factors typical of aerial imagery.

The geometric component of the metric reflects the fundamental visual similarity between classes and is based on the analysis of the relative positions of their centroids in the

d

-dimensional feature space

R^{d}

. For each pair of classes

c_{i}

and

c_{j}

, the Euclidean distance between their respective centroids,

μ_{i}

and

μ_{j}

, is calculated and normalized by the maximum distance between any pair of centroids in the feature space.

Formally, this value is defined as

d_{i j}^{f e a t} = \frac{{‖ μ_{i} - μ_{j} ‖}_{2}}{\max_{p, q} {‖ μ_{p} - μ_{q} ‖}_{2}} .

(2)

Normalization allows for the interpretation of the resulting values within a unified range. A small

d_{i j}^{f e a t}

value indicates high similarity in texture, geometric, and structural characteristics, suggesting that these classes should be grouped at higher levels of the hierarchy. Thus, the geometric component captures the theoretical proximity of classes from the perspective of the neural network’s learned representation.

However, analyzing only the geometric structure of the feature space does not fully account for the practical classification difficulties encountered during model inference on real-world data. To address this, a confusion component is introduced based on the analysis of the confusion matrix

M

, obtained during the testing of a baseline, non-hierarchical multi-class model. In this matrix, element

M_{i j}

represents the frequency with which objects of class

c_{i}

are erroneously assigned to class

c_{j}

. Based on these values, a symmetrized confusion measure between pairs of classes is determined and formalized as:

d_{i j}^{c o n f} = 1 - \frac{M_{i j} + M_{j i}}{M_{i i} + M_{j j} + ε},

(3)

where

ε

is a small regularization constant to prevent division by zero.

This component reflects the empirical complexity of class differentiation: if two classes are frequently confused, their corresponding distance decreases regardless of the spatial separation of their centroids in the feature space. Consequently, the confusion component identifies pairs of classes that “merge” from the classifier’s perspective due to unfavorable observation conditions or insufficient informativeness of local features, even when their mean vector representations differ significantly. As confusion is assessed within the comparison of two classes, both in one direction and the other, class imbalance does not affect the correctness of the calculations.

The robustness of the proposed proximity metric against class imbalance is achieved by its bidirectional nature. The distance calculation incorporates misclassification rates from class

i

to

j

and vice versa (

i \leftrightarrow j

). This reciprocal evaluation ensures that the proximity value reflects the inherent visual ambiguity between classes rather than the absolute number of samples in the training set. Consequently, even under severe imbalance, the automated synthesis correctly identifies clusters of highly correlated classes by focusing on their mutual confusion patterns.

The final integrated inter-class distance matrix

D

is formed by combining the geometric and confusion components as a weighted sum

D_{i j} = α \cdot d_{i j}^{f e a t} + (1 - α) \cdot d_{i j}^{c o n f},

(4)

where the weight coefficient

α \in [0, 1]

determines the balance between theoretical feature similarity and the actual behavior of the classifier.

The value of

α

allows the method to adapt to the quality and stability of the utilized feature extractor: with a high-performance backbone module that forms well-separated classes in the latent space, more weight is assigned to the geometric component. Conversely, for a weaker or less generalizable model, it is advisable to emphasize empirical classification errors. The resulting integrated distance matrix serves as the formal foundation for the automatic synthesis of the class hierarchical structure and the construction of an efficient classification cascade.

3.3. Automated Cascade Generation via Hierarchical Agglomerative Clustering

To transform the set of classes

C

into a multi-level hierarchical structure, this study proposes the use of Hierarchical Agglomerative Clustering (HAC) [31] with the average linkage method, applied to the previously formed integrated inter-class distance matrix

D

. The application of HAC enables the formal and automated construction of a class similarity tree without the need for expert-defined rules or manual grouping, relying exclusively on the properties of the latent feature space and classifier behavior.

At the initial stage, the HAC algorithm sequentially merges pairs of classes or clusters with the minimum distance value, forming a tree-like structure. Initially, each base class is treated as an individual cluster; subsequently, the most similar elements are merged at each iteration. This results in a complete hierarchy of nested clusters, reflecting a gradual transition from a highly detailed representation to generalized class groups. This dendrogram provides a compact formal description of the similarity structure among classes within the training set.

The synthesis of classification cascade levels is performed by “cutting” the resulting dendrogram at a fixed threshold

τ

, which defines the maximum allowable intra-cluster distance. Superclasses: Clusters formed above this threshold are interpreted as superclasses for the initial and intermediate levels of the cascade. Atomic Classes: Leaf nodes correspond to the final, atomic classes used at the fine-grained identification level. This procedure automatically implements the coarse-to-fine principle: the most distant and semantically distinct object groups are separated at early stages, while visually and structurally similar classes are grouped and passed to deeper levels of analysis. For example, fundamentally different objects, such as pedestrians and vehicles, are differentiated at the upper levels, while similar vehicle types remain within a single superclass until the final stage.

The threshold

τ

, which determines whether further clustering subdivision is required, is automatically estimated from the dendrogram-derived inconsistency coefficients. For each internal node i, the inconsistency coefficient

I_{i}

measures the normalized deviation between the linkage height

h_{i}

and the statistical properties of merge heights within a local neighborhood of depth

d

. This allows identifying merges that significantly deviate from the underlying hierarchical structure.

The global threshold is defined as:

τ = μ_{I} + k σ_{I},

(5)

where

μ_{I}

and

σ_{I}

denote the mean and standard deviation of all inconsistency coefficients across the dendrogram, and

k

is a sensitivity parameter controlling the strictness of the splitting criterion.

This formulation provides a statistically grounded and data-driven mechanism that does not require dataset-specific tuning.

Clusters corresponding to nodes with

I_{i} > τ

are recursively subdivided, while those with

I_{i} \leq τ

are considered sufficiently homogeneous. As a result, the hierarchy depth is determined adaptively: more complex datasets yield deeper structures, whereas simpler datasets yield more compact representations.

This enables adaptive control of the cascade depth by identifying statistically significant deviations in cluster merging, thus preventing unnecessary subdivision when intra-cluster homogeneity is already sufficient.

The complete cycle of constructing the hierarchical classification architecture encompasses all stages—from the analysis of raw input data to the training of an ensemble of specialized models for each level of the cascade.

Since the hierarchy structure is created once, the computational complexity does not affect the performance of this approach.

Algorithm 1 describes the full cycle of forming the hierarchical classification architecture—from raw data analysis to the training of a suite of specialized models for each level of the cascade.

Algorithm 1. Automatic Cascade Construction

Input:

Training set D = \{(x_{i}, y_{i})\}

, Feature extractor ϕ (\cdot)

, Confusion weight

α

.

Output:

Hierarchy structure L = \{L_{1}, \dots, L_{K}\}

, Set of trained classifiers Θ = \{θ_{1}, \dots, θ_{K}\}

.

Extraction: $Obtain feature vectors f_{i}$ $for the entire dataset using ϕ (\cdot)$ .
Centroids: $Compute centroids μ_{k}$ $for each unique class in D$ .
Metrics:
•
$Construct the geometric distance matrix d^{f e a t}$ ;
•
$Perform baseline training of a flat classifier to obtain the confusion matrix M$ ;
•
$Calculate the final proximity matrix D$ $according to the weight parameter α$ .
Clustering: Execute hierarchical agglomerative clustering.
Hierarchy Synthesis: $Determine the optimal threshold τ$ $using the inconsistency coefficient and cut the dendrogram . Define the composition of superclasses at each level L_{K}$ .
Recursive Training: $For each level L_{K}$ :
•
$Assemble a reduced dataset D_{k}$ , where original class labels are replaced by the corresponding superclass labels for that level.;
•
$Train a specialized classifier θ_{k}$ $on the set D_{k}$ .
Return: Generated cascade structure and a set of trained model weights.

To obtain the empirical confusion matrix M necessary for the hybrid metric, a monolithic YOLOv11s model was trained for 20 epochs on the flat, multi-class dataset. The resulting validation confusion matrix serves as the empirical input for the clustering algorithm.

Following the formation of the cascade structure, recursive training of specialized classifiers is conducted. For each level of the hierarchy, a corresponding training set

D_{k}

is generated, where original class labels are substituted with the labels of the respective superclasses defined at that level. For instance, if the “Truck” and “Bus” classes are automatically merged into a single cluster at an early level, the model for that level is trained to recognize them as the generalized category “Large Vehicle.” This approach significantly simplifies the task for each individual classifier, reduces the number of alternative decisions, and consequently enhances recognition stability and accuracy. The inference process in hierarchical architecture is implemented as a sequential descent through the model cascade. At each level, the decision is made by the classifier

θ_{k}

considering the set of allowed classes

S_{k}

, which is determined by the output of the previous level:

{\hat{y}}_{k} = θ_{k} (f, S_{k}) .

(6)

This mechanism ensures that each model focuses exclusively on relevant classes and their corresponding features, reducing the risk of error accumulation between levels and increasing system reliability in complex UAV monitoring scenarios. Ultimately, the proposed procedure provides a fully automated construction of a hierarchical classification architecture aligned with both the feature space geometry and the practical behavior of the neural models.

3.4. General Pipeline of the Approach

The proposed approach implements a full cycle of intelligent processing for UAV-acquired imagery, covering all key stages—from training data analysis to real-time decision-making. The general workflow is divided into two conceptually distinct but closely related phases: an offline phase of automated hierarchy synthesis and cascade training, and an online phase of multi-level hierarchical inference. This separation allows computationally intensive and structure-forming procedures to be moved to the pre-processing stage, ensuring high system throughput during operation. The generalized scheme of this process is shown in Figure 2. It consists of two main phases: (1) Offline Cascade Synthesis, which represents the primary contribution of this work, focusing on the automated generation of the hierarchy from raw data; (2) Online Hierarchical Inference, which utilizes the synthesized structure for real-time target identification, as established in our foundational research [5].

In the offline phase, the transition from raw annotated data to a formalized hierarchical classification model is performed. The initial stage involves forming a representative feature space using a shared backbone module based on the YOLOv11 architecture. The YOLOv11 architecture was selected as the foundational backbone due to its implementation of C3k2 modules and a refined spatial attention mechanism, which together provide a more robust and discriminative feature representation. These architectural improvements enable more precise alignment of the representative feature space, especially when capturing the nuanced characteristics required for subsequent hierarchical classification and inter-class distance analysis. This module extracts feature vectors

f_{i}

from all annotated objects in the training sample, providing a unified and consistent representation of the input data regardless of subsequent model specialization. Based on these representations and the error analysis of the baseline flat model, a hybrid inter-class distance matrix

D_{i j}

is calculated, integrating geometric feature similarity and empirical class confusion.

Subsequently, the classification cascade structure is automatically generated using the

D_{i j}

matrix. Hierarchical agglomerative clustering is applied, followed by dendrogram cutting at threshold

τ

, which simultaneously determines the number of hierarchy levels

L

and the composition of superclasses at each classification stage as it was proposed in [5]. Thus, the structure of Stages 1–3 is formed without expert intervention, based solely on the properties of the latent feature space and model behavior. After synthesizing the hierarchy, specialized model training is performed in a transfer learning mode. For each hierarchy node, a specific model combination is trained, including an adapted YOLOv11 feature extractor optimized for the specific visual characteristics of the class group, and an FT-Transformer-based classifier that effectively forms decision boundaries in the corresponding feature subspace.

Once offline preparation is complete, the system is deployed for online operation onboard the UAV or at a ground processing station. Each input frame is processed according to a multi-level “coarse-to-fine” cascade principle. Initially, the input image is analyzed by a Faster R-CNN network, acting as a high-precision region of interest (ROI) generator. This stage effectively separates the background and localizes all potential objects regardless of their class. Next, the extracted ROIs are passed to the first level of the cascade for coarse filtering and separation into fundamental categories, such as humans and vehicles. The next level performs intermediate refinement for objects assigned to the vehicle category, distinguishing between large and small vehicles. The final cascade level provides fine-grained identification of critically similar targets, such as distinguishing between visually similar transport types that require the most specific features (visualized in Appendix A).

The combination of the automated hierarchy synthesis proposed in this work with the Faster R-CNN/YOLO/FT-Transformer cascade architecture provides several fundamental advantages. The system demonstrates high adaptability, as it can automatically reconfigure itself for any set of classes without manual hierarchy design. Model specialization at each level effectively reduces inter-class ambiguity, i.e., a critical issue for visually similar objects in UAV imagery, while simultaneously lowering the risk of error propagation. Furthermore, computational efficiency is achieved by using the most complex and resource-intensive models only for ROIs that have passed the preliminary filtering stages, thereby achieving an optimal balance between accuracy and speed.

3.5. Datasets

To ensure the objectivity of the results and the ability to compare with previous research, the proposed automatic cascade synthesis method was validated on several representative datasets covering various UAV monitoring scenarios. The VisDrone2019 dataset was selected as the primary source, comprising over 8000 high-resolution images collected from various UAVs under diverse weather and lighting conditions. This dataset was used to calculate the distance matrix

D_{i j}

and for the automatic synthesis of the cascade structure, as well as for training and testing the hierarchical models with an official benchmark 80/20 data split.

Dynamic scenarios were analyzed based on the UAV123 video dataset, which includes 123 sequences and is critical for assessing detection and classification stability under scale variation, camera motion, and partial occlusions. Utilizing this dataset confirmed the hypothesis that the automatically synthesized hierarchy can effectively adapt to new challenging conditions without expert intervention, relying solely on the analysis of the latent feature space.

3.6. Evaluation Metric System and Method Validation Strategy

A multi-level system of metrics and validation procedures is employed for a comprehensive and objective assessment of the proposed approach, allowing for the analysis of both the local properties of individual cascade components and the generalized behavior of the system as a whole. The proposed evaluation framework covers three complementary levels of analysis, ensuring correct interpretation of results in the context of hierarchical classification of UAV imagery.

At the local evaluation level, the classification quality at each node of the cascade is considered separately. For all models corresponding to various hierarchy levels, standard classification metrics, specifically Precision, Recall, and F1-score, are calculated. This analysis quantitatively assesses the degree of class separability at each level of the hierarchy and identifies bottlenecks where the greatest inter-class ambiguity occurs. Special attention is given to the intermediate levels of the cascade, as errors at these stages directly influence the set of allowed classes at subsequent levels.

To evaluate the system’s overall quality for the detection task, the mean Average Precision (mAP) is used, a standard metric in computer vision. Two of its modifications are considered: mAP@0.50, which evaluates accuracy at a fixed Intersection over Union (IoU) threshold, and mAP@0.50:0.95, which averages accuracy across a wide range of IoU thresholds and allows for a more detailed assessment of object localization quality. Using both indicators provides a balanced assessment of the system’s ability to both accurately detect objects and correctly localize them in the image space.

The aggregated cascade efficiency metric

M_{a v g}

, proposed in [5], is introduced separately to reflect the system’s averaged performance across all hierarchy levels and to quantitatively assess the cumulative effect of error accumulation. This metric is especially important for hierarchical architectures, as standard indicators at the individual level do not always reflect the quality of the final decision.

The validation strategy for the proposed method is structured to independently verify the correctness of the automatic hierarchy synthesis and the effectiveness of the integral approach as a whole. The first stage involves the internal validation of the automatic hierarchy generation algorithm. The goal of this stage is to prove that the hierarchy constructed by the agglomerative clustering method based on the

D

matrix objectively reflects the internal structure and complexity of the data. To achieve this, a comparative structural analysis is performed between the automatically generated dendrogram and the expert hierarchy proposed in [5]. Particular attention is paid to non-trivial or “unexpected” clusters identified by the algorithm as critically similar but not explicitly highlighted in the expert approach, which allows for assessing the method’s ability to reveal hidden patterns in the feature space.

Furthermore, an ablation study is conducted on the influence of the weight coefficient

α

, which determines the balance between the geometric and confusion components of the hybrid metric. The value of

α

varies in the range from 0, where only the confusion matrix is considered, to 1, where only the distance between feature centroids is used. This analysis allows for determining the optimal ratio between theoretical class similarity and actual classifier behavior specifically for UAV image processing tasks.

In the second stage, external end-to-end validation of the integral approach is performed, within which the performance of the full pipeline, including detection, automatic hierarchy synthesis, and multi-level classification, is analyzed. A direct comparison of the results of the proposed model with the automatically generated hierarchy and the expert cascade described in [5] is conducted based on F1-score and mAP metrics. This experimental comparison confirms the hypothesis that automating the hierarchy construction process does not degrade, and in some cases improves, recognition quality.

The generalization capability of the approach under varying visual domains is additionally investigated. For this purpose, the hierarchy trained on the VisDrone dataset is applied to the UAV123 dataset without any manual modifications. This experiment is critical as it demonstrates the ability of the automatically synthesized structure to adapt to new observation conditions and different data types without repetitive expert system design. The final validation stage is the analysis of computational latency, measuring real-time frame processing speed on the NVIDIA Jetson Xavier NX platform.

3.7. Experimental Setup

To ensure reproducibility and consistency of results, all experiments were conducted in an environment identical to that described in [5].

The Hardware configuration included a 64-bit Linux workstation equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB GDDR6X) (NVIDIA Corporation, Santa Clara, CA, USA), an AMD Ryzen 9-series processor (Advanced Micro Devices, Inc., Santa Clara, CA, USA), and 64 GB of RAM. For mobile platform tests (Edge AI), the NVIDIA Jetson Xavier NX module (NVIDIA Corporation) was utilized.

The Software environment included Python v3.10, PyTorch v2.2.1, the Ultralytics library for YOLOv11 models, Scikit-learn v1.8.0 for hierarchical clustering, and Hugging Face Transformers v5.5.2 for FT-Transformer architectures.

Regarding Hyperparameters, models were trained using the Adam optimizer with an initial learning rate of 0.001 over 20 epochs.

For the automatic cascade construction, base values of

α = 0.5

(representing an equal balance between geometry and confusion) and a threshold

τ

were established, ensuring the formation of three classification levels for direct comparison with [5].

3.8. Ethical Considerations

All datasets used in this study are publicly available and were collected in accordance with the ethical standards and privacy regulations of the respective organizations. The research focuses on the technical aspects of target identification for situational awareness and does not involve the collection of personal identifiable information.

4. Results

The main objective of this section is to provide a quantitative and qualitative evaluation of the impact of automated hierarchy synthesis on the performance of the object detection and classification system in UAV imagery, in comparison with (i) a flat model without hierarchy; (ii) existing approaches; (iii) a cascade architecture with an expert-designed hierarchy.

For automatic construction of the cascade structure, the VisDrone2019 dataset served as the basis. The automatic synthesis of the class hierarchy was performed using agglomerative hierarchical clustering based on a combined proximity matrix D that integrates geometric feature similarity and the classifier’s empirical confusion. The result of this stage is a dendrogram that reflects the multi-level relationships among the classes in the VisDrone2019 dataset.

The constructed dendrogram (Figure 3) clearly illustrates the formation of a natural hierarchy of objects, where large semantic groups are separated at the early stages of aggregation, while visually and functionally similar classes merge only at the lower levels of the tree. This confirms the proposed method’s ability to automatically implement the coarse-to-fine principle without relying on expert-defined rules. Thus, the dendrogram serves not only as an analytical tool but also as a key structural component of the entire pipeline.

While the automated synthesis successfully replicated the primary semantic boundary of the expert-designed cascade (isolating “People” from “Vehicles”), it removed human bias in the sub-classification of vehicles. Whereas our previous expert model [5] manually forced “Van” and “Bus” into a shared intermediate node, the data-driven dendrogram (Figure 3) reveals that “Trucks” are structurally distinct enough to branch off earlier, while ‘Vans’ share a tighter geometric and empirical overlap with the “Other Vehicles” category. This proves that automated synthesis can discover optimal routing paths that humans might overlook.

At the same time, the automated approach demonstrates an important difference from expert-based design: the hierarchy structure is determined not only by semantic considerations but also by the model’s actual statistical error patterns. For example, the algorithm confirmed the appropriateness of merging the “Car” and “Van” classes into a single group at an early level. However, it also revealed increased empirical confusion between certain subclasses, such as “Truck” and “Bus,” leading to the separation of the “Truck” category at the second level of the hierarchy.

Such “unexpected” clusters should not be interpreted as algorithmic errors. On the contrary, they reflect the true complexity of visual recognition in UAV imagery, including small object sizes, occlusions, and varying viewing angles. This observation confirms that the model operates not only on visual features but also accounts for contextual similarities between objects under challenging aerial imaging conditions.

Accordingly, since the cascade structure for the VisDrone2019 dataset was constructed in a manner similar to that described in [5]. It becomes possible to perform a comparative evaluation of the complete object detection and classification pipeline against other existing approaches (Table 2).

The results presented in Table 2 are reported as the mean value and standard deviation obtained from three independent training runs of the models, following the same evaluation protocol as in [5]. As shown in the table, the proposed pipeline for hierarchy construction, model training, and final classification outperforms several existing classification methods, including approaches based on direct multi-class classification. Also, we observe that this pipeline constructs a hierarchical structure similar to that in [5], which leads to obtaining comparable metric results.

The FPS metrics reported in Table 2 were recorded on the high-performance workstation (RTX 3090). For Edge AI deployment on the NVIDIA Jetson Xavier NX, as noted in our previous work [5], the baseline monolithic models achieve ~38 FPS, while the cascaded inference operates at ~14 FPS depending on the density of detected ROIs. Future optimization via TensorRT is expected to bridge this gap.

One of the key components of the proposed automatic hierarchy synthesis method is a hybrid distance metric that combines the geometric proximity of features in the latent space with empirical information about class confusion obtained from a baseline classifier. The weighting coefficient

α

determines the relative contribution of each of these components in the hierarchy construction process and therefore directly affects the cascade structure and the final recognition performance.

The goal of this experimental analysis is not only to determine the optimal value of the parameter

α

, but also to demonstrate the fundamental validity of the hybrid approach. In particular, the analysis aims to determine whether geometric feature similarity alone is sufficient or whether it is necessary to incorporate the model’s actual behavior under real classification errors.

To this end, a series of experiments was conducted in which the parameter α was varied from 0 to 1 in fixed steps. For each value of

α

, the class hierarchy was automatically synthesized, after which the full training cycle of the cascade architecture was performed and its performance was evaluated on the test set. The final performance was measured using the integral F1-score metric (similar trends were also observed for mAP).

The experimental results are presented in Figure 4, where the X-axis represents the values of the parameter

α

, and the Y-axis shows the final cascade accuracy. Analysis of the curve indicates that the extreme cases

α = 0

and

α = 1

demonstrate reduced performance. In the first case, the hierarchy is formed exclusively based on the confusion matrix, making it overly dependent on random errors and noise in the training data. In the second case, only the geometric distance between feature centroids is used, which ignores the actual difficulty of class discrimination for the neural network under UAV imaging conditions.

The highest recognition performance is achieved within the interval

α \in [0.4, 0.6]

, where a clearly pronounced maximum of the F1-score is observed. This indicates that the optimal hierarchy is formed precisely when the cascade structure considers both the theoretical visual similarity of objects and the empirical “experience” of the model captured in the confusion matrix. In other words, the hybrid metric enables alignment of the feature space with the real challenges of classification, particularly for small, visually similar objects in UAV imagery.

Thus, the conducted analysis confirms that the use of a hybrid metric is not an arbitrary design choice but rather a necessary condition for stable and efficient automatic synthesis of the hierarchical structure. The results also demonstrate that the proposed method is relatively insensitive to the exact value of the parameter α near the optimum, which simplifies its practical application and adaptation to new datasets without manual fine-tuning.

Another key component of the proposed automatic hierarchy synthesis method is the cutting threshold

τ

. While the parameter

α

regulates the balance between latent geometry and empirical error rates in the proximity metric, the threshold

τ

serves as the primary structural controller. It determines the granularity of the hierarchical decomposition by defining the points at which the dendrogram is “cut” to form functional cascade stages.

As illustrated in Figure 5, the relationship between

τ

and the system’s performance is non-linear. Small values of

τ

lead to an overly granular hierarchy with a large number of specialized nodes. While this can theoretically improve fine-grained precision, it significantly increases the inference latency due to the depth of the cascade. Conversely, excessively high values of

τ

cause the algorithm to merge distinct “confusion zones,” effectively collapsing the hierarchy back toward a monolithic architecture and leading to a drop in the F1-score.

The experimental results indicate that the optimal classification accuracy is achieved within the interval

τ \in [0.45, 0.55]

. Within this range, the synthesized structure maintains a balance between specialized feature extraction and computational efficiency. Specifically, at the selected value of

τ

, the algorithm successfully identifies the “elbow point” in the linkage distances, separating classes into a three-level hierarchy that mirrors the complexity of the VisDrone2019 dataset’s inter-class overlaps. This data-driven selection of

τ

confirms that the framework can independently reach a configuration that matches or exceeds the reliability of expert-designed systems.

To systematically validate the structural soundness of the proposed automated framework, we conducted extensive ablation studies on the two core hyperparameters that govern the cascade synthesis: the hybrid metric weight coefficient

α

and the dendrogram cutting threshold

τ

.

The parameter

α

dynamically balances the theoretical geometry of the latent feature space against the empirical error statistics of the baseline model. As illustrated in Figure 4, executing the pipeline at the extreme boundaries—

α = 0

(relying exclusively on the confusion matrix) and

α = 1

(relying exclusively on centroid Euclidean distances)—yields sub-optimal classification performance, with F1-scores dropping below 88.5%.

When

α = 1

, the framework remains strictly “blind” to the real-world sensor noise, atmospheric interference, and structural distortions inherent in UAV imagery, grouping classes that are geometrically proximate but empirically highly volatile. Conversely, when

α = 0

, the structure becomes overly sensitive to the specific statistical biases and random noise of the baseline training sample, leading to poor generalization. The peak F1-score of 94.9% is achieved within the stable interval of

α \in [0.4, 0.6]

, proving that an optimal architectural synthesis requires a balanced synergy between learned deep representations and empirical model behavior.

Similarly, the cutting threshold

τ

serves as the primary controller of the cascade’s granularity (Figure 5). Lower values of

τ

trigger excessive, hyper-specific cluster splitting, which increases cascade depth, induces severe inference latency, and accelerates early-stage error propagation. High values of

τ

collapse the hierarchy back into a monolithic “flat” architecture, failing to isolate the primary “confusion zones.” The experimental optimization identifies

τ \in [0.45, 0.55]

as the ideal trade-off, where the algorithm independently discovers a stable three-level taxonomy that minimizes inter-class ambiguity while preserving a real-time frame rate of 41 FPS.

To evaluate the adaptability of the proposed framework, we performed a synthesis experiment on the UAV123 dataset. Rather than applying the VisDrone2019-derived hierarchy, the system independently synthesized a new cascade structure (Figure 6) based on the unique visual features and class correlations present in UAV123. This demonstrates the method’s ability to self-configure when presented with a novel domain. This dataset was selected due to its similarity to the VisDrone2019 dataset. Both datasets share several common classes, while also containing certain classes that differ.

As shown in Figure 6, the algorithm constructed a hierarchy with a structure largely like that obtained for VisDrone2019, although certain differences can be observed. At the first stage, the classes “Person” and “Bike” were merged into a common subclass, “Human-like,” since objects in these classes share human-related characteristics (a bicycle is typically associated with a human rider). Meanwhile, the refinement of the “Vehicle” category demonstrates almost the same behavior as observed for the VisDrone2019 dataset.

Models trained using the cascade structure synthesized by the automatic approach achieved the following performance values on the UAV123 dataset (Table 3).

To rigorously evaluate the generalization capability and robustness of the proposed automated cascade synthesis method outside the primary training domain, a comprehensive cross-domain validation was executed using the UAV123 video dataset. Unlike the static, high-resolution aerial captures that characterize the VisDrone2019 dataset, the UAV123 benchmark poses a drastically different set of challenges. It features continuous aerial video sequences recorded at lower relative altitudes, marked by rapid camera ego-motion, severe aspect ratio transformations, out-of-view illumination changes, and persistent partial or full target occlusions. Testing the framework on this dataset serves as a strict baseline to verify that the automatically synthesized hierarchy does not suffer from overfitting and can adapt autonomously to unfamiliar environmental and sensor configurations.

A key advantage of the proposed methodology is its complete independence from human intervention when transitioning between distinct data domains. Without any manual modifications, architectural adjustments, or expert-guided class grouping, the automated pipeline was deployed directly to analyze the latent feature space and baseline confusion matrix of the novel UAV123 data distribution.

Driven strictly by the hybrid proximity metric

D_{i j}

with an optimized weight parameter

α = 0.5

and the data-adaptive inconsistency threshold

τ

, the Hierarchical Agglomerative Clustering (HAC) algorithm independently synthesized a brand-new, tailored classification cascade topology (Figure 6).

The resulting dendrogram, illustrated in Figure 6, objectively maps the structural and visual correlations specific to the UAV123 feature distribution. At the first hierarchy level, the algorithm successfully isolated a “Human-like” superclass by automatically merging the “Person” and “Bike” categories. This structural grouping uncovers a subtle visual and contextual pattern within the data—namely, that bicycles in aerial monitoring scenes are almost invariably associated with human riders, resulting in overlapping latent descriptors. Simultaneously, the refinement path for the broader “Vehicle” superclass replicated the core behavior observed during the VisDrone2019 synthesis, segregating distinct sub-nodes for highly correlated transportation categories.

The specialized deep learning models, comprising the adapted YOLOv11 feature extractors and fine-grained FT-Transformer classifiers, were trained recursively according to this automatically generated three-level taxonomy.

To further solidify the empirical validation and extensively benchmark the robustness of our framework under severe, unconstrained environmental stress, we conducted an additional round of stress-testing on the highly challenging UAVDT (Unmanned Aerial Vehicle Benchmark Object Detection and Tracking) dataset [33]. The UAVDT dataset poses extreme domain challenges specifically optimized for vehicular surveillance analytics, as it comprises dense traffic sequences shot across various complex urban typologies (e.g., main streets, highways, intersections, and toll booths) under highly volatile shooting attributes, including extreme night illumination, heavy atmospheric fog, sudden camera ego-motion, and critical target occlusions. The end-to-end quantitative evaluation of the resulting hierarchical cascade on the several test sequences is compiled in Table 4.

To gain a deeper understanding of the behavior of the automatically constructed cascade and to analyze the mechanisms of classification error reduction, a stage-by-stage diagnostic analysis of each hierarchy level was performed using confusion matrices. Unlike aggregate metrics such as mAP or F1-score, confusion matrices allow a direct assessment of which specific classes are confused with each other and how these errors evolve throughout the multi-level processing pipeline.

For each level of the cascade, a separate confusion matrix was constructed using the validation dataset. It is important to note that at each subsequent stage the number of classes decreases according to the automatically generated hierarchy, which fundamentally affects both the structure and the interpretation of the corresponding matrices.

The confusion matrix of the first level (Figure 7a) corresponds to the task of coarse classification between fundamental superclasses such as People and Vehicle. As shown by the results, the matrix exhibits a clearly pronounced diagonal, indicating the model’s strong ability to distinguish semantically distant categories even under complex UAV scene conditions.

The off-diagonal elements at this level are minimal and are mostly associated with borderline cases, such as very small or partially occluded objects. This confirms that the automatically generated superclasses at the first level are well separated in the latent feature space.

The confusion matrix of the second level (Figure 7b) corresponds to the identification of the Truck class, which is the most distinct category within the first-level Vehicle superclass. As shown in the matrix, the diagonal remains clearly pronounced, indicating high classification accuracy.

The confusion matrix of the final level (Figure 7c) corresponds to the most challenging task—distinguishing between highly similar classes such as Van and Bus. It is precisely at this stage that flat classification models typically exhibit the highest concentration of errors.

In addition, it is useful to compare these confusion matrices with the matrix obtained from a model used for standard multi-class classification.

Here is the revised and polished version of the text, written in fluent, academic English while preserving the original structure:

To construct this matrix (Figure 8), a fine-tuned YOLOv11s model was utilized, having been trained on the same dataset as the models within the cascade architecture. As the figure illustrates, the resulting matrix exhibits a more diffuse pattern: while broad superclasses such as “People” and “Other Vehicle” are relatively well separated, the intermediate classes remain highly ambiguous, leading to a decrease in classification accuracy.

As shown in Table 2, the proposed pipeline outperforms the monolithic YOLOv11s model by 0.8% in F1-score, albeit with a slight reduction in frame rate (FPS). Although this numerical gain may appear incremental, its true significance lies in the enhanced reliability of the classification process. The hierarchical structure specifically targets and resolves the “confusion zones” where the monolithic model’s performance typically plateaus, as clearly evidenced by comparing Figure 7 and Figure 8. Furthermore, the decrease in processing speed is not a critical limitation, as both models can comfortably operate in real-time. By automating the synthesis of this cascade, the system achieves a level of precision comparable to expert-designed architectures [5], while ensuring the model is optimally adapted to the specific statistical challenges inherent in the VisDrone2019 dataset.

Based on this analysis, it can be concluded that under the cascade-based approach, the diagonal of the confusion matrix remains distinctly pronounced, while off-diagonal misclassifications are significantly reduced compared to baseline multi-class methods. This indicates that cascade decomposition enables the model to focus exclusively on a relevant subset of classes at each processing stage, thereby minimizing the risk of error accumulation between semantically distant categories.

5. Discussion

The obtained results confirm that the automated synthesis of a hierarchical classification structure is an effective alternative to manual design of cascade architectures for UAV image analysis. Unlike traditional approaches, where the hierarchy is defined based on expert assumptions regarding the semantic similarity of classes, as in [5], the proposed method constructs the cascade structure through an objective analysis of the feature space and the empirical behavior of the model. This enables the classification architecture to be aligned with the actual complexities of the data rather than relying solely on their formal semantic descriptions.

A particularly important finding is that the automatically generated hierarchy not only reproduces the key structural decisions embedded in the expert-designed hierarchy of the previous work, but in several cases also provides more refined grouping of classes. This is especially evident at the intermediate levels of the cascade, where the algorithm identifies increased empirical confusion between certain types of vehicles and accordingly adjusts the structure of the superclasses. As a result, the hierarchy ceases to be a static design artifact and instead becomes an adaptive structure that reflects the statistical properties of classification errors.

Quantitative results on the primary VisDrone2019 dataset demonstrate that automation of the architecture design process does not reduce classification performance. On the contrary, a consistent improvement in the aggregated F1-score was observed compared to flat models without cascade decomposition. This confirms the hypothesis that the coarse-to-fine strategy is particularly effective for recognizing visually similar objects in UAV imagery, where class differences are often subtle and context-dependent.

The analysis of confusion matrices at each cascade level provides further insight into the mechanisms underlying the improved performance. The results show that the automatically synthesized hierarchy enables progressive localization of classification errors: semantically distant classes are effectively separated at early stages, while the most challenging cases are concentrated at the final stage, where the model has access to more specialized features. This reduces the risk of systematic misclassification and confirms the structural validity of the generated cascade.

At the same time, the analysis of the hybrid distance metric parameter indicates that the success of automatic hierarchy synthesis critically depends on balancing geometric feature similarity and empirical confusion information. Maximum performance is achieved within the interval

α \in [0.4, 0.6]

, indicating that both the theoretical structure of the feature space and the practical “experience” of the model must be considered. This finding suggests a fundamental idea of hierarchical structural design: relying solely on feature space geometry (

α ⟶ 1

) proves insufficient as it remains ‘blind’ to the real-world sensor noise and environmental interference typical of UAV imagery. Conversely, relying exclusively on empirical error statistics (

α ⟶ 0

) makes the architecture overly sensitive to overfitting and specific dataset biases. The optimality of the 50/50 balance signifies that a truly robust hierarchy must integrate both the theoretical structure of the latent feature space and the practical “experience” of the model’s performance. This synergy ensures that the resulting cascade is both mathematically sound and empirically resilient, overcoming the inherent limitations of using either metric in isolation.

An important advantage of the proposed method is the significant reduction of the entry barrier for adapting monitoring systems to new operational environments. Traditional hierarchical approaches require expert involvement to construct semantic trees whenever the set of target objects or imaging conditions changes. In contrast, the proposed methodology automates this process: when new domain-specific classes are introduced (e.g., new types of vehicles or industrial equipment), the system automatically determines their position in the hierarchy based on visual similarity and empirical classification errors within the latent feature space. This capability enables rapid deployment in dynamic operational scenarios where manual architecture design is impractical.

To fully evaluate the operational viability of the proposed automated multi-stage framework, it is essential to establish a rigorous comparative analysis against state-of-the-art single-stage (flat) detectors, such as monolithic YOLO architectures. While modern single-stage detectors excel in localized feature extraction and high-speed processing, they hit a performance plateau when confronting severe inter-class ambiguity in UAV monitoring scenarios. This comparative paradigm centers on a fundamental trade-off between architectural specificity, computational latency, and final recognition accuracy.

The primary advantages of the proposed multi-stage cascade over representative single-stage architecture, as was presented in [5], include two aspects.

Mitigation of the “Feature Overlap”. Monolithic models are forced to optimize decision boundaries for all target classes simultaneously within a single, flat latent feature space. In aerial imagery, visually similar classes (e.g., “Van” and “Bus”, or “Person” and “Pedestrian”) generate embedding vectors that lie in extreme spatial proximity, leading to significant classification instability under variable lighting or viewpoint adjustments. By contrast, our framework sequentially decomposes the global classification task into specialized subtasks. Early stages handle well-separated coarse categories, while the deep levels deploy fine-grained FT-Transformer classifiers optimized exclusively to isolate the micro-features of highly ambiguous subsets.

Data-driven error localization. Flat architectures diffuse classification errors uniformly across the entire label space. Conversely, the proposed cascade localizes empirical confusion. By executing hierarchical agglomerative clustering guided by a hybrid metric (α ∈ [0.4, 0.6]) the cascade structures itself around the baseline model’s actual statistical failures, isolating highly correlated classes into closed diagnostic nodes and minimizing global error propagation.

Despite these advantages, the proposed approach has several limitations. The primary one is the increase in computational complexity caused by the multi-stage inference process. Although cascade processing allows complex models to be applied only to subsets of relevant objects, the overall latency may still be a critical factor for UAV onboard systems with limited computational resources. In addition, cascade classification has issues with early-stage decision vulnerability. The cascade operates on a sequential refinement principle, meaning that irreversible errors introduced at Stage 0 (ROI detection recall) or Stage 1 (coarse classification) cannot be corrected by the specialized fine-grained models in deeper layers. If a “Van” is misrouted into a non-vehicle superclass at an early stage, the downstream FT-Transformer will never have the opportunity to evaluate its visual descriptors.

Regarding practical operational scenarios and real-world suitability, we would like to mention that while monolithic single-stage detectors offer optimal utility in computationally unconstrained environments with high-contrast, structurally distinct target sets, the practical deployment of UAV monitoring platforms often introduces harsh environmental and geometric constraints. Based on the empirical findings and the properties of the automatically synthesized hierarchy, we define several critical real-world operational scenarios where the proposed multi-stage approach is structurally superior to modern flat architectures.

Firstly, in high-altitude monitoring missions, the operational distance between the airborne sensor and the ground causes a drastic reduction in target resolution. From a strict 90-degree nadir perspective, highly distinct volumetric objects lose their vertical contextual features and collapse into flat geometric profiles. Under these conditions, a monolithic detector experiences catastrophic intra-class confusion, frequently misclassifying “Vans”, “Cars”, and “Buses” as a single generic vehicle type due to overlapping rectangular boundaries. The proposed cascade addresses this by deploying the Stage 1 superclass filter to isolate the broader vehicle domain, subsequently activating deeply specialized, high-capacity FT-Transformer classification nodes at Stage 3. These downstream transformers focus computational energy exclusively on isolating localized micro-features, such as specific roof aspect ratios, window distributions, and cooling vent placement, thereby maintaining fine-grained accuracy where flat models plateau.

Secondly, real-world autonomous monitoring operations are frequently challenged by atmospheric interference (e.g., fog, haze, or dust storms), motion blur from high-speed UAV positioning, and partial occlusions caused by urban infrastructure or dense canopies. Under these conditions, the internal visual features of targets degrade severely. Monolithic architectures, relying entirely on a standard spatial embedding representation, fail due to “feature overlap.” The proposed framework is structurally resilient in this scenario because its hierarchy synthesis is guided by a hybrid metric α that incorporates the empirical confusion matrix of a baseline model. This means the cascade structure inherently adapts to how neural networks actually fail under noise, grouping highly vulnerable class combinations into specialized sub-nodes that are systematically decoupled from clear, easily recognizable classes (e.g., separating the highly confused “Truck“ category at Stage 2).

In addition, traditional hierarchical frameworks require extensive human-expert systems engineering to rebuild the classification tree whenever a UAV is redeployed to a new geographic area with an entirely different set of operational classes (e.g., transitioning from urban traffic monitoring to maritime harbor surveillance). The proposed data-driven framework allows for rapid, automated cross-domain scalability. By passing the feature space of the new target dataset through the optimized HAC pipeline, the framework independently calculates the optimal splitting threshold

τ

and automatically synthesizes a tailored cascade structure (as demonstrated in the adaptation from VisDrone2019 to UAV123). This eliminates the reliance on manual expert design, drastically minimizing deployment latency in critical, time-sensitive operational environments.

Future research directions include several promising avenues. First, the application of optimization techniques such as knowledge distillation, model quantization, and hardware-aware optimization could significantly reduce inference latency without substantial accuracy loss. Second, integrating uncertainty-aware routing mechanisms, such as confidence-based branching or feedback across cascade levels, may further improve robustness. Finally, extending the approach to multisensor data (e.g., thermal cameras or LiDAR) and incorporating explainable AI techniques could enhance system reliability and suitability for safety-critical applications.

Overall, the results of this study demonstrate that automated synthesis of hierarchical classification structures represents a promising direction for the development of computer vision systems for UAV monitoring, combining high accuracy, adaptability, and reduced dependence on manual expert design. Also, while representative single-stage architectures offer maximum computational efficiency for distinct object categories, the proposed multi-stage framework is uniquely suited for specialized, high-stakes UAV monitoring missions where resolving inter-class ambiguity and preventing catastrophic visual misclassifications outweigh the marginal loss in raw frame rate.

6. Conclusions

This study presents a highly effective, data-driven methodology for the automated synthesis of hierarchical classification architectures, overcoming the persistent challenges of identifying visually similar objects in the UAV imagery. By eliminating the reliance on subjective expert design, our framework employs a hybrid inter-class proximity metric that optimally balances geometric class separability with empirical classifier confusion. Experimental validation confirms the robustness of this approach, with the synthesized cascade achieving a peak F1-score of 94.9%. This performance successfully outstrips state-of-the-art monolithic models by 0.8% and matches precisely engineered human-expert architectures, all while maintaining high adaptability across diverse operational domains. Despite these substantial gains in classification accuracy, the approach is constrained by increased computational overhead from multi-stage inference and an inherent vulnerability to irreversible early-stage routing errors.

Future research must focus on optimizing inference speed through advanced knowledge distillation and model quantization tailored for edge computing. Moreover, integrating uncertainty-aware branching and multisensor perception arrays, such as thermal and LiDAR inputs, promises to further secure real-time decision-making, paving the way for highly autonomous, all-weather intelligent UAV monitoring systems.

Author Contributions

Conceptualization, D.B. and O.B.; methodology, D.B. and O.B.; software, D.B. and P.R.; validation, D.B., O.B. and P.R.; formal analysis, O.B., P.R. and I.K.; investigation, D.B.; resources, O.B. and I.K.; data curation, D.B. and P.R.; writing—original draft preparation, D.B. and O.B.; writing—review and editing, P.R. and I.K.; visualization, D.B. and P.R.; supervision, I.K.; project administration, O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study used publicly available datasets and did not involve humans or animals.

Informed Consent Statement

Not applicable. This study did not involve humans.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/DmytroBorovykKhnu/Target-detection (accessed on 22 April 2026).

Acknowledgments

During the preparation of this manuscript, the authors used generative AI and AI-assisted technologies, namely Gemini 3.1 Pro Preview (Google LLC) and Grammarly (Grammarly, Inc.), for the purposes of copyediting, including grammar refinement, stylistic consistency, and enhancing the overall linguistic quality and readability of the text. The authors have comprehensively reviewed and edited the output, ensuring that all concepts and conclusions accurately reflect the original research, and take full responsibility for the content and integrity of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Examples of correct classification. Purple frames belong to “Vehicle” class, blue frames belong to “Person” class, red frames belong to “Bus” class, violet frames belong to “Truck” class, and orange frames belong to “Van” class.

References

Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, version 8.4.50; GitHub: San Francisco, CA, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 22 April 2026).
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for UAVs based on deep learning. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
Liu, C.; Meng, F.; Zhu, Z.; Zhou, L. Object detection of UAV aerial image based on YOLOv8. Front. Comput. Intell. Syst. 2023, 5, 46–50. [Google Scholar] [CrossRef]
Lu, T.; Wan, L.; Qi, S.; Gao, M. Land cover classification of UAV remote sensing based on Transformer–CNN hybrid architecture. Sensors 2023, 23, 5288. [Google Scholar] [CrossRef]
Borovyk, D.; Barmak, O.; Radiuk, P.; Krak, I. Hierarchical deep learning model for identifying similar targets in UAV imagery. Drones 2025, 9, 743. [Google Scholar] [CrossRef]
Pittino, F.; Dimitrievska, V.; Heer, R. Hierarchical concept bottleneck models for vision and their application to explainable fine classification and tracking. Eng. Appl. Artif. Intell. 2023, 118, 105674. [Google Scholar] [CrossRef]
Gromada, K.; Siemiatkowska, B.; Stecz, W.; Plochocki, K.; Wozniak, K. Real-time object detection and classification by UAV equipped with SAR. Sensors 2022, 22, 2068. [Google Scholar] [CrossRef]
Jiang, B.; Qu, R.; Li, Y.; Li, C. Object detection in UAV imagery based on deep learning: Review. Acta Aeronaut. Astronaut. Sin. 2021, 42, 524519. [Google Scholar] [CrossRef]
Li, C.; Zhao, R.; Wang, Z.; Xu, H.; Zhu, X. RemDet: Rethinking efficient model design for UAV object detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4643–4651. [Google Scholar] [CrossRef]
Wang, X.; Peng, Y.; Shen, C. Efficient feature fusion for UAV object detection. In Proceedings of the 2025 International Joint Conference on Neural Networks, Rome, Italy, 30 June–5 July 2025; IEEE Inc.: New York, NY, USA, 2025; pp. 1–8. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent advances for aerial object detection: A survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Bouchard, G.; Triggs, B. Hierarchical part-based visual object categorization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2025; IEEE Inc.: New York, NY, USA, 2005; pp. 710–715. [Google Scholar] [CrossRef]
Fang, P.; Harandi, M.; Le, T.; Phung, D. Hyperbolic geometry in computer vision: A survey. arXiv 2023. [Google Scholar] [CrossRef]
Kwon, H.; Jang, J.; Kim, J.; Kim, K.; Sohn, K. Improving visual recognition with hyperbolical visual hierarchy mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA; CVF: New York, NY, USA, 2024; pp. 17364–17374. [Google Scholar]
Abebe, G.; Cavallaro, A. Hierarchical modeling for first-person vision activity recognition. Neurocomputing 2017, 267, 362–377. [Google Scholar] [CrossRef]
Ke, T.-W.; Mo, S.; Yu, S.X. Learning hierarchical image segmentation for recognition and by recognition. In International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2024; pp. 39881–39919. Available online: https://proceedings.iclr.cc/paper_files/paper/2024/file/ae201118dfe8c6529a663c82b6bcdf8c-Paper-Conference.pdf (accessed on 22 April 2026).
Kosmopoulos, A.; Partalas, I.; Gaussier, E.; Paliouras, G.; Androutsopoulos, I. Evaluation measures for hierarchical classification: A unified view and novel approaches. Data Min. Knowl. Discov. 2015, 29, 820–865. [Google Scholar] [CrossRef]
Roy, D.; Panda, P.; Roy, K. Tree-CNN: A hierarchical deep convolutional neural network for incremental learning. Neural Netw. 2020, 121, 148–160. [Google Scholar] [CrossRef]
Svystun, S.; Melnychenko, O.; Radiuk, P.; Savenko, O.; Sachenko, A.; Lysyi, A. Thermal and RGB images work better together in wind turbine damage detection. Int. J. Comput. 2024, 23, 526–535. [Google Scholar] [CrossRef]
Chen, T.; Wu, W.; Gao, Y.; Dong, L.; Luo, X.; Lin, L. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 2023–2031. [Google Scholar] [CrossRef]
Xu, Z.; Xiang, X. Learning visual-semantic hierarchical attribute space for interpretable open-set recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025; IEEE Inc.: New York, NY, USA, 2025; pp. 5697–5706. [Google Scholar] [CrossRef]
Märzinger, T.; Kotík, J.; Pfeifer, C. Application of hierarchical agglomerative clustering (HAC) for systemic classification of pop-up housing (PUH) environments. Appl. Sci. 2021, 11, 11122. [Google Scholar] [CrossRef]
McCarthy, C.; Quirijnen, L.; van Zandwijk, J.P.; Geradts, Z.; Worring, M. Hi-OSCAR: Hierarchical open-set classifier for human activity recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2025, 9, 199. [Google Scholar] [CrossRef]
Dong, F.; Wang, M. HybriDet: A hybrid neural network combining CNN and Transformer for wildfire detection in remote sensing imagery. Remote Sens. 2025, 17, 3497. [Google Scholar] [CrossRef]
Ostrovskyi, Z.; Barmak, O.; Radiuk, P.; Krak, I. Unsupervised knowledge extraction of distinctive landmarks from earth imagery using deep feature outliers for robust UAV geo-localization. Mach. Learn. Knowl. Extr. 2025, 7, 81. [Google Scholar] [CrossRef]
Yang, J.; Wan, H.; Shang, Z. Enhanced hybrid CNN and transformer network for remote sensing image change detection. Sci. Rep. 2025, 15, 10161. [Google Scholar] [CrossRef]
Li, W.; Xue, L.; Wang, X.; Li, G. MCTNet: A multi-scale CNN–Transformer network for change detection in optical remote sensing images. In Proceedings of the 2023 26th International Conference on Information Fusion, Charleston, SC, USA, 27–30 June 2023; IEEE Inc.: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN–Mamba UNet for remote sensing image semantic segmentation. arXiv 2024. [Google Scholar] [CrossRef]
Semenov, S.; Krupska-Klimczak, M.; Czapla, R.; Krzaczek, B.; Gavrylenko, S.; Poltorazkiy, V.; Zozulia, V. Intrusion detection method based on preprocessing of highly correlated and imbalanced data. Appl. Sci. 2025, 15, 4243. [Google Scholar] [CrossRef]
Kaushik, S.; Bhardwaj, A.; Almogren, A.; Bharany, S.; Altameem, A.; Rehman, A.U.; Hussen, S.; Hamam, H. Robust machine learning based intrusion detection system using simple statistical techniques in feature selection. Sci. Rep. 2025, 15, 3970. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.-N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, Hangzhou, China, 19–25 October 2025; IEEE Inc.: New York, NY, USA, 2025; pp. 15143–15149. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Computer Vision—ECCV 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 375–391. [Google Scholar] [CrossRef]

Figure 1. General hierarchical architecture for object detection and classification in UAV imagery.

Figure 2. Comprehensive workflow of the proposed hierarchical approach. The workflow is divided into two phases: Offline Cascade Synthesis, where the inter-class proximity matrix D is used to automatically generate the optimal hierarchy through agglomerative clustering; Online Hierarchical Inference, which utilizes a staged cascade of Faster R-CNN (for detection), YOLO-based feature extractors, and FT-Transformers (for classification) to progressively identify targets from coarse to fine-grained levels.

Figure 3. Dendrogram of class partitioning obtained during the automatic synthesis of the cascade structure. Line colors indicate distinct clusters formed by the cutting threshold: blue for the isolated “People” superclass and orange for the grouped vehicle sub-classes.

Figure 4. Experimental analysis of the parameter

α

.

Figure 4. Experimental analysis of the parameter

α

.

Figure 5. Experimental analysis of the parameter

τ

.

Figure 5. Experimental analysis of the parameter

τ

.

Figure 6. Dendrogram of class partitioning obtained during the automatic synthesis of the cascade structure for the UAV123 dataset. The uniform line color denotes the complete hierarchical tree, which fundamentally branches into two primary superclasses (“Vehicle” and “Human-like”).

Figure 7. Confusion matrices for the three levels of the proposed cascade: (a) Stage 1 (superclasses), (b) Stage 2 (vehicle refinement), and (c) Stage 3 (final identification).

Figure 8. Confusion matrix of the multi-class classification model.

Table 2. Comparison of the performance of the complete classification pipeline with existing approaches. Numbers in bold represent higher values.

Method	Precision, %	Recall, %	F1-Score, %	mAP@.50, %	mAP@.50:.95, %	FPS
Wang et al. [10]	94.5	94.0	93.2	-	-	-
Zhang et al. [32]	91.4	91.0	91.4	91.8	78.4	-
Fine-tuned YOLOv11s	92.4	93.8	94.1	93.1	81.2	50
Hi-OSCAR [24]	86.1	85.9	86.7	-	-	-
Manual hierarchy [5]	94.6	95.2	94.9	94.1	83.0	41
Proposed approach (automatic hierarchy)	94.6 ± 0.8	95.2 ± 1.3	94.9 ± 1.0	94.1	83.0	41

Table 3. Results of hierarchical classification on the UAV123 dataset at different cascade levels.

Level	Precision, %	Recall, %	F1-Score, %	mAP@.50, %	mAP@.50:.95, %
1	93.4	93.8	94.0	93.1	79.4
2	93.5	93.9	94.1	93.3	80.1
3	94.1	94.7	94.9	94.3	80.5

Table 4. Results of proposed approach on different datasets.

Dataset	Precision, %	Recall, %	F1-Score, %
VisDrone2019	94.6	95.2	94.9
UAV123	94.1	94.5	94.3
UAVDT	93.5	94.2	93.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Borovyk, D.; Barmak, O.; Radiuk, P.; Krak, I. Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery. Technologies 2026, 14, 360. https://doi.org/10.3390/technologies14060360

AMA Style

Borovyk D, Barmak O, Radiuk P, Krak I. Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery. Technologies. 2026; 14(6):360. https://doi.org/10.3390/technologies14060360

Chicago/Turabian Style

Borovyk, Dmytro, Oleksander Barmak, Pavlo Radiuk, and Iurii Krak. 2026. "Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery" Technologies 14, no. 6: 360. https://doi.org/10.3390/technologies14060360

APA Style

Borovyk, D., Barmak, O., Radiuk, P., & Krak, I. (2026). Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery. Technologies, 14(6), 360. https://doi.org/10.3390/technologies14060360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Automated Synthesis of Hierarchical Deep Learning Cascades for Identifying Visually Similar Objects in UAV Imagery

Abstract

1. Introduction

2. Related Works

2.1. Evolution of UAV Object Detection and the Limits of Monolithic Architectures

2.2. Hierarchical Approaches and the Semantic Gap

2.3. Automation of Hierarchy Synthesis: Metric Challenges and Edge AI

2.4. Problem Statement and Scientific Contradiction

3. Materials and Methods

3.1. General Description of the Automatic Cascade Generation ApproachGeneral Architecture of the Object Identification Cascade

3.2. Formalization of Feature Extraction and Class Centroids

3.3. Automated Cascade Generation via Hierarchical Agglomerative Clustering

3.4. General Pipeline of the Approach

3.5. Datasets

3.6. Evaluation Metric System and Method Validation Strategy

3.7. Experimental Setup

3.8. Ethical Considerations

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI