Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation

Dou, Bei; Wu, Tao; Guo, Zhiwei

doi:10.3390/ai6090201

Open AccessArticle

Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation

by

Bei Dou

,

Tao Wu

^* and

Zhiwei Guo

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410028, China

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 201; https://doi.org/10.3390/ai6090201

Submission received: 10 July 2025 / Revised: 14 August 2025 / Accepted: 20 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Object Detection and Tracking: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Open-vocabulary object detection (OVOD) aims to localize and recognize objects in images by leveraging category-specific textual inputs, including both known and novel categories. While existing methods excel in general scenarios, their performance significantly deteriorates in domain-specific fine-grained detection because of their heavy reliance on high-quality textual descriptions. In specialized domains, such textual descriptions are often affected by newly introduced terms or subjective human biases, limiting their applicability. In this paper, we propose an attribute decomposition–aggregation approach for the OVOD to address these challenges. By decomposing categories into fine-grained attributes and learning them in a multi-label manner, our method mitigates text quality issues caused by novel terms and human bias. During inference, unseen fine-grained category texts can be effectively represented by combining the decomposed attributes for detection. Even if the model learns the attributes, a key limitation of current methods is the insufficient utilization of textual attributes. To mitigate this issue, we propose an attribute-aggregation module that enhances the discriminative capability by emphasizing critical attributes for distinguishing target objects from foreground elements. To demonstrate the effectiveness of our OVOD framework, we evaluate our method on both our newly constructed military dataset and the public LAD dataset. Experimental results demonstrate that our method outperforms existing methods in domain-specific fine-grained open-vocabulary detection tasks.

Keywords:

open-vocabulary object detection; fine-grained; attribute decomposition and aggregation; multi-label learning; attribute text encoding

1. Introduction

Traditional object-detection methods typically adopt a closed-set detection paradigm, where both training and inference are confined to a predefined set of categories. However, in practical applications such as reconnaissance and security surveillance, detection requirements are often open-ended and dynamic; systems must recognize novel categories that are unseen during training. Conventional approaches require recollection of large-scale annotated data and retraining of models from scratch when encountering new categories. This process incurs prohibitive annotation costs and severely limits the adaptability of the system to evolving needs.

Driven by the advancement of the Transformer architecture [1,2,3,4,5], recent advances in large-scale vision-language pretrained models have enabled Open-Vocabulary Object Detection (OVOD) [6,7]. This paradigm establishes a unified visual-language feature space that leverages the semantic generalization capability of text modalities to detect novel categories without the need for retraining. By computing the semantic similarity between image region features and category text embeddings, zero-shot detection of novel classes is achieved.

The fundamental challenge in OVOD centers on achieving accurate alignment between visual region features and textual semantic representations, harnessing the powerful generalization capacity of textual modalities to enable zero-shot recognition of novel object categories. Current research methodologies primarily fall into three distinct categories: preliminary vision-language feature alignment-based approaches, image-level pre-trained model transfer-utilizing methods, and end-to-end visual grounding training–employing techniques.

Preliminary vision-language feature alignment approaches build upon conventional closed-set detectors by replacing their standard classification layers with cross-modal similarity-computation mechanisms. The seminal OVR-CNN [8] framework, as a pioneering work in this field, implements this paradigm within the Faster R-CNN [9] architecture by substituting the traditional classification layer with a visual-language similarity module. This system performs category prediction by computing the cosine similarity between the region proposal features and the corresponding word embeddings. However, these early stage methods exhibit several inherent limitations, including constrained cross-modal representation learning due to training on limited region–text paired datasets, insufficient semantic contextual understanding from static word embeddings, and inadequate discriminative power for fine-grained category differentiation in the classification of images.

Approaches based on image-level pretrained model transfer primarily leverage the generalization capabilities of large-scale vision-language classification models, such as CLIP [10]. These methods transfer image-level semantic knowledge to detectors through either knowledge distillation or joint training [11]. However, they face two inherent challenges: CLIP’s pretraining relies solely on image–text pairs, lacking explicit supervision for object spatial locations, which degrades the localization accuracy of region proposals; furthermore, the domain gap between classification data and detection data introduces substantial noise.

ViLD [12] pioneered a two-stage distillation framework that aligns the visual feature spaces by employing CLIP’s [10] image encoder as the teacher model to guide the student detector based on Faster R-CNN. To address the missing spatial information, RegionCLIP [13] introduced a region–text contrastive learning strategy that enhances regional awareness through the generated pseudo-labels for region descriptions. In a different approach, the OV-DETR [14] modified the detection architecture by integrating the DETR with CLIP for novel category detection. Notably, DetCLIP [11] attempted to jointly optimize the detector using both detection and classification data, although the training samples created by cropping classification images introduced pseudo-label noise from background interference, ultimately compromising the model performance. Beyond the aforementioned approaches, several studies [15,16,17,18,19,20,21] have also achieved open-set object detection by building upon the CLIP framework. However, these methods remain fundamentally constrained by CLIP’s lack of fine-grained spatial supervision, resulting in suboptimal bounding box localization accuracy.

Visual grounding-based approaches [11,22,23,24,25] adopt an end-to-end multimodal joint training paradigm to establish direct mappings from textual descriptions to target object regions. These methods have significantly advanced zero-shot detection performance in general domains [8,26,27], using innovative cross-modal interaction mechanisms. For example, GLIP [28] reformulates the detection task as a phrase grounding problem by introducing region–text contrastive learning, whereas GroundingDINO [29] achieves a more refined cross-modal alignment using a three-stage early fusion architecture that incorporates language-guided query generation and cross-modal feature interaction modules.

Although these methods have demonstrated promising zero-shot detection performance in general domains, they exhibit significant limitations in practical application scenarios. Specifically, when applied to domain-specific tasks, a detection mechanism that relies solely on the similarity between class-level textual and visual features faces critical issues. Their training data are predominantly derived from general domains, and this paradigm assumes that the textual inputs in the application are accurate and well suited for the model. However, for fine-grained or difficult-to-name categories in specialized domains, ensuring that the provided textual inputs meet the required quality is challenging, leading to insufficient reliability when deployed in downstream detection tasks. In fine-grained category detection, textual similarity can induce overconfidence in the detector. As illustrated in Figure 1, when the target is a reconnaissance vehicle, the model misclassifies other visually similar vehicles as the target vehicle owing to inadequate discriminative capability. We observed that the categories requiring detection often exhibit parallel or hierarchical relationships, thereby sharing or mutually excluding part-level attributes. During inference, fusing textual representations of these part-level attributes into a global textual representation can effectively address detection failures caused by low-quality textual inputs.

Therefore, for fine-grained OVOD in specific scenarios, we propose an attribute decomposition and aggregation-based method. The core idea of our method is to assign multiple attribute labels to each fine-grained category, where categories belonging to the same broader class share common attribute labels that characterize the superclass. Subsequently, multi-label learning was employed to enable the model to learn the correspondence between the visual features and multiple attribute labels. We further adopted an attribute text feature-aggregation mechanism to generate more discriminative textual representations, allowing the model to synthesize attribute-level features during inference for flexible and rational detection outputs. In contrast to conventional open-vocabulary detection methods that neglect part-level attributes and hierarchical relationships among categories, our approach explicitly considers parallel or hierarchical relations among classes. By introducing the fusion of part-level attributes to replace the direct input of potentially low-quality class-level textual descriptions, our method better leverages the fusion capability of the attention mechanism in the textual branch and harnesses prior knowledge to improve the recognition of novel categories. The contributions of this paper are summarized as follows:

1.: We propose an attribute enhanced OVOD method. This method decomposes category texts into semantic attributes and employs a multi-positive contrastive learning paradigm to achieve fine-grained alignment between visual features and textual attribute embeddings.
2.: We propose a multi-category text attribute-aggregation method, which employs parallel text encoding for multi-category multi-attribute representations, enhancing the semantic characteristics of attributes while preserving the textual features of other categories.
3.: We present the military dataset, a comprehensive dataset containing fine-grained attributes and diverse types of military vehicles, designed to facilitate fine-grained attribute learning for object detectors.

The remainder of this paper is organized as follows: Section 2 presents our methodological framework, detailing the architecture of the proposed detection system and the operational mechanisms of its constituent modules; Section 3 describes our experimental design, including benchmark datasets, evaluation protocols, implementation details, and comprehensive results analysis. Finally, the conclusions and prospects of the current work are discussed in Section 4.

2. Methodology

In this section, we present an open-vocabulary detection method incorporating attribute decomposition and aggregation. The learned attributes mitigate semantic misinterpretation caused by low-quality or non-standardized textual inputs for novel categories. This overview outlines the core elements, implementation workflow, and notable characteristics of the proposed method.

2.1. Overall Framework

The overall framework is illustrated in Figure 2. We employ Grounding DINO [29] as the baseline algorithm in subsequent experiments, which adopts a dual-encoder single-decoder architecture. We employ Swin-Transformer [30] as the image backbone, leveraging its strong generalization capability from pre-training on Objects365 [31] and goldG [28] datasets, while using BERT [32] for the text feature extraction. The extracted image and text features are fed into a feature-fusion module for multi-modal feature integration. In the decoder component, a language-guided query-selection module initializes the decoder queries, followed by refinement through a cross-modal decoder to predict the position and category for each query.

To address the core issue of performance degradation in OVOD within domain-specific, fine-grained scenarios, primarily caused by the unstable quality of class-level textual descriptions, we propose a novel detection framework based on attribute decomposition and aggregation. The framework comprises three key components: Attribute Text-Decomposition Mechanism: Prior to text encoding during training, prior knowledge of the semantic structure of category names was leveraged to decompose each class name into multiple shareable fine-grained attributes. This transforms the conventional single-label classification paradigm into a multi-label attribute recognition task, thereby reducing the model’s dependence on the precise accuracy of full category names and enhancing its robustness against novel terminology or subjective descriptive variations. Multi-Label Attribute Learning: A multi-positive contrastive learning strategy is employed to align region-level visual features with multiple fine-grained textual attributes. This enhances the discriminative capability of the model among visually and semantically similar categories. Attribute Text Feature-Aggregation Mechanism: During inference, we introduce a multi-category, multi-attribute text-aggregation method. By integrating both category names and decomposed attribute features within the text branch, the model is encouraged to focus more on part-level attribute characteristics, effectively mitigating the adverse impact of low-quality textual inputs.

The proposed framework not only preserves the zero-shot generalization capability inherent in open-vocabulary detection but also significantly improves detection reliability and adaptability for hard-to-name or ambiguously described novel objects in specialized domains, such as military reconnaissance, through a fine-grained semantic disentanglement and re-composition mechanism.

2.2. Attribute Text-Decomposition Mechanism

To achieve alignment between region-level image features and multiple attribute-based textual features, it is essential to decompose each category into multiple attribute representations based on prior knowledge. In fine-grained object-detection tasks, particularly for recognizing diverse vehicle categories (including novel classes) in reconnaissance scenarios, the semantic similarity between fine-grained category names typically indicates shared attributes inherited from their common parent class. For instance, tanks and infantry fighting vehicles, both belonging to the military combat vehicle category, exhibit the shared “artillery-equipped” attribute.

This paper proposes a component-based attribute-decomposition framework for military vehicles, grounded in two foundational principles: (1) alignment with domain-specific understanding of essential vehicle characteristics, and (2) compatibility with the practical requirements of dataset construction for fine-grained visual recognition. These dual considerations ensure that the resulting attribute system is both semantically meaningful within the military context and operationally viable for annotation and downstream analysis. According to drive mechanism, vehicles are categorized into wheeled and tracked types; based on function, they are classified into artillery equipped and missile launching types; and in terms of appearance type, they are divided into armored, monocoque, and cargo style types. An example illustrated in Figure 2 shows the decomposition of a fine-grained category into its corresponding parent class text and attribute text. During text encoding, deduplicated combinations of parent class names and attributes from all training samples are formed, separated by “.”, and then fed into the text encoder for processing.

To satisfy the first principle, namely alignment with domain expertise, the framework adopts a component-based attribute-decomposition scheme. Specifically, the drive mechanism reflects the chassis configuration such as wheeled or tracked; the functional roles captures category-discriminative auxiliary systems such as artillery equipped or missile launching; and the appearance type characterizes the structural integration between the cab and hull, encompassing designs such as monocoque or monocoque. This tripartite schema enables a preliminary yet effective fine-grained distinction among diverse military vehicle types, based on physically interpretable and functionally relevant components. Task-oriented viability constitutes the second principle, prioritizing attributes that are both visually annotatable and discriminative at the fine-grained level. Camouflage patterns, although prevalent in military vehicles, exhibit limited discriminative power across categories and are therefore excluded from the system. Likewise, attributes such as crew size are not visually observable in armored platforms like tanks, due to design imperatives related to protection and concealment. As a result, only those attributes that simultaneously satisfy semantic relevance and annotation feasibility are retained in the final attribute ontology.

According to the above attribute-decomposition rules, the datasets and their corresponding formats used in the subsequent experiments are as follows: a custom military dataset and the LAD dataset [33], both featuring attribute-based annotation schemes to enable fine-grained open-vocabulary detection. The military dataset focuses on military vehicles while the LAD dataset covers civilian transportation, with both employing structured attribute decomposition to facilitate novel category recognition through attribute fusion.

Custom Military Dataset. To address the limitations of existing military datasets which suffer from limited scene diversity and lack of detailed attributes, we curated a new dataset containing 15,438 high-quality annotated images. The military dataset comprises primary categories including personnel, vehicles, and buildings, with further granularity into nine secondary subcategories (e.g., combat/non-combat vehicles). The annotation strategy emphasizes attribute decomposition, particularly for vehicles as they represent clearly decomposable entities. Each vehicle is characterized through three attribute dimensions: drive mechanism (wheeled or tracked), functional role (artillery equipped or missile launching), and appearance type (armored, monocoque, or cargo style). This structured annotation enables our model to learn compositional relationships between attributes and objects. After offline data augmentation, the coarse-grained categories and the number of instances per attribute in the final training dataset are illustrated in Figure 3, while the overall structure of the self-constructed dataset is presented in Figure 4.

LAD [33] Dataset. For cross-domain validation, we employ the LAD dataset which contains 5 supercategories and 230 fine-grained classes. From its transportation supercategory (encompassing cars, ships and aircraft), we selected 8599 images across 24 vehicle classes. The proposed annotation scheme systematically categorizes vehicles along three primary dimensions: drive mechanism, body type, and roof characteristics. For drive mechanism, vehicles are classified into three-wheel, multi-wheel (≥4 wheels), and tracked configurations. Body types encompass six categories: compact, full-size, flatbed, enclosed, tanker, and crane. Roof characteristics are further divided into convertible and hardtop designs. This hierarchical structure enables comprehensive yet efficient vehicle characterization for detection tasks.

2.3. Attribute Text Feature-Aggregation Mechanism

During inference, given an input image and a human-provided textual description composed of multiple attributes (e.g., “wheeled artillery equipped vehicle, tracked armored artillery equipped tank”), conventional text encoders in OVOD exhibit limited capacity in modeling composite attribute structures. While selfattention mechanisms implicitly encode inter-attribute relationships, they often fail to disambiguate semantically similar categories due to diffuse feature representations. To address this limitation, we propose aggregating multiattribute descriptions (e.g., “wheeled_artillery equipped_vehicle”) into a unified, discriminative textual embedding within the text-encoding module. This aggregation explicitly enhances the representation of constituent attributes, thereby improving fine-grained category discrimination.

To this end, we propose an attribute text feature-aggregation mechanism. The core idea is to explicitly model attribute-level features of fine-grained categories through structured text inputs and a constrained attention masking strategy, thereby enhancing discriminative capability across visually similar classes.

We introduce multi-category joint attribute encoding, a structured text-encoding framework for fine-grained category representation. To enable explicit attribute-level modeling, we first construct structured textual inputs with explicit syntactic delimiters: attributes within a single category are concatenated using underscores “_”, while distinct categories are separated by periods “.”. This formatting allows precise identification of attribute boundaries and category scope during tokenization. For example, the input is formatted as: “wheeled_artillery equipped_vehicle. tracked_armored_artillery equipped_tank”.

For the trained model, given an image and attribute description, our objective is to integrate multiple attributes and the combination of class names across various categories into a global feature representation. This global feature should emphasize the semantics of the attributes to facilitate differentiation among other fine-grained categories. For multi-category multi-attribute-detection scenarios, this paper proposes an efficient parallel encoding scheme. Specifically, a hierarchical encoding framework is adopted to process multiple category texts in batches. The global features of all category texts are computed simultaneously, or the attribute features at the same positional index within each global text are processed concurrently.

For each category text, our approach encompasses two stages: global encoding and attribute-level encoding. Initially, a global text containing complete attribute and category information is encoded, facilitating thorough interaction among textual components through a self-attention mechanism. The formula for global text encoding is:

F_{global} = TextEncoder (T, M_{global}),

(1)

where, T denotes the text to be encoded, and

M_{global}

is the mask applicable for global encoding.

F_{global}

is the global text feature.

Building on this, to enhance the semantic representation of attribute words, we devise a context-dependent attribute-encoding method. This method employs a masking mechanism to preserve interactions between the current attribute word and the global text while masking out attention from other irrelevant tokens. The formula forattribute text encoding is:

F_{{attr}_{i}} = TextEncoder (T, M_{{attr}_{i}}),

(2)

where T denotes the text to be encoded,

M_{{attr}_{i}}

is the mask applicable for attribute encoding and

F_{{attr}_{i}}

denotes the feature corresponding to the i-th attribute text.

To enable attribute encoding, it is essential to identify the positions of the tokens corresponding to the attribute text within the global text. We adopt the following method to identify the token positions corresponding to the attribute texts: For a text string composed of multiple attributes and categories, such as “wheeled_artillery equipped_vehicle . tracked_armored_artillery equipped_tank”, we use the symbol “.” to separate different categories and append an underscore “_” after each attribute text to mark the position of attribute words. The obtained mask can be applied to the self-attention computation part in the text-encoding process:

F = softmax (\frac{Q K^{⊤}}{\sqrt{d_{K}}} + M) V,

(3)

where M include the global mask and attribute masks, and F consist of both the global text feature and attribute text features. The global mask governs the computation of global text features, while each attribute mask controls the derivation of its corresponding attribute text features.

The mask is constructed as:

M = [(1 - \bar{M}) ⊙ - \infty],

(4)

where

\bar{M}

can refer to either

{\bar{M}}_{global}

, which is associated with the global text, or

{\bar{M}}_{attr}

, which is associated with the attribute text.

{\bar{M}}_{global} = Ψ Ψ^{⊤} + diag (1 - Ψ),

(5)

where,

Ψ

denotes the mask in which the positions corresponding to the tokens of each global text are set to false.

{\bar{M}}_{attr} = Θ Ψ^{⊤} + diag (1 - Ψ),

(6)

where

Θ

denotes the mask in which only the positions corresponding to the tokens of a single attribute within each global text are set to false.

Furthermore, to address the synchronization challenge in encoding varying numbers of attributes across different categories, we implement a zero-padding strategy during attribute text encoding. Specifically, for categories with missing attributes, their corresponding feature representations are set to zero vectors post-encoding. This approach effectively prevents text features from being contaminated by irrelevant category encoding.

Existing research [34] has found that embeddings of compositional concepts in the CLIP [10] model can typically be approximately decomposed into the sum of vectors corresponding to each individual factor. The addition of global text features and attribute text features is due to the approximate linear nature of text encoding. The features encoded by BERT [32] are visualized in 3D space after principal component analysis (PCA) dimensionality reduction, as shown in Figure 5. PCA is a widely used linear transformation method for feature dimensionality reduction.

The encoded representations of attribute texts differing by a single attribute exhibit approximately linear relationships in the feature space. Based on this observation, this paper employs a linear combination approach to effectively integrate attribute features, thereby enhancing detection performance. As shown in Figure 6, different masking strategies are applied in the attribute-encoding module. Experimental results demonstrate that column-wise interaction approach yields the best performance.

Finally, the global features and individual attribute features are summed to form the fused text representation.

F_{new} = F_{global} + \sum_{i} F_{{attr}_{i}}

(7)

This comprehensive encoding strategy ensures robust differentiation between fine-grained categories while maintaining computational efficiency.

Given that the text inputs during the testing phases are fixed, all text features need to be precomputed only once, ensuring that attribute-level aggregation does not increase the computational burden during inference.

2.4. Multi-Positive Contrastive Learning

Existing methods typically employ image–text contrastive learning to align visual and textual features. When training with detection-format datasets, each image region is associated with a single textual label. The process of calculating the similarity between image region features and individual text features is as follows: first, following the previously defined formulation in Equation (1), textual descriptions or attributes are encoded into textual feature representations F. Subsequently, the input image is processed through an image encoder to generate visual feature representations:

I = ImageEncoder (I m g)

(8)

The similarity measure between each image region feature I and corresponding text feature F is subsequently computed through dot product operations:

S = I F^{⊤},

(9)

where I and F denote the encoded features of the image and text prompt, respectively, while S represents their similarity matrix.

However, in multi-label classification scenarios, a single object may correspond to multiple text labels. For instance, a “truck” image might simultaneously relate to both “wheeled” and “cargo-style” labels. For text obtained via an offline attribute correspondence table, which includes parent class names and their associated subclass attributes, multi-label training is employed. During training, both the parent category and related attributes of the same object are treated as positive samples, establishing a many-to-many relationship between fine-grained category texts and attribute texts. The loss function is calculated accordingly to accommodate this complexity. We formally define the set of positive samples as:

P_{i} = {j ∣ T_{i j} = 1},

(10)

where i and j indicate that the j-th text is a valid label for the i-th image region. The similarity score between the i-th image and the j-th text is normalized via a sigmoid function:

{\hat{S}}_{i j} = \frac{1}{1 + e^{- S_{i j}}}

(11)

The contrastive loss is then computed to maximize the similarity between the image and its associated text labels while minimizing similarity with irrelevant texts:

L_{cls} = - \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{| P_{i} |} \sum_{j \in P_{i}} log \frac{exp ({\hat{S}}_{i j})}{\sum_{k = 1}^{M} exp ({\hat{S}}_{i k})},

(12)

where i is the index of the current image region (

i = 1, \dots, N

), and j denotes a positive text label associated with the i-th image region (

j \in P_{i}

), and k is the index of any text label in the vocabulary.

This approach ensures that the model effectively learns from multiple relevant labels for each object, enhancing its capability to accurately distinguish and classify objects within fine-grained categories under data-limited conditions.

3. Experiments

We implement the proposed method based on the MMDetection framework [35], which is built upon PyTorch 2.1.0. The software environment includes CUDA 12.1 and mmcv 2.1.0 to ensure compatibility with vision models. The baseline algorithm for experiments is Grounding DINO [29]. To guarantee fairness in downstream fine-grained object-detection experiments, all subsequent experiments employ the same image encoder and pretraining data. We adopt a pretrained model based on Swin-Transformer Tiny [30], which is initially trained on the Objects365 [31] and GoldG [28] datasets. Input images maintain their aspect ratios and are uniformly scaled to 800 × 1333 pixels with zero-padding, consistent with the settings of DETR-based detectors [36]. To enhance the diversity of training data, online data augmentation during training includes random horizontal flipping and cropping.

Subsequently, we conduct continual pretraining using self-collected or publicly available datasets, with all data organized in attribute-annotated detection label format. This strategy enables the model to learn fine-grained vision-language alignment for downstream tasks while preserving generalized representation capabilities. All experiments are performed on three NVIDIA A100 GPUs. The datasets are partitioned into training, validation, and test sets in an 8:1:1 ratio to ensure balanced distribution. During training, the total batch size is set to 48, evenly distributed across the three GPUs, with a base learning rate of 0.0004 using the AdamW optimizer and a weight decay of 0.0001. Additionally, to prevent overfitting and maintain training stability, a 0.1× learning rate decay is applied to both the image backbone and the language model to ensure stable optimization. Following the standard practice of modern DETR-based detectors [36], all reported results are obtained after 12 training epochs.

3.1. Datasets

The specific details of the self-made military dataset are presented in Table 1, including the image dimensions, annotation box sizes, aspect ratios, and other related information.

During the dataset construction phase, to enhance the model’s robustness in complex battlefield environments and mitigate data imbalance issues, this paper employs an offline data augmentation strategy on the training set. Specifically, based on the original annotations, a hybrid augmentation pipeline is constructed using the Albumentations framework. This pipeline sequentially introduces meteorological and illumination perturbations, including Gaussian noise, ISO sensor noise, random brightness-contrast distortion, random fog, rain, snow, shadows, and solar glare, with grayscale conversion applied as an additional branch. All geometrically invariant transformations are constrained by bounding boxes, with a minimum visibility threshold of 0.1, ensuring that objects remain detectable after augmentation.

3.2. Evaluation Protocols

To systematically validate the effectiveness of attribute decomposition and fusion mechanisms in fine-grained OVOD, we design a dual-level evaluation framework.

AG-AP (Attribute Aggregation AP). Since the model only learns independent alignment relationships between image features and atomic attributes during training, while actual fine-grained categories require multi-attribute compositional representations, this metric quantifies the model’s ability to perform fine-grained reasoning through attribute combination. Specifically, for categories whose images appeared in the training set, we input dynamically generated attribute aggregation of all test categories during testing. The final metric is obtained by averaging the AP values across all categories.

CG-AP (Compositional Generalization AP). To address military reconnaissance requirements, we propose CG-AP, a metric evaluating fine-grained detection under category existence uncertainty. The core computational logic of CG-AP is consistent with that of the traditional mean average precision. Both metrics assess detection performance by computing the area under the precision-recall curve across varying confidence thresholds. Unlike referring expression comprehension’s Top-1 accuracy—which assumes target presence and measures only description-conditioned localization—CG-AP quantifies generalization over attribute compositions and false positive suppression. Referring expression comprehension is a multimodal vision-language task that aims to precisely localize the unique object instance in an image referred to by a natural language expression, outputting its bounding box. The evaluation set includes distractor categories sharing partial attributes with target classes but differing in full composition. To assess fine-grained detection, we employ structured textual inputs generated by systematically replacing individual attributes within each target’s full attribute set, with combinations separated by a period (“.”). For example, given “wheeled_artillery equipped” (a truck-mounted howitzer), we generate: “wheeled_artillery equipped . wheeled_missile launching . tracked_artillery equipped”, replacing propulsion (wheeled vs. tracked) and function (artillery vs. missile) with mutually exclusive alternatives. All combinations are concatenated using the period delimiter. Model performance is measured via per-class average precision.

3.3. Experiments on Custom Military Dataset

To validate the performance of our method, we first conduct experiments on the aforementioned military dataset, evaluating both the AG-AP and CG-AP metrics. The experimental results are presented below.

Fine-Grained Detection with Attribute Augmentation and Aggregation. We conduct comprehensive benchmarking by selecting state-of-the-art open-source object-detection algorithms with demonstrated superior performance. These baseline methods undergo additional pre-training on our military target dataset before being evaluated under identical experimental conditions alongside our proposed approach. The comparative results are systematically presented in Table 2.

All competing methods were evaluated under identical experimental conditions: trained on domain-specific data and simultaneously assessed on multiple fine-grained novel categories within that domain. Standard OVOD approaches, lacking explicit attribute learning, fail to capture discriminative information for fine-grained distinctions, resulting in substantial false positives and misclassifications, thereby yielding lower AG-AP scores. In contrast, our method explicitly learns inter-class attribute information, enabling superior detection accuracy for fine-grained categories.

Zero-Shot Novel Category Detection via Attribute Composition and Aggregation. To evaluate the effect of attribute aggregation on fine-grained novel category detection, the following types of categories were selected for testing: categories with attribute combinations not encountered in the training set, such as recoilless rifle jeeps, which combine artillery equipped and monocoque; categories missing one attribute from the combinations present in the training set, such as tracked armored ambulances lacking artillery and categories with attribute combinations that appeared during training but exhibited significant appearance variations, such as anti-tank gun.

In accordance with the evaluation metric calculation rules outlined earlier, and considering the challenges of acquiring military target data, 200 images of each target novel category were mixed with 50 images from other novel categories to simulate real-world scenarios where multiple classes coexist and some images may not contain the target category. In addition to the images of the category under test, the evaluation set also included categories from the last experiment and other novel categories. The CG-AP for individual categories was computed, with the experimental results presented in Table 3.

Each test category corresponds to a distinct military vehicle. For evaluation, the text input for each category is generated by systematically combining and sequentially substituting its attributes. This approach enhances the assessment of the model’s fine-grained recognition performance. Compared to directly concatenating attribute-containing text, our method explicitly guides the model to focus on attribute-related textual features, achieving superior fine-grained discrimination.

As summarized in Table 4, the model employed in the evaluation comprises 169.43 MB of parameters and, when executed on a single NVIDIA A100 GPU, sustains a real-time inference throughput of 54.9 frames per second (FPS). The inference speed of the proposed method was evaluated on an NVIDIA Jetson AGX Orin (32 GB) platform. Experimental results indicate that the model achieves an inference rate of approximately 11 frames per second (FPS). While this performance falls below the conventional real-time threshold of 24–30 FPS, it is sufficient for specific mission-critical applications such as low-speed reconnaissance and surveillance, in which strict frame-by-frame processing is not required.

3.4. Experiments on LAD Dataset

To further validate the performance of the proposed method across diverse datasets, we evaluate the detection performance on novel categories after conducting attribute decomposition and training. The test set consists of the following categories: categories with trained attribute combinations but distinct appearances (e.g., crane truck), and completely novel attribute combinations (e.g., tracked transporter combining tracked drive and enclosed attributes). The experimental results presented in Table 5.

Each test category represents a distinct civilian vehicle from the LAD dataset. Evaluation results demonstrate that our method, which explicitly guides the model to focus on attribute-related textual features, achieves superior fine-grained discrimination compared to direct concatenation of attribute-containing text.

3.5. Ablation Studies

Component Ablation. To validate the effectiveness of the proposed model design, we conduct systematic ablation studies in this section. Through progressively removing or modifying key components of the model, we evaluate their impact on overall performance. The detailed experimental setup and corresponding results are presented in Table 6. When the model excludes the three key components—Attribute Text Decomposition, Multi-positive Contrastive Learning, and Attribute Text Feature Aggregation—the AG-AP drops to 0.199. This result demonstrates that open-vocabulary detection struggles to perform well in specialized domains without proper attribute decomposition and fusion mechanisms. Upon incorporating Attribute Text Decomposition and Multi-positive Contrastive Learning, the AG-AP significantly improves to 0.402. This notable enhancement indicates that decomposing attribute texts into meaningful semantic components and leveraging multiple positive samples for contrastive learning effectively mitigates the impact of unseen words and human bias in text inputs. Further introducing the Attribute Text Feature Aggregation module boosts the AP to 0.435, as the aggregated features better capture both local details and global contextual information, thereby improving prediction accuracy.

Attribute-aggregation Strategies. For attribute text encoding, we conduct experiments with three masking strategies, as illustrated in Figure 6, to obtain context-aware attribute representations: row-only interaction, both row and column interaction, and column-only interaction. The comparative results evaluated across all categories on our custom dataset are presented in Table 7.

3.6. Visualization Analysis

As shown in Figure 7, we visualize and analyze the detection results under three different configurations to demonstrate the effectiveness of our approach. The first row presents the baseline results using a standard OVOD framework with only class names as input. The second row shows the improved detection after incorporating attribute decomposition and multi-positive contrastive learning, where attribute features are directly concatenated for prediction. The third row displays the final results of our complete method, which further enhances performance through attribute feature aggregation. The experimental visualization clearly demonstrates that our attribute decomposition and aggregation strategy effectively reduces false positives and false negatives in fine-grained object detection, with only detection boxes exceeding the confidence threshold of 0.5 being displayed.

Figure 7 presents four distinct target categories across its columns: a “three-wheeled express vehicle” (first column), a “tracked transporter” (second column), and two visually similar scenes featuring a “truck crane” (third and fourth columns). The first column illustrates a three-wheeled motorcycle, which should be annotated with the attributes “three-wheel” and “flatbed”. However, this instance is prone to misclassification as a three-wheeled expressed vehicle when relying solely on category name or a naive concatenation of attributes. Qualitative comparisons indicate that the proposed method achieves accurate detection and precise localization in both scenarios (a) and (b). More importantly, in challenging case (c) where “truck” and “truck crane” exhibit high visual similarity, our attribute-aware decomposition and aggregation approach significantly suppresses false alarms while maintaining detection accuracy, demonstrating superior discriminative power and robustness. These results collectively confirm that our proposed attribute decomposition and aggregation mechanism outperforms baseline methods in handling fine-grained objects with high visual similarity, validating its effectiveness for open-vocabulary object-detection tasks.

4. Conclusions

This paper presents a novel attribute decomposition and aggregation framework for open-vocabulary object detection, specifically targeting fine-grained novel categories. Our approach decomposes category texts into multiple attribute descriptions during training and aligns visual features with attribute-level textual representations through multi-label contrastive learning. During inference, we design a parallel attribute-fusion module that supports simultaneous processing of multiple category texts, effectively enhancing attribute semantic representation while maintaining discriminative encoding of diverse category information.

Extensive experiments conducted on our newly constructed military target dataset and the adapted LAD benchmark demonstrate that our method outperforms existing open-vocabulary detection approaches, particularly showing superior capability in distinguishing visually similar fine-grained categories within specialized domains rather than general scenarios. These findings collectively indicate that our attribute-aware text-decomposition and -aggregation paradigm provides a promising direction for domain-specific fine-grained object detection.

Despite these promising outcomes, our work is not devoid of limitations. Currently, the introduced attributes remain limited. While effective for detecting most novel categories, the method may still encounter new attributes in practical applications. Future research should aim to transcend these constraints by developing an approach that can efficiently expand attribute categories through few-shot learning, thereby further enhancing adaptability in open environments. The number of samples per attribute label in the experimental dataset ranges from 5 to 30. To mitigate the risk of catastrophic forgetting in future work, we intend to incorporate a memory replay mechanism and parameter isolation strategies.

Author Contributions

Conceptualization, B.D. and T.W.; methodology, B.D.; software, B.D.; validation, B.D.; formal analysis, B.D. and Z.G.; investigation, B.D.; resources, T.W.; data curation, B.D.; writing—original draft preparation, B.D.; writing—review and editing, Z.G.; visualization, B.D.; supervision, T.W.; project administration, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were utilized in this study, and the corresponding references are provided in the manuscript. The military dataset presented in this study are available on request from the corresponding author due to the sensitive nature of the military dataset. The military dataset used in this study was collected from publicly available online sources. All data have been reprocessed and de-identified to ensure compliance with ethical standards and protect sensitive information. This process included the removal or obfuscation of any personally identifiable information and other sensitive details that could compromise security. The dataset has been reviewed and approved to ensure adherence to all relevant ethical guidelines and legal requirements.

Acknowledgments

Thanks to the Unmanned Systems Research Group at the College of Intelligence Science, National University of Defense Technology, China.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Azzuni, H.; Khan, M.; Elsaddik, A.; Gueaieb, W. Smartphone Video-based Monocular 3D Reconstruction. IEEE Consum. Electron. Mag. 2024, 14, 11–20. [Google Scholar] [CrossRef]
Dawoud, K.; Zaheer, Z.; Khan, M.; Nandakumar, K.; Elsaddik, A.; Khan, M.H. FusedVision: A Knowledge-Infusing Approach for Practical Anomaly Detection in Real-world Surveillance Videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 4036–4046. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Jamil, S.; Abbas, M.S.; Roy, A.M. Distinguishing malicious drones using vision transformer. AI 2022, 3, 260–273. [Google Scholar] [CrossRef]
Wu, L.; Li, J.; Li, S.; Ding, Y.; Zhou, M.; Shi, Y. Integrating Pose Features and Cross-Relationship Learning for Human–Object Interaction Detection. AI 2025, 6, 55. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2988–2997. [Google Scholar]
Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.F. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14393–14402. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yao, L.; Han, J.; Wen, Y.; Liang, X.; Xu, D.; Zhang, W.; Li, Z.; Xu, C.; Xu, H. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Adv. Neural Inf. Process. Syst. 2022, 35, 9125–9138. [Google Scholar]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Open-vocabulary detr with conditional matching. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 106–122. [Google Scholar]
Ma, Z.; Luo, G.; Gao, J.; Li, L.; Chen, Y.; Wang, S.; Zhang, C.; Hu, W. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14074–14083. [Google Scholar]
Pham, C.; Vu, T.; Nguyen, K. Lp-ovod: Open-vocabulary object detection by linear probing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 779–788. [Google Scholar]
Gao, M.; Xing, C.; Niebles, J.C.; Li, J.; Xu, R.; Liu, W.; Xiong, C. Open vocabulary object detection with pseudo bounding-box labels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 266–282. [Google Scholar]
Feng, C.; Zhong, Y.; Jie, Z.; Chu, X.; Ren, H.; Wei, X.; Xie, W.; Ma, L. Promptdet: Towards open-vocabulary detection using uncurated images. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 701–717. [Google Scholar]
Wu, X.; Zhu, F.; Zhao, R.; Li, H. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7031–7040. [Google Scholar]
Shi, C.; Yang, S. Edadet: Open-vocabulary object detection using early dense alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 15724–15734. [Google Scholar]
Wu, S.; Zhang, W.; Jin, S.; Liu, W.; Loy, C.C. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15254–15264. [Google Scholar]
Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. Glipv2: Unifying localization and vision-language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36067–36080. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Ma, Y.; Liu, M.; Zhu, C.; Yin, X.C. HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection. IEEE Trans. Multimed. 2025, 27, 3171–3183. [Google Scholar] [CrossRef]
Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 350–368. [Google Scholar]
Kaul, P.; Xie, W.; Zisserman, A. Multi-modal classifiers for open-vocabulary object detection. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 15946–15969. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 10012–10022. [Google Scholar]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8430–8439. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Zhao, B.; Fu, Y.; Liang, R.; Wu, J.; Wang, Y.; Wang, Y. A large-scale attribute dataset for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Trager, M.; Perera, P.; Zancato, L.; Achille, A.; Bhatia, P.; Soatto, S. Linear spaces of meanings: Compositional structures in vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 15395–15404. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]

Figure 1. (a) Training paradigm: Both standard OVOD methods and our proposed method utilize identical training datasets comprising general-domain and specific-domain images. (b) Limitation of standard OVOD: Despite strong open-vocabulary generalization, existing methods suffer from high false positive rates in fine-grained scenarios due to semantic ambiguity in category-level text embeddings. (c) Our solution: Our method achieves superior fine-grained detection performance while maintaining robust generalization, accomplished through the decomposition and aggregation of attribute features. Detection results are denoted by yellow bounding boxes, among which correctly detected instances are outlined with green dashed ellipses, and incorrectly detected instances are marked with red dashed ellipses.

Figure 2. Overview of our proposed attribute-guided fine-grained OVOD. During the training process, the category names are decomposed into fine-grained attributes and parent category names. Multi-positive sample contrastive learning is utilized to align multiple text attributes and image features. In the inference stage, a parallel multi-category and multi-attribute text-aggregation method is employed to achieve detection of new categories.

Figure 3. Instance count distribution for each attribute text in the custom military dataset.

Figure 4. The custom military dataset structure.

Figure 5. The features obtained from BERT encoding are visualized in a three-dimensional space (obtained through PCA dimensionality reduction), with category texts of different attributes connected by black lines. The visualization demonstrates that these three lines remain approximately parallel in the spatial representation. For different category names, the same attribute induces consistent directional shifts of global text features by a certain distance.

Figure 6. Diagram of diverse attribute-encoding approaches: (a) Column-wise interaction: attributes interact only within the same row. (b) Row-wise interaction: attributes interact only within the same column. (c) Full cross-dimensional interaction: attributes dynamically interact across both rows and columns.

Figure 7. Visualization of three representative detection cases: (a) for the category “three-wheeled express vehicle”, we employ the attributes “three wheel” and “enclosed”; (b) For “trucked transporter”, the attributes “tracked” and “enclosed” are utilized; (c) the last two columns demonstrate results for the “crane truck” category using attributes “crane” and “multi wheel”. These visualizations showcase how different attribute combinations contribute to the detection performance across varying object categories.

Table 1. Dataset statistics.

Qualified Labelled Images	15,438
Average Number of Annotation Boxes per Image	1.77
Minimum Number of Annotation Boxes per Image	1
Maximum Number of Annotation Boxes per Image	26
Minimum Image Height	84
Maximum Image Height	1417
Minimum Image Width	140
Maximum Image Width	1443
Minimum Image Area	12,600
Maximum Image Area	1,137,600
Minimum Aspect Ratio (Height/Width)	0.28
Maximum Aspect Ratio (Height/Width)	2.5

Table 2. Results involving multiple object categories participating in the evaluation simultaneously.

Method	Pre-Training Data	Textual Input	AG-AP
GLIP [28]	O365, GoldG	Category Names	0.152
Grounding DINO [29]	O365, GoldG	Category Names	0.199
Ours	O365, GoldG	Attribute Aggregation	0.435

Table 3. Comparative results of attribute concatenation and aggregation on the custom military dataset.

Category	Metric	Attribute Concatenation	Ours	Improve
Truck-Mounted Howitzer	CG-AP	0.472	0.715	+0.243
	CG-AP50	0.531	0.775	+0.244
	CG-AP75	0.498	0.738	+0.240
Medical Evacuation Vehicle	CG-AP	0.661	0.816	+0.155
	CG-AP50	0.710	0.909	+0.199
	CG-AP75	0.680	0.830	+0.150
Recoilless Rifle Jeep	CG-AP	0.459	0.712	+0.253
	CG-AP50	0.490	0.741	+0.251
	CG-AP75	0.479	0.733	+0.254
Amphibious Assault Vehicle	CG-AP	0.642	0.741	+0.099
	CG-AP50	0.790	0.928	+0.138
	CG-AP75	0.638	0.742	+0.104
Anti-Tank Gun	CG-AP	0.262	0.388	+0.126
	CG-AP50	0.311	0.449	+0.138
	CG-AP75	0.261	0.396	+0.135

Table 4. Model parameters and inference speed.

Parameter	Value
Model Parameters size	169.43 MB
FPS	54.9

Table 5. Comparative results of attribute concatenation and aggregation on the LAD dataset.

Category	Metric	Attribute Concatenation	Ours	Improve
Tracked Transporter	CG-AP	0.434	0.693	+0.259
	CG-AP50	0.464	0.740	+0.276
	CG-AP75	0.448	0.717	+0.269
Truck Crane	CG-AP	0.800	0.816	+0.016
	CG-AP50	0.849	0.863	+0.014
	CG-AP75	0.838	0.851	+0.013
Turntable Ladder Fire Engine	CG-AP	0.495	0.567	+0.072
	CG-AP50	0.544	0.634	+0.090
	CG-AP75	0.516	0.590	+0.074
Three-Wheeled Cargo Vehicle	CG-AP	0.497	0.628	+0.131
	CG-AP50	0.505	0.635	+0.130
	CG-AP75	0.489	0.622	+0.133

Table 6. Ablation study on the custom military dataset.

Attribute Text Decomposition	Multi-Positive Contrastive Learning	Attribute Text Feature Aggregation	AG-AP
×	×	×	0.199
✓	✓	×	0.402
✓	✓	✓	0.435

Table 7. Results of different attribute-encoding methods.

Method	AG-AP	AG-AP50	AG-AP75
Row-wise	0.298	0.363	0.306
Both row-wise and column-wise	0.397	0.485	0.410
Column-wise	0.435	0.541	0.456

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dou, B.; Wu, T.; Guo, Z. Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation. AI 2025, 6, 201. https://doi.org/10.3390/ai6090201

AMA Style

Dou B, Wu T, Guo Z. Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation. AI. 2025; 6(9):201. https://doi.org/10.3390/ai6090201

Chicago/Turabian Style

Dou, Bei, Tao Wu, and Zhiwei Guo. 2025. "Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation" AI 6, no. 9: 201. https://doi.org/10.3390/ai6090201

APA Style

Dou, B., Wu, T., & Guo, Z. (2025). Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation. AI, 6(9), 201. https://doi.org/10.3390/ai6090201

Article Menu

Fine-Grained Open-Vocabulary Object Detection via Attribute Decomposition and Aggregation

Abstract

1. Introduction

2. Methodology

2.1. Overall Framework

2.2. Attribute Text-Decomposition Mechanism

2.3. Attribute Text Feature-Aggregation Mechanism

2.4. Multi-Positive Contrastive Learning

3. Experiments

3.1. Datasets

3.2. Evaluation Protocols

3.3. Experiments on Custom Military Dataset

3.4. Experiments on LAD Dataset

3.5. Ablation Studies

3.6. Visualization Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI