1. Introduction
Monocular 3D object detection estimates the 3D bounding boxes of objects from a single RGB image, an approach that offers compelling cost and deployment advantages over LiDAR-based or multi-sensor fusion systems. This capability is particularly valuable in large-scale applications such as autonomous driving, robotics, and augmented reality, where hardware simplicity and scalability are paramount. However, despite its practical appeal, monocular 3D detection remains significantly more challenging than its 2D counterpart. This difficulty stems primarily from the inherently ill-posed nature of depth recovery from a single viewpoint and the scarcity of large-scale, semantically diverse 3D annotations.
Recent years have witnessed substantial progress driven by benchmark datasets such as KITTI [
1], SUN RGB-D [
2], ScanNet V2 [
3], and nuScenes [
4]. Yet, as summarized in
Table 1, these datasets are largely constrained in semantic scope, typically encompassing only nine to 23 object categories. Most of these categories are restricted to traffic-related agents, such as cars and pedestrians, or common indoor items. In contrast, modern 2D detection benchmarks like COCO and Objects365 cover hundreds of categories, enabling robust generalization across diverse visual concepts. This semantic gap severely limits the applicability of current 3D detectors in open-world environments, where systems must recognize and localize objects beyond a fixed, predefined set.
To bridge this gap, open-vocabulary 3D object detection has emerged as a promising research direction. The goal is to enable models to detect and localize objects described by arbitrary natural language prompts, including those unseen during training. Existing approaches, however, predominantly follow a two-stage paradigm. They first leverage a pre-trained 2D open-vocabulary detector to extract semantic proposals, which are then fed into a class-agnostic 3D detector for geometric regression, as illustrated in
Figure 1a. While effective, this design introduces several practical limitations: (i) it requires heavy reliance on external 2D detectors and their associated supervision; (ii) it involves multi-stage training pipelines that are difficult to optimize jointly; and (iii) many methods still depend on point cloud priors or depth estimators, undermining the simplicity of the monocular setting.
Recent attempts, such as OVMono3D [
5], have sought to mitigate these issues by incorporating foundation models like SAM for segmentation priors or monocular depth estimators. Nevertheless, they remain fundamentally dependent on pre-trained 2D detectors and auxiliary data sources, preventing true end-to-end training from 3D supervision alone.
Despite these efforts, a critical research gap remains: achieving true end-to-end open-vocabulary 3D detection solely from monocular images, without the architectural complexity and inference latency introduced by auxiliary 2D detectors, multi-stage pipelines, or external depth priors.
In this work, we present CLIP-Mono3D, a novel, end-to-end trainable framework for open-vocabulary monocular 3D object detection. Built upon the MonoDGP architecture [
6], our method integrates semantic knowledge directly into the 3D detection pipeline by leveraging a pre-trained FG-CLIP visual–language encoder [
7]. Unlike prior approaches, CLIP-Mono3D eliminates the need for an external 2D detector by fusing CLIP-derived visual–semantic features with geometric representations via cross-modal attention. By initializing detection queries using language embeddings, our model achieves zero-shot generalization to novel categories without additional 2D supervision, representing a significant step toward practical and generalizable monocular 3D perception.
To facilitate research in this under-explored direction, we further introduce OV-KITTI, a new benchmark that extends the original KITTI dataset with 40 additional object categories, including animals and everyday items. As shown in
Table 1, OV-KITTI not only expands semantic coverage but also provides more diverse shape and scale priors, which help alleviate the depth ambiguity inherent in monocular setups. The dataset is carefully curated to ensure balanced distributions between base and novel categories, enabling fair evaluation of open-vocabulary generalization.
The significance of our work lies in three key contributions:
We propose CLIP-Mono3D, an end-to-end framework that unifies semantic and geometric reasoning. By introducing a cross-modal semantic–geometric fusion module, we inject fine-grained semantic clues into geometric features via a lightweight residual connection, enhancing semantic awareness without disrupting pre-trained geometric cues.
We design a novel query initialization strategy that converts 2D semantic probability maps into explicit 3D query positions. This mechanism significantly improves 3D center localization and recall for open-vocabulary objects compared to standard learned queries.
We introduce OV-KITTI, a large-scale benchmark with controlled semantic and size distributions. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in both closed- and open-vocabulary settings, paving the way for deployment in truly open-world scenarios.
The remainder of this paper is organized as follows.
Section 2 reviews related work in monocular and open-vocabulary 3D object detection.
Section 3 details the methodology of the proposed CLIP-Mono3D framework.
Section 4 introduces the newly curated OV-KITTI benchmark.
Section 5 presents the experimental results and discussions. Finally,
Section 6 provides the conclusion of our work.
3. Method
This section details the proposed CLIP-Mono3D framework. We begin by defining the mathematical background of the task and presenting the overall architecture. We then introduce the cross-modal feature fusion module, the language-aware query initialization strategy, describe the open-vocabulary detection head and formulate the corresponding training objectives.
3.1. Preliminaries and Overall Architecture
Fundamentally, monocular 3D object detection is constrained by the geometry of projective transformations. Following Hartley and Zisserman [
45], the mapping of a 3D point
to a 2D pixel coordinate
is defined by the projection equation
, where
is the intrinsic camera matrix,
denotes the extrinsic rotation and translation, and
represents the depth. Recovering
from
is inherently ill-posed since depth
is lost during projection. Furthermore, practical imaging systems often suffer from optical aberrations, such as radial/tangential distortion and chromatic aberration. These imperfections perturb the ideal linear projection model, introducing non-linear spatial variations that further complicate precise 3D center and geometry recovery.
Open-vocabulary monocular 3D object detection aims to estimate 3D bounding boxes
from a single RGB image
, guided by a set of arbitrary text prompts
. Each object
is parameterized by its 3D center
, dimensions
, orientation
, and a semantic similarity vector
. The similarity score for each prompt is computed as:
where
denotes the sigmoid function,
and
represent the image and text encoders (e.g., CLIP), and
denotes the region-level features associated with the
ith object. This formulation enables the detection of novel categories by replacing fixed-set classification with open-ended semantic matching.
As illustrated in
Figure 2, CLIP-Mono3D extends the MonoDGP architecture [
6]. A ResNet-50 backbone first extracts multi-scale features
, which are enhanced by a Region Segmentation Head (RSH) to produce
. These features are fused with CLIP-derived semantics (
Section 3.2) and processed by a depth predictor to generate
. Dual transformer encoders independently process these streams, followed by a 2D visual decoder and a 3D depth-guided decoder. To provide explicit geometric priors, language-aware queries (
Section 3.3) are initialized from CLIP similarity maps. Final predictions are refined via geometric depth correction and scored against text embeddings.
3.2. Cross-Modal Feature Fusion
To bridge the semantic gap between vision and language modalities, we introduce a lightweight cross-modal fusion mechanism that injects text-guided spatial priors into the visual backbone. Given an input image and prompts , we extract dense visual features and global textual embeddings using a frozen CLIP encoder. Freezing CLIP’s parameters is essential to preserve its zero-shot generalization and prevent the loss of rich semantic knowledge during 3D detection training.
Semantically relevant regions are identified by computing a spatial similarity map
through the aggregation of cosine similarities across all text tokens:
This map functions as a soft attention mask, highlighting regions aligned with the linguistic input. Unlike post hoc filtering methods, our approach integrates this signal early in the feature hierarchy, allowing semantic guidance to inform all downstream components.
The similarity map
is bilinearly upsampled to match the dimensions of the intermediate feature map
. It is then processed by a convolutional module
—comprising two
convolutions with ReLU activations—to refine its spatial structure and match the channel dimensions. The final fused feature map is obtained via an additive residual connection:
This residual design ensures that primary geometric and structural information remains intact, which is critical for accurate 3D localization. This early fusion strategy creates a “semantic spotlight” that benefits both the RSH and the subsequent query initialization stage (
Section 3.3).
3.3. Language-Aware Query Initialization
Standard DETR-like architectures typically initialize object queries as content-agnostic learnable embeddings, which can lead to slow convergence in complex scenes. To address this, we propose a language-aware query initialization strategy that transforms 2D semantic probability maps into explicit 3D geometric anchors.
As shown in
Figure 3, the initialization involves three steps. First, we interpret the CLIP similarity map
as a spatial probability distribution and identify
locations
with the highest activations. These serve as candidate 2D centers for potential 3D objects.
Second, we employ F.grid_sample to sample local descriptors from the enhanced feature map at coordinates . This differentiable operation ensures that each query is initialized with features specific to its corresponding object region.
Third, a global semantic prior is distilled from these descriptors via a two-layer MLP:
The base queries
consist of a content component
and a positional component
. We use sine–cosine positional encodings of
to initialize
, while
is enhanced by the global prior:
By grounding queries in semantically verified regions, we transform the detection process from exhaustive spatial searching to targeted localization. This “semantic priming” accelerates training convergence and reduces attention to background clutter. Importantly, the CLIP-derived prior provides a strong inductive bias for open-world concepts. During inference, if no text is provided, the system reverts to base queries () to maintain backward compatibility.
3.4. Open-Vocabulary Detection Head and Loss
The detection head enables open-vocabulary classification by computing semantic similarity between decoder outputs and projected text embeddings. For each output , we project it into the CLIP feature space, , where , and denotes normalization. Similarly, text embeddings are projected via a learnable matrix to obtain .This dual-projection design mitigates the domain gap between the detector’s internal representations and CLIP’s pre-trained embeddings.
The similarity score is computed as a temperature-scaled dot product, , where the learnable temperature controls the distribution concentration. To handle background regions, we introduce a learnable background embedding appended to the text embeddings.
Training follows a bipartite matching strategy using the Hungarian algorithm. The matching cost incorporates both geometric and semantic terms:
Integrating the similarity score s into the matching process ensures that predictions are assigned based on both spatial accuracy and semantic coherence, significantly enhancing generalization to unseen classes.
For matched pairs
, we apply a contrastive loss:
The total objective combines this with geometric regression losses
:
To better understand the optimization dynamics, we explicitly formulate the error backpropagation for the contrastive loss. Let
denote the predicted softmax probability for class
k. The gradient of
with respect to the similarity score
for a matched pair
is given by:
This gradient is strictly bounded within . Since the geometric regression losses () employ smooth variants, which also exhibit bounded derivatives, the overall parameter gradients remain stable. This bounded error propagation naturally prevents gradient explosion, demonstrating the theoretical convergence stability of our end-to-end training process.
This co-design ensures that gradients from the contrastive loss are applied to the most semantically relevant predictions, creating a robust optimization cycle that yields a model both geometrically precise and semantically aware.
4. OV-KITTI Benchmark
Most existing monocular 3D object detection models rely on the KITTI dataset for training and evaluation. However, this dataset presents several critical limitations. KITTI contains only nine object categories, and prior works predominantly evaluated models on the “Car” category due to the scarcity of other instances. In open-vocabulary or zero-shot learning settings, a common practice involves training on “Car” and “Cyclist” classes while attempting to transfer detection capabilities to the “Pedestrian” category. Nevertheless, as illustrated in
Figure 4c, the 3D bounding-box dimensions of the “Car” category differ significantly from those of “Pedestrian” and “Cyclist”. This discrepancy leads to an inherent bias where the detection of “Pedestrian” largely relies on knowledge spillover from “Cyclist” rather than benefiting from the rich feature learning associated with the dominant “Car” category. We argue that this imbalance in both category support and scale distribution within KITTI hinders the development of generalizable 3D detection models.
To overcome these limitations, we introduce OV-KITTI, an augmented benchmark based on the KITTI dataset designed specifically for open-vocabulary monocular 3D detection. We enrich the original dataset with 40 additional object categories sourced from Objaverse [
46]. These categories encompass a diverse set of animals and household items. The new objects are carefully selected to avoid semantic overlap with existing traffic participants in KITTI, such as cars and pedestrians, thereby enabling precise supervision and evaluation.
4.1. Dataset Construction
The construction of OV-KITTI followed a systematic multi-step pipeline:
(1) Bounding-Box Design: For each new category, we defined physically plausible size constraints for its 3D bounding box. During rendering, the actual box dimensions were randomly sampled within this predefined range to ensure variability. (2) Scene Integration: Using Blender, we rendered 3D meshes into real driving scenes from KITTI. Objects were placed on the ground plane with random rotation, scaling, and translation. We explicitly enforced constraints to ensure no physical overlap with existing objects in the scene. (3) Stereo Rendering: We generated stereo-consistent left and right views for each modified scene. Annotations are provided in the standard KITTI format to ensure seamless compatibility with existing detection frameworks. (4) Balance Control: Special care was taken to balance the 3D box size distributions between known and unknown classes, as shown in
Figure 4a,b. This step reduces potential size-related biases and supports a fair evaluation of model generalization capabilities.
4.2. Category Statistics
We divided the 40 new classes into known (training) and unknown (testing) categories. The known set comprised 32 classes, consisting of 19 household items and 13 animals, while the unknown set contained 8 classes, including 5 items and 3 animals, for zero-shot evaluation. This split ensured diversity and scale balance across training and testing phases. The full category list and instance counts are provided in
Table 2. As shown in
Figure 5, OV-KITTI contains high-quality renderings of diverse objects under varied scales and contexts. This provides a challenging yet realistic testbed for evaluating open-vocabulary 3D detection performance.
5. Experiments
In this section, we comprehensively evaluate the proposed CLIP-Mono3D framework. We first describe the experimental setup, including datasets, evaluation metrics, and implementation details. We then present quantitative comparisons with state-of-the-art methods in both closed-vocabulary and open-vocabulary settings. Furthermore, we provide detailed ablation studies to validate our core components, followed by qualitative visualizations and verifications in unconstrained real-world scenarios.
5.1. Experimental Setup
5.1.1. Datasets and Metrics
We evaluated on KITTI and OV-KITTI. For KITTI, we used standard splits (3712 train, 3769 val). For OV-KITTI, we used 5936 training samples (32 classes) and 1545 test samples (8 classes). Original categories were excluded from the open-vocabulary evaluation. We used AP at 40 recall as the metric: and for closed-vocabulary; per category for open-vocabulary.
5.1.2. Settings
We followed MonoDETR-style augmentation and FG-CLIP image preprocessing. The model was implemented in PyTorch 1.13.1 and trained on a single NVIDIA RTX 4090D GPU for 200 epochs with a batch size of eight. We utilized the AdamW optimizer with an initial learning rate of and a weight decay of . A step learning rate scheduler was employed, decaying the learning rate by a factor of 0.5 at epochs 85, 125, 165.
Regarding hyperparameter configurations, we adopted the validated settings from MonoDGP [
6] for geometric components to maintain stable 3D localization. Specifically, the bipartite matching costs for the 3D center, 2D bounding box, and GIoU were set to 10, five, and two, respectively. The corresponding loss coefficients (
) were assigned the same weights. For our open-vocabulary modules, the semantic matching cost (
) and the contrastive loss coefficient (
) were both set to two, with a focal loss
of 0.25. This balance ensured that semantic learning proceeded without destabilizing the established geometric precision during early training stages.
5.2. Closed-Vocabulary Results
As shown in
Table 3, our method achieves state-of-the-art performance in the standard monocular 3D object detection task. On the KITTI val set, our approach attains 31.40%
under the “easy” difficulty level, surpassing previous methods by a clear margin: +2.84 percentage points over MonoDETR [
22] and +1.72 over MonoDGP [
6]. The improvements are consistent across moderate and hard difficulty levels, demonstrating enhanced robustness in detecting occluded and distant objects. In BEV detection, our model also sets a new benchmark with 39.47%
(easy), outperforming both baselines by over 1.5 points.
These gains can be attributed to our improved 2D feature representation enriched with CLIP-derived semantics, which provides stronger cues for depth estimation and spatial localization. Unlike purely geometric reasoning methods, our integration of semantic context allows the detector to better disambiguate scale and position, especially in low-texture or cluttered scenes.
On the OV-KITTI dataset, while the focus shifts toward open-vocabulary generalization, we still report strong closed-vocabulary performance. Our reaches 14.44%, exceeding MonoDGP by 1.11%, which indicates that our enhancements do not compromise detection accuracy on seen classes. This balance between specialization and generalization is crucial for real-world deployment, where systems must handle both known and emerging object categories.
As shown in
Table 4, we compare the efficiency metrics with baseline methods. It is important to note that while our framework incorporates a frozen CLIP image encoder of approximately 150 M parameters to extract semantic features, these parameters do not require gradient updates. Consequently, our trainable parameter count of 43.48 M is nearly identical to that of MonoDGP at 43.33 M. This minimal increase of approximately 0.15 M confirms that the performance gains stem from our effective semantic–geometric alignment design rather than a brute-force increase in model capacity. Although the integration of the visual encoder introduces a marginal latency increase, the inference speed of 51 ms per frame remains sufficient for real-time autonomous driving applications.
5.3. Open-Vocabulary Results
The true strength of our method lies in its ability to detect objects from previously unseen categories, a key challenge in open-vocabulary 3D detection. As shown in
Table 5, our method achieves an overall 5.81% AP on the eight unseen categories in OV-KITTI, outperforming the fine-tuned variant OVMono3D
† by +1.21%.
A critical observation is that fine-tuning on the known training set provides substantial gains for baseline methods, particularly for geometrically complex categories. As shown in the “Gain from FT” rows, OVMono3D † improves by a remarkable +7.02 points on “tiger” and +1.61 on “giraffe” compared to its non-fine-tuned version. This confirms that exposure to 3D shape priors during fine-tuning is essential for accurate localization of categories largely absent from standard 3D datasets.
Despite this significant boost from fine-tuning, our method still achieves superior performance across most categories. We outperform OVMono3D † by +2.73 on “tiger” and +1.04 on “giraffe”, demonstrating that our direct integration of vision-language semantics enables better generalization to novel biological shapes, without relying on cascaded 2D detectors or fine-tuning heuristics. For smaller or less frequent objects like “bucket” and “dog”, our method also shows consistent improvements.
The only exception is “wardrobe” where OVMono3D † retains a slight edge. We attribute this to its stronger pre-trained knowledge of household items and the relatively simple geometric structure of wardrobes, which are more easily captured by existing 2D open-vocabulary detectors. Overall, our results validate that grounding 3D detection directly in language semantics, rather than relying on fine-tuned 2D detectors, leads to superior generalization, especially for structurally diverse and underrepresented categories like animals.
5.4. Visualization of Cross-Modal Alignment
To elucidate how language priors guide the detection process, we visualize the CLIP-based similarity maps
and their influence on query initialization in
Figure 6. As shown in the middle column,
effectively serves as a semantic attention mechanism, highlighting regions aligned with class-specific prompts such as “car” and “wheelchair”. Notably, for the rare class “wheelchair”, which lacks 3D annotations during training, the model identifies plausible candidates by leveraging semantic cues (e.g., human silhouettes with wheels), demonstrating robust open-vocabulary generalization.
The right column illustrates the top- keypoints selected for query initialization. These points concentrate on the object’s spatial extent, confirming that our language-aware sampling effectively localizes semantically meaningful regions. These keypoints provide informed feature priors that prime the object queries, enabling the decoder to focus on relevant content from the initial stages.
This visualization validates two key advantages: (1) the similarity map bridges language semantics and visual geometry for zero-shot grounding; (2) language-aware initialization mitigates monocular depth ambiguity by providing semantically grounded spatial anchors.
To quantitatively validate the effectiveness of our cross-modal feature fusion, we tracked the evolution of semantic alignment metrics during the training process on the validation set.
Figure 7 plots the Mean Positive Cosine Similarity (MPCS), the Mean Negative Cosine Similarity (MNCS), and the Contrastive Matching Accuracy across 200 training epochs.
We observe that the network tends to predict relatively high similarity scores across all queries, which is a common characteristic in dense visual–language matching. However, as training progresses, the model successfully learns to discriminate between target and background regions. As shown in
Figure 7a, the MPCS steadily increases to approximately 0.78, establishing a clear and robust discriminative margin against the MNCS. Driven by this semantic margin, the contrastive matching accuracy on the assigned positive queries (
Figure 7b) rises consistently, converging to around 84.5%. This quantitative trend complements our qualitative similarity maps, confirming that our framework effectively transforms spatial queries into language-aware object representations despite the inherent difficulty of open-vocabulary reasoning.
5.5. Ablation Studies
To evaluate the contribution of each proposed component, we conducted comprehensive ablation studies on the OV-KITTI benchmark. Unless otherwise specified, performance was measured by for seen classes (CV) and for unseen categories (OV).
Core Components and Query Initialization. As summarized in
Table 6, each module consistently improves performance. While feature fusion enhances the base representation, the language-aware query initialization provides the most significant gain for open-vocabulary generalization (+1.12% OV AP). This suggests that grounding queries in semantic regions from the early decoding stages is more effective for novel objects than relying on generic learnable embeddings.
Regarding the specific initialization design in
Table 7, we found that using 50 queries with a global prior aggregated via mean-pooling yielded optimal results. This global context informed each query of the overall semantic landscape, preventing them from over-focusing on isolated, potentially noisy local regions. In contrast, max-pooling over-emphasized the single most salient activation, leading to poorer generalization for diverse scenes. We also compared VLM backbones: FG-CLIP (Frozen) achieved the highest accuracy, while fine-tuning it via LoRA [
50] led to a performance drop. This indicates that the 3D dataset’s scale is insufficient to update the VLM without causing catastrophic forgetting of its pre-trained open-world semantic knowledge.
Feature Fusion Strategy and Domain Robustness.
Table 8 explores fusion architectures. Additive residual connections at
res2 outperform simple concatenation. This additive design acts as a calibrated semantic spotlight that preserves the geometric integrity of intermediate features, whereas concatenation may introduce noise that disrupts sensitive 3D regressions. Injecting semantic guidance early in the feature hierarchy (stage 2) is crucial for informing downstream depth estimation and query sampling.
To investigate whether the model overfit to synthetic artifacts, we conducted a data-mixing study and present the results in
Table 9. We progressively introduced real-world KITTI objects into the training pipeline. Crucially, to ensure a fair comparison, the Seen AP was evaluated exclusively on the original synthetic classes. The results show that adding real-world instances with distinct lighting and textures leads to negligible fluctuations in Unseen AP. This stability confirms that CLIP-Mono3D learns generalized semantic concepts rather than low-level texture priors. The drop observed in Row (3) is likely due to the extreme class imbalance introduced by real-world Cars, which biases the optimization away from rare open-vocabulary classes.
5.6. Visualization
To provide intuitive insights into the performance of our CLIP-Mono3D model, we present qualitative visualizations of detection results on both the KITTI and OV-KITTI datasets, as shown in
Figure 8 and
Figure 9.
Figure 8 visualizes detection results on KITTI. Our method exhibits higher recall for distant vehicles compared to MonoDETR. While sharing a similar architecture to MonoDGP, the integration of CLIP-derived semantics improves 2D detection performance, which subsequently leads to more accurate distance regression. These results confirm that semantically enriched 2D features provide robust spatial priors for 3D localization, especially for long-range objects where geometric cues are often ambiguous.
Figure 9 compares CLIP-Mono3D with OVMono3D on the OV-KITTI benchmark. While OVMono3D occasionally yields more precise 2D boxes, our framework demonstrates superior 3D geometric reasoning. Specifically, our depth-guided transformer and language-aware queries enable more accurate depth estimation (row 2, “bench”), enhanced bounding-box integrity for complex shapes (row 3, “giraffe”), and significantly reduced missed detections for small objects (row 4, “dog”), highlighting the robustness of our end-to-end semantic–geometric alignment in open-world scenarios.
5.7. Real-World Scenario Verification
To verify generalization in unconstrained real-world environments, we extended our evaluation to the standard KITTI and Argoverse datasets. For KITTI, we adopted a cross-category split by training solely on Car and Pedestrian and evaluating zero-shot performance on Cyclist. For Argoverse, we trained on common categories and evaluated on seven distinct novel classes.
To adapt to the complex nature of the Argoverse dataset, we modified our query initialization strategy. Specifically, we shifted from the global mean aggregation used in OV-KITTI to a discrete spatial assignment paradigm. This ensured that queries were initialized at distinct spatial locations, effectively preventing multiple queries from collapsing onto a single salient object in complex scenes.
As shown in
Table 10, on the KITTI dataset, despite the extreme scarcity of training categories limiting the learning of generalized 3D shape priors, our end-to-end framework still achieves competitive performance. It is worth noting that the baseline OVMono3D utilizes an external open-vocabulary 2D detector for strong 2D localization, whereas our method operates seamlessly without explicit 2D bounding-box guidance. On the larger scale Argoverse dataset detailed in
Table 11, our method demonstrates robust generalization. Supported by the updated initialization strategy, our model successfully guides geometric reasoning for novel objects, achieving promising results across various categories and obtaining a higher overall average precision compared to the baseline.
To further illustrate the practical performance, we provide qualitative comparisons in
Figure 10. In the first column, our method successfully detects the occluded moped while maintaining the detection completeness of the surrounding objects. In the third column, our model manages to recall the challenging animal class in the distance without missed detections. We acknowledge that the geometric regression for highly irregular novel classes like animals still requires further improvement and exhibits certain deviations. This is primarily due to the unreasonable category distribution inherent in the dataset and the extreme difficulty of these highly crowded scenes.
6. Conclusions
We presented CLIP-Mono3D, an end-to-end open-vocabulary monocular 3D detector that directly integrates vision–language semantics without relying on pre-trained 2D detectors. To facilitate comprehensive evaluation, we introduced OV-KITTI, a new benchmark with balanced category distributions. Furthermore, we validated our framework on real-world datasets including KITTI and Argoverse. Extensive experiments demonstrated that our method achieved state-of-the-art performance, delivering a +1.72% absolute gain on the KITTI easy split and a +1.21% overall improvement on OV-KITTI unseen categories over competitive baselines.
Despite these advances, scaling monocular 3D detection to completely unconstrained open-world environments remains challenging. The primary limitations lie in the inherent scarcity of diverse, real-world 3D annotations and the domain gaps encountered during cross-scene generalization. Achieving true generalized perception requires vastly increasing the scene richness of 3D datasets, maintaining a broad distribution of highly irregular novel classes, and developing targeted adaptation strategies for entirely new domains. Nevertheless, our CLIP-Mono3D provides a viable paradigm to mitigate this data bottleneck by grounding 3D geometry in rich, pre-trained language semantics. Moving forward, future work will focus on alleviating these geometric data constraints, designing robust cross-scene adaptation mechanisms, and extending our semantic–geometric alignment to other data-scarce modalities, such as infrared imaging [
51,
52,
53], to build reliable perception systems under adverse conditions. We hope this work inspires further research in semantic–geometric fusion for 3D perception.