Next Article in Journal
Development of 2D Microfluidics Surface with Low-Frequency Electric Fields for Cell Separation Applications
Previous Article in Journal
Using the Injection System as a Sensor to Analyze the State of the Electronic Automotive System
Previous Article in Special Issue
Fast, Zero-Reference Low-Light Image Enhancement with Camera Response Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images

School of Mathematics and Computer Science, Shaanxi University of Technology, Hanzhong 723001, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(18), 5815; https://doi.org/10.3390/s25185815
Submission received: 4 August 2025 / Revised: 10 September 2025 / Accepted: 15 September 2025 / Published: 18 September 2025

Abstract

Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale synthetic dataset containing 36 highway scenarios using the CARLA 0.9.15 simulation engine, generating approximately 336,000 virtual frames with precise calibration parameters. The dataset achieves statistical consistency with real-world scenes by incorporating diverse view distributions, complex weather conditions, and varied road geometries. Second, we developed DeepCalib, a deep calibration network that explicitly models perspective projection features through the triplet attention mechanism. This network simultaneously achieves road direction vanishing point localization and camera pose estimation using only a single image. Finally, we adopted a progressive learning paradigm: robust pre-training on synthetic data establishes universal feature representations in the first stage, followed by fine-tuning on real-world datasets in the second stage to enhance practical adaptability. Experimental results indicate that DeepCalib attains an average calibration precision of 89.6%. Compared to conventional multi-stage algorithms, our method achieves a single-frame processing speed of 10 frames per second, showing robust adaptability to dynamic calibration tasks across diverse surveillance views.

1. Introduction

The evolution of camera calibration technology has significantly advanced video-based analysis from 2D planar to 3D spatial domains, providing critical support for 3D vision tasks such as vehicle speed calculation [1,2], spatial coordinate localization [3,4], traffic flow counting [5], and vehicle pose estimation [6,7]. This progress has substantially enhanced the environmental awareness of traffic surveillance systems. While camera calibration has established a mature theoretical framework as a fundamental computer vision technique, automatic acquisition of intrinsic and extrinsic camera parameters remains challenging in traffic surveillance scenarios due to diverse observation perspectives and unpredictable environmental conditions.
Existing automatic calibration methods can be categorized into two technical paradigms: multi-stage approaches and single-image approaches. The former achieves calibration through modularized processes including 2D-3D feature point matching, vanishing point detection, and parameter optimization, while the latter directly derives camera parameters from geometric features in single image. Classic multi-stage approaches based on the Perspective-n-Point (PnP) principle [8,9] establish mapping relationships between 3D spatial points and 2D image points to estimate camera focal length and pose parameters [10,11]. In traffic scenarios, these methods typically rely on static landmarks [12,13] or moving vehicles [14,15,16,17] to construct geometric constraint models. However, their performance is highly dependent on accurate feature point detection, making them susceptible to environmental noise such as illumination variations and shadow interference. Even minor localization errors can lead to significant deviations in calibration results.
As the most distinctive geometric feature in panoramic images, vanishing points reflect the visual convergence characteristics of camera perspective projection, with their image positions determined by both intrinsic and extrinsic camera parameters. In traffic scenes, vanishing points typically arise from two orthogonal directions: the viewpoint direction and the horizontal direction. Consequently, numerous studies have creatively utilized these two vanishing point categories for automatic camera calibration [18,19,20,21,22]. Some approaches further attempt to extract the third vanishing point from vertical objects to satisfy the Manhattan World assumption [23,24]. Nevertheless, these methods encounter challenges in highway scenarios. For instance, geometric constraints from single/dual vanishing points remain limited, necessitating supplementary prior information such as landmark dimensions, lane widths, or camera heights. The Manhattan World assumption only applies to artificial structures rather than most natural environments. Additionally, multi-stage methods involve high computational complexity due to iterative optimization across modules. Particularly for Pan-Tilt-Zoom (PTZ) monitoring cameras, continuous detection of lane markings or vehicle targets to stabilize vanishing point acquisition often prematurely terminates calibration procedures during focal length/pose adjustments.
Therefore, developing efficient and robust fully automatic camera calibration methods holds significant practical value. Based on geometric principles of camera imaging, image perspective features provide critical constraints for solving camera parameters. Compared with traditional algorithms relying on scene priors, deep learning frameworks demonstrate stronger environmental adaptability through data-driven feature extraction mechanisms. Previous studies have demonstrated that convolutional neural networks (CNNs) can localize vanishing points. The work [25] directly regressed vanishing point coordinates from panoramic images. Another category of approaches [26,27] reformulates the vanishing point detection task as a classification problem by discretizing the image space into n × n grids, then using the softmax classifier to predict grid positions containing vanishing points. Similarly, a vanishing point representation method [28] based on quadrant partitioning offers new insights for camera parameter estimation. In recent years, deep learning frameworks have extended to end-to-end single-image calibration through supervised learning, directly regressing camera focal lengths and other parameters [29]. The core motivation of these methods stems from utilizing observable visual cues in images, such as horizon features [30,31,32] and scene vector fields [33]. However, in highway scenarios, such visual cues are often weakened due to homogeneous road structures and diverse camera viewpoints, significantly degrading performance of existing methods. More critically, the scarcity of publicly available highway scene datasets remains a persistent challenge, leaving camera calibration research as an unresolved problem.
To address these challenges, this paper proposes an automatic calibration framework for highway surveillance cameras using a single image, featuring three primary contributions. (1) We constructed a large-scale synthetic dataset using the CARLA [34] simulation engine, containing 6 map categories and 36 representative highway segments. Through automated annotation pipelines, we generated 336,249 images with ground-truth calibration parameters. This dataset closely matches real-world highway scenarios in camera perspectives, road geometries, and weather conditions, significantly reducing deep learning models’ reliance on real-world data. (2) We developed a deep calibration network (DeepCalib) that synergistically integrates the triplet attention module (TAM) [35]. This architecture enhances semantic representation of perspective projection features, enabling joint estimation of vanishing point coordinates and camera pose parameters from single images while automatically adapting to varying observation viewpoints. (3) We adopted a dual-stage training paradigm combining synthetic pre-training and real-data fine-tuning. Robust feature learning is first performed on synthetic data with augmentation strategies to improve generalization. Subsequent parameter fine-tuning on limited real-world data enables virtual-to-real transfer learning. Experimental results demonstrate this approach significantly enhances model adaptability in complex traffic environments.
The rest of this paper is organized as follows. Section 2 introduces the proposed synthetic dataset. Section 3 details the calibration model, network architecture, and training methodology. Section 4 presents experimental results including comprehensive comparisons with baseline models. Finally, Section 5 concludes the study and explores future research directions.

2. Benchmark Dataset

Large-scale annotated datasets play a pivotal role in enhancing the generalization capability of deep learning models for visual perception tasks. However, existing public highway-scene datasets predominantly exhibit single-view limitations and lack complete annotations of camera intrinsic and extrinsic parameters, which hinders their capacity to support the training demands of high-precision visual perception models. While the prior work [3] has released Multi-View Camera Calibration Dataset (MVCCD), the sample sizes remain insufficient to cover the diversity of complex highway scenarios. To address this gap, we constructed a large-scale synthetic dataset using the CARLA [34] traffic simulation platform, employing virtual scene augmentation strategies to explicitly expand data distribution diversity.
We selected 36 arterial roads from 6 virtual city maps as foundational scenarios. Each road randomly deployed three camera groups (left/center/right) to achieve multi-view coverage. A comprehensive weather simulation system was developed using procedural generation for typical meteorological conditions including sunny, rainy, cloudy, foggy, and nighttime scenarios, ensuring deep networks maintain robust performance under diverse weather patterns. The traffic flow simulation module incorporated 33 standardized vehicle models with dynamic adjustment capabilities ranging from sparse to dense traffic conditions, maintaining consistency with real highway vehicle density parameters. To simulate operational boundaries of traffic surveillance cameras, we defined four parameter sampling spaces: field-of-view (FOV) [70°, 120°], pitch angle [−28°, 0°], yaw angle [−40°, 40°], and mounting height [10 m, 14.5 m]. Random parameter sampling ensures uniform label distribution across image regions, effectively mitigating training biases caused by imbalanced datasets. Figure 1 illustrates representative synthetic scenes that closely resemble real-world highway environments while exhibiting greater diversity in camera viewpoints and road geometries.
The final dataset comprises 336,249 pairs of 1920 × 1080 resolution RGB images with corresponding annotations. Each annotation file records vanishing point coordinates, pitch angle ( ϕ ), yaw angle ( θ ), camera focal length (f), and camera height (h). Data partitioning follows a stratified sampling strategy, allocating samples to training/validation/test sets in 7.5:1.5:1 ratios. Table 1 compares parameter distributions between the real-world dataset (MVCCD_R) and synthetic counterpart (MVCCD_S), demonstrating broader coverage across all dimensions for the proposed dataset.
We systematically demonstrate the parameter distribution characteristics of the constructed dataset by visualizing histograms of vanishing point coordinates, camera focal lengths, and pose parameters. Figure 2 reveals that vanishing point coordinates cover the majority of the image plane. Notably, due to the typical top-down installation of surveillance cameras, vanishing points exhibit a pronounced bias toward the upper image half. This distribution pattern closely aligns with the visual perception of roads receding into the distance in real-world scenarios.
Figure 3 presents statistical histograms of camera parameters in MVCCD_S. The two rotation angles (pitch/yaw) exhibit uniform distributions across their defined angular spaces. The focal lengths demonstrate a broad distribution across 500–1400 pixels, with equivalent focal lengths spanning the operational spectrum from wide-angle to medium-telephoto configurations typical of surveillance systems. Camera height parameters cluster within 10.0–13.5 m, aligning with empirical deployment standards. The dataset maintains statistical equilibrium across critical parameters, providing an ideal benchmark for validating camera calibration algorithms based on geometric constraints.
Figure 4 presents a quantitative comparison between MVCCD_S and MVCCD_R datasets across multiple feature dimensions, including RGB color channels, texture features, pixel intensity, and geometric properties. Through visualized histograms and statistical mean overlays, the following conclusions are drawn:
Color Space Distribution: Synthetic data exhibits slightly lower RGB channel means compared to real data, indicating overall darker brightness. This observation is directly attributed to simulated weather conditions (rain, fog, nighttime) in synthetic scenes, which shift pixel values toward lower luminance regions.
Texture Complexity: Real data demonstrates significantly higher contrast, suggesting richer edge details and high-frequency textures. The disparity in dissimilarity and homogeneity further confirms the regularity of synthetic textures—exhibiting stronger spatial correlation—while real data shows lower texture homogeneity due to natural noise and structural complexity.
Pixel Intensity Dynamics: Real data intensity concentrates in the 10–240 range with a left-skewed peak at 50 gray levels. Synthetic data spans the 20–250 range with bimodal peaks (115 and 180), demonstrating enhanced diversity through simulations of varying weather (sunny/cloudy) and time periods (day/dusk).
Geometric Feature Consistency: Close alignment in orientation angle and anisotropy indicates high statistical consistency between datasets in object orientation and shape anisotropy. The corner count discrepancy suggests room for improvement in modeling complex geometric details, but synthetic data’s directional distribution adequately covers real-world variations.
Overall, synthetic and real datasets demonstrate significant statistical consistency in geometric features, particularly in orientation angle and anisotropy metrics. Color and texture discrepancies highlight the necessity of data augmentation techniques such as stochastic color jittering and noise injection to further improve distributional alignment between synthetic and real-world scenes.

3. Methods

3.1. Calibration Model for Traffic Surveillance Cameras

Calibration methods for traffic surveillance cameras have been extensively discussed, with detailed derivations referenced in the work [36]. This section provides a concise overview of the underlying principles. In the standard calibration model, a homogeneous 3D spatial point P = [X,Y,Z,1]T is projected onto the image plane as a 2D point p = [u,v,1]T through the projection matrix M .
λ u , v , 1 T = M X , Y , Z , 1 T ,
The general mathematical formulation of M is expressed as M = K[R|T], where K denotes the camera’s intrinsic parameters (including focal length and principal point coordinates), while R and T represent the extrinsic parameters (relative to the world coordinate system), corresponding to the rotation matrix and translation vector, respectively.
For traffic surveillance cameras, the calibration parameters can be simplified by establishing a rational world coordinate system (refer to Figure 5). Furthermore, under the assumptions that the camera’s principal point coincides with the image center and the roll angle remains zero, the projection matrix M is solely determined by the focal length f, pitch angle ϕ , yaw angle θ , and camera height h.
As a fundamental characteristic of perspective projection, vanishing points exhibit strong correlations with the camera’s focal length, pitch angle, and yaw angle. Their coordinates (u,v) in the image plane can be derived using the following relationships:
u   =   c x + f t a n θ c o s ϕ v   =   c y + f t a n ϕ ,
where (cx,cy) denotes the coordinates of the principal point. As demonstrated in Equation (2), the focal length f can be derived given the vanishing point coordinates (u,v), pitch angle ϕ , and yaw angle θ . When combined with the camera height h, these parameters enable the complete construction of the calibration matrix.

3.2. Single-Image Calibration with DeepCalib

Three-dimensional objects exhibit distinct visual convergence effects after undergoing camera perspective projection transformations. In traffic scenes, geometric deformations and convergence directions of road structures vary significantly across different viewpoints, with these projection patterns universally present in both panoramic images and local objects. This implies that camera parameters can be derived from image projection features. According to this geometric regularity, we developed DeepCalib, a single-image based deep calibration network whose overall framework is illustrated in Figure 6.
The DeepCalib architecture comprises three components: a backbone, a deconvolutional module and a multi-task detection head. The backbone network, built on the ConvNeXt [37] architecture, integrates the TAM [35] module for cross-dimensional feature fusion. After feature encoding, three-stage deconvolutional modules perform progressive upsampling, ultimately generating multi-scale feature maps at 1/16, 1/8, and 1/4 resolutions of the original image. The multi-task detection head contains a key point localization branch and a camera pose estimation branch to perform geometric inference from the captured features. Based on the established calibration model, the network outputs are decoded to obtain both intrinsic parameters (focal length) and extrinsic parameters (rotation angles, translation vectors).

3.2.1. Backbone

The backbone network adopts a ConvNeXt architecture to jointly capture global and local visual features. Its hierarchical design incorporates four cascaded ConvNeXt Block modules, constructing multi-scale feature representations through progressive down sampling and channel expansion. For feature extraction, each ConvNeXt Block replaces 3 × 3 convolutions with 7 × 7 kernels, maintaining local texture modeling capability while expanding receptive fields to capture long-range spatial dependencies. This design enables joint encoding of global semantic contexts and fine-grained local patterns through enhanced feature hierarchies.
In convolutional neural networks, attention mechanisms enable the model to focus on specific visual regions or assign differentiated weights to different regions, thereby filtering critical features from vast information. A typical example is SENet [38], which captures inter-channel importance differences through channel attention. However, its lack of spatial dimension perception leads to insufficient modeling of spatial positional correlations. Although Convolutional Block Attention Module (CBAM) [39] integrates channel and spatial attention, it fails to establish cross-dimensional feature interaction mechanisms. Given the pervasive perspective projection characteristics in panoramic images and local objects, accurately capturing global-local features and their interactions is pivotal for enhancing network performance. The TAM module effectively addresses spatial-channel dimensional feature interactions through three parallel branches. Each branch independently aggregates interaction information between specific dimensions and channel dimensions in the input, forming a cross-dimensional information enhancement mechanism.
To this end, this paper integrates the TAM module into the backbone network, establishing a joint modeling framework for channel-spatial dimensional dependencies. Specifically, we embed a TAM unit within each ConvNeXt module to synchronously extract low-level geometric features and high-level semantic features. As illustrated in Figure 7, the TAM module achieves interaction among channel height (CH), channel width (CW), and spatial attention (HW) through three parallel branches. Each branch follows a three-stage processing pipeline of “Z-pool operation—convolution—Sigmoid activation”, and ultimately generates an attention-weighted tensor of the same dimension as the original feature through point-wise multiplication with the original feature. Specifically, the Z-pool layer compresses the zeroth dimension of the tensor to two dimensions by concatenating the features obtained from average pooling and maximum pooling across that dimension. Mathematically, it can be formalized as follows:
Z p o o l ( χ ) = [ M a x P o o l 0 d ( χ ) , A v g P o o l 0 d ( χ ) ] ,
where 0d is the 0th-dimension across which the max and average pooling operations take place.
The first branch is responsible for constructing the interaction relationship between channels and height. For an input tensor χ R C × H × W , it is first rotated 90° anti-clockwise along the H-axis to form χ 1 R W × H × C . Subsequently, χ 1 is compressed to a dimension of 2 × H × C via Z-pool, and then passed through a convolutional layer and a batch normalization layer to generate attention weights. These weights are activated by a Sigmoid function ( σ ) and applied to χ 1 . After that, it is rotated 90° clockwise along the H-axis again to restore the same shape as the original input tensor χ . This branch utilizes height-dimension information to focus on the vertical geometric features of the image, enabling the estimation of the vertical vanishing point (v) and the pitch angle ( ϕ ). The operation process can be formally described as follows:
ω C H = σ ( C o n v 7 × 7 ( Z p o o l ( χ 1 ) ) ) χ 1 ,
where ⊙ denotes broadcast element wise multiplication.
The second branch deals with channel-width interaction. Similarly, the input χ is rotated 90° anti-clockwise along the W-axis to generate χ 2 R H × C × W . Subsequently, a three-stage processing operation is employed to generate channel-width attention. This branch estimates the horizontal vanishing point (u) and the camera yaw angle ( θ ) by perceiving features in the horizontal dimension. The formulation is presented as follows:
ω C W = σ ( C o n v 7 × 7 ( Z p o o l ( χ 2 ) ) ) χ 2 ,
The third branch directly processes the spatial dimension. The input χ is compressed by Z-pool to a 2 × H × W dimension. The simplified tensor χ 3 captures global contextual dependencies through a 7 × 7 convolution. After sigmoid activation, it generates 1 × H × W attention weights that directly act on the original input χ . This branch enhances the overall perception of road geometry and camera perspective through spatial dimension modeling.
ω H W = σ ( C o n v 7 × 7 ( χ 3 ) ) χ
Finally, the C × H × W dimensional fine-tuned attention weights generated by the three branches are fused across dimensions through simple averaging, with the aggregation process expressed as:
χ ^ = 1 3 ( ω C H + ω C W + ω H W )
This architecture preserves the integrity of the original feature space structure while achieving cross-dimensional synergistic enhancement of channel-spatial features. It enables the network to adaptively focus on key perspective-sensitive feature regions, significantly improving the accuracy of vanishing point detection and the robustness of camera rotation angle estimation. These capabilities provide strong support for geometric structure perception in real-world scenarios.

3.2.2. Multi-Task Detection Head

The multi-task detection head adopts a dual-branch architecture comprising a keypoint branch and a camera pose branch, responsible for vanishing point detection and camera pose estimation. The keypoint branch treats the vanishing point along road extension directions as a critical geometric anchor in panoramic imagery. This branch processes 1/4-scale feature maps from the upsampling module, employing two cascaded 1 × 1 convolutional layers for channel dimension reduction, ultimately generating heatmap at 136 × 240 resolution. During ground truth generation, a 2D Gaussian kernel was used to construct the vanishing point response region, with peak coordinates corresponding to the true vanishing point location. Sub-pixel localization accuracy is achieved through heatmap peak response decoding, enabling precise geometric anchor localization in complex traffic environments.
For rotation angle estimation, direct regression of continuous angular values is prone to prediction instability. Inspired by the MultiBin [40] architecture, the camera pose estimation branch adopts a classification and central-residual regression strategy. As illustrated in Figure 8, the rotation angle space is discretized into n overlapping bins. The network first predicts a probability distribution over these bins, then performs residual ( δ η ) regression relative to the selected bin’s central angle. The final rotation calibration is obtained through summation of the bin center value and predicted residual.
η ^ = c i + δ η i ,
where η ^ represents the ground truth, c i denotes the center angle of the bin i and δ η i refers to the residual with respect to the center of the bin i .
Regarding camera height regression, a hybrid strategy combining global prior and local refinement was employed. We precomputed a mean height ( h ¯ ) across the entire dataset, with the network only required to predict residual offset ( δ h ) relative to this global prior. This approach significantly reduces parameter search space complexity while maintaining adaptive calibration capability. The absolute camera height for each input image is recomputed by combining the global mean value with the predicted residual offset.
h = h ¯ e δ h ,
where the height residual δ h is activated using the sigmoid function σ . δ h = σ ( o h ) 1 / 2 . o stands for the specific output of the network.

3.2.3. Multi-Task Loss Function

Based on the network outputs, the loss function of DeepCalib comprises four components: vanishing point classification loss L v , offset loss L v o , multibin loss L m , and camera height residual loss L h o . The vanishing point loss is computed using focal loss [41]:
L v = 1 N i = 1 H j = 1 W ( 1 P ^ c i j ) α l o g ( P ^ c i j ) ,                             P c i j = 1                   ( 1 P c i j ) β p ^ c i j α l o g ( 1 P ^ c i j ) ,         otherwise ,
where (H,W) denotes the heatmap size, and N represents the number of positive samples. The terms P c i j and P ^ c i j correspond to the ground truth and predicted responses at heatmap position (i,j), respectively. The hyperparameters α and β adjust the loss weights for positive and negative samples, respectively. To compensate for the quantization error due to feature map down sampling, the vanishing point offset loss is calculated as follows:
L v o = 1 N δ p ( P R P ~ ) ,
where P represents the actual vanishing point coordinates, R denotes the down sampling factor. δ p is the true offset, while P ~ = P R (the symbol · indicates the floor operation).
L m = L c + ω × L c o ,
where ω is the weighting factor, set to 0.5. The confidence loss L c is described by the softmax loss for each bin, while L c o aims at eliminating the discrepancy between the predicted and true values within each bin. The calculation formula is as follows:
L c o = 1 m c o s ( η ^ c i δ η i ) ,
where m denotes the number of bins covering the true angle, and η ^ represents the ground truth angles. For the camera height residual loss L h o , each regression quantity is evaluated using the Smooth L1 loss:
L h o = S m o o t h   L 1 ( δ h ) ,
In summary, the total loss function L o s s of DeepCalib can be described as follows:
L o s s = ω 1 × L v + ω 2 × L v o + ω 3 × L m + ω 4 × L h o ,
where ω 1 , ω 2 , ω 3 , and ω 4 are the weighting factors between the sub–loss functions, with ω 1 + ω 2 + ω 3 + ω 4 = 1 .

3.3. Training Details

This section presents a two-stage progressive training paradigm: (1) end-to-end robust feature learning on MVCCD_S without pre-trained weight initialization, followed by (2) task-specific parameter fine-tuning of the multi-task detection head on MVCCD_R. The initial training phase involves the implementation of several data processing procedures.
Preprocessing: During synthetic data generation, viewpoint diversity was simulated through random perturbations of camera parameters. Despite constrained parameter perturbation ranges, certain samples exhibit vanishing points near image boundaries, causing their corresponding heatmap responses to exceed valid perceptual ranges after down sampling. To address this, a data purification step was first implemented to exclude such invalid samples. The retained valid image sequences were then resized to 544 × 960 pixels as standardized input for supervised training.
Data Augmentation: The training pipeline incorporated three data augmentation techniques: horizontal flipping, spatial translation, and color transformation. Horizontal flipping and random translation were applied with a probability of 0.4. Translation vectors were randomly combined from four directions (up/down/left/right), with two safety mechanisms: (1) a 50-pixel displacement threshold per direction, aborting transformations exceeding this limit, and (2) a maximum translation magnitude of 180 pixels. Void regions generated post-translation are filled using nearest-neighbor interpolation to maintain pixel continuity. During horizontal flipping, simultaneous sign inversion of the yaw angle ensures parameter validity. The color augmentation module includes random jittering of brightness/contrast/saturation and Gaussian noise injection. To prevent over-enhancement, this operation was activated with a probability of 0.2.
Hyperparameter Choice: Vanishing point heatmaps were generated using 2D Gaussian masks with radius r = 8, where pixels with mask values ≥ 0.5 were defined as positive samples. For heatmap loss calculation, parameters were configured as α = 2 and β = 4 . For angular discretization, pitch angle ϕ and yaw angle θ were partitioned using 3 and 5 overlapping bins, respectively. The specific binning parameters were configured as: 12° width with 4° overlap for pitch angles, and 20° width with 5° overlap for yaw angles. Pre-computed statistical analysis yielded average camera height of 11.83 m for MVCCD_S and 12.36 m for MVCCD_R.
Training Strategy: The training process adopted a two-stage paradigm combining pre-training and fine-tuning. During the pre-training phase, comprehensive feature learning was conducted on large-scale synthetic datasets using a batch size of 32 across 20 epochs (157,600 iterations) with an initial learning rate of 2 × 10−3. This stage emphasizes robust representation learning through end-to-end optimization of all network parameters. The subsequent fine-tuning stage focused on task-specific parameter refinement for the multi-task detection head using real-world datasets. This phase employed a reduced batch size of 16 over 10 epochs (6720 iterations) with an adjusted initial learning rate of 2 × 10−5. Both training stages utilized the AdamW [42] optimizer with weight decay regularization and implement dynamic learning rate adjustment: the learning rate was decayed by a factor of 0.1 when validation loss shows no improvement for three consecutive epochs. To accelerate convergence, distributed training was performed across four NVIDIA A800 GPUs using data parallelism.

4. Experiments

To validate the proposed method, we conducted five experiments: (1) ablation studies, (2) binning Strategy, (3) transfer learning, (4) camera calibration, and (5) time consumption. All experiments were performed under identical conditions using a unified test set (comprising both MVCCD_S and MVCCD_R datasets), consistent hardware configuration (NVIDIA A800 GPU) and hyperparameter settings to ensure direct comparability of results.

4.1. Ablation Studies

This study employed three ConvNeXt sub-architectures (ConvNeXt_tiny, ConvNeXt_small, and ConvNeXt_base) as baseline models. Enhanced networks were developed by integrating TAM modules into these backbones. Ablation experiments were systematically conducted to compare performance differences between baseline and enhanced models. All evaluations were performed on MVCCD_S, with quantitative analysis focusing on four key metrics: vanishing point Euclidean distance (L2 Dis), the Mean Absolute Error (MAE) of pitch angle, yaw angle, and camera height. Experimental results are summarized in Table 2.
Statistical analysis reveals a nonlinear positive correlation between model capacity and performance in the ConvNeXt series. As network scale increased from tiny to base, continuous optimization is observed across all evaluated metrics. Notably, the transition from ConvNeXt_tiny to ConvNeXt_small yields the most significant performance improvements: L2 distance error decreases by 11.17 pixels, while pitch and yaw angle errors reduce by 3.97° and 4.87°, respectively. However, when model capacity is further expanded to the base level, the rate of performance improvement markedly slowed, with only a 5.72-pixel reduction in L2 distance error and 0.99° and 1.68° reductions in pitch and yaw angle errors, respectively. This phenomenon indicates a feature representation bottleneck in lightweight models, while mere capacity scaling fails to deliver sustained linear performance gains.
Ablation studies demonstrate that integrating TAM into all baseline models yields significant improvements across four core metrics. Specifically, vanishing point localization errors are reduced by 16.32–22.46 pixels, with ConvNeXt_base_TAM achieving the optimal 22.22-pixel reduction. Pitch/yaw errors for ConvNeXt_base_TAM decrease to 1.33° and 2.45°, representing reductions of 0.41° and 1.49° (23.56% and 37.82% decreases) compared to the baseline model. These results confirm that the TAM module effectively enhances baseline models’ perception of scene geometric structures. While height residual estimation shows relatively modest improvements, the enhanced model maintained stable performance with a mean absolute error (MAE) of 0.89 m versus the baseline’s 0.91 m. This limited enhancement may stem from inherent properties of perspective projection features—camera height does not directly influence perspective effects, resulting in reduced model sensitivity to height variations.

4.2. Binning Strategy

Based on results in Section 4.1, ConvNeXt_base_TAM was selected as the test network to systematically validate rotation angle binning strategies. In these experiments, pitch and yaw angles were partitioned into 2–6 bins with 4° and 5° overlaps, respectively. Classification accuracy and MAE of residual angles served as primary evaluation metrics. Table 3 reveals a distinct trade-off between classification and regression performance as bin counts increased from 2 to 6. For pitch angle: At 2 bins, classification accuracy peaked at 94.23% but with substantial regression error (2.78°). When bin count reached 6, accuracy declined to 79.03% while MAE improved to 3.35°. Similarly, yaw angle showed highest classification accuracy at 2 bins (92.37%) with correspondingly high regression error (5.27°), but accuracy dropped to 76.91% and MAE reduced to 4.45° at 6 bins. This inverse relationship suggests that fewer bins enhance classification discriminability but expand regression range, increasing error. Conversely, more bins refine angular resolution to reduce regression error but blur classification boundaries, compromising accuracy.

4.3. Transfer Learning

This section presents experimental validation of the transfer learning strategy applied to the DeepCalib model. The implementation involved two training paradigms: (1) pre-training from scratch on MVCCD_S followed by fine-tuning on MVCCD_R, and (2) direct training exclusively on MVCCD_R as a baseline comparison. Given the differences in RGB color channels and texture feature distributions between synthetic and real images, this experiment first employed data augmentation techniques such as random color jittering and noise injection to optimize the synthetic dataset, thereby mitigating distribution bias between the two datasets.
Figure 9 systematically illustrates the evolution of loss functions under both strategies, including vanishing point heatmap loss, pitch angle estimation loss, yaw angle estimation loss, and camera height residual loss. Global analysis reveals that compared to direct training, the transfer learning model achieves consistently lower loss values across all metrics, with significantly reduced fluctuations in loss curves during training. Although the initial loss for rotation angle training is higher in the transfer learning approach, its curves demonstrate faster convergence. This indicates that pre-training on synthetic data effectively enhances generalization capability.
Figure 10 provides quantitative comparisons on MVCCD_R. For vanishing point detection, the transfer learning strategy constrains Euclidean distance errors within the [0, 50] pixel range (mean 16 pixels), significantly outperforming direct training’s [0, 220] pixel range (mean 29 pixels). In camera extrinsic parameter estimation, absolute errors for pitch and yaw angles are confined to [0°, 4°] (mean 1.18°) and [0°, 4.2°] (mean 2°), respectively, substantially surpassing direct training results of [0°, 13°] (mean 2.90°) and [0°, 16°] (mean 3.28°). Camera height residuals are controlled within 1.2 m (mean 0.42 m), improving upon direct training’s 0.48 m average. All quantitative metrics confirm the superior performance of transfer learning across evaluated dimensions.
To intuitively verify transfer learning efficacy, Figure 11 compares qualitative results of both strategies in real-world vanishing point estimation. The experimental setup comprises four typical road scene groups arranged in side-by-side format: left panels show direct training predictions, while right panels display transfer learning results. Columns sequentially present input images, ground truth heatmaps, and predicted heatmaps. Observations indicate that transfer learning produces consistently stable vanishing point detection across diverse road environments, particularly excelling in complex curved scenarios with heatmap distributions showing greater consistency with ground truth.
These qualitative findings align with quantitative results, conclusively demonstrating that the proposed synthetic dataset serves as a valuable complement to real-world data, enabling significant performance improvements in practical applications through transfer learning. The statistical consistency of geometric features plays a positive role in enabling DeepCalib to learn the geometric structure and camera view properties of real-world scenes. Although the introduction of weather conditions such as rain, fog, and night in the synthetic dataset results in slightly lower RGB color channel values compared to real scenes, and the inherent homogeneity of images caused by virtual engine characteristics remains an issue, the implementation of reasonable data augmentation strategies (including random color jittering and noise injection) effectively reduces dataset distribution discrepancies. This approach significantly improves the performance of vanishing point detection accuracy and camera rotation angle estimation in highway scenarios when applying transfer learning to real-world applications.

4.4. Camera Calibration

This section comprehensively evaluates DeepCalib’s calibration performance in real-world road scenarios through two experiments. As a foundational validation, we assessed DeepCalib’s vanishing point estimation capability on MVCCD_R and another public dataset [27], performing comparative analysis against representative traditional methods (AutoCalib [19] and Edgelets [43]) and state-of-the-art deep learning approaches [3,27]. Considering that AutoCalib and Edgelets apply exclusively to video sequences with fixed camera views, we specifically used videos originating from the same scenarios as MVCCD_R to ensure fair comparison. Identical configurations were adopted for subsequent related experiments. To address resolution discrepancies between datasets, this study employed L2 and normalized distance (NormDis) for quantitative evaluation. The NormDis, standardized by image diagonal length, effectively eliminated resolution variations’ impact on assessment results, enabling cross-dataset objective comparison. The formulation is presented as follows:
N o r m D i s = v p ^ v p d ,
where v p ^ denotes the estimated vanishing point coordinates, while vp represents the ground truth vanishing point coordinates, and d indicates the image diagonal length.
Table 4 systematically presents comparative results of the five methods. On the MVCCD_R dataset, DeepCalib achieved L2 distance errors of 13 pixels for straight roads and 34 pixels for curved roads, outperforming other algorithms by 44–78 pixels and 39–96 pixels, respectively. On dataset [27], it reduces L2 errors by 2 pixels and 7 pixels compared to DeepCN and DeepVP. NormDis metrics shows DeepCalib attains 0.006 (straight roads) and 0.022 (curved roads) on MVCCD_R, and 0.014 on dataset [27], all significantly lower than traditional methods (e.g., 0.059 for Edgelets on curved roads) and deep learning approaches (e.g., 0.026 for DeepCN on straight roads). This confirms that DeepCalib’s vanishing point localization accuracy is resolution-independent, depending only on scene complexity, which substantially enhances stability in cross-device and cross-scene deployments.
Notably, curved road scenarios exhibit significantly higher vanishing point estimation errors than straight roads. This discrepancy arises because nonlinear road edge distributions challenge traditional methods reliant on linear assumptions. Meanwhile, deep learning approaches also suffer performance degradation due to weakened linear features in such complex scenes. However, DeepCalib maintains the lowest errors in these challenging environments, demonstrating its capacity to capture geometric features of curved roads to some extent.
Based on the visualized experimental setup (Figure 12), we conducted line segment measurements at three distances (6 m, 9 m, 15 m) along standardized highway lane markings. The reference benchmarks (6 m marking intervals and 9 m lane spacing) are explicitly annotated through short line segments and their combinations in the image. Critical measurement lines aligned with lane edges converging toward the vanishing point, enabling direct analysis of perspective projection effects. This experiment compared DeepCalib with manual calibration, traditional methods (AutoCalib and Edgelets), and a deep learning approach (DeepCN). Manual calibration utilized the VWL algorithm from the work [36], where V denotes vanishing point, W represents road width, and L signifies landmark length.
Table 5 presents quantitative results where DeepCalib achieves mean measurements of 6.56 m, 9.96 m, and 16.68 m for the 6 m, 9 m, and 15 m segments, with calibration accuracies of 90.67%, 89.33%, and 88.80%. The overall calibration accuracy reached 89.60%, surpassing AutoCalib (81.46%), Edgelets (76.29%), and DeepCN (86.05%). In contrast, manual calibration achieves centimeter-level precision (≤6 cm). Notably, DeepCalib eliminates scene- and object-specific constraints, demonstrating superior adaptability. This enables high calibration accuracy while maintaining operational flexibility, achieving an optimal balance between precision and generality.
Figure 13 visually demonstrates the calibration performance of DeepCalib under various surveillance camera perspectives. Green circles mark the ground truth of vanishing points, while red circles indicate predictions. To qualitatively evaluate camera parameter prediction accuracy, we employed the line segment reprojection visualization strategy: green line segments represent reprojections of the predicted landmark lengths, and red line segments correspond to projections of the single lane width (3.75 m). Experimental results indicate that DeepCalib exhibits excellent adaptability to camera perspective variations. Predicted vanishing points show high consistency with ground truth, and the line segment reprojections strictly adhere to perspective transformation principles. Notably, while the errors of vanishing point localization accuracy and projection experience slight increases in curved road scenarios compared to straight sections, overall errors remain within tolerance thresholds. Although current calibration precision still lags behind manual methods, DeepCalib’s advantages lie in its computational efficiency and environmental adaptability. These characteristics make it particularly suitable for dynamic calibration of highway surveillance cameras, offering a practical and scalable solution for intelligent transportation systems.

4.5. Time Consumption

To validate the real-time performance of the DeepCalib model in processing 1920 × 1080 resolution image frames, the experimental protocol decomposed the algorithm workflow into two core modules: vanishing point decoding (VP Decoding) and extrinsic parameter estimation (EP Estimation). Comparative methods included traditional multi-stage calibration techniques (AutoCalib [19] and Edgelets [43]) and single-image calibration approach (DeepCN [3]). Table 6 systematically records the time consumption results across processing stages for various calibration algorithms. Comparative analysis reveals that traditional multi-stage methods require tens of seconds for calibration (at 25 FPS video streams), while single-image techniques reduce processing time to the 10−2 s range, significantly enhancing real-time efficiency. Notably, computational bottlenecks in traditional approaches like AutoCalib and Edgelets concentrate in the vanishing point estimation phase. This stage requires continuous vehicle tracking and horizontal edge feature extraction to achieve stable vanishing point localization, resulting in processing duration that strongly correlates with traffic volume. Overall, traditional multi-stage methods exhibit significant environmental dependency in processing efficiency, whereas single-image techniques completely circumvent these limitations. Although DeepCalib demonstrates a 3.56 × 10−2 s increase in total processing time compared to DeepCN, it still achieves 10 FPS performance. Given its better calibration accuracy, DeepCalib maintains competitive advantages by balancing precision and computational efficiency.

5. Conclusions

This study addresses the bottlenecks in existing automatic calibration methods for traffic surveillance cameras, focusing on resolving two critical challenges: the scarcity of labeled datasets and poor adaptability to multi-view scenes. We first constructed a large-scale synthetic dataset through simulation of highway scenarios, establishing an effective data augmentation framework. The synthetic dataset maintains statistical consistency with real-world scenes in terms of geometric feature distribution. By adopting data augmentation techniques such as random color jittering and noise injection, we have alleviated distribution bias in RGB color channels and texture distributions between the two datasets, thereby establishing a solid foundation for transfer learning. Subsequently, we proposed DeepCalib, a deep calibration network that integrates the triplet attention mechanism to enhance the representational capacity of geometric visual cues, enabling simultaneous vanishing point detection and camera extrinsic parameter estimation. The method operates on single highway images without requiring continuous object detection, significantly improving calibration efficiency. To enhance real-world robustness, we adopted the pre-training and fine-tuning strategies. Experimental results on proposed benchmark dataset demonstrate that DeepCalib adapts to diverse highway surveillance camera views. Its simple yet efficient architecture shows practical value for real-world applications. While achieving promising performance, calibration accuracy for curved road scenarios requires further improvement. Future work will focus on expanding the real-world highway dataset to better meet deep learning requirements. Additionally, we aim to strengthen utilization of local visual cues (e.g., vehicles) for more comprehensive perspective feature representation. Addressing these challenges will hold significant potential to advance automatic traffic surveillance camera calibration technology.

Author Contributions

Conceptualization, W.Z. and W.J.; methodology, W.Z.; software, W.Z.; validation, W.Z. and W.L.; formal analysis, W.Z.; investigation, W.Z. and W.J.; resources, W.Z.; data curation, W.J. and W.L.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z.; visualization, W.Z. and W.L.; supervision, W.J.; project administration, W.J.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Education Department of Shaanxi Province, grant number 23JK0371, the Shaanxi Provincial Science and Technology Department, grant number 2024JC-YBQN-0725, and the Shaanxi University of Technology, grant number SLGRCQD2318.

Data Availability Statement

The dataset used in this research has been published in https://github.com/WenTao10/Multi-View-Camera-Calibration-Dataset, accessed on 25 January 2022.

Acknowledgments

In the preparation of this manuscript, we used the CARLA simulation platform for dataset production and extend our appreciation to it. We have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

PnPPerspective n point
PTZPan tilt zoom
CNNsConvolutional neural networks
TAMTriplet attention module
MVCCDMulti-view camera calibration dataset
FOVField of view
CHChannel height
CWChannel Width
VPVanishing point
EPExtrinsic parameter

References

  1. Sochor, J.; Juránek, R.; Špaňhel, J.; Maršík, L.; Široký, A.; Herout, A.; Zemčík, P. Comprehensive data set for automatic single camera visual speed measurement. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1633–1643. [Google Scholar] [CrossRef]
  2. Revaud, J.; Humenberger, M. Robust automatic monocular vehicle speed estimation for traffic surveillance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4551–4561. [Google Scholar] [CrossRef]
  3. Zhang, W.; Song, H.; Liu, L.; Li, C.; Mu, B.; Gao, Q. Vehicle localisation and deep model for automatic calibration of monocular camera in expressway scenes. IET Intell. Transp. Syst. 2022, 16, 459–473. [Google Scholar] [CrossRef]
  4. Qin, L.; Lin, C.; Huang, S.; Yang, S.; Zhao, Y. Camera calibration for the surround-view system: A benchmark and dataset. Vis. Comput. 2024, 40, 7457–7470. [Google Scholar] [CrossRef]
  5. Hu, Z.; Lam, W.H.; Wong, S.; Chow, A.H.; Ma, W. Turning traffic surveillance cameras into intelligent sensors for traffic density estimation. Complex Intell. Syst. 2023, 9, 7171–7195. [Google Scholar] [CrossRef]
  6. Wang, Z.; Huang, X.; Hu, Z. Attention-Based LiDAR–Camera Fusion for 3D Object Detection in Autonomous Driving. World Electr. Veh. J. 2025, 16, 306. [Google Scholar] [CrossRef]
  7. Hu, X.; Chen, T.; Zhang, W.; Ji, G.; Jia, H. MonoAMP: Adaptive Multi-Order Perceptual Aggregation for Monocular 3D Vehicle Detection. Sensors 2025, 25, 787. [Google Scholar] [CrossRef]
  8. Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
  9. Li, S.; Xu, C.; Xie, M. A robust O(n) solution to the perspective-n-point problem. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1444–1450. [Google Scholar] [CrossRef]
  10. Zheng, Y.; Sugimoto, S.; Sato, I.; Okutomi, M. A general and simple method for camera pose and focal length determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 430–437. [Google Scholar] [CrossRef]
  11. Penate-Sanchez, A.; Andrade-Cetto, J.; Moreno-Noguer, F. Exhaustive linearization for robust camera pose and focal length estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2387–2400. [Google Scholar] [CrossRef]
  12. Li, S.; Yoon, H.S. Enhancing camera calibration for traffic surveillance with an integrated approach of genetic algorithm and particle swarm optimization. Sensors 2024, 24, 1456. [Google Scholar] [CrossRef]
  13. Guo, S.; Yu, X.; Sha, Y.; Ju, Y.; Zhu, M.; Wang, J. Online camera auto–calibration appliable to road surveillance. Mach. Vis. Appl. 2024, 35, 91–106. [Google Scholar] [CrossRef]
  14. Bhardwaj, R.; Tummala, G.K.; Ramalingam, G.; Ramjee, R.; Sinha, P. Autocalib: Automatic traffic camera calibration at scale. ACM Trans. Sens. Netw. (TOSN) 2018, 14, 1–27. [Google Scholar] [CrossRef]
  15. Bartl, V.; Herout, A. Optinopt: Dual optimization for automatic camera calibration by multi–target observations. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar] [CrossRef]
  16. Bartl, V.; Juranek, R.; Špaňhel, J.; Herout, A. Planecalib: Automatic camera calibration by multiple observations of rigid objects on plane. In Proceedings of the 2020 Digital Image Computing: Techniques and Applications (DICTA), Melbourne, Australia, 29 November–2 December 2020; pp. 1–8. [Google Scholar] [CrossRef]
  17. Bartl, V.; Špaňhel, J.; Dobeš, P.; Juranek, R.; Herout, A. Automatic camera calibration by landmarks on rigid objects. Mach. Vis. Appl. 2021, 32, 2–15. [Google Scholar] [CrossRef]
  18. Alvarez, S.; Llorca, D.F.; Sotelo, M. Camera auto–calibration using zooming and zebra–crossing for traffic monitoring applications. In Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), The Hague, The Netherlands, 6–9 October 2013; pp. 608–613. [Google Scholar] [CrossRef]
  19. Dubská, M.; Herout, A.; Juránek, R.; Sochor, J. Fully automatic roadside camera calibration for traffic surveillance. IEEE Trans. Intell. Transp. Syst. 2014, 16, 1162–1171. [Google Scholar] [CrossRef]
  20. Wang, N.; Du, H.; Liu, Y.; Tang, Z.; Hwang, J.N. Self–calibration of traffic surveillance cameras based on moving vehicle appearance and 3–D vehicle modeling. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3064–3068. [Google Scholar] [CrossRef]
  21. Kocur, V.; Ftáčnik, M. Traffic camera calibration via vehicle vanishing point detection. In Proceedings of the International Conference on Artificial Neural Networks, Bratislava, Slovakia, 14–17 September 2021; pp. 628–639. [Google Scholar] [CrossRef]
  22. Zhang, W.; Song, H.; Liu, L. Automatic calibration for monocular cameras in highway scenes via vehicle vanishing point detection. J. Transp. Eng. Part A Syst. 2023, 149, 04023050. [Google Scholar] [CrossRef]
  23. Tong, X.; Ying, X.; Shi, Y.; Wang, R.; Yang, J. Transformer based line segment classifier with image context for real–time vanishing point detection in Manhattan world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6083–6092. [Google Scholar] [CrossRef]
  24. Wildenauer, H.; Hanbury, A. Robust camera self–calibration from monocular images of Manhattan worlds. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2831–2838. [Google Scholar] [CrossRef]
  25. Itu, R.; Borza, D.; Danescu, R. Automatic extrinsic camera parameters calibration using Convolutional Neural Networks. In Proceedings of the 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 7–9 September 2017; pp. 273–278. [Google Scholar] [CrossRef]
  26. Borji, A. Vanishing point detection with convolutional neural networks. arXiv 2016, arXiv:1609.00967. [Google Scholar] [CrossRef]
  27. Chang, C.K.; Zhao, J.; Itti, L. Deepvp: Deep learning for vanishing point detection on 1 million street view images. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 4496–4503. [Google Scholar] [CrossRef]
  28. Lee, S.; Kim, J.; Shin Yoon, J.; Shin, S.; Bailo, O.; Kim, N.; Lee, T.H.; Seok Hong, H.; Han, S.H.; So Kweon, I. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1965–1973. [Google Scholar] [CrossRef]
  29. Workman, S.; Greenwell, C.; Zhai, M.; Baltenberger, R.; Jacobs, N. Deepfocal: A method for direct focal length estimation. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 1369–1373. [Google Scholar] [CrossRef]
  30. Hold–Geoffroy, Y.; Sunkavalli, K.; Eisenmann, J.; Fisher, M.; Gambaretto, E.; Hadap, S.; Lalonde, J.F. A perceptual measure for deep single image camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2354–2363. [Google Scholar] [CrossRef]
  31. Workman, S.; Zhai, M.; Jacobs, N. Horizon lines in the wild. arXiv 2016, arXiv:1604.02129. [Google Scholar] [CrossRef]
  32. Lee, J.; Sung, M.; Lee, H.; Kim, J. Neural geometric parser for single image camera calibration. In Computer Vision–ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16; Springer: Glasgow, UK, 2020; pp. 541–557. [Google Scholar] [CrossRef]
  33. Jin, L.; Zhang, J.; Hold–Geoffroy, Y.; Wang, O.; Blackburn–Matzen, K.; Sticha, M.; Fouhey, D.F. Perspective fields for single image camera calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17307–17316. [Google Scholar] [CrossRef]
  34. Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
  35. Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3138–3147. [Google Scholar] [CrossRef]
  36. Kanhere, N.K.; Birchfield, S.T. A taxonomy and analysis of camera calibration methods for traffic monitoring applications. IEEE Trans. Intell. Transp. Syst. 2010, 11, 441–452. [Google Scholar] [CrossRef]
  37. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
  38. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
  39. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
  40. Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5632–5640. [Google Scholar] [CrossRef]
  41. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
  42. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  43. Sochor, J.; Juranek, R.; Herout, A. Traffic Surveillance Camera Calibration by 3D Model Bounding Box Alignment for Accurate Vehicle Speed Measurement. Comput. Vis. Image Underst. 2017, 161, 87–98. [Google Scholar] [CrossRef]
Figure 1. Representative images of different scenarios in real dataset and synthetic dataset. (a) Real dataset; (b) Synthetic dataset.
Figure 1. Representative images of different scenarios in real dataset and synthetic dataset. (a) Real dataset; (b) Synthetic dataset.
Sensors 25 05815 g001
Figure 2. Distribution of vanishing point coordinates in MVCCD_S.
Figure 2. Distribution of vanishing point coordinates in MVCCD_S.
Sensors 25 05815 g002
Figure 3. Histograms of camera parameters in MVCCD_S. (a,b) represent the distributions of pitch angle and yaw angle respectively, while (c,d) correspond to the distributions of focal length and camera height respectively.
Figure 3. Histograms of camera parameters in MVCCD_S. (a,b) represent the distributions of pitch angle and yaw angle respectively, while (c,d) correspond to the distributions of focal length and camera height respectively.
Sensors 25 05815 g003aSensors 25 05815 g003b
Figure 4. Quantitative comparison between MVCCD_S and MVCCD_R datasets across multiple feature dimensions.
Figure 4. Quantitative comparison between MVCCD_S and MVCCD_R datasets across multiple feature dimensions.
Sensors 25 05815 g004
Figure 5. Traffic surveillance camera calibration model. The camera is placed at a height of h meters above the ground plane. Xw-Yw-Zw axis represents the world coordinate system, while Xc-Yc-Zc axis represents the camera coordinate system. U denotes the vanishing point along the road direction. ρ indicates the ground plane.
Figure 5. Traffic surveillance camera calibration model. The camera is placed at a height of h meters above the ground plane. Xw-Yw-Zw axis represents the world coordinate system, while Xc-Yc-Zc axis represents the camera coordinate system. U denotes the vanishing point along the road direction. ρ indicates the ground plane.
Sensors 25 05815 g005
Figure 6. The overall framework of DeepCalib. TAM stands for the triplet attention module. The symbol ⊕ denotes feature fusion and © indicates feature map connection. The keypoint branch outputs vanishing point heatmap, and the camera pose branch estimates extrinsic parameters.
Figure 6. The overall framework of DeepCalib. TAM stands for the triplet attention module. The symbol ⊕ denotes feature fusion and © indicates feature map connection. The keypoint branch outputs vanishing point heatmap, and the camera pose branch estimates extrinsic parameters.
Sensors 25 05815 g006
Figure 7. The overall framework of TAM [35].
Figure 7. The overall framework of TAM [35].
Sensors 25 05815 g007
Figure 8. Schematic diagram of rotation angle estimation.
Figure 8. Schematic diagram of rotation angle estimation.
Sensors 25 05815 g008
Figure 9. Loss curves for direct training and transfer learning.
Figure 9. Loss curves for direct training and transfer learning.
Sensors 25 05815 g009
Figure 10. Quantitative analysis of direct training and transfer learning.
Figure 10. Quantitative analysis of direct training and transfer learning.
Sensors 25 05815 g010
Figure 11. Qualitative comparison of vanishing point detection performance by DeepCalib on real-world test images. (a) Straight road scenarios; (b) Curved road scenarios. For each row, the left image shows results from a model trained exclusively on MVCCD_R, while the right image demonstrates outcomes after pre-training on MVCCD_S followed by fine-tuning on MVCCD_R.
Figure 11. Qualitative comparison of vanishing point detection performance by DeepCalib on real-world test images. (a) Straight road scenarios; (b) Curved road scenarios. For each row, the left image shows results from a model trained exclusively on MVCCD_R, while the right image demonstrates outcomes after pre-training on MVCCD_S followed by fine-tuning on MVCCD_R.
Sensors 25 05815 g011
Figure 12. Schematic diagram of three different measurement line segments on the lane.
Figure 12. Schematic diagram of three different measurement line segments on the lane.
Sensors 25 05815 g012
Figure 13. Calibration results of the DeepCalib model in real-world scenarios. (a) Straight road environment; (b) Curved road environment. Green line segments represent reprojections of the predicted landmark lengths, and red line segments correspond to projections of the lane width.
Figure 13. Calibration results of the DeepCalib model in real-world scenarios. (a) Straight road environment; (b) Curved road environment. Green line segments represent reprojections of the predicted landmark lengths, and red line segments correspond to projections of the lane width.
Sensors 25 05815 g013aSensors 25 05815 g013b
Table 1. Comparison of the real-world and synthetic dataset parameters.
Table 1. Comparison of the real-world and synthetic dataset parameters.
DatasetSample Size ϕ θ hFormatResolution
MVCCD_R [3]8765[−18.4°, 0°][−29.3°, 29°][10.4 m, 13.9 m]RGB1920 × 1080
MVCCD_S336,249[−28°, 0°][−40°, 40°][10 m, 14.5 m]RGB1920 × 1080
Table 2. Comparison of backbone network ablation studies.
Table 2. Comparison of backbone network ablation studies.
BackboneL2 Dis (Pixel)Pitch (°)Yaw (°)h (m)
ConvNeXt_tiny57.156.7010.490.98
ConvNeXt_tiny_TAM34.694.437.840.98
ConvNeXt_small45.982.735.620.94
ConvNeXt_small_TAM29.662.323.880.93
ConvNeXt_base40.261.743.940.91
ConvNeXt_base_TAM22.221.332.450.89
Table 3. Quantitative evaluation of binning strategies on rotation angle estimation.
Table 3. Quantitative evaluation of binning strategies on rotation angle estimation.
AnglesBins NumClass AccuracyResidual Error (°)
pitch294.23%2.78
392.62%1.82
488.68%1.49
585.77%2.97
679.03%3.35
yaw292.37%5.27
390.08%4.84
487.52%4.05
586.04%2.33
676.91%4.45
Table 4. Comparison of vanishing point detection performance on different datasets.
Table 4. Comparison of vanishing point detection performance on different datasets.
MethodsMetricsMVCCD_RDataset [27]
Straight RoadsCurved Roads
AutoCalib [19]L2 Dis49.2377.61
NormDis0.0220.035
Edgelets [43]L2 Dis70.96130.38
NormDis0.0320.059
DeepVP [27]L2 Dis91.08119.4314.64
NormDis0.0410.0540.035
DeepCN [3]L2 Dis57.1473.169.39
NormDis0.0260.0330.022
DeepCalibL2 Dis13.1134.297.15
NormDis0.0060.0220.014
Table 5. Line segment measurement comparison.
Table 5. Line segment measurement comparison.
Methods6 m9 m15 mMean Precision
Mean ValueMean PrecisionMean ValueMean PrecisionMean ValueMean Precision
VWL [36]6.0599.17%8.9199.00%14.9699.73%99.30%
AutoCalib [19]6.9584.17%10.9278.67%17.7781.53%81.46%
Edgelets [43]7.1181.50%11.4572.78%18.8174.60%76.29%
DeepCN [3]6.8286.33%10.2785.89%17.1185.93%86.05%
DeepCalib6.5690.67%9.9689.33%16.6888.80%89.60%
Table 6. Processing time comparison between multi-stage and single-image calibration methods.
Table 6. Processing time comparison between multi-stage and single-image calibration methods.
MethodsVP Decoding (s)EP Estimation (s)Total Time (s)
AutoCalib [19]1.38 × 1021.38 × 102
Edgelets [43]4.32 × 1014.32 × 101
DeepCN [3]6.16 × 1021.73 × 1046.18 × 102
DeepCalib9.72 × 10−21.90 × 10−49.74 × 10−2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Jia, W.; Li, W. Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images. Sensors 2025, 25, 5815. https://doi.org/10.3390/s25185815

AMA Style

Zhang W, Jia W, Li W. Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images. Sensors. 2025; 25(18):5815. https://doi.org/10.3390/s25185815

Chicago/Turabian Style

Zhang, Wentao, Wei Jia, and Wei Li. 2025. "Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images" Sensors 25, no. 18: 5815. https://doi.org/10.3390/s25185815

APA Style

Zhang, W., Jia, W., & Li, W. (2025). Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images. Sensors, 25(18), 5815. https://doi.org/10.3390/s25185815

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop